Project Duration: October 1, 2025 – September 30, 2027
Funding: Hessian Ministry of Science and Research, State Securities and Arts (HMWK), LOEWE Exploration
Project Management: Michael Schonhardt, Andrea Rapp
Project Staff: Torben Jordan, Torsten Schenk, Serafina Fuchs, Elena Monzel
Description
The project “Embedding the Past,” funded by the LOEWE Exploration line, addresses the methodological challenge of making Large Language Models (LLMs) reliably accessible for medieval and early modern research. While modern language stages are extensively represented in current AI models, conventional systems reach their limits when processing pre-modern sources, such as those in Latin or Middle High German. This results in potential hallucinations, bias, and limited hermeneutic reliability, which have thus far hindered the scholarly application of these technologies.
The project operates at this intersection, pursuing a three-pillar exploratory approach:
1. Evaluation and Gold-Standard Datasets
To systematically test the suitability of existing multilingual embedding models for historical research, the project is developing a domain-specific evaluation scenario. A gold-standard dataset will be created to mirror traditional hermeneutic workflows in the humanities. The objective is to provide an objective measurement of how precisely current models can capture semantic relations within historical language stages.
2. Data Generation and Fine-Tuning
A central component of the project is the optimization of Sentence Transformers for the pre-modern domain. This work is based on a specialized training corpus from the field of pre-modern canon law. Through alignment (the mapping of content-related text segments across different languages and language stages) the models are specifically trained to “understand” historical terminology and concepts.
3. Sustainable Software Suite for RAG Applications
The theoretical findings will result in the development of a low-threshold and sustainable software suite. At its core is a locally executable application based on Retrieval-Augmented Generation (RAG). This tool will enable researchers to query their own source collections through LLMs and the embedding models developed in the project in a transparent, fact-based manner, while ensuring the sustainable storage of results.
Open Science and Reuse
In the spirit of Open Science, all project results will be made available to the international research community for free reuse via repositories such as Huggingface, Zenodo, and Git, subject to licensing restrictions.