Research Projects

Research Projects at the Department of Corpus and Computational Linguistics, English Philology

On this page, you will find an overview of the research projects of the department.

The project addresses the development and creation of tutorials, how-tos, links, tools, and approaches to a corpus – focusing on research in the fields of Linguistics, Corpus and Computational Linguistics, and other digital philologies.

The aim of is to support students and researchers in corpus- and computer-based research by providing materials and guidance for self-study and teaching, and to further the independent use of technologies and methods of Linguistics and other philological sciences.

The portal linguisticsweb-org is used by international researchers and teachers, in the fields of research and teaching as well as in workshops. was created as an independent online project in 2008-09, and it has been developed further ever since.

To the website

The goal of this shared task is was to encourage the developers of NLP applications to adapt their tools and resources for the processing of written German discourse in genres of computer-mediated communication (CMC). Examples for CMC genres are chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

Processing CMC discourse is a desideratum and a relevant task in different research fields and application contexts in the Digital Humanities – e.g.:

- in the context of building, processing and analyzing corpora of computer-mediated communication / social media (chat corpora, news corpora, whatsapp corpora, …)

- in the context of collecting, processing and analyzing large, genre-heterogenous web corpora as resources in the field of Language Technology / Data Mining

- in the context of dealing with CMC data in corpus-based analyses on contemporary written language, language variation and language change

- in all research fields beyond linguistics which address social, cultural and educational aspects of social media and CMC technologies using language data from CMC genres

The shared task consisted of two subtasks:

- Tokenization of CMC discourse

- Part-of-speech tagging of CMC discourse

The two subtasks made use of two different data sets:

- CMC data set: a selection of data from different CMC genres (social chat, professional chat, Wikipedia talk pages, blog comments, tweets, WhatsApp dialogues).

- Web corpora data set: a selection of data which represents written discourse from heterogenuous WWW genres. It consists of crawled websites including small portions of CMC discourse (e.g. webpages, blogs, news sites, blog commentary etc.).

Learn more

The LOEWE-Schwerpunkt Digital Humanities is a collaboration of the University of Frankfurt, the Technical University of Darmstadt, and the Freie Deutsche Hochstift / Goethe Museum Frankfurt. Objective: to connect basic research in the humanistic disciplines involved, focusing on information technology procedures.

LOEWE Schwerpunkt Digital Humanities – Integrated editing and evaluation of text-based corpora, co-applicant and PI in the project area “Contemporary Corpora”, January 2011 to December 2013

Partner: Prof. Dr. Iryna Gurevich, Prof. Dr. Gert Webelhuth, January 2011 to December 2013

Funded by the State of Hesse as part of the LOEWE initiative of excellence.

Funded by the Initiative of Excellence of the state of Hesse, LOEWE.

To the website


Subproject: „Scientific and technical literacy – Untersuchungen natürlichsprachlicher Kommunikation in der kollaborativen Produktentwicklung“;

Partner: Prof. Dr.-Ing. Reiner Anderl und Prof. Dr. Elke Teich;

Funded by the Innovationsfonds des Landes Hessen 07.2004 – 01.2006