Projektleitung: Dr. Sabine Bartsch
The goal of this shared task is was to encourage the developers of NLP applications to adapt their tools and resources for the processing of written German discourse in genres of computer-mediated communication (CMC). Examples for CMC genres are chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.
Processing CMC discourse is a desideratum and a relevant task in different research fields and application contexts in the Digital Humanities – e.g.:
- in the context of building, processing and analyzing corpora of computer-mediated communication / social media (chat corpora, news corpora, whatsapp corpora, …)
- in the context of collecting, processing and analyzing large, genre-heterogenous web corpora as resources in the field of Language Technology / Data Mining
- in the context of dealing with CMC data in corpus-based analyses on contemporary written language, language variation and language change
- in all research fields beyond linguistics which address social, cultural and educational aspects of social media and CMC technologies using language data from CMC genres
The shared task consisted of two subtasks:
- Tokenization of CMC discourse
- Part-of-speech tagging of CMC discourse
The two subtasks made use of two different data sets:
- CMC data set: a selection of data from different CMC genres (social chat, professional chat, Wikipedia talk pages, blog comments, tweets, WhatsApp dialogues).
- Web corpora data set: a selection of data which represents written discourse from heterogenuous WWW genres. It consists of crawled websites including small portions of CMC discourse (e.g. webpages, blogs, news sites, blog commentary etc.).