Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Department of Nordic Studies and Linguistics

Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sen-tence tokenizer and the CST Named Entity Recognizer were improved. These tools, to-gether with the CST-lemmatiser, Danish UD-Pipe software and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the pa-per with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.

Original language	English
Title of host publication	CLARIN Annual Conference 2021 Proceedings
Publisher	CLARIN ERIC
Publication date	2021
Pages	70-73
Publication status	Published - 2021

Centre for Language Technology

Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Links