DK-CLARIN LSP corpus – University of Copenhagen

Forward this page to a friend Resize Print Bookmark and Share

Home > Resources > DK-CLARIN LSP corpus

DK-CLARIN LSP corpus

The LSP (Language for Special Purposes) corpus consists of texts from seven selected domains. The DK-CLARIN LSP corpus comprises 11 M tokens from the period 2000-2010, complementing the existing Danish general language corpora. 

The collection and compilation of the corpus was carried out by University of Copenhagen, Center for Sprogteknologi and Dansk Sprognævn (The Danish Language Council).

This corpus was collected and processed in the Danish CLARIN project (see http://dkclarin.ku.dk/english).
The aim of the Danish CLARIN consortium was to construct a Danish research infrastructure for the humanities integrating written, spoken, and visual records into a coherent and systematic digital repository. The project ran from January 2008 until the end of 2010.

Content of the LSP Corpus:

The corpus comprises the following domains:
Health and Medicine
Agriculture
Climate and Environment
Economics
Information Technology (IT)
Construction
Nanotechnology

The corpus is composed of texts in the communicative settings expert/(semi)expert to (semi)expert/layman and comprises text types with different communicative aims: informative (reports, product descriptions), normative (standards, regulations), instructive (textbooks, manuals), etc.

Annotations

All texts have the following annotations:
- Tokenization
- Lemmatization
- POS-tagging
- Termhood

License

All the texts in the corpus are subject to the Danish CLARIN academic license, for strictly non-commercial use. See full license at https://clarin.dk/clarindk/download-proxy.jsp?license=downloadacademic.

Documentation

Here you can find a detailed documentation in Danish on the corpus composition and text processing.

Contact person

Sussi Olsen
Center for Sprogteknologi, Københavns Universitet
saolsen@hum.ku.dk