The LSP (Language for Special Purposes) corpus consists of texts from seven selected domains. The DK-CLARIN LSP corpus comprises 11 M tokens from the period 2000-2010, complementing the existing Danish general language corpora. 

The collection and compilation of the corpus was carried out by University of Copenhagen, Center for Sprogteknologi and Dansk Sprognævn (The Danish Language Council).

This corpus was collected and processed in the Danish CLARIN project (see
The aim of the Danish CLARIN consortium was to construct a Danish research infrastructure for the humanities integrating written, spoken, and visual records into a coherent and systematic digital repository. The project ran from January 2008 until the end of 2010.

Content of the LSP Corpus:

The corpus comprises the following domains:
Health and Medicine
Climate and Environment
Information Technology (IT)

The corpus is composed of texts in the communicative settings expert/(semi)expert to (semi)expert/layman and comprises text types with different communicative aims: informative (reports, product descriptions), normative (standards, regulations), instructive (textbooks, manuals), etc.


All texts have the following annotations:
- Tokenization
- Lemmatization
- POS-tagging
- Termhood


All the texts in the corpus are subject to the Danish CLARIN academic license, for strictly non-commercial use. See full license at


Here you can find a detailed documentation in Danish on the corpus composition and text processing.

Contact person

Sussi Olsen
Center for Sprogteknologi, Københavns Universitet