Research
The CST staff carries out research within a range of language technological fields which interact in various ways. Much of the research is carried out through external projects.
NLP and Digital Humanities
We develop NLP methodology and statistical and neural language models for analysing textual data in its widest forms. This includes a wide variety of genres such as, poems, novels, letters, news articles, scientific articles, or love song lyrics. We develop NLP pipelines and corpus tools in particular for Danish and develop appropriate methods and gold standards for evaluating them.
- Measuring Modernity and Mining the Meaning – two projects financed by the Carlsberg Foundation and the UCPH Data+ grant
Goal - To explore the “Modern Breakthrough” in Scandinavian literature through 900 digitised Danish and Norwegian novels 1870-1900 by means of quantitative methods in terms of i.e. semantic processing. PI’s are from Literature at NorS and from the Department of Computer Science. - Automated credibility assessment of dissemination of science news in media - Faculty funded PhD thesis.
Goal - To examine fake news and to do cross-domain sentence-level hyperbole detection. To explore an NLP approach to Danish sentence-level hyperbole detection. - Computational modelling of language change - Velux funded PhD thesis.
Goal - One among others is to analyse the phonological changes that can happen in the language used in manuscripts. - ParlaMint and related projects – CLARIN projects.
Goal - to examine parliamentary debates including compilation and annotation of corpora and machine learning experiments on the data. - Poetry analysis and generation – internal project.
Goals - To develop a poetry generation system for Basque Language using Neural Networks, to analyse the usefulness of scansion models for the recognition of old Scottish tunes, to align the Milton and Shakespeare corpus with text aligned with audio, words, syllables, phonemes, and scansion.
NLP Resources and Datasets
A central focus of the Centre is to develop principled methods for compiling high-quality language resources and datasets for NLP. Coming from a tradition of expertise in computational linguistics and with an affiliation to the Department of Nordic Studies and Linguistics, a vision for this research area is to cultivate and further develop a linguistically driven approach to the development of language resources, which pays specific attention to the particular characteristics of the Danish language, culture and society. We build most of our lexical resources in close collaboration with language institutions in Denmark, such as the Society for Danish Language and Literature and the Danish Language Council.
Current lexical and terminological projects
- ELEXIS – EU Horizon2020 (Goal: To foster knowledge exchange between communities in lexicography and to examine and develop tools for making lexicographic data available for LT. Hereunder to develop multilingual corpora that are semantically annotated).
- DanNet2 – The Carlsberg Foundation (Goal: To examine how The Danish Thesaurus can be made applicable for LT and to extend the Danish wordnet, DanNet, with thesaurus data)
- Det Centrale OrdRegister (COR) – The Agency for Digitisation (Goal: To develop a common Danish lexical resource for LT and language-centric AI)
- FedTerm – EU CEF Programme (Goal: To develop federated terminology collections for the European languages)
CST also engages in resource collections and compilation of corpora related to projects in digital humanities and computational cognitive modelling as further described under these respective subareas. These include among others:
- CLARIN resource development (hereunder ParlaMint corpus collections).
- The Danish eye tracking data collection – internal project with ITU. (Goal: To develop a Danish eye tracking collection from natural reading of Danish texts. The data resource can be used for research in psycholinguistics as well as for cognitively-enhanced NLP applications.
- Low-cost eye tracking corpus for explainable natural language processing – Carlsberg Foundation– with the Department of Computer Science at UCPH. (Goal: To collect low-cost webcam-based eye-tracking for explainable NLP).
- Gestures and Head Movements in Language (GEHM). Research network that supports cooperation among eight leading research groups that work in the area of gesture and language. (Goal: to foster new theoretical insights into the way hand gestures and head movements interact with speech in face-to-face multimodal communication).
Computational Cognitive Modeling and Multimodality
Computational cognition is an approach to cognition that seeks to explain the way humans process information by developing mathematical and computational models capturing aspects of such processing. At CST, we focus on computational models of language processing and the extent to which these models make use of cognitive signals on the one hand, and linguistic knowledge on the other. Furthermore, we investigate the way in which language – both written and spoken – interacts with other modalities such as the visual gestural modality.
- Gestures and Head Movements in Language (GEHM). Research network that supports cooperation among eight leading research groups that work in the area of gesture and language. (Goal: to foster new theoretical insights into the way hand gestures and head movements interact with speech in face-to-face multimodal communication).
- Project on the development of an automatic head movement classifier – internal project related to GEHM. (Goal: To develop a classifier that makes use of visual and acoustic features to detect head movements in video data. To be used as an aid for the annotation of video-recorded language data).
- The Danish eye tracking data collection – internal project with ITU. (Goal: To develop a Danish eye tracking collection from natural reading of Danish texts. The data resource can be used for research in psycholinguistics as well as for cognitively-enhanced NLP applications.
- Low-cost eye tracking corpus for explainable natural language processing – Carlsberg Foundation– with the Department of Computer Science at UCPH. (Goal: To collect low-cost webcam-based eye-tracking for explainable NLP).
NLP infrastructure and Policy
We address language policy issues regarding Danish LT and promote LT for low-resource languages in Denmark and the EU at the political and industrial levels. The Centre has been involved in promoting Danish LT around the world for decades and in addressing the interoperability and availability of Danish resources internationally, supporting the use of standards and the production and sharing of FAIR linguistic and language technology resources. Related to this mission is the Centre’s long-term involvement in CLARIN, which is a technological infrastructure for the social sciences and humanities, and which includes the development of NLP tools for the processing and annotation of text and other language-related material.
- CLARIN (Goal: all digital language resources and tools from all over Europe and beyond should be accessible through a single sign-on online environment for the support of researchers in the humanities and social sciences)
- European Language Equality (Goal: To prepare a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality in Europe by 2030)
- European Language Grid (Goal: To establish the primary platform for Language Technology in Europe)
- European Language Resource Coordination (Goal: To improve the quality of automated translation solutions by coordinating the relevant language resources in all official languages of the EU and CEF associated countries).
Over the years, the Centre has also been highly active in a considerable number of international and national boards and committees relevant to infrastructures and language policy. Currently, these comprise organisations such as Digital Humanities in the Nordic and Baltic Countries (DHNB), the European Language Technology Council, the Social and Cultural Innovation SWG, ESFRI, the Advisory Board for Sprogteknologi.dk at the Danish Agency for Digitisation under the Danish Government, the Danish Language Council, Society for Danish Language and Literature, and the Danish Terminology Group.
Representation learning for Natural Language Processing (RL4NLP)
We contribute to explainable artificial intelligence by exploring the internal representations and learning mechanisms in computational models of natural languages. By adopting a multilingual strategy, we align these mechanisms with linguistic theories to uncover universal linguistic patterns and demonstrate their practical applications in real-world tasks. In addition, we inform computation models of languages with linguistic theories and enhance their transparency and performance. This alignment between computational models and linguistic theories improves model interpretability and enables more robust and accurate processing of diverse languages.
What is language technology?
Language technology is an interdisciplinary subject and includes the study of language as well as the study of IT.