The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification

Institut for Nordiske Studier og Sprogvidenskab

The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Standard

The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification. / Navarretta, Costanza; Hansen, Dorte Haltrup.

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association, 2022. s. 1428-1436.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Harvard

Navarretta, C & Hansen, DH 2022, The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification. i Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association, s. 1428-1436. <https://aclanthology.org/2022.lrec-1.153.pdf>

APA

Navarretta, C., & Hansen, D. H. (2022). The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification. I Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (s. 1428-1436). European Language Resources Association. https://aclanthology.org/2022.lrec-1.153.pdf

Vancouver

Navarretta C, Hansen DH. The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification. I Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association. 2022. s. 1428-1436

Author

Navarretta, Costanza ; Hansen, Dorte Haltrup. / The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association, 2022. s. 1428-1436

Bibtex

@inproceedings{35e85f5c4e1549bca586179a434f95bb,

title = "The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification",

abstract = "This paper addresses the semi-automatic annotation of subjects, also called policy areas, in the Danish Parliament Corpus (2009-2017)v.2. Recently, the corpus has been made available through the CLARIN-DK repository, the Danish node of the European CLARINinfrastructure. The paper also contains an analysis of the subjects in the corpus, and a description of multi-label classificationexperiments act to verify the consistency of the subject annotation and the utility of the corpus for training classifiers on this type ofdata. The analysis of the corpus comprises an investigation of how often the parliament members addressed each subject and the relation between subjects and gender of the speaker. The classification experiments show that classifiers can determine the two co-occurring subjects of the speeches from the agenda titles with a performance similar to that of human annotators. Moreover, a multilayer perceptron achieved an F1-score of 0.68 on the same task when trained on bag of words vectors obtained from the speeches{\textquoteright} lemmas. This is an improvement of more than 0.6 with respect to the baseline, a majority classifier that accounts for the frequency of the classes. The result is promising given the high number of subject combinations (186) and the skewness of the data.",

author = "Costanza Navarretta and Hansen, {Dorte Haltrup}",

year = "2022",

language = "English",

pages = "1428--1436",

booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)",

publisher = "European Language Resources Association",

}

RIS

TY - GEN

T1 - The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification

AU - Navarretta, Costanza

AU - Hansen, Dorte Haltrup

PY - 2022

Y1 - 2022

N2 - This paper addresses the semi-automatic annotation of subjects, also called policy areas, in the Danish Parliament Corpus (2009-2017)v.2. Recently, the corpus has been made available through the CLARIN-DK repository, the Danish node of the European CLARINinfrastructure. The paper also contains an analysis of the subjects in the corpus, and a description of multi-label classificationexperiments act to verify the consistency of the subject annotation and the utility of the corpus for training classifiers on this type ofdata. The analysis of the corpus comprises an investigation of how often the parliament members addressed each subject and the relation between subjects and gender of the speaker. The classification experiments show that classifiers can determine the two co-occurring subjects of the speeches from the agenda titles with a performance similar to that of human annotators. Moreover, a multilayer perceptron achieved an F1-score of 0.68 on the same task when trained on bag of words vectors obtained from the speeches’ lemmas. This is an improvement of more than 0.6 with respect to the baseline, a majority classifier that accounts for the frequency of the classes. The result is promising given the high number of subject combinations (186) and the skewness of the data.

AB - This paper addresses the semi-automatic annotation of subjects, also called policy areas, in the Danish Parliament Corpus (2009-2017)v.2. Recently, the corpus has been made available through the CLARIN-DK repository, the Danish node of the European CLARINinfrastructure. The paper also contains an analysis of the subjects in the corpus, and a description of multi-label classificationexperiments act to verify the consistency of the subject annotation and the utility of the corpus for training classifiers on this type ofdata. The analysis of the corpus comprises an investigation of how often the parliament members addressed each subject and the relation between subjects and gender of the speaker. The classification experiments show that classifiers can determine the two co-occurring subjects of the speeches from the agenda titles with a performance similar to that of human annotators. Moreover, a multilayer perceptron achieved an F1-score of 0.68 on the same task when trained on bag of words vectors obtained from the speeches’ lemmas. This is an improvement of more than 0.6 with respect to the baseline, a majority classifier that accounts for the frequency of the classes. The result is promising given the high number of subject combinations (186) and the skewness of the data.

M3 - Article in proceedings

SP - 1428

EP - 1436

BT - Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

PB - European Language Resources Association

ER -

ID: 317676571

Center for Sprogteknologi

The Subject Annotations of the Danish Parliament Corpus (2009-2017) v.2 - Evaluated with Automatic Multilabel Classification

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS