Towards a Gold Standard for Evaluating Danish Word Embeddings

Department of Nordic Studies and Linguistics

Towards a Gold Standard for Evaluating Danish Word Embeddings

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

2020.lrec-1.585
Final published version, 310 KB, PDF document

Nina Schneidermann
Rasmus Hvingelby
Pedersen, Bolette Sandford

This paper presents the process of compiling a model-agnostic similarity gold standard for evaluating Danish word embeddings based
on human judgments made by 42 native speakers of Danish. Word embeddings resemble semantic similarity solely by distribution
(meaning that word vectors do not reflect relatedness as differing from similarity), and we argue that this generalisation poses a problem
in most intrinsic evaluation scenarios. In order to be able to evaluate on both dimensions, our human-generated dataset is therefore
designed to reflect the distinction between relatedness and similarity. The goal standard is applied for evaluating the "goodness" of
six existing word embedding models for Danish, and it is discussed how a relatively low correlation can be explained by the fact that
semantic similarity is substantially more challenging to model than relatedness, and that there seems to be a need for future human
judgements to measure similarity in full context and along more than a single spectrum.

Original language	English
Title of host publication	Proceedings of the 12th Language Resources and Evaluation Conference
Number of pages	10
Place of Publication	Marseille, France
Publisher	European Language Resources Association
Publication date	2020
Pages	4756-4765
ISBN (Electronic)	9791095546344
Publication status	Published - 2020
Event	Language Resources and Evaluation Conference (LREC) 2020 - Marseille, Marseille, France Duration: 13 May 2020 → 15 May 2020 https://lrec2020.lrec-conf.org/en/

Conference

Conference	Language Resources and Evaluation Conference (LREC) 2020
Location	Marseille
Land	France
By	Marseille
Periode	13/05/2020 → 15/05/2020
Internetadresse	https://lrec2020.lrec-conf.org/en/

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 241358594

Centre for Language Technology

Towards a Gold Standard for Evaluating Danish Word Embeddings

Documents

Conference

Links

Number of downloads are based on statistics from Google Scholar and www.ku.dk