The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

The Hidden Folk : Linguistic Properties Encoded in Multilingual Contextual Character Representations. / Agirrezabal, Manex; Boldsen, Sidsel; Hollenstein, Nora.

Proceedings of the Workshop on Computation and Written Language (CAWL 2023). Toronto : Association for Computational Linguistics, 2023. p. 6-13.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Agirrezabal, M, Boldsen, S & Hollenstein, N 2023, The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations. in Proceedings of the Workshop on Computation and Written Language (CAWL 2023). Association for Computational Linguistics, Toronto, pp. 6-13. https://doi.org/10.18653/v1/2023.cawl-1.2

APA

Agirrezabal, M., Boldsen, S., & Hollenstein, N. (2023). The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023) (pp. 6-13). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.cawl-1.2

Vancouver

Agirrezabal M, Boldsen S, Hollenstein N. The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023). Toronto: Association for Computational Linguistics. 2023. p. 6-13 https://doi.org/10.18653/v1/2023.cawl-1.2

Author

Agirrezabal, Manex ; Boldsen, Sidsel ; Hollenstein, Nora. / The Hidden Folk : Linguistic Properties Encoded in Multilingual Contextual Character Representations. Proceedings of the Workshop on Computation and Written Language (CAWL 2023). Toronto : Association for Computational Linguistics, 2023. pp. 6-13

Bibtex

@inproceedings{d9eb7eb754234bcf93ac92ad68b1216a,
title = "The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations",
abstract = "To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.",
author = "Manex Agirrezabal and Sidsel Boldsen and Nora Hollenstein",
year = "2023",
doi = "10.18653/v1/2023.cawl-1.2",
language = "English",
pages = "6--13",
booktitle = "Proceedings of the Workshop on Computation and Written Language (CAWL 2023)",
publisher = "Association for Computational Linguistics",

}

RIS

TY - GEN

T1 - The Hidden Folk

T2 - Linguistic Properties Encoded in Multilingual Contextual Character Representations

AU - Agirrezabal, Manex

AU - Boldsen, Sidsel

AU - Hollenstein, Nora

PY - 2023

Y1 - 2023

N2 - To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.

AB - To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.

U2 - 10.18653/v1/2023.cawl-1.2

DO - 10.18653/v1/2023.cawl-1.2

M3 - Article in proceedings

SP - 6

EP - 13

BT - Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

PB - Association for Computational Linguistics

CY - Toronto

ER -

ID: 374969148