Semantic Processing across Domains
The project is closed.
Next generation information technology will rely on adequate semantic processing. Extending such technology to Danish requires semantically annotated data, but also more methods that are more robust to data scarcity and domain shifts than current state-of-the-art methods.
The project partners developed scalable sense inventories for Danish on the basis of existing lexical resources (The Danish Dictionary and the Danish wordnet, DanNet) and provided semantic corpus annotations of Danish texts.
The project went beyond state of the art in recent semantic processing by developing machine learning methods that require less data and are less sensitive to domain shifts.
The semantic models induced from such data by the developed methods was evaluated in a semantic search engine on the national dictionary site http://ordnet.dk developed at the Society for Language and Literature, as well as in a Danish question-answering platform that has been developed by The University of Copenhagen and The Technical University of Denmark in the ESICT project.
We launch the SemDaX corpus (https://github.com/kuhumcst/semdax) which is a recently completed Danish human-annotated corpus relying on the combined wordnet and dictionary resources: DanNet and Den Danske Ordbog, and available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity.
The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish.
To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the curated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.
The methodological considerations behind the corpus are presented in: Bolette S. Pedersen, Anna Braasch, Anders Johanssen, Hector Martinez Alonso, Sanni Nimb, Sussi Olsen, Anders Søgaard, Nicolai Hartvig Sørensen (under review for LREC 2016): The SemDaX corpus – sense annotations with scalable sense inventories.
Sanni Nimb (2018). The Danish FrameNet Lexicon: method and lexical coverage. In Proceedings of the International FrameNet Workshop at LREC 2018, Miyazaki, Japan.
Pedersen, B. S., Nimb, S., Søgaard, A., Hartmann, M., & Olsen, S. (2018). A Danish FrameNet Lexicon and an Annotated Corpus Used for Training and Evaluating a Semantic Frame Classifier. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference, Miyazaki, Japan.
Pedersen, B. S., Nimb, S., Olsen, S., & Sørensen, N. H. (2018). Combining Dictionaries, Wordnets and other Lexical Resources - Advantages and Challenges. In Globalex Proceedings 2018, Miyasaki, Japan.
Pedersen, B. S. (2018). Semantisk processering og leksikografi. In Nordiske Studier i leksikografi, Skrifter udgivet af Nordisk Forening for Leksikografi.
Pedersen, B. S., Aguirrezabal Zabaleta, M., Nimb, S., Olsen, S., & Rørmann, I. (2018). Towards a principled approach to sense clustering – a case study of wordnet and dictionary senses in Danish. In Proceedings of Global WordNet Conference 2018 Singapore.
Nimb, S., Braasch, A., Olsen, S., Pedersen, B. S., & Søgaard, A. (2017). From Thesaurus to Framenet. In I. Kosem, C. Tiberius, M. Jabobicek, J. Kallas, S. Krej, & V. Baisa (Eds.), Electronic Lexicography in the 21st Century : Proceedings of eLex 2017 conference (pp. 1-22). Lexical Computing CZ.
Pedersen, B. S. (2017). Leksikografisk viden som væsentlig medspiller i ny, intelligent teknologi. In Bók Jógvan (pp. 351-371). Torshavn: Faroe University Press. Annales Societatis Scientiarum Færoensis Supplementum 68.
Augenstein, Isabelle; Søgaard, Anders. 2017. Multi-task learning of keyphrase boundary classification. The 55th Annual Meeting of the Association for Computational Linguistics (ACL). Vancouver, Canada.
Levy, Omer; Søgaard, Anders; Goldberg, Yoav. 2017. A strong baseline for learning cross-lingual word embeddings from sentence alignments. The 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain.
Martínez Alonso, Héctor; Anders Johannsen; Sanni Nimb; Sussi Olsen; Bolette Sandford Pedersen. 2016. An empirically grounded expansion of the supersense inventory. In Proceedings of Global Wordnet Conference 2016.
Nimb, Sanni. 2016. Der er ikke langt fra tanke til handling. In Simon Skovgaard Boeck & Henrik Blicher (red.): Danske Studier 2016, København, Universitets-Jubilæets danske Samfund 2016, s. 25-59.
Nimb, Sanni. 2016. Semantic Processesing across Domains. In Det Danske Sprog- og Litteraturselskabs årsberetning DSL 2015-16. s. 65-67.
Nimb, Sanni; Bolette Sandford Pedersen. 2016. Fra begrebsordbog til sprogteknologisk ressource: verber, semantiske roller og rammer – et pilotstudie. In Nordiske Studier i Leksikografi, Vol. 13, København, Danmark.
Pedersen, Bolette Sandford; Nimb, Sanni; Braasch, Anna; Olsen, Sussi. 2016. Betydningsinventarer – i ordbøger og i løbende tekst. In Nordiske Studier i Leksikografi, Vol. 13, København, Danmark.
Pedersen, Bolette Sandford; Braasch, Anna; Johannsen, Anders Trærup; Martínez Alonso, Héctor; Nimb, Sanni; Olsen, Sussi; Søgaard, Anders; Sørensen, Nicolai. 2016. The SemDaX Corpus - sense annotations with scalable sense inventories. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference. Portorož, Slovenia.
Søgaard, Anders. 2016. Evaluating word embeddings with fMRI and eye-tracking. In RepEval, The 54th Annual Meeting of the Association for Computational Linguistics (ACL). Berlin, Germany.
Gouws, Stephan; Søgaard, Anders. 2015. Simple task-specific bilingual word embeddings. In North American Chapter of the Association for Computational Linguistics (NAACL). Denver, CO.
Johannsen, Anders; Héctor Martínez Alonso; Anders Søgaard. 2015. Any-language frame-semantic parsing. In Proceeding of emnlp2015.
Martínez Alonso, Héctor; Anders Johannsen; Sussi Olsen; Sanni Nimb; Nicolai Hartvig Sørensen; Anna Braasch; Anders Søgaard; Bolette Sandford Pedersen. 2015. Supersense tagging for Danish. In Proceedings of the 20th Nordic Conference of Computational Linguistics NODALIDA 2015, Linköping Electronic Conference Proceedings #109, ACL Anthology, Linköping University Electronic Press, Sweden.
Martínez Alonso, Héctor; Barbara Plank; Anders Johannsen; Anders Søgaard. 2015. Active learning for sense annotation. In Proceedings of the 20th Nordic Conference of Computational Linguistics NODALIDA 2015, Linköping Electronic Conference Proceedings #109, ACL Anthology, Linköping University Electronic Press, Sweden.
Olsen, Sussi; Bolette Sandford Pedersen; Héctor Martínez Alonso; Anders Johannsen. 2015. Coarse-grained sense annotation of Danish across textual domains. In Proceedings of the Workshop on Semantic resources and Semantic Annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015, Linköping University Electronic Press, Sweden.
Pedersen, Bolette Sandford (Redaktør); Olsen, Sussi (Redaktør); Borin, Lars (Redaktør): Proceedings of the Workshop on Semantic resources and Semantic Annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015. Linköping University Electronic Press, 2015. 43 s. (Linköping Electronic Conference Proceedings).
Pedersen, Bolette Sandford; Nimb, Sanni; Olsen, Sussi. 2015. Eksperimenter med et skalérbart betydningsinventar til semantisk opmærkning af dansk. In Rette ord: Festskrift til Sabine Kirchmeier-Andersen i anledning af 60-årsdagen. red. Dorthe Duncker; Eva Skafte Jensen; Ole Ravnholt. Vol. 46 Dansk Sprognævns skrifter. s. 247-261.
Fromreide, Hege; Søgaard, Anders. 2014. NER in tweets using bagging and a small crowdsourced dataset. In The 9th International Conference on Natural Language Processing (PolTAL), Lecture Notes in Computer Science, Vol. 8686, Springer.
Fromheide, Hege, Søgaard, Anders (2014): Crowdsourcing and annotating NER for Twitter #drift. In Proceedings of Language Resources and Evaluation Conference 2014. ELRA, Reykjavik, Iceland.
Johannsen, Anders; Hovy, Dirk; Martínez Alonso, Héctor; Søgaard, Anders. 2014. More or less supervised super-sense tagging of Twitter. In The 3rd Joint Conference on Lexical and Computational Semantics (*SEM). Dublin, Ireland. Received Best Paper Award.
Pedersen, Bolette Sandford, Sanni Nimb, Sussi Olsen, Anders Søgaard, Nicolai Sørensen (2014): Semantic Annotation of the Danish CLARIN Reference Corpus.In Proceedings of the isa-10, 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, LREC 2014, ELRA, Reykjavik, Iceland.
Søgaard, Anders; Johannsen, Anders; Plank, Barbara; Hovy, Dirk; Martínez Alonso, Héctor. 2014. What is in a p-value in NLP? In The 18th Conference on Computational Natural Language Learning (CoNLL). Baltimore, MD.
Workshop på NODALIDA 2015: Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities
Vilnius, Lithuania, May 11, 2015
Even if language resources covering English tend to receive most attention in the LT community, recent years have shown an increased interest in developing lexical semantic resources and semantically annotated corpora of also lesser-resourced languages, including the languages in the Nordic and Baltic region. Nevertheless, high-quality semantic resources with sufficient coverage still prove to be a serious bottleneck not only in purely rule-based NLP applications but also in supervised corpus-based approaches. Also in the Digital Humanities there is an increased interest in and need for semantic annotation which would enable more refined search in, and better visualization and analyses of large-scale corpus data.
This workshop focuses in particular on the interplay between lexical-semantic resources as resembled by wordnets, framenets, propbanks, and others and their relation to practical corpus annotation. The workshop – a follow-up on the successful Nodalida 2009 and 2013 workshops on semantic resources – intends to bring together researchers involved in building and integrating semantic resources (lexicons and corpora) as well as researchers who apply these resources for semantic processing. Also researchers who are more theoretically interested in investigating the interplay between lexical semantics, lexicography, corpus linguistics and Digital Humanities are welcome.
Talks and presentations
Invited talk: Johan Bos: Issues in Parallel Meaning Banking
Magnus Norrby and Pierre Nugues: Extraction of Lethal Events from Wikipedia and a Semantic Repository
Sussi Olsen, Bolette Pedersen, Héctor Martínez Alonso and Anders Johannsen:Coarse-Grained Sense Annotation of Danish across Textual Domains
Natalia Loukachevitch and Ilia Chetviorkin: Determining the Most Frequent Senses Using Russian Linguistic Ontology RuThes
Karin Friberg Heppin and Dana Dannells: Polysemy and questions of lumping or splitting in the construction of Swedish FrameNet
Lars Borin, Luis Nieto Piña and Richard Johansson: Here be dragons? The perils and promises of inter-resource lexical-semantic mapping
Workshop on Semantic Annotation and Processing 2014
Copenhagen, November 3, 2014
University of Copenhagen with University of Gothenburg
Co-funded by the Danish Research Council via the project Semantic Processing across Domains
Bolette S. Pedersen: Semantic annotation of the Danish CLARIN Reference Corpus
Yvonne Adesam, Gerlof Bouma, Lars Borin, Markus Forsberg, Richard Johansson: The Koala project
Anders Johannsen & Hèctor Martínez Alonso: Cross-domain and cross-language super sense tagging
Lars Borin, Dana Dannélls, Markus Forsberg, Maria Toporowska Gronostaj, Karin Friberg Heppin, Richard Johansson,Dimitrios Kokkinakis: The Swedish FrameNet
Anders Søgaard: Semantic parsing for the 99%
- Professor Pierre Nugues, Lunds Universitet, Sverige
- Associate prof. Christina Lioma, Københavns Universitet, Danmark
- Associate prof. Eneko Agirre, University of the Basque Country, Spanien
- Director Sabine Kirchmeier, Dansk Sprognævn, Danmark
- Senior editor Jørg Asmussen, Det Danske Sprog- og Litteraturselskab, Danmark
Semantic Processing across Domains was funded by the Danish Council for independent research| Humanities for the period 2013 – 2017 with a grant of DKK 5.7 m (DFF-1319-00123).
The project was a collaborate project between the University of Copenhagen and The Society for Danish Language and Literature.
Project period: 2013 - 2017
The project in the media
Other project participants
Anders Johannsen, Postdoc, Centre for Language Technology
Héctor Martínez Alonso, Postdoc, Centre for Language Technology
Sussi A. Olsen, Research Associate, Centre for Language Technology
Nicolai Hartvig Sørensen, Senior Editor, The Society for Danish Language and Literature
Ida Hauerberg Wolthers, Student Assistant
Sara Lee Naldal, Student Assistant
Selma Rosenfeldt-Olsen, Student Assistant