Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Department of Nordic Studies and Linguistics

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

ACLIJCNLP017
Final published version, 193 KB, PDF document

Jongejan, Bart
Hercules Dalianis

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and
Swedish full form-lemma pairs respectively.
We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.
Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.

Original language	English
Title of host publication	Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
Number of pages	9
Volume	1
Publisher	Association for Computational Linguistics
Publication date	2009
Pages	145-153
ISBN (Print)	978-1-932432-61-9
ISBN (Electronic)	1-932432-61-2
Publication status	Published - 2009
Event	ACL-IJCNLP 2009 - Singapore, Singapore Duration: 2 Aug 2009 → 7 Aug 2009 Conference number: 47

Conference

Conference	ACL-IJCNLP 2009
Nummer	47
Land	Singapore
By	Singapore
Periode	02/08/2009 → 07/08/2009

Research areas

Faculty of Humanities - lemmatization morphology affix

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 14093025

Centre for Language Technology