What is STO?
A computational lexicon is a collection of lexical data intended for computational applications as opposed to other dictionaries intended primarily for human application. STO is therefore a basic lexical component which can be used in systems operating in Danish.
Some obvious application areas are on-line information retrieval, automatic and machine aided translation, QA systems, linguistic components for handicapped, applications for language teaching etc. STO data are also valuable within research in computational linguistics, for example when testing grammars.
STO contents
The vocabulary consists primarily of general language words (68,000 words) and is primarily based on a newspaper corpus.
The technical vocabulary (13,500 words) originate from six selected subjects fields, computer terminology, environment, health, finance, administration and trade & industry.
Information types in STO:
- Morphology: part of speech, inflection, spelling variants and for nouns also information about compounds.
- Syntax: construction possibilities of the word and for verbs also specification of auxiliary verbs. For each construction pattern a prototypical corpus example is also given.
- Semantics: descriptions have different levels of detail including ontologic type, semantic relation, argument structure, selection restrictions etc.
Tables showing the composition and linguistic description of the vocabulary:
POS |
Words |
Morphology |
Morphology + syntax |
Morphology + syntax + semantics |
Noun |
64735 |
100% |
53% |
12% |
Adjective |
9773 |
100% |
68% |
13% |
Verb |
5775 |
100% |
98% |
17% |
Adverb |
771 |
100% |
*0% |
|
Interjection |
158 |
100% |
0% |
|
Preposition |
80 |
100% |
0% |
|
Conjunction |
60 |
100% |
0% |
|
Pronoun |
44 |
100% |
0% |
|
Other |
128 |
100% |
0% |
|
Total |
81524 |
|
|
|
Table 1. The total vocabulary distributed over parts of speech with information about the extent of the linguistic specification
*19% in preparation
POS |
Words |
Noun |
52840 |
Adjektive |
8568 |
Verb |
5410 |
Adverb |
771 |
Interjection |
158 |
Preposition |
80 |
Conjunction |
60 |
Pronoun |
44 |
Other |
128 |
Total |
68059 |
Table 2. The general vocabulary, distributed over parts of speech
Subject field |
Noun |
Verb |
Adjektive |
Total |
IT |
1730 |
160 |
115 |
2005 |
Environment |
1770 |
50 |
300 |
2120 |
Trade & Industry |
1800 |
60 |
160 |
2020 |
Administration |
2430 |
25 |
220 |
2675 |
Health |
2285 |
40 |
250 |
2575 |
Finance |
1880 |
30 |
160 |
2070 |
Total |
11895 |
365 |
1205 |
13365 |
Tabl2 3. Vocabulary from each subject field, distributed over parts of speech