What is STO?

A computational lexicon is a collection of lexical data intended for computational applications as opposed to other dictionaries intended primarily for human application. STO is therefore a basic lexical component which can be used in systems operating in Danish.

Some obvious application areas are on-line information retrieval, automatic and machine aided translation, QA systems, linguistic components for handicapped, applications for language teaching etc. STO data are also valuable within research in computational linguistics, for example when testing grammars. 

STO contents

The vocabulary consists primarily of general language words (68,000 words) and is primarily based on a newspaper corpus.

The technical vocabulary (13,500 words) originate from six selected subjects fields, computer terminology, environment, health, finance, administration and trade & industry.

Information types in STO:

  • Morphology: part of speech, inflection, spelling variants and for nouns also information about compounds.
  • Syntax: construction possibilities of the word and for verbs also specification of auxiliary verbs. For each construction pattern a prototypical corpus example is also given.
  • Semantics: descriptions have different levels of detail including ontologic type, semantic relation, argument structure, selection restrictions etc. 

Tables showing the composition and linguistic description of the vocabulary:

POS

Words   

Morphology  

Morphology + syntax

Morphology + syntax + semantics

Noun 

64735

100%

53%

12%

Adjective

9773

100%

68%

13%

Verb

5775

100%

98%

17%

Adverb

771

100%

*0%

 

Interjection  

158

100%

0%

 

Preposition

80

100%

0%

 

Conjunction

60

100%

0%

 

Pronoun

44

100%

0%

 

Other

128

100%

0%

 

Total

81524

 

 

 

Table 1. The total vocabulary distributed over parts of speech with information about the extent of the linguistic specification

*19% in preparation

POS 

Words

Noun 

52840

Adjektive

8568

Verb

5410

Adverb

771

Interjection

158

Preposition

80

Conjunction

60

Pronoun

44

Other 

128

Total

68059

Table 2. The general vocabulary, distributed over parts of speech

Subject field 

Noun 

Verb

Adjektive

Total

IT

1730

160

115

2005

Environment

1770

50

300

2120

Trade & Industry

1800

60

160

2020

Administration 

2430

25

220

2675

Health 

2285

40

250

2575

Finance

1880

30

160

2070

Total

11895

365

1205

13365

Tabl2 3. Vocabulary from each subject field, distributed over parts of speech