ROMBAC | Corpus finder

Full Name

Romanian balanced corpus

Composer

Research Institute for Artificial Intelligence of the Romanian Academy

URL

http://catalogue.elra.info/en-us/repository/browse/ELRA-W0088/

Language

Romanian

Register

Written

Genre

Academic

Legislative

Newspaper

Other

Prose

Style

Formal

Period

2000-2100 AD

1900-2000 AD

Number of words

10.000.000 - 100.000.000

Number of words (details)

41,000,000 words

Annotation

Discourse and text linguistic annotation

POS tagging

Annotation remarks

The corpus is annotated at paragraph, sentence, constituent group and word levels. It provides morpho-syntactic information (MSD) which has been assigned automatically with the high accuracy TTL tagger (accuracy is at least 98%), which implements the tiered tagging methodology. About 20% of the MSDs have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added). The corpus is xml encoded.

Format

Online

Availability

Commercial