ROMBAC

Full Name
Romanian balanced corpus
Composer
Research Institute for Artificial Intelligence of the Romanian Academy
Language
Romanian
Register
Written
Genre
Academic
Legislative
Newspaper
Other
Prose
Style
Formal
Period
2000-2100 AD
1900-2000 AD
Number of words
10.000.000 - 100.000.000
Number of words (details)
41,000,000 words
Annotation
Discourse and text linguistic annotation
POS tagging
Annotation remarks

The corpus is annotated at paragraph, sentence, constituent group and word levels. It provides morpho-syntactic information (MSD) which has been assigned automatically with the high accuracy TTL tagger (accuracy is at least 98%), which implements the tiered tagging methodology. About 20% of the MSDs have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added). The corpus is xml encoded.

Format
Online
Availability
Commercial