ROCO

Full Name
Romanian journalistic corpus
Composer
ELRA
Language
Romanian
Language (details)
Romanian, Moldavian
Register
Written
Genre
Newspaper
Style
Formal
Period
2000-2100 AD
1900-2000 AD
Number of words
2.000.000 - 10.000.000
Number of words (details)
7,1 million tokens, 231,626 types
Annotation
Discourse and text linguistic annotation
Lemmatisation
POS tagging
Tokenization
Annotation remarks

The corpus contains morphosyntactic information (MSD annotations) which has been assigned automatically with the high accuracy (estimated 98%) TTL tagger implementing the tiered tagging methodology. About 20% of the MSD annotations have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added). The corpus was first segmented, then PoS annotated and lemmatized with the TTL processing chain. The corpus has been XML encoded and each file includes metadata (cesHeader)

Format
Download
Availability
Commercial