Corpus

CLARE is a general corpus of Latin written texts, collected in electronic format and enriched with XML tags, that facilitate their exploitation and analysis. The texts are imported from different Latin libraries on the Web, and they belong to genres as diverse as apology, biography, comedy, didactic, doctrinal, epic, epistolary, essay, fable, history, legislative, lyric, mythology, novel, oratory, philosophy, satire, tragedy. The whole number of words contained in the corpus comprise ca. 19,5 Mio.

CLARE is pre-annotated on several levels, which include information on tokens, lemmas, morpho-syntactic features (e.g. case, number, etc.), parts-of-speech, as well as and sentence boundaries. Tokenization, lemmatization, PoS tagging was carried out with TreeTagger (Schmid, 1994, 1995) using Gabrielle Bandolini's parameters and sentence boundary detection with CLTK (Kyle P. Johnson et al., 2014-2017).

CLARE is encoded in the CWB format (CWB, 2010) and can be queried with Corpus Query Processor (CQP) (Evert, 2005).