The Corpus of Late Modern English Texts, version 3.1

The Corpus of Late Modern English Texts, version 3.1 (CLMET3.1) has been created by Hendrik De Smet, Susanne Flach, Hans-Jürgen Diller and Jukka Tyrkkö, as an offshoot of a bigger project developing a database of text descriptors (Diller, De Smet & Tyrkkö 2011). CLMET3.1 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET, CLMETEV, and CLMET3.0, and has been compiled following roughly the same principles, that is:

However, compared to the earlier versions, it comes with a number of important improvements (in addition to being substantially bigger):

The following table summarises the corpus make-up:

Sub-period Number of authors Number of texts Number of words
1710-1780 51 88 10,480,431
1780-1850 70 99 11,285,587
1850-1920 91 146 12,620,207
TOTAL 212 333 34,386,225

The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The genre-division per sub-period is as follows:

Genre 1710-1780 1780-1850 1850-1920
Narrative fiction 4,642,670 4,830,718 6,311,301
Narrative non-fiction 1,863,855 1,940,245 958,410
Drama 407,885 347,493 607,401
Letters 1,016,745 714,343 479,724
Treatise 1,114,521 1,692,992 1,782,124
Other 1,434,755 1,759,796 2,481,247

License and Download

Creative Commons License

CLMET3.1 is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To download the corpus, simply click here (zip-archive, 687 MB, md5sum 1be90a10316bffa33a452d8281a51a28, sha1sum 590d575ad26b3ea5c4c0866d06db028d26b03045).

References:

Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35.

To the Hompage of the UdS CLARIN-D repository | Terms of use | Impressum