The Corpus of Late Modern English Texts, version 3.1 (CLMET3.1) has been created by Hendrik De Smet, Susanne Flach, Hans-Jürgen Diller and Jukka Tyrkkö, as an offshoot of a bigger project developing a database of text descriptors (Diller, De Smet & Tyrkkö 2011). CLMET3.1 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET, CLMETEV, and CLMET3.0, and has been compiled following roughly the same principles, that is:
However, compared to the earlier versions, it comes with a number of important improvements (in addition to being substantially bigger):
The following table summarises the corpus make-up:
Sub-period | Number of authors | Number of texts | Number of words | |
1710-1780 | 51 | 88 | 10,480,431 | |
1780-1850 | 70 | 99 | 11,285,587 | |
1850-1920 | 91 | 146 | 12,620,207 | |
TOTAL | 212 | 333 | 34,386,225 |
The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The genre-division per sub-period is as follows:
Genre | 1710-1780 | 1780-1850 | 1850-1920 | |
Narrative fiction | 4,642,670 | 4,830,718 | 6,311,301 | |
Narrative non-fiction | 1,863,855 | 1,940,245 | 958,410 | |
Drama | 407,885 | 347,493 | 607,401 | |
Letters | 1,016,745 | 714,343 | 479,724 | |
Treatise | 1,114,521 | 1,692,992 | 1,782,124 | |
Other | 1,434,755 | 1,759,796 | 2,481,247 |
CLMET3.1 is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To download the corpus, simply click here (zip-archive, 687 MB, md5sum 1be90a10316bffa33a452d8281a51a28, sha1sum 590d575ad26b3ea5c4c0866d06db028d26b03045).
Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35.
To the Hompage of the UdS CLARIN-D repository | Terms of use | Impressum