For corpus query and analysis the RSC is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB)).


The corpus can be searched via the CQPweb server of the Department of Linguistics and Language Technology at Saarland University. Registration with your e-mail address is required but free.

Sample queries:

Helpful links:


The CWB requires a simple XML as an input format. In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. More information on the annotation of the corpus can be found here

You can download the corpus as a compressed VRT file (419 MB), which can be imported into the Corpus Workbench or CQPweb.

$ md5sum

You can download the OCR correction tools as archive of sed source files (15 kB).

$ md5sum rsc-tool-2.0.tar.gz
b410be474c8df77ac952eb0ee1246bbc  rsc-tool-2.0.tar.gz


Creative Commons License

The Royal Society Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

If you use the Royal Society Corpus in your research, please refer to:

Kermes, Hannah, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the LREC 2016. Portoroz, Slovenia.

For the OCR correction scripts, please refer to:

Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Göteborg, Sweden. Linköping University Electronic Press.

