For corpus query and analysis the RSC is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB)).
The corpus can be searched via the CQPweb server of the Department of Linguistics and Language Technology at Saarland University. Registration with your e-mail address is required but free.
Sample queries:
Helpful links:
The CWB requires a simple XML as an input format. In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>
-element. More information on the annotation of the corpus can be found here
You can download the corpus as a compressed VRT file (419 MB), which can be imported into the Corpus Workbench or CQPweb.
$ md5sum Royal_Society_Corpus_v2.0.2_final.zip
9dc54d20820a6507ac3a3957a24a5131 Royal_Society_Corpus_v2.0.2_final.zip
You can download the OCR correction tools as archive of sed source files (15 kB).
$ md5sum rsc-tool-2.0.tar.gz
b410be474c8df77ac952eb0ee1246bbc rsc-tool-2.0.tar.gz
The Royal Society Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use the Royal Society Corpus in your research, please refer to:
Kermes, Hannah, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the LREC 2016. Portoroz, Slovenia. http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.
For the OCR correction scripts, please refer to:
Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Göteborg, Sweden. Linköping University Electronic Press. http://www.ep.liu.se/ecp/article.asp?issue=133&article=003&volume=.