The corpus is available for download and can be searched online.
The corpus can be searched via the CQPweb server of the Department of Language Science and Technology at Saarland University. Registration with your e-mail address is required but free.
Sample queries:
Helpful links:
The corpus can be downloaded in several file formats:
.vrt
.txt
.tei.xml
.tcf.xml
.vrt
is the default file format containing all available annotations.
The other file formats are provided as a convenience only and may be incomplete (they contain all tokens though).
The text metadata can be downloaded separately.
More information on the annotation of the corpus can be found on a separate page.
In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>
-element. Files in vrt-format can be imported into the Corpus Workbench or CQPweb.
Checksums (md5sum
):
7bf9c582f2af6d581eacf8dde4aabc66 Royal_Society_Corpus_v4.0.1_corpus.vrt.zip
6268a5239e9fcd156a8bf00d271fe3aa Royal_Society_Corpus_v4.0.1_corpus.tei.xml.zip
5a2b85ab8bdde684d82a0251a6d9fe62 Royal_Society_Corpus_v4.0.1_texts_vrt.zip
ad8881d6a5c96c6d11ed7ce4e623ec04 Royal_Society_Corpus_v4.0.1_texts_txt.zip
045579bc7cb2032641323a4845c317b7 Royal_Society_Corpus_v4.0.1_texts_tei.zip
15b62cd7e881b33554b68c27d02ecefa Royal_Society_Corpus_v4.0.1_texts_tcf.zip
8add09ff70ab18f210649d4911eeaa7b Royal_Society_Corpus_v4.0.1_meta.tsv.zip
You can also download our OCR correction scripts (2.7 MB).
$ md5sum tools-4.0.tar.gz
d4aa8252457d9366762382d8c4161625 tools-4.0.tar.gz
The topic model (6.3 MB) is also available.
$ md5sum topic-model_rsc_v2.0-24e-142.zip
cf91135ab1a5436087b522dafb3bc33b topic-model_rsc_v2.0-24e-142.zip
v4.0.1
: more file formats (same data)v4.0.0
: long-term release
Checksums (md5sum
):
5469d9cf30c1a4c74cf81ced861c95c6 Royal_Society_Corpus_v4.0.0_final.zip
The Royal Society Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use the Royal Society Corpus in your research, please refer to:
Kermes, Hannah, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the LREC 2016. Portoroz, Slovenia. http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.