The corpus is available for download and can be searched online.

CQPweb

The corpus can be searched via the CQPweb server of the Department of Language Science and Technology at Saarland University. Registration with your e-mail address is required but free.

Sample queries:

Helpful links:

Download

Formats

The corpus can be downloaded in several file formats:

  • vertical text format (CWB/CQPweb) .vrt
  • plain text format .txt
  • TEI format (Text Encoding Initiative) .tei.xml
  • TCF format (WebLicht Text Corpus Format) .tcf.xml

.vrt is the default file format containing all available annotations.

The other file formats are provided as a convenience only and may be incomplete (they contain all tokens though).

The text metadata can be downloaded separately.

More information on the annotation of the corpus can be found on a separate page.

In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. Files in vrt-format can be imported into the Corpus Workbench or CQPweb.

Files

Checksums (md5sum):

7bf9c582f2af6d581eacf8dde4aabc66  Royal_Society_Corpus_v4.0.1_corpus.vrt.zip
6268a5239e9fcd156a8bf00d271fe3aa  Royal_Society_Corpus_v4.0.1_corpus.tei.xml.zip

5a2b85ab8bdde684d82a0251a6d9fe62  Royal_Society_Corpus_v4.0.1_texts_vrt.zip
ad8881d6a5c96c6d11ed7ce4e623ec04  Royal_Society_Corpus_v4.0.1_texts_txt.zip
045579bc7cb2032641323a4845c317b7  Royal_Society_Corpus_v4.0.1_texts_tei.zip
15b62cd7e881b33554b68c27d02ecefa  Royal_Society_Corpus_v4.0.1_texts_tcf.zip

8add09ff70ab18f210649d4911eeaa7b  Royal_Society_Corpus_v4.0.1_meta.tsv.zip

Tools

You can also download our OCR correction scripts (2.7 MB).

$ md5sum tools-4.0.tar.gz
d4aa8252457d9366762382d8c4161625  tools-4.0.tar.gz

The topic model (6.3 MB) is also available.

$ md5sum topic-model_rsc_v2.0-24e-142.zip
cf91135ab1a5436087b522dafb3bc33b  topic-model_rsc_v2.0-24e-142.zip

Release History

Checksums (md5sum):

5469d9cf30c1a4c74cf81ced861c95c6  Royal_Society_Corpus_v4.0.0_final.zip

License

Creative Commons License

The Royal Society Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

If you use the Royal Society Corpus in your research, please refer to:

Kermes, Hannah, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the LREC 2016. Portoroz, Slovenia. http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.


CLARIN-D German Research Foundation (DFG) German Federal Ministry of Education and Research
Impressum