The corpus is available for download and can be searched online.

CQPweb

The corpus can be searched via the CQPweb server of the Department of Language Science and Technology at Saarland University. Registration with your e-mail address is required but free.

Sample queries:

Helpful links:

Visualizations

Access to the visualizations is password-protected.
Please ask for credentials.

Download

Formats

The corpus can be downloaded in several file formats:

  • vertical text format (CWB/CQPweb) .vrt
  • plain text format .txt
  • TEI format (Text Encoding Initiative) .tei.xml
  • TCF format (WebLicht Text Corpus Format) .tcf.xml

.vrt is the default file format containing all available annotations.

The other file formats are provided as a convenience only and may be incomplete (they contain all tokens though).

The text metadata can be downloaded separately.

More information on the annotation of the corpus can be found on a separate page.

In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as XML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. Files in vrt-format can be imported into the Corpus Workbench or CQPweb.

Files

Checksums (md5sum):

33e50f29c2137a4152c6b996f83ee08f  Royal_Society_Corpus_open_v6.0.4_corpus.tei.xml.zip
999fb9aefad9ea47be4e1b0cb9494632  Royal_Society_Corpus_open_v6.0.4_corpus.vrt.zip
40e02025587649e9f23d5e83760b9230  Royal_Society_Corpus_open_v6.0.4_meta.tsv.zip
55940b45ba6bb7f330f7aeba20234f15  Royal_Society_Corpus_open_v6.0.4_texts_tcf.zip
09f425cd79aa72f5ff4f2cd87af85061  Royal_Society_Corpus_open_v6.0.4_texts_tei.zip
a38ea641fb2665db22a161bb1e9c97d4  Royal_Society_Corpus_open_v6.0.4_texts_txt.zip
b1f16a23637b2a48228b88ad6beb982e  Royal_Society_Corpus_open_v6.0.4_texts_vrt.zip

Release History

  • v6.0.4 Open: Version 6.0.3 with additional topic annotation on texts
  • v6.0.3 Open: new long-term release
  • v4.0.1: more file formats (same data)
  • v4.0.0: new long-term release
  • v2.0.2: first long-term release

Persistent Identifier

Each release of the RSC was assigned a PID.

License

Creative Commons License

The Royal Society Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

If you use the Royal Society Corpus in your research, please refer to:

Fischer, Stefan, Jörg Knappen, Katrin Menzel, and Elke Teich. 2020. “The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study.” In Proceedings of the 12th Language Resources and Evaluation Conference, 794–802. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.99.

Kermes, Hannah, Stefania Degaetano-Ortlieb, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation, 1928–31. Portorož, Slovenia: European Language Resources Association. https://www.aclweb.org/anthology/L16-1305.


CLARIN-D German Research Foundation (DFG) German Federal Ministry of Education and Research
Impressum