The Royal Society Corpus (RSC) 6.0 Open is based on the first centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1920. It includes all publications of the journal written in English or mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. Founded in 1665 by Henry Oldenburg, the first secretary of the Royal Society, it initially contained excerpts of letters of his scientific correspondence, reviews and summaries of recently-published books, and accounts of observations and experiments. In addition, the RSC also contains all texts from other Royal Society science journals such as the Proceedings of the Royal Society of London until 1920.
The previous releases of the corpus can be found here and here.
The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication (cf. Project B1 Information Density and Scientific Literacy in English: Synchronic and Diachronic Perspectives of SFB 1102).
With their long and continuous history the Royal Society science journals provide a good basis for diachronic analysis of English scientific writing.
We obtained the Philosophical Transactions from JSTOR in a well-formed XML format including meta-data (e.g. author(s), text type (such as article, abstract), day, month and year of publication, volume, text ID, and title). In addition, we received data from the Royal Society. Although already digitized, the source texts can still contain some noise, e.g., OCR errors, which can impact the quality of any step in corpus processing as well as corpus analysis. Inspired by Agile Software Development (Cockburn 2001), we intertwine corpus building, corpus annotation and analysis to produce new versions of the corpus whenever we encounter problems in data quality. The dedicated corpus building pipeline is divided into three main steps:
The steps in the pipeline are mostly automatic; manual work is kept to a minimum and is applied prior to the first automatic step in the pipeline (cf. Kermes et al. (2016)).
We also received full texts and metadata directly from the Royal Society of London and adapted our processing pipeline to ingest them into the new corpus build.
The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its (average) surprisal (cf. Kermes and Teich 2017) with words as units and trigram as contexts (cf. Genzel and Charniak 2002).
Detailed information on the linguistic and structural annotation of the RSC can be found here.
The RSC 6.0 Open consists of approximately 78.6 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. Token sizes of the different subcorpora and other corpus statistics can be found here.
The creation of the Royal Society Corpus was supported by the German section of the Common Language Resources and Technology Infrastructure (CLARIN-D), the German Research Foundation (DFG) and the Federal Ministry of Education and Research (BMBF).
The corpus is available for download and can be searched online under license. More information on how to access the corpus can found here.
If you use the Royal Society Corpus in your research, please refer to it using the following articles:
Fischer, Stefan, Jörg Knappen, Katrin Menzel, and Elke Teich. 2020. “The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study.” In Proceedings of the 12th Language Resources and Evaluation Conference, 794–802. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.99.
Kermes, Hannah, Stefania Degaetano-Ortlieb, Ashraf Khamis, Jörg Knappen, an Elke Teich. 2016. "The Royal Society Corpus: From Uncharted Data to Corpus." In Proceedings of the Tenth International Conference on Language Resources and Evaluation, 1928–31. Portorož, Slovenia: European Language Resources Association. https://www.aclweb.org/anthology/L16-1305.