For corpus query and analysis the RSC is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB)). The CWB requires a simple XML as an input format. In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as XML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. In the following we give a detailled overview of annotations on token level (positional attributes) and on structural level (structural attributes).

Annotation on Token Level

Attributes are listed in the order of the columns in the VRT file. Attribute names refer to the positional attributes encoded in the online corpus.

word   pos  lemma  orig   srp   srp_avg  srp_rnd  srp_avg_rnd  doc   doc_avg  doc_rnd  doc_avg_rnd  s50   s50_avg  s50_rnd  s50_avg_rnd  s10   s10_avg  s10_rnd  s10_avg_rnd
of     IN   of     of     0.11  1.621    0        2            0.04  0.367    0        0            0.06  2.023    0        2            0.10  2.214    0        2
some   DT   some   some   3.49  6.976    3        7            2.11  0.861    2        1            2.66  6.306    3        6            2.03  6.387    2        6
Books  NPS  Books  Books  3.48  8.743    3        9            0.65  1.500    1        2            0.90  6.109    1        6            0.88  5.479    1        5
Positional Attribute Description
word Normalized word form (VARD)
pos Part-of-speech tag (Penn Treebank Tagset)
lemma Lemma, according to TreeTagger
orig Original word form
srp Surprisal
srp_avg Average surprisal
srp_rnd Surprisal (rounded)
srp_avg_rnd Average surprisal (rounded)
doc Document surprisal
doc_avg Average document surprisal
doc_rnd Document surprisal (rounded)
doc_avg_rnd Average document surprisal (rounded)
s50 Surprisal on 50-year periods
s50_avg Average surprisal on 50-year periods
s50_rnd Surprisal on 50-year periods (rounded)
s50_avg_rnd Average surprisal on 50-year periods (rounded)
s10 Surprisal on decades
s10_avg Average surprisal on decades
s10_rnd Surprisal on decades (rounded)
s10_avg_rnd Average surprisal on decades (rounded)

Annotation on Structural Level

Structural attributes are given in two different representations:

  • a schematic XML-representation (cf. vrt-format)
  • structural attribute names as encoded in the online corpus

Texts

Metadata are encoded on the text level.

<text id="" issn="" title="" fpage="" lpage="" year="" volume="" journal="" author="" type="" corpusBuild="" doiLink="" language="" jrnl="" decade="" period="" century="" pages="" sentences="" tokens="" visualizationLink="" doi="" jstorLink="" isAbstractOf="" hasAbstract="" primaryTopic="" primaryTopicPercentage="" secondaryTopic="" secondaryTopicPercentage="">

Example:

<text id=“100997” issn=“03702316” title=“An Extract of a Letter Written by Dr. Edward Brown from Vienna in Austria March 3. 1669. Concerning Two Parhelia's or Mocksuns, Lately Seen in Hungary” fpage=“953” lpage=“953” year=“1669” volume=“4” journal=“Philosophical Transactions (1665-1678)” author=“Edward Brown” type=“fla” corpusBuild=“5.2” doiLink=“http://dx.doi.org/10.1098/rstl.1669.0015” language="" jrnl=“transactions” decade=“1660” period=“1650” century=“1600” pages=“1” sentences=“10” tokens=“245” visualizationLink=“http://corpora.clarin-d.uni-saarland.de/surprisal/6.0.3/?id=100997” doi=“10.1098/rstl.1669.0015” jstorLink=“http://www.jstor.org/stable/100997” isAbstractOf="" hasAbstract="" primaryTopic="Reporting" primaryTopicPercentage="47.2589599153001" secondaryTopic="Astronomy" secondaryTopicPercentage="34.1265051578955">

Structural Attribute Description
<text_author> Author of the article
<text_century> Century of publication
<text_corpusBuild> Internal version number
<text_decade> Decade of publication
<text_doi> DOI of article
<text_doiLink> Link to DOI resolver
<text_fpage> First page of the article
<text_hasAbstract> ID of the corresponding abstract
<text_id> JSTOR ID
<text_isAbstractOf> ID of the corresponding article
<text_issn> ISSN of the journal
<text_journal> Journal in which the article was published
<text_jrnl> Journal abbreviation
<text_jstorLink> Link to the source text on JSTOR
<text_language> Language of article
<text_lpage> Last page of the article
<text_pages> Number of pages in text
<text_period> 50-year period of publication
<text_primaryTopic> The most prominent topic according to our topic model
<text_primaryTopicPercentage> Percentage of the most prominent topic in the text
<text_secondaryTopic> The second most prominent topic according to our topic model
<text_secondaryTopicPercentage> Percentage of the second most prominent topic in the text
<text_sentences> Number of sentences in text
<text_title> Title of the article
<text_tokens> Number of tokens in text
<text_type> Text type
<text_visualizationLink> Link to visualization
<text_volume> Volume of the article
<text_year> Year of publication

Pages

<page id="" no="" tokens="">

Structural Attribute Description
<page> Page (attribute from JSTOR)
<page_id> Absolute page number
<page_no> Relative page number
<page_tokens> Number of tokens in page

Sentences

Texts are split into sentences based on the output of the TreeTagger.

<s srp="" doc="" s50="" s10="" no="" tokens="">

Structural Attribute Description
<s> Sentence boundary (based on SENT tags of TreeTagger)
<s_srp> Average surprisal of sentence based on srp
<s_doc> Average surprisal of sentence based on doc
<s_s50> Average surprisal of sentence based on s50
<s_s10> Average surprisal of sentence based on s10
<s_no> Relative sentence number (within a text)
<s_tokens> Number of tokens in sentence

Normalisation

Normalised words are represented on the token level and on the structural level to account for one-to-many relations (e.g. ’tis \(\rightarrow\) this is, my self \(\rightarrow\) myself) on the one hand and to allow for an easy access on the other hand.

<normalised orig="" auto="">

Structural Attribute Description
<normalised> Normalised token(s)
<normalised_auto> Always “true” as all normalisations are automatic
<normalised_orig> Original token(s)

Inferred Text

The element <inferred> is part of the JSTOR distribution. It refers to illegible text which was recovered from the context.

Reference on annotation and metadata

For more information on annotation and metadata, please consult our paper in RiCL:

Katrin Menzel, Jörg Knappe, and Elke Teich (2021): "Generating linguistically relevant metadata for the Royal Society Corpus", Research in Corpus Linguistics 9(1):1-18, DOI: 10.32714/ricl.09.01.02