For corpus query and analysis the RSC is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB)). The CWB requires a simple XML as an input format. In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. In the following we give a detailled overview of annotations on token level (positional attributes) and on structural level (structural attributes).

Annotation on Token Level

Attributes are listed in the order of the columns in the VRT file. Attribute names refer to the positional attributes encoded in the online corpus.

word   pos  lemma  orig   avs   surprisal  avs50  avs10

of     IN   of     of     0.05  0.29       0.06   0.10
some   DT   some   some   3.06  0.41       2.66   2.04
Books  NPS  Books  Books  2.64  0.62       0.90   0.88
Positional Attribute Description
word Normalized word form
pos Part-of-speech tag (Penn Treebank Tagset)
lemma Lemma, according to TreeTagger
orig Original word form
avs Average surprisal
surprisal Surprisal
avs50 Average surprisal based on 50-year periods
avs10 Average surprisal based on decades

Annotation on Structural Level

Structural attributes are given in two different representations:

  • a schematic xml-representation (cf. vrt-format)
  • structural attribute names as encoded in the online corpus


Metadata are encoded on the text level.

<text id=“” issn=“” title=“” fpage=“” lpage=“” year=“” decade=“” period=“” century=“” volume=“” journal=“” author=“” type=“” corpusBuild=“” jstorLink=“”>


<text id=“100997” issn=“03702316” title=“An Extract of a Letter Written by Dr. Edward Brown from Vienna in Austria March 3. 1669. Concerning Two Parhelia’s or Mocksuns, Lately Seen in Hungary” fpage=“953” lpage=“953” year=“1669” decade=“1660” period=“1650” century=“1600” volume=“4” journal=“Philosophical Transactions (1665-1678)” author=“Edward Brown” type=“fla” corpusBuild=“2.0” jstorLink=“”>

Structural Attribute Description
<text> Text (based on JSTOR source articles)
<text_author> Author of the article
<text_century> Century of publication
<text_corpusBuild> Internal version number
<text_decade> Decade of publication
<text_fpage> First page of the article
<text_hasAbstract> ID of the corresponding abstract
<text_id> JSTOR ID
<text_isAbstractOf> ID of the corresponding article
<text_issn> ISSN of the journal
<text_journal> Journal in which the article was published
<text_jstorLink> Link to the source text on JSTOR
<text_lpage> Last page of the article
<text_period> 50-year period of publication
<text_title> Title of the article
<text_type> Text type: abs, brv, fla or nws (see table below for details)
<text_volume> Volume of the article
<text_year> Year of publication
<inferred> Inferred word (attribute from JSTOR)
Text type Description
abs Abstract
brv Book review
fla Full article
nws Obituary


<page id=“” no=“”>

Structural Attribute Description
<page> Page (attribute from JSTOR)
<page_id> Absolute page number
<page_no> Relative page number


Texts are split into sentences based on the output of the TreeTagger.

<s no=“” surprisal=“” avs=“” avs50=“” avs10=“”>

Structural Attribute Description
<s> Sentence boundary (based on SENT tags of TreeTagger)
<s_avs> Average surprisal of sentence based on avs
<s_avs10> Average surprisal of sentence based on avs10
<s_avs50> Average surprisal of sentence based on avs50
<s_no> Relative sentence number (within a text)
<s_surprisal> Surprisal of sentence based on surprisal


Normalised words are represented on the token level and on the structural level to account for one-to-many relations (e.g. ’tis \(\rightarrow\) this is, my self \(\rightarrow\) myself) on the one hand and to allow for an easy access on the other hand.

<normalised orig=“” auto=“”>

Structural Attribute Description
<normalised> Normalised token(s)
<normalised_auto> Always “true” as all normalisations are automatic
<normalised_orig> Original token(s)

Inferred Text

The element <inferred> is part of the JSTOR distribution. It refers to illegible text which was recovered from the context.

at      IN      at      at      1.96    0.19    1.95    1.85
the     DT      the     the     0.52    0.29    0.55    0.56
Root    NP      Root    Root    1.99    0.38    0.99    0.99
of      IN      of      of      0.48    0.08    1.91    1.74
the     DT      the     the     0.53    1.26    1.45    1.48
Tongue  NP      Tongue  Tongue  5.48    0.72    5.95    4.74