Annotation

For corpus query and analysis the RSC is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB)). The CWB requires a simple XML as an input format. In the so-called vrt-format (vertical text format) annotations on the token level (positional attributes, e.g. word, pos, lemma) are represented in a one-word-per-line with TAB deliminated columns for each positional attribute. Annotations beyond token level (structural attributes, e.g. texts, sentences, pages) are represented as SGML-tags with possible attribute-value pairs. Metadata, e.g., are encoded as attributes of the <text>-element. In the following we give a detailled overview of annotations on token level (positional attributes) and on structural level (structural attributes).

Annotation on Token Level

Attributes are listed in the order of the columns in the VRT file. Attribute names refer to the positional attributes encoded in the online corpus.

word   pos  lemma  orig   srp   srp_avg  srp_rnd  srp_avg_rnd  doc   doc_avg  doc_rnd  doc_avg_rnd  s50   s50_avg  s50_rnd  s50_avg_rnd  s10   s10_avg  s10_rnd  s10_avg_rnd

of     IN   of     of     0.05  1.588    0        2            0.29  0.361    0        0            0.06  2.023    0        2            0.10  2.214    0        2
some   DT   some   some   3.06  6.597    3        7            0.41  0.789    0        1            2.66  6.306    3        6            2.04  6.387    2        6
Books  NPS  Books  Books  2.64  9.027    3        9            0.62  0.687    1        1            0.90  6.109    1        6            0.88  5.479    1        5

Positional Attribute	Description
`word`	Normalized word form (VARD)
`pos`	Part-of-speech tag (Penn Treebank Tagset)
`lemma`	Lemma, according to TreeTagger
`orig`	Original word form
`srp`	Surprisal
`srp_avg`	Average surprisal
`srp_rnd`	Surprisal (rounded)
`srp_avg_rnd`	Average surprisal (rounded)
`doc`	Document surprisal
`doc_avg`	Average document surprisal
`doc_rnd`	Document surprisal (rounded)
`doc_avg_rnd`	Average document surprisal (rounded)
`s50`	Surprisal on 50-year periods
`s50_avg`	Average surprisal on 50-year periods
`s50_rnd`	Surprisal on 50-year periods (rounded)
`s50_avg_rnd`	Average surprisal on 50-year periods (rounded)
`s10`	Surprisal on decades
`s10_avg`	Average surprisal on decades
`s10_rnd`	Surprisal on decades (rounded)
`s10_avg_rnd`	Average surprisal on decades (rounded)

Annotation on Structural Level

Structural attributes are given in two different representations:

a schematic xml-representation (cf. vrt-format)
structural attribute names as encoded in the online corpus

Texts

Metadata are encoded on the text level.

<text id=“” issn=“” title=“” fpage=“” lpage=“” year=“” volume=“” journal=“” author=“” type=“” corpusBuild=“” jstorLink=“” primaryTopic=“” primaryTopicPercentage=“” secondaryTopic=“” secondaryTopicPercentage=“” decade=“” period=“” century=“” pages=“” sentences=“” tokens=“” visualizationLink=“” doi=“” doiLink=“” hasAbstract=“” isAbstractOf=“”>

Example:

<text id=“100997” issn=“03702316” title=“An Extract of a Letter Written by Dr. Edward Brown from Vienna in Austria March 3. 1669. Concerning Two Parhelia's or Mocksuns, Lately Seen in Hungary” fpage=“953” lpage=“953” year=“1669” volume=“4” journal=“Philosophical Transactions (1665-1678)” author=“Edward Brown” type=“fla” corpusBuild=“4.0” jstorLink=“http://www.jstor.org/stable/100997” primaryTopic=“Solar System” primaryTopicPercentage=“45.9015834509” secondaryTopic=“Observation” secondaryTopicPercentage=“23.7890339956” decade=“1660” period=“1650” century=“1600” pages=“1” sentences=“10” tokens=“245” visualizationLink=“http://corpora.clarin-d.uni-saarland.de/surprisal/4.0.0/?id=100997” doi=“10.1098/rstl.1669.0015” doiLink=“https://dx.doi.org/10.1098/rstl.1669.0015” hasAbstract=“” isAbstractOf=“”>

Structural Attribute	Description
`<text>`	Text (based on JSTOR source articles)
`<text_author>`	Author of the article
`<text_century>`	Century of publication
`<text_corpusBuild>`	Internal version number
`<text_decade>`	Decade of publication
`<text_doi>`	DOI of article
`<text_doiLink>`	Link to DOI resolver
`<text_fpage>`	First page of the article
`<text_hasAbstract>`	ID of the corresponding abstract
`<text_id>`	JSTOR ID
`<text_isAbstractOf>`	ID of the corresponding article
`<text_issn>`	ISSN of the journal
`<text_journal>`	Journal in which the article was published
`<text_jstorLink>`	Link to the source text on JSTOR
`<text_lpage>`	Last page of the article
`<text_pages>`	Number of pages in text
`<text_period>`	50-year period of publication
`<text_primaryTopic>`	Primary topic according to topic model (see table below for details)
`<text_primaryTopicPercentage>`	Percentage of primary topic
`<text_secondaryTopic>`	Secondary topic according to topic model (see table below for details)
`<text_secondaryTopicPercentage>`	Percentage of secondary topic
`<text_sentences>`	Number of sentences in text
`<text_title>`	Title of the article
`<text_tokens>`	Number of tokens in text
`<text_type>`	Text type: `abs`, `brv`, `fla` or `nws` (see table below for details)
`<text_visualizationLink>`	Link to visualization
`<text_volume>`	Volume of the article
`<text_year>`	Year of publication
`<inferred>`	Inferred word (attribute from JSTOR)

Topic Label	Topic Words
Botany	plant leaves plants tab tree foliis folio seeds flowers bark seed species leaf trees fruit ray fl roots soil
Chemistry I	water acid grains quantity salt iron solution air experiments found lime colour substance matter gold heat made part copper
Chemistry II	acid water solution hydrogen oxygen obtained action salt cent alcohol gave substance liquid chloride compound ammonia grm nitrogen oxide
Electromagnetism	wire iron electricity experiments current experiment made end electric copper power force length metal diameter effect glass magnetic electrical
Experiment	author present general subject state results nature similar case place great observations fact action form power made change view
Formulae	cos sin oo tan ab sine axis ac io nt cd aa log vi cc arc al bc ef
French	la les le des en dans du par qui une qu il ou pour ce je sur au ne
Galaxy	stars distance position star obs equatorial diff small vf double magnitudes nebula nf sp np sf passy measures observations
Geography	sea water great miles found north part time river south side earth land east west ground places high place
Headmatter	years year society time royal life age great number made letter part work published science john london men country
Latin	quae quam sed ab sit vero hoc sunt ac qui esse etiam autem pro erit inter quo haec aut
Mathematics	equation equations series number form terms values case equal order point curve roots line function term sum general method
Mechanics	force motion equal point surface velocity axis line plane body direction angle centre fluid distance forces parallel gravity perpendicular
Meteorology	observations time hours tide water station hill height diurnal made stations st moon high difference pendulum results arc day
Observation	made great found parts part make time small water body account long nature manner put find kind good common
Optics	light rays glass eye spectrum red lines colour colours surface blue white lens object line image angle dark part
Paleontology	bone part bones teeth surface upper lower side anterior length posterior jaw tooth skull process long large cartilage head
Physiology I	blood time animal day urine parts hours heart found food part days quantity case body fat lungs experiments made
Physiology II	fibres nerves nerve part muscles vessels side muscular posterior anterior left structure portion surface branches muscle substance heart membrane
Reproduction	cells form species surface structure cell membrane found part shell animal size ova corpuscles fluid egg ovum development appearance
Solar System	sun time observations made moon distance observed observation telescope limb difference latitude instrument star stars place found motion degrees
Terrestrial Magnetism	needle magnetic ship observations direct force compass north made dip erebus iron terror def intensity observed magnetism south mag
Thermodynamics	air water heat temperature experiments tube experiment gas time made mercury thermometer pressure glass atmosphere quantity weight cold vapour
Weather	rain cloudy ditto fair wind weather clear sw day fine ne cy se rn winds april night di nw

Text type	Description
`abs`	Abstract
`brv`	Book review
`fla`	Full article
`nws`	Obituary

Pages

<page id=“” no=“” tokens=“”>

Structural Attribute	Description
`<page>`	Page (attribute from JSTOR)
`<page_id>`	Absolute page number
`<page_no>`	Relative page number
`<page_tokens>`	Number of tokens in page

Sentences

Texts are split into sentences based on the output of the TreeTagger.

<s srp=“” doc=“” s50=“” s10=“” no=“” tokens=“”>

Structural Attribute	Description
`<s>`	Sentence boundary (based on `SENT` tags of TreeTagger)
`<s_srp>`	Average surprisal of sentence based on `srp`
`<s_doc>`	Average surprisal of sentence based on `doc`
`<s_s50>`	Average surprisal of sentence based on `s50`
`<s_s10>`	Average surprisal of sentence based on `s10`
`<s_no>`	Relative sentence number (within a text)
`<s_tokens>`	Number of tokens in sentence

Normalisation

Normalised words are represented on the token level and on the structural level to account for one-to-many relations (e.g. ’tis \(\rightarrow\) this is, my self \(\rightarrow\) myself) on the one hand and to allow for an easy access on the other hand.

<normalised orig=“” auto=“”>

Structural Attribute	Description
`<normalised>`	Normalised token(s)
`<normalised_auto>`	Always “true” as all normalisations are automatic
`<normalised_orig>`	Original token(s)

Inferred Text

The element <inferred> is part of the JSTOR distribution. It refers to illegible text which was recovered from the context.

at      IN      at      at      1.96    5.765   2       6       0.19    0.787   0       1       1.95    6.497   2       6       1.85    6.796   2       7
the     DT      the     the     0.51    1.846   1       2       0.29    0.486   0       0       0.55    2.219   1       2       0.56    2.321   1       2
<inferred>
Root    NP      Root    Root    1.99    9.710   2       10      0.38    1.002   0       1       0.99    9.157   1       9       0.99    8.850   1       9
of      IN      of      of      0.48    1.588   0       2       0.08    0.361   0       0       1.91    2.023   2       2       1.74    2.230   2       2
</inferred>
the     DT      the     the     0.54    1.846   1       2       1.26    0.486   1       0       1.45    2.219   1       2       1.48    2.321   1       2
Tongue  NP      Tongue  Tongue  5.45    10.111  5       10      0.72    0.942   1       1       5.95    10.067  6       10      4.74    10.779  5       11

German Federal Ministry of Education and Research

Impressum