Welcome to the website of the Saarbrücken Cookbook Corpora

 

 

Description

The SaCoCo diachronic corpus is made up of cooking recipes organized into two different collections: historical and contemporary.

Historical component

The historical component contains a selection of recipes from different works. The full nomina of sources can be found listed as sources in the metadata. Most of these recipes were collected and transcribed by Andrea Wurm as part of her PhD. For more information see Wurm 2008.

Contemporary component

The contemporary component contains cooking recipes from rezeptewiki.org. The selection criteria were temporal (only the last version of the recipe) and geographical (only recipes belonging to German speaking regions).

Normalization

We automatically asign the normalized spelling for each word in the corpus:

  • by relaying on clustering techniques based on string and semantic similarity measures,
  • by identifying a set of diachronic variations of the same word form.
normalized form variants
magst magst_1574, magstu_1602
Hühner Hüner_1574, hüner_1574, Hünner_1611

Annotation

The corpus contains two types of annotation: structural and positional.

Structural annotation is written in XML and provides a description of the textual structure, on the one hand, and metatextual information and shallow semantics, on the other hand.

  • metadata: id, collection, source, url, year, decade, period, language, ref;
  • shallow semantics: type, course, cuisine, ingredient, method;
  • structure: title, body, segment, paragraph, sentence.

Positional annotation is provided at token level containing linguistic information.

  • word form;
  • POS (TreeTagger, STTS tagset);
  • lemma (TreeTagger);
  • normalized form.

Access to the corpus

You can access the corpus via a CQPweb interface: Diachronic corpus
Access is granted under license.
For further information please contact José Manuel Martínez Martínez