Welcome to the website of the VARTRA Corpus

 

 

Description

The VARTRA corpus is made up of multiple English-to-German translations organised into several subcorpora: English originals (EO), and their translations with different methods: PT - by professional translators; CAT - by student translators with the help of CAT tools (translation memories); RBMT - rule-based machine translation; SMT1 - statistical machine translaion with Google; SMT2 - statistical machine translation with Moses.

English originals (EO)

English original texts were exported from CroCo, a German-English corpus with parallel and comparable components, see clarind-uds:croco. For more information about this corpus see Hansen-Schirra et al. (2012).

Professional Translations (PT)

Translations by professionals were also exported from CroCo, which has an English-German corpus component.

Student Translations (CAT)

Translations by students of the FR4.6 Department (Applied Linguistics, Translation and Interpreting) at Saarland UniversityAll students were assisted with translation memories within the CAT tool ACROSS in their translation process.

Rule-Based Machine Translations (RBMT)

Translations produced with the SYSTRAN6 machine translation system.

Statistical Machine Translations (SMT1)

Translations produced with the Google translation toolkit.

Rule-Based Machine Translations (RBMT)

Translations produced with an in-house Moses-based machine translation system.

Text Genres / Registers

Each translation variant is saved as a subcorpus and covers seven registers of written language:

  • political essays (ESSAY),
  • fictional texts (FICTION),
  • manuals (INSTR),
  • popular-scientific articles (POPSCI),
  • letters of share-holders (SHARE),
  • prepared political speeches (SPEECH),
  • and touristic leaflets (TOU).

Annotation

The corpus contains two types of annotation: structural and positional.

Structural annotation is written in XML and provides a description of the textual structure.

  • metadata: id, register;
  • structure: sentence.

Positional annotation is provided at token level containing linguistic information.

  • word form;
  • POS (TreeTagger, STTS tagset);
  • lemma (TreeTagger);
  • chunk (TreeTagger).

Access to the corpus

You can access the corpus via a CQPweb interface: VARTRA
Access is granted under license and upon request.
For further information please contact Ekaterina Lapshinova-Koltunski