The EuroParl-UdS corpus is a corpus consisting of parliamentary debates of the European Parliament enriched with metadata about the text and the speaker. It is presently available for English, German and Spanish and the data is in xml format. Plain text versions, filtered based on native speaker are also available. It contains texts of the European Parliament that were produced between 1999-2017.

More specifically it consists of:

  1. metadata-rich xml files for each language,
  2. parallel (sentence-aligned) plain text corpora for English into German and English into Spanish, where the source side contains texts only by native English speakers,
  3. comparable monolingual plain text corpora for English, German and Spanish, containing texts produced only by native speakers of each language respectively,
  4. texts filtered based on criteria relevant for translationese research; for each language L:
    1. originals in L,
    2. originals by native speakers of L,
    3. all translations into L, and
    4. translations from a specific source language into L.

The full pipeline to compile the corpus as well as the documentation for all necessary (pre- and post-) processing steps is available on GitHub.


The code used for collecting and structuring the EuroParl-UdS corpus is available on GitHub.

Citing the EuroParl-UdS corpus

If you use this corpus, please cite the following reference:

Alina Karakanta, Mihaela Vela, and Elke Teich. 2018. EuroParl-Uds: Preserving and Extending Metadata in Parliamentary Debates. Proceedings of the LREC 2018. Miyazaki, Japan.


All versions of the EuroParl-UdS corpus are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.