The EuroParl-UdS corpus is a corpus consisting of parliamentary debates of the European Parliament enriched with metadata about the text and the speaker. It is presently available for English, German and Spanish and the data is in xml format. Plain text versions, filtered based on native speaker are also available. It contains texts of the European Parliament that were produced between 1999-2017.
More specifically it consists of:
The full pipeline to compile the corpus as well as the documentation for all necessary (pre- and post-) processing steps is available on GitHub.
The EuroParl-UdS corpus files comprises metadata-rich xml files for English, German and Spanish, parallel corpora for English into German, English into Spanish and comparable monolingual English, German and Spanish corpora.
The code used for collecting and structuring the EuroParl-UdS corpus is available on GitHub.
If you use this corpus, please cite the following reference:
Alina Karakanta, Mihaela Vela, and Elke Teich. 2018. EuroParl-Uds: Preserving and Extending Metadata in Parliamentary Debates. Proceedings of the LREC 2018. Miyazaki, Japan.
All versions of the EuroParl-UdS corpus are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.