Contemporary News Corpus for Ukrainian (CNC-UA)

Description

CNC-UA was built from a database dump provided to us by Suspilne. Linguistic annotations were added by processing the texts with the Stanza NLP library. Each text is annotated with an identifier, article title as well as date and time of publication. Currently, we apply various language modelling techniques to the corpus, including topic models, for analysis of the data.

The motivation for building the corpus was to track language use in news reporting as the Russian war against Ukraine proceeded.

The corpus contains 87,210,364 words in 292,955 texts. The sources represent standard language and were published between 2019 and 2022 on https://suspilne.media, the news website of the national public broadcaster of Ukraine.

Citation

Persistent identifier http://hdl.handle.net/21.11119/0000-000E-1C5C-D

Stefan Fischer, Kateryna Haidarzhyi, Jörg Knappen, Yuliya Stodolinska, and Elke Teich (2024). A Contemporary News Corpus for Ukrainian (CNC-UA). Poster at 46. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, Bochum (Germany), February/March 2024

Stefan Fischer, Kateryna Haidarzhyi, Jörg Knappen, Olha Polishchuk, Yuliya Stodolinska, and Elke Teich. 2024. A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication. In Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, pages 1–7, Torino, Italia. ELRA and ICCL.

Licence

The newspaper texts in the corpus are copyright © by Suspilne Movlennya (Public Broadcasting Company of Ukraine). They are licenced for strictly non-commercial use under the condition that the texts are not changed and that the copyright owner is acknowledged with a full citation.

The CNC-UA is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Acknowledgements

The authors acknowledge financial support from Deutsche Forschungsgemeinschaft (DFG) project number 460033370 (Text+) and 232722074 (SFB 1102) as well as the Federal Republic of Germany and the 16 federal states in the framework of the National Research Data Infrastructure (NFDI) and its association NFDI e.V.

Download

The corpus in CoNLL-U from the Universal Dependencies project: CNC-UA_v1_conll.zip (1.8 GB, md5sum: aa365325dfd42da89870860a4d1736ec; sha1sum: 66d7e8c38d692f50b62b0c0aaa1534be521f60a8)
The corpus in vrt format as used by the Open Corpus Workbench (fewer columns lemma, upos, and xpos (using the MULTEXT-East Morphosyntactic Specifications, Version 4 tagset) only): CNC-UA_v1_vrt.zip (571 MB, md5sum: c34814d6380bea98f40dd355bc10b807; sha1sum: ef7ad90f5febb804241fa177ffb77cff58ed2beb)
A metadata file with the internal text metadata, helpful when installing the corpus in CQPweb: CNC-UA_v1_meta.zip (15 MB, md5sum: 424bc0d73deebd443aa91efd2698f305; sha1sum: 4b3e66d65773556005061dd0c91945d598aa3090)

To the Hompage of the UdS CLARIN-D repository | Terms of use | Impressum