CNC-UA was built from a database dump provided to us by Suspilne. Linguistic annotations were added by processing the texts with the Stanza NLP library. Each text is annotated with an identifier, article title as well as date and time of publication. Currently, we apply various language modelling techniques to the corpus, including topic models, for analysis of the data.
The motivation for building the corpus was to track language use in news reporting as the Russian war against Ukraine proceeded.
The corpus contains 87,210,364 words in 292,955 texts. The sources represent standard language and were published between 2019 and 2022 on https://suspilne.media, the news website of the national public broadcaster of Ukraine.
Persistent identifier http://hdl.handle.net/21.11119/0000-000E-1C5C-D
Stefan Fischer, Kateryna Haidarzhyi, Jörg Knappen, Yuliya Stodolinska, and Elke Teich (2024). A Contemporary News Corpus for Ukrainian (CNC-UA). Poster at 45. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, Cologne (Germany), March 2024
The newspaper texts in the corpus are copyright © by Suspilne Movlennya (Public Broadcasting Company of Ukraine). They are licenced for strictly non-commercial use under the condition that the texts are not changed and that the copyright owner is acknowledged with a full citation.
The CNC-UA is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.
The authors acknowledge financial support from Deutsche Forschungsgemeinschaft (DFG) project number 460033370 (Text+) and 232722074 (SFB 1102) as well as the Federal Republic of Germany and the 16 federal states in the framework of the National Research Data Infrastructure (NFDI) and its association NFDI e.V.