Welcome to the website of the GRUG Parallel Treebank

 

 

Description

This dataset is made of two types of resources: four monolingual Treebanks (German, Georgian, Russian and Ukrainian), and four parallel Treebanks (German-Georgian, German-Russian, German-Ukrainian, Georgian-Ukrainian). The parallel texts used for the outlined experiment comprises German sentences and their translations into Georgian and Russian languages compiled for the GREG NLP lexicon project. The GREG itself contains valency data with the manually aligned Georgian, Russian, English and German verbs (ca. 1250) augmented with the examples of sentences considered as translation equivalents. Each subcorpus used for the study has a size of roughly 2600 sentence pairs that correspond to different syntactic subcategorization frames considered as German-Georgian translation equivalents. For the Russian and Ukrainian languages translation equivalents were provided by Dr. Alla Mishchenko.

Morphological analysis

For the Georgian text analyses has been applied a finitestate morphological transducer using the XEROX FST tools. For the rest of languages, German, Russian and Ukrainian, involved in the experiment, morphological features, including POS tags, were assigned manually drawing on the TIGER guidelines for the German language with the necessary changes relevant to the Russian and Ukrainian grammar formal description.

Syntactic parsing

The syntactical annotation was done manually with Synpathy. The annotation followed the TIGER guidelines and the outcome follows the TIGER-XML format.

Alignment of monolingual Treebanks into parallel bilingual Treebanks

The alignment of the monolingual (GO, RU, UK, GE) Treebanks into the bilingual (GE-GO, GE-RU, GE-UK, GO-UK) ones was done manually with Stockholm TreeAligner. The issue was performed at sentence, phrase and word level. Two types of translations are aligned: "exact" and "fuzzy" translation equivalents.

Tools to explore GRUG

To explore the four monolingual treebaks (Georgian, Russian, Ukrainian and German) use TIGERSearch software or SALTO, which can also accept TIGER-XML format files as input.

To explore the four parallel treebanks (German-Georgian, German-Russian, German-Ukrainian) use the Stockholm TreeAligner.