Colloquium

Friday, 9 May 2008, building A2 2, Room 1.20

time programme time
11.00 - 12.30 Exploiting statistical and linguistic knowledge in a text alignment system Bettina Schrader,
Universität Potsdam

Abstract

Within machine translation, the alignment of corpora has evolved into a mature research area, basically aimed at providing training data for statistical machine translation (SMT). The alignment techniques used for these purposes roughly fall in two separate classes: sentence alignment approaches that often combine statistical and linguistic information, and word alignment models that are dominated by the SMT paradigm. Alignment approaches that use linguistic knowledge are rare, as well as non-statistical word alignment strategies. Furthermore, parallel corpora are typically not aligned at all text levels simultaneously. Rather, a corpus is first sentence aligned, and in a subsequent step, word alignments are computed.

In this talk, I will present an alignment platform that does not distinguish between the two alignment classes. Rather, it has been designed to simultaneously align at the paragraph, sentence, word, and phrase level. Furthermore, linguistic as well as statistical information can be combined. This combination of alignment cues from different knowledge sources, as well as the combination of the sentence and word alignment tasks, is made possible by the development of a modular alignment platform. Its main features are that it supports different kinds of linguistic corpus annotation, and furthermore aligns a corpus hierarchically. Alignment cues are not used within a global alignment model. Rather, different sub-models can be implemented and allowed to interact.