Colloquium
Friday, 9 May 2008, building A2 2, Room 1.20
time | programme | time |
---|---|---|
11.00 - 12.30 | Exploiting statistical and linguistic knowledge in a text alignment system | Bettina Schrader, Universität Potsdam |
Abstract
Within machine translation, the alignment of corpora has evolved into a
mature research area, basically aimed at providing training data for
statistical machine translation (SMT). The alignment techniques used for
these purposes roughly fall in two separate classes: sentence alignment
approaches that often combine statistical and linguistic information,
and word alignment models that are dominated by the SMT paradigm.
Alignment approaches that use linguistic knowledge are rare, as well as
non-statistical word alignment strategies. Furthermore, parallel corpora
are typically not aligned at all text levels simultaneously. Rather, a
corpus is first sentence aligned, and in a subsequent step, word
alignments are computed.
In this talk, I will present an alignment platform that does not
distinguish between the two alignment classes. Rather, it has been
designed to simultaneously align at the paragraph, sentence, word, and
phrase level. Furthermore, linguistic as well as statistical information
can be combined. This combination of alignment cues from different
knowledge sources, as well as the combination of the sentence and word
alignment tasks, is made possible by the development of a modular
alignment platform. Its main features are that it supports different
kinds of linguistic corpus annotation, and furthermore aligns a corpus
hierarchically. Alignment cues are not used within a global alignment
model. Rather, different sub-models can be implemented and allowed to
interact.