Welcome to the website of SubCo, a machine and human Subtitle Corpus

 

 

Description

SubCo, a machine and human Subtitle Corpus, is a corpus comprising both human (HT) and machine translations (MT) of subtitles as well as the post-edited version of the MT output. MT output is annotated with errors. All three versions (HT, MT, and post-edited MT) were evaluated by humans.

German translations were produced by human translators as well as by the machine translation (MT) system SUMAT, which was developed and trained specifically for the translation of subtitles.

Both human and machine translations were annotated with labels for the identification of the errors and were assigned evaluation scores on sentence level.

The corpus was collected as part of a course on subtitling targeted at students enrolled in the Translation Studies programme at Saarland University.

Error Annotation

The error annotation schema consists of 4 dimensions: 1) content, 2) language, 3) format, and 4) semiotics.

The first two categories correspond to classical error types described in the literature:

  • content: omission, addition, content shift, untranslated, terminology
  • language: syntax, morphology, function words, orthography
The last two categories are our contribution aimed at describing specific features of subtitling:
  • format: punctuation, font-style, capitalisation, number of characters per line, number of lines per subtitle, number of seconds per subtitle, line breaks, positioning of subtitle, colour of subtitle, audio synchronisation, video synchronisation
  • semiotics: cases where there is a contradiction between other channels contributing to the meaning of the text and the translation

Assessment annotation

The quality of a translation is measured in four levels in light of its acceptability:

  • perfect: no error at all, no modifications needed
  • acceptable: some minor errors, but no major error, it can be used without modifications
  • revisable: one major error or several minor ones, requiring a cost-effective revision
  • unacceptable: the amount of revision to fix the translation is such that it is not worth the effort, re-translation is required
The assessment was carried out for each dimension of analysis: 1) content, 2) language, 3) format and 4) semiotics.

Access to the corpus

The corpus is available for download on the Download page