Resources from SFB 1102

DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. It has 40 scenarios with approximately 100 ESDs each. The corpus also has partial alignments of event descriptions that are semantically similar with respect to the given scenario.

Link to the resource

Persistent identifier http://hdl.handle.net/21.11119/0000-0000-5DCF-0

Reference: Wanzare, L., Zarcone, A. , Thater, S. & Pinkal, M. (2016). DeScript: A Crowdsourced Database for the Acquisition of High-quality Script Knowledge. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Lilian Wanzare

InScript (Narrative texts annotated with script information)

The InScript corpus contains a total of 1000 narrative texts crowdsourced via Amazon Mechanical Turk. The texts cover 10 different scenarios describing everyday situations like taking a bath, baking a cake etc. It is annotated with script information in the form of scenario-specific events and participants labels. The texts are also annotated with coreference chains linking different mentions of the same entity within the document.

Link to the resource

Persistent identifier: http://hdl.handle.net/21.11119/0000-0000-5DD4-9

Reference: Modi, A., Anikina, T. , Ostermann, S. & Pinkal, M. (2016). InScript: Narrative texts annotated with script information. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Simon Ostermann

Modeling Semantic Expectations

This resource contains the DR predictions (by humans) on the InScript corpus. These were collected using Amazon Mechanical Turk. For details please refer to the paper mentioned below.

Link to the resource

Persistent identifier: http://hdl.handle.net/21.11119/0000-0000-5DD9-4

Reference: Modi, A., Titov, I., Demberg, D., Sayeed, A. & Pinkal, M. (2016). Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction. Transactions of Association for Computational Linguistics (TACL)

MCScript

MCScript is a new dataset for the task of machine comprehension focussing on commonsense knowledge. Questions were collected based on script scenarios, rather than individual texts, which resulted in question–answer pairs that explicitly involve commonsense knowledge. It comprises 13,939 questions on 2,119 narrative texts and covers 110 different everyday scenarios. Each text is annotated with one of 110 scenarios. Questions are typed with a crowdsourced annotation, indicating whether they can be answered from the text or if commonsense knowledge is needed for finding an answer.

Link to resource

Persistent Identifier: http://hdl.handle.net/21.11119/0000-0001-D3A7-4

Reference: Ostermann, S., Modi, A., Roth, M., Thater, S., Pinkal, M. (to appear): MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.

MCScript-2.0

MCScript 2.0 is a machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. The task is not challenging to humans, but existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base.

Link to resource

Persistent Identifier: http://hdl.handle.net/21.11119/0000-000A-3606-3

Reference: Ostermann, Simon; Roth, Michael; Pinkal, Manfred (2018): MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM) , Minneapolis, USA, 2019. doi:10.18653/v1/S19-1012.

Event Surprisal Estimates Based on DeScript

The data set contains unigram and bigram event surprisal estimates based on the data of 24 scenarios taken from the DeScript corpus of script knowledge (Wanzare et al. 2016). The event sequence descriptions (ESDs) of DeScript have been semi-automatically transformed into sequences of schematic event labels (SELs). These SELs are verb-noun pairs subsuming descriptions which refer to the same event and which are uniquely assigned per scenario. Based on these SELs, overall event probabilities and transition probabilities from one event to another were computed with the SRILM language modeling toolkit (Stolcke 2002). Probabilities were transformed to Shannon Information (Shannon 1948) aka Surprisal (Hale 2001).

Preliminary location /event-surprisal-descript.zip (password protected)

N-gram Language Models based on DeWaC German web corpus

The resource is a set of language models at syllable and phone level. The contex t length (n-gram length) of the models ranges from 1 to 4 for syllable level and 1 to 6 for phone level. For each n-gram length, two model versions are provided: A forward version which contains the probability of a unit to occur given the preceding context, and a backward version which contains the probability of a unit to occur in the follow ing context.

Each forward and backward model has a version that includes syllable boundary in formation and a version without syllable boundaries.

The models were trained on the DeWaC German web corpus (Baroni and Kilgarriff 2006) using the SRILM language modeling toolkit (Stolcke 2002). Syllabification was performed using HMM syllable tagger (Schmid, Möbius nd Weidenkaff 2007).

Link to resource

Reference and Permanent Link: Phonetics Group at Saarland University, 2021. N-gram Language Models based on DeWaC German web corpus. Persistent Identifier http://hdl.handle.net/21.11119/0000-000B-4141-2.

Word level N-gram Language Models based on DeWaC German web corpus

The resource is a set of language models at word level, for syllable and phone level see the resource above. The context length (n-gram length) of the models ranges 1 to 6. For each n-gram length, two model versions are provided: A forward version which contains the probability of a unit to occur given the preceding context, and a backward version which contains the probability of a unit to occur in the following context.

The models were trained on the DeWaC German web corpus (Baroni and Kilgarriff 2006) using the SRILM language modeling toolkit (Stolcke 2002). Syllabification was performed using HMM syllable tagger (Schmid, Möbiu, and Weiddenkaff 2007).

Link to resource

Reference and Permanent Link: Phonetics Group at Saarland University, 2022. Word level N-gram Language Models based on DeWaC German web corpus. Persistent Identifier http://hdl.handle.net/21.11119/0000-000E-1712-4.

To the Hompage of the UdS CLARIN-D repository | Terms of use | Impressum | Data Protection, Privacy