Annotation of Cohesive Devices

Funded by the German Federal Ministry of Education and Research

in EO-GECCO2013, GO-GECCO2013, EO-SPOKEN2013, GO-SPOKEN2013, EO/GO-GECCOCOH2014

Lexical chains

Representation

There following Structural Attributes:

lexicalcohesion
lexicalcohesion_id            [A]
lexicalcohesion_distance      [A]
lexicalcohesion_lexical_type  [A]
lexicalcohesion_distance_type [A]
lexicalcohesion_lexical_chain [A]

lexicalcohesion_id: is an ID of a markable

lexicalcohesion_distance: is supposed to be a distance between various anaphors (but wasn’t annotated in most cases and needed to be calculated)

lexicalcohesion_lexical_type: indicates the semantic relation an item has

lexicalcohesion_distance_type: is not annotated, is always immediate

lexicalcohesion_lexical_chain: has the value ‘set_ZAHL’ and indicates the ID of chains

Elements in the chains

antecedents have the following features: lexicalcohesion_id="ZAHL" lexical_distance="ZAHL" lexicalcohesion_lexical_type="first" lexical_chain!="empty"

All non-cohesive items (not belonging to a chain) have the following features: lexicalcohesion_id="ZAHL" lexical_distance="ZAHL" lexicalcohesion_lexical_type="first|First|none" lexical_chain="none|empty"

Semantic relations

repetition: same word is repeated
synonym: same meaning
antonym: opposite meaning
hypernym: more generic word
hyponym: more specific word
co-hyponym: word with same degree of specificity
meronym: a word that names a part of a larger whole
co-meronym: a word that names another part of a larger whole"
holonym: the larger whole mentioned before
instance: is-relation
type: is-relation
co-instance: word which has the same is-relation

Querying lexical cohesion

every token which belong to a member of lexical chain (unsorted, unclassified):

[lexicalcohesion];

every member of every chain (one and multiword):

/region[lexicalcohesion];

STATISTICS:

to get the number of elements:

size Last;

At the moment, we have some elements which are there but do not belong to any chain. To sort them out, please use the following command:

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;

If you want to query the first element in the chain (for instance to count the number of chains):

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type="first|none"]+</lexicalcohesion>;

size Last;

To query all elements of chains except for the fist elements, use:

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type!="first|none"]+</lexicalcohesion>;

size Last;

To export a table with the number of first elements (chains) per register:

tabulate Last match text_register> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-register-SUBCORPUS.csv

To export a table with the number of first elements (chains) per text:

tabulate Last match text_id> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-text-SUBCORPUS.csv

To assess the distributions of semantic relations:

group Last match lexicalcohesion_lexical_type;

To export a table (with semantic relations) sorted according to registers:

tabulate Last match text_register, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-register-SUBCORPUS.csv"

To export a table (with semantic relations) sorted according to texts:

tabulate Last match text_id, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-text-SUBCORPUS.csv"

To extract the information on chains size, we would need some post-processing steps for average sizes. But we can already extract information on chains size per text:

[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;

tabulate Last match text_id, match lexicalcohesion_lexical_chain> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/chains-text-SUBCORPUS.csv"

To assess certain chains, you could use the information from the extraction on chain size and analyse, for instance, the longest chains. here is an example:

In the text EO_ESSAY_007, we have a chain which consists of more than 30 elements, so we have a closer look at it:

[_.lexicalcohesion_lexical_chain="set_3"&_.text_id="EO_ESSAY_007"]+</lexicalcohesion>;

Coreference chains

Representation

There following Structural Attributes:

mention
mention_type         [A]
mention_func         [A]
mention_chain_id     [A]
mention_cohesive     [A]
mention_anaphor      [A]
mention_problematic  [A]

antecedents have the following attribute-values:

chain_id="set_ZAHL" problematic="yes|no" cohesive="anaphoric" func="modifier" type="dem" antecedent="none"

anaphors have the following attribute-values:

anaphors without antecedents (e.g. comparative) have the following attribute-values:

referring expressions have the following subtypes (saved as values of mention_type):

pers
dem
comp

And the subtypes or functions are (saved as values of mention_func):

pers in EO

personal
possessive
it-endophoric
it-exophoric

pers in GO

personal
possessive
es/it-endophoric
es/it-exophoric

dem

head
modifier
local
temporal
pronadv
correlat-pronadv

comp

particular
general

All referring expressions have annotations on the features of the antecedent they refer to (saved as values of mention_antecedent):

np
fact-s
pronominal
other
event-vp
is-a

Querying coreference

every token which belong to reference (unsorted, unclassified):

[mention];

every member of every chain (one and multiword):

/region[mention];

STATISTICS:

to get the number of elements:

size Last;

to get the types of referring expressions:

group Last match mention_type;

to get the functions of referring expressions:

group Last match mention_func;

NOTE!

querying demonstrative referring expressions for German, please exclude one function:

mention_func="correlat-pronadv"

For this use for example the following query:

[_.mention_type!="correlat-pronadv"];

to get the list of referents (antecedents):

/region[mention]::match.mention_antecedent="none";

to get the information on the number of chains per text:

tabulate Last match text_id > "|sorttabulate>/data/projects/steiner/gecco/results/reference-chains.csv"

to get the list of anaphors:

/region[mention]::match.mention_antecedent!="none";

to get the types of antecedents of the queried anaphors:

group last match mention_antecedent;

in EO-GECCO2013, ETRANS-GECCO, GO-GECCO2013, GTRANS-GECCO, EO-SPOKEN2013 and GO-SPOKEN2013

Assessing chunk information

run a query to extract all referring expressions:

[_.mention_antecedent!="none"] expand to mention;

count grammatical function of these:

group Last match NP_gf;

group Last match NP_gf1;

you need to repeat grouping as many times, as many NP levels there are.

Another possibility:

tabulate Last match NP_gf;

tabulate Last match NP_gf, match NP_gf1;

etc.

for INFORMATION ON GRAMMATICAL FUNCTIONS SEE INFORMATION ON CHUNKS for [[howtos:extractions:eo-gecco&#grammatical_functions|EO]] and [[howtos:extractions:go-gecco#grammatical_functions|GO]]

Substitution

Representation

nominal substitution (expressed with //one, einen//, etc. )
verbal substitution (expressed with //do so, tun//, etc.)
clausal substitution (expressed with //so//)

saved as the following Structural Attributes:

substitution
substitution_type

Querying

any token (unsorted, unclassified):

[substitution];

or for the whole structure:

/region[substitution];

or:

<substitution>[]+</substitution>;

STATISTICS:

size Last;

group Last match substitution;

group Last match substitution_type;

with Register information:

group Last match text_register by match substitution;

group Last match text_register by match substitution_type;

substitution types (concrete Querying):

[_.substitution_type="nominal"];

[_.substitution_type="verbal"];

[_.substitution_type="clausal"];

STATISTICS

size Last;

EXAMPLE of querying all types of substitution in EO-GECCO:

EO-GECCO;

<substitution>[]+</substitution>;

group Last match text_register by match substitution;

tabulate Last match substitution_type, match text_register > "|sorttabulate > substitution.csv";

Conjunctions

Representation

syntactic types:

subjuncts
connectors
adverbials

semantic types:

additive
adversative
temporal
causal
modal

saved as the following Structural Attribute:

conj
conj_type
conj_func
conj_problematic

And the types are (saved as values of conj_type):

subjunct
connect
adverbial

And the functions or semantic types are (saved as conj_func):

additive
adversative
causal
temporal
modal

Querying

Any token

[conj];

all conjunctions (unsorted, unclassified):

<conj>[]+[]*</conj>;

/region[conj];

activate the borders of syntactic types of conjunctions:

show +conj_type;

activate the borders of semantic types of conjunctions:

show +conj_func

concrete Querying (one element only):

<conj>[_.conj_type="connect"]</conj>;

<conj>[_.conj_type="subjunct"]</conj>;

<conj>[_.conj_type="adverbial"]</conj>;

concrete Querying (one and multi-word element):

<conj>[_.conj_type="connect"]+[_.conj_type="connect"]*</conj>;

<conj>[_.conj_type="subjunct"]+[_.conj_type="subjunct"]*</conj>;

<conj>[_.conj_type="adverbial"]+[_.conj_type="adverbial"]*</conj>;

same for semantic types:

<conj>[_.conj_func="additive"]+[_.conj_func="additive"]*</conj>;

<conj>[_.conj_func="adversative"]+[_.conj_func="adversative"]*</conj>;

<conj>[_.conj_func="causal"]+[_.conj_func="causal"]*</conj>;

<conj>[_.conj_func="temporal"]+[_.conj_func="temporal"]*</conj>;

<conj>[_.conj_func="modal"]+[_.conj_func="modal"]*</conj>;

STATISTICS:

group Last match conj_type;

group Last match conj_func;

tabulate Last match conj_type, match conj_func > "|sorttabulate> /data/projects/steiner/gecco/results/conj.csv";

EXAMPLE - For statistics of different semantic types in different registers in EO

EO-GECCO;

<conj>[]+[]*</conj>;

group Last match conj_func by match text_register;

tabulate Last match conj_func, match text_register > "|sorttabulate > conj_func-eo.csv";

General Nouns

Representation

nouns (general)

saved as the following Structural Attributes:

noun
noun_type

And the types are (saved as values of reference_type):

general

Querying

Any token

[noun];

/region[noun];

activate the borders of types:

show +noun_type

STATISTICS:

size Last;

group Last match noun_type;

For statistics of different semantic types in different registers

group last match noun_type by text_register;

tabulate Last match lemma, match noun_type, match text_register > "|sorttabulate > generalnoun-CORPUS.csv";

Ellipsis

Representation

ellipsis

saved as the following Structural Attributes:

ellipsis
ellipsis_problematic [A]
ellipsis_func        [A]
ellipsis_type        [A]
ellipsis_antecedent  [A]
ellipsis_item        [A]
ellipsis_id          [A]

The items are (saved as values of ellipsis_item):

unspecified
antecedent
clause-internal antecedent
antecedent_ambig

The antecedents can be (saved as values of ellipsis_antecedent) either empty or some markables

The possible types are saved as ellipsis_type and include:

texttype
split
nominal
nonclausal
clausal
verbal
other
mixed_nominal+verbal/clausal
yes_no

The possible functions (saved as ellipsis_func):

non-cohesive
cohesive
clause-internal

If there were problems in annotation, e.g. disambiguation, unclear context, etc. the ellipsis were annotated with the category of problematic (ellipsis_problematic) with the values ‘yes’ or ‘no’

NOTE* that we were able to annotate triggers only, not empty items. So the annotated structure indicate that the ellipsis follows.

antecedents are recognisable as follows:

Querying

All ellipsis

/region[ellipsis];

all cohesive ellipsis (anaphors of elliptical chains):

[_.ellipsis_func="cohesive"&_.ellipsis_item="unspecified"] expand to ellipsis;

all antecedents of cohesive elliptical chains:

[_.ellipsis_func="cohesive"&_.ellipsis_item="antecedent|antecedent_ambig"] expand to ellipsis;

activate the borders of types:

show +ellipsis_type

activate other structures:

show +ellipsis_ATTRIBUTE

It is possible to restrict the search to certain registers, for example, in this case for essays and speeches.

/region[ellipsis]::match.text_register="ESSAY|SPEECH";

It is also possible to exclude certain types from the query, for example, extract non-problematic cases only:

/region[ellipsis]::match.ellipsis_problematic="no";

To restrict your search to certain types of ellipsis, use the same type of query:

/region[ellipsis]::match.ellipsis_ATTRIBUTE="VALUE";

STATISTICS:

size Last;

group Last match ellipsis_type;

For statistics of different types in different registers

group last match ellipsis_type by text_register;

tabulate Last match ellipsis_type, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";

Collect all statistics on the extracted ellipsis

tabulate Last match ellipsis_item, ellipsis_type, ellipsis_func, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";

Problematic Cases

conj, mention, reference and ellipsis have annotation of problematic cases. these are cases which are either ambiguous or were problematic for human annotators for some reason.

to sort them out use:

region[conj]::match.conj_problematic="yes";

region[mention]::match.mention_problematic="yes";

region[ellipsis]::match.ellipsis_problematic="yes";

Impressum

Corpus by AuthorName and ContributorName is licensed under a Creative Commons Attribution 4.0 International License.