in EO-GECCO2013, GO-GECCO2013, EO-SPOKEN2013, GO-SPOKEN2013, EO/GO-GECCOCOH2014

Lexical chains

Representation

There following Structural Attributes:

lexicalcohesion
lexicalcohesion_id            [A]
lexicalcohesion_distance      [A]
lexicalcohesion_lexical_type  [A]
lexicalcohesion_distance_type [A]
lexicalcohesion_lexical_chain [A]

lexicalcohesion_id: is an ID of a markable

lexicalcohesion_distance: is supposed to be a distance between various anaphors (but wasn’t annotated in most cases and needed to be calculated)

lexicalcohesion_lexical_type: indicates the semantic relation an item has

lexicalcohesion_distance_type: is not annotated, is always immediate

lexicalcohesion_lexical_chain: has the value ‘set_ZAHL’ and indicates the ID of chains

Elements in the chains

antecedents have the following features: lexicalcohesion_id="ZAHL" lexical_distance="ZAHL" lexicalcohesion_lexical_type="first" lexical_chain!="empty"

anaphors (chain members which are not first) have the following features: lexicalcohesion_id="ZAHL" lexical_distance="ZAHL" lexicalcohesion_lexical_type="repetition|synonym|antonym|hypernym|hyponym|co-hyponym|meronym|co-meronym|holonym|instance|type|co-instance" lexical_chain!="none|empty"

All non-cohesive items (not belonging to a chain) have the following features: lexicalcohesion_id="ZAHL" lexical_distance="ZAHL" lexicalcohesion_lexical_type="first|First|none" lexical_chain="none|empty"

Semantic relations

  • repetition: same word is repeated
  • synonym: same meaning
  • antonym: opposite meaning
  • hypernym: more generic word
  • hyponym: more specific word
  • co-hyponym: word with same degree of specificity
  • meronym: a word that names a part of a larger whole
  • co-meronym: a word that names another part of a larger whole"
  • holonym: the larger whole mentioned before
  • instance: is-relation
  • type: is-relation
  • co-instance: word which has the same is-relation

Querying lexical cohesion

  • every token which belong to a member of lexical chain (unsorted, unclassified):
[lexicalcohesion];
  • every member of every chain (one and multiword):
/region[lexicalcohesion];

STATISTICS:

to get the number of elements:

size Last;

At the moment, we have some elements which are there but do not belong to any chain. To sort them out, please use the following command:

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;

If you want to query the first element in the chain (for instance to count the number of chains):

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type="first|none"]+</lexicalcohesion>;
size Last;

To query all elements of chains except for the fist elements, use:

<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type!="first|none"]+</lexicalcohesion>;
size Last;

To export a table with the number of first elements (chains) per register:

tabulate Last match text_register> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-register-SUBCORPUS.csv

To export a table with the number of first elements (chains) per text:

tabulate Last match text_id> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-text-SUBCORPUS.csv

To assess the distributions of semantic relations:

group Last match lexicalcohesion_lexical_type;

To export a table (with semantic relations) sorted according to registers:

tabulate Last match text_register, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-register-SUBCORPUS.csv"

To export a table (with semantic relations) sorted according to texts:

tabulate Last match text_id, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-text-SUBCORPUS.csv"

To extract the information on chains size, we would need some post-processing steps for average sizes. But we can already extract information on chains size per text:

[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;
tabulate Last match text_id, match lexicalcohesion_lexical_chain> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/chains-text-SUBCORPUS.csv"

To assess certain chains, you could use the information from the extraction on chain size and analyse, for instance, the longest chains. here is an example:

In the text EO_ESSAY_007, we have a chain which consists of more than 30 elements, so we have a closer look at it:

[_.lexicalcohesion_lexical_chain="set_3"&_.text_id="EO_ESSAY_007"]+</lexicalcohesion>;

Coreference chains

Representation

There following Structural Attributes:

mention
mention_type         [A]
mention_func         [A]
mention_chain_id     [A]
mention_cohesive     [A]
mention_anaphor      [A]
mention_problematic  [A]

antecedents have the following attribute-values:

chain_id="set_ZAHL" problematic="yes|no" cohesive="anaphoric" func="modifier" type="dem" antecedent="none"

anaphors have the following attribute-values:

chain_id="set_ZAHL" problematic="yes|no" cohesive="anaphoric" func="es/it-endophoric/exophoric|modifier|person/possessive-endophoric|article|head|general|particular|local|temporal|pronadv" type="pers|dem|comp" antecedent="np|pronominal|fact-s|other|event-vp|is-a"

anaphors without antecedents (e.g. comparative) have the following attribute-values:

chain_id="empty" problematic="yes|no" cohesive="none" func="es/it-endophoric/exophoric|modifier|person/possessive-endophoric|article|head|general|particular|local|temporal|pronadv" type="pers|dem|comp"

referring expressions have the following subtypes (saved as values of mention_type):

  • pers
  • dem
  • comp

And the subtypes or functions are (saved as values of mention_func):

pers in EO

  • personal
  • possessive
  • it-endophoric
  • it-exophoric

pers in GO

  • personal
  • possessive
  • es/it-endophoric
  • es/it-exophoric

dem

  • head
  • modifier
  • local
  • temporal
  • pronadv
  • correlat-pronadv

comp

  • particular
  • general

All referring expressions have annotations on the features of the antecedent they refer to (saved as values of mention_antecedent):

  • np
  • fact-s
  • pronominal
  • other
  • event-vp
  • is-a

Querying coreference

  • every token which belong to reference (unsorted, unclassified):
[mention];
  • every member of every chain (one and multiword):
/region[mention];

STATISTICS:

to get the number of elements:

size Last;

to get the types of referring expressions:

group Last match mention_type;

to get the functions of referring expressions:

group Last match mention_func;

NOTE!

querying demonstrative referring expressions for German, please exclude one function:

mention_func="correlat-pronadv"

For this use for example the following query:

[_.mention_type!="correlat-pronadv"];

to get the list of referents (antecedents):

/region[mention]::match.mention_antecedent="none";

to get the information on the number of chains per text:

tabulate Last match text_id > "|sorttabulate>/data/projects/steiner/gecco/results/reference-chains.csv"

to get the list of anaphors:

/region[mention]::match.mention_antecedent!="none";

to get the types of antecedents of the queried anaphors:

group last match mention_antecedent;

in EO-GECCO2013, ETRANS-GECCO, GO-GECCO2013, GTRANS-GECCO, EO-SPOKEN2013 and GO-SPOKEN2013

Assessing chunk information

run a query to extract all referring expressions:

[_.mention_antecedent!="none"] expand to mention;

count grammatical function of these:

group Last match NP_gf;
group Last match NP_gf1;

you need to repeat grouping as many times, as many NP levels there are.

Another possibility:

tabulate Last match NP_gf;

or

tabulate Last match NP_gf, match NP_gf1;

etc.

for INFORMATION ON GRAMMATICAL FUNCTIONS SEE INFORMATION ON CHUNKS for [[howtos:extractions:eo-gecco&#grammatical_functions|EO]] and [[howtos:extractions:go-gecco#grammatical_functions|GO]]

Substitution

Representation

  • nominal substitution (expressed with //one, einen//, etc. )
  • verbal substitution (expressed with //do so, tun//, etc.)
  • clausal substitution (expressed with //so//)

saved as the following Structural Attributes:

substitution
substitution_type

Querying

  • any token (unsorted, unclassified):
[substitution];

or for the whole structure:

/region[substitution];

or:

<substitution>[]+</substitution>;

STATISTICS:

size Last;
group Last match substitution;
group Last match substitution_type;

with Register information:

group Last match text_register by match substitution;
group Last match text_register by match substitution_type;
  • substitution types (concrete Querying):
[_.substitution_type="nominal"];
[_.substitution_type="verbal"];
[_.substitution_type="clausal"];

STATISTICS

size Last;

EXAMPLE of querying all types of substitution in EO-GECCO:

EO-GECCO;
<substitution>[]+</substitution>;
group Last match text_register by match substitution;
tabulate Last match substitution_type, match text_register > "|sorttabulate > substitution.csv";

Conjunctions

Representation

syntactic types:

  • subjuncts
  • connectors
  • adverbials

semantic types:

  • additive
  • adversative
  • temporal
  • causal
  • modal

saved as the following Structural Attribute:

conj
conj_type
conj_func
conj_problematic

And the types are (saved as values of conj_type):

subjunct
connect
adverbial

And the functions or semantic types are (saved as conj_func):

additive
adversative
causal
temporal
modal

Querying

  • Any token
[conj];
  • all conjunctions (unsorted, unclassified):
<conj>[]+[]*</conj>;
/region[conj];
  • activate the borders of syntactic types of conjunctions:
show +conj_type;
  • activate the borders of semantic types of conjunctions:
show +conj_func
  • concrete Querying (one element only):
<conj>[_.conj_type="connect"]</conj>;
<conj>[_.conj_type="subjunct"]</conj>;
<conj>[_.conj_type="adverbial"]</conj>;
  • concrete Querying (one and multi-word element):
<conj>[_.conj_type="connect"]+[_.conj_type="connect"]*</conj>;
<conj>[_.conj_type="subjunct"]+[_.conj_type="subjunct"]*</conj>;
<conj>[_.conj_type="adverbial"]+[_.conj_type="adverbial"]*</conj>;
  • same for semantic types:
<conj>[_.conj_func="additive"]+[_.conj_func="additive"]*</conj>;
<conj>[_.conj_func="adversative"]+[_.conj_func="adversative"]*</conj>;
<conj>[_.conj_func="causal"]+[_.conj_func="causal"]*</conj>;
<conj>[_.conj_func="temporal"]+[_.conj_func="temporal"]*</conj>;
<conj>[_.conj_func="modal"]+[_.conj_func="modal"]*</conj>;

STATISTICS:

group Last match conj_type;
group Last match conj_func;
tabulate Last match conj_type, match conj_func > "|sorttabulate> /data/projects/steiner/gecco/results/conj.csv";

EXAMPLE - For statistics of different semantic types in different registers in EO

EO-GECCO;
<conj>[]+[]*</conj>;
group Last match conj_func by match text_register;
tabulate Last match conj_func, match text_register > "|sorttabulate > conj_func-eo.csv";

General Nouns

Representation

  • nouns (general)

saved as the following Structural Attributes:

noun
noun_type

And the types are (saved as values of reference_type):

  • general

Querying

  • Any token
[noun];
  • or:
/region[noun];
  • activate the borders of types:
show +noun_type

STATISTICS:

size Last;
group Last match noun_type;
  • For statistics of different semantic types in different registers
group last match noun_type by text_register;
tabulate Last match lemma, match noun_type, match text_register > "|sorttabulate > generalnoun-CORPUS.csv";

Ellipsis

Representation

  • ellipsis

saved as the following Structural Attributes:

ellipsis
ellipsis_problematic [A]
ellipsis_func        [A]
ellipsis_type        [A]
ellipsis_antecedent  [A]
ellipsis_item        [A]
ellipsis_id          [A]

The items are (saved as values of ellipsis_item):

unspecified
antecedent
clause-internal antecedent
antecedent_ambig

The antecedents can be (saved as values of ellipsis_antecedent) either empty or some markables

The possible types are saved as ellipsis_type and include:

texttype
split
nominal
nonclausal
clausal
verbal
other
mixed_nominal+verbal/clausal
yes_no

The possible functions (saved as ellipsis_func):

non-cohesive
cohesive
clause-internal

If there were problems in annotation, e.g. disambiguation, unclear context, etc. the ellipsis were annotated with the category of problematic (ellipsis_problematic) with the values ‘yes’ or ‘no’

NOTE* that we were able to annotate triggers only, not empty items. So the annotated structure indicate that the ellipsis follows.

antecedents are recognisable as follows:

id="markable_ZAHL" type="texttype|split|nominal|nonclausal|clausal|verbal|other|mixed_nominal+verbal/clausal|yes_no" func="cohesive|non-cohesive|clause-internal" item="antecedent|antecedent_ambig|clause-internal antecedent" cohesive="yes|no" antecedent="empty" problematic="yes|no"

anaphors are recognisable as follows: id="markable_ZAHL" type="texttype|split|nominal|nonclausal|clausal|verbal|other|mixed_nominal+verbal/clausal|yes_no" func="endophoric|exophoric" item="unspecified" antecedent="markable_ZAHL" problematic="yes|no"

Querying

  • All ellipsis
/region[ellipsis];
  • all cohesive ellipsis (anaphors of elliptical chains):
[_.ellipsis_func="cohesive"&_.ellipsis_item="unspecified"] expand to ellipsis;
  • all antecedents of cohesive elliptical chains:
[_.ellipsis_func="cohesive"&_.ellipsis_item="antecedent|antecedent_ambig"] expand to ellipsis;
  • activate the borders of types:
show +ellipsis_type
  • activate other structures:
show +ellipsis_ATTRIBUTE
  • It is possible to restrict the search to certain registers, for example, in this case for essays and speeches.
/region[ellipsis]::match.text_register="ESSAY|SPEECH";
  • It is also possible to exclude certain types from the query, for example, extract non-problematic cases only:
/region[ellipsis]::match.ellipsis_problematic="no";
  • To restrict your search to certain types of ellipsis, use the same type of query:
/region[ellipsis]::match.ellipsis_ATTRIBUTE="VALUE";

STATISTICS:

size Last;
group Last match ellipsis_type;
  • For statistics of different types in different registers
group last match ellipsis_type by text_register;
tabulate Last match ellipsis_type, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";
  • Collect all statistics on the extracted ellipsis
tabulate Last match ellipsis_item, ellipsis_type, ellipsis_func, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";

Problematic Cases

conj, mention, reference and ellipsis have annotation of problematic cases. these are cases which are either ambiguous or were problematic for human annotators for some reason.

to sort them out use:

region[conj]::match.conj_problematic="yes";
region[mention]::match.mention_problematic="yes";
region[ellipsis]::match.ellipsis_problematic="yes";