in EO-GECCO2013
, GO-GECCO2013
, EO-SPOKEN2013
, GO-SPOKEN2013
, EO/GO-GECCOCOH2014
There following Structural Attributes:
lexicalcohesion
lexicalcohesion_id [A]
lexicalcohesion_distance [A]
lexicalcohesion_lexical_type [A]
lexicalcohesion_distance_type [A]
lexicalcohesion_lexical_chain [A]
lexicalcohesion_id
: is an ID of a markable
lexicalcohesion_distance
: is supposed to be a distance between various anaphors (but wasn’t annotated in most cases and needed to be calculated)
lexicalcohesion_lexical_type
: indicates the semantic relation an item has
lexicalcohesion_distance_type
: is not annotated, is always immediate
lexicalcohesion_lexical_chain
: has the value ‘set_ZAHL’ and indicates the ID of chains
antecedents have the following features: lexicalcohesion_id="ZAHL"
lexical_distance="ZAHL"
lexicalcohesion_lexical_type="first"
lexical_chain!="empty"
anaphors (chain members which are not first) have the following features: lexicalcohesion_id="ZAHL"
lexical_distance="ZAHL"
lexicalcohesion_lexical_type="repetition|synonym|antonym|hypernym|hyponym|co-hyponym|meronym|co-meronym|holonym|instance|type|co-instance" lexical_chain!="none|empty"
All non-cohesive items (not belonging to a chain) have the following features: lexicalcohesion_id="ZAHL"
lexical_distance="ZAHL"
lexicalcohesion_lexical_type="first|First|none"
lexical_chain="none|empty"
[lexicalcohesion];
/region[lexicalcohesion];
STATISTICS:
to get the number of elements:
size Last;
At the moment, we have some elements which are there but do not belong to any chain. To sort them out, please use the following command:
<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;
If you want to query the first element in the chain (for instance to count the number of chains):
<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type="first|none"]+</lexicalcohesion>;
size Last;
To query all elements of chains except for the fist elements, use:
<lexicalcohesion>[_.lexicalcohesion_lexical_chain!="none|empty"&_.lexicalcohesion_lexical_type!="first|none"]+</lexicalcohesion>;
size Last;
To export a table with the number of first elements (chains) per register:
tabulate Last match text_register> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-register-SUBCORPUS.csv
To export a table with the number of first elements (chains) per text:
tabulate Last match text_id> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/first-text-SUBCORPUS.csv
To assess the distributions of semantic relations:
group Last match lexicalcohesion_lexical_type;
To export a table (with semantic relations) sorted according to registers:
tabulate Last match text_register, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-register-SUBCORPUS.csv"
To export a table (with semantic relations) sorted according to texts:
tabulate Last match text_id, match lexicalcohesion_lexical_type > "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/relations-text-SUBCORPUS.csv"
To extract the information on chains size, we would need some post-processing steps for average sizes. But we can already extract information on chains size per text:
[_.lexicalcohesion_lexical_chain!="none|empty"]+</lexicalcohesion>;
tabulate Last match text_id, match lexicalcohesion_lexical_chain> "|sorttabulate|sort -k 2|cat > /data/projects/steiner/gecco/results/lexical-cohesion/chains/chains-text-SUBCORPUS.csv"
To assess certain chains, you could use the information from the extraction on chain size and analyse, for instance, the longest chains. here is an example:
In the text EO_ESSAY_007, we have a chain which consists of more than 30 elements, so we have a closer look at it:
[_.lexicalcohesion_lexical_chain="set_3"&_.text_id="EO_ESSAY_007"]+</lexicalcohesion>;
There following Structural Attributes:
mention
mention_type [A]
mention_func [A]
mention_chain_id [A]
mention_cohesive [A]
mention_anaphor [A]
mention_problematic [A]
antecedents have the following attribute-values:
chain_id="set_ZAHL"
problematic="yes|no"
cohesive="anaphoric"
func="modifier"
type="dem"
antecedent="none"
anaphors have the following attribute-values:
chain_id="set_ZAHL"
problematic="yes|no"
cohesive="anaphoric"
func="es/it-endophoric/exophoric|modifier|person/possessive-endophoric|article|head|general|particular|local|temporal|pronadv" type="pers|dem|comp"
antecedent="np|pronominal|fact-s|other|event-vp|is-a"
anaphors without antecedents (e.g. comparative) have the following attribute-values:
chain_id="empty"
problematic="yes|no"
cohesive="none"
func="es/it-endophoric/exophoric|modifier|person/possessive-endophoric|article|head|general|particular|local|temporal|pronadv"
type="pers|dem|comp"
referring expressions have the following subtypes (saved as values of mention_type):
And the subtypes or functions are (saved as values of mention_func):
pers in EO
pers in GO
dem
comp
All referring expressions have annotations on the features of the antecedent they refer to (saved as values of mention_antecedent):
[mention];
/region[mention];
STATISTICS:
to get the number of elements:
size Last;
to get the types of referring expressions:
group Last match mention_type;
to get the functions of referring expressions:
group Last match mention_func;
NOTE!
querying demonstrative referring expressions for German, please exclude one function:
mention_func="correlat-pronadv"
For this use for example the following query:
[_.mention_type!="correlat-pronadv"];
to get the list of referents (antecedents):
/region[mention]::match.mention_antecedent="none";
to get the information on the number of chains per text:
tabulate Last match text_id > "|sorttabulate>/data/projects/steiner/gecco/results/reference-chains.csv"
to get the list of anaphors:
/region[mention]::match.mention_antecedent!="none";
to get the types of antecedents of the queried anaphors:
group last match mention_antecedent;
in EO-GECCO2013
, ETRANS-GECCO
, GO-GECCO2013
, GTRANS-GECCO
, EO-SPOKEN2013
and GO-SPOKEN2013
run a query to extract all referring expressions:
[_.mention_antecedent!="none"] expand to mention;
count grammatical function of these:
group Last match NP_gf;
group Last match NP_gf1;
you need to repeat grouping as many times, as many NP levels there are.
Another possibility:
tabulate Last match NP_gf;
or
tabulate Last match NP_gf, match NP_gf1;
etc.
for INFORMATION ON GRAMMATICAL FUNCTIONS SEE INFORMATION ON CHUNKS for [[howtos:extractions:eo-gecco&#grammatical_functions|EO]] and [[howtos:extractions:go-gecco#grammatical_functions|GO]]
saved as the following Structural Attributes:
substitution
substitution_type
[substitution];
or for the whole structure:
/region[substitution];
or:
<substitution>[]+</substitution>;
STATISTICS:
size Last;
group Last match substitution;
group Last match substitution_type;
with Register information:
group Last match text_register by match substitution;
group Last match text_register by match substitution_type;
[_.substitution_type="nominal"];
[_.substitution_type="verbal"];
[_.substitution_type="clausal"];
STATISTICS
size Last;
EXAMPLE of querying all types of substitution in EO-GECCO:
EO-GECCO;
<substitution>[]+</substitution>;
group Last match text_register by match substitution;
tabulate Last match substitution_type, match text_register > "|sorttabulate > substitution.csv";
saved as the following Structural Attribute:
conj
conj_type
conj_func
conj_problematic
And the types are (saved as values of conj_type):
subjunct
connect
adverbial
And the functions or semantic types are (saved as conj_func):
additive
adversative
causal
temporal
modal
[conj];
<conj>[]+[]*</conj>;
/region[conj];
show +conj_type;
show +conj_func
<conj>[_.conj_type="connect"]</conj>;
<conj>[_.conj_type="subjunct"]</conj>;
<conj>[_.conj_type="adverbial"]</conj>;
<conj>[_.conj_type="connect"]+[_.conj_type="connect"]*</conj>;
<conj>[_.conj_type="subjunct"]+[_.conj_type="subjunct"]*</conj>;
<conj>[_.conj_type="adverbial"]+[_.conj_type="adverbial"]*</conj>;
<conj>[_.conj_func="additive"]+[_.conj_func="additive"]*</conj>;
<conj>[_.conj_func="adversative"]+[_.conj_func="adversative"]*</conj>;
<conj>[_.conj_func="causal"]+[_.conj_func="causal"]*</conj>;
<conj>[_.conj_func="temporal"]+[_.conj_func="temporal"]*</conj>;
<conj>[_.conj_func="modal"]+[_.conj_func="modal"]*</conj>;
STATISTICS:
group Last match conj_type;
group Last match conj_func;
tabulate Last match conj_type, match conj_func > "|sorttabulate> /data/projects/steiner/gecco/results/conj.csv";
EXAMPLE - For statistics of different semantic types in different registers in EO
EO-GECCO;
<conj>[]+[]*</conj>;
group Last match conj_func by match text_register;
tabulate Last match conj_func, match text_register > "|sorttabulate > conj_func-eo.csv";
saved as the following Structural Attributes:
noun
noun_type
And the types are (saved as values of reference_type):
[noun];
/region[noun];
show +noun_type
STATISTICS:
size Last;
group Last match noun_type;
group last match noun_type by text_register;
tabulate Last match lemma, match noun_type, match text_register > "|sorttabulate > generalnoun-CORPUS.csv";
saved as the following Structural Attributes:
ellipsis
ellipsis_problematic [A]
ellipsis_func [A]
ellipsis_type [A]
ellipsis_antecedent [A]
ellipsis_item [A]
ellipsis_id [A]
The items are (saved as values of ellipsis_item):
unspecified
antecedent
clause-internal antecedent
antecedent_ambig
The antecedents can be (saved as values of ellipsis_antecedent) either empty or some markables
The possible types are saved as ellipsis_type and include:
texttype
split
nominal
nonclausal
clausal
verbal
other
mixed_nominal+verbal/clausal
yes_no
The possible functions (saved as ellipsis_func):
non-cohesive
cohesive
clause-internal
If there were problems in annotation, e.g. disambiguation, unclear context, etc. the ellipsis were annotated with the category of problematic (ellipsis_problematic) with the values ‘yes’ or ‘no’
NOTE* that we were able to annotate triggers only, not empty items. So the annotated structure indicate that the ellipsis follows.
antecedents are recognisable as follows:
id="markable_ZAHL"
type="texttype|split|nominal|nonclausal|clausal|verbal|other|mixed_nominal+verbal/clausal|yes_no"
func="cohesive|non-cohesive|clause-internal"
item="antecedent|antecedent_ambig|clause-internal antecedent"
cohesive="yes|no"
antecedent="empty"
problematic="yes|no"
anaphors are recognisable as follows: id="markable_ZAHL"
type="texttype|split|nominal|nonclausal|clausal|verbal|other|mixed_nominal+verbal/clausal|yes_no"
func="endophoric|exophoric"
item="unspecified"
antecedent="markable_ZAHL"
problematic="yes|no"
/region[ellipsis];
[_.ellipsis_func="cohesive"&_.ellipsis_item="unspecified"] expand to ellipsis;
[_.ellipsis_func="cohesive"&_.ellipsis_item="antecedent|antecedent_ambig"] expand to ellipsis;
show +ellipsis_type
show +ellipsis_ATTRIBUTE
/region[ellipsis]::match.text_register="ESSAY|SPEECH";
/region[ellipsis]::match.ellipsis_problematic="no";
/region[ellipsis]::match.ellipsis_ATTRIBUTE="VALUE";
STATISTICS:
size Last;
group Last match ellipsis_type;
group last match ellipsis_type by text_register;
tabulate Last match ellipsis_type, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";
tabulate Last match ellipsis_item, ellipsis_type, ellipsis_func, match text_register > "|sorttabulate > ellipsis-CORPUS.csv";
conj
, mention
, reference
and ellipsis
have annotation of problematic cases. these are cases which are either ambiguous or were problematic for human annotators for some reason.
to sort them out use:
region[conj]::match.conj_problematic="yes";
region[mention]::match.mention_problematic="yes";
region[ellipsis]::match.ellipsis_problematic="yes";