In this tutorial we will learn how to download query results in a delimited text file, more precicely in a TAB-delimited text file, out of CQPweb. See Data and data formats for more information on formats for data sets.
anchors
. The main anchors
are the corpus position of the first and the last word in the hit.anchors
are the so-called target
and keyword
. Both of which are optional and have to be specifically defined for a query (we will ignore them in this tutorial).In order to understand the concept better, let’s have a look an example. The diagram below shows the beginning of a corpus. In the first column you see the corpus position and in the second column the words.
0 Alice
1 's
2 adventures
3 in
4 wonderland
5 Down
6 the
7 Rabbit-Hole
8 ALICE
9 was
10 beginning
11 to
12 get
13 very
14 tired
15 of
16 sitting
17 by
18 her
19 sister
20 on
Let’s assume we looked for noun phrases in this subset of the corpus and we get the following results.
Alice 's adventures
wonderland
the Rabbit-Hole
ALICE
her sister
CQP represents these results in terms of corpus positions, i.e., the first token (refered to as match
) and the last token (refered to as matchend
) in the result item. These positions are also refered to as anchor
positions. Thus, the internal represenation of the query results is:
match matchend
0 2
4 4
6 7
8 8
18 19
Attention: each hit is represented by two corpus positions (the anchors: match
and matchend
) not matter of the length of the string. For hits consisting solely of one word match
and matchend
are the same, e.g. for the ALICE
in our example. These anchor
positions play an important role if we want to download results in a TAB-delimited format.
Download
from the Menu in the upper right corner of the concordance window and click on Go
.Download query as plain-text tabulation
Under Frequently-used tabulations
you can find a number of preinstalled tabulation commands.
We can principally download any attribute annotated in the Corpus for our match string. With the form below Specify custom tabulation
we define the which values or features we want to extract for which tokens.
Lets have a look at the form below Specify custom tabulation
in more detail:
Col.no.
: column in the download table - each column stands for a particular valueBegin at
ad End at
: specifies the token(s) for which we want to extract the valuesAnchor
: corpus position anchor (e.g. match
, matchend
)Offset
: offset for the anchor
Attribute
: specifies the value[pos="VV.*"]
.Specifications for the download table:
match
with no offsetlemma
, pos
, text_reg
simple tabulate output
The Specify custom tabulation form
looks like this:
The result should look like this. We have three columns (one for each variable)
and one row for each instance in the corpus.
This format is well suited as input format for statistical analysis programs such as R.
Principally, we have the following output format options:
simple tabulate output
as described abovesort and group output
which calculate frequencies for each combination of attributessort and group output, display as matrix
, which presents the results in a matrix (useful only for two variables). Column 1 will be the matrix rows, column 2 the matrix columns. In the example here, we used the following parameters:
[pos="N.*"] [pos="IN"] [pos="N.*"]
Pay attention, the preposition is neither at the beginning (match
) nor at the end (matchend
) of the pattern. In order to extract the lemma of the preposition we need to specify the corpus position relative to one of these anchors
. This is done with the so-called offset
. A positive offset
referst to the right or to what follows the anchor
, a negative offset
refers to the left or to what precedes the anchor
. As the preposition is one position to the right of match
(match[+1]
) and one position left of matchend
(matchend[-1]
), we could use either anchor
.
The Specify custom tabulation form
using match[+1]
looks like this
Pay attention, we extract text_decade
at the position match
(without offset).
The result of the download should look like this