The result of a corpus linguistic study is usually a so-called data set. Such a data set contains information about a particular linguistic phenomeno extracted from a particular corpus.
In this tutorial we will learn how to create such data sets using the online search tool CQPweb.
A data set
In a corpus linguistic study observations are usually associated with instances in a given corpus, and variables are also often called features.
Datasets are structured to make relations between elements explicit. Most commonly a dataset is structured as follows:
Wickham (2014) calls this a tidy dataset, which is easy to manipulate, model and visualize.
The datasets are usually stored in delimited plain text format. Typical deliminators are commas (CSV), semicolons or TABs. For datasets including language material it is most common to use a TAB-delimited format. While most other characters (commas, semicolon, white space, …) may be part of a linguistic signal, TABs are not, thus they do not conflict with potential elements of a value.
Let us assume we are interested in the distribution of content verbs and their parts-of-speech across registers in the Brown corpus. We could formularize our study as follows:
The corresponding tidy dataset should look like this: distr_vfull_lemma-pos-reg_brown.txt.
In the following you will now learn how to create this data set.
match
.anchors
.match
for the beginning of the match and matchend
for the end of the match.target
and keyword
. They are both optional and have to be specifically defined in a query (we will ignore them for the time being).In order to understand the concept better, let’s have a look an example. The diagram below shows the beginning of a corpus. In the first column you see the corpus position and in the second column the words.
0 Alice
1 's
2 adventures
3 in
4 wonderland
5 Down
6 the
7 Rabbit-Hole
8 ALICE
9 was
10 beginning
11 to
12 get
13 very
14 tired
15 of
16 sitting
17 by
18 her
19 sister
20 on
Let’s assume we looked for noun phrases in this subset of the corpus and we get the following results.
Alice 's adventures
wonderland
the Rabbit-Hole
ALICE
her sister
CQP represents these results in terms of corpus positions, i.e., the corpus position of the first token (refered to as match
) and the last token (refered to as matchend
). Thus, the internal represenation the query result look like this:
0 2
4 4
6 7
8 8
18 19
Attention: each hit is represented by two corpus positions (for match
and matchend
) no matter of the length of the string. For results consisting solely of one word match
and matchend
are the same (e.g. in the case of wonderland
and ALICE
; line 2 and line 4). These anchor
positions play an important role if we want to extract a data set for a corpus in CQPweb.
The first thing we have to do in order to extract data is obviously to run a query.
Let us return to our example study of content verbs and their parts-of-speech across registers in the Brown corpus:
The query is rather simple, searching for every occurence of a content verb:
[pos="VV.*"]
.After executing the query in CQPweb, we get concordances showing each instance of a content verb (in this case in the Brown corpus) in context.
See Introduction to CQPweb: Corpus Search for more information on how to execute queries in CQPweb.
In order to download the results of a query as a data set:
Download
from the Menu in the upper right corner of the concordance window and click on Go
.Download query as plain-text tabulation
Under Frequently-used tabulations
you can find a number of preinstalled tabulation commands.
If none of the preinstalled tabulation commands extracts the features we need (as is the case in our example), we have to specify a custom tabulation.
Lets have a look at the form Specify custom tabulation
in more detail:
Col.no.
: column in the download table - each column represents a variable/featureBegin at
ad End at
: specifies the begin and end corpus positions of the token(s) for which we want to extract the variable/featuresAnchor
: anchor for corpus position (e.g. match
, matchend
)Offset
: offset for the anchor
- with the offset, we can specify a corpus position relative to one of the anchorsAttribute
: attribute of the variable/featureTo recall the parameters of our example study:
[pos="VV.*"]
.Parameters for the custom tabulation:
lemma
; anchor: match
pos
; anchor: match
text_reg
; anchor: match
simple tabulate output
Once you have specified all parameters in the custom tabulation
form you can download the results by clicking on Download query tabulation with settings above
.
Save the file under data/distr_vfull_lemma-pos-reg_brown.txt
in the course directory. If you haven’t filled in the name for the download file
before, you can specify the name of the file now.
Result of the download: distr_vfull_lemma-pos-reg_brown.txt.
We now have created a tidy data set, which is most flexible with regard to manipulation, modeling and vizualization.
However, principally, we have the following output format options:
simple tabulate output
as described abovesort and group output
which calculate frequencies for each combination of featuressort and group output, display as matrix
, which presents the results in a matrix (two variables only).The same data extraction methods can be used for queries including patterns with more than one token.
[pos="N.*"] [pos="IN"] [pos="N.*"]
Pay attention:
word
instead of lemma for the complex noun phrase and normalize it for case
(to avoid differences due to capitalization of words)text_period
(50-year periods)Save the results under data/distr_n-prep-n_np-word-period_rsc.txt
in your course directory.
Result of the download: distr_n-prep-n_np-word-period_rsc.txt
[pos="N.*"] [pos="IN"] [pos="N.*"]
Pay attention:
lemma
instead of wordmatch
) nor at the end (matchend
) of the matching pattern. Thus, we have to specify the corpus position of the preposition relative to one of the anchors. This is done with the so-called offset
. In our case, we can use either match
or matchend
as the preposition is one position right of (following) match (match[+1]
) and one position left of (preceeding) matchend (matchend[-1]
).match[+1]
text_decade
instead of text_period
text_decade
as it is the same for the whole pattern
Save the results under data/distr_n-prep-n_lemma-decade_rsc.txt
in your course directory
Result of the download: distr_n-prep-n_lemma-decade_rsc.txt
<> | >> Next: Tutorial Manipulating Data Sets >> |
Bartsch, Sabine. 2004. Structural and Functional Properties of Collocations in English. A Corpus Study of Lexical and Pragmatic Constraints on Lexical Co-Occurence. Tübingen: Narr.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.