In this tutorial you will learn how to use a simple Markup for corpus data: the vertical text format (VRT).
This tutorial assumes that:
word
is NOT plain text)The goal is to get you acquainted with the
The VRT-format is the input format for the Corpus Workbench (CWB), which allows to encode and query corpora in an efficient way using the command line tool CQP (Corpus Query Processor) or its web-based GUI CQPweb.
The VRT-format is simple an may easily be processed and transformed. It is a combination of one-word-per-line (vertical) format with simple XML Markup.
Annotation on token level is represented in a one-word-per-line TAB-deliminated format, where each column represents a different annotation (e.g. word, lemma, part-of-speech). These attributes are also refered to as positional attributes
. An example is given below.
ALICE NP Alice
was VBD be
beginning VVG begin
to TO to
get VV get
very RB very
tired JJ tired
of IN of
sitting VVG sit
by IN by
her PP$ her
sister NN sister
on IN on
the DT the
bank NN bank
, , ,
Annotation beyond token level, spanning a sequence of tokens or whole section of the text, (e.g. title, sentences, paragraphs but also named entities or phrases) is represented using XML-tags (see also Short Introduction to XML). These annotations are referred to as structural attributes
.
The following is an example out of a VRT-file including positional and structural attributes.
<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
<s>
Alice NP Alice
's POS 's
adventures NNS adventure
in IN in
wonderland NN wonderland
</s>
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
<s>
Down IN down
the DT the
Rabbit-Hole NN rabbit-hole
</s>
</head>
<p>
<s>
ALICE NP Alice
was VBD be
beginning VVG begin
to TO to
get VV get
very RB very
tired JJ tired
of IN of
sitting VVG sit
by IN by
her PP$ her
sister NN sister
on IN on
the DT the
bank NN bank
, , ,
[...]
</s>
[...]
</p>
[...]
</text>
There are some restrictions with respect to the naming of elements and attribute-value pairs in structural attributes
, some of them are CQPweb specific:
[a-zA-Z0-9_]
, i.e. no white spaces, no dashes or diacritics to name just the most frequent error sources"
signals the end of a value so it cannot be part of a value (you can use singel quotes instead).CQPweb specific:
<text>
-element with an obligatory unique identifier (ID) as attribute-value pair.id
in lower case letters and has to be the first attribute-value pair of the <text>
-elementdistribution
implemented in CQPweb (e.g. information on register, year
) may consist only of the characters [a-zA-Z0-9_]
again no white space, dashes or diacritics (among others)<s>
-elements are requiredBefore you actually start with the annotation, it is a good idea to think about the (linguistic) information you want to annotate and how it is best represented. As said before, the VRT-format provides two different attribute types:
The annotation itself can be divided into:
What you actually annotate depends on:
Meta-data is like the ID card of your texts. It is essential for grouping texts into subgroups (subcorpora), e.g., according to time of publication/utterance, register/genre, language, author, … But also to find particular texts in a large collection and to retrieve the source of a text. The meta-data you annotate for your corpus depend on
Typical meta-data information for a corpus are:
In the VRT-format, meta-data information is represented as attribute-value-pairs of the <text>
-element.
Information related to word sequences or text sections are represented as XML-elements within the text. Typical structural information are
<s></s>
)<p></p>
)<title></title>
) - Title information in the meta-data and marking of the title within the text are two different things.<head></head>
)Linguistic annotation can include annotation on
There are many tools for linguistic annotation such as part-of-speech taggers, named entity recognizers, syntactic or semantic parsers, …
However, the more complex the linguistic annotation, the more error prone a tool usually is.
Part-of-speech (pos) taggers, which perform basic annotation on the token level (word, lemma, part-of-speech) are relatively reliable (precision: 95-98% on languages such as English and German). For many investigations this basic annotation suffices.
The reason why there is no ready-made tool for data preparation is that each data source is different. Thus, the first step is really to look at the data, to get a feeling of its quality and of the information it contains.
search and replace
.You can store all your texts in one file or use separate files for each text. However, if you want to add metadata for each text in you corpus (e.g. in order to create subcorpora), it can makes automatic processing easier if you have each text in a separate file. To add metadata for each text you need to
<text>
-elementIn any case you need at least one <text>
-element for the installation of your corpus in CQPweb.
An schematic example is given below
<text id="text_01" title="Title of first text" author="Firstname Lastname" year="2000" register="fiction" language="en">
...
</text>
Attention:
id
id
has to be in lower case and has to be the first attribute in the <text>
-element[a-zA-Z0-9_]
(no white space, dashes, diacritics, …)Structural information (e.g. title, paragraph, headlines, divistions) may be inherent in the text:
Depending on your research interest, you might want to keep some of the explicit markup or add markup for structural information manually or with a dedicated script.
Explicit markup might need some transformation to meet the requirements of the XML-markup in VRT-files. For a list of typical structural information see the paragraph on structural information above.
Sentences are a special case. If not already marked explicitly in the text, sentence boundaries may be derived automatically from pos-tagging using a dedicated script.
An example of a corpus with structural information is given below.
<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
Alice's adventures in wonderland
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
Down the Rabbit-Hole
</head>
<p>
ALICE was beginning to get very
tired of sitting by her sister on
the bank, and of having nothing
to do: once or twice she had
peeped into the book her sister was reading,
but it had no pictures or conversations in
it, "and what is the use of a book," thought
Alice, "without pictures or conversations?"
</p>
...
</div>
...
</text>
You might have realized that the texts are not yet in the one-word-per-line format.
We will use a tokenizer to split the text in tokens and a part-of-speech tagger to assign lemmas and part-of-speech tags.
The TreeTagger is easy to use and produces the output required for the VRT-format. See Tagging with TreeTagger for more information on how to use the TreeTagger.
After this step, your data should look similar to this:
<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
Alice NP Alice
's POS 's
adventures NNS adventure
in IN in
wonderland NN wonderland
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
Down IN down
the DT the
Rabbit-Hole NN rabbit-hole
</head>
<p>
ALICE NP Alice
was VBD be
beginning VVG begin
to TO to
get VV get
very RB very
tired JJ tired
of IN of
sitting VVG sit
by IN by
her PP$ her
sister NN sister
on IN on
the DT the
bank NN bank
, , ,
[...]
[...]
</p>
[...]
</text>
This is already a valid VRT format for command-line CQP and for importing the VRT format in CQPweb.
However, we do not yet have sentence markers. Pos-tags (e.g. SENT
in the UPenn tagset) and annotated xml-elements (e.g. p
for paragraph and title
; option --sentmark|sm
) indicate sentence boundaries. We can use these in a dedicated script (add-sentence-markers.perl
) to add sentence markers to the coröpus. Additionally, the script can add a <text>
-element at the beginning and end of the file with a given ID (option --textmark|sm
)
Usage: add-sent-markers.perl <infile> <outputfile> <sentdeliminator>
[--sentmark|sm=s] list of xml-elements for sentence marking
[--textmark|tm=s] creates a text-element with given id surrounding enclosing the content of the file
[--excludeword|ew=s] exclude list of word as sentence markers
The script takes the tagged file, the desired output file and a comma-separated list of pos-tags marking sentence boundaries as input.
An example is given below:
add-sent-markers.perl alice.tagged alice.tagged.vrt SENT --sm div,title,head,p
alice.tagged
alice.tagged.vrt
SENT
div,title,head,p
mark sentence boundariesAfter this step your data should look like this.
<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
<s>
Alice NP Alice
's POS 's
adventures NNS adventure
in IN in
wonderland NN wonderland
</s>
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
<s>
Down IN down
the DT the
Rabbit-Hole NN rabbit-hole
</s>
</head>
<p>
<s>
ALICE NP Alice
was VBD be
beginning VVG begin
to TO to
get VV get
very RB very
tired JJ tired
of IN of
sitting VVG sit
by IN by
her PP$ her
sister NN sister
on IN on
the DT the
bank NN bank
, , ,
[...]
</s>
[...]
</p>
[...]
</text>
Another example for a different tagset (STTS) excluding ;
as sentence boundary marker (;
although it is tagged as $.
) looks as follows:
add-sent-markers.perl text.tagged text.tagged.vrt $. --sm head,title,p --ew ';'
The file is now ready for encoding in CQP and CQPweb, including installing the corpus as precompiled corpus in CQPweb.