How to create valid VRT files for encoding in CQP

In this tutorial you will learn how to use a simple Markup for corpus data: the vertical text format (VRT).

This tutorial assumes that:

you have already collected the texts for your corpus
the data is in plain text format (ascii or utf8, word is NOT plain text)

The goal is to get you acquainted with the

The VRT format

The VRT-format is the input format for the Corpus Workbench (CWB), which allows to encode and query corpora in an efficient way using the command line tool CQP (Corpus Query Processor) or its web-based GUI CQPweb.
The VRT-format is simple an may easily be processed and transformed. It is a combination of one-word-per-line (vertical) format with simple XML Markup.

Annotation on token level is represented in a one-word-per-line TAB-deliminated format, where each column represents a different annotation (e.g. word, lemma, part-of-speech). These attributes are also refered to as positional attributes. An example is given below.

ALICE   NP  Alice
was  VBD    be
beginning VVG   begin
to  TO  to
get VV  get
very    RB  very
tired   JJ  tired
of  IN  of
sitting VVG sit
by  IN  by
her PP$ her
sister  NN  sister
on  IN  on
the DT  the
bank    NN  bank
,   ,   ,

Annotation beyond token level, spanning a sequence of tokens or whole section of the text, (e.g. title, sentences, paragraphs but also named entities or phrases) is represented using XML-tags (see also Short Introduction to XML). These annotations are referred to as structural attributes.
The following is an example out of a VRT-file including positional and structural attributes.

<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
<s>
Alice   NP  Alice
's  POS 's
adventures  NNS adventure
in  IN  in
wonderland  NN  wonderland
</s>
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
<s>
Down    IN  down
the DT  the
Rabbit-Hole NN  rabbit-hole
</s>
</head>
<p>
<s>
ALICE   NP  Alice
was VBD be
beginning   VVG begin
to  TO  to
get VV  get
very    RB  very
tired   JJ  tired
of  IN  of
sitting VVG sit
by  IN  by
her PP$ her
sister  NN  sister
on  IN  on
the DT  the
bank    NN  bank
,   ,   ,
[...]
</s>
[...]
</p>
[...]
</text>

Restrictions

There are some restrictions with respect to the naming of elements and attribute-value pairs in structural attributes, some of them are CQPweb specific:

element and attribute names may only consists of the characters [a-zA-Z0-9_], i.e. no white spaces, no dashes or diacritics to name just the most frequent error sources
values are principally free text;
- BUT ATTENTION: " signals the end of a value so it cannot be part of a value (you can use singel quotes instead).
XML-elements have to be on a separate single line - XML-elements spanning more than one line cannot be properly identified and lead to errors in the processing
pointed brackets (<>) in token level annotation may also lead to errors in the processing as they might be misinterpreted as a broken XML-element

CQPweb specific:

there must be at least one <text>-element with an obligatory unique identifier (ID) as attribute-value pair.
the attribute for the ID has to represented with the attribute id in lower case letters and has to be the first attribute-value pair of the <text>-element
attribute-value pairs you want to use in the frequency distribution implemented in CQPweb (e.g. information on register, year) may consist only of the characters [a-zA-Z0-9_] again no white space, dashes or diacritics (among others)
for importing of pre-encoded corpora <s>-elements are required

What to annotate and how?

Before you actually start with the annotation, it is a good idea to think about the (linguistic) information you want to annotate and how it is best represented. As said before, the VRT-format provides two different attribute types:

positional attributes (for token level annotation)
structural attributes (for annotation beyond the token level)

The annotation itself can be divided into:

meta-data (information about the texts)
structural information (information about structural characteristics of the texts)
linguistic annotation

What you actually annotate depends on:

what you need for your research
what is available
how much effort you want to invest

Some thoughts on meta-data

Meta-data is like the ID card of your texts. It is essential for grouping texts into subgroups (subcorpora), e.g., according to time of publication/utterance, register/genre, language, author, … But also to find particular texts in a large collection and to retrieve the source of a text. The meta-data you annotate for your corpus depend on

the purpose of your study, i.e. diachronic investigations require the time of publication, for register variation you need to know the register of your texts
what is provided to you as meta-data

Typical meta-data information for a corpus are:

author
title
time of publication (year, decade, century, …)
register (academic, fiction, news, …)
mode (spoken, written)
language (DE, EN, FR, …)
place of publication (journal, website, …)

In the VRT-format, meta-data information is represented as attribute-value-pairs of the <text>-element.

Structural information

Information related to word sequences or text sections are represented as XML-elements within the text. Typical structural information are

sentences (<s></s>)
paragraphs (<p></p>)
title (<title></title>) - Title information in the meta-data and marking of the title within the text are two different things.
headline (<head></head>)

Linguistic annotation

Linguistic annotation can include annotation on

token level: typically word, lemma, part-of-speech
structural level (word sequences): named entities, phrases (noun phrases, prepositional phrase)

There are many tools for linguistic annotation such as part-of-speech taggers, named entity recognizers, syntactic or semantic parsers, …
However, the more complex the linguistic annotation, the more error prone a tool usually is.

Part-of-speech (pos) taggers, which perform basic annotation on the token level (word, lemma, part-of-speech) are relatively reliable (precision: 95-98% on languages such as English and German). For many investigations this basic annotation suffices.

Preparing your data - Step-by-Step

Step 0: House keeping

The reason why there is no ready-made tool for data preparation is that each data source is different. Thus, the first step is really to look at the data, to get a feeling of its quality and of the information it contains.

Check for noise: OCR errors, non-text items (e.g. markup).
Not all non-text items are noise. Some markup might capture information you want to keep (e.g. paragraphs, meta-data information).
Preprocessing: Get rid of all items in your data that you do not want to keep. Depending on the size of your data this process may be performed manually or (semi-)automatically. In the latter case you will need some (basic) programming skills. But there is also a lot you can do with search and replace.
keep in mind that you need plain text. Thus, Word is NOT a good editor. If you do manual editing use a plain text editor such as Notepad++ or Emacs

Step 1: Text level

You can store all your texts in one file or use separate files for each text. However, if you want to add metadata for each text in you corpus (e.g. in order to create subcorpora), it can makes automatic processing easier if you have each text in a separate file. To add metadata for each text you need to

enclose each text in a <text>-element
add meta-data information as attribute-value pairs

In any case you need at least one <text>-element for the installation of your corpus in CQPweb.

An schematic example is given below

<text id="text_01" title="Title of first text" author="Firstname Lastname" year="2000" register="fiction" language="en">
...
</text>

Attention:

each text has to have a unique ID stored in the attribute id
the attribute ǹame id has to be in lower case and has to be the first attribute in the <text>-element
all attribute names are limited to the charachters [a-zA-Z0-9_] (no white space, dashes, diacritics, …)
the same limitation holds for attribute-values that you want to use for sorting in CQPweb

Step 2: Add structural information

Structural information (e.g. title, paragraph, headlines, divistions) may be inherent in the text:

in explicit markup (e.g. HTML markup)
in the visual layout (e.g. empty lines separating paragraphs, title at the beginning of the text)

Depending on your research interest, you might want to keep some of the explicit markup or add markup for structural information manually or with a dedicated script.

Explicit markup might need some transformation to meet the requirements of the XML-markup in VRT-files. For a list of typical structural information see the paragraph on structural information above.

Sentences are a special case. If not already marked explicitly in the text, sentence boundaries may be derived automatically from pos-tagging using a dedicated script.

An example of a corpus with structural information is given below.

<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
Alice's adventures in wonderland
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
Down the Rabbit-Hole
</head>
<p>
ALICE was beginning to get very
tired of sitting by her sister on
the bank, and of having nothing
to do: once or twice she had
peeped into the book her sister was reading,
but it had no pictures or conversations in
it, "and what is the use of a book," thought
Alice, "without pictures or conversations?"
</p>
...
</div>
...
</text>

Step 3: Tokenizing and part-of-speech tagging

You might have realized that the texts are not yet in the one-word-per-line format.
We will use a tokenizer to split the text in tokens and a part-of-speech tagger to assign lemmas and part-of-speech tags.

The TreeTagger is easy to use and produces the output required for the VRT-format. See Tagging with TreeTagger for more information on how to use the TreeTagger.

After this step, your data should look similar to this:

<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
Alice   NP  Alice
's  POS 's
adventures  NNS adventure
in  IN  in
wonderland  NN  wonderland
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
Down    IN  down
the DT  the
Rabbit-Hole NN  rabbit-hole
</head>
<p>
ALICE   NP  Alice
was VBD be
beginning   VVG begin
to  TO  to
get VV  get
very    RB  very
tired   JJ  tired
of  IN  of
sitting VVG sit
by  IN  by
her PP$ her
sister  NN  sister
on  IN  on
the DT  the
bank    NN  bank
,   ,   ,
[...]
[...]
</p>
[...]
</text>

This is already a valid VRT format for command-line CQP and for importing the VRT format in CQPweb.

Step 4: Adding sentence markers

However, we do not yet have sentence markers. Pos-tags (e.g. SENT in the UPenn tagset) and annotated xml-elements (e.g. p for paragraph and title; option --sentmark|sm) indicate sentence boundaries. We can use these in a dedicated script (add-sentence-markers.perl) to add sentence markers to the coröpus. Additionally, the script can add a <text>-element at the beginning and end of the file with a given ID (option --textmark|sm)

 Usage: add-sent-markers.perl <infile> <outputfile> <sentdeliminator>
    [--sentmark|sm=s] list of xml-elements for sentence marking
    [--textmark|tm=s] creates a  text-element with given id surrounding enclosing the content of the file
    [--excludeword|ew=s] exclude list of word as sentence markers

The script takes the tagged file, the desired output file and a comma-separated list of pos-tags marking sentence boundaries as input.

An example is given below:

add-sent-markers.perl alice.tagged alice.tagged.vrt SENT --sm div,title,head,p

Input file: alice.tagged
Output file: alice.tagged.vrt
the pos-tag marking sentence boundaries is SENT
aside the xml-elements: div,title,head,p mark sentence boundaries

After this step your data should look like this.

<text title="Alice's adventures in wonderland" author="Lewis Carroll">
<title>
<s>
Alice   NP  Alice
's  POS 's
adventures  NNS adventure
in  IN  in
wonderland  NN  wonderland
</s>
</title>
<div id="chpt 1" title="Down the Rabbit-Hole">
<head>
<s>
Down    IN  down
the DT  the
Rabbit-Hole NN  rabbit-hole
</s>
</head>
<p>
<s>
ALICE   NP  Alice
was VBD be
beginning   VVG begin
to  TO  to
get VV  get
very    RB  very
tired   JJ  tired
of  IN  of
sitting VVG sit
by  IN  by
her PP$ her
sister  NN  sister
on  IN  on
the DT  the
bank    NN  bank
,   ,   ,
[...]
</s>
[...]
</p>
[...]
</text>

Another example for a different tagset (STTS) excluding ; as sentence boundary marker (; although it is tagged as $.) looks as follows:

add-sent-markers.perl text.tagged text.tagged.vrt $. --sm head,title,p --ew ';'

The file is now ready for encoding in CQP and CQPweb, including installing the corpus as precompiled corpus in CQPweb.