Corpus Query with Regular Expressions: Introduction to CQPWeb

In this tutorial we will learn how to query a corpus by formulating regular expressions in CQPweb

Background

The Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the Corpus Query Processor (CQP) which allows to query language corpora in an efficient way using pattern matching with regular expressions.

CQPweb is the web-based graphical user interface (GUI) for some elements of the CWB, in particular the query processor CQP.

In this tutorial we will learn how to query corpora using CQP and its graphical user interface using the CQPweb server at Saarland University.

CQPweb has two query language options simple query and CQP syntax. In this tutorial we will use the CQP syntax only, because

it is not really more complicated than the simple query
you have more options available
you can use the same queries on the command line

Getting started

you need to register to the CQPweb installation of the UdS. Be aware that all CQPweb installations have their own user management, your user account of another CQPweb (i.e., the CQPweb at Lancaster) will not be valid.
the registration process is mostly automatic, users are assigned rights based on their email address. Thus, in order to be granted the necessary rights please use:
- your university email address or
- the email address you used for registration to a workshop/course
please also indicate your real name and affiliation (if applicable), this makes it easier to identify your account in case of problems
after registration you will get a confirmation email with a link to activate your account. It may happen that you do not get this email, e.g. because it does not pass the SPAM filter. In this case, please contact the administator.
there is also a video tutorial of the registration process by Andrew Hardie (Lancaster University)

Logging in and choosing a corpus

Once you have registered you can log in to the CQPweb installation of the UdS.

go to the CQPweb main page
enter your username and password
you will be directed to your account
to see which corpora are available to you click on: click here to view your own corpus access privileges.
return to the CQPweb main page
your now ready to select a corpus

CQPweb at Saarland University

The CQPweb at the UniversitÃ¤t des Saarlandes provides access to a number of different corpora:

parallel and comparable corpora
English corpora
German corpora
scientific corpora
historical corpora
corpora compiled by/for students

On the CQPweb main page, you can get an overview of the different corpora available at UdS. Depending on copyright issues not all corpora may be available to you.

Accessing a corpus

select a corpus by clicking on it, e.g.
- the Royal Society Corpus (RSC)
- the Brown corpus

Basic queries

The standard setting in this CQPweb installation is Simple query as Query mode. In this tutorial, we use CQP Synax. So, before we get started, we need to change Query mode from Simple query to CQP syntax!

A word - or better token - in CQP Syntax is surrounded by square-brackets [].
Inside of these brackets you can specify the token more closely using an attribute-value syntax: [att="value"]

If you want to search for a word, e.g. the word word we use the attribute word with the value word:

[word="word"]

As a result we get concordance lines for all instances of the word word in the corpus. Executed on the BLOB corpus, the concordance should look like this.

A concordance is an alphabetical listing of the words in a text, given together with the contexts in which they appear. The most common form of concordance today is the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field (e.g., 80 characters).

For the attribute word - and only for this attribute - there is a second option, namely simply to enter the word in quotation marks:

"word"

The attribute word retrieves only the exact word form, i.e., if you want to look for the word word in its plural form words, the query would be:

[word="words"]

To look for both forms (word and words) at the same time, we have to use regular expressions, e.g. using the or-operator |

[word="word|words"]

or making the s at the end optional using a question mark ?

[word="words?"]

Another possibility, depending on the annotation of the corpus, is to use the attribute lemma (base form) with the value word. This gives us all word forms in the corpus that have the lemma (base form) word: word, words

[lemma="word"]

In order to look for all occurrences of a common noun in the singluar (given your text is tagged with the UPenn tagset) we can use the following query.

[pos="NN"]

Note that different tagsets may be used for different corpora. The menu item Corpus Info usually tells you which tagset has been used for the pos annotation of a corpus.

Quantifiers

We already learned that ? makes the preceding s optional in the query:

[word = "words?"] # final "s" is optional

Generally speaking, the ?-operator makes the preceding element optional (it may occur 0 to 1 times)

What happens if we use brackets to group two characters followed by ??

[word="(re)?search"] # the sequence "re" is optional

Excecute the query and use the Frequency breakdown to see which word forms are matched.

We could also use * (the Kleene-star) to make the final s in words “optional”.

[word = "words*"]

The Frequency breakdown shows that we get occurences of word and words same as with the query using ?.

Lets exchange the s by a ..

[word="word.?"]
[word="word.*"]

Excecute the two queries and use Frequency breakdown to see the difference!

Note:
. stands for any character
Quantifiers specify the number of occurrences of the preceding element
? makes the preceding element optional (may occur 0 to 1 times)
* makes the preceding element optional, AND it may occur 0 to an infinite number of times
+ makes the preceding element obligatory, AND it may occur 1 to an infinite number of times
{n} element occur n times
{n,m} element may occur n to m times

Characters and tokens

We already learned that . stands for any single character. But we can also define a character using character sets. Character sets are enclosed in [], same as tokens. Take a look at the following example below.

[word="[whm]orse"]

We have two sets of square brackets:

one specifying a token, enclosing the attribute-value pair
one specifying a character inside the value of the expression, enclosing a character set [whm]

The character set can consist of a list of characters, e.g. [whm] or of a character range [a-z] (all characters from a to z) or [0-9] (all numbers from 0 to 9)

[word="[a-zA-Z]orse"]

You can also negate a character set using ^ inside the scquare brackets.

[word="[^whm]orse"]

Operators

In order to combine patterns and restrictions using operators
() group elements
| stands for “or”
& stands for “and”
! negates a pattern or restriction (note the difference to the negation of character sets ^)

Try out the following queries and check the results using Frequency breakdown

[word="im(portant|possible)"]
[lemma="research|investigation"]
[lemma="research" & pos="NN"]
[pos!="V.*|N.*"]