In this tutorial we will learn how to query a corpus by formulating regular expressions in CQPweb

Background

The Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the Corpus Query Processor (CQP) which allows to query language corpora in an efficient way using pattern matching with regular expressions.

CQPweb is the web-based graphical user interface (GUI) for some elements of the CWB, in particular the query processor CQP.

In this tutorial we will learn how to query corpora using CQP and its graphical user interface using the CQPweb server at Saarland University.

CQPweb has two query language options simple query and CQP syntax. In this tutorial we will use the CQP syntax only, because

Getting started

Logging in and choosing a corpus

Once you have registered you can log in to the CQPweb installation of the UdS.

  • go to the CQPweb main page
  • enter your username and password
  • you will be directed to your account
  • to see which corpora are available to you click on: click here to view your own corpus access privileges.
  • return to the CQPweb main page
  • your now ready to select a corpus

CQPweb at Saarland University

The CQPweb at the Universität des Saarlandes provides access to a number of different corpora:

  • parallel and comparable corpora
  • English corpora
  • German corpora
  • scientific corpora
  • historical corpora
  • corpora compiled by/for students

On the CQPweb main page, you can get an overview of the different corpora available at UdS. Depending on copyright issues not all corpora may be available to you.

Accessing a corpus

Basic queries

The standard setting in this CQPweb installation is Simple query as Query mode. In this tutorial, we use CQP Synax. So, before we get started, we need to change Query mode from Simple query to CQP syntax!

A word - or better token - in CQP Syntax is surrounded by square-brackets [].
Inside of these brackets you can specify the token more closely using an attribute-value syntax: [att="value"]

If you want to search for a word, e.g. the word word we use the attribute word with the value word:

[word="word"]

As a result we get concordance lines for all instances of the word word in the corpus. Executed on the BLOB corpus, the concordance should look like this.

A concordance is an alphabetical listing of the words in a text, given together with the contexts in which they appear. The most common form of concordance today is the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field (e.g., 80 characters).

For the attribute word - and only for this attribute - there is a second option, namely simply to enter the word in quotation marks:

"word"

The attribute word retrieves only the exact word form, i.e., if you want to look for the word word in its plural form words, the query would be:

[word="words"]

To look for both forms (word and words) at the same time, we have to use regular expressions, e.g. using the or-operator |

[word="word|words"]

or making the s at the end optional using a question mark ?

[word="words?"]

Another possibility, depending on the annotation of the corpus, is to use the attribute lemma (base form) with the value word. This gives us all word forms in the corpus that have the lemma (base form) word: word, words

[lemma="word"]

In order to look for all occurrences of a common noun in the singluar (given your text is tagged with the UPenn tagset) we can use the following query.

[pos="NN"]

Note that different tagsets may be used for different corpora. The menu item Corpus Info usually tells you which tagset has been used for the pos annotation of a corpus.

Quantifiers

We already learned that ? makes the preceding s optional in the query:

[word = "words?"] # final "s" is optional

Generally speaking, the ?-operator makes the preceding element optional (it may occur 0 to 1 times)

What happens if we use brackets to group two characters followed by ??

[word="(re)?search"] # the sequence "re" is optional

Excecute the query and use the Frequency breakdown to see which word forms are matched.

We could also use * (the Kleene-star) to make the final s in words “optional”.

[word = "words*"]

The Frequency breakdown shows that we get occurences of word and words same as with the query using ?.

Lets exchange the s by a ..

[word="word.?"]
[word="word.*"]

Excecute the two queries and use Frequency breakdown to see the difference!

Note:
. stands for any character
Quantifiers specify the number of occurrences of the preceding element
? makes the preceding element optional (may occur 0 to 1 times)
* makes the preceding element optional, AND it may occur 0 to an infinite number of times
+ makes the preceding element obligatory, AND it may occur 1 to an infinite number of times
{n} element occur n times
{n,m} element may occur n to m times

Characters and tokens

We already learned that . stands for any single character. But we can also define a character using character sets. Character sets are enclosed in [], same as tokens. Take a look at the following example below.

[word="[whm]orse"]

We have two sets of square brackets:

  • one specifying a token, enclosing the attribute-value pair
  • one specifying a character inside the value of the expression, enclosing a character set [whm]

The character set can consist of a list of characters, e.g. [whm] or of a character range [a-z] (all characters from a to z) or [0-9] (all numbers from 0 to 9)

[word="[a-zA-Z]orse"]

You can also negate a character set using ^ inside the scquare brackets.

[word="[^whm]orse"]

Operators

In order to combine patterns and restrictions using operators
() group elements
| stands for “or”
& stands for “and”
! negates a pattern or restriction (note the difference to the negation of character sets ^)

Try out the following queries and check the results using Frequency breakdown

[word="im(portant|possible)"]
[lemma="research|investigation"]
[lemma="research" & pos="NN"]
[pos!="V.*|N.*"]