In this tutorial we will learn how to query a corpus using CQPweb including:

Background

The Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the Corpus Query Processor (CQP) which allows to query language corpora in an efficient way using pattern matching with regular expressions.

CQPweb is the web-based graphical user interface (GUI) for some elements of the CWB, in particular the query processor CQP.

In this tutorial we will learn how to query corpora using CQP and its graphical user interface using the CQPweb server at the Universität des Saarlandes.

CQPweb has two query language options simple query and CQP syntax. In this tutorial we will use the CQP syntax only, because

Getting started

Logging in and choosing a corpus

Once you have registered you can log in to the CQPweb installation of the UdS.

  • go to the CQPweb main page
  • enter your username and password
  • you will be directed to your account
  • to see which corpora are available to you click on: click here to view your own corpus access privileges.
  • return to the CQPweb main page
  • your now ready to select a corpus

CQPweb at the Universität des Saarlandes

The CQPweb at the Universität des Saarlandes provides access to a number of different corpora:

  • parallel and comparable corpora
  • English corpora
  • German corpora
  • scientific corpora
  • historical corpora
  • corpora compiled by/for students

On the CQPweb main page, you can get an overview of the different corpora available at UdS. Depending on copyright issues not all corpora may be available to you.

Accessing a corpus

Basic queries

A word - or better token - in CQP Syntax is surrounded by square-brackets [].
Inside of these brackets you can specify the token more closely using an attribute-value syntax: [att="value"]

If you want to search for a word, e.g. the word word we use the attribute word with the value word:

[word="word"]

As a result we get concordance lines for all instances of the word word in the corpus. Executed on the BLOB corpus, the concordance should look like this

A concordance is an alphabetical listing of the words in a text, given together with the contexts in which they appear. The most common form of concordance today is the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field (e.g., 80 characters).

For the attribute word - and only for this attribute - there is a second option, namely simply to enter the word in quotation marks:

"word"

The attribute word retrieves only the exact word form, i.e., if you want to look for the word word in its plural form words, the query would be:

[word="words"]

To look for both forms (word and words) at the same time, we have to use regular expressions, e.g. using the or-operator |

[word="word|words"]

or making the s at the end optional using a question mark ?

[word="words?"]

Another possibility, depending on the annotation of the corpus, is to use the attribute lemma (base form) with the value word. This gives us all word forms in the corpus that have the lemma (base form) word: word, words

[lemma="word"]

In order to look for all occurrences of a common noun in the singluar (given your text is tagged with the UPenn tagset) we can use the following query.

[pos="NN"]

See the tutorial Tagging with TreeTagger for more information on tagging and tagsets. The menu item Corpus Info usually tells you which tagset has been used for the pos annotation of a corpus.

Excercise

Try out the queries above and check the results choosing Frequency breakdown from the drop-down menu in the upper right corner and clicking on Go.

To remember

  • Word and lemma are called positional attributes in CQP.
  • Positional attributes are attributes on the token level.
  • Typical positional attributes of a corpus are: word, lemma and pos (part-of-speech).
  • A token is surrounded by square brackets: [].
  • Use attribute-value syntax to define the query more closely: [attribute="value"], e.g., [lemma="word"]
  • The attribute specifies the type of annotation (word, lemma or part-of-speech), the value specifies the search string.

Using regular expressions

Quantifiers

We already learned that ? makes the preceding s optional in the query:

[word = "words?"] # final "s" is optional

Generally speaking, the ?-operator makes the preceding element optional (it may occur 0 to 1 times)

What happens if we use brackets to group two characters followed by ??

[word="(re)?search"] # the sequence "re" is optional

Excecute the query and use the Frequency breakdown to see which word forms are matched.

We could also use * (the Kleene-star) to make the final s in words “optional”.

[word = "words*"]

The Frequency breakdown shows that we get occurences of word and words same as with the query using ?.

Lets exchange the s by a ..

[word="word.?"]
[word="word.*"]

Excecute the two queries and use Frequency breakdown to see the difference!

Note:
. stands for any character
Quantifiers specify the number of occurrences of the preceding element
? makes the preceding element optional (may occur 0 to 1 times)
* makes the preceding element optional, AND it may occur 0 to an infinite number of times
+ makes the preceding element obligatory, AND it may occur 1 to an infinite number of times
{n} element occur n times
{n,m} element may occur n to m times

Excercise

Try out:

[word="search.+"]
[word="search.{2}"]
[word="search.{2,3}"]

and check the results using Frequency breakdown

Characters and tokens

We already learned that . stands for any single character. But we can also define a character using character sets. Character sets are enclosed in [], same as tokens. Take a look at the following example below.

[word="[whm]orse"]

We have two sets of square brackets:

  • one specifying a token, enclosing the attribute-value pair
  • one specifying a character inside the value of the expression, enclosing a character set [whm]

The character set can consist of a list of characters, e.g. [whm] or of a character range [a-z] (all characters from a to z) or [0-9] (all numbers from 0 to 9)

[word="[a-zA-Z]orse"]

You can also negate a character set using ^ inside the scquare brackets.

[word="[^whm]orse"]

Exercise

Execute the querie above and check the results using Frequency breakdown

Operators

In order to combine patterns and restrictions using operators
() group elements
| stands for “or”
& stands for “and”
! negates a pattern or restriction (note the difference to the negation of character sets ^)

Try out the following queries and check the results using Frequency breakdown

[word="im(portant|possible)"]
[lemma="research|investigation"]
[lemma="research" & pos="NN"]
[pos!="V.*|N.*"]

Exercise

Execute the queries and check the result using Frequency breakdown

Examples of queries using regular expressions

determiners or nouns:

[pos="DT|NN"]

verbs (including full verbs, auxiliary verbs but no modals)

[pos="V.*"]

full verbs only:

[pos="VV.*"]

full verbs in the past tense

[pos="VV[DN]"]

verbs beginning with under or over

[(pos = "V.*") & (word = "(under|over).+")]

Simple patterns

Up to now, we have looked for one token only. However, we can also look for a sequence of tokens, a pattern.
Just to recall: One set of square brackets stands for one token. To list a sequence of patterns, we simply need a sequence of square brackets.

A simple word sequence:

[word="I"][word="believe"][word="that"]

A sequence of part-of-speech tags, here: adjective-noun-cooccurrences in order to find adjective-noun collocations

[pos="JJ.*"][pos="NN.*"]

Using coordinate constructions to find semantically related words

[word="\w+"][word="and|or"][word="\w+"]

\w stands for any word character (compare to ., which stands for any character)

verbally derived adjectives:

[(word=".+(ed|ing)") & (pos="JJ")][pos="NN"]

Simple patterns with optional token

Often, the pattern we are looking for may be modified at certain positions. In oder words, we want to include discontinuous pattern with optional modifying elements- We can use the same quantifiers for tokens as we used for characters (?*+{n,m})

A query with one unspecified optional token

[word="I"][]?[word="believe"][word="that"]

A query with n to m optional tokens

[word="I"][]{1,3}[word="believe"][word="that"]

A query with 0 and more optional tokens

[word="I"][]*[word="believe"][word="that"]
  • Why is the last query not such a good idea?

A query with 0 and more optional tokens within a sentence

[word="I"][]*[word="believe"][word="that"] within s

Exercise

Execute the queries and check the result using Frequency breakdown

Powerful post-processing

In this tutorial we learned how to query a corpus in CQPweb. However, CQPweb does not only allow for querying but additionally allows for powerful post-processing:

Find out more in the tutorial on Post-processing in CQPweb