In this tutorial we will learn how to query a corpus by formulating regular expressions in Notepad++.

Regular Expressions

Definition

A regular expression describes one or more strings to match when you search a body of text. The expression serves as character pattern to compare with the text being searched. You can use regular expressions to search for patterns in a string, replace text, and extract substrings.

A regular expression consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters.

Special Characters

Special character Description Example
* Matches the preceding character or subexpression zero or more times. Equivalent to {0,}. bo* matches “b” and “boo”
bo{0,}the same as above
+ Matches the preceding character or subexpression one or more times. Equivalent to {1,}. bo+ matches “bo” and “boo”, but not “b”
bo{1,} the same as above
? Matches the preceding character or subexpression zero or one time. Equivalent to {0,1}.
When ? immediately follows any other quantifier (*, +, ?, {n}, {n,}, or {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible. The default greedy pattern matches as much of the searched string as possible.
bo? matches “b” and “bo”, but not “boo”
o+? matches a single “o” in “oooo”,
and o+ matches all
o{1,}? the same as above
do(es)? matches the “do” in “do” or “does”
^ Matches the position at the start of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r. When used as the first character in a bracket expression, ^ negates the character set. ^\d{3} matches 3 numeric digits at the start of the searched string
[^abc] matches any character except a, b, and c
$ Matches the position at the end of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position before \n or \r. \d{3}$ matches 3 numeric digits at the end of the searched string
. Matches any single character except the newline character \n. To match any character including the \n, use a pattern like [\s\S]. a.c matches “abc”, “a1c”, and “a-c”
[] Marks the start and end of a bracket expression. [1-4] matches “1”, “2”, “3”, or “4”
[^aAeEiIoOuU] matches any non-vowel character
{} Marks the start and end of a quantifier expression. a{2,3} matches “aa” and “aaa”
() Marks the start and end of a subexpression. Subexpressions can be saved for later use. A(\d) matches “A0” to “A9”
The digit is saved for later use
| Indicates a choice between two or more items. z|food matches “z” or “food”
(z|f)ood matches “zood” or “food”
\ Marks the next character as a special character, a literal, a backreference, or an octal escape. \n matches a newline character
\( matches “(”. \\ matches “"

Most special characters lose their meaning and represent ordinary characters when they occur inside a bracket expression.

Multiple-character Special Characters

Special character Description Example
\b Matches a word boundary; that is, the position between a word and a space. er\b matches the “er” in “never” but not the “er” in “verb”
\B Matches a word non-boundary. er\B matches the “er” in “verb” but not the “er” in “never”
\d Matches a digit character. Equivalent to [0-9]. In the searched string “12 345”, \d{2} matches “12” and “34”
\d matches “1”, 2“,”3“,”4“, and”5"
\D Matches a nondigit character. Equivalent to [^0-9]. \D+ matches “abc” and " def" in “abc123 def”
\w Matches any of the following characters: A-Z, a-z, 0-9, and underscore.
Equivalent to [A-Za-z0-9_].
In the searched string “The quick brown fox…”,
\w+ matches “The”, “quick”, “brown”, and “fox”
\W Matches any character except A-Z, a-z, 0-9, and underscore. Equivalent to [^A-Za-z0-9_]. In the searched string “The quick brown fox…”, \W+ matches “…” and all of the spaces
[xyz] A character set. Matches any one of the specified characters. [abc] matches the “a” in “plain
[^xyz] A negative character set. Matches any character that is not specified. [^abc] matches the “p”, “l”, “i”, and “n” in “plain”
[a-z] A range of characters. Matches any character in the specified range. [a-z] matches any lowercase alphabetical character in the range “a” through “z”
[^a-z] A negative range of characters. Matches any character that is not in the specified range. [^a-z] matches any character that is not in the range “a” through “z”
{n} Matches exactly n times. n is a nonnegative integer. o{2} does not match the “o” in “Bob”, but does match the two “o”s in “food”
{n,} Matches at least n times. n is a nonnegative integer. * is equivalent to {0,}.
+ is equivalent to {1,}.
o{2,} does not match the “o” in “Bob” but does match all the “o”s in “foooood”
{n,m} Matches at least n and at most m times. n and m are nonnegative integers, where n <= m. There cannot be a space between the comma and the numbers. ? is equivalent to {0,1}. In the searched string“1234567”, k\d{1,3} matches “123”, “456”, and “7”.

Non Printing Characters

The following table contains escape sequences that represent nonprinting characters.

Special character Matches
\n Newline character.
\r Carriage-return character.
\s Any white-space character. This includes space, tab, and form feed.
\S Any non–white space character.
\t Tab character.

Notepad++

Notepad ++ is a free source code editor running under the MS Windows environment. As almost all editors, Notepad++ incorporates the Count,Search and Replace function. In additon to other editors it allows performing these operations with regular expressions. In order to perform these operation with regular expressions open Notepad++, and in the tab Search hit the Replace button. By ticking Regular Expressions all the operations you perfrom in this window will cover regular expressions too. Even in Regular Expressions mode you can perform string-based queries.