Pattern matching with grep -E, part 2
Last updated on 2024-10-25 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we use predefined character classes for more complex search patterns?
Objectives
- Learn to incorporate predefined character classes into regular expressions.
Predefined character classes
The Extended Regular Expression syntax has a number of predefined
character groupings that may be written as a word, rather than a
collection or range of characters.
For example, you may write [[:digit:]]
instead of writing
[0-9]
.
Written as | Equivalent to |
---|---|
[[:digit:]] | [0-9] |
[[:alpha:]] | [a-zA-Z] |
[[:alnum:]] | [a-zA-Z0-9] |
[[:upper:]] | [A-Z] |
[[:lower:]] | [a-z] |
[[:space:]] | Spaces, tabs, in some contexts new-lines |
[[:graph:]] | Any printable character other than space |
[[:punct:]] | Any printable character other than space or [a-zA-Z0-9] |
Why?
Why would you write [[:digit:]]
instead of writing
[0-9]
? Or [[:upper:]]
instead of
[A-Z]
? In general, for our use, there’s little difference
other than readability/style. The difference comes if you need to make
things more universal, as Arabic-numerals are not the only number system
in use around the world and the 26-letter English alphabet is not the
only writing system. So, for example, while [0-9]
will
always just be those specific 10 characters, [[:digit:]]
may have multiple alternate numeric systems encoded.
Regex shorthand escape symbols
The other way you may refer to predefined character classes, and the
way you will likely most commonly do so from here on, is using the
following shorthands, formed by “escaping” certain characters with a
backslash. For example ‘\w’ can be used to match to any “word”
character, which means any letter, number, or (counterintuitively) an
underscore.
The shorthand symbols available are:
Written as | Equivalent to |
---|---|
\w | “Word” character [a-zA-Z0-9] OR a _ (underscore) |
\W | [^\w] Inverse of \w, any non-“word” character |
\s | Spaces, tabs, in some contexts new-lines |
\S | [^\s] Inverse of \s, any non-space character |
\b | Boundary between “words” and “spaces” (0-length) |
\B | [^\b] In the middle of a “word” or multiple “spaces” (0-length) |
\< | Boundary at start of “word” between “words” and “spaces” (0-length) |
\> | Boundary at end of “word” between “words” and “spaces” (0-length) |
Those last four are considered “anchors”. They don’t actually match characters, but they can give a regex pattern more context, helping to orientate components. For example, a letter being specifically at the start of a word.
The following are also commonly used within regex syntax (e.g. will work in Python or R), but are not understood by grep or sed:
Written as | Equivalent to |
---|---|
\d | [0-9] A digit |
\D | [^0-9] Not a digit |
\t | A tab character (does work in some version of sed, test yours) |
\n | A newline, if program supports multi-line matching |
Here are some examples.
The first two words of a line (start of line, word, space(s), word):
OUTPUT
word1 word_2
A word with spaces at both ends:
OUTPUT
word_2
Every set of consecutive non-space characters:
OUTPUT
word1
word_2
thirdWord!?
Everything up to the boundary of the last word:
OUTPUT
word1 word_2
The middle characters of each word (bounded by not-a-word-boundary):
OUTPUT
ord
ord_
hirdWor
Try it 1
- Use ‘grep -E -o’ on wordplay1.txt to print the first 2 words of any
line using
- [[:alpha:]] and [[:space:]]
- \w and \s
- \S and \s
- Use ‘grep -E -o’, with \w and \b, on wordplay1.txt to print a word
that starts with ‘p’ and ends with ‘g’.
- Use ‘grep -E’, with \w and \<, on wordplay1.txt to highlight the first letter of every word.
Tab characters
There are a few options for getting a tab character to work with grep
or sed on a bash command line (if a space ‘\s’ or word boundary anchor
will not be specific enough):
1. On some systems, pressing ctrl+v followed by tab, will insert a
literal tab character.
2. On some systems, a literal tab could be copy and pasted in from a
text editor.
3. A dollar sign in front of pattern can enable escape character
interpretation in bash, based on ANSI-C rules, where ‘\t’ is a
tab.
E.g. echo $'|\t|'
OR grep -E $'\t'
This last option is neatest, but beware other conflicting escape
character interpretations in this mode, meaning that to use ‘\w’ or ‘\b’
etc., you will need to double-escape them, with two slashes.
E.g. for a word with tabs either side:
grep -E $'\t\w+\t'
Capturing groups and back-references
We’d mentioned earlier that round brackets ( ) have multiple uses. One use is to “capture” a match seen within the round brackets, remembering the contents of what matched within, such that a copy of the contents may be referred to again. We’ll make much use of this feature in the next lesson, for find-n-replace substitutions! Within grep though, a “back-reference” to a captured group can be used to identify something that repeats identically within a line. The reference to the previously seen item is used in the form ‘\number’. It works like this:
OUTPUT
blah2 blah2
Here a word ‘\w+’ is “captured” by the round brackets, is followed by space, then is referenced again by ‘\1’, which represents a stored copy of what was originally matched by the ‘\w+’. Hence this grep only matches a case where a word is followed by another copy of that same word.
If we’d like to make multiple back-references, we use multiple pairs of round brackets and increment our back-reference number by one for each open bracket from left-to-right. E.g.:
OUTPUT
blah2 blah2
We had the same outcome, but stored letters part “blah” separate to the digits part “2”, then referred to the two captured parts as ‘\1’ and ‘\2’ respectively.
Consider the following:
OUTPUT
EFGGFE
Our pattern matched to three letters, then to those same three letters repeated again, but in reverse order!
Try it 2
- Grep ‘wordplay1.txt’ to print the only line that contains the same word repeated twice. Hint: You may need to use the ‘\b’ word-boundary anchor.
- Grep ‘namesndates.txt’ to print the name of a person whose firstname and surname start with the same letter.
- Grep ‘namesndates.txt’ to print a date where the month and the day are the same number.
Key Points
- grep in Extended Regex mode has a number of predefined character classes:
[:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
- and escape-character enabled shorthand character classes and anchors:
-
\w
: Word character [a-zA-Z0-9] OR a _ (underscore) -
\W
:[^\w]
Inverse of \w, any non-word character -
\s
: Spaces, tabs, in some contexts new-lines -
\S
:[^\s]
Inverse of \s, any non-space character -
\b
: Boundary between adjacent word and space, 0-length anchor -
\B
:[^\b]
In the middle of a word or multiple spaces, 0-length anchor -
\<
: Boundary at start of word between word and space, 0-length anchor -
\>
: Boundary at end of word between word and space, 0-length anchor - You can refer back to an exact copy of a matched (group) using \1, \2, etc..