Pattern matching with grep -E, part 1
Overview
Teaching: 45 min
Exercises: 45 minQuestions
How do we write regexs to match complex patterns in files or output streams?
Objectives
Learn the basic syntax of Extended Regular Expressions, using grep -E
You may have used the command ‘grep’ before, which allows you to search for a matching string. It may be given a file name, to make it search within a file:
grep 'react' wordplay1.txt
class react exercise impartial heavenly
trees shelf amused reactive sheet
relation reacting handsome ashamed
Or it can search piped output from another command:
echo "Hello Andrew, you are Andrew?" | grep -o "Andrew"
Andrew
Andrew
In this example, the ‘-o’ flag was used to make grep print only the matches it finds, rather than whole matching lines. This will be a handy option when we start testing more ambiguous search patterns later.
For the following lessons, we’ll be making use of grep with a ‘-E’ flag, which enables Extended Regular Expression (ERE) mode for searching regex patterns. On some systems, an aliased command ‘egrep’ exists. The regular expression syntax of ‘grep -E’ is the same as that of ‘sed -E’, which we’ll be using for “find & replace” commands later, and basically the same as most other implementations of regular expressions.
’|’ for matching this or that
Like in the ‘if’ statements of most programming languages, the pipe character acts as an OR, allowing matching to any of the listed options.
grep -E 'boat|bolt|bell' wordplay1.txt
bolt glow shaky lumpy hypnotic
bell spiritual materialistic rule sponge
exclusive color harbor boat bedroom
Round brackets may be used to group terms together. This grouping has multiple uses, as you’ll see as we continue, but one is to group OR options together. E.g.:
grep -E 'b(oat|olt|ell)' wordplay1.txt
bolt glow shaky lumpy hypnotic
bell spiritual materialistic rule sponge
exclusive color harbor boat bedroom
Consider the following:
echo "ABC" | grep -E -o '(A|B)C'
BC
Using ‘-o’ to print matching part only, why wasn’t the ‘A’ included in result? The search pattern asked for either a A or a B which was followed by a C. Only the B met this criteria.
Try it 1
- Write a grep -E command to search wordplay1.txt for either ‘corn’ or ‘cow’
- Try the above with -w option enabled (match as whole words only)
- Write a grep -E command to search wordplay1.txt for either ‘reaction’ or ‘reactive’
- Write an alternative working answer to 3.
Solution
grep -E 'corn|cow' wordplay1.txt grep -E -w 'corn|cow' wordplay1.txt grep -E 'reaction|reactive' wordplay1.txt grep -E 'reacti(on|ve)' wordplay1.txt
’[ ]’ to match any single character in a list or range
This may sound familiar…
Use of square brackets [ list ] allows matching to a single character, where that character has to match any of the options listed between the brackets. The brackets may contain a list, or a range, or a mix of both. For example, to match any one digit from 0 to 9, the following are equivalent:
[0123456789]
[0-9]
[123-789]
Ranges work for alphabet characters too. The following are equivalent:
[ABCDEFG]
[A-G]
Important: as always on a Unix shell, case matters! A list of all alphanumeric characters is:
[a-zA-Z0-9]
Example:
echo "bog bag cog cot tog log lag" | grep -E -o '[a-z]og'
bog
cog
tog
log
Example:
echo "1952 1986 2003 1995 2018" | grep -E -o '20[0-9][0-9]'
2003
2018
Try it 2
- Write a grep -E command to search wordplay1.txt for 4-letter words ending in ‘oat’
- What if you didn’t know if the words started with lower case or capitol letters?
- What if any of the letters could be upper case or lower case?
- Write an alternative working answer to 3.
Solution
grep -E -w -o '[a-z]oat' wordplay1.txt grep -E -w -o '[a-zA-Z]oat' wordplay1.txt grep -E -w -o '[a-zA-Z][oO][aA][tT]' wordplay1.txt # OR grep -E -w -o -i '[a-z]oat' wordplay1.txt # grep -i ignores case (see 'man grep')
When used within square brackets, a ‘^’ means “NOT”. For example ‘[^ABC]’ would match any one character that wasn’t A or B or C.
Example:
echo "bog bag cog cot tog log lag" | grep -E -o '[^b][a-z]g'
cog
tog
log
lag
Try it 3
- Modify your ‘oat’ word finder to find any 4-letter ‘oat’ words other than ‘boat’ and ‘goat’
Solution
grep -E -w -o '[^bg]oat' wordplay1.txt
Match start of line ^ or end of line $
The ‘^’ symbol has another use. When placed at the start of a regex pattern, it anchors the
pattern such that it must match from the start of a line.
Similarly, a ‘$’, when placed at the end of a regex pattern, anchors the pattern such that it
must match up to the end of a line.
Consider the following examples demonstrating the effect:
grep -E 'repeat' wordplay1.txt
repeat drum quilt superficial uncovered
copper bad unpack repeat old-fashioned
sheet accessible word enthusiastic repeat
grep -E '^repeat' wordplay1.txt
repeat drum quilt superficial uncovered
grep -E 'repeat$' wordplay1.txt
sheet accessible word enthusiastic repeat
There will be more use for these anchors when we get into more complex patterns with wildcards.
’.’ is the match-anything regex wildcard
Speaking of, the match-anything wildcard (any single character) for regular expressions is a period, ‘.’
Consider:
grep -E -o '.oat' wordplay1.txt
boat
oat
loat
coat
goat
For the short word “oat”, it was the preceeding tab character that the wildcard matched.
One more concept to introduce before we get into some better examples of using this wildcard.
Matching multiples of things
What if you wanted to match something that repeated multiple times?
Any single character, bracket expression or grouping, can be modified to specify number
of occurances, with one of the following:
Symbol(s) | Effect |
---|---|
? | Present either one time or no times (0-1) |
* | Present any number of times, including zero (0+) |
+ | Present at least once, but any number of times (1+) |
{n} | Repeated exactly ‘n’ times |
{n,m} | Repeated between ‘n’ and ‘m’ times |
{n,} | Repeated at least ‘n’ times |
{,m} | Repeated at most ‘m’ times |
For example, what if we wanted to search for ‘colour’ but maybe it had US spelling?
grep -E 'colou?r' wordplay1.txt
exclusive color harbor boat bedroom
Here the ‘?’ modifier let the preceeding ‘u’ match either one time OR zero times.
Lets find all words from our words file that are at least 12 letters long:
grep -E -o '[a-z]{12,}' wordplay1.txt
materialistic
distribution
enthusiastic
sophisticated
Pattern repetition
Note, it’s probably clear from the previous example that, when specifying repetition of a pattern in this way, e.g. “match any letter from a to z, 12 or more times”, it’s the general ambiguous pattern concept which is repeated, not a specific matching case. Hence we found 12 lots of “any letter” in a row, rather than “any letter” and then 12 repeats of that letter.
Greediness
Regular expression search patterns are considered “greedy”. That is, when you use ‘+’ or ‘*’ or ‘{n,}’, they will always match the longest possible string that still fits the pattern.
E.g. ‘.*’, which means “any character, zero or more times”, will always match an entire line without further specific context around it.
Try it 4
- Use ‘grep -E -o’ on wordplay1.txt to match all of any line that starts with a ‘c’
- Use ‘grep -E -o’ on wordplay1.txt to match the first word of lines starting with a ‘c’
- Modify previous answer to keep words 5 or 6 letters long only.
- Use ‘grep -E -o’ on wordplay1.txt to match both ‘stand’ and ‘outstanding’
Solution
grep -E -o '^c.+' wordplay1.txt grep -E -o '^c[a-z]+' wordplay1.txt grep -E -o -w '^c[a-z]{4,5}' wordplay1.txt grep -E -o '(out)?stand(ing)?' wordplay1.txt
What if we wanted to match a date, and didn’t know if the year would be 2 digits or 4? Our pattern for a date is “a digit, from 0-9, either one or two of them, then a forward slash, then a digit, either one or two of them, then either 2 digits OR 4 digits.”
echo "11/06/91 not/a/date 5/9/2018" | grep -E -o '[0-9]{1,2}/[0-9]{1,2}/([0-9]{2}|[0-9]{4})'
11/06/91
5/9/2018
Alternatively, if we were sure about the sensibleness of the contents we were searching, we may get away with just:
echo "11/06/91 5/9/2018" | grep -E -o '[0-9]+/[0-9]+/[0-9]+'
11/06/91
5/9/2018
Try it 5
Have a look at ‘namesndates.txt’ (
cat namesndates.txt
)
- Use ‘grep -E -o’ on namesndates.txt to list times (e.g. 20:57).
- Use ‘grep -E -o’ on namesndates.txt to list all calendar dates.
- Consider extra information we know about dates: days will be at most 31, months at most 12.
Could we use aspects of this information within our regex for more sensibleness checking?Solution
grep -E -o '[0-9]{1,2}:[0-9]{1,2}' namesndates.txt grep -E -o '[0-9]{1,2}/[0-9]{1,2}/([0-9]{2}|[0-9]{4})' namesndates.txt grep -E -o '[0-3]?[0-9]/[0-1]?[0-9]/([0-9]{2}|[0-9]{4})' namesndates.txt
Key Points
grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
|
acts as an OR between options
( )
allows grouping, e.g. for OR modifier, with quantifiers, etc..
[ ]
matches a character from a list or range of contained options
[^ ]
matches a character NOT in a list or range of contained options
^
at the start of a regex means match at start of line
$
at the end of a regex means match at end of line
.
is the match-all (any single character) wildcard
?
quantifies previous character or group as occurring zero or one time
*
quantifies previous character or group as occurring zero or more times
+
quantifies previous character or group as occurring one or more times
{n,m}
quantifies previous character or group as occurring between n and m timesQuantifiers are greedy- will always match longest possible fit.