Content from Regular Expressions: The pitch


Last updated on 2024-10-25 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • What sort of capabilities do regular expressions provide?
  • What are regular expressions?

Objectives

  • Introduce the concept of regular expressions.
  • Sell the idea of the power of regular expressions.

What are regular expressions?


In short, a regular expression, or rational expression, or regex for short, is a sequence of characters that act as a pattern for searching within text. So, what do we mean by pattern?

Consider a calendar date written down, say, 23/08/2018. We all recognise a date as being a date when we see one written down this way. Why? Because there’s a consistent pattern to it. We could describe the pattern of a date written this way as: one or two digits, a forward slash, one or two digits, a forward slash, then either 2 digits or 4 digits. Using that pattern of what a date looks like, our eyes can scan a page of text and pick out other dates.

Regular expressions are a formalisation of such patterns into a consistent syntax, making use of metacharacters of certain meaning, such that we can guide a computer to search text in a way that is as specific or as ambiguous as required. It’s a very powerful tool, being able to search text for patterns instead of just exact words. Taken further, it also allows for complex find & replaces, where the replacing is also following a pattern described with the same syntax.

The pitch


So, have you ever wanted to pull specific information out of a file, where a straight search for a specific word or phrase wouldn’t cut it? Perhaps all you knew was the general “shape” of what you were looking for. For example, you know there’s a date involved, not what the date is, but that it would look like a date- day/month/year. Or maybe it’s part of a specific file format; you know the format’s rules, but not the contents.

Have you ever wanted to search a file for something, but only actually print out some different nearby information from the same line?

Have you ever wanted to make bulk changes to a file, where a simple ‘find & replace’ wouldn’t work? Perhaps you’d like to rename a list of samples, according to a rule based on their current names. Or fix up some formatting inconsistencies.

Or have you ever wanted to completely rearrange and modify lines of a file or program output?

These are the situations that regular expressions make really quick and easy compared to other solutions.

Here’s an example. We have a list of files in folders, each representing a sample and replicate in an experiment, with some sample details incorporated into the file names. We wish to automatically transform this into a table, listing samples, sample information, and file details.

Regex demo1

There are numerous ways to do this, but most solutions would be multi-part, requiring cutting out different bits of information at a time.
Using a regular expression, this entire transformation can be done in a single command:

BASH

's/^(.+\/)(s([0-9]+)-R([0-9]+)_([0-9]+)h-(\w+)\.fasta)/Sample\3\tRep\4\t\5hours\t\6\t\2\t\1/'

It’s ugly, but it works (which might be the regular expression catch-phrase).

This is a regular expression substitution, containing a pattern that matches each part of our file names and another pattern that defines the rearrangement. A goal at the end of today will be to understand how this works and to be able to implement similar effects yourselves.

There have been numerous implementations of regular expressions, but the major common implementations available today all follow a mostly-consistent syntax, which is what we’ll be learning. So, for example, once you learn this regex syntax, you’ll then be able to use it within certain Unix command line utilities and in Python and in R. It even works in most text editors too!

Key Points

  • Regexs are powerful tools for searching and transforming text.
  • A search pattern, using a defined syntax, allows non-specific but directed matching.

Content from Shell wildcards - a type of regex


Last updated on 2024-10-25 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • Are regular expressions available in standard Unix commands?

Objectives

  • (Re)Introduce wildcards available in the Unix shell for file selection.
  • *’, ‘?’, ‘[ ]’, ‘[! ]’ and ‘{ }

While using a Unix shell, you may have already become familiar with a simple form of regular expression. Have you ever used a command like this?

BASH

ls *.txt

It’s a searchable pattern which means “match anything that ends in .txt”. The * works as a wildcard, expanding to match any of zero or more characters.

However, there are other wildcards available that allow for more specific and complex patterns.

‘?’ to match any single character


The ? wildcard matches any character, but always exactly one character.
For example:

BASH

ls sample?.txt  

OUTPUT

sample1.txt  sample4.txt  sample7.txt  sampleA.txt  sampleD.txt  sampleY.txt
sample2.txt  sample5.txt  sample8.txt  sampleB.txt  sampleE.txt  sampleZ.txt
sample3.txt  sample6.txt  sample9.txt  sampleC.txt  sampleF.txt

Try it 1

  1. Write a similar ls command to list .txt files with a 2 digit/char sample number.
  2. Write a similar ls command to list for samples specifically in the range of 10-19.

BASH

ls sample??.txt
ls sample1?.txt

That last challenge was a bit of a gotcha. The solution using ‘?’ would have also picked up a sample “1A”. Not what we were after, but a good segue for the next concept- matching to a list of options or ranges.

‘[ ]’ to match any single character in a list or range


Use of square brackets [ *list* ] allows matching to a single character, where that character has to match any of the options listed between the brackets. The brackets may contain a list, or a range, or a mix of both. For example, to match any one digit from 0 to 9, the following are equivalent:

BASH

[0123456789]
[0-9]
[0-789]

Ranges work for alphabet characters too. The following are equivalent:

BASH

[ABCDEFG]
[A-G]

Important: as always on a Unix shell, case matters! A list of all alphanumeric characters is:

BASH

[a-zA-Z0-9]

So back to our earlier challenge of listing files for sample numbers 10-19.
A better solution may be:

BASH

ls sample1[0-9].txt

Remember: an entire ‘[list]’ set will only match a single listed character at a time.

Try it 2

  1. Write an ls command to list .txt files for samples 2 to 5.
  2. Write an ls command to list .txt files for samples X, Y and Z and sample A.
  3. Write an ls command to list .txt files for samples denoted by a number followed by a letter (e.g. sample1A).
  4. Write an ls command to list .txt files for all sample names ending in a letter.

BASH

ls sample[2-5].txt
ls sample[AX-Z].txt
ls sample[0-9][A-Z].txt
ls sample*[A-Z].txt

‘!’ to match any single character NOT in a list or range


Connected to the square bracket notation is the NOT(!) symbol ‘[! ]’. Use of a ‘!’ inside ‘[ ]’ makes it match to any single character NOT in the list.
E.g.:

BASH

[!A]

… would match any character that’s not A.

BASH

[!0-9]

… would match any character that’s not a digit from 0 to 9.

Try it 3

  1. Write an ls command to list .txt files for all samples other than those numbered 4 to 7 and 9.

BASH

ls sample[!4-79]*.txt

‘{ }’ to list whole word or expressions as options


Lastly for this section we have the curly brackets ‘{ }’. Like for square brackets, curly brackets lets you match to one of the options listed inside. The difference is, this time we list not just single characters, but entire words or expressions to choose between.
E.g. ‘{alpha,beta,gamma}’ would match ‘alpha’ and ‘beta’ and ‘gamma’.

Real example:

BASH

ls sample11.{txt,tab,csv}
sample11.csv  sample11.tab  sample11.txt

You can use other wildcard patterns within the listed options.
So these are equivalent:

BASH

ls *.{tab,csv}
ls {*.tab,*.csv}

And these are equivalent:

BASH

ls sample[A-Z0-9].txt
ls sample{[A-Z],[0-9]}.txt

The curly braces actually have their own range capability, even more powerful in that it allows for more than single digit numbers:

BASH

ls sample{1..15}.txt
ls sample{A..C}.txt

BUT if you’ve spotted the warnings while trying these examples, you may have guessed the catch with using curly brackets- They aren’t actually a form of pattern matching! Rather, the first thing a shell does is expand the brackets out to form all posibilities listed, in full.

So, when you run the first command below, you’re actually executing the second command below:

BASH

ls sample{A..C}.txt
ls sampleA.txt sampleB.txt sampleC.txt

Contrast:

BASH

echo [A-C][1-3]
[A-C][1-3]
  
echo {A..C}{1..3}
A1 A2 A3 B1 B2 B3 C1 C2 C3

Brace expansion is always the first order of business when the command line interprets a command. It’s a useful tool, and while not quite fitting the regular expression theme, the concept of listing longer options in a this-OR-that fashion will be visited again in proper regexs.

Try it 4

  1. Write an ls command to list files for samples 8 to 13 and CD.
  2. Write an ls command to list just .csv and .tab files for samples 11 and CD.

BASH

ls sample{{8..13},CD}.*
ls sample{11,CD}.{csv,tab}

While syntax will differ, the concepts learnt here will remain very relevant as we next move onto more advanced forms of regular expression.

Key Points

  • Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
  • * matches zero or more characters
  • ? matches exactly one character
  • [ ] matches a character from a list or range of contained options
  • [! ] matches a character NOT in a list or range of contained options
  • { } expands to produce forms of all listed contained options

Content from Pattern matching with grep -E, part 1


Last updated on 2024-11-01 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • How do we write regexs to match complex patterns in files or output streams?

Objectives

  • Learn the basic syntax of Extended Regular Expressions, using grep -E

You may have used the command ‘grep’ before, which allows you to search for a matching string. It may be given a file name, to make it search within a file:

BASH

grep 'react' wordplay1.txt

OUTPUT

class   react   exercise        impartial       heavenly
trees   shelf   amused  reactive        sheet
relation        reacting        handsome        ashamed

Or it can search piped output from another command:

BASH

echo "Hello Andrew, you are Andrew?" | grep -o "Andrew"

OUTPUT

Andrew
Andrew

In this example, the ‘-o’ flag was used to make grep print only the matches it finds, rather than whole matching lines. This will be a handy option when we start testing more ambiguous search patterns later.

For the following lessons, we’ll be making use of grep with a ‘-E’ flag, which enables Extended Regular Expression (ERE) mode for searching regex patterns. On some systems, an aliased command ‘egrep’ exists. The regular expression syntax of ‘grep -E’ is the same as that of ‘sed -E’, which we’ll be using for “find & replace” commands later, and basically the same as most other implementations of regular expressions.

‘|’ for matching this or that


Like in the ‘if’ statements of most programming languages, the pipe character acts as an OR, allowing matching to any of the listed options.

BASH

grep -E 'boat|bolt|bell' wordplay1.txt 

OUTPUT

bolt    glow    shaky   lumpy   hypnotic
bell    spiritual       materialistic   rule    sponge
exclusive       color   harbor  boat    bedroom

Round brackets may be used to group terms together. This grouping has multiple uses, as you’ll see as we continue, but one is to group OR options together. E.g.:

BASH

grep -E 'b(oat|olt|ell)' wordplay1.txt 

OUTPUT

bolt    glow    shaky   lumpy   hypnotic
bell    spiritual       materialistic   rule    sponge
exclusive       color   harbor  boat    bedroom

Consider the following:

BASH

echo "ABC" | grep -E -o '(A|B)C'

OUTPUT

BC

Using ‘-o’ to print matching part only, why wasn’t the ‘A’ included in result? The search pattern asked for either a A or a B which was followed by a C. Only the B met this criteria.

Try it 1

  1. Write a grep -E command to search wordplay1.txt for either ‘corn’ or ‘cow’
  2. Try the above with -w option enabled (match as whole words only)
  3. Write a grep -E command to search wordplay1.txt for either ‘reaction’ or ‘reactive’
  4. Write an alternative working answer to 3.

BASH

grep -E 'corn|cow' wordplay1.txt
grep -E -w 'corn|cow' wordplay1.txt
grep -E 'reaction|reactive' wordplay1.txt
grep -E 'reacti(on|ve)' wordplay1.txt

‘[ ]’ to match any single character in a list or range


This may sound familiar…

Use of square brackets [ *list* ] allows matching to a single character, where that character has to match any of the options listed between the brackets. The brackets may contain a list, or a range, or a mix of both. For example, to match any one digit from 0 to 9, the following are equivalent:

BASH

[0123456789]
[0-9]
[123-789]

Ranges work for alphabet characters too. The following are equivalent:

BASH

[ABCDEFG]
[A-G]

Important: as always on a Unix shell, case matters! A list of all alphanumeric characters is:

BASH

[a-zA-Z0-9]

Example:

BASH

echo "bog bag cog cot tog log lag" | grep -E -o '[a-z]og'

OUTPUT

bog
cog
tog
log

Example:

BASH

echo "1952 1986 2003 1995 2018" | grep -E -o '20[0-9][0-9]'

OUTPUT

2003
2018

Try it 2

  1. Write a grep -E command to search wordplay1.txt for 4-letter words ending in ‘oat’
  2. What if you didn’t know if the words started with lower case or capitol letters?
  3. What if any of the letters could be upper case or lower case?
  4. Write an alternative working answer to 3.

BASH

grep -E -w -o '[a-z]oat' wordplay1.txt
grep -E -w -o '[a-zA-Z]oat' wordplay1.txt
grep -E -w -o '[a-zA-Z][oO][aA][tT]' wordplay1.txt
# OR
grep -E -w -o -i '[a-z]oat' wordplay1.txt
# grep -i ignores case (see 'man grep')

When used within square brackets, a ‘^’ means “NOT”. For example ‘[^ABC]’ would match any one character that wasn’t A or B or C.

Example:

BASH

echo "bog bag cog cot tog log lag" | grep -E -o '[^b][a-z]g'

OUTPUT

cog
tog
log
lag

Try it 3

  1. Modify your ‘oat’ word finder to find any 4-letter ‘oat’ words other than ‘boat’ and ‘goat’

BASH

grep -E -w -o '[^bg]oat' wordplay1.txt

Match start of line ^ or end of line $


The ‘^’ symbol has another use. When placed at the start of a regex pattern, it anchors the pattern such that it must match from the start of a line.
Similarly, a ‘$’, when placed at the end of a regex pattern, anchors the pattern such that it must match up to the end of a line.
Consider the following examples demonstrating the effect:

BASH

grep -E 'repeat' wordplay1.txt

OUTPUT

repeat  drum    quilt   superficial     uncovered
copper  bad     unpack  repeat  old-fashioned
sheet   accessible      word    enthusiastic    repeat

BASH

grep -E '^repeat' wordplay1.txt

OUTPUT

repeat  drum    quilt   superficial     uncovered

BASH

grep -E 'repeat$' wordplay1.txt

OUTPUT

sheet   accessible      word    enthusiastic    repeat

There will be more use for these anchors when we get into more complex patterns with wildcards.

‘.’ is the match-anything regex wildcard


Speaking of, the match-anything wildcard (any single character) for regular expressions is a period, ‘.’

Consider:

BASH

grep -E -o '.oat' wordplay1.txt

OUTPUT

boat
        oat
loat
coat
goat

For the short word “oat”, it was the preceeding tab character that the wildcard matched.

One more concept to introduce before we get into some better examples of using this wildcard.

Matching multiples of things


What if you wanted to match something that repeated multiple times?
Any single character, bracket expression or grouping, can be modified to specify number of occurances, with one of the following:

Symbol(s) Effect
? Present either one time or no times (0-1)
* Present any number of times, including zero (0+)
+ Present at least once, but any number of times (1+)
{n} Repeated exactly ‘n’ times
{n,m} Repeated between ‘n’ and ‘m’ times
{n,} Repeated at least ‘n’ times
{,m} Repeated at most ‘m’ times

For example, what if we wanted to search for ‘colour’ but maybe it had US spelling?

BASH

grep -E 'colou?r' wordplay1.txt

OUTPUT

exclusive       color   harbor  boat    bedroom

Here the ‘?’ modifier let the preceeding ‘u’ match either one time OR zero times.

Lets find all words from our words file that are at least 12 letters long:

BASH

grep -E -o '[a-z]{12,}' wordplay1.txt 

OUTPUT

materialistic
distribution
enthusiastic
sophisticated

Pattern repetition

Note, it’s probably clear from the previous example that, when specifying repetition of a pattern in this way, e.g. “match any letter from a to z, 12 or more times”, it’s the general ambiguous pattern concept which is repeated, not a specific matching case. Hence we found 12 lots of “any letter” in a row, rather than “any letter” and then 12 repeats of that letter.

Greediness

Regular expression search patterns are considered “greedy”. That is, when you use ‘+’ or ‘*’ or ‘{n,}’, they will always match the longest possible string that still fits the pattern.
E.g. ‘.*’, which means “any character, zero or more times”, will always match an entire line without further specific context around it.

Try it 4

  1. Use ‘grep -E -o’ on wordplay1.txt to match all of any line that starts with a ‘c’
  2. Use ‘grep -E -o’ on wordplay1.txt to match the first word of lines starting with a ‘c’
  3. Modify previous answer to keep words 5 or 6 letters long only.
  4. Use ‘grep -E -o’ on wordplay1.txt to match both ‘stand’ and ‘outstanding’

BASH

grep -E -o '^c.+' wordplay1.txt
grep -E -o '^c[a-z]+' wordplay1.txt
grep -E -o -w '^c[a-z]{4,5}' wordplay1.txt
grep -E -o '(out)?stand(ing)?' wordplay1.txt

What if we wanted to match a date, and didn’t know if the year would be 2 digits or 4? Our pattern for a date is “a digit, from 0-9, either one or two of them, then a forward slash, then a digit, either one or two of them, then either 2 digits OR 4 digits.”

BASH

echo "11/06/91  not/a/date  5/9/2018" | \
  grep -E -o '[0-9]{1,2}/[0-9]{1,2}/([0-9]{2}|[0-9]{4})'

OUTPUT

11/06/91
5/9/2018

Alternatively, if we were sure about the sensibleness of the contents we were searching, we may get away with just:

BASH

echo "11/06/91 5/9/2018" | grep -E -o '[0-9]+/[0-9]+/[0-9]+'

OUTPUT

11/06/91
5/9/2018

Try it 5

Have a look at ‘namesndates.txt’ (cat namesndates.txt)

  1. Use ‘grep -E -o’ on namesndates.txt to list times (e.g. 20:57).
  2. Use ‘grep -E -o’ on namesndates.txt to list all calendar dates.
  3. Consider extra information we know about dates: days will be at most 31, months at most 12.
    Could we use aspects of this information within our regex for more sensibleness checking?

BASH

grep -E -o '[0-9]{1,2}:[0-9]{1,2}' namesndates.txt
grep -E -o '[0-9]{1,2}/[0-9]{1,2}/([0-9]{2}|[0-9]{4})' namesndates.txt
grep -E -o '[0-3]?[0-9]/[0-1]?[0-9]/([0-9]{2}|[0-9]{4})' namesndates.txt

Key Points

  • grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
  • | acts as an OR between options
  • ( ) allows grouping, e.g. for OR modifier, with quantifiers, etc..
  • [ ] matches a character from a list or range of contained options
  • [^ ] matches a character NOT in a list or range of contained options
  • ^ at the start of a regex means match at start of line
  • $ at the end of a regex means match at end of line
  • . is the match-all (any single character) wildcard
  • ? quantifies previous character or group as occurring zero or one time
  • * quantifies previous character or group as occurring zero or more times
  • + quantifies previous character or group as occurring one or more times
  • {n,m} quantifies previous character or group as occurring between n and m times
  • Quantifiers are greedy- will always match longest possible fit.

Content from Pattern matching with grep -E, part 2


Last updated on 2024-10-25 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do we use predefined character classes for more complex search patterns?

Objectives

  • Learn to incorporate predefined character classes into regular expressions.

Predefined character classes


The Extended Regular Expression syntax has a number of predefined character groupings that may be written as a word, rather than a collection or range of characters.
For example, you may write [[:digit:]] instead of writing [0-9].

Written as Equivalent to
[[:digit:]] [0-9]
[[:alpha:]] [a-zA-Z]
[[:alnum:]] [a-zA-Z0-9]
[[:upper:]] [A-Z]
[[:lower:]] [a-z]
[[:space:]] Spaces, tabs, in some contexts new-lines
[[:graph:]] Any printable character other than space
[[:punct:]] Any printable character other than space or [a-zA-Z0-9]

Why?

Why would you write [[:digit:]] instead of writing [0-9]? Or [[:upper:]] instead of [A-Z]? In general, for our use, there’s little difference other than readability/style. The difference comes if you need to make things more universal, as Arabic-numerals are not the only number system in use around the world and the 26-letter English alphabet is not the only writing system. So, for example, while [0-9] will always just be those specific 10 characters, [[:digit:]] may have multiple alternate numeric systems encoded.

Regex shorthand escape symbols


The other way you may refer to predefined character classes, and the way you will likely most commonly do so from here on, is using the following shorthands, formed by “escaping” certain characters with a backslash. For example ‘\w’ can be used to match to any “word” character, which means any letter, number, or (counterintuitively) an underscore.
The shorthand symbols available are:

Written as Equivalent to
\w “Word” character [a-zA-Z0-9] OR a _ (underscore)
\W [^\w] Inverse of \w, any non-“word” character
\s Spaces, tabs, in some contexts new-lines
\S [^\s] Inverse of \s, any non-space character
\b Boundary between “words” and “spaces” (0-length)
\B [^\b] In the middle of a “word” or multiple “spaces” (0-length)
\< Boundary at start of “word” between “words” and “spaces” (0-length)
\> Boundary at end of “word” between “words” and “spaces” (0-length)

Those last four are considered “anchors”. They don’t actually match characters, but they can give a regex pattern more context, helping to orientate components. For example, a letter being specifically at the start of a word.

The following are also commonly used within regex syntax (e.g. will work in Python or R), but are not understood by grep or sed:

Written as Equivalent to
\d [0-9] A digit
\D [^0-9] Not a digit
\t A tab character (does work in some version of sed, test yours)
\n A newline, if program supports multi-line matching

Here are some examples.
The first two words of a line (start of line, word, space(s), word):

BASH

echo 'word1 word_2 thirdWord' | grep -E -o '^\w+\s+\w+'

OUTPUT

word1 word_2

A word with spaces at both ends:

BASH

echo 'word1 word_2 thirdWord!?' | grep -E -o '\s\w+\s'

OUTPUT

 word_2 

Every set of consecutive non-space characters:

BASH

echo 'word1 word_2 thirdWord!?' | grep -E -o '\S+'

OUTPUT

word1
word_2
thirdWord!?

Everything up to the boundary of the last word:

BASH

echo 'word1 word_2 thirdWord!?' | grep -E -o '.+\<'

OUTPUT

word1 word_2  

The middle characters of each word (bounded by not-a-word-boundary):

BASH

echo 'word1 word_2 thirdWord!?' | grep -E -o '\B\w+\B'

OUTPUT

ord
ord_
hirdWor

Try it 1

  1. Use ‘grep -E -o’ on wordplay1.txt to print the first 2 words of any line using
  1. [[:alpha:]] and [[:space:]]
  2. \w and \s
  3. \S and \s
  1. Use ‘grep -E -o’, with \w and \b, on wordplay1.txt to print a word that starts with ‘p’ and ends with ‘g’.
  2. Use ‘grep -E’, with \w and \<, on wordplay1.txt to highlight the first letter of every word.

BASH

grep -E -o '^[[:alpha:]]+[[:space:]]+[[:alpha:]]+' wordplay1.txt 
grep -E -o '^\w+\s+\w+' wordplay1.txt  
grep -E -o '^\S+\s+\S+' wordplay1.txt
grep -E -o '\bp\w+g\b' wordplay1.txt
grep -E '\<\w' wordplay1.txt

Tab characters

There are a few options for getting a tab character to work with grep or sed on a bash command line (if a space ‘\s’ or word boundary anchor will not be specific enough):
1. On some systems, pressing ctrl+v followed by tab, will insert a literal tab character.
2. On some systems, a literal tab could be copy and pasted in from a text editor.
3. A dollar sign in front of pattern can enable escape character interpretation in bash, based on ANSI-C rules, where ‘\t’ is a tab.
E.g. echo $'|\t|' OR grep -E $'\t'
This last option is neatest, but beware other conflicting escape character interpretations in this mode, meaning that to use ‘\w’ or ‘\b’ etc., you will need to double-escape them, with two slashes.
E.g. for a word with tabs either side: grep -E $'\t\w+\t'

Capturing groups and back-references


We’d mentioned earlier that round brackets ( ) have multiple uses. One use is to “capture” a match seen within the round brackets, remembering the contents of what matched within, such that a copy of the contents may be referred to again. We’ll make much use of this feature in the next lesson, for find-n-replace substitutions! Within grep though, a “back-reference” to a captured group can be used to identify something that repeats identically within a line. The reference to the previously seen item is used in the form ‘\number’. It works like this:

BASH

echo 'blah1 blah2 blah2 blah4?' | grep -E -o '(\w+)\s+\1'

OUTPUT

blah2 blah2

Here a word ‘\w+’ is “captured” by the round brackets, is followed by space, then is referenced again by ‘\1’, which represents a stored copy of what was originally matched by the ‘\w+’. Hence this grep only matches a case where a word is followed by another copy of that same word.

If we’d like to make multiple back-references, we use multiple pairs of round brackets and increment our back-reference number by one for each open bracket from left-to-right. E.g.:

BASH

echo 'blah1 blah2 blah2 blah4?' | grep -E -o '([a-z]+)([0-9])\s+\1\2'

OUTPUT

blah2 blah2

We had the same outcome, but stored letters part “blah” separate to the digits part “2”, then referred to the two captured parts as ‘\1’ and ‘\2’ respectively.

Consider the following:

BASH

echo 'ABCDEFGGFEDCBA' | grep -E -o '(\w)(\w)(\w)\3\2\1'

OUTPUT

EFGGFE

Our pattern matched to three letters, then to those same three letters repeated again, but in reverse order!

Try it 2

  1. Grep ‘wordplay1.txt’ to print the only line that contains the same word repeated twice. Hint: You may need to use the ‘\b’ word-boundary anchor.
  2. Grep ‘namesndates.txt’ to print the name of a person whose firstname and surname start with the same letter.
  3. Grep ‘namesndates.txt’ to print a date where the month and the day are the same number.

BASH

grep -E '\b(\w+)\b.+\b\1\b' wordplay1.txt
grep -E -o '^(\w)\w+\s\1\w+' namesndates.txt
grep -E -o '([0-9]{2})/\1/[0-9]{4}' namesndates.txt

Key Points

  • grep in Extended Regex mode has a number of predefined character classes:
  • [:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
  • and escape-character enabled shorthand character classes and anchors:
  • \w : Word character [a-zA-Z0-9] OR a _ (underscore)
  • \W : [^\w] Inverse of \w, any non-word character
  • \s : Spaces, tabs, in some contexts new-lines
  • \S : [^\s] Inverse of \s, any non-space character
  • \b : Boundary between adjacent word and space, 0-length anchor
  • \B : [^\b] In the middle of a word or multiple spaces, 0-length anchor
  • \< : Boundary at start of word between word and space, 0-length anchor
  • \> : Boundary at end of word between word and space, 0-length anchor
  • You can refer back to an exact copy of a matched (group) using \1, \2, etc..

Content from Find... and replace! With sed.


Last updated on 2024-10-25 | Edit this page

Estimated time: 120 minutes

Overview

Questions

  • How do we find and replace content using regex patterns?

Objectives

  • Learn regex substitutions using sed.

Regex substitution syntax


Regex substitutions allow you to use the pattern syntax we’ve learned so far to describe complex ‘find & replace’ operations. For the following exercises we’ll be using the program ‘sed’, but the substitution syntax is common to other implementations too. It’s this:

BASH

's/pattern/replacement/'

‘s’ for substitution. The ‘pattern’ component is as we’ve been using with ‘grep -E’ so far; the same concept and syntax, describing a pattern to be matched. The ‘replacement’ component is what will replace a match of the pattern.

E.g. Substitute “hello” for “goodbye” would be:

BASH

's/hello/goodbye/'

Sed for regex substitutions


‘sed’ is a Unix command line utility, name short for “stream editor”. It will parse text passed to it, from a file or piped input stream, and can perform various transformations to the text. It has many different capabilities, as can be found in the sed manual. Today we’ll just be playing with the regular expression substitution capability. We’ll be using ‘sed -E’, which enables the same “Extended Regular Expression” syntax as we’d been using with ‘grep -E’.

There are a few ways to use it:

BASH

# Modify stream from another program, print result
other-command | sed -E 's/pat/replace/'

# Read input file, print contents with changes
sed -E 's/pat/replace/' inFile

# Read input file, write contents with changes to another file
sed -E 's/pat/replace/' inFile > outFile

# Read input file, write with changes to the same file (caution!)
sed -E -i 's/pat/replace/' file

Example:

BASH

echo "hello Andrew" | sed -E 's/Andrew/Bob/'

OUTPUT

hello Bob

All of our regex pattern capabilities still work:

BASH

echo "hello Andrew" | sed -E 's/A\w+/Bob/'

OUTPUT

hello Bob

We replace the entirety of what we match:

BASH

echo "hello Andrew" | sed -E 's/.+/A whole new line/'

OUTPUT

A whole new line

Only the first possible extended match is replaced:

BASH

echo "hello Andrew" | sed -E 's/\w+/blah/'

OUTPUT

blah Andrew

However, a greedy mode may be enabled which keeps looking for subsequent matches to replace, by adding a ‘g’ to the end of your substitution string:

BASH

echo "hello Andrew" | sed -E 's/\w+/blah/g'

OUTPUT

blah blah

The replacement section is allowed to be empty too:

BASH

echo "hello    Andrew" | sed -E 's/\s+//g'

OUTPUT

helloAndrew

Try it 1

Using sed -E 's/ / /g' wordplay1.txt

  1. Replace all vowels (a,e,i,o,u) with ‘oo’
  2. Replace all words longer than 4 characters with ‘blah’
  3. Replace an entire line if it has any words longer than 4 characters with “had long words” (as a literal phrase)
  4. Insert “Some words:” (as a literal phrase) at the start of every line.

BASH

sed -E 's/[aeiou]/oo/g' wordplay1.txt
sed -E 's/\w{5,}/blah/g' wordplay1.txt
sed -E 's/.*\w{5,}.*/had long words/' wordplay1.txt
sed -E 's/^/Some words: /' wordplay1.txt

Making use of back-references


Recall with grep we could “capture” or remember a part of a match by encasing it in round brackets ‘\(\)’ and then refer back to an exact copy of what was matched, using ‘\1’ for the first bracketed group, ‘\2’ for second bracketed group, etc.. Such back-references may also be used within the replacement section of a substitution. This is how we start doing more interesting find and replaces.

For example, we could add an exclamation mark after every word:

BASH

echo "hello Andrew" | sed -E 's/(\w+)/\1!/g'

OUTPUT

hello! Andrew!

The back-reference is necessary, so that we can define the replacement as being “the same as what we matched, plus an exclamation mark.”

Another example, we could swap every second word:

BASH

echo "one two three four" | sed -E 's/(\w+) (\w+)/\2 \1/g'

OUTPUT

two one four three

We separately capture two adjacent words, then in the replacement, refer back to them in reverse order.

Remember that the entirety of a matched pattern is replaced. So if we do something like this, where a ‘.+’ matches the whole rest of the line, then the whole rest of the line is replaced.

BASH

echo "one two three four" | sed -E 's/(\w+) (\w+).+/\2 \1/g'

OUTPUT

two one

Try it 2

Using sed -E 's/ / /g' wordplay1.txt

  1. Replace all vowels (a,e,i,o,u) with two copies of that same vowel
  2. Switch the places of the first and second word on each line
  3. Switch the places of the first and last word on each line, while retaining everything between

BASH

sed -E 's/([aeiou])/\1\1/g' wordplay1.txt
sed -E 's/^(\w+)\t(\w+)/\2\t\1/' wordplay1.txt
sed -E 's/^(\w+)(.+)\b(\w+)/\3\2\1/' wordplay1.txt

Forward slashes

Note that, as a substitution pattern is bookended by forward slashes, there will be a conflict if you actually want to include a literal forward slash. There are two ways of handling this.
1. Escape your literal forward slash with a back slash, ensuring it acts as a literal character.
\/
2. Alternatively, you can actually use any other character to bookend the substitution! E.g.:
's;pattern;replacement;'
or
's#pattern#replacement#'
If doing this, choose a character that won’t otherwise appear in your substitution string and that doesn’t have any other special meaning.

Try it - date format conversion

Complete the following, to convert dates in namesndates.txt from dd/mm/yyyy format, into yymmdd format. E.g. change 23/08/2012 into 120823.
sed -E 's; ; ;' namesndates.txt

BASH

sed -E 's;([0-9]{2})/([0-9]{2})/[0-9]{2}([0-9]{2});\3\2\1;' namesndates.txt

Try it - make a FASTA file

DNA and RNA sequences are often represented in the “FASTA” format, which looks like:

>seqID
ATCGTACGTAGCTACGT

The file ‘dnaSequences.txt’ contains DNA sequences in a tab-separated format:

seqID    ATCGTACGTAGCTACGT
  1. Use sed to convert dnaSequences.txt sequences into a FASTA format representation. Hint: ‘\n’ may be used in the replacement pattern to insert a newline.

  2. Some sequences may be very long. Can you make it so that there’s never more than 20 characters per line, with longer sequences split over multiple lines?

BASH

sed -E 's/(\S+)\s+(\w+)/>\1\n\2/' dnaSequences.txt
sed -E 's/(\S+)\s+(\w+)/>\1\n\2/' dnaSequences.txt | sed -E 's/(\w{20})/\1\n/g'

Try it - Fixing a file

Have another look at ‘namesndates.txt’ (cat namesndates.txt)

A number of typos and inconsistencies were made when adding rows to ‘namesndates.txt’:
1. One row has a name written as “surname, firstname”, instead of “firstname surname”.
2. One row has a date written in “dd-mm-yyyy” format instead of “dd/mm/yyyy” format.
3. One row is comma-separated, while the rest are tab-separated.
4. Some rows have multiple spaces, or a mix of tabs and spaces, in place of just single tabs.
(cat -T namesndates.txt can help distinguish tabs from spaces)

Using multiple piped sed substitutions, can you correct all of these mistakes?

BASH

sed -E 's/^(\w+), (\w+)/\2 \1/' namesndates.txt | \
  sed -E 's;([0-9]{2})-([0-9]{2})-([0-9]{4});\1/\2/\3;' | \
  sed -E 's/,/\t/g' | \
  sed -E 's/\s{2,}/\t/g' > namesndates-FIXED.txt

Prototyping solutions

Tips for turning longer complex strings into regular expression substitutions:
1. Start by copying a real example of a whole string into your pattern section.
2. Add escape back slashes to any forward slashes, literal brackets, etc., as necessary.
3. “Circle” the parts of the string you’d like to separately retain, with round brackets.
4. Write out your replacement pattern, using back-reference to what you circled.
5. At this stage, the substitution should work, but only for the specific real example string that you’ve started with.
6. Finally, start abstracting your search pattern, replacing parts of your example string with wild-cards or character-classes as needed, to strike the balance between specificity and ambiguity required to match all that you want and not all that you don’t want.

Try it - information from file names

‘fileExample1.txt’ contains the input text from the first regex example in the intro of this course. Can you recreate that transformation?
Convert this list of files (fileExample1.txt)…

/path/to/my/data/folder1/s1-R1_6h-shade.fasta
/path/to/my/data/folder1/s2-R1_6h-sun.fasta
/path/to/my/data/folder1/s3-R1_24h-shade.fasta
/path/to/my/data/folder1/s4-R1_24h-sun.fasta
/path/to/my/data/folder2/s1-R2_6h-shade.fasta
/path/to/my/data/folder2/s2-R2_6h-sun.fasta
/path/to/my/data/folder2/s3-R2_24h-shade.fasta
/path/to/my/data/folder2/s4-R2_24h-sun.fasta

…into this table of sample information:

OUTPUT

Sample1 Rep1    6hours  shade   s1-R1_6h-shade.fasta    /path/to/my/data/folder1/
Sample2 Rep1    6hours  sun     s2-R1_6h-sun.fasta      /path/to/my/data/folder1/
Sample3 Rep1    24hours shade   s3-R1_24h-shade.fasta   /path/to/my/data/folder1/
Sample4 Rep1    24hours sun     s4-R1_24h-sun.fasta     /path/to/my/data/folder1/
Sample1 Rep2    6hours  shade   s1-R2_6h-shade.fasta    /path/to/my/data/folder2/
Sample2 Rep2    6hours  sun     s2-R2_6h-sun.fasta      /path/to/my/data/folder2/
Sample3 Rep2    24hours shade   s3-R2_24h-shade.fasta   /path/to/my/data/folder2/
Sample4 Rep2    24hours sun     s4-R2_24h-sun.fasta     /path/to/my/data/folder2/

BASH

sed -E \
's/^(.+\/)(s([0-9]+)-R([0-9]+)_([0-9]+)h-(\w+)\.fasta)/Sample\3\tRep\4\t\5hours\t\6\t\2\t\1/' \
    fileExample1.txt

Key Points

  • sed -E 's/pattern/replacement/'
  • 's/pattern/replacement/g' - enables Greedy, replace-all mode.
  • Use grouping () in pattern and back-reference \1 in replacement…
  • … to rearrange or recontextualise parts of the matched input.
  • Tips for writing complex substitutions:
  • 1- Start with a complete real example pasted as your pattern.
  • 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
  • 3- Circle the parts to retain, with round brackets.
  • 4- Write your replacement rules, using back-references.
  • 5- Substitution should now work for your specific real example.
  • 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.

Content from Regexs within text editors


Last updated on 2024-10-25 | Edit this page

Estimated time: 5 minutes

Overview

Questions

  • How can we invoke regular expressions within certain text editors?

Objectives

  • Introduce regex substitutions implemented in text editors, e.g. Nano, VSCode, Notepad++, etc…

Regular expressions, for both searching and substituting, with all the capabilities we’ve looked at today, are fully implemented in many modern text editors.

For example, on Windows: Visual Studio Code, Notepad++, Sublime Text…

Patterns follow all the same syntax you’ve learnt using grep and sed. However, you’ll usually find that the following patterns work that didn’t work with Grep:

Written as Equivalent to
\d [0-9] A digit
\D [^0-9] Not a digit
\t A tab character
\n A newline, if program supports multi-line matching

In some text editors, for example VSCode, back-reference variables are referred to with a ‘$’ sign instead of a backslash. E.g. instead of \1 and \2, you use $1 and $2. Experiment with your favourite text editor to see which it expects.

In Nano, find and replace options are listed at the bottom of the screen as ^W Where Is and ^\ Replace respectively, which in Windows corresponds to Ctrl+W for Find or Ctrl+\ for Replace.
Once either is enabled a new option for toggling Regex mode on or off is listed at the bottom: M-R Reg.exp.. This command (Meta+R) corresponds to Alt+R on Windows.
Toggling Regex mode on will change the prompt from Search: to Search [Regexp]:.
With Regex mode on, all the same syntax from grep and sed should work, including using \1, \2, etc. to reference groups captured by round brackets when doing a replace.

Key Points

  • Regular expression capabilities are incorporated in most modern text editors for find and replace.

Content from Python regular expressions


Last updated on 2024-10-25 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How can we invoke regular expressions using Python?

Objectives

  • Introduce regex capabilities in Python.

Regular expressions, with search, replace and other capabilities, are available in through Python through the ‘re’ library. E.g. import re.

Patterns follow all the same syntax you’ve learnt using grep and sed. However, you’ll usually find that the following patterns work that didn’t work with Grep:

Written as Equivalent to
\d [0-9] A digit
\D [^0-9] Not a digit
\t A tab character
\n A newline

Python regexs, using the ‘re’ library, are implemented through functions, where search patterns and strings are given as function arguments.

Searching


The grep equivalent is the function re.search().
match = re.search( pattern, string )
The returned match object looks “true” if there was a match and has functions for interrogating aspects of the match, for example individual bracketed groups matched and span range of matched parts.

PYTHON

import re
myString = 'word1 1234 word2'
match = re.search(r'\b(\d+) (\w+)', myString)
if match:
    print(match.group(0))
    print(match.span(0))
    print(match.group(1))
    print(match.group(2))
else:
    print("No Match")

OUTPUT

1234 word2
(6, 16)
1234
word2

Note the ‘r’ in front of the search pattern, r'\b(\d+) (\w+)'. This enables the pattern string to be passed to the re function in its literal form, which prevents backslashes being being interpreted as escapes too early, before the function looks at it. Without the preceeding ‘r’, double-backslashes would be necessary.
E.g. '\\b(\\d+) (\\w+)'

Splitting


A neat ‘re’ feature is splitting of a string into an array of substrings, using a regex as a delimiter, instead of just a literal comma or tab or space, etc.. With this functionality, you can, for example, split up a line into individual words, using a regex “anything that’s not a word” as the delimiter for the split.

PYTHON

listOfWords = re.split(r'\W+', 'word_1 $%^ 1234,word2-word3')
print(listOfWords)

OUTPUT

['word_1', '1234', 'word2', 'word3']

Substituting


Substitution commands in Python ‘re’ take the form:
newstring = re.sub( pattern, replacement, string )
Again, both the pattern and replacement/back-reference syntax is as we’ve learnt already.

PYTHON

import re
oldstring = 'Four 123 Five'
newstring = re.sub( r'(\w+)\s+(\d+)\s+(\w+)', r'\2-\1-\3', oldstring )
print(newstring)

OUTPUT

'123-Four-Five'

Try it - Reformatting a file using Python

Write a Python script that reads through the file ‘namesndates_v2.txt’ and, for each line, rearranges it (using re.sub) to the following format, and prints the result:

month-year: surname,firstname @ place

For example, the line…

Neve Erindale 23/08/2012 20:57 Coombs

…should be printed as:

08-2012: Erindale,Neve @ Coombs

Hints:
1. \t may be used to match the tab characters used between the fields in the input file.
2. Basic Python file reading:

PYTHON

with open(filename) as inFile:
  for line in inFile:

PYTHON

import re
with open('namesndates_v2.txt') as inFile:
  for line in inFile:
    rearranged = re.sub( r'^(\w+) (\w+)\t\d+/(\d+)/(\d+)\t\S+\t(.+)', r'\3-\4: \2,\1 @ \5', line )
    print(rearranged, end="")

RE Manual

More reference for the Python ‘re’ library may be found at: https://docs.python.org/3/library/re.html

Key Points

  • Regular expressions through Python:
  • import re
  • match = re.search(r'pattern', 'string')
  • OR
  • list = re.split(r'pattern', 'string')
  • OR
  • re.sub( r'pattern', r'replacement', 'string' )
  • Reference: https://docs.python.org/3/library/re.html

Content from R regular expressions


Last updated on 2024-10-25 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How can we invoke regular expressions using R?

Objectives

  • Introduce regex capabilities in R.

Regular expressions, with search, replace capabilities, are also available in R. Again, patterns follow all the same syntax you’ve learnt using grep and sed. And again, you’ll find that the following patterns work that didn’t work with Grep:

Written as Equivalent to
\d [0-9] A digit
\D [^0-9] Not a digit
\t A tab character

There are multiple ways of utilising regular expressions through R, but we’re going to focus on a set of functions provided by the ‘stringr’ package:

R

library(stringr)
# We'll also load tidyverse for later
library(tidyverse)

The grep equivalent in ‘stringr’ is str_detect. It’s intended for use in searching vectors of strings.

str_detect( string.vector, 'pattern' )

Basic use returns a list of TRUE/FALSE for which vector entries matched the search pattern.

Note that you’ll need to use double-backslashes in place of any single backslash in R.

For example, find words where the same letter is repeated consecutively:

R

words <- c("apple", "banana", "carrot")
str_detect( words, '(\\w)\\1' )

OUTPUT

[1]  TRUE FALSE  TRUE

Applying the output of str_detect back to the original string vector, as indices, allows access to the actual matched entries:

R

words[ str_detect( words, '(\\w)\\1' ) ]

OUTPUT

[1] "apple"  "carrot"

‘str_detect’ can also be used in conjunction with data frames. Within a data frame, each column is stored as a separate vector and so we can use ‘str_detect’ within any function that accesses a single data frame column (such as mutate()).

For the next example, we’re going to make use of the ‘stringr’ package coming pre-loaded with a longer list of words, provided exactly for demonstrations like this, named ‘fruit’. Check it out:

R

fruit
length(fruit)

First we’ll convert the fruit vector into a data frame:

R

framed_fruit <- tibble(fruit_name = fruit)
framed_fruit

We can then add a new column to the data frame, which is the result of using ‘str_detect’ to find our repeated letter pattern:

R

framed_fruit %>% 
  mutate(has_double = str_detect(fruit_name, '(\\w)\\1'))

OUTPUT

# A tibble: 80 × 2
   fruit_name   has_double
   <chr>        <lgl>
 1 apple        TRUE
 2 apricot      FALSE
 3 avocado      FALSE
 4 banana       FALSE
 5 bell pepper  TRUE
 6 bilberry     TRUE
 7 blackberry   TRUE
 8 blackcurrant TRUE
 9 blood orange TRUE
10 blueberry    TRUE
# ... 70 more rows

Another approach is to use the ‘filter’ function to reduce the data frame to only matching rows:

R

framed_fruit %>% 
  filter( str_detect(fruit_name, '(\\w)\\1') )

OUTPUT

# A tibble: 29 × 1
   fruit_name
   <chr>
 1 apple
 2 bell pepper
 3 bilberry
 4 blackberry
 5 blackcurrant
 6 blood orange
 7 blueberry
 8 boysenberry
 9 cherry
10 chili pepper
# ... 19 more rows

Substitution in R is enabled through the ‘str_replace’ or ‘str_replace_all’ functions.

str_replace( string.vector, 'pattern', 'replacement' )

These functions will complete substitutions in all matching entries in a string vector. The difference is that ‘str_replace_all’ will also perform greedy matching and replacing within individual vector entries, where ‘str_replace’ will only replace the first match in each vector entry. The functions return the modified string vector.

R

str_replace(words, '(\\w)\\1', '\\1a\\1a\\1')

OUTPUT

[1] "apapaple"  "banana"    "carararot"

And again, these functions may be combined with data frame use too. For example:

R

framed_fruit %>% 
  mutate( long_vowels = str_replace_all(fruit_name, '([aeiou])', '\\1\\1\\1\\1\\1') )

OUTPUT

# A tibble: 80 × 2
   fruit_name   long_vowels
   <chr>        <chr>
 1 apple        aaaaappleeeee
 2 apricot      aaaaapriiiiicooooot
 3 avocado      aaaaavooooocaaaaadooooo
 4 banana       baaaaanaaaaanaaaaa
 5 bell pepper  beeeeell peeeeeppeeeeer
 6 bilberry     biiiiilbeeeeerry
 7 blackberry   blaaaaackbeeeeerry
 8 blackcurrant blaaaaackcuuuuurraaaaant
 9 blood orange blooooooooood oooooraaaaangeeeee
10 blueberry    bluuuuueeeeebeeeeerry
# ... 70 more rows

Key Points

  • Regular expressions through R:
  • str_detect( string.vector, 'pattern' )
  • str_replace( string.vector, 'pattern', 'replacement )
  • str_replace_all( string.vector, 'pattern', 'replacement ) for ‘greedy’ match & replace.
  • Need to double escape \\ any back slashes in patterns.