Content from Regular Expressions: The pitch
Last updated on 2024-10-25 | Edit this page
Estimated time: 10 minutes
Overview
Questions
- What sort of capabilities do regular expressions provide?
- What are regular expressions?
Objectives
- Introduce the concept of regular expressions.
- Sell the idea of the power of regular expressions.
What are regular expressions?
In short, a regular expression, or rational expression, or regex for short, is a sequence of characters that act as a pattern for searching within text. So, what do we mean by pattern?
Consider a calendar date written down, say, 23/08/2018. We all recognise a date as being a date when we see one written down this way. Why? Because there’s a consistent pattern to it. We could describe the pattern of a date written this way as: one or two digits, a forward slash, one or two digits, a forward slash, then either 2 digits or 4 digits. Using that pattern of what a date looks like, our eyes can scan a page of text and pick out other dates.
Regular expressions are a formalisation of such patterns into a consistent syntax, making use of metacharacters of certain meaning, such that we can guide a computer to search text in a way that is as specific or as ambiguous as required. It’s a very powerful tool, being able to search text for patterns instead of just exact words. Taken further, it also allows for complex find & replaces, where the replacing is also following a pattern described with the same syntax.
The pitch
So, have you ever wanted to pull specific information out of a file, where a straight search for a specific word or phrase wouldn’t cut it? Perhaps all you knew was the general “shape” of what you were looking for. For example, you know there’s a date involved, not what the date is, but that it would look like a date- day/month/year. Or maybe it’s part of a specific file format; you know the format’s rules, but not the contents.
Have you ever wanted to search a file for something, but only actually print out some different nearby information from the same line?
Have you ever wanted to make bulk changes to a file, where a simple ‘find & replace’ wouldn’t work? Perhaps you’d like to rename a list of samples, according to a rule based on their current names. Or fix up some formatting inconsistencies.
Or have you ever wanted to completely rearrange and modify lines of a file or program output?
These are the situations that regular expressions make really quick and easy compared to other solutions.
Here’s an example. We have a list of files in folders, each representing a sample and replicate in an experiment, with some sample details incorporated into the file names. We wish to automatically transform this into a table, listing samples, sample information, and file details.
There are numerous ways to do this, but most solutions would be
multi-part, requiring cutting out different bits of information at a
time.
Using a regular expression, this entire transformation can be done in a
single command:
It’s ugly, but it works (which might be the regular expression catch-phrase).
This is a regular expression substitution, containing a pattern that matches each part of our file names and another pattern that defines the rearrangement. A goal at the end of today will be to understand how this works and to be able to implement similar effects yourselves.
There have been numerous implementations of regular expressions, but the major common implementations available today all follow a mostly-consistent syntax, which is what we’ll be learning. So, for example, once you learn this regex syntax, you’ll then be able to use it within certain Unix command line utilities and in Python and in R. It even works in most text editors too!
Key Points
- Regexs are powerful tools for searching and transforming text.
- A search pattern, using a defined syntax, allows non-specific but directed matching.
Content from Shell wildcards - a type of regex
Last updated on 2024-10-25 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- Are regular expressions available in standard Unix commands?
Objectives
- (Re)Introduce wildcards available in the Unix shell for file selection.
- ‘
*
’, ‘?
’, ‘[ ]
’, ‘[! ]
’ and ‘{ }
’
While using a Unix shell, you may have already become familiar with a simple form of regular expression. Have you ever used a command like this?
It’s a searchable pattern which means “match anything that ends in .txt”. The * works as a wildcard, expanding to match any of zero or more characters.
However, there are other wildcards available that allow for more specific and complex patterns.
‘?’ to match any single character
The ? wildcard matches any character, but always exactly one
character.
For example:
OUTPUT
sample1.txt sample4.txt sample7.txt sampleA.txt sampleD.txt sampleY.txt
sample2.txt sample5.txt sample8.txt sampleB.txt sampleE.txt sampleZ.txt
sample3.txt sample6.txt sample9.txt sampleC.txt sampleF.txt
Try it 1
- Write a similar ls command to list .txt files with a 2 digit/char
sample number.
- Write a similar ls command to list for samples specifically in the range of 10-19.
That last challenge was a bit of a gotcha. The solution using ‘?’ would have also picked up a sample “1A”. Not what we were after, but a good segue for the next concept- matching to a list of options or ranges.
‘[ ]’ to match any single character in a list or range
Use of square brackets [ *list* ]
allows matching to a
single character, where that character has to match any of the options
listed between the brackets. The brackets may contain a list, or a
range, or a mix of both. For example, to match any one digit from 0 to
9, the following are equivalent:
Ranges work for alphabet characters too. The following are equivalent:
Important: as always on a Unix shell, case matters! A list of all alphanumeric characters is:
So back to our earlier challenge of listing files for sample numbers
10-19.
A better solution may be:
Remember: an entire ‘[list]’ set will only match a single listed character at a time.
Try it 2
- Write an ls command to list .txt files for samples 2 to 5.
- Write an ls command to list .txt files for samples X, Y and Z and
sample A.
- Write an ls command to list .txt files for samples denoted by a
number followed by a letter (e.g. sample1A).
- Write an ls command to list .txt files for all sample names ending in a letter.
‘!’ to match any single character NOT in a list or range
Connected to the square bracket notation is the NOT(!) symbol ‘[! ]’.
Use of a ‘!’ inside ‘[ ]’ makes it match to any single character NOT in
the list.
E.g.:
… would match any character that’s not A.
… would match any character that’s not a digit from 0 to 9.
Try it 3
- Write an ls command to list .txt files for all samples other than those numbered 4 to 7 and 9.
‘{ }’ to list whole word or expressions as options
Lastly for this section we have the curly brackets ‘{ }’. Like for
square brackets, curly brackets lets you match to one of the options
listed inside. The difference is, this time we list not just single
characters, but entire words or expressions to choose between.
E.g. ‘{alpha,beta,gamma}’ would match ‘alpha’ and ‘beta’ and
‘gamma’.
Real example:
You can use other wildcard patterns within the listed options.
So these are equivalent:
And these are equivalent:
The curly braces actually have their own range capability, even more powerful in that it allows for more than single digit numbers:
BUT if you’ve spotted the warnings while trying these examples, you may have guessed the catch with using curly brackets- They aren’t actually a form of pattern matching! Rather, the first thing a shell does is expand the brackets out to form all posibilities listed, in full.
So, when you run the first command below, you’re actually executing the second command below:
Contrast:
Brace expansion is always the first order of business when the command line interprets a command. It’s a useful tool, and while not quite fitting the regular expression theme, the concept of listing longer options in a this-OR-that fashion will be visited again in proper regexs.
Try it 4
- Write an ls command to list files for samples 8 to 13 and CD.
- Write an ls command to list just .csv and .tab files for samples 11 and CD.
While syntax will differ, the concepts learnt here will remain very relevant as we next move onto more advanced forms of regular expression.
Key Points
- Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
-
*
matches zero or more characters -
?
matches exactly one character -
[ ]
matches a character from a list or range of contained options -
[! ]
matches a character NOT in a list or range of contained options -
{ }
expands to produce forms of all listed contained options
Content from Pattern matching with grep -E, part 1
Last updated on 2024-11-01 | Edit this page
Estimated time: 90 minutes
Overview
Questions
- How do we write regexs to match complex patterns in files or output streams?
Objectives
- Learn the basic syntax of Extended Regular Expressions, using grep -E
You may have used the command ‘grep’ before, which allows you to search for a matching string. It may be given a file name, to make it search within a file:
OUTPUT
class react exercise impartial heavenly
trees shelf amused reactive sheet
relation reacting handsome ashamed
Or it can search piped output from another command:
OUTPUT
Andrew
Andrew
In this example, the ‘-o’ flag was used to make grep print only the matches it finds, rather than whole matching lines. This will be a handy option when we start testing more ambiguous search patterns later.
For the following lessons, we’ll be making use of grep with a ‘-E’ flag, which enables Extended Regular Expression (ERE) mode for searching regex patterns. On some systems, an aliased command ‘egrep’ exists. The regular expression syntax of ‘grep -E’ is the same as that of ‘sed -E’, which we’ll be using for “find & replace” commands later, and basically the same as most other implementations of regular expressions.
‘|’ for matching this or that
Like in the ‘if’ statements of most programming languages, the pipe character acts as an OR, allowing matching to any of the listed options.
OUTPUT
bolt glow shaky lumpy hypnotic
bell spiritual materialistic rule sponge
exclusive color harbor boat bedroom
Round brackets may be used to group terms together. This grouping has multiple uses, as you’ll see as we continue, but one is to group OR options together. E.g.:
OUTPUT
bolt glow shaky lumpy hypnotic
bell spiritual materialistic rule sponge
exclusive color harbor boat bedroom
Consider the following:
OUTPUT
BC
Using ‘-o’ to print matching part only, why wasn’t the ‘A’ included in result? The search pattern asked for either a A or a B which was followed by a C. Only the B met this criteria.
Try it 1
- Write a grep -E command to search wordplay1.txt for either ‘corn’ or ‘cow’
- Try the above with -w option enabled (match as whole words only)
- Write a grep -E command to search wordplay1.txt for either ‘reaction’ or ‘reactive’
- Write an alternative working answer to 3.
‘[ ]’ to match any single character in a list or range
This may sound familiar…
Use of square brackets [ *list* ]
allows matching to a
single character, where that character has to match any of the options
listed between the brackets. The brackets may contain a list, or a
range, or a mix of both. For example, to match any one digit from 0 to
9, the following are equivalent:
Ranges work for alphabet characters too. The following are equivalent:
Important: as always on a Unix shell, case matters! A list of all alphanumeric characters is:
Example:
OUTPUT
bog
cog
tog
log
Example:
OUTPUT
2003
2018
Try it 2
- Write a grep -E command to search wordplay1.txt for 4-letter words ending in ‘oat’
- What if you didn’t know if the words started with lower case or capitol letters?
- What if any of the letters could be upper case or lower case?
- Write an alternative working answer to 3.
When used within square brackets, a ‘^’ means “NOT”. For example ‘[^ABC]’ would match any one character that wasn’t A or B or C.
Example:
OUTPUT
cog
tog
log
lag
Try it 3
- Modify your ‘oat’ word finder to find any 4-letter ‘oat’ words other than ‘boat’ and ‘goat’
Match start of line ^ or end of line $
The ‘^’ symbol has another use. When placed at the start of a regex
pattern, it anchors the pattern such that it must match from the start
of a line.
Similarly, a ‘$’, when placed at the end of a regex pattern, anchors the
pattern such that it must match up to the end of a line.
Consider the following examples demonstrating the effect:
OUTPUT
repeat drum quilt superficial uncovered
copper bad unpack repeat old-fashioned
sheet accessible word enthusiastic repeat
OUTPUT
repeat drum quilt superficial uncovered
OUTPUT
sheet accessible word enthusiastic repeat
There will be more use for these anchors when we get into more complex patterns with wildcards.
‘.’ is the match-anything regex wildcard
Speaking of, the match-anything wildcard (any single character) for regular expressions is a period, ‘.’
Consider:
OUTPUT
boat
oat
loat
coat
goat
For the short word “oat”, it was the preceeding tab character that the wildcard matched.
One more concept to introduce before we get into some better examples of using this wildcard.
Matching multiples of things
What if you wanted to match something that repeated multiple
times?
Any single character, bracket expression or grouping, can be modified to
specify number of occurances, with one of the following:
Symbol(s) | Effect |
---|---|
? | Present either one time or no times (0-1) |
* | Present any number of times, including zero (0+) |
+ | Present at least once, but any number of times (1+) |
{n} | Repeated exactly ‘n’ times |
{n,m} | Repeated between ‘n’ and ‘m’ times |
{n,} | Repeated at least ‘n’ times |
{,m} | Repeated at most ‘m’ times |
For example, what if we wanted to search for ‘colour’ but maybe it had US spelling?
OUTPUT
exclusive color harbor boat bedroom
Here the ‘?’ modifier let the preceeding ‘u’ match either one time OR zero times.
Lets find all words from our words file that are at least 12 letters long:
OUTPUT
materialistic
distribution
enthusiastic
sophisticated
Pattern repetition
Note, it’s probably clear from the previous example that, when specifying repetition of a pattern in this way, e.g. “match any letter from a to z, 12 or more times”, it’s the general ambiguous pattern concept which is repeated, not a specific matching case. Hence we found 12 lots of “any letter” in a row, rather than “any letter” and then 12 repeats of that letter.
Greediness
Regular expression search patterns are considered “greedy”. That is,
when you use ‘+’ or ‘*’ or ‘{n,}’, they will always match the
longest possible string that still fits the pattern.
E.g. ‘.*’, which means “any character, zero or more times”, will always
match an entire line without further specific context around it.
Try it 4
- Use ‘grep -E -o’ on wordplay1.txt to match all of any line
that starts with a ‘c’
- Use ‘grep -E -o’ on wordplay1.txt to match the first word
of lines starting with a ‘c’
- Modify previous answer to keep words 5 or 6 letters long only.
- Use ‘grep -E -o’ on wordplay1.txt to match both ‘stand’ and ‘outstanding’
What if we wanted to match a date, and didn’t know if the year would be 2 digits or 4? Our pattern for a date is “a digit, from 0-9, either one or two of them, then a forward slash, then a digit, either one or two of them, then either 2 digits OR 4 digits.”
OUTPUT
11/06/91
5/9/2018
Alternatively, if we were sure about the sensibleness of the contents we were searching, we may get away with just:
OUTPUT
11/06/91
5/9/2018
Try it 5
Have a look at ‘namesndates.txt’
(cat namesndates.txt
)
- Use ‘grep -E -o’ on namesndates.txt to list times
(e.g. 20:57).
- Use ‘grep -E -o’ on namesndates.txt to list all calendar
dates.
- Consider extra information we know about dates: days will be at most
31, months at most 12.
Could we use aspects of this information within our regex for more sensibleness checking?
Key Points
- grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
-
|
acts as an OR between options -
( )
allows grouping, e.g. for OR modifier, with quantifiers, etc.. -
[ ]
matches a character from a list or range of contained options -
[^ ]
matches a character NOT in a list or range of contained options -
^
at the start of a regex means match at start of line -
$
at the end of a regex means match at end of line -
.
is the match-all (any single character) wildcard -
?
quantifies previous character or group as occurring zero or one time -
*
quantifies previous character or group as occurring zero or more times -
+
quantifies previous character or group as occurring one or more times -
{n,m}
quantifies previous character or group as occurring between n and m times - Quantifiers are greedy- will always match longest possible fit.
Content from Pattern matching with grep -E, part 2
Last updated on 2024-10-25 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we use predefined character classes for more complex search patterns?
Objectives
- Learn to incorporate predefined character classes into regular expressions.
Predefined character classes
The Extended Regular Expression syntax has a number of predefined
character groupings that may be written as a word, rather than a
collection or range of characters.
For example, you may write [[:digit:]]
instead of writing
[0-9]
.
Written as | Equivalent to |
---|---|
[[:digit:]] | [0-9] |
[[:alpha:]] | [a-zA-Z] |
[[:alnum:]] | [a-zA-Z0-9] |
[[:upper:]] | [A-Z] |
[[:lower:]] | [a-z] |
[[:space:]] | Spaces, tabs, in some contexts new-lines |
[[:graph:]] | Any printable character other than space |
[[:punct:]] | Any printable character other than space or [a-zA-Z0-9] |
Why?
Why would you write [[:digit:]]
instead of writing
[0-9]
? Or [[:upper:]]
instead of
[A-Z]
? In general, for our use, there’s little difference
other than readability/style. The difference comes if you need to make
things more universal, as Arabic-numerals are not the only number system
in use around the world and the 26-letter English alphabet is not the
only writing system. So, for example, while [0-9]
will
always just be those specific 10 characters, [[:digit:]]
may have multiple alternate numeric systems encoded.
Regex shorthand escape symbols
The other way you may refer to predefined character classes, and the
way you will likely most commonly do so from here on, is using the
following shorthands, formed by “escaping” certain characters with a
backslash. For example ‘\w’ can be used to match to any “word”
character, which means any letter, number, or (counterintuitively) an
underscore.
The shorthand symbols available are:
Written as | Equivalent to |
---|---|
\w | “Word” character [a-zA-Z0-9] OR a _ (underscore) |
\W | [^\w] Inverse of \w, any non-“word” character |
\s | Spaces, tabs, in some contexts new-lines |
\S | [^\s] Inverse of \s, any non-space character |
\b | Boundary between “words” and “spaces” (0-length) |
\B | [^\b] In the middle of a “word” or multiple “spaces” (0-length) |
\< | Boundary at start of “word” between “words” and “spaces” (0-length) |
\> | Boundary at end of “word” between “words” and “spaces” (0-length) |
Those last four are considered “anchors”. They don’t actually match characters, but they can give a regex pattern more context, helping to orientate components. For example, a letter being specifically at the start of a word.
The following are also commonly used within regex syntax (e.g. will work in Python or R), but are not understood by grep or sed:
Written as | Equivalent to |
---|---|
\d | [0-9] A digit |
\D | [^0-9] Not a digit |
\t | A tab character (does work in some version of sed, test yours) |
\n | A newline, if program supports multi-line matching |
Here are some examples.
The first two words of a line (start of line, word, space(s), word):
OUTPUT
word1 word_2
A word with spaces at both ends:
OUTPUT
word_2
Every set of consecutive non-space characters:
OUTPUT
word1
word_2
thirdWord!?
Everything up to the boundary of the last word:
OUTPUT
word1 word_2
The middle characters of each word (bounded by not-a-word-boundary):
OUTPUT
ord
ord_
hirdWor
Try it 1
- Use ‘grep -E -o’ on wordplay1.txt to print the first 2 words of any
line using
- [[:alpha:]] and [[:space:]]
- \w and \s
- \S and \s
- Use ‘grep -E -o’, with \w and \b, on wordplay1.txt to print a word
that starts with ‘p’ and ends with ‘g’.
- Use ‘grep -E’, with \w and \<, on wordplay1.txt to highlight the first letter of every word.
Tab characters
There are a few options for getting a tab character to work with grep
or sed on a bash command line (if a space ‘\s’ or word boundary anchor
will not be specific enough):
1. On some systems, pressing ctrl+v followed by tab, will insert a
literal tab character.
2. On some systems, a literal tab could be copy and pasted in from a
text editor.
3. A dollar sign in front of pattern can enable escape character
interpretation in bash, based on ANSI-C rules, where ‘\t’ is a
tab.
E.g. echo $'|\t|'
OR grep -E $'\t'
This last option is neatest, but beware other conflicting escape
character interpretations in this mode, meaning that to use ‘\w’ or ‘\b’
etc., you will need to double-escape them, with two slashes.
E.g. for a word with tabs either side:
grep -E $'\t\w+\t'
Capturing groups and back-references
We’d mentioned earlier that round brackets ( ) have multiple uses. One use is to “capture” a match seen within the round brackets, remembering the contents of what matched within, such that a copy of the contents may be referred to again. We’ll make much use of this feature in the next lesson, for find-n-replace substitutions! Within grep though, a “back-reference” to a captured group can be used to identify something that repeats identically within a line. The reference to the previously seen item is used in the form ‘\number’. It works like this:
OUTPUT
blah2 blah2
Here a word ‘\w+’ is “captured” by the round brackets, is followed by space, then is referenced again by ‘\1’, which represents a stored copy of what was originally matched by the ‘\w+’. Hence this grep only matches a case where a word is followed by another copy of that same word.
If we’d like to make multiple back-references, we use multiple pairs of round brackets and increment our back-reference number by one for each open bracket from left-to-right. E.g.:
OUTPUT
blah2 blah2
We had the same outcome, but stored letters part “blah” separate to the digits part “2”, then referred to the two captured parts as ‘\1’ and ‘\2’ respectively.
Consider the following:
OUTPUT
EFGGFE
Our pattern matched to three letters, then to those same three letters repeated again, but in reverse order!
Try it 2
- Grep ‘wordplay1.txt’ to print the only line that contains the same word repeated twice. Hint: You may need to use the ‘\b’ word-boundary anchor.
- Grep ‘namesndates.txt’ to print the name of a person whose firstname and surname start with the same letter.
- Grep ‘namesndates.txt’ to print a date where the month and the day are the same number.
Key Points
- grep in Extended Regex mode has a number of predefined character classes:
[:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
- and escape-character enabled shorthand character classes and anchors:
-
\w
: Word character [a-zA-Z0-9] OR a _ (underscore) -
\W
:[^\w]
Inverse of \w, any non-word character -
\s
: Spaces, tabs, in some contexts new-lines -
\S
:[^\s]
Inverse of \s, any non-space character -
\b
: Boundary between adjacent word and space, 0-length anchor -
\B
:[^\b]
In the middle of a word or multiple spaces, 0-length anchor -
\<
: Boundary at start of word between word and space, 0-length anchor -
\>
: Boundary at end of word between word and space, 0-length anchor - You can refer back to an exact copy of a matched (group) using \1, \2, etc..
Content from Find... and replace! With sed.
Last updated on 2024-10-25 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- How do we find and replace content using regex patterns?
Objectives
- Learn regex substitutions using sed.
Regex substitution syntax
Regex substitutions allow you to use the pattern syntax we’ve learned so far to describe complex ‘find & replace’ operations. For the following exercises we’ll be using the program ‘sed’, but the substitution syntax is common to other implementations too. It’s this:
‘s’ for substitution. The ‘pattern’ component is as we’ve been using with ‘grep -E’ so far; the same concept and syntax, describing a pattern to be matched. The ‘replacement’ component is what will replace a match of the pattern.
E.g. Substitute “hello” for “goodbye” would be:
Sed for regex substitutions
‘sed’ is a Unix command line utility, name short for “stream editor”. It will parse text passed to it, from a file or piped input stream, and can perform various transformations to the text. It has many different capabilities, as can be found in the sed manual. Today we’ll just be playing with the regular expression substitution capability. We’ll be using ‘sed -E’, which enables the same “Extended Regular Expression” syntax as we’d been using with ‘grep -E’.
There are a few ways to use it:
BASH
# Modify stream from another program, print result
other-command | sed -E 's/pat/replace/'
# Read input file, print contents with changes
sed -E 's/pat/replace/' inFile
# Read input file, write contents with changes to another file
sed -E 's/pat/replace/' inFile > outFile
# Read input file, write with changes to the same file (caution!)
sed -E -i 's/pat/replace/' file
Example:
OUTPUT
hello Bob
All of our regex pattern capabilities still work:
OUTPUT
hello Bob
We replace the entirety of what we match:
OUTPUT
A whole new line
Only the first possible extended match is replaced:
OUTPUT
blah Andrew
However, a greedy mode may be enabled which keeps looking for subsequent matches to replace, by adding a ‘g’ to the end of your substitution string:
OUTPUT
blah blah
The replacement section is allowed to be empty too:
OUTPUT
helloAndrew
Try it 1
Using sed -E 's/ / /g' wordplay1.txt
…
- Replace all vowels (a,e,i,o,u) with ‘oo’
- Replace all words longer than 4 characters with ‘blah’
- Replace an entire line if it has any words longer than 4 characters with “had long words” (as a literal phrase)
- Insert “Some words:” (as a literal phrase) at the start of every line.
Making use of back-references
Recall with grep we could “capture” or remember a part of a match by encasing it in round brackets ‘\(\)’ and then refer back to an exact copy of what was matched, using ‘\1’ for the first bracketed group, ‘\2’ for second bracketed group, etc.. Such back-references may also be used within the replacement section of a substitution. This is how we start doing more interesting find and replaces.
For example, we could add an exclamation mark after every word:
OUTPUT
hello! Andrew!
The back-reference is necessary, so that we can define the replacement as being “the same as what we matched, plus an exclamation mark.”
Another example, we could swap every second word:
OUTPUT
two one four three
We separately capture two adjacent words, then in the replacement, refer back to them in reverse order.
Remember that the entirety of a matched pattern is replaced. So if we do something like this, where a ‘.+’ matches the whole rest of the line, then the whole rest of the line is replaced.
OUTPUT
two one
Try it 2
Using sed -E 's/ / /g' wordplay1.txt
…
- Replace all vowels (a,e,i,o,u) with two copies of that same vowel
- Switch the places of the first and second word on each line
- Switch the places of the first and last word on each line, while retaining everything between
Forward slashes
Note that, as a substitution pattern is bookended by forward slashes,
there will be a conflict if you actually want to include a literal
forward slash. There are two ways of handling this.
1. Escape your literal forward slash with a back slash, ensuring it acts
as a literal character.\/
2. Alternatively, you can actually use any other character to bookend
the substitution! E.g.:'s;pattern;replacement;'
or's#pattern#replacement#'
If doing this, choose a character that won’t otherwise appear in your
substitution string and that doesn’t have any other special meaning.
Try it - date format conversion
Complete the following, to convert dates in namesndates.txt from
dd/mm/yyyy format, into yymmdd format. E.g. change 23/08/2012 into
120823.sed -E 's; ; ;' namesndates.txt
Try it - make a FASTA file
DNA and RNA sequences are often represented in the “FASTA” format, which looks like:
>seqID
ATCGTACGTAGCTACGT
The file ‘dnaSequences.txt’ contains DNA sequences in a tab-separated format:
seqID ATCGTACGTAGCTACGT
Use sed to convert dnaSequences.txt sequences into a FASTA format representation. Hint: ‘\n’ may be used in the replacement pattern to insert a newline.
Some sequences may be very long. Can you make it so that there’s never more than 20 characters per line, with longer sequences split over multiple lines?
Try it - Fixing a file
Have another look at ‘namesndates.txt’
(cat namesndates.txt
)
A number of typos and inconsistencies were made when adding rows to
‘namesndates.txt’:
1. One row has a name written as “surname, firstname”, instead of
“firstname surname”.
2. One row has a date written in “dd-mm-yyyy” format instead of
“dd/mm/yyyy” format.
3. One row is comma-separated, while the rest are tab-separated.
4. Some rows have multiple spaces, or a mix of tabs and spaces, in place
of just single tabs.
(cat -T namesndates.txt
can help distinguish tabs from
spaces)
Using multiple piped sed substitutions, can you correct all of these mistakes?
Prototyping solutions
Tips for turning longer complex strings into regular expression
substitutions:
1. Start by copying a real example of a whole string into your pattern
section.
2. Add escape back slashes to any forward slashes, literal brackets,
etc., as necessary.
3. “Circle” the parts of the string you’d like to separately retain,
with round brackets.
4. Write out your replacement pattern, using back-reference to what you
circled.
5. At this stage, the substitution should work, but only for the
specific real example string that you’ve started with.
6. Finally, start abstracting your search pattern, replacing parts of
your example string with wild-cards or character-classes as needed, to
strike the balance between specificity and ambiguity required to match
all that you want and not all that you don’t want.
Try it - information from file names
‘fileExample1.txt’ contains the input text from the first regex
example in the intro of this course. Can you recreate that
transformation?
Convert this list of files (fileExample1.txt)…
/path/to/my/data/folder1/s1-R1_6h-shade.fasta
/path/to/my/data/folder1/s2-R1_6h-sun.fasta
/path/to/my/data/folder1/s3-R1_24h-shade.fasta
/path/to/my/data/folder1/s4-R1_24h-sun.fasta
/path/to/my/data/folder2/s1-R2_6h-shade.fasta
/path/to/my/data/folder2/s2-R2_6h-sun.fasta
/path/to/my/data/folder2/s3-R2_24h-shade.fasta
/path/to/my/data/folder2/s4-R2_24h-sun.fasta
…into this table of sample information:
OUTPUT
Sample1 Rep1 6hours shade s1-R1_6h-shade.fasta /path/to/my/data/folder1/
Sample2 Rep1 6hours sun s2-R1_6h-sun.fasta /path/to/my/data/folder1/
Sample3 Rep1 24hours shade s3-R1_24h-shade.fasta /path/to/my/data/folder1/
Sample4 Rep1 24hours sun s4-R1_24h-sun.fasta /path/to/my/data/folder1/
Sample1 Rep2 6hours shade s1-R2_6h-shade.fasta /path/to/my/data/folder2/
Sample2 Rep2 6hours sun s2-R2_6h-sun.fasta /path/to/my/data/folder2/
Sample3 Rep2 24hours shade s3-R2_24h-shade.fasta /path/to/my/data/folder2/
Sample4 Rep2 24hours sun s4-R2_24h-sun.fasta /path/to/my/data/folder2/
Key Points
sed -E 's/pattern/replacement/'
-
's/pattern/replacement/g'
- enables Greedy, replace-all mode. - Use grouping () in pattern and back-reference \1 in replacement…
- … to rearrange or recontextualise parts of the matched input.
- Tips for writing complex substitutions:
- 1- Start with a complete real example pasted as your pattern.
- 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
- 3- Circle the parts to retain, with round brackets.
- 4- Write your replacement rules, using back-references.
- 5- Substitution should now work for your specific real example.
- 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.
Content from Regexs within text editors
Last updated on 2024-10-25 | Edit this page
Estimated time: 5 minutes
Overview
Questions
- How can we invoke regular expressions within certain text editors?
Objectives
- Introduce regex substitutions implemented in text editors, e.g. Nano, VSCode, Notepad++, etc…
Regular expressions, for both searching and substituting, with all the capabilities we’ve looked at today, are fully implemented in many modern text editors.
For example, on Windows: Visual Studio Code, Notepad++, Sublime Text…
Patterns follow all the same syntax you’ve learnt using grep and sed. However, you’ll usually find that the following patterns work that didn’t work with Grep:
Written as | Equivalent to |
---|---|
\d | [0-9] A digit |
\D | [^0-9] Not a digit |
\t | A tab character |
\n | A newline, if program supports multi-line matching |
In some text editors, for example VSCode, back-reference variables are referred to with a ‘$’ sign instead of a backslash. E.g. instead of \1 and \2, you use $1 and $2. Experiment with your favourite text editor to see which it expects.
In Nano, find and replace options are listed at the bottom of the
screen as ^W Where Is
and ^\ Replace
respectively, which in Windows corresponds to Ctrl+W
for
Find or Ctrl+\
for Replace.
Once either is enabled a new option for toggling Regex mode on or off is
listed at the bottom: M-R Reg.exp.
. This command (Meta+R)
corresponds to Alt+R
on Windows.
Toggling Regex mode on will change the prompt from Search:
to Search [Regexp]:
.
With Regex mode on, all the same syntax from grep and sed should work,
including using \1, \2, etc. to reference groups captured by round
brackets when doing a replace.
Key Points
- Regular expression capabilities are incorporated in most modern text editors for find and replace.
Content from Python regular expressions
Last updated on 2024-10-25 | Edit this page
Estimated time: 10 minutes
Overview
Questions
- How can we invoke regular expressions using Python?
Objectives
- Introduce regex capabilities in Python.
Regular expressions, with search, replace and other capabilities, are
available in through Python through the ‘re’ library. E.g.
import re
.
Patterns follow all the same syntax you’ve learnt using grep and sed. However, you’ll usually find that the following patterns work that didn’t work with Grep:
Written as | Equivalent to |
---|---|
\d | [0-9] A digit |
\D | [^0-9] Not a digit |
\t | A tab character |
\n | A newline |
Python regexs, using the ‘re’ library, are implemented through functions, where search patterns and strings are given as function arguments.
Searching
The grep equivalent is the function re.search().match = re.search( pattern, string )
The returned match object looks “true” if there was a match and has
functions for interrogating aspects of the match, for example individual
bracketed groups matched and span range of matched parts.
PYTHON
import re
myString = 'word1 1234 word2'
match = re.search(r'\b(\d+) (\w+)', myString)
if match:
print(match.group(0))
print(match.span(0))
print(match.group(1))
print(match.group(2))
else:
print("No Match")
OUTPUT
1234 word2
(6, 16)
1234
word2
Note the ‘r’ in front of the search pattern,
r'\b(\d+) (\w+)'
. This enables the pattern string to be
passed to the re function in its literal form, which prevents
backslashes being being interpreted as escapes too early,
before the function looks at it. Without the preceeding ‘r’,
double-backslashes would be necessary.
E.g. '\\b(\\d+) (\\w+)'
Splitting
A neat ‘re’ feature is splitting of a string into an array of substrings, using a regex as a delimiter, instead of just a literal comma or tab or space, etc.. With this functionality, you can, for example, split up a line into individual words, using a regex “anything that’s not a word” as the delimiter for the split.
OUTPUT
['word_1', '1234', 'word2', 'word3']
Substituting
Substitution commands in Python ‘re’ take the form:newstring = re.sub( pattern, replacement, string )
Again, both the pattern and replacement/back-reference syntax is as
we’ve learnt already.
PYTHON
import re
oldstring = 'Four 123 Five'
newstring = re.sub( r'(\w+)\s+(\d+)\s+(\w+)', r'\2-\1-\3', oldstring )
print(newstring)
OUTPUT
'123-Four-Five'
Try it - Reformatting a file using Python
Write a Python script that reads through the file ‘namesndates_v2.txt’ and, for each line, rearranges it (using re.sub) to the following format, and prints the result:
month-year: surname,firstname @ place
For example, the line…
Neve Erindale 23/08/2012 20:57 Coombs
…should be printed as:
08-2012: Erindale,Neve @ Coombs
Hints:
1. \t
may be used to match the tab characters used between
the fields in the input file.
2. Basic Python file reading:
RE Manual
More reference for the Python ‘re’ library may be found at: https://docs.python.org/3/library/re.html
Key Points
- Regular expressions through Python:
import re
match = re.search(r'pattern', 'string')
- OR
list = re.split(r'pattern', 'string')
- OR
re.sub( r'pattern', r'replacement', 'string' )
- Reference: https://docs.python.org/3/library/re.html
Content from R regular expressions
Last updated on 2024-10-25 | Edit this page
Estimated time: 10 minutes
Overview
Questions
- How can we invoke regular expressions using R?
Objectives
- Introduce regex capabilities in R.
Regular expressions, with search, replace capabilities, are also available in R. Again, patterns follow all the same syntax you’ve learnt using grep and sed. And again, you’ll find that the following patterns work that didn’t work with Grep:
Written as | Equivalent to |
---|---|
\d | [0-9] A digit |
\D | [^0-9] Not a digit |
\t | A tab character |
There are multiple ways of utilising regular expressions through R, but we’re going to focus on a set of functions provided by the ‘stringr’ package:
R
library(stringr)
# We'll also load tidyverse for later
library(tidyverse)
The grep equivalent in ‘stringr’ is str_detect
. It’s
intended for use in searching vectors of strings.
str_detect( string.vector, 'pattern' )
Basic use returns a list of TRUE/FALSE for which vector entries matched the search pattern.
Note that you’ll need to use double-backslashes in place of any single backslash in R.
For example, find words where the same letter is repeated consecutively:
R
words <- c("apple", "banana", "carrot")
str_detect( words, '(\\w)\\1' )
OUTPUT
[1] TRUE FALSE TRUE
Applying the output of str_detect back to the original string vector, as indices, allows access to the actual matched entries:
R
words[ str_detect( words, '(\\w)\\1' ) ]
OUTPUT
[1] "apple" "carrot"
‘str_detect’ can also be used in conjunction with data frames. Within a data frame, each column is stored as a separate vector and so we can use ‘str_detect’ within any function that accesses a single data frame column (such as mutate()).
For the next example, we’re going to make use of the ‘stringr’ package coming pre-loaded with a longer list of words, provided exactly for demonstrations like this, named ‘fruit’. Check it out:
R
fruit
length(fruit)
First we’ll convert the fruit vector into a data frame:
R
framed_fruit <- tibble(fruit_name = fruit)
framed_fruit
We can then add a new column to the data frame, which is the result of using ‘str_detect’ to find our repeated letter pattern:
R
framed_fruit %>%
mutate(has_double = str_detect(fruit_name, '(\\w)\\1'))
OUTPUT
# A tibble: 80 × 2
fruit_name has_double
<chr> <lgl>
1 apple TRUE
2 apricot FALSE
3 avocado FALSE
4 banana FALSE
5 bell pepper TRUE
6 bilberry TRUE
7 blackberry TRUE
8 blackcurrant TRUE
9 blood orange TRUE
10 blueberry TRUE
# ... 70 more rows
Another approach is to use the ‘filter’ function to reduce the data frame to only matching rows:
R
framed_fruit %>%
filter( str_detect(fruit_name, '(\\w)\\1') )
OUTPUT
# A tibble: 29 × 1
fruit_name
<chr>
1 apple
2 bell pepper
3 bilberry
4 blackberry
5 blackcurrant
6 blood orange
7 blueberry
8 boysenberry
9 cherry
10 chili pepper
# ... 19 more rows
Substitution in R is enabled through the ‘str_replace’ or ‘str_replace_all’ functions.
str_replace( string.vector, 'pattern', 'replacement' )
These functions will complete substitutions in all matching entries in a string vector. The difference is that ‘str_replace_all’ will also perform greedy matching and replacing within individual vector entries, where ‘str_replace’ will only replace the first match in each vector entry. The functions return the modified string vector.
R
str_replace(words, '(\\w)\\1', '\\1a\\1a\\1')
OUTPUT
[1] "apapaple" "banana" "carararot"
And again, these functions may be combined with data frame use too. For example:
R
framed_fruit %>%
mutate( long_vowels = str_replace_all(fruit_name, '([aeiou])', '\\1\\1\\1\\1\\1') )
OUTPUT
# A tibble: 80 × 2
fruit_name long_vowels
<chr> <chr>
1 apple aaaaappleeeee
2 apricot aaaaapriiiiicooooot
3 avocado aaaaavooooocaaaaadooooo
4 banana baaaaanaaaaanaaaaa
5 bell pepper beeeeell peeeeeppeeeeer
6 bilberry biiiiilbeeeeerry
7 blackberry blaaaaackbeeeeerry
8 blackcurrant blaaaaackcuuuuurraaaaant
9 blood orange blooooooooood oooooraaaaangeeeee
10 blueberry bluuuuueeeeebeeeeerry
# ... 70 more rows
Documentation
More reference for the R stringr functions may be found here:
https://www.rdocumentation.org/packages/stringr/versions/0.3/topics/str_detect
https://www.rdocumentation.org/packages/stringr/versions/0.3/topics/str_replace
https://www.rdocumentation.org/packages/stringr/versions/0.3/topics/str_replace_all
Older alternate functions that achieve the same result:
https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/grep
https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/regex
Key Points
- Regular expressions through R:
str_detect( string.vector, 'pattern' )
str_replace( string.vector, 'pattern', 'replacement )
-
str_replace_all( string.vector, 'pattern', 'replacement )
for ‘greedy’ match & replace. - Need to double escape
\\
any back slashes in patterns.