Regular Expressions: The pitch
- Regexs are powerful tools for searching and transforming text.
- A search pattern, using a defined syntax, allows non-specific but directed matching.
Shell wildcards - a type of regex
- Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
-
*
matches zero or more characters -
?
matches exactly one character -
[ ]
matches a character from a list or range of contained options -
[! ]
matches a character NOT in a list or range of contained options -
{ }
expands to produce forms of all listed contained options
Pattern matching with grep -E, part 1
- grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
-
|
acts as an OR between options -
( )
allows grouping, e.g. for OR modifier, with quantifiers, etc.. -
[ ]
matches a character from a list or range of contained options -
[^ ]
matches a character NOT in a list or range of contained options -
^
at the start of a regex means match at start of line -
$
at the end of a regex means match at end of line -
.
is the match-all (any single character) wildcard -
?
quantifies previous character or group as occurring zero or one time -
*
quantifies previous character or group as occurring zero or more times -
+
quantifies previous character or group as occurring one or more times -
{n,m}
quantifies previous character or group as occurring between n and m times - Quantifiers are greedy- will always match longest possible fit.
Pattern matching with grep -E, part 2
- grep in Extended Regex mode has a number of predefined character classes:
[:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
- and escape-character enabled shorthand character classes and anchors:
-
\w
: Word character [a-zA-Z0-9] OR a _ (underscore) -
\W
:[^\w]
Inverse of \w, any non-word character -
\s
: Spaces, tabs, in some contexts new-lines -
\S
:[^\s]
Inverse of \s, any non-space character -
\b
: Boundary between adjacent word and space, 0-length anchor -
\B
:[^\b]
In the middle of a word or multiple spaces, 0-length anchor -
\<
: Boundary at start of word between word and space, 0-length anchor -
\>
: Boundary at end of word between word and space, 0-length anchor - You can refer back to an exact copy of a matched (group) using \1, \2, etc..
Find... and replace! With sed.
sed -E 's/pattern/replacement/'
-
's/pattern/replacement/g'
- enables Greedy, replace-all mode. - Use grouping () in pattern and back-reference \1 in replacement…
- … to rearrange or recontextualise parts of the matched input.
- Tips for writing complex substitutions:
- 1- Start with a complete real example pasted as your pattern.
- 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
- 3- Circle the parts to retain, with round brackets.
- 4- Write your replacement rules, using back-references.
- 5- Substitution should now work for your specific real example.
- 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.
Regexs within text editors
- Regular expression capabilities are incorporated in most modern text editors for find and replace.
Python regular expressions
- Regular expressions through Python:
import re
match = re.search(r'pattern', 'string')
- OR
list = re.split(r'pattern', 'string')
- OR
re.sub( r'pattern', r'replacement', 'string' )
- Reference: https://docs.python.org/3/library/re.html
R regular expressions
- Regular expressions through R:
str_detect( string.vector, 'pattern' )
str_replace( string.vector, 'pattern', 'replacement )
-
str_replace_all( string.vector, 'pattern', 'replacement )
for ‘greedy’ match & replace. - Need to double escape
\\
any back slashes in patterns.