Regular Expressions: The pitch
|
Regexs are powerful tools for searching and transforming text.
A search pattern, using a defined syntax, allows non-specific but directed matching.
|
Shell wildcards - a type of regex
|
Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
* matches zero or more characters
? matches exactly one character
[ ] matches a character from a list or range of contained options
[! ] matches a character NOT in a list or range of contained options
{ } expands to produce forms of all listed contained options
|
Pattern matching with grep -E, part 1
|
grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
| acts as an OR between options
( ) allows grouping, e.g. for OR modifier, with quantifiers, etc..
[ ] matches a character from a list or range of contained options
[^ ] matches a character NOT in a list or range of contained options
^ at the start of a regex means match at start of line
$ at the end of a regex means match at end of line
. is the match-all (any single character) wildcard
? quantifies previous character or group as occurring zero or one time
* quantifies previous character or group as occurring zero or more times
+ quantifies previous character or group as occurring one or more times
{n,m} quantifies previous character or group as occurring between n and m times
Quantifiers are greedy- will always match longest possible fit.
|
Pattern matching with grep -E, part 2
|
grep in Extended Regex mode has a number of predefined character classes:
[:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
and escape-character enabled shorthand character classes and anchors:
\w : Word character [a-zA-Z0-9] OR a _ (underscore)
\W : [^\w] Inverse of \w, any non-word character
\s : Spaces, tabs, in some contexts new-lines
\S : [^\s] Inverse of \s, any non-space character
\b : Boundary between adjacent word and space, 0-length anchor
\B : [^\b] In the middle of a word or multiple spaces, 0-length anchor
\< : Boundary at start of word between word and space, 0-length anchor
\> : Boundary at end of word between word and space, 0-length anchor
You can refer back to an exact copy of a matched (group) using \1, \2, etc..
|
Find... and replace! With sed.
|
sed -E 's/pattern/replacement/'
's/pattern/replacement/g' - enables Greedy, replace-all mode.
Use grouping () in pattern and back-reference \1 in replacement…
… to rearrange or recontextualise parts of the matched input.
Tips for writing complex substitutions:
1- Start with a complete real example pasted as your pattern.
2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
3- Circle the parts to retain, with round brackets.
4- Write your replacement rules, using back-references.
5- Substitution should now work for your specific real example.
6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.
|
Regexs within text editors
|
|
Python regular expressions
|
Regular expressions through Python:
import re
match = re.search(r'pattern', 'string')
or
list = re.split(r'pattern', 'string')
or
re.sub( r'pattern', r'replacement', 'string' )
Reference: https://docs.python.org/3/library/re.html
|
R regular expressions
|
Regular expressions through R:
grep( 'pattern', string.vector )
sub( 'pattern', 'replacement', string.vector )
Use gsub instead of sub for greedy find+replace mode.
Need to double escape \ any back slashes in patterns.
|