Regular Expressions: The pitch


  • Regexs are powerful tools for searching and transforming text.
  • A search pattern, using a defined syntax, allows non-specific but directed matching.

Shell wildcards - a type of regex


  • Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
  • * matches zero or more characters
  • ? matches exactly one character
  • [ ] matches a character from a list or range of contained options
  • [! ] matches a character NOT in a list or range of contained options
  • { } expands to produce forms of all listed contained options

Pattern matching with grep -E, part 1


  • grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
  • | acts as an OR between options
  • ( ) allows grouping, e.g. for OR modifier, with quantifiers, etc..
  • [ ] matches a character from a list or range of contained options
  • [^ ] matches a character NOT in a list or range of contained options
  • ^ at the start of a regex means match at start of line
  • $ at the end of a regex means match at end of line
  • . is the match-all (any single character) wildcard
  • ? quantifies previous character or group as occurring zero or one time
  • * quantifies previous character or group as occurring zero or more times
  • + quantifies previous character or group as occurring one or more times
  • {n,m} quantifies previous character or group as occurring between n and m times
  • Quantifiers are greedy- will always match longest possible fit.

Pattern matching with grep -E, part 2


  • grep in Extended Regex mode has a number of predefined character classes:
  • [:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]
  • and escape-character enabled shorthand character classes and anchors:
  • \w : Word character [a-zA-Z0-9] OR a _ (underscore)
  • \W : [^\w] Inverse of \w, any non-word character
  • \s : Spaces, tabs, in some contexts new-lines
  • \S : [^\s] Inverse of \s, any non-space character
  • \b : Boundary between adjacent word and space, 0-length anchor
  • \B : [^\b] In the middle of a word or multiple spaces, 0-length anchor
  • \< : Boundary at start of word between word and space, 0-length anchor
  • \> : Boundary at end of word between word and space, 0-length anchor
  • You can refer back to an exact copy of a matched (group) using \1, \2, etc..

Find... and replace! With sed.


  • sed -E 's/pattern/replacement/'
  • 's/pattern/replacement/g' - enables Greedy, replace-all mode.
  • Use grouping () in pattern and back-reference \1 in replacement…
  • … to rearrange or recontextualise parts of the matched input.
  • Tips for writing complex substitutions:
  • 1- Start with a complete real example pasted as your pattern.
  • 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
  • 3- Circle the parts to retain, with round brackets.
  • 4- Write your replacement rules, using back-references.
  • 5- Substitution should now work for your specific real example.
  • 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.

Regexs within text editors


  • Regular expression capabilities are incorporated in most modern text editors for find and replace.

Python regular expressions


  • Regular expressions through Python:
  • import re
  • match = re.search(r'pattern', 'string')
  • OR
  • list = re.split(r'pattern', 'string')
  • OR
  • re.sub( r'pattern', r'replacement', 'string' )
  • Reference: https://docs.python.org/3/library/re.html

R regular expressions


  • Regular expressions through R:
  • str_detect( string.vector, 'pattern' )
  • str_replace( string.vector, 'pattern', 'replacement )
  • str_replace_all( string.vector, 'pattern', 'replacement ) for ‘greedy’ match & replace.
  • Need to double escape \\ any back slashes in patterns.