Intro to regular expressions: Glossary

Key Points

Regular Expressions: The pitch
  • Regexs are powerful tools for searching and transforming text.

  • A search pattern, using a defined syntax, allows non-specific but directed matching.

Shell wildcards - a type of regex
  • Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.

  • * matches zero or more characters

  • ? matches exactly one character

  • [ ] matches a character from a list or range of contained options

  • [! ] matches a character NOT in a list or range of contained options

  • { } expands to produce forms of all listed contained options

Pattern matching with grep -E, part 1
  • grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.

  • | acts as an OR between options

  • ( ) allows grouping, e.g. for OR modifier, with quantifiers, etc..

  • [ ] matches a character from a list or range of contained options

  • [^ ] matches a character NOT in a list or range of contained options

  • ^ at the start of a regex means match at start of line

  • $ at the end of a regex means match at end of line

  • . is the match-all (any single character) wildcard

  • ? quantifies previous character or group as occurring zero or one time

  • * quantifies previous character or group as occurring zero or more times

  • + quantifies previous character or group as occurring one or more times

  • {n,m} quantifies previous character or group as occurring between n and m times

  • Quantifiers are greedy- will always match longest possible fit.

Pattern matching with grep -E, part 2
  • grep in Extended Regex mode has a number of predefined character classes:

  • [:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]

  • and escape-character enabled shorthand character classes and anchors:

  • \w : Word character [a-zA-Z0-9] OR a _ (underscore)

  • \W : [^\w] Inverse of \w, any non-word character

  • \s : Spaces, tabs, in some contexts new-lines

  • \S : [^\s] Inverse of \s, any non-space character

  • \b : Boundary between adjacent word and space, 0-length anchor

  • \B : [^\b] In the middle of a word or multiple spaces, 0-length anchor

  • \< : Boundary at start of word between word and space, 0-length anchor

  • \> : Boundary at end of word between word and space, 0-length anchor

  • You can refer back to an exact copy of a matched (group) using \1, \2, etc..

Find... and replace! With sed.
  • sed -E 's/pattern/replacement/'

  • 's/pattern/replacement/g' - enables Greedy, replace-all mode.

  • Use grouping () in pattern and back-reference \1 in replacement…

  • … to rearrange or recontextualise parts of the matched input.

  • Tips for writing complex substitutions:

  • 1- Start with a complete real example pasted as your pattern.

  • 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.

  • 3- Circle the parts to retain, with round brackets.

  • 4- Write your replacement rules, using back-references.

  • 5- Substitution should now work for your specific real example.

  • 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.

Regexs within text editors
  • Regular expression capabilities are incorporated in most modern text editors for find and replace.

Python regular expressions
  • Regular expressions through Python:

  • import re

  • match = re.search(r'pattern', 'string')

  • or

  • list = re.split(r'pattern', 'string')

  • or

  • re.sub( r'pattern', r'replacement', 'string' )

  • Reference: https://docs.python.org/3/library/re.html

R regular expressions
  • Regular expressions through R:

  • grep( 'pattern', string.vector )

  • sub( 'pattern', 'replacement', string.vector )

  • Use gsub instead of sub for greedy find+replace mode.

  • Need to double escape \ any back slashes in patterns.

Glossary

FIXME