Regular Expressions: The pitch
- Regexs are powerful tools for searching and transforming text.
 - A search pattern, using a defined syntax, allows non-specific but directed matching.
 
Shell wildcards - a type of regex
- Use of wildcards in the Unix shell for file selection is a simple form of regular expressions.
 - 
*matches zero or more characters - 
?matches exactly one character - 
[ ]matches a character from a list or range of contained options - 
[! ]matches a character NOT in a list or range of contained options - 
{ }expands to produce forms of all listed contained options 
Pattern matching with grep -E, part 1
- grep in Extended Regex mode (or egrep) allows complex pattern matching in files/streams.
 - 
|acts as an OR between options - 
( )allows grouping, e.g. for OR modifier, with quantifiers, etc.. - 
[ ]matches a character from a list or range of contained options - 
[^ ]matches a character NOT in a list or range of contained options - 
^at the start of a regex means match at start of line - 
$at the end of a regex means match at end of line - 
.is the match-all (any single character) wildcard - 
?quantifies previous character or group as occurring zero or one time - 
*quantifies previous character or group as occurring zero or more times - 
+quantifies previous character or group as occurring one or more times - 
{n,m}quantifies previous character or group as occurring between n and m times - Quantifiers are greedy- will always match longest possible fit.
 
Pattern matching with grep -E, part 2
- grep in Extended Regex mode has a number of predefined character classes:
 [:alpha:] [:alnum:] [:digit:] [:upper:] [:lower:] [:punct:] [:space:]- and escape-character enabled shorthand character classes and anchors:
 - 
\w: Word character [a-zA-Z0-9] OR a _ (underscore) - 
\W:[^\w]Inverse of \w, any non-word character - 
\s: Spaces, tabs, in some contexts new-lines - 
\S:[^\s]Inverse of \s, any non-space character - 
\b: Boundary between adjacent word and space, 0-length anchor - 
\B:[^\b]In the middle of a word or multiple spaces, 0-length anchor - 
\<: Boundary at start of word between word and space, 0-length anchor - 
\>: Boundary at end of word between word and space, 0-length anchor - You can refer back to an exact copy of a matched (group) using \1, \2, etc..
 
Find... and replace! With sed.
sed -E 's/pattern/replacement/'- 
's/pattern/replacement/g'- enables Greedy, replace-all mode. - Use grouping () in pattern and back-reference \1 in replacement…
 - … to rearrange or recontextualise parts of the matched input.
 - Tips for writing complex substitutions:
 - 1- Start with a complete real example pasted as your pattern.
 - 2- Escape ‘\’ any forward slashes, literal brackets, etc., as necessary.
 - 3- Circle the parts to retain, with round brackets.
 - 4- Write your replacement rules, using back-references.
 - 5- Substitution should now work for your specific real example.
 - 6- Abstract pattern with wildcards, etc., to make ambiguous enough for all required cases.
 
Regexs within text editors
- Regular expression capabilities are incorporated in most modern text editors for find and replace.
 
Python regular expressions
- Regular expressions through Python:
 import rematch = re.search(r'pattern', 'string')- OR
 list = re.split(r'pattern', 'string')- OR
 re.sub( r'pattern', r'replacement', 'string' )- Reference: https://docs.python.org/3/library/re.html
 
R regular expressions
- Regular expressions through R:
 str_detect( string.vector, 'pattern' )str_replace( string.vector, 'pattern', 'replacement )- 
str_replace_all( string.vector, 'pattern', 'replacement )for ‘greedy’ match & replace. - Need to double escape 
\\any back slashes in patterns.