You are on page 1of 38

Learn

[Reg]ular
[Ex]pressions
in 45 minutes or
less
Gabriel Barbu
16th of November 2019
1. What is a regular expression anyway?
2. Basics
a. Metacharacters
b. Character classes
c. Quantifiers
d. Negation
e. Alternations
f. Grouping
g. Flags
3. Uses and flavors
4. Tools & cheat sheet
5. Q&A
I am sure you are familiar with the term “wildcard” and used it at least once

If you work with Windows maybe you used the search from Windows Explorer
to find specific files by type such as:
- Documents: *.doc
- Text files: *.txt
- mp3 files: *.mp3
If you work with Linux (or Mac – not judging) maybe you used the search
for files into a specific folder, or some content inside a file:
- ls *.doc
- grep “CC*” CreditCardInformation.txt
If you work with databases, you definitely used wildcards in your queries suc

- select * from <table> where <field> like “%something”


- select * from <table> where <field> like “_something”
Regular expressions are a formal language that takes advantage of
wildcards in order to define patterns to search (and/or replace) in text.

A regular expression looks like this:

/^[2-9]\d{2}-\d{3}-\d{4}$/
o
r
/^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$/g
o
r
Let's put that scary thing away and start understanding regular
expressions.

Structure of a basic regular expression:

/search pattern/flags
What to search for Search flags

/search pattern/replacement/flags

Replace the match with


A regular expression can be build using the following things:

- Meta characters
- Character classes
- Alternations
- Capturing groups
- Flags

Let's dive deep !


In regular expressions there are two types of characters: literal
characters and metacharacters.

Literal characters are any printable character from the ASCII table.

Metacharacters are the building blocks of regular expressions. Those


characters are special because they can work as a wildcard for a subset
of literal characters or as a positional character.
To match any character, except newline: . (dot)
To match any white space characters (\r, \n, \t, \f, \v, [space]): \s
To match any numeric characters: \d
To match any “word” characters (all lowercase and uppercase alphabet
letters, the numeric characters and underscore character): \w
To match a word boundary (matches the start and end of a word): \b
To match the start of the string: ^
To match the end of the string: $
Character classes are defined by square brackets (“[“ and “]”). Anything
contained inside the brackets represents the character class.

A character class can contain multiple type of items like:


- Literal characters: [abc] – will match characters a, b and c
Metacharacters like: [\s\d]
Characters ranges: [a-z], [A-Z], [0-9], [a-d] or even a combination of
them such as [a-zA-Z]
Note that character ranges use the ASCII table to form ranges. We
cannot use a range between “a” and “Z” because it’s an invalid range.
We could use a range between “A” and “z” but be careful because this
range will also contain some other characters from the ascii table.
A character class can contain multiple items in any combination, such as:

- [a-z\d] – will match all lowercase characters between a and z and all
the numeric characters
- [a-fz] – will match all lowercase characters between a and f and the
character z
- [-a-z\\.] – will match the character “-”, all the lowercase characters
A between a and z,
particularity of the character
character “\” and
classes the character
is that “.” the
we can write (notice the
double slash: as
metacharacters this is used classes.
character to escape the
The slash character)
benefit of character classes is
that we gain more control over the predefined characters in
metacharacters:

- \d can be written as [0-9]


- \w can be written as [a-zA-Z\d_] or [a-zA-Z0-9_]
- \s can be written as [\r\n\t\f\v ]
But what do we do in case we want to match multiple characters?

In this case we can use quantifiers to indicate how many things we want
to match.
The simple quantifiers defined in regular expressions are:

- * : match zero or more


- + : match one or more
- ? : match zero or one (also known as conditional)

Besides the simple quantifiers, there are range quantifiers:

- {n} : match exactly n times


- {n,} : match at least n times
- {n, m} : match between n and m times

All the above-mentioned quantifiers, except {n}, are greedy quantifiers.


Greedy means that it will try to match as much as possible. In the
following example we see that the match doesn’t stop at the first b
that is encountered.

If we want, we can change this behavior and make it lazy.


If we want to make the quantifier lazy and match as little as
possible, we need to add “?” after the quantifier.
The previously mentioned quantifiers can be made lazy as follows:

- *? : Repeat any number of times, but as few times as possible


- +? : Repeat one or more times, but as few times as possible
- ?? : Repeat zero or one time, but as few times as possible
- {n, m}? : Repeat at least n, but no more than m times, but as few
times as possible
- {n,}? : Repeat at least n times, but as few times as possible
In regular expression, we can use negation in order to achieve an
opposite of a character class or a metacharacter.

In order to negate a metacharacter, we need to capitalize the metacharacter l

- \S is the opposite of \s – match any character which is not a whitespace


- \D is the opposite of \d – match any character which is not a numeric char
- \W is the opposite of \W – match any character which is not word character
- … and so on
In order to negate a character class we can use the caret character (“^”)
but only if is the first character in the character class. If it is not
the first character in a character class, the caret will be part of the
character class:

- [^abc] – will match all characters that are not a, b or c


- [^a-c] – will match all characters that are not included in the list
of character between a and c
- [^a-c^] – will match all character that are not included in the list
of character between a and c and the ^ character
As in the previous chapter, a negated metacharacter can be written as a
negated character class:

- \D can be written as [^0-9] or [^0123456789]


- \S can be written as [^\r\n\t\f\v ] or even as [^\s] – isn’t RegEx
fun?
Alternations or alternatives can be used to match one out of a group of
multiple possibilities.

To define an alternation, the character pipe (“|”) is used to separate


the options. You can consider the alternation as a logical OR.
The syntax for alternations is: <first_option>|<second_option>|<third_option>

Any character, metacharacter or character class can be used as an option


to an alternation:

- a|b – will match character a or character b


- \s|a – will match a white space character or character a
- [a-c]|[x-z] – will match all lowercase characters between a and c or
all lowercase character between x and z
Capture groups in regular expressions are used to capture (doh!) matches
which can be later referenced in regular expression or in replacement
expression.

Capture groups can also be used, by a programming language, to return a


part of the match that we need from a text.
The syntax for a capture group is: (<any_regular_expression>)

Each capture group gets an index starting from 1 and can be referenced
using backslash (\) and the index.

For example, if we have two capture groups in a regular expression:


(\d)([\w]), the first group (\1) will contain a digit (\d) and the second
group (\2) will contain a word character ([\w]). The \0 capture group is
used to reference the entire match.
In the following example we will see how capture groups work:

As we can see, we have a capture group for \d+ and the \1 (Group 1) has
captured “10” from our test string, and the \0 (Full match) has captured
“10 types” from the same test string.
In the following example we will see how to replace using capture groups:
Regular expressions support some flags (or modifiers) to change how the
match is done.

The flag(s) are added at the end of the regular expression, after the
closing
The mostslash /.flags in regular expression are:
common

- g : (global) don’t return after the first match, return all the
matches
- m : (multi line) changes the behavior of ^ to match the start of the
line and of $ to match the end of the line
- i : (insensitive) makes the search case insensitive
- s : (single line) changes the behavior of “.” to also match newline
- u : (ungreedy) makes all the quantifiers lazy
The following examples shows how the g flag changes how regular
expression
Without thereturn matches:
g flag:

With the g flag:


Regular expressions are often used in the following scenarios, but are
not limited to:

- File renaming
- Text search (and replace)
- Web directives (Apache .htaccess)
- Database queries (MySQL)
- Input validation (and sanitization)
- Parsing log files
There are many flavors of regular expressions which influence what is
supported (and how is supported) from the generic syntax. Most of the
time all the simple stuff (which was presented here) work in all flavors.

All programming languages that can use regular expressions use a


different (custom) flavor. But is mostly the same.

A comparison of different regular expression flavors can be found on


Wikipedia:
https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
You can find a regular expression cheat sheet here:
https://www.rexegg.com/regex-quickstart.html or here: https://regexr.com/

Useful tools to write and test regular expressions online:

- https://regex101.com/
- https://regexr.com/
- https://www.regextester.com/
Thank
You