You are on page 1of 12

1.

Characters Escapes

The backslash character (\) in the following table indicates that the character that follows it is a
special character.

Escaped Description Pattern Matches


character

\a Matches a bell character, \u0007. \a "\u0004" in "Error!"


+ '\u0004'

\b Will match a backspace within a character [\b]{3,} "\b\b\b\b" in


class, \u0008. "\b\b\b\b"

\t It will match a tab, \u0009. (\w+)\t "i1\t", "i2\t" in


"i1\ti\t"

\r It will match a carriage return, \u000D. (\r is \r\n(\w+) "\r\nThese" in


not equivalent to the newline character, \n.) "\r\nThese are\ntwo
lines."

\v It will match a vertical tab, \u000B. [\v]{2,} "\v\v\v" in "\v\v\v"

\f It will match a form feed, \u000C. [\f]{2,} "\f\f\f" in "\f\f\f"

\n It will match a new line, \u000A. \r\n(\w+) "\r\nThese" in


"\r\nThese are\ntwo
lines."

\e It will match an escape, \u001B. \e "\x001B" in


"\x001B"

\ nnn It uses octal representation to specify a \w\040\w "a b", "c d" in "a bc
character (nnn consists of two or three d"
digits).

\x nn It uses the hexadecimal representation to \w\x20\w "a b", "c d" in "a bc
specify a character (nn consists of exactly d"
two digits).

\c X \c x It will match the ASCII control character \cC "\x0003" in


that is specified by X or x, where X or x is "\x0003" (Ctrl-C)
the letter of the control character.
\u nnnn It will match a Unicode character using \w\u0020\ "a b", "c d" in "a bc
hexadecimal representation (exactly four w d"
digits, as represented by nnnn).

2. Character Classes

A character class will match any one of a set of characters. Character classes include the
language elements that are listed in the following table.

Character Description Pattern Matches


class

[ It will match any single character present [ae] "a" in "bay" "a",
character_grou in the character_group. By default, the "e" in "stake"
p] match is case-sensitive.

[^ Negation: it will match any single [^aei] "r", "g", "n" in


character_grou character that is not present in the "reign"
p] character_group. By default, characters
in character_group are case-sensitive.

[ first - last ] Character range: it will match any single [A-Z] "A", "B" in
character present in the range from first "AB123"
to last.

. Wildcard: it will match any single a.e "ave" in "have"


character except \n. If you want to match "ate" in "hater"
a literal period character (. or \u002E),
you have to precede it with the escape
character (\.).

\p{ name } It will match any single character \p{Lu} "C", "L" in "City
available in the Unicode general category \p{IsCyrillic} Lights" "Д",
or named block specified by name. "Ж" in "ДЖem"

\P{ name } It will match any single character not \P{Lu} "i", "t", "y" in
available in the Unicode general category \P{IsCyrillic} "City" "e", "m"
or named block specified by name. in "ДЖem"

\w It will match any word character. \w "I", "D", "A",


"1", "3" in "ID
A1.3"
\W It will match any non-word character. \W " ", "." in "ID
A1.3"

\s It will match any white-space character. \w\s "D " in "ID


A1.3"

\S It will match any non-white-space \s\S " _" in "int


character. __ctr"

\d It will match any decimal digit. \d "4" in "4 = IV"

\D It will match any character other than a \D " ", "=", " ", "I",
decimal digit. "V" in "4 = IV"

3. Character Class Operations


Class Legend Example Sample Match
Operation

[…-[…]] .NET: it is a character class [a-z-[aeiou]] Any lowercase


subtraction. One character on consonant
the left, but not in the subtracted
class.

[…-[…]] .NET: it is a character class [\p{IsArabic}-[\D]] An Arabic character


subtraction. and not a non-digit,
i.e., an Arabic digit

[…&&[…]] Java, Ruby 2+: it is a character [\S&&[\D]] An non-whitespace


class intersection. One character character and a
on the left and in the && class. non-digit.

[…&&[…]] Java, Ruby 2+: character class [\S&&[\D]&&[^a-zA- An non-whitespace


intersection. Z]] character that a
non-digit and not a
letter.

[…&&[^…] Java, Ruby 2+: it is a character [a-z&&[^aeiou]] An English lowercase


] class subtraction is obtained by letter that is not a
intersecting a class with a vowel.
negated class

[…&&[^…] Java, Ruby 2+: it is a character [\p{InArabic}&&[^\p{ An Arabic character


] class subtraction L}\p{N}]] and not a letter or a
number
4. Anchors

Anchors are also known as atomic zero-width assertions. It results the match to succeed or fail
based on the current position in the string. But these anchors cannot be used to allow the
engine to advance through the string or characters. The metacharacters that are listed in the
following table are anchors.

Assertio Description Pattern Matches


n

^ By default, the match starts from the beginning ^\d{3} "111" in


of the string. Also, in the case of the multiline "111-333-"
mode, it will also start at the beginning of the
line.

$ By default, the match will occur at the end of the -\d{3}$ "-444" in
string or just before \n at the end of the string. In "-901-444"
the case of the multiline mode, it will occur just
before the end of the line or before \n at the end
of the line.

\A The match occurs at the start of the string. \A\d{3} "222" in


"222-333-"

\Z The match occurs at the end of the string or -\d{3}\Z "-111" in


before \n at the end of the string. "-555-111"

\z The match occurs at the end of the string. -\d{3}\z "-111" in


"-901-111"

\G The match occurs at the point where the \G\(\d\) "(1)", "(3)", "(5)"
previous match ended. in
"(1)(3)(5)[7](9)"

\b The match occurs on a boundary between a \w \b\w+\s\w+\b "them theme",


(alphanumeric) and a \W (nonalphanumeric) "them them" in
character. "them theme
them them"

\B The match will not occur on a \b boundary. \Bend\w*\b "ends", "ender"


in "end sends
endure lender"

5. Grouping Constructs
Grouping constructs delineate subexpressions of a regular expression and capture substrings of
the provided string. Grouping constructs uses the following language elements.

Grouping Description Pattern Matches


construct

( It will capture (\w)\1 "ll" in "hello"


subexpression the matched
) subexpression
and assigns it
with a
one-based
ordinal
number.

(?< name > It will capture (?<double>\w)\k<double> "ll" in "hello"


subexpression the matched
) or (?' name ' subexpression
subexpression into a named
) group.

(?< name1 - It will define a (((?'Open'\()[^\(\)]*)+((?'Clo "((1-3)*(3-1))" in


name2 > balancing se-Open'\))[^\(\)]*)+)*(?(Op "3+2^((1-3)*(3-1))"
subexpression group en)(?!))$
) or (?' name1 definition.
- name2 '
subexpression
)

(?: It will define a Write(?:Line)? "WriteLine" in


subexpression noncapturing "Console.WriteLine()" "Write"
) group. in "Console.Write(value)"

(?imnsx-imnsx It will apply or A\d{2}(?i:\w+)\b "A12xl", "A12XL" in "A12xl


: disable the A12XL a12xl"
subexpression specified
) options within
subexpression.

(?= Zero-width \b\w+\b(?=.+and.+) "rats", "bats" in "rats, bats and


subexpression positive some mice."
) lookahead
assertion.
(?! Zero-width \b\w+\b(?!.+and.+) "and", "some", "mice" in "rats,
subexpression negative bats and some mice."
) lookahead
assertion.

(?<= Zero-width \b\w+\b(?<=.+and.+) "some", "mice" in "rats, bats


subexpression positive ——————————— and some mice."
) lookbehind \b\w+\b(?<=.+and.*) ————————————
assertion. "and", "some", "mice" in "rats,
bats and some mice."

(?<! Zero-width \b\w+\b(?<!.+and.+) "rats", "bats", "and" in "rats,


subexpression negative ——————————— bats and some mice."
) lookbehind \b\w+\b(?<!.+and.*) ————————————
assertion. "rats", "bats" in "rats, bats and
some mice."

(?> Atomic group. (?>a|ab)c "ac" in"ac" nothing in"abc"


subexpression
)

6. Lookarounds

When the regex engine starts processing the lookaround expression, it takes a substring from
the current position to the start (lookbehind) or end (lookahead) of the original string, and then
runs Regex.IsMatch on that selected substring with the help of the lookaround pattern. You can
determine the success of the result based on a positive or negative assertion.

Lookaround Name Example Sample Match

(?=check) Positive Lookahead (?=\d{10})\d{5} 06678 in 0667856789

(?<=check) Positive Lookbehind (?<=\d)rat bat in 1bat

(?!check) Negative Lookahead (?!theatre)the\w+ theme

(?<!check) Negative Lookbehind \w{3}(?<!mon)ster Munster

7. Quanitfiers

A quantifier will simply specify how many instances of the previous element must be available in
the input string for resulting in a perfect match. Quantifiers include the following language
elements.
Quantifie Description Pattern Matches
r

* It will match the previous element \d*\.\d ".0", "19.9", "219.9"


zero or more times.

+ It will match the previous element "se+" "see" in "seen", "se" in "sent"
one or more times.

? It will match the previous element "mai?n" "man", "main"


zero or one time.

{n} It will match the previous element ",\d{3}" ",043" in "1,043.6", ",876",
exactly n times. ",543", and ",210" in
"9,876,543,210"

{ n ,} It will match the previous element "\d{2,}" "166", "29", "1930"


at least n times.

{n,m} It will match the previous element "\d{3,5}" "166", "17668" "19302" in
at least n times, but no more than "193024"
m times.

*? It will match the previous element \d*?\.\d ".0", "19.9", "219.9"


zero or more times, but as few
times as possible.

+? It will match the previous element "se+?" "se" in "seen", "se" in "sent"
one or more times, but as few
times as possible.

?? It will match the previous element "mai??n" "man", "main"


zero or one time, but as few times
as possible.

{ n }? It will match the preceding element ",\d{3}?" ",043" in "1,043.6", ",876",


exactly n times. ",543", and ",210" in
"9,876,543,210"

{ n ,}? It will match the previous element "\d{2,}?" "166", "29", "1930"
at least n times, but as few times
as possible.

{ n , m }? It will match the previous element "\d{3,5}?" "166", "17668" "193", "024" in
between n and m times, but as few "193024"
times as possible.
8. Backreference Constructs

With backreference, you can simply identify the subexpression subsequently in the same
regular expression. The following table highlights the backreference constructs:

Backreference Description Pattern Matches


construct

\ number Backreference. It will match the value (\w)\1 "ee" in


of a numbered subexpression. "peek"

\k< name > Named backreference. It will match (?<char>\w)\k<cha "ee" in


the value of a named expression. r> "peek"

9. Alteration Constructs

Alternation constructs will alter a regular expression to enable the “either/or” matching. These
constructs come with the language elements that are listed in the following table.

Alternation Description Pattern Matches


construct

| It will match any one element th(e|is|at) "the", "this" in


that is separated by the vertical "this is the day."
bar (|) character.

(?( It will match “yes” if the regex (?(A)A\d{2}\b|\b\d{3}\b) "A10", "910" in


expression ) pattern designated by "A10 C103 910"
yes | no ) or expression matches; else, it will
(?( match the optional “no” part. The
expression ) provided expression is
yes ) interpreted as a zero-width
assertion. To avoid ambiguity
with a named or numbered
capturing group, you must use
the optional explicit assertion,
such as (?( (?= expression ) )
yes | no )
(?( name ) It will match “yes” if name, a (?<quoted>")?(?(quote "Dogs.jpg ",
yes | no ) or named or numbered capturing d).+?"|\S+\s) "\"Yiska
(?( name ) group, has a match; else, it will playing.jpg\"" in
yes ) match the optional no. "Dogs.jpg \"Yiska
playing.jpg\""

10. Substitutions

Substitutions are regex language elements that are used in replacement patterns. The following
table lists metacharacters that are atomic zero-width assertions.

Characte Description Pattern Replaceme Input string Result string


r nt pattern

$ number It will \b(\w+)(\s)(\w+ $3$2$1 "one two" "two one"


substitute )\b
the substring
matched by
group
number.

${ name } It will \b(?<word1>\w ${word2} "one two" "two one"


substitute +)(\s)(?<word2 ${word1}
the substring >\w+)\b
matched by
the named
group name.

$$ It will \b(\d+)\s?USD $$$1 "44 USD" "$44"


substitute a
literal "$".

$& It will \$?\d*\.?\d+ **$&** "$1.30" "**$1.30**"


substitute a
copy of the
whole
match.
$` It will B+ $` "DDBBCC" "DDDDCC"
substitute all
the text of
the input
string before
the match.

$' It will B+ $' "AADDCC" "AACCCC"


substitute all
the text of
the input
string after
the match.

$+ It will B+(C+) $+ "AABBCCDD" "AACCDD"


substitute
the last
group that
was
captured.

$_ It will B+ $_ "AABBCC" "AAAABBCCCC"


substitute
the entire
input string.

11. Inline Options

The following are the inline options supported by the .Net regex engine:

Optio Description Pattern Matches


n

i It is for case-insensitive matching. \b(?i)a(?-i) "aardvark", "aaaAuto" in


a\w+\b "aardvark AAAuto aaaAuto
Adam breakfast"

m In the case of the multiline mode. ^ and $


match the beginning and end of a line,
instead of the beginning and end of a
string.

n It will not capture unnamed groups.


s It will use the single-line mode.

x It will ignore the unescaped white space in \b(?x) \d+ "1 aardvark", "2 cats" in "1
the regular expression pattern. \s \w+ aardvark 2 cats IV
centurions"

12. POSIX Character Classes

A character class matches a small sequence of characters with a large set of characters. We
can use POSIX character classes only within bracket expressions. The POSIX standard
supports the following classes of characters to create regular expressions.

Characte Legend Example Sample


r Match

[:alpha:] PCRE (C, PHP, R…): ASCII letters A-Z and a-z [8[:alpha:]]+ WellDone88

[:alpha:] Ruby 2: Unicode letter or ideogram [[:alpha:]\d]+ кошка99

[:alnum:] PCRE (C, PHP, R…): ASCII digits and letters [[:alnum:]]{10} ABC1235251
A-Z and a-z

[:alnum:] Ruby 2: Unicode digit, letter or ideogram [[:alnum:]]{10} кошка90210

[:punct:] PCRE (C, PHP, R…): ASCII punctuation mark [[:punct:]]+ ?!.,:;

[:punct:] Ruby: Unicode punctuation mark [[:punct:]]+ ‽,:〽⁆

13. Inline Modifiers

The following modifiers are not supported in JavaScript. If you are using Ruby, make sure to
carefully use the “?s” and “?m”.

Modifier Legend Example Sample


Match

(?i) Case-insensitive mode (except (?i)Monday monDAY


JavaScript)
(?s) DOTALL mode (except JS and Ruby). (?s)From A.*to Z From A to
The dot (.) will match the new line Z
characters (\r\n). You can also refer it
as the "single-line mode" because the
dot treats the entire input as a single
line

(?m) Multiline mode (except Ruby and JS) ^ (?m)1\r\n^2$\r\n^3$ 123


and $ match at the beginning and end
of every line

(?m) In Ruby: it is as same as (?s) in other (?m)From A.*to Z From A to


engines, i.e. DOTALL mode, i.e. dot Z
matches line breaks

(?x) Free-Spacing Mode mode (except (?x) # this is a # comment abc d


JavaScript). You can also refer it as abc # write on multiple #
comment mode or whitespace mode lines [ ]d # spaces must be #
in brackets

(?n) .NET, PCRE 10.30+: named capture Turns all (parentheses) into
only non-capture groups. To
capture, use named groups.

(?d) Java: Unix linebreaks only The dot and the ^ and $
anchors are only affected by
\n

(?^) PCRE 10.32+: unset modifiers Unsets ismnx modifiers

You might also like