You are on page 1of 19

10/11/2014

Advanced Regex TutorialRegex Syntax

Fundamentals
Black Belt Program
Regex in Action
Humor & More
Ask Rex

Reducing (? ) Syntax Confusion


What the (? )
A question mark inside a parenthesis: So many uses!
I thought I would bring them all together in one place.
I don't know the fine details of the history of regular expressions. Stephen Kleene and Ken Thompson,
who started them, obviously wanted something very compact. Maybe they were into hieroglyphs, maybe
they were into cryptography, or maybe that was just the way you did things when you only had a few
kilobytes or RAM.
The heroes who expanded regular expressions (such as Henry Spencer and Larry Wall) followed in these
footsteps. One of the things that make regexes hard to read for beginners is that many points of syntax
that serve vastly different purposes all start with the same two characters:
(?

In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But
(?: ) looks a lot like (?= ), so that at some point they are bound to clash in the mind of the regex
apprentice. To facilitate study, I have pulled all the (? ) usages I know about into one place. I'll start
by pointing out three confusing couples; details of usage will follow.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
Confusing Couples
Lookahead and Lookbehind: (?= ), (?! ), (?<= ), (?<! )
Non-Capturing Groups: (?: ) and (?is: )
Atomic Groups: (?> )
Named Capture: (?<foo> ) and (?P<foo> )
Inline Modifiers: (?isx-m)
Subroutines: (?1)
Recursion: (?R)
Conditionals: (?(A)B) and (?(A)B|C)
Pre-Defined Subroutines: (?(DEFINE)(<foo> )(<bar> )) and (?&foo)
Branch Reset: (?| )
Inline Comments: (?# )
(direct link)
http://www.rexegg.com/regex-disambiguation.html

1/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Confusing Couples
Confusing Couple #1: (?: ) and (?= )
These false twins have very different jobs. (?: ) contains a non-capturing group, while (?= ) is a
lookahead.
Confusing Couple #2: (?<= ) and (?> )
(?<= ) is a lookbehind, so (?> ) must be a lookahead, right? Not so. (?> ) contains an atomic
group. The actual lookahead marker is (?= ). More about all these guys below.
Confusing Couple #3: (?(1) ) and (?1)
This pair is delightfully confusing. The first is a conditional expression that tests whether Group 1 has
been captured. The second is a subroutine call that matches the sub-pattern contained within the
capturing parentheses of Group 1.
Now that these three "big ones" are out of the way, let's drill into the syntax.
(direct link)

Lookarounds: (?<= ) and (?= ),


(?<! ) and (?! )
Collectively, lookbehinds and lookaheads are known as lookarounds. This section gives you basic
examples of the syntax, but further down the track I encourage you to read the dedicated regex
lookaround page, as it covers subtleties that need to be grasped if you'd like lookaheads and lookbehinds
to become your trusted friends.
In the meantime, if there is one thing you should remember, it is this: a lookahead or a lookbehind does
not "consume" any characters on the string. This means that after the lookahead or lookbehind's closing
parenthesis, the regex engine is left standing on the very same spot in the string from which it started
looking: it hasn't moved. From that position, then engine can start matching characters again, or, why not,
look ahead (or behind) for something elsea useful technique, as we'll later see.
Here is how the syntax works.
(direct link)
Lookahead After the Match: \d+(?= dollars)
Sample Match: 100 in 100 dollars
Explanation: \d+ matches the digits 100, then the lookahead (?= dollars) asserts that at that position in
the string, what immediately follows is the characters "dollars"
Lookahead Before the Match: (?=\d+ dollars)\d+
Sample Match: 100 in 100 dollars
Explanation: The lookahead (?=\d+ dollars) asserts that at the current position in the string, what
follows is digits then the characters "dollars". If the assertion succeeds, the engine matches the digits
with \d+.
http://www.rexegg.com/regex-disambiguation.html

2/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Note that this pattern achieves the same result as \d+(?= dollars) from above, but it is less efficient
because \d+ is matched twice. A better use of looking ahead before matching characters is to validate
multiple conditions in a password.
(direct link)
Negative Lookahead After the Match: \d+(?! dollars)
Sample Match: 100 in 100 pesos
Explanation: \d+ matches 100, then the negative lookahead (?! dollars) asserts that at that position in the
string, what immediately follows is not the characters "dollars"
Negative Lookahead Before the Match: (?!\d+ dollars)\d+
Sample Match: 100 in 100 pesos
Explanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string,
what follows is not digits then the characters "dollars". If the assertion succeeds, the engine matches the
digits with \d+.
Note that this pattern achieves the same result as \d+(?! dollars) from above, but it is less efficient
because \d+ is matched twice. A better use of looking ahead before matching characters is to validate
multiple conditions in a password.
(direct link)
Lookbehind Before the match: (?<=USD)\d{3}
Sample Match: 100 in USD100
Explanation: The lookbehind (?<=USD) asserts that at the current position in the string, what precedes
is the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}.
Lookbehind After the match: \d{3}(?<=USD\d{3})
Sample Match: 100 in USD100
Explanation: \d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in the
string, what immediately precedes is the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<=USD)\d{3} from above, but it is less efficient
because \d{3} is matched twice.
(direct link)
Negative Lookbehind Before the Match: (?<!USD)\d{3}
Sample Match: 100 in JPY100
Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string, what
precedes is not the characters "USD". If the assertion succeeds, the engine matches three digits with
\d{3}.
Negative Lookbehind After the Match: \d{3}(?<!USD\d{3})
Explanation: \d{3} matches 100, then the negative lookbehind (?<!USD\d{3}) asserts that at that
position in the string, what immediately precedes is not the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<!USD)\d{3} from above, but it is less efficient
because \d{3} is matched twice.
(direct link)
http://www.rexegg.com/regex-disambiguation.html

3/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Support for Lookarounds


All major engines have some form of support for lookaroundswith some important differences. For
instance, JavaScript doesn't support lookbehind, though it supports lookahead (one of the many blotches
on its regex scorecard). Ruby 1.8 suffered from the same condition.
(direct link)
Lookbehind: Fixed-Width / Constrained Width / Infinite Width
One important difference is whether lookbehind accepts variable-width patterns.
At the moment, I am aware of only three engines that allow infinite repetition within a lookbehindas
in (?<=\s*): .NET, Matthew Barnett's outstanding regex module for Python, whose features far outstrip
those of the standard re module, and the JGSoft engine used by Jan Goyvaerts' software such as EditPad
Pro.
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a
pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four
characters. Likewise, (?<=A{1,10}) is valid.
PCRE (C, PHP, R ), Java and Ruby 2+ allow lookbehinds to contain alternations that match strings
of different but pre-determined lengths (such as (?<=cat|raccoon))
Perl and Python require a lookbehind to match strings of a fixed length, so (?<=cat|racoons) will not
work.
To master lookarounds, there is a bit more you should really know. For these finer details, visit the
lookaround page.
(direct link)

Non-Capturing Groups: (?: )


In regex as in the (2+3)*(5-2) of arithmetic, parentheses are often needed to group components of an
expression together. For instance, the above operation yields 15. Without the parentheses, because the *
operator has higher precedence than the + and -, 2+3*5-2 is interpreted as 2+(3*5)-2, yielding er 15
(a happy coincidence).
In regex, normal parentheses not only group parts of a pattern, they also capture the sub-match to a
capture group. This is often tremendously useful. At other times, you do not need the overhead.
In .NET, this capturing behavior of parentheses can be overridden by the (?n) flag or the
RegexOptions.ExplicitCapture option. But in all flavors, .NET included, it is far more common to use (?:
), which is the syntax for a non-capturing group. Watch out, as the syntax closely resembles that for a
lookahead (?= ).
For instance (?:Bob|Chloe) matches Bob or Chloebut the name is not captured.
Within a non-capturing group, you can still use capture groups. For instance, (?:Bob says: (\w+)) would
match Bob says: Go and capture Go in Group 1.
http://www.rexegg.com/regex-disambiguation.html

4/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Likewise, you can capture the content of a non-capturing group by surrounding it with parentheses. For
instance, ((?:Bob|Chloe)\d\d) would capture "Chloe44".
(direct link)
Mode Modifiers within Non-Capture Groups
On all engines that support inline modifiers such as (?i), except Python, you can blend the the noncapture group syntax with mode modifiers. Here are some examples:
(?i:Bob|Chloe) This non-capturing group is case-insensitive.
(?ism:^BEGIN.*?END) This non-capturing group matches everything between "begin" and "end"
(case-insensitive), allowing such content to span multiple lines (the s modifier), starting at the beginning
of any line (the m modifier allows the ^ anchor to match the beginning of any line).
(?i-sm:^BEGIN.*?END) As above, but turns off the "s" and "m" modifiers
See below for more on inline modifiers.
(direct link)

Atomic Groups: (?> )


An atomic group is an expression that becomes solid as a block once the regex leaves the closing
parenthesis. If the regex fails later down the string and needs to backtrack, a regular group containing a
quantifier would give up characters one at a time, allowing the engine to try other matches. Likewise, if
the group contained an alternation, the engine would try the next branch. An atomic group won't do that:
it's all or nothing.
Example 1: With Alternation
(?>A|.B)C

This will fail against ABC, whereas (?:A|.B)C would have succeeded. After matching the A in the atomic
group, the engine tries to match the C but fails. Because it is atomic, it is unable to try the .B part of the
alternation, which would also succeed, and allow the final token C to match.
Example 2: With Quantifier
(?>A+)[A-Z]C

This will fail against AAC, whereas (?:A+)[A-Z]C would have succeeded. After matching the AA in the
atomic group, the engine tries to match the [A-Z], succeeds by matching the C, then tries to match the
token C but fails as the end of the string has been reached. Because the group is atomic, it is unable to
give up the second A, which would allow the rest of the pattern to match.
If, before the atomic group, there were other options to which the engine can backtrack (such as
quantifiers or alternations), then the whole atomic group can be given up in one go.
When are Atomic Groups Important?
When a series of characters only makes sense as a block, using an atomic group can prevent needless
backtracking. This is explored on the section on possessive quantifiers. In such situations atomic
quantifiers can be useful, but not necessarily mission-critical.
On the other hand, there are situations where atomic quantifiers can save your pattern from disaster. They
are particularly useful:
http://www.rexegg.com/regex-disambiguation.html

5/19

10/11/2014

Advanced Regex TutorialRegex Syntax

In order to avoid the Lazy Trap with patterns that contain lazy quantifiers whose token can eat the
delimiter
To avoid certain forms of the Explosive Quantifier Trap
Supported Engines, and Workaround
Atomic groups are supported in most of the major engines: .NET, Perl, PCRE and Ruby. For engines that
don't support atomic grouping syntax, such as Python and JavaScript, see the well-known pseudo-atomic
group workaround.
(direct link)
Alternate Syntax: Possessive Quantifier
When an atomic group only contains a token with a quantifier, an alternate syntax (in engines that
support it) is a possessive quantifier, where a + is added to the quantifier. For instance,
(?>A+) is equivalent to A++
(?>A*) is equivalent to A*+
(?>A?) is equivalent to A?+
(?>A{,}) is equivalent to A{,}+
This works in Perl, PCRE, Java and Ruby 2+.
For more, see the possessive quantifiers section of the quantifiers page.
Non-Capturing
Atomic groups are non-capturing, though as with other non-capturing groups, you can place the group
inside another set of parentheses to capture the group's entire match; and you can place parentheses
inside the atomic group to capture a section of the match.
Watch out, as the atomic group syntax is confusingly similar to the lookbehind syntax (?<= ).
(direct link)

Named Capture: (?<foo> ),


(?P<foo> ) and (?P=foo)
When you cut and paste a piece of a pattern, Group 3 can suddenly become Group 1. That's a problem if
you were using a back-reference \3 or replacement $3.
One way around this problem is named capture groups. The syntax varies across engines (see Naming
Groupsand referring back to them for the gory details). It's worth noting that named group also have a
number that obeys the left-to-right numbering rules, and can be referenced by their number as well as
their name.
In short, the two capturing flavors are (?<foo> ) and (?P<foo> ). For instance, in the right engines,
or ^(?P<intpart>\d+)\.(?P<decpart>\d+)$
would both match a string containing a decimal number such as 12.22, storing the integer portion to a
group named intpart, and storing the decimal portion to a group named decpart.
^(?<intpart>\d+)\.(?<decpart>\d+)$

http://www.rexegg.com/regex-disambiguation.html

6/19

10/11/2014

Advanced Regex TutorialRegex Syntax

To create a back-reference to the intpart group in the pattern, depending on the engine, you'll use
\k<intpart> or (?P=intpart). To insert the named group in a replacement string, depending on the engine,
you'll either use ${intpart}, \g<intpart>, $+{intpart}or the group number \1. For the gory details, see
Naming Groupsand referring back to them.
To name, or not to name?
I'll admit that I don't use named groups a whole lot, but some people love them.
Sure, named captures are bulkier than a quick (capture) and reference to \1but they can save hassles in
expressions that contain many groups.
Do they make your patterns easier to read? That's subjective. For my part, if the regex is short, I always
prefer numbered groups. And if it is long, I would rather read a regex with numbered groups and good
comments in free-spacing mode than a one-liner with named groups.
(direct link)

Inline Modifiers: (?isx-m)


All popular regex flavors apart from JavaScript support inline modifiers, which allow you to tell the
engine, in a pattern, to change how to interpret the pattern. For instance, (?i) turns on case-insensitivity.
Except in Python, (?-i) turns it off.
If a modifier appears at the head of the pattern, it modifies the matching mode for the whole pattern
unless it is later turned off. But (except in Python) a modifier can appear in mid-pattern, in which case in
only affects the portion of the pattern that follows.
Modifiers can be combined: for instance, (?ix) turns on both case-insensitive and free-spacing mode. (?
ix-s) does the same, but also turns off single-line (a.k.a DOTALL) mode.
Summary of inline modifiers
(?i) turns on case insensitive mode.
Except in Ruby, (?s) activates "single-line mode", a.k.a. DOTALL modes, allowing the dot to match
line break characters. In Ruby, the same function is served by (?m)
Except in Ruby, (?m) activate "multi-line mode", which allows the dollar $ and caret ^ assertions to
match at the beginning and end of lines. In Ruby, (?m) does what (?s) does in other flavorsit activates
DOTALL mode.
(?x) Turns on the free-spacing mode (a.k.a. whitespace mode or comment mode). This allows you to
write your regex on multiple lineslike on the example on the home pagewith comments preceded by
a #. Warning: You will usually want to make sure that (?x) appears immediately after the quote
character that starts the pattern string. For instance, if you try placing it on a newline because it would
look better, the engine will try matching the newline characters before it activates free-spacing mode.
In .NET, (?n) turns on "named capture only" mode, which means that regular parentheses are treated as
non-capture groups.
http://www.rexegg.com/regex-disambiguation.html

7/19

10/11/2014

Advanced Regex TutorialRegex Syntax

In Java, (?d) turns on "Unix lines mode" mode, which means that the dot and the anchors ^ and $ only
care about line break characters when they are line feeds \n.
Combining Non-Capture Group with Inline Modifiers
As we saw in the section on non-capture groups, you can blend mode modifiers into the non-capture
group syntax in all engines that support inline modifiersexcept Python. For instance, (?i:bob) is a noncapturing group with the case insensitive flag turned on. It matches strings such as "bob" and "boB"
But don't get carried away: you cannot blend inline modifiers with any random bit of regex syntax. For
instance, the following are all illegal: (?i=bob), (?iP<name>bob) and (?i>bob)
Using Inline Modifiers in the Middle of a Pattern
Usually, you'll use your inline modifiers at the start of the regex string to set the mode for the entire
pattern. However, changing modes in the middle of a pattern can be useful, so I'll give you two examples.
This ensures that an upper-case word is repeated somewhere in the string, in
any letter-case. First we capture an upper-case word to Group 1 (for instance DOG), then we set caseinsensitive mode, then .*? matches any characters up to the back-reference \1, which could be dog or
dOg. As a neat variation, (\b[A-Z]+\b).*?\b(?=[a-z]+\b)(?i)\1\b ensures that the back-reference is in
lower-case.
(\b[A-Z]+\b)(?i).*?\b\1\b

This ensures that the first word of the string is repeated on a different
line. First we capture a word to Group 1, then we get to the end of the line with .*, match a line break,
then set DOTALL modeallowing the .*? to match across lines, which brings us to our back-reference
\1.
^(\w+)\b.*\r?\n(?s).*?\b\1\b

(direct link)

Subroutines: (?1) and (?&foo)


As you well know by now, when you create a capture group such as (\d+), you can then create a backreference to that groupfor instance \1 for Group 1to match the very characters that were captured by
the group. For instance, (\w+) \1 matches Hey Hey.
In Perl, PCRE (C, PHP, R ) and Ruby 1.9+, you can also repeat the actual pattern defined by a capture
Group. In Perl and PCRE, the syntax to repeat the pattern of Group 1 is (?1) (in Ruby 2+, it is \g<1>)
For instance,
(\w+) (?1)

will match Hey Ho. The parentheses in (\w+) not only capture Hey to Group 1they also define
Subroutine 1, whose pattern is \w+. Later, (?1) is a call to subroutine 1. The entire regex is therefore
equivalent to (\w+) \w+
Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors.
(direct link)
Relative Subroutines
Instead of referring to a subroutine by its number, you can refer to the relative position of its defining
group, counting left or right from the current position in the pattern. For instance, (?-1) refers to the last
http://www.rexegg.com/regex-disambiguation.html

8/19

10/11/2014

Advanced Regex TutorialRegex Syntax

defined subroutine, and (?+1) refers to the next defined subroutine. Therefore,
and (?+1) (\w+)
are both equivalent to our first example with numbered group 1. In Ruby 2+, for relative subroutine calls,
you would use \g<-1> and \g<+1>.
(\w+) (?-1)

(direct link)
Named Subroutines
Instead of using numbered groups, you can use named groups. In that case, in Perl and PHP the syntax
for the subroutine call will be (?&group_name). In Ruby 2+ the syntax is \g<some_word>. For instance,
(?<some_word>\w+) (?&some_word) is equivalent to our first example with numbered group 1.
Pre-Defined Subroutines
So far, when we defined our subroutines, we also matched something. For instance, (\w+) defines
subroutine 1 but also immediately matches some word characters. It so happens that Perl and PCRE have
terrific syntax that allows you to pre-define a subroutine without initially matching anything. This
syntax is extremely useful to build large, modular expressions. We will look at it in the corresponding
section: Defined Subroutines: (?(DEFINE)(<foo> ))(<bar> ))
Subroutines and Recursion
If you place a subroutine such as (?1) within the very capture group to which it refersGroup 1 in this
casethen you have a recursive expression. For instance, the regex ^(A(?1)?Z)$ contains a recursive
sub-pattern, because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1.
If you try to trace the matching path of this regex in your mind, you will see that it matches strings like
AAAZZZ, strings which start with any number of letters A and end with letters Z that perfectly balance the
As. After you open the parenthesis, the A matches an A then the optional (?1)? opens another
parenthesis and tries to match an A and so on.
We'll look at recursion syntax in the next section. There is also a page dedicated to recursion.
Warning
Note that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.
(direct link)

Recursive Expressions: (?R) and old friends


A recursive pattern allows you to repeat an expression within itself any number of times. This is quite
handy to match patterns where some tokens on the left must be balanced by some tokens on the right.
Recursive calls are available in PCRE (C, PHP, R), Perl, Ruby 2+ and the alternate regex module for
Python.
Recursion of the Entire Pattern: (?R)
To repeat the entire pattern, the syntax in Perl and PCRE is (?R). In Ruby, it is \g<0>.
For instance,
A(?R)?Z matches strings or substrings such as AAAZZZ, where a number of letters A at the start are
http://www.rexegg.com/regex-disambiguation.html

9/19

10/11/2014

Advanced Regex TutorialRegex Syntax

perfectly balanced by a number of letters Z at the end. The initial token A matches an A Then the
optional (?R)? tries to repeat the whole pattern right there, and therefore attempts the token A to match an
A and so on.
Recursion of a Subroutine: (?1) and (?-1)
You also have recursion when a subroutine calls itself. For instance, in
^(A(?1)?Z)$ subroutine 1 (defined by the outer parentheses) contains a call to itself. This regex matches
entire strings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a
number of letters Z at the end.
As we saw in the section on subroutines, you can also call a subroutine by the relative position of its
defining group at the current position in the pattern. Therefore,
^(A(?-1)?Z)$ performs exactly like the above regex.
There is much more to be said about recursion. See the page dedicated to recursive regex patterns.

(direct link)

Conditionals: (?(A)B) and (?(A)B|C)


This section covers the basics on conditional syntax. For more, you'll want to explore the page dedicated
to regex conditionals.
In (?(A)B), condition A is evaluated. If it is true, the engine must match pattern B. In the full form (?
(A)B|C), when condition A is not true, the engine must match pattern C. Conditionals therefore allow
you to inject some if() then {} else {} logic into your patterns.
Typically, condition A will be that a given capture group has been set. For instance, (?(1)}) says: If
capture Group 1 has been set, match a closing curly brace. This would be useful in
^({)?\d+(?(1)})$

Likewise, (?(foo)) checks if the capture group named foo has been set.
This pattern matches a string of digits that may or may not be embedded in curly braces. The optional
capture Group 1 ({)? captures an opening brace. Later, the conditional checks if capture 1 was set, and if
so it matches the closing brace.
Let's expand this example to use the "else" part of the syntax:
^(?:({)|")\d+(?(1)}|")$

This pattern matches strings of digits that are either embedded in double quotes or in curly braces. The
non-capture group (?:({)|") matches the opening delimiter, capturing it to Group 1 if it is a curly brace.
After matching the digits, (?(1)}|") checks whether Group 1 was set. If so, we match a closing curly
brace. If not, we match a double quote.
Lookaround in Conditions
In (?(A)B), the condition you'll most frequently see is a check as to whether a capture group has been set.
In .NET, PCRE and Perl (but not Python and Ruby), you can also use lookarounds:
\b(?(?<=5D:)\d{5}|\d{10})\b

If the prefix 5D: can be found, the pattern will match five digits. Otherwise, it will match ten digits.
http://www.rexegg.com/regex-disambiguation.html

10/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Needless to say, that is not the only way to perform this task.
(direct link)
Checking if a relative capture group was set
(?(1)A) checks whether Group 1 was set. In PCRE, instead of hard-coding the group number, we can also
check whether a group at a relative position to the current position in the pattern has been set: for
instance, (?(-1)A) checks whether the previous group has been set. Likewise, (?(+1)A) checks whether
the next capture group has been set. (This last scenario would be found within a larger repeating group,
so that on the second pass through the pattern, the next capture group may indeed have been set on the
previous pass.)
(direct link)
Checking if a recursion level was reached
This is not the place to be talking in depth about recursion, which has a section below and a dedicated
page, but for completion I should mention two other uses of conditionals, available in Perl and PCRE:
(?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from a
recursive call to the whole pattern or a subroutine).
(?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1.
See examples here.
Availability of Regex Conditionals
Conditionals are available in PCRE, Perl, .NET, Python, and Ruby 2+. In other engines, the work of a
conditional can usually be handled by the careful use of lookarounds.
Similar Syntax
Note that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine,
where the regex pattern defined by Group 1 must be matched.
(direct link)

Pre-Defined Subroutines: (?(DEFINE)(<foo> )(<bar> ))


and (?&foo)
Available in Perl and PCRE (and therefore C, PHP, R), pre-defined subroutines allow you to produce
regular expressions that are beautifully modular and start to feel like clean procedural code.
Within a (?(DEFINE) ) block, you can pre-define one or several named subroutines without matching
any characters at that time. You can even pre-define subroutines based on other subroutines. When you
get to the matching part of the regex, this allows you to match complex expressions with compact and
readable syntaxand to match the same kind of expressions in multiple places without needing to repeat
your regex code.
This makes your regex more maintainable, both because it is easier to understand and because you don't
need to fix a sub-pattern in multiple places.
But an example is worth a thousand words, so let's dive in. If you like, you can play with the pattern and
http://www.rexegg.com/regex-disambiguation.html

11/19

10/11/2014

Advanced Regex TutorialRegex Syntax

sample text in this online demo.


A quick note first: in case you wonder what the \ are all about, they simply match one space character.
The regex is in free-spacing modethe x flag is implied but could be made part of the pattern using the
(?x) modifier. In free-spacing mode, spaces that you do want to match must either be escaped as in \ or
specified inside a character class as in [ ].
(?(DEFINE) # start DEFINE block
# pre-define quant subroutine
(?<quant>many|some|five)
# pre-define adj subroutine
(?<adj>blue|large|interesting)
# pre-define object subroutine
(?<object>cars|elephants|problems)
# pre-define noun_phrase subroutine
(?<noun_phrase>(?&quant)\ (?&adj)\ (?&object))
# pre-define verb subroutine
(?<verb>borrow|solve|resemble)
)
# end DEFINE block
##### The regex matching starts here #####
(?&noun_phrase)\ (?&verb)\ (?&noun_phrase)

This regex would match phrases such as:


five blue elephants solve many interesting problems
many large problems resemble some interesting cars
Note that the portion that does the matching is extremely compact and readable:
(?&noun_phrase)\ (?&verb)\ (?&noun_phrase)

The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern,
and if we decide to change the definition of noun_phrase, that immediately trickles to the two places
where it is used.
Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (?
&object) uses the quant, adj and object subroutines.
With this kind of modularity, you can build regex cathedrals. There is a beautiful example on the page
with the regex to match numbers in plain English.
A Note on Group Numbering
Please be mindful that each named subroutine consumes one capture group number, so if you use capture
groups later in the regex, remember to count from left to right. The gory details are on the page about
Capture Group Numbering & Naming.
(direct link)

Branch Reset: (?| )


http://www.rexegg.com/regex-disambiguation.html

12/19

10/11/2014

Advanced Regex TutorialRegex Syntax

If you've read the page about Capture Group Numbering & Naming, you'll remember that capture groups
get numbered from left to right. Therefore, if you have two sets of capturing parentheses, they have two
group numbers. Sometimes, you might wish that these two sets of parentheses might capture to the same
numbered group.
Perl and PCRE (and therefore C, PHP, R) have a feature that let you reuse a group number when
capturing parentheses are present on different sides of an alternation.
This is rather abstract, so let's take an example. Let's say you want to match a number, but only in three
situations:
If it follows an A, as in A00
If it precedes a B, as in 11B
If it is sandwiched between C and D, as in C22D
This poses no problem using lookahead and lookbehind, but the branch reset syntax (?| ) gives you
anotherpotentially more readableoption:
(?|A(\d+)|(\d+)B|C(\d+)D)

After the initial (?|, which introduces a branch reset, the group has a three-piece alternation (two |). Each
of those contains a capture group (\d+). The number of all of those capture groups is the same: Group 1.
You are not limited to one group. For instance, if you are also interested in capturing a potential suffix
after the number (which can happen in the situations 11B and C55D), place another set of parentheses
wherever you find a suffix:
(?|A(\d+)|(\d+)(B)|C(\d+)(D))

Using this regex to match the string A00 11B C22D, you obtain these groups:
Match
----A00
11B
C22D

Group 1: Number
--------------00
11
22

Group 2: Suffix
--------------(not set)
B
D

How Useful is Branch Reset?


When I first read about branch reset in the PCRE documentation a few years ago, I was excited and
certain I'd use it often. Since then, I've written several thousand regular expression patterns, but I've used
branch reset less than a handful of times. It's probably my fault for always jumping on other ways to do
things first, but this leaves me with a sense that the feature is not all that useful after all.
That being said, on rare occasions, it's just the most direct and elegant way of doing things.
Let's look at one more example, less contrived than the firstwhich was pared down in order to explain
the feature.
A Branch Reset Example: Tokenization with Variable Formats
To me, this is an example where branch reset seems to offer benefits over competing idioms.
Suppose you want to parse strings such as
song:"Sweet Home Alabama" fruit:apple color:blue motto:"Don't Worry"
into pairs of keys and values. When the value following the colon is between quotes, you only want the
inside of the quotes. Therefore, you expect something like:
http://www.rexegg.com/regex-disambiguation.html

13/19

10/11/2014

Group 1
------song
fruit
color
motto

Advanced Regex TutorialRegex Syntax

Group 2
------Sweet Home Alabama
apple
blue
Don't Worry

This branch reset regex will get you there:


(\S+):(?|([^"\s]+)|"([^"]+))

Group 1 (\S+) is a straight capture group that captures the key. In the branch reset, the two sets of
capturing parentheses allow you to capture different kinds of values in different formats to the same
group, i.e. Group 2. You can check the group captures in the right pane of this online regex demo.
To me, this alternative with a conditional and a lookbehind
(\S+):"?((?(?<!")[^"\s]+|[^"]+)) feels a little less satisfying. But hey, it works too.
(direct link)

Inline Comments: (?# )


By now you must be familiar with the free-spacing mode, which makes it possible to unroll long regexes
and comment them out, as in the many code boxes on this site. To turn on free-spacing for an entire
pattern, the syntax varies:
the (?x) modifier works in .NET, Perl, PCRE, Java, Python and Ruby.
the x flag can be added after the pattern delimiter in Perl, PHP and Ruby.
.NET lets you turn on the RegexOptions.IgnorePatternWhitespace option.
Python lets you turn on re.VERBOSE
What if you only want to insert a single comment without turning on free-spacing mode for the entire
pattern? In Perl, PCRE (and therefore C, PHP, R), Python and Ruby, you can write an inline comment
with this syntax: (?# )
For instance, in:
(?# the year)\d{4}

\d{4} matches four digits, while (?# the year) tells you what we are trying to match.
How useful is this? Not very. I almost never use this feature: when I want comments, I just turn on freespacing mode for the whole regex.

Don't Miss The Regex Style Guide


and The Best Regex Trick Ever!!!

Everything You've Wanted to know about Capture Groups


http://www.rexegg.com/regex-disambiguation.html

14/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Ask Rex
Leave a Comment
1-7 of 7 Threads
Duncan UK
March 12, 2014 - 02:40
Subject: Removing Confusion Around (? Regex Syntax
This topic is very well written and much appreciated. Distills large works like Friedl's book into an easily
digestible quarter of an hour. I look forward to reading the rest!
xtello France
February 19, 2014 - 08:03
Subject: RE: Your banner regex
Thanks Rex, you really made me laugh!! I see you always have the same excellent sense of humor as in
your (brilliant) articles & tutorials! Thank you for this great site and for the joke :) (and for the new
regex)
Greetings from (the south of) France! Xavier Tello
Reply to xtello
Rex
February 21, 2014 - 10:45
Subject: RE: Your banner regex
Hi Xavier, Thank you for your very kind encouragements! If only everyone could be like you. When the
technology becomes available, would you mind if I get back in touch in order to clone you? Wishing you
a fun weekend, Rex
xtello France
February 17, 2014 - 10:07
Subject: Your banner regex
I looked at the regex displayed in your banner Applying this regex to the string [spoiler] will produce
[spoiler] (if I'm not wrong!). What's this easter egg? ;-)
Reply to xtello
Rex
February 17, 2014 - 16:37
Subject: RE: Your banner regex
Hi Xavier, Thank you for writing, it was a treat to hear from you. Wow, you are the first person to notice!
In fact, you made me change the banner to satisfy your sense of completion (and make it harder for the
next guy). > What's this easter egg? This Easter Egg (pun intended, I presume) is that you are the grand
winner of a secret contest. From the time I launched the site, I had planned that the first person to
http://www.rexegg.com/regex-disambiguation.html

15/19

10/11/2014

Advanced Regex TutorialRegex Syntax

discover this would win a free trip to the South of France. You won!!! :) :) :) Wishing you a beautiful
day, Rex
Nicolas Brussels
August 05, 2013 - 10:09
Subject: Little question about capture
Hi Andy. Thank you for all these articles, they are amazing! I learn a lot with this website. So glad to
found it! Like they said : Best ressource on internet :)
I tried some of your example, and I'm stuck with one of them: (? :(\()|-)\d{6}(? (1)\)). When I'm trying "
(111111)" with "preg_match_all", it captures"(". Do you think it's possible to bypass this capture? When
I use "-222222", it catches an empty string And I dont unserstand why. Could you please explain this?
Thank you Andy! And again: Nice work!
Reply to Nicolas
Rex
August 05, 2013 - 18:56
Subject: RE: Little question about capture
Hi Nicolas,
Run this:
$regex='~(?:(\()|-)\d{6}(?(1)\))~';
$string='(such as "(444444)"), or it is preceded by a minus sign (such as "-333333").';
preg_match_all($regex,$string,$m);
var_dump( $m );
You will see that the MATCHES are (444444) and -333333
The CAPTURES are "(" and "". The captured left par is what makes the ?(1) work later in the regex.
Let me know if this is still unclear.
Aravind P S
May 03, 2013 - 17:39
Subject: Great Work man.
I enjoyed reading this article and learnt a lot. Thanks for your wonderful work. :)
Vin Switzerland
November 28, 2012 - 21:05
Subject: Brilliant
Best resource I've found yet on regular expressions. Much appreciate the work you put into this. Why not
create an eBook that could be downloadedI for one would willingly cough up a few dollars. Regards
Vin
Reply to Vin
Andy
December 02, 2012 - 09:03
Subject: Re: Brilliant
Hi Vin, Thank you very much for your encouragements, and also for your suggestion. I've been itching to
make a print-on-demand book with the lowest price possible, to make it easy to read offline. Will
probably do that as soon as they extend the length of a day to 49 hours. Wishing you a fun weekend,
Andy
http://www.rexegg.com/regex-disambiguation.html

16/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Skrell
November 22, 2012 - 08:21
Subject: amazing
These articles you post on regular expressions are among the best, I've found on the entire internet! No
joke! Much appreciated!!!
Reply to Skrell
Andy
November 22, 2012 - 21:13
Subject: Re: amazing
Hi Skrell, thank you very much for your supportive comment. I'm glad to know that someone likes these
pages! They took weeks to write and I've been surprised by how little time visitors have spent on them.
To enjoy a certain presentation of technical information I guess we must be of like minds at least in some
small way. :) Wishing you a fun end of the week, -A
Leave a Comment
* Your name
* Email (it will not be shown)
Your location
Subject:
All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your
comment.

Type the text

Privacy & Terms

Submit

Fundamentals
Regex Tutorial
Regex vs. Regex
Quick Reference
100 Uses for Regex
Regex Style Guide
Black Belt Program
http://www.rexegg.com/regex-disambiguation.html

17/19

10/11/2014

Advanced Regex TutorialRegex Syntax

All (? ) Syntax
Boundaries++
Anchors
Capture & Back
Flags & Modifiers
Lookarounds
Quantifiers
Explosive Quantifiers
Conditionals
Recursion
Class Operations
Regex Gotchas
Syntax Tricks
Quantifier capture
Regex in Action
For awesome tricks:
scroll down!
Cookbook
Cool Regex Classes
Regex Optimizations
PCRE: Grep and Test
Perl One-Liners
Tools & More
Regex Tools
Regex Humor
Regex Books & More
RegexBuddy Trial
Tricks
The Best Regex Trick
Line Numbers
Numbers in English
Languages
PCRE Doc & Log
Regex with C#
Regex with PHP
Regex with Python
Regex with Java
Regex with JavaScript
Regex with Ruby
http://www.rexegg.com/regex-disambiguation.html

18/19

10/11/2014

Advanced Regex TutorialRegex Syntax

Regex with Perl


Regex with VB.NET

A must-read
RegexBuddy 4
is Out! Big Wow!
Get the Free Trial

Ask Rex
search the site

Copyright RexEgg.com

http://www.rexegg.com/regex-disambiguation.html

19/19