10/11/2014

Advanced Regex Tutorial—Regex Syntax

Fundamentals
Black Belt Program
Regex in Action
Humor & More
Ask Rex

Reducing (? … ) Syntax Confusion
What the (? … )
A question mark inside a parenthesis: So many uses!
I thought I would bring them all together in one place.
I don't know the fine details of the history of regular expressions. Stephen Kleene and Ken Thompson,
who started them, obviously wanted something very compact. Maybe they were into hieroglyphs, maybe
they were into cryptography, or maybe that was just the way you did things when you only had a few
kilobytes or RAM.
The heroes who expanded regular expressions (such as Henry Spencer and Larry Wall) followed in these
footsteps. One of the things that make regexes hard to read for beginners is that many points of syntax
that serve vastly different purposes all start with the same two characters:
(?

In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But
(?: … ) looks a lot like (?= … ), so that at some point they are bound to clash in the mind of the regex
apprentice. To facilitate study, I have pulled all the (? … ) usages I know about into one place. I'll start
by pointing out three confusing couples; details of usage will follow.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
✽ Confusing Couples
✽ Lookahead and Lookbehind: (?= … ), (?! … ), (?<= … ), (?<! … )
✽ Non-Capturing Groups: (?: … ) and (?is: … )
✽ Atomic Groups: (?> … )
✽ Named Capture: (?<foo> … ) and (?P<foo> … )
✽ Inline Modifiers: (?isx-m)
✽ Subroutines: (?1)
✽ Recursion: (?R)
✽ Conditionals: (?(A)B) and (?(A)B|C)
✽ Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … )) and (?&foo)
✽ Branch Reset: (?| … )
✽ Inline Comments: (?# … )
(direct link)
http://www.rexegg.com/regex-disambiguation.html

1/19

lookbehinds and lookaheads are known as lookarounds. This section gives you basic examples of the syntax.rexegg. then engine can start matching characters again. http://www.html 2/19 . it is this: a lookahead or a lookbehind does not "consume" any characters on the string. (?: … ) contains a non-capturing group. (direct link) Lookahead After the Match: \d+(?= dollars) Sample Match: 100 in 100 dollars Explanation: \d+ matches the digits 100. as it covers subtleties that need to be grasped if you'd like lookaheads and lookbehinds to become your trusted friends. look ahead (or behind) for something else—a useful technique.                (?<! … ) and (?! … ) Collectively. why not. what immediately follows is the characters " dollars" Lookahead Before the Match: (?=\d+ dollars)\d+ Sample Match: 100 in 100 dollars Explanation: The lookahead (?=\d+ dollars) asserts that at the current position in the string. Here is how the syntax works. the regex engine is left standing on the very same spot in the string from which it started looking: it hasn't moved. (?> … ) contains an atomic group. The actual lookahead marker is (?= … ). as we'll later see. or. if there is one thing you should remember. This means that after the lookahead or lookbehind's closing parenthesis. while (?= … ) is a lookahead. Confusing Couple #2: (?<= … ) and (?> … ) (?<= … ) is a lookbehind. so (?> … ) must be a lookahead. In the meantime. If the assertion succeeds. (direct link) Lookarounds: (?<= … ) and (?= … ). what follows is digits then the characters " dollars". but further down the track I encourage you to read the dedicated regex lookaround page.com/regex-disambiguation. The first is a conditional expression that tests whether Group 1 has been captured. then the lookahead (?= dollars) asserts that at that position in the string. The second is a subroutine call that matches the sub-pattern contained within the capturing parentheses of Group 1. right? Not so. From that position. Confusing Couple #3: (?(1) … ) and (?1) This pair is delightfully confusing. the engine matches the digits with \d+.10/11/2014 Advanced Regex Tutorial—Regex Syntax Confusing Couples Confusing Couple #1: (?: … ) and (?= … ) These false twins have very different jobs. let's drill into the syntax. Now that these three "big ones" are out of the way. More about all these guys below.

but it is less efficient because \d+ is matched twice. A better use of looking ahead before matching characters is to validate multiple conditions in a password. the engine matches three digits with \d{3}. what precedes is not the characters "USD". If the assertion succeeds. what follows is not digits then the characters " dollars". Negative Lookbehind After the Match: \d{3}(?<!USD\d{3}) Explanation: \d{3} matches 100. Note that this pattern achieves the same result as \d+(?! dollars) from above. the engine matches the digits with \d+. (direct link) http://www. If the assertion succeeds. what precedes is the characters "USD". Note that this pattern achieves the same result as (?<!USD)\d{3} from above. what immediately precedes is not the characters "USD" then three digits. Lookbehind After the match: \d{3}(?<=USD\d{3}) Sample Match: 100 in USD100 Explanation: \d{3} matches 100. what immediately follows is not the characters " dollars" Negative Lookahead Before the Match: (?!\d+ dollars)\d+ Sample Match: 100 in 100 pesos Explanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string. (direct link) Negative Lookbehind Before the Match: (?<!USD)\d{3} Sample Match: 100 in JPY100 Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string. but it is less efficient because \d+ is matched twice. then the negative lookahead (?! dollars) asserts that at that position in the string. then the negative lookbehind (?<!USD\d{3}) asserts that at that position in the string. Note that this pattern achieves the same result as (?<=USD)\d{3} from above.rexegg. but it is less efficient because \d{3} is matched twice. the engine matches three digits with \d{3}. (direct link) Negative Lookahead After the Match: \d+(?! dollars) Sample Match: 100 in 100 pesos Explanation: \d+ matches 100.html 3/19 . what immediately precedes is the characters "USD" then three digits.com/regex-disambiguation. then the lookbehind (?<=USD\d{3}) asserts that at that position in the string. A better use of looking ahead before matching characters is to validate multiple conditions in a password. (direct link) Lookbehind Before the match: (?<=USD)\d{3} Sample Match: 100 in USD100 Explanation: The lookbehind (?<=USD) asserts that at the current position in the string. but it is less efficient because \d{3} is matched twice.10/11/2014 Advanced Regex Tutorial—Regex Syntax Note that this pattern achieves the same result as \d+(?= dollars) from above. If the assertion succeeds.

In . ✽ Java accepts quantifiers within lookbehind. (?<=A{1. JavaScript doesn't support lookbehind. the above operation yields 15. (?:Bob says: (\w+)) would match Bob says: Go and capture Go in Group 1. Within a non-capturing group.8 suffered from the same condition. Ruby 1. For these finer details.ExplicitCapture option. To master lookarounds. you do not need the overhead. But in all flavors. Likewise. ✽ PCRE (C. I am aware of only three engines that allow infinite repetition within a lookbehind—as in (?<=\s*): . yielding… er… 15 (a happy coincidence). whose features far outstrip those of the standard re module.10/11/2014 Advanced Regex Tutorial—Regex Syntax Support for Lookarounds All major engines have some form of support for lookarounds—with some important differences. Java and Ruby 2+ allow lookbehinds to contain alternations that match strings of different but pre-determined lengths (such as (?<=cat|raccoon)) ✽ Perl and Python require a lookbehind to match strings of a fixed length. PHP. ✽ At the moment. For instance.com/regex-disambiguation. http://www. For instance. they also capture the sub-match to a capture group. normal parentheses not only group parts of a pattern. Watch out. though it supports lookahead (one of the many blotches on its regex scorecard). (?<=cats?) is valid because it can only match strings of three or four characters.10}) is valid. (direct link) Non-Capturing Groups: (?: … ) In regex as in the (2+3)*(5-2) of arithmetic. For instance. (direct link) Lookbehind: Fixed-Width / Constrained Width / Infinite Width One important difference is whether lookbehind accepts variable-width patterns. which is the syntax for a non-capturing group. For instance. it is far more common to use (?: … ). as the syntax closely resembles that for a lookahead (?= … ). This is often tremendously useful. In regex. 2+3*5-2 is interpreted as 2+(3*5)-2. because the * operator has higher precedence than the + and -. this capturing behavior of parentheses can be overridden by the (?n) flag or the RegexOptions.rexegg. there is a bit more you should really know. For instance (?:Bob|Chloe) matches Bob or Chloe—but the name is not captured.NET. you can still use capture groups. R …). parentheses are often needed to group components of an expression together. .NET included. as long as the length of the matching strings falls within a pre-determined range.NET. Matthew Barnett's outstanding regex module for Python. visit the lookaround page. so (?<=cat|racoons) will not work. and the JGSoft engine used by Jan Goyvaerts' software such as EditPad Pro.html 4/19 . At other times. Without the parentheses.

the engine would try the next branch. ✽ (?ism:^BEGIN. which would also succeed. starting at the beginning of any line (the m modifier allows the ^ anchor to match the beginning of any line).rexegg. and allow the final token C to match. allowing the engine to try other matches. ✽ (?i-sm:^BEGIN. there are situations where atomic quantifiers can save your pattern from disaster. it is unable to give up the second A.*?END) This non-capturing group matches everything between "begin" and "end" (case-insensitive). An atomic group won't do that: it's all or nothing.*?END) As above. you can capture the content of a non-capturing group by surrounding it with parentheses. Here are some examples: ✽ (?i:Bob|Chloe) This non-capturing group is case-insensitive. Likewise. They are particularly useful: http://www. After matching the A in the atomic group. This is explored on the section on possessive quantifiers. the engine tries to match the C but fails. you can blend the the noncapture group syntax with mode modifiers. succeeds by matching the C. using an atomic group can prevent needless backtracking. if the group contained an alternation.com/regex-disambiguation. whereas (?:A+)[A-Z]C would have succeeded. it is unable to try the . which would allow the rest of the pattern to match. a regular group containing a quantifier would give up characters one at a time. When are Atomic Groups Important? When a series of characters only makes sense as a block. Because the group is atomic. (direct link) Mode Modifiers within Non-Capture Groups On all engines that support inline modifiers such as (?i). Example 1: With Alternation (?>A|. (direct link) Atomic Groups: (?> … ) An atomic group is an expression that becomes solid as a block once the regex leaves the closing parenthesis. whereas (?:A|. ((?:Bob|Chloe)\d\d) would capture "Chloe44". before the atomic group.B)C This will fail against ABC. For instance. except Python.B)C would have succeeded. but turns off the "s" and "m" modifiers See below for more on inline modifiers.B part of the alternation. If the regex fails later down the string and needs to backtrack. then tries to match the token C but fails as the end of the string has been reached. but not necessarily mission-critical. After matching the AA in the atomic group. On the other hand. If. Because it is atomic. Example 2: With Quantifier (?>A+)[A-Z]C This will fail against AAC. In such situations atomic quantifiers can be useful.html 5/19 . the engine tries to match the [A-Z]. then the whole atomic group can be given up in one go. allowing such content to span multiple lines (the s modifier).10/11/2014 Advanced Regex Tutorial—Regex Syntax Likewise. there were other options to which the engine can backtrack (such as quantifiers or alternations).

Non-Capturing Atomic groups are non-capturing. an alternate syntax (in engines that support it) is a possessive quantifier. (direct link) Named Capture: (?<foo> … ). Perl. Java and Ruby 2+. For more.rexegg. In short. PCRE. and you can place parentheses inside the atomic group to capture a section of the match. and storing the decimal portion to a group named decpart.com/regex-disambiguation. ✽ (?>A+) is equivalent to A++ ✽ (?>A*) is equivalent to A*+ ✽ (?>A?) is equivalent to A?+ ✽ (?>A{…. (direct link) Alternate Syntax: Possessive Quantifier When an atomic group only contains a token with a quantifier. though as with other non-capturing groups. in the right engines. PCRE and Ruby. storing the integer portion to a group named intpart. as the atomic group syntax is confusingly similar to the lookbehind syntax (?<= … ). For instance.…}+ This works in Perl. the two capturing flavors are (?<foo> … ) and (?P<foo> … ). For engines that don't support atomic grouping syntax. One way around this problem is named capture groups.NET. That's a problem if you were using a back-reference or replacement $3. or ^(?P<intpart>\d+)\. Group 3 can suddenly become Group 1.10/11/2014 Advanced Regex Tutorial—Regex Syntax ✽ In order to avoid the Lazy Trap with patterns that contain lazy quantifiers whose token can eat the delimiter ✽ To avoid certain forms of the Explosive Quantifier Trap Supported Engines.…}) is equivalent to A{…. For instance.(?<decpart>\d+)$ http://www. The syntax varies across engines (see Naming Groups—and referring back to them for the gory details).22. see the well-known pseudo-atomic group workaround. and can be referenced by their number as well as their name. you can place the group inside another set of parentheses to capture the group's entire match. Watch out.html 6/19 . and Workaround Atomic groups are supported in most of the major engines: . such as Python and JavaScript. ^(?<intpart>\d+)\.                  (?P<foo> … ) and (?P=foo) When you cut and paste a piece of a pattern. where a + is added to the quantifier. see the possessive quantifiers section of the quantifiers page. It's worth noting that named group also have a number that obeys the left-to-right numbering rules.(?P<decpart>\d+)$ would both match a string containing a decimal number such as 12.

(?s) activates "single-line mode". For my part.com/regex-disambiguation. I always prefer numbered groups. in a pattern. Modifiers can be combined: for instance. DOTALL modes. see Naming Groups—and referring back to them. depending on the engine. or not to name? I'll admit that I don't use named groups a whole lot. If a modifier appears at the head of the pattern. http://www. (?m) does what (?s) does in other flavors—it activates DOTALL mode. For instance. (?m) activate "multi-line mode". named captures are bulkier than a quick (capture) and reference to —but they can save hassles in expressions that contain many groups. it modifies the matching mode for the whole pattern— unless it is later turned off.a DOTALL) mode. a.NET.a.rexegg. (?n) turns on "named capture only" mode.a. in which case in only affects the portion of the pattern that follows.10/11/2014 Advanced Regex Tutorial—Regex Syntax To create a back-reference to the intpart group in the pattern.k. the engine will try matching the newline characters before it activates free-spacing mode. (direct link) Inline Modifiers: (?isx-m) All popular regex flavors apart from JavaScript support inline modifiers. which allow you to tell the engine. ✽ In .k. (?i) turns on case-insensitivity. I would rather read a regex with numbered groups and good comments in free-spacing mode than a one-liner with named groups. if you try placing it on a newline because it would look better. For instance. ✽ Except in Ruby. For the gory details. depending on the engine. But (except in Python) a modifier can appear in mid-pattern. (?-i) turns it off. To insert the named group in a replacement string. you'll use \k<intpart> or (?P=intpart). which allows the dollar $ and caret ^ assertions to match at the beginning and end of lines. Summary of inline modifiers ✽ (?i) turns on case insensitive mode. the same function is served by (?m) ✽ Except in Ruby.k. to change how to interpret the pattern. Except in Python. whitespace mode or comment mode). \g<intpart>. To name. Warning: You will usually want to make sure that (?x) appears immediately after the quote character that starts the pattern string. Sure. which means that regular parentheses are treated as non-capture groups. In Ruby. In Ruby. if the regex is short.html 7/19 . but some people love them. you'll either use ${intpart}. Do they make your patterns easier to read? That's subjective. but also turns off single-line (a. And if it is long. (?ix) turns on both case-insensitive and free-spacing mode. $+{intpart}or the group number . (? ix-s) does the same. This allows you to write your regex on multiple lines—like on the example on the home page—with comments preceded by a #. allowing the dot to match line break characters. ✽ (?x) Turns on the free-spacing mode (a.

For instance.10/11/2014 Advanced Regex Tutorial—Regex Syntax ✽ In Java.*? matches any characters up to the back-reference . then we set caseinsensitive mode. you'll use your inline modifiers at the start of the regex string to set the mode for the entire pattern. ^(\w+)\b. the syntax to repeat the pattern of Group 1 is (?1) (in Ruby 2+. Later. PHP. (\w+) (?1) will match Hey Ho. which means that the dot and the anchors ^ and $ only care about line break characters when they are line feeds \n. when you create a capture group such as (\d+). match a line break. (\b[A-Z]+\b). The parentheses in (\w+) not only capture Hey to Group 1—they also define Subroutine 1. then . It matches strings such as "bob" and "boB" But don't get carried away: you cannot blend inline modifiers with any random bit of regex syntax. you can then create a backreference to that group—for instance for Group 1—to match the very characters that were captured by the group. then set DOTALL mode—allowing the . In Perl and PCRE. changing modes in the middle of a pattern can be useful. you can also repeat the actual pattern defined by a capture Group. This ensures that an upper-case word is repeated somewhere in the string. For instance. Combining Non-Capture Group with Inline Modifiers As we saw in the section on non-capture groups. which brings us to our back-reference . First we capture a word to Group 1. whose pattern is \w+.rexegg. you can refer to the relative position of its defining group.*?\b\b This ensures that the first word of the string is repeated on a different line. For instance. As a neat variation. However. (?d) turns on "Unix lines mode" mode.com/regex-disambiguation. The entire regex is therefore equivalent to (\w+) \w+ Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors. (?iP<name>bob) and (?i>bob) Using Inline Modifiers in the Middle of a Pattern Usually. it is \g<1>) For instance. (?-1) refers to the last http://www. the following are all illegal: (?i=bob). (\w+) matches Hey Hey.*\r?\n(?s).*? to match across lines. (direct link) Relative Subroutines Instead of referring to a subroutine by its number.html 8/19 . First we capture an upper-case word to Group 1 (for instance DOG).*?\b\b (direct link) Subroutines: (?1) and (?&foo) As you well know by now. so I'll give you two examples. counting left or right from the current position in the pattern. you can blend mode modifiers into the non-capture group syntax in all engines that support inline modifiers—except Python.9+. PCRE (C. (?1) is a call to subroutine 1. For instance. (?i:bob) is a noncapturing group with the case insensitive flag turned on. in any letter-case. which could be dog or dOg.*?\b(?=[a-z]+\b)(?i)\b ensures that the back-reference is in lower-case. (\b[A-Z]+\b)(?i). then we get to the end of the line with .*. R …) and Ruby 1. In Perl.

we also matched something. Recursion of the Entire Pattern: (?R) To repeat the entire pattern. you would use \g<-1> and \g<+1>. For instance. because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1. For instance. (direct link) Recursive Expressions: (?R) … and old friends A recursive pattern allows you to repeat an expression within itself any number of times. you will see that it matches strings like AAAZZZ. Recursive calls are available in PCRE (C. you can use named groups. It so happens that Perl and PCRE have terrific syntax that allows you to pre-define a subroutine without initially matching anything. and (?+1) (\w+) are both equivalent to our first example with numbered group 1. Ruby 2+ and the alternate regex module for Python. This syntax is extremely useful to build large. where a number of letters A at the start are http://www. We'll look at recursion syntax in the next section.rexegg. In Ruby. and (?+1) refers to the next defined subroutine. (\w+) defines subroutine 1 but also immediately matches some word characters. In Ruby 2+ the syntax is \g<some_word>. For instance. (\w+) (?-1) (direct link) Named Subroutines Instead of using numbered groups. Pre-Defined Subroutines So far. Perl. For instance. We will look at it in the corresponding section: Defined Subroutines: (?(DEFINE)(<foo> … ))(<bar> … )) Subroutines and Recursion If you place a subroutine such as (?1) within the very capture group to which it refers—Group 1 in this case—then you have a recursive expression. the A matches an A… then the optional (?1)? opens another parenthesis and tries to match an A… and so on.html 9/19 . for relative subroutine calls. There is also a page dedicated to recursion. In Ruby 2+. A(?R)?Z matches strings or substrings such as AAAZZZ. strings which start with any number of letters A and end with letters Z that perfectly balance the As. it is \g<0>. After you open the parenthesis. when we defined our subroutines. (?<some_word>\w+) (?&some_word) is equivalent to our first example with numbered group 1. Warning Note that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.10/11/2014 Advanced Regex Tutorial—Regex Syntax defined subroutine. modular expressions. R…).com/regex-disambiguation. in Perl and PHP the syntax for the subroutine call will be (?&group_name). the syntax in Perl and PCRE is (?R). Therefore. PHP. If you try to trace the matching path of this regex in your mind. the regex ^(A(?1)?Z)$ contains a recursive sub-pattern. In that case. This is quite handy to match patterns where some tokens on the left must be balanced by some tokens on the right.

After matching the digits. For instance. (?(1)}|") checks whether Group 1 was set. when condition A is not true. See the page dedicated to recursive regex patterns. For instance. Later. the pattern will match five digits. PCRE and Perl (but not Python and Ruby). where a number of letters A at the start are perfectly balanced by a number of letters Z at the end. In . There is much more to be said about recursion.10/11/2014 Advanced Regex Tutorial—Regex Syntax perfectly balanced by a number of letters Z at the end. ^(A(?-1)?Z)$ performs exactly like the above regex. Recursion of a Subroutine: (?1) and (?-1) You also have recursion when a subroutine calls itself. and therefore attempts the token A to match an A… and so on. The initial token A matches an A… Then the optional (?R)? tries to repeat the whole pattern right there. the engine must match pattern B.html 10/19 . condition A will be that a given capture group has been set. As we saw in the section on subroutines. Therefore. you'll want to explore the page dedicated to regex conditionals. you can also call a subroutine by the relative position of its defining group at the current position in the pattern. If so. This pattern matches a string of digits that may or may not be embedded in curly braces. The optional capture Group 1 ({)? captures an opening brace. we match a closing curly brace. Conditionals therefore allow you to inject some if(…) then {…} else {…} logic into your patterns. Typically. If not. you can also use lookarounds: \b(?(?<=5D:)\d{5}|\d{10})\b If the prefix 5D: can be found. the conditional checks if capture 1 was set. Otherwise. This regex matches entire strings such as AAAZZZ. and if so it matches the closing brace.NET. In the full form (? (A)B|C). (?(foo)…) checks if the capture group named foo has been set. In (?(A)B). we match a double quote. For more.com/regex-disambiguation. in ^(A(?1)?Z)$ subroutine 1 (defined by the outer parentheses) contains a call to itself. http://www. Lookaround in Conditions In (?(A)B). (?(1)}) says: If capture Group 1 has been set. capturing it to Group 1 if it is a curly brace. If it is true. The non-capture group (?:({)|") matches the opening delimiter. it will match ten digits. the engine must match pattern C. condition A is evaluated. This would be useful in ^({)?\d+(?(1)})$ Likewise.rexegg. Let's expand this example to use the "else" part of the syntax: ^(?:({)|")\d+(?(1)}|")$ This pattern matches strings of digits that are either embedded in double quotes or in curly braces. the condition you'll most frequently see is a check as to whether a capture group has been set. match a closing curly brace. (direct link) Conditionals: (?(A)B) and (?(A)B|C) This section covers the basics on conditional syntax.

we can also check whether a group at a relative position to the current position in the pattern has been set: for instance. This makes your regex more maintainable. When you get to the matching part of the regex. ✽ (?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1.rexegg. In PCRE. the work of a conditional can usually be handled by the careful use of lookarounds. You can even pre-define subroutines based on other subroutines.html 11/19 .NET.com/regex-disambiguation. available in Perl and PCRE: ✽ (?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from a recursive call to the whole pattern or a subroutine). But an example is worth a thousand words.) (direct link) Checking if a recursion level was reached This is not the place to be talking in depth about recursion. (direct link) Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … ))                       and (?&foo) Available in Perl and PCRE (and therefore C. where the regex pattern defined by Group 1 must be matched. which has a section below and a dedicated page. (?(-1)A) checks whether the previous group has been set. but for completion I should mention two other uses of conditionals.10/11/2014 Advanced Regex Tutorial—Regex Syntax Needless to say. Availability of Regex Conditionals Conditionals are available in PCRE. so that on the second pass through the pattern. so let's dive in. you can play with the pattern and http://www. If you like. See examples here. instead of hard-coding the group number. Likewise. you can pre-define one or several named subroutines without matching any characters at that time. . (?(+1)A) checks whether the next capture group has been set. pre-defined subroutines allow you to produce regular expressions that are beautifully modular and start to feel like clean procedural code. and Ruby 2+. Python. both because it is easier to understand and because you don't need to fix a sub-pattern in multiple places. Perl. (This last scenario would be found within a larger repeating group. the next capture group may indeed have been set on the previous pass. (direct link) Checking if a relative capture group was set (?(1)A) checks whether Group 1 was set. that is not the only way to perform this task. Similar Syntax Note that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine. this allows you to match complex expressions with compact and readable syntax—and to match the same kind of expressions in multiple places without needing to repeat your regex code. In other engines. PHP. R…). Within a (?(DEFINE) … ) block.

so if you use capture groups later in the regex. they simply match one space character. In free-spacing mode. and if we decide to change the definition of noun_phrase. With this kind of modularity. adj and object subroutines. that immediately trickles to the two places where it is used. There is a beautiful example on the page with the regex to match numbers in plain English. (direct link) Branch Reset: (?| … ) http://www. spaces that you do want to match must either be escaped as in \ or specified inside a character class as in [ ]. Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (? &object) uses the quant.com/regex-disambiguation. you can build regex cathedrals.html 12/19 .10/11/2014 Advanced Regex Tutorial—Regex Syntax sample text in this online demo.rexegg. The regex is in free-spacing mode—the x flag is implied but could be made part of the pattern using the (?x) modifier. (?(DEFINE) # start DEFINE block # pre-define quant subroutine (?<quant>many|some|five) # pre-define adj subroutine (?<adj>blue|large|interesting) # pre-define object subroutine (?<object>cars|elephants|problems) # pre-define noun_phrase subroutine (?<noun_phrase>(?&quant)\ (?&adj)\ (?&object)) # pre-define verb subroutine (?<verb>borrow|solve|resemble) ) # end DEFINE block ##### The regex matching starts here ##### (?&noun_phrase)\ (?&verb)\ (?&noun_phrase) This regex would match phrases such as: ✽ five blue elephants solve many interesting problems ✽ many large problems resemble some interesting cars Note that the portion that does the matching is extremely compact and readable: (?&noun_phrase)\ (?&verb)\ (?&noun_phrase) The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern. A Note on Group Numbering Please be mindful that each named subroutine consumes one capture group number. The gory details are on the page about Capture Group Numbering & Naming. remember to count from left to right. A quick note first: in case you wonder what the \ are all about.

Therefore. you only want the inside of the quotes. Perl and PCRE (and therefore C. you'll remember that capture groups get numbered from left to right. as in C22D This poses no problem using lookahead and lookbehind. which introduces a branch reset. you expect something like: http://www. Let's say you want to match a number. you might wish that these two sets of parentheses might capture to the same numbered group.rexegg. you obtain these groups: Match ----A00 11B C22D Group 1: Number --------------00 11 22 Group 2: Suffix --------------(not set) B D How Useful is Branch Reset? When I first read about branch reset in the PCRE documentation a few years ago. Since then. Sometimes. I've written several thousand regular expression patterns.com/regex-disambiguation. but I've used branch reset less than a handful of times. as in A00 ✽ If it precedes a B. Each of those contains a capture group (\d+). For instance. Let's look at one more example. less contrived than the first—which was pared down in order to explain the feature. but the branch reset syntax (?| … ) gives you another—potentially more readable—option: (?|A(\d+)|(\d+)B|C(\d+)D) After the initial (?|. PHP. this is an example where branch reset seems to offer benefits over competing idioms. but this leaves me with a sense that the feature is not all that useful after all. if you have two sets of capturing parentheses.html 13/19 . A Branch Reset Example: Tokenization with Variable Formats To me. R…) have a feature that let you reuse a group number when capturing parentheses are present on different sides of an alternation. I was excited and certain I'd use it often. Therefore. but only in three situations: ✽ If it follows an A. as in 11B ✽ If it is sandwiched between C and D. the group has a three-piece alternation (two |). so let's take an example. That being said.10/11/2014 Advanced Regex Tutorial—Regex Syntax If you've read the page about Capture Group Numbering & Naming. It's probably my fault for always jumping on other ways to do things first. they have two group numbers. on rare occasions. Suppose you want to parse strings such as song:"Sweet Home Alabama" fruit:apple color:blue motto:"Don't Worry" into pairs of keys and values. it's just the most direct and elegant way of doing things. if you are also interested in capturing a potential suffix after the number (which can happen in the situations 11B and C55D). This is rather abstract. You are not limited to one group. The number of all of those capture groups is the same: Group 1. place another set of parentheses wherever you find a suffix: (?|A(\d+)|(\d+)(B)|C(\d+)(D)) Using this regex to match the string A00 11B C22D. When the value following the colon is between quotes.

R…). Group 2. as in the many code boxes on this site. PHP. (direct link) Inline Comments: (?# … ) By now you must be familiar with the free-spacing mode. To turn on free-spacing for an entire pattern. PCRE. Java.VERBOSE What if you only want to insert a single comment without turning on free-spacing mode for the entire pattern? In Perl. ✽ .html 14/19 . ✽ Python lets you turn on re. I almost never use this feature: when I want comments. PCRE (and therefore C.com/regex-disambiguation. i. How useful is this? Not very. the syntax varies: ✽ the (?x) modifier works in .rexegg. To me. In the branch reset. this alternative with a conditional and a lookbehind… (\S+):"?((?(?<!")[^"\s]+|[^"]+)) …feels a little less satisfying. Python and Ruby. Python and Ruby. But hey. Don't Miss The Regex Style Guide and The Best Regex Trick Ever!!! Everything You've Wanted to know about Capture Groups http://www.IgnorePatternWhitespace option.NET. I just turn on freespacing mode for the whole regex. in: (?# the year)\d{4} \d{4} matches four digits. PHP and Ruby.e. You can check the group captures in the right pane of this online regex demo. ✽ the x flag can be added after the pattern delimiter in Perl.NET lets you turn on the RegexOptions.10/11/2014 Group 1 ------song fruit color motto Advanced Regex Tutorial—Regex Syntax Group 2 ------Sweet Home Alabama apple blue Don't Worry This branch reset regex will get you there: (\S+):(?|([^"\s]+)|"([^"]+)) Group 1 (\S+) is a straight capture group that captures the key. the two sets of capturing parentheses allow you to capture different kinds of values in different formats to the same group. it works too. Perl. while (?# the year) tells you what we are trying to match. you can write an inline comment with this syntax: (?# … ) For instance. which makes it possible to unroll long regexes and comment them out.

What's this easter egg? .-) Reply to xtello Rex February 17.html 15/19 . you are the first person to notice! In fact. I look forward to reading the rest! xtello – France February 19.com/regex-disambiguation. Wow.10/11/2014 Advanced Regex Tutorial—Regex Syntax Ask Rex Leave a Comment 1-7 of 7 Threads Duncan – UK March 12. it was a treat to hear from you. From the time I launched the site. Rex xtello – France February 17. 2014 . Thank you for writing. Distills large works like Friedl's book into an easily digestible quarter of an hour.10:45 Subject: RE: Your banner regex Hi Xavier. would you mind if I get back in touch in order to clone you? Wishing you a fun weekend.10:07 Subject: Your banner regex I looked at the regex displayed in your banner… Applying this regex to the string [spoiler] will produce [spoiler] (if I'm not wrong!). I presume) is that you are the grand winner of a secret contest. 2014 .08:03 Subject: RE: Your banner regex Thanks Rex.rexegg. I had planned that the first person to http://www. you really made me laugh!! I see you always have the same excellent sense of humor as in your (brilliant) articles & tutorials! Thank you for this great site and for the joke :) (and for the new regex) Greetings from (the south of) France! Xavier Tello Reply to xtello Rex February 21. 2014 . Thank you for your very kind encouragements! If only everyone could be like you. 2014 . 2014 . you made me change the banner to satisfy your sense of completion (and make it harder for the next guy).16:37 Subject: RE: Your banner regex Hi Xavier. When the technology becomes available. > What's this easter egg? This Easter Egg (pun intended.02:40 Subject: Removing Confusion Around (? Regex Syntax This topic is very well written and much appreciated.

You will see that the MATCHES are (444444) and -333333 The CAPTURES are "(" and "". 2013 . Will probably do that as soon as they extend the length of a day to 49 hours. it catches an empty string… And I dont unserstand why. to make it easy to read offline. Thanks for your wonderful work. Regards Vin Reply to Vin Andy December 02. Wishing you a fun weekend. preg_match_all($regex.10/11/2014 Advanced Regex Tutorial—Regex Syntax discover this would win a free trip to the South of France.10:09 Subject: Little question about capture Hi Andy.18:56 Subject: RE: Little question about capture Hi Nicolas. or it is preceded by a minus sign (such as "-333333"). Run this: $regex='~(?:(\()|-)\d{6}(?(1)\))~'. So glad to found it! Like they said : Best ressource on internet :) I tried some of your example.html 16/19 . I enjoyed reading this article and learnt a lot. Aravind P S May 03. I've been itching to make a print-on-demand book with the lowest price possible. $string='(such as "(444444)"). 2013 .'.$string. Rex Nicolas – Brussels August 05.com/regex-disambiguation. var_dump( $m ).17:39 Subject: Great Work man. :) Vin – Switzerland November 28.09:03 Subject: Re: Brilliant Hi Vin.rexegg. Andy http://www. Much appreciate the work you put into this. and I'm stuck with one of them: (? :(\()|-)\d{6}(? (1)\)). 2012 . Let me know if this is still unclear. Do you think it's possible to bypass this capture? When I use "-222222". Why not create an eBook that could be downloaded—I for one would willingly cough up a few dollars. The captured left par is what makes the ?(1) work later in the regex. You won!!! :) :) :) Wishing you a beautiful day. When I'm trying " (111111)" with "preg_match_all". Thank you for all these articles. 2012 .21:05 Subject: Brilliant Best resource I've found yet on regular expressions. and also for your suggestion. they are amazing! I learn a lot with this website. it captures"(". Thank you very much for your encouragements. 2013 . Could you please explain this? Thank you Andy! And again: Nice work! Reply to Nicolas Rex August 05.$m).

com/regex-disambiguation. :) Wishing you a fun end of the week.21:13 Subject: Re: amazing Hi Skrell. thank you very much for your supportive comment. Regex Quick Reference 100 Uses for Regex Regex Style Guide Black Belt Program http://www. To enjoy a certain presentation of technical information I guess we must be of like minds at least in some small way. we require that you type the two words below before you submit your comment.html 17/19 . I've found on the entire internet! No joke! Much appreciated!!! Reply to Skrell Andy November 22. this won't work for you. 2012 . Link spammers.10/11/2014 Advanced Regex Tutorial—Regex Syntax Skrell November 22. To prevent automatic spam.08:21 Subject: amazing These articles you post on regular expressions are among the best. I'm glad to know that someone likes these pages! They took weeks to write and I've been surprised by how little time visitors have spent on them. 2012 .rexegg. -A Leave a Comment * Your name * Email (it will not be shown) Your location Subject: All comments are moderated. Type the text Privacy & Terms Submit Fundamentals Regex Tutorial Regex vs.

html 18/19 .rexegg.com/regex-disambiguation.10/11/2014 Advanced Regex Tutorial—Regex Syntax All (? … ) Syntax Boundaries++ Anchors Capture & Back Flags & Modifiers Lookarounds Quantifiers Explosive Quantifiers Conditionals Recursion Class Operations Regex Gotchas Syntax Tricks Quantifier capture Regex in Action For awesome tricks: scroll down! Cookbook Cool Regex Classes Regex Optimizations PCRE: Grep and Test Perl One-Liners Tools & More Regex Tools Regex Humor Regex Books & More RegexBuddy Trial Tricks The Best Regex Trick Line Numbers Numbers in English Languages PCRE Doc & Log Regex with C# Regex with PHP Regex with Python Regex with Java Regex with JavaScript Regex with Ruby http://www.

rexegg.10/11/2014 Advanced Regex Tutorial—Regex Syntax Regex with Perl Regex with VB.com http://www.com/regex-disambiguation.NET A must-read RegexBuddy 4 is Out! Big Wow! Get the Free Trial Ask Rex search the site © Copyright RexEgg.html 19/19 .