Professional Documents
Culture Documents
Applications
Dr. Curtis Gittens
Lecture V
Regular Expressions
Understanding How to Create Regular Expressions
Introduction
Full regular expressions are composed of two types of characters.
The special characters, e.g. * are called metacharacters
Normal text characters are called literals
Special-Meaning Characters
Punctuation characters with special meanings in regular expressions
^ $ . * + ? = ! : | \ / ( ) [ ] { }
Character Classes
A character class matches any one character that is contained within it
Combine individual characters into character classes by placing them within
square brackets
Negated character classes can also be defined
Character Classes
Character
Matches
[...]
[^...]
\w
\W
\s
\S
Characters that are not Unicode whitespace. Note that \w and \S are not the same thing.
\d
\D
[\b]
Repetition
Describing multiple characters explicitly is ineffective
Use repetition to specify an arbitrarily unknown number of characters
in an regular expression
Character
Meaning
{n,m}
{n,}
{n}
Match the previous item at least n times but no more than m times.
Match the previous item n or more times.
Match exactly n occurrences of the previous item.
Match zero or one occurrences of the previous item. That is, the previous item is optional.
Equivalent to {0,1}.
Match one or more occurrences of the previous item. Equivalent to {1,}.
Match zero or more occurrences of the previous item. Equivalent to {0,}.
?
+
*
Repetition
Be careful when using the * and ? repetition characters.
These characters may match zero instances of whatever precedes them
Essentially they are allowed to match nothing
Example: /a*/ matches "bbbb" because the string contains zero
occurrences of the letter a!
10
Non-greedy
Stops at the first match
Only available from JavaScript 1.5
Example /a+/ only returns the first a from the string aaa as a match
Can be enabled in PHP by using the U pattern modifier
11
12
13
14
(...)
(?:...)
\n
Meaning
Alternation. Match either the subexpression to the left or the
subexpression to the right.
Grouping. Group items into a single unit that can be used with *, +, ?, |,
and so on. Also remember the characters that match this group for use
with later references.
Grouping only. Group items into a single unit, but do not remember the
characters that match this group.
Match the same characters that were matched when group number n
was first matched. Groups are sub-expressions within (possibly nested)
parentheses. Group numbers are assigned by counting left parentheses
from left to right. Groups formed with (?: are not numbered.
15
16
Matching Position
Characters exist that match the position between characters instead
of the characters themselves
So instead of \s which matches a whitespace character there is \b which
matches the boundary between a word character and a non-word
character
\B is the negation of \b
Matching Position
Character Meaning
^
Match the beginning of the string and, in multiline searches, the beginning of a
line.
Match the end of the string and, in multiline searches, the end of a line.
\b
Match a word boundary. That is, match the position between a \w character
and a \W character or between a \w character and the beginning or end of a
string. (Note, however, that [\b] matches backspace.)
\B
(?=p)
(?!p)
Assertions
Two types of assertions:
Lookahead and lookbehind, collectively called "lookaround",
Zero-length assertions like start (^) and end of line ($), and \b
Key difference: lookaround actually matches characters, but then
gives up the match, returning only the result: match or no match.
That is why they are called "assertions".
They do not consume characters in the string
They only assert whether a match is possible or not
Assertions
Use of negative lookahead
Allows you to match something not followed by something else.
Example: How do you match a q not followed by a u?
q(?!u)
Assertions
You can use any regular expression inside the lookahead
Does not apply to lookbehind
Assertions
Example:
A password must meet four conditions:
1.
2.
3.
4.
Assertions
Lookbehind has the same effect, but works backwards.
It tells the regex engine to temporarily step backwards in the string,
to check if the text inside the lookbehind can be matched there.
Example: (?<!a)b matches a "b" that is not preceded by an "a", using
negative lookbehind.
It doesn't match cab, but matches the b (and only the b) in bed or
debt.
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab,
but does not match bed or debt.
Questions:
Would A(?=5) match the A in the string AB25?
Would A(?=5)(?=[A-Z]) match the A in the string A5B?
Meaning
Perform a global matchthat is, find all matches rather than stopping
after the first match.
25