You are on page 1of 25

COMP3170 Web Based

Applications
Dr. Curtis Gittens
Lecture V

Regular Expressions
Understanding How to Create Regular Expressions

Introduction
Full regular expressions are composed of two types of characters.
The special characters, e.g. * are called metacharacters
Normal text characters are called literals

Different from filename patterns


The expressive powers of their metacharacters provide makes the
difference
Filename patterns provide limited metacharacters for limited

Regular expression language provides rich and expressive


metacharacters for advanced uses.
3

Special-Meaning Characters
Punctuation characters with special meanings in regular expressions
^ $ . * + ? = ! : | \ / ( ) [ ] { }

These characters have special meaning only within certain contexts of a


regular expression
Treated as literals in other contexts
As a general rule, when using these characters, escape them by
preceding them with a backslash \
Quotation marks and the @ character are treated as literals

Character Classes
A character class matches any one character that is contained within it
Combine individual characters into character classes by placing them within
square brackets
Negated character classes can also be defined

Common character classes are represented by special characters and


escape sequences
E.g. \s and \S
Match only ASCII characters
Have not been extended to work with Unicode characters
Can explicitly define Unicode character classes [\u0400-\u04FF]
5

Character Classes
Character

Matches

[...]

Any one character between the brackets.

[^...]

Any one character not between the brackets.

Any character except newline or another Unicode line terminator.

\w

Any ASCII word character. Equivalent to [a-zA-Z0-9_].

\W

Characters not an ASCII word character. Equivalent to [^a-zA-Z0-9_].

\s

Any Unicode whitespace character.

\S

Characters that are not Unicode whitespace. Note that \w and \S are not the same thing.

\d

Any ASCII digit. Equivalent to [0-9].

\D

Any character other than an ASCII digit. Equivalent to [^0-9].

[\b]

A literal backspace (special case).

Can You Do This?


Construct regular expressions for the following:
1. Accept a string that starts with upper case "A" followed by anything
except "x", "y" or "z".
2. Accepts any string that contains an "a" or "b" followed by any 2
characters followed by an "a" or a "b". The strings "axxb", "alfa" and
"blka" match, and "ab" does not.
3. Accepts any 5-digit integer.

Repetition
Describing multiple characters explicitly is ineffective
Use repetition to specify an arbitrarily unknown number of characters
in an regular expression
Character

Meaning

{n,m}
{n,}
{n}

Match the previous item at least n times but no more than m times.
Match the previous item n or more times.
Match exactly n occurrences of the previous item.
Match zero or one occurrences of the previous item. That is, the previous item is optional.
Equivalent to {0,1}.
Match one or more occurrences of the previous item. Equivalent to {1,}.
Match zero or more occurrences of the previous item. Equivalent to {0,}.

?
+
*

Can You Do This?


Construct regular expressions for the following:
1. Create a valid PHP variable for the English language. A PHP variable
starts with a dollar sign $ followed by any valid alphanumeric character
including _, but cannot start with a number. E.g. $1var_a is incorrect,
$var1_a or $_var1_a is correct.
2. An HTML anchor tag, e.g. <a href=blah>

Repetition
Be careful when using the * and ? repetition characters.
These characters may match zero instances of whatever precedes them
Essentially they are allowed to match nothing
Example: /a*/ matches "bbbb" because the string contains zero
occurrences of the letter a!

Repetition is usually greedy, to use non-greedy repetition, add a


question mark after the repetition character
??, +?, *?, {n, m}?

10

Greedy vs. Non-Greedy (Lazy)


Greedy
Match as many times as possible
Example /a+/ returns the entire string aaa as a match

Was always available in JavaScript


All PHP PCRE functions are greedy

Non-greedy
Stops at the first match
Only available from JavaScript 1.5
Example /a+/ only returns the first a from the string aaa as a match
Can be enabled in PHP by using the U pattern modifier
11

Alternation, Grouping and References


Specifying Alternatives
The | character separates alternatives
Example /ab|12|a1/ matches the string "ab" or the string 12" or the string
a1
Alternatives are considered left to right until a match is found
If the left alternative matches, the right alternative is ignored, even if it
would have produced a "better" match
Example /ab | abc/ will only match ab for the string abc even though
thats the better match

12

Alternation, Grouping and References


Defining Sub-expressions
Use parentheses to group separate items to form a sub-expression
Regular expression items are treated as a single unit by other R.E.
characters like |, *, +, ?, etc.
Example: What does this regular expression match: /java(script)?/

Sub-expressions can be referenced as sub-patterns by using references


References are determined by counting the left parenthesis
Example: What does reference \2 refer to in the R.E. below:
/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/

13

Alternation, Grouping and References


More About References
A reference to a previous sub-expression of a regular expression does
not refer to the pattern for that sub-expression but to the text that
matched the pattern
They can be used to enforce constraints that ensure different portions of
a string contain exactly the same characters
Example: What does this regular expression do?: /(['"])[^'"]*\1/

14

Alternation, Grouping and References


Character
|

(...)
(?:...)

\n

Meaning
Alternation. Match either the subexpression to the left or the
subexpression to the right.
Grouping. Group items into a single unit that can be used with *, +, ?, |,
and so on. Also remember the characters that match this group for use
with later references.
Grouping only. Group items into a single unit, but do not remember the
characters that match this group.
Match the same characters that were matched when group number n
was first matched. Groups are sub-expressions within (possibly nested)
parentheses. Group numbers are assigned by counting left parentheses
from left to right. Groups formed with (?: are not numbered.
15

Can You Do This?


Create a regular expression that:
1. Accepts only numbers that are greater than 100
2. Accepts any word (a word is defined as a sequence of alphanumeric
characters - no whitespace) that contains a double letter, for example
"book" has a double "o" and "feed" has a double "e".
3. Any string that contains an HTML tag and it's corresponding end tag.
The following should match: <H1>Big World</H1> and so should
<TITLE>No class today</TITLE>, but this should not match <TITLE>Not
right</H2>.

16

Matching Position
Characters exist that match the position between characters instead
of the characters themselves
So instead of \s which matches a whitespace character there is \b which
matches the boundary between a word character and a non-word
character
\B is the negation of \b

These characters do not specify any characters to be used in a matched


string, they specify the position that the match is to occur
Also called anchors because they anchor a pattern to a specific position
Examples: /^JavaScript$/
/\B[Ss]cript/
Question: How do you find the word exam by itself?
17

Matching Position
Character Meaning
^

Match the beginning of the string and, in multiline searches, the beginning of a
line.

Match the end of the string and, in multiline searches, the end of a line.

\b

Match a word boundary. That is, match the position between a \w character
and a \W character or between a \w character and the beginning or end of a
string. (Note, however, that [\b] matches backspace.)

\B

Match a position that is not a word boundary.

(?=p)
(?!p)

A positive lookahead assertion. Require that the following characters match


the pattern p, but do not include those characters in the match.
A negative lookahead assertion. Require that the following characters do not
match the pattern p.
18

Assertions
Two types of assertions:
Lookahead and lookbehind, collectively called "lookaround",

Zero-length assertions like start (^) and end of line ($), and \b
Key difference: lookaround actually matches characters, but then
gives up the match, returning only the result: match or no match.
That is why they are called "assertions".
They do not consume characters in the string
They only assert whether a match is possible or not

Assertions
Use of negative lookahead
Allows you to match something not followed by something else.
Example: How do you match a q not followed by a u?
q(?!u)

Use of positive lookahead


Allows you to match something followed by something else.
Example: Matching q followed by u.
q(?=u)

Assertions
You can use any regular expression inside the lookahead
Does not apply to lookbehind

Any valid regular expression can be used inside the lookahead.


Capturing groups will capture as normal
Back references will work normally, even outside the lookahead

The lookahead itself is not a capturing group


It is not included in the count when numbering backreferences

Assertions
Example:
A password must meet four conditions:
1.
2.
3.
4.

It must have between six and ten word characters \w


It must include at least one lowercase character [a-z]
It must include at least three uppercase characters [A-Z]
It must include at least one digit \d

Use lookaheads to create a regular expression that will meet these


requirements

Assertions
Lookbehind has the same effect, but works backwards.
It tells the regex engine to temporarily step backwards in the string,
to check if the text inside the lookbehind can be matched there.
Example: (?<!a)b matches a "b" that is not preceded by an "a", using
negative lookbehind.
It doesn't match cab, but matches the b (and only the b) in bed or
debt.
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab,
but does not match bed or debt.

Assertions Lookaround Caveats


Lookarounds do not move or consume
They look immediately to the left or right of the engine's current position
on the string
They do not alter that position

Questions:
Would A(?=5) match the A in the string AB25?
Would A(?=5)(?=[A-Z]) match the A in the string A5B?

Regular Expression Flags


Character

Meaning

Perform case-insensitive matching.

Perform a global matchthat is, find all matches rather than stopping
after the first match.

Multiline mode. ^ matches beginning of line or beginning of string,


and $ matches end of line or end of string.

25

You might also like