You are on page 1of 11

Regular Expressions

Ashley J.S Mills

<ashley@ashleymills.com>
Copyright © 2005 The University Of Birmingham

Table of Contents
1. Introduction ................................................................................................................................................. 1
2. Basics ......................................................................................................................................................... 1
2.1. Single Character ................................................................................................................................. 1
2.2. Any Character: . ................................................................................................................................. 1
2.3. The Escape Character: \ ....................................................................................................................... 1
2.4. The Caret: ^ ...................................................................................................................................... 2
2.5. The Dollar Symbol: $ .......................................................................................................................... 2
2.6. The Kleene star: * .............................................................................................................................. 2
2.7. The Kleene plus: + ............................................................................................................................. 2
2.8. Ranges: [ ], [cn-cm] and [^cn-cm] ......................................................................................................... 2
2.9. Grouping: \( \) ................................................................................................................................... 2
2.10. Alternatives: | .................................................................................................................................. 2
2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\} ................................................................................................. 2
3. Grep Examples ............................................................................................................................................. 3
4. java.util.regex, Java 1.4 .................................................................................................................................. 5
5. Emacs Regular Expressions ............................................................................................................................ 8
6. References ................................................................................................................................................... 10

1. Introduction
Pattern matching is an important topic in Computer Science, it is the process of matching defined patterns to information. Humans
use pattern matching everyday to recognise objects and faces, computers use pattern matching everyday to perform the most basic
of operations, when you execute a command at the command line, some kind of pattern matching is being employed to determine
what your command is asking the computer to do, pattern matching is used in compilers and programming languages.
Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching lan-
guages. They are considered the least complex of the pattern matching languages but are very useful. You have probably used reg-
ular expressions before, for instance if you have specified that you want to delete *.* at the command line, referring to any base-
name followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once. Most of
you will be aware that the * character, known as a Kleene star or asterisk, means "match anything" and indeed it is used in a very
similar manner in the regular expressions we are about to discuss.
There are many programs out there that have some kind of builtin regular expression handling capabilities. The thing is, they all
seem to have slight syntactical variation, fortunately the concepts are identical in each case and the differences are often marginal,
this text will describe the most common components of a regular expression and will present program specific examples where ap-
propriate.

2. Basics
Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find,
meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will be-
gin this section with a brief introduction to the most common meta characters.

2.1. Single Character


A single character such as Q is a regular expression, it is the regular expression that matches every string that contains the charac-
ter Q, so it would match Quick, Quiet and Quantum but not quick.

2.2. Any Character: .


The period, or full-stop as we call it in Britain, is used to signify that any character may be replaced by it in the search, it matches
any character. For example, ".t.m would match atom, item and stem and probably some other words too. A fun example of using
this character can be found at http://www.oneacross.com/ where it is used to help people find words for their crosswords, they
also use the the character ?' as an alternative.

2.3. The Escape Character: \


1
Regular Expressions
\ is used to signify that we want to use a meta character as a literal character, this is necessary otherwise the character in question
would be interpreted as meta-data, the character that the is being escaped is the character immediately following the escape char-
acter. For example, "\*" would match the string containing the character that has been escaped, that is, it would match the string
(or any string containing) *.
The converse can also be true, sometimes \ is used to signify that we want to use a literal character as a meta character, for exam-
ple, within a double quoted string in an implementation that requires that meta characters are escaped. You should read the docu-
mentation of the particular regular expression implementation you are using to find out which approach your implementation
takes.

2.4. The Caret: ^


^, known as a caret, is used to match the beginning of a line, so "^CAPITAL" would match "CAPITAL's signify emphasised
speech, anger or SHOUTING", it would not match "Your such a CAPITAL idiot!".

2.5. The Dollar Symbol: $


$ is used to match the end of a line, so "here$" would match "I like it here" but would not match "here is a potato".

2.6. The Kleene star: *


* is used to match zero or more occurrences of the regular expression immediately preceding the meta character. "10*" would
match "1", "10", "100", "1000" and so on.

2.7. The Kleene plus: +


+ is used to match one or more occurrences of the regular expression immediately preceding the meta character. "10+" would
match "10", "100", "1000" and so on but would not match "1".

Note
(regular expression)+ is the same as (regular expression)(regular expression)*.

2.8. Ranges: [ ], [cn-cm] and [^cn-cm]


[ ] is used to signify that any of the characters or expressions enclosed within them may be matched. 1[123]512 would match
"11512", "12512" and "13512".
[cn-cm] is used to specify a range of characters (inclusively) that may be matched at this point in the regular expression. ";[b-f]oo"
would match "boo", "coo", "doo", "eoo" and "foo" but not "goo".
[^cn-cm] is used to exclude a range of characters from a match, notice that the caret has been used again, when it is used immedi-
ately after an opening [ it has this special meaning, if you want to exclude the caret then you would escape it: "[^\^]. "[^1-8]00"
would match "900" but not any of the other three digit hundreds such as "500".

2.9. Grouping: \( \)
\( \) is used to treat regular expression contained within the (escaped in this case) brackets as a group, this group can then be back
referenced later like \1 to refer to the first group defined. How this is implemented in various programs that use regular expres-
sions varies, some tools do not require you to escape the brackets, some use different conventions to back reference defined
groups. For instance a program may use "$1" to refer to the first bracketed group instead of "\1". There may also be limits on the
number of groups that can be referenced in this way, sometimes it is a maximum of nine. In the program grep "\(a\)b\1" would
match "aba".

2.10. Alternatives: |
| is used to delimit the OR operator, in this case the operands are the regular expressions either side of it, signifying that if either
the first expression OR the second expression matches, then the whole expression will match. For example "^aba\|b$" will match
the lines "aba", "abb" but not "abc". The | meta character may or may not need to be escaped depending on the program.

2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\}


\{n\} is used to specify that the regular expression immediately preceding must be matched n times exactly. "^10\{3\}$" will match
the line "1000" but not "100" or "10000".
\{,n\} is used to specify that the regular expression immediately preceding may be matched up to a maximum of n times.
"^10\{,3\}$" will match the lines "1", "10", "100" and "1000" but will not match "10000".
\{n,\} is used to specify that the regular expression immediately preceding must be matched at least n times. "^10\{3,\}$" will
match the lines "1000", "10000", "100000" and so on but will not match "100".

Note
This is an alternative to using the Kleene star and the Kleene plus, they may not be supported in your implementation.
"a\{0,}\" is the same as "a*" and "a\{1,}\" is the same as "a+".
\{n,m\} is used to specify that the regular expression immediately preceding must be matched at least n times but may not exceed
2
Regular Expressions
m matches. "^10\{3,4\}$" will match the lines "1000" and "10000" but not "100" or "100000". The necessity to escape the charac-
ters may vary. Not all programs support all the types of repetition described.

3. Grep Examples
Grep is a tool used to search text using regular expressions, its origins highlight its function, according to
http://www.faqs.org/faqs/usenet/faq/part1/section-21.html [http://www.faqs.org/faqs/usenet/faq/part1/section-21.html:] its origins
are as follows:
The original UNIX text editor "ed" has a construct g/re/p, where "re" stands for a regular expression, to Globally
search for matches to the Regular Expression and Print the lines containing them. This was so often used that it
was packaged up into its own command, thus named "grep". According to Dennis Ritchie, this is the true origin
of the command.

I will present a few examples, of which the first two are based on the following text file, mb.txt:

NAME MAKE HP YEAR PRICE


NSR250R Honda 60 1993 £1340
NSR250R-SP Honda 65 1994 £2000
KR1S Kawasaki 60 1989 £1250
GSX250 Suzuki 26 1981 £300
GS250T Suzuki 26 1982 £250
RGV250 Suzuki 60 1993 £1400
RGV250-SP Suzuki 65 1994 £2400

The examples will use the -E option which specifies that grep should expect syntax in the form of an extended regular expression.

grep -e Honda mb.txt

Lists all the lines that contain the text string "Honda":

NSR250R Honda 60 1993 £1340


NSR250R-SP Honda 65 1994 £2000

grep -e "6. 1" mb.txt | grep -v

Output:

KR1S Kawasaki 60 1989 £1250


RGV250 Suzuki 60 1993 £1400
RGV250-SP Suzuki 65 1994 £2400

First lists all bikes that are sixty something BHP and then pipes this to another instance of grep which excludes all the lines con-
taining "Honda" with the -v

Note
Quotes are used to preserve whitespace and are used whenever '\' is used since this is also special within the shell so
needs to be hidden from the shell

ls -l | grep -e "Aristotle\.txt"

Output:

Aristotle.txt

Pipes the output from a directory listing to grep which searches filters the lines containing "Aristotle.txt", note the use of the es-
cape character '.' to literally match '\'

grep -e "\(101\)\1" in.file

Matches the string "101101", notice that "101" is first grouped by enclosing within an escaped opening parentheses "\(" and an es-
caped closing parentheses "\)". The first group is then referenced with "\1". Something like:

grep -e "1\(0\)\*" in.file

3
Regular Expressions
Would match "1" followed by zero or more occurrences of "0", whereas:

grep -e "1\(0\)\+" in.file

Would match "1" followed by at leastone occurrence of "0", this is the same as:

grep -e "1\(0\)\(\1\)*" in.file

'[' followed by ']' can be used to match a range of characters and some special ranges are already defined:

• [[:alnum:]] matches [0-9a-zA-Z]

• [[:alpha:]] matches [a-zA-Z]

• [[:cntrl:]] matches control characters

• [[:digit:]] matches [0-9]

• [[:lower:]] matches [a-z]

• [[:punct:]] matches punctuation characters

• [[:upper:]] matches [A-Z]

• [[:space:]] matches any white space

grep -e "\([[:alpha:]]\)\+[[:digit:]][[:upper:]]"

Would match one or more characters in the range [a-zA-Z] followed by one character in the range [0-9] followed by one character
in the range [A-Z]. So it would match the string "abc9Z". The number of times a pattern must be matched can be specified after
the group:

grep -e "\(abc\)\{3\}" in.file

Would match any lines containing 3 occurrences of the pattern "abc".

grep -e "^\(abc\)\{22\}" in.file

Would match any lines containing 2 occurrences of the pattern "abc", with the restriction that the sequence must start at the begin-
ning of a line, specified by the use of '^', there are similar commands to '^':

• $ matches the end of a line

• \ matches the beginning of a word

• \> matches the end of a word

• \b matches the empty string at the edge of a word

• \B matches the empty string provided it is not at the edge of a word.

grep -e "^\([[:alpha:]][[:alnum:]]*\)=\1"

Would match an an alpha character beginning at the start of a line followed by zero or more alphanumeric characters followed by
'=' followed by the same sequence of characters that were matched before the '=', so "abc=abc" would be matched but "abc=abd"
would not be matched.
Suppose you wanted to match the h1, h2, h3... etc. elements in an HTML file. Assume the text file html.txt:

<h1 blah="cool"<Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
4
Regular Expressions
<h6>Title5</h6>
<h1>Title2</h2>

One could use the following:

grep -e "^<h[1-6][^>]*>[^<]*</h[1-6]>" html.txt

Which says to match, from the start of a line: "<h" then a character from the range [1-6] then anything but the '>' character (so that
the opening tag may contain attributes) then anything but a '<' character then "</h" then a character from the range [1-6] then the
character '>'. The output from executing this is shown below:

<h1 blah="cool">Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>
<h1>Title2</h2>

Which is everything that the file contained, but if you look carefully, the last line is not a valid header element because it opens
with h1 and closes h2 so the correct regular expression would take this into account and only output lines that have matching
opening and closing tags, this can be achieved as follows:

grep -e "^<h\([1-6]\)[^>]*>[^<]*</h\1>" html.txt

The expression is the same as before but instead of having two [1-6] sections, the first [1-6] section is enclosed within "\(" and
"\)" so that it can be back-referenced. In the closing tag the group is back-referenced using \1 which means that the string matched
by the back-referenced group must be matched again, hence the opening and closing header tags must be of the same level. This
produces the correct output:

<h1 blah="cool">Title1</h1>
<h2>Title2</h2>
<h3>Title3</h3>
<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>

4. java.util.regex, Java 1.4


java.util.regex provides classes for matching character sequences against regular expressions. The two classes of java.util.regex
are Matcher and Pattern. Pattern provides the regular expression in an efficient compiled Java version. Matcher provides the
methods needed to match a character sequence against a Pattern. The java.util.regex entry in the Java API 1.4 can be found at
http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html.
So that Java regex can be illustrated efficiently, a program will be developed that takes a searchPattern and a searchString as argu-
ments and then prints out some information regarding the application of the searchPattern to the searchString. This will promote
efficient demonstration of java.util.regex compliant regular expressions by removing the need to re-compile the test class with a
new searchPattern and searchString. The process of building the program will illustrate how one can use the features provided by
java.util.regex. The program is shown below:

import java.util.regex.*;
public class Regex {
public static void main(String args[]) {
String searchString = "",
searchPattern = "";
if(args.length==2) {
searchPattern = args[0];
searchString = args[1];
} else {
output("Usage:");
output("java regex searchPattern searchString");
System.exit(0);
}
Pattern p = Pattern.compile(searchPattern);
Matcher m = p.matcher(searchString);
boolean b = m.find();
output("\nMatch found : "+b);
while(b) {
output("Match start : " + m.start());
output("Match end : " + m.end());
5
Regular Expressions
output("Match content : " + m.group(0));
if(m.groupCount()!=0) {
for(int i=1; i<=m.groupCount(); i++) {
output("Group " + i + " : " + m.group(i));
}
}
b = m.find();
if(b) output("\nMatch found : "+b);
}
}
private static void output(String s) {
System.out.println(s);
}
}

The program begins with the importation of the Java regular expression package java.util.regex, the Strings searchString and
searchString are declared ready for their use later. The number of command line arguments is checked, if it is not equal to two the
usage message is output, if it is equal to two the first command-line argument is assigned to the String variable searchPattern and
the second command-line argument is assigned to the String variable searchString.

Pattern p = Pattern.compile(searchPattern);
Matcher m = p.matcher(searchString);
boolean b = m.find();

The Pattern is created from the searchPattern using the compile method which compiles the given regular expression into a pat-
tern. A Matcher is created based on the recently created Pattern and the searchString, the Matcher will match instances of the
searchPattern within the searchString. A boolean called b is set to the result of m.find() which is the Matcher method which at-
tempts to find the next subsequence of input sequence that matches the Pattern defined by searchPattern. The state of the Matcher
is updated upon a successful match to contain information about the match such as where it occurred in the string and the content
of marked groups.

output("\nMatch found : "+b);


while(b) {
output("Match start : " + m.start());
output("Match end : " + m.end());
output("Match content : " + m.group(0));
if(m.groupCount()!=0) {
for(int i=1; i<=m.groupCount(); i++) {
output("Group " + i + " : " + m.group(i));
}
}
b = m.find();
if(b) output("\nMatch found : "+b);
}

This loop first prints whether or not a match was found, if it was then the start position of the match is output using the Matcher
method start(), the end position (+1) of the match is output using the Matcher method end. The portion of searchString that
matched the pattern is printed using the Matcher method group(int i) which returns the portion of searchString matched by the i'th
bracketed group within the pattern, group(0) returns the portion of searchString that is matched by the whole pattern.

if(m.groupCount()!=0) {
for(int i=1; i<=m.groupCount(); i++) {
output("Group " + i + " : " + m.group(i));
}
}
b = m.find();
if(b) output("\nMatch found : "+b);

If any groups were defined in the pattern, this section loops through the groups and prints out there content, group(0) is not printed
because it was printed earlier. The boolean b is set to the result of the next call to find() and if it is true, indicating another instance
of the pattern has been matched, the program prints that it has found a match. This condition is necessary so that at the end of all
the matches, "Match found : false" is not printed.
The program can be downloaded from here: Regex.java [files/Regex.java] and it is executed like this:

java Regex patternString searchString

Here are a few examples:

java Regex "Hello" "Hello World!"


Match found : true
Match start : 0

6
Regular Expressions
Match end : 5
Match content : Hello

java Regex "[Hh]ello" "Hello there Peter! Oh hello there James!"


Match found : true
Match start : 0
Match end : 5
Match content : Hello
Match found : true
Match start : 22
Match end : 27
Match content : hello

java Regex "(H)(e)(l)(l)(o)" "Hello ello ello!"


Match found : true
Match start : 0
Match end : 5
Match content : Hello
Group 1 : H
Group 2 : e
Group 3 : l
Group 4 : l
Group 5 : o

java Regex "H(e(l(l(o))))" "Hello ello ello"


Match found : true
Match start : 0
Match end : 5
Match content : Hello
Group 1 : ello
Group 2 : llo
Group 3 : lo
Group 4 : o

java Regex "!*" "0+ !!!"


Match found : true
Match start : 0
Match end : 0
Match content :
Match found : true
Match start : 1
Match end : 1
Match content :
Match found : true
Match start : 2
Match end : 2
Match content :
Match found : true
Match start : 3
Match end : 6
Match content : !!!
Match found : true
Match start : 6
Match end : 6
Match content :

The example shown last is quite strange in that it illustrates how each character in the input sequence matches the pattern since the
pattern specifies that zero or more '!' characters should be matched, the content is empty however since zero '!' were matched,
eventually the three '!'s are matched and then finally the newline produced when the command was entered is matched.

java Regex "S.*t" "Spontaneous combustion"


Match found : true
Match start : 0
Match end : 19

7
Regular Expressions
Match content : Spontaneous combust
java Regex "S.*?t" "Spontaneous combustion"
Match found : true
Match start : 0
Match end : 5
Match content : Spont

Notice the difference between these two, in that, the first expression uses the greedy version of ".*" which matches as many char-
acters as it can whilst still producing a match where as the second expression uses the reluctant version ".*?" which matches the
least amount of characters it can whilst still producing a match.

java Regex "10{3}\b" "10 100 1000 10000"


Match found : true
Match start : 7
Match end : 11
Match content : 1000

Matches '1' followed by three '0's. The "\b" matches a word boundary so that this expression does not match "10000".

java Regex "10{2,}\b" "10 100 1000 10000"


Match found : true
Match start : 3
Match end : 6
Match content : 100
Match found : true
Match start : 7
Match end : 11
Match content : 1000
Match found : true
Match start : 12
Match end : 17
Match content : 10000

Matches '1' followed by at least two '0's followed by a word boundary

java Regex "10{1,3}\b" "10 100 1000 10000"


Match found : true
Match start : 0
Match end : 2
Match content : 10
Match found : true
Match start : 3
Match end : 6
Match content : 100
Match found : true
Match start : 7
Match end : 11
Match content : 1000

Matches '1' followed by a minimum of one '0' and a maximum of 3 '0's followed by a word boundary.

5. Emacs Regular Expressions


Emacs has builtin regular expression support. Regular expressions may be used within searches by typing the Emacs command se-
quence C-M-s, this is CTRL-ALT-s on most computers. An example is shown below:

Figure 1. RegExp search

8
Regular Expressions

More useful however is the regular expression search and replace function. It is activated by typing C-M-%, that is CTRL-ALT-%
on most computers. When this command sequence is entered, the user is asked to enter an expression to find the text to replace
and then an expression to use to replace the text found. The regular expression syntax is shown below:

Regular Expressions
any single character except a newline . (dot)
zero or more repeats *
one or more repeats +
zero or one repeat ?
any character in the set [ : : :]
any character not in the set [^ : : :]
beginning of line ^
end of line $
quote a special character c \c
alternative (\or\) \_
grouping \( : : :\)
nth group \n
beginning of buffer \`
end of buffer \'
word break \b
not beginning or end of word \B
beginning of word \lt;
end of word \gt;
any word-syntax character \w
any non-word-syntax character \W
character with syntax c \sc
character with syntax not c \Sc

If you had a HTML file and you wanted to replace every occurrence of "<table>" with "<table border="1">" you could use:
Query replace regexp: <table> with: <table border=\"1\">
You can use back-references too:
Query replace regexp \(this\)\(.*\)\(that\) with \3\2\1
When operated on:

Switch this and that!


Switch this and then switch that!
Take this! and take that too!

Produces:

Switch that and this!


Switch that and then switch this!
Take that! and take this too!

".*" comes in greedy and non greedy flavours, consider the line:
You are greedy not greedy!
Using the greedy flavour:
Query replace regexp Y.*y with You are
Causes the whole line to be replaced with "You are", where as the non-greedy:

9
Regular Expressions

Query replace regexp Y.*?y with You are


Causes only the first part of the line to be replaced with "You are", leaving "You are not greedy!"
As a more useful example, imagine you have a tab separated file like this:

NAME AGE SEX


MAN1 32 M
MAN2 23 M
WOMAN1 33 F
WOMAN2 34 F

The following emacs regular expression could be used to swap the last two columns around:
^\([^ ]+\)\([ ]+\)\([^ ]\)\([ ]+\)\(.+?\)$
The regular expression begins with '^' to say that the pattern should begin with the beginning of a line. The next block is "\([^ ]+\)"
which says to match one or more non tab characters, in the example the tab looks like a space, this is because in emacs one would
actually enter the tab character like one would any other character so I have replaced the large gap that a tab makes with a smaller
one in this document, in emacs the expression looks like this:

Figure 2. Emacs tabs

Notice how emacs varies the sizes of some of the tabs, if one did not know they were tabs it would be possible to mistake them for
spaces. The "one or more non-tab characters" pattern is enclosed between "/(" and "/)", this is emacs regular expression grouping
so that the group can be back-referenced later. The next section in the pattern is "\([ ]+\)" which says to match one or more tab
characters and save them in a group. After this is another grouped "one or more non-tab characters" section followed by another
grouped "one or more tab characters" section followed finally by a grouped "one or more any characters" section. Terminated with
an end of line. This can be summarised to:

(StartOfLine) (Non-Tab+) (Tab+) (Non-Tab+) (Tab+) (Anything+) (EndOfLine)


group 1 group2 group3 group4 group5

When asked for the replacement text, this is used:


\1\2\5\4\3
Which replaces the line with the text from the groups, in the order specified, so that the text file is changed to:

NAME SEX AGE


MAN1 M 32
MAN2 M 23
WOMAN1 F 33
WOMAN2 F 34

The columns sex and age have been swapped.

6. References

• http://www.evolt.org/article/rating/20/22700/
Evolt Regular Expression Tutorial

• http://sitescooper.org/tao_regexps.html
A Tao Of Regular Expressions

• http://www.zytrax.com/tech/web/regex.htm
Zytrax Regular Expression Tutorial

• http://etext.lib.virginia.edu/helpsheets/regex.html
Using Regular Expressions - Stephen Ramsay

• http://www.grymoire.com/Unix/Regular.html
Regular Expressions - Bruce Barnett & General Electric Company

• http://jakarta.apache.org/regexp/
Jakarta Regexp
10
Regular Expressions

• http://www.grymoire.com/Unix/Regular.html
Regular ExPressions - Bruce Barnett

• http://theory.uwinnipeg.ca/localfiles/infofiles/lisp/lispref_Searching_and_Matching.html
XEmacs Lisp Reference Manual - Searching and Matching

11

You might also like