Regular Expressions

Ashley J.S Mills
<ashley@ashleymills.com>

Copyright © 2005 The University Of Birmingham

Table of Contents
1. Introduction ................................................................................................................................................. 1 2. Basics ......................................................................................................................................................... 1 2.1. Single Character ................................................................................................................................. 1 2.2. Any Character: . ................................................................................................................................. 1 2.3. The Escape Character: \ ....................................................................................................................... 1 2.4. The Caret: ^ ...................................................................................................................................... 2 2.5. The Dollar Symbol: $ .......................................................................................................................... 2 2.6. The Kleene star: * .............................................................................................................................. 2 2.7. The Kleene plus: + ............................................................................................................................. 2 2.8. Ranges: [ ], [cn-cm] and [^cn-cm] ......................................................................................................... 2 2.9. Grouping: \( \) ................................................................................................................................... 2 2.10. Alternatives: | .................................................................................................................................. 2 2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\} ................................................................................................. 2 3. Grep Examples ............................................................................................................................................. 3 4. java.util.regex, Java 1.4 .................................................................................................................................. 5 5. Emacs Regular Expressions ............................................................................................................................ 8 6. References ................................................................................................................................................... 10

1. Introduction
Pattern matching is an important topic in Computer Science, it is the process of matching defined patterns to information. Humans use pattern matching everyday to recognise objects and faces, computers use pattern matching everyday to perform the most basic of operations, when you execute a command at the command line, some kind of pattern matching is being employed to determine what your command is asking the computer to do, pattern matching is used in compilers and programming languages. Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching languages. They are considered the least complex of the pattern matching languages but are very useful. You have probably used regular expressions before, for instance if you have specified that you want to delete *.* at the command line, referring to any basename followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once. Most of you will be aware that the * character, known as a Kleene star or asterisk, means "match anything" and indeed it is used in a very similar manner in the regular expressions we are about to discuss. There are many programs out there that have some kind of builtin regular expression handling capabilities. The thing is, they all seem to have slight syntactical variation, fortunately the concepts are identical in each case and the differences are often marginal, this text will describe the most common components of a regular expression and will present program specific examples where appropriate.

2. Basics
Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find, meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will begin this section with a brief introduction to the most common meta characters.

2.1. Single Character
A single character such as Q is a regular expression, it is the regular expression that matches every string that contains the character Q, so it would match Quick, Quiet and Quantum but not quick.

2.2. Any Character: .
The period, or full-stop as we call it in Britain, is used to signify that any character may be replaced by it in the search, it matches any character. For example, ".t.m would match atom, item and stem and probably some other words too. A fun example of using this character can be found at http://www.oneacross.com/ where it is used to help people find words for their crosswords, they also use the the character ?' as an alternative.

2.3. The Escape Character: \
1

Regular Expressions \ is used to signify that we want to use a meta character as a literal character, this is necessary otherwise the character in question would be interpreted as meta-data, the character that the is being escaped is the character immediately following the escape character. For example, "\*" would match the string containing the character that has been escaped, that is, it would match the string (or any string containing) *. The converse can also be true, sometimes \ is used to signify that we want to use a literal character as a meta character, for example, within a double quoted string in an implementation that requires that meta characters are escaped. You should read the documentation of the particular regular expression implementation you are using to find out which approach your implementation takes.

2.4. The Caret: ^
^, known as a caret, is used to match the beginning of a line, so "^CAPITAL" would match "CAPITAL's signify emphasised speech, anger or SHOUTING", it would not match "Your such a CAPITAL idiot!".

2.5. The Dollar Symbol: $
$ is used to match the end of a line, so "here$" would match "I like it here" but would not match "here is a potato".

2.6. The Kleene star: *
* is used to match zero or more occurrences of the regular expression immediately preceding the meta character. "10*" would match "1", "10", "100", "1000" and so on.

2.7. The Kleene plus: +
+ is used to match one or more occurrences of the regular expression immediately preceding the meta character. "10+" would match "10", "100", "1000" and so on but would not match "1".

Note
(regular expression)+ is the same as (regular expression)(regular expression)*.

2.8. Ranges: [ ], [cn-cm] and [^cn-cm]
[ ] is used to signify that any of the characters or expressions enclosed within them may be matched. 1[123]512 would match "11512", "12512" and "13512". [cn-cm] is used to specify a range of characters (inclusively) that may be matched at this point in the regular expression. ";[b-f]oo" would match "boo", "coo", "doo", "eoo" and "foo" but not "goo". [^cn-cm] is used to exclude a range of characters from a match, notice that the caret has been used again, when it is used immediately after an opening [ it has this special meaning, if you want to exclude the caret then you would escape it: "[^\^]. "[^1-8]00" would match "900" but not any of the other three digit hundreds such as "500".

2.9. Grouping: \( \)
\( \) is used to treat regular expression contained within the (escaped in this case) brackets as a group, this group can then be back referenced later like \1 to refer to the first group defined. How this is implemented in various programs that use regular expressions varies, some tools do not require you to escape the brackets, some use different conventions to back reference defined groups. For instance a program may use "$1" to refer to the first bracketed group instead of "\1". There may also be limits on the number of groups that can be referenced in this way, sometimes it is a maximum of nine. In the program grep "\(a\)b\1" would match "aba".

2.10. Alternatives: |
| is used to delimit the OR operator, in this case the operands are the regular expressions either side of it, signifying that if either the first expression OR the second expression matches, then the whole expression will match. For example "^aba\|b$" will match the lines "aba", "abb" but not "abc". The | meta character may or may not need to be escaped depending on the program.

2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\}
\{n\} is used to specify that the regular expression immediately preceding must be matched n times exactly. "^10\{3\}$" will match the line "1000" but not "100" or "10000". \{,n\} is used to specify that the regular expression immediately preceding may be matched up to a maximum of n times. "^10\{,3\}$" will match the lines "1", "10", "100" and "1000" but will not match "10000". \{n,\} is used to specify that the regular expression immediately preceding must be matched at least n times. "^10\{3,\}$" will match the lines "1000", "10000", "100000" and so on but will not match "100".

Note
This is an alternative to using the Kleene star and the Kleene plus, they may not be supported in your implementation. "a\{0,}\" is the same as "a*" and "a\{1,}\" is the same as "a+". \{n,m\} is used to specify that the regular expression immediately preceding must be matched at least n times but may not exceed 2

Regular Expressions m matches. "^10\{3,4\}$" will match the lines "1000" and "10000" but not "100" or "100000". The necessity to escape the characters may vary. Not all programs support all the types of repetition described.

3. Grep Examples
Grep is a tool used to search text using regular expressions, its origins highlight its function, according to http://www.faqs.org/faqs/usenet/faq/part1/section-21.html [http://www.faqs.org/faqs/usenet/faq/part1/section-21.html:] its origins are as follows: The original UNIX text editor "ed" has a construct g/re/p, where "re" stands for a regular expression, to Globally search for matches to the Regular Expression and Print the lines containing them. This was so often used that it was packaged up into its own command, thus named "grep". According to Dennis Ritchie, this is the true origin of the command. I will present a few examples, of which the first two are based on the following text file, mb.txt:
NAME NSR250R NSR250R-SP KR1S GSX250 GS250T RGV250 RGV250-SP MAKE Honda Honda Kawasaki Suzuki Suzuki Suzuki Suzuki HP 60 65 60 26 26 60 65 YEAR 1993 1994 1989 1981 1982 1993 1994 PRICE £1340 £2000 £1250 £300 £250 £1400 £2400

The examples will use the -E option which specifies that grep should expect syntax in the form of an extended regular expression.
grep -e Honda mb.txt

Lists all the lines that contain the text string "Honda":
NSR250R Honda NSR250R-SP Honda 60 1993 £1340 65 1994 £2000

grep -e "6. 1" mb.txt | grep -v

Output:
KR1S RGV250 RGV250-SP Kawasaki 60 1989 £1250 Suzuki 60 1993 £1400 Suzuki 65 1994 £2400

First lists all bikes that are sixty something BHP and then pipes this to another instance of grep which excludes all the lines containing "Honda" with the -v

Note
Quotes are used to preserve whitespace and are used whenever '\' is used since this is also special within the shell so needs to be hidden from the shell
ls -l | grep -e "Aristotle\.txt"

Output:
Aristotle.txt

Pipes the output from a directory listing to grep which searches filters the lines containing "Aristotle.txt", note the use of the escape character '.' to literally match '\'
grep -e "\(101\)\1" in.file

Matches the string "101101", notice that "101" is first grouped by enclosing within an escaped opening parentheses "\(" and an escaped closing parentheses "\)". The first group is then referenced with "\1". Something like:
grep -e "1\(0\)\*" in.file

3

Regular Expressions Would match "1" followed by zero or more occurrences of "0", whereas:
grep -e "1\(0\)\+" in.file

Would match "1" followed by at leastone occurrence of "0", this is the same as:
grep -e "1\(0\)\(\1\)*" in.file

'[' followed by ']' can be used to match a range of characters and some special ranges are already defined: • • • • • • • • [[:alnum:]] matches [0-9a-zA-Z] [[:alpha:]] matches [a-zA-Z] [[:cntrl:]] matches control characters [[:digit:]] matches [0-9] [[:lower:]] matches [a-z] [[:punct:]] matches punctuation characters [[:upper:]] matches [A-Z] [[:space:]] matches any white space

grep -e "\([[:alpha:]]\)\+[[:digit:]][[:upper:]]"

Would match one or more characters in the range [a-zA-Z] followed by one character in the range [0-9] followed by one character in the range [A-Z]. So it would match the string "abc9Z". The number of times a pattern must be matched can be specified after the group:
grep -e "\(abc\)\{3\}" in.file

Would match any lines containing 3 occurrences of the pattern "abc".
grep -e "^\(abc\)\{22\}" in.file

Would match any lines containing 2 occurrences of the pattern "abc", with the restriction that the sequence must start at the beginning of a line, specified by the use of '^', there are similar commands to '^': • • • • • $ matches the end of a line \ matches the beginning of a word \> matches the end of a word \b matches the empty string at the edge of a word \B matches the empty string provided it is not at the edge of a word.

grep -e "^\([[:alpha:]][[:alnum:]]*\)=\1"

Would match an an alpha character beginning at the start of a line followed by zero or more alphanumeric characters followed by '=' followed by the same sequence of characters that were matched before the '=', so "abc=abc" would be matched but "abc=abd" would not be matched. Suppose you wanted to match the h1, h2, h3... etc. elements in an HTML file. Assume the text file html.txt:
<h1 blah="cool"<Title1</h1> <h2>Title2</h2> <h3>Title3</h3> <h4>Title4</h4> <h5>Title5</h5>

4

Regular Expressions
<h6>Title5</h6> <h1>Title2</h2>

One could use the following:
grep -e "^<h[1-6][^>]*>[^<]*</h[1-6]>" html.txt

Which says to match, from the start of a line: "<h" then a character from the range [1-6] then anything but the '>' character (so that the opening tag may contain attributes) then anything but a '<' character then "</h" then a character from the range [1-6] then the character '>'. The output from executing this is shown below:
<h1 blah="cool">Title1</h1> <h2>Title2</h2> <h3>Title3</h3> <h4>Title4</h4> <h5>Title5</h5> <h6>Title5</h6> <h1>Title2</h2>

Which is everything that the file contained, but if you look carefully, the last line is not a valid header element because it opens with h1 and closes h2 so the correct regular expression would take this into account and only output lines that have matching opening and closing tags, this can be achieved as follows:
grep -e "^<h\([1-6]\)[^>]*>[^<]*</h\1>" html.txt

The expression is the same as before but instead of having two [1-6] sections, the first [1-6] section is enclosed within "\(" and "\)" so that it can be back-referenced. In the closing tag the group is back-referenced using \1 which means that the string matched by the back-referenced group must be matched again, hence the opening and closing header tags must be of the same level. This produces the correct output:
<h1 blah="cool">Title1</h1> <h2>Title2</h2> <h3>Title3</h3> <h4>Title4</h4> <h5>Title5</h5> <h6>Title5</h6>

4. java.util.regex, Java 1.4
java.util.regex provides classes for matching character sequences against regular expressions. The two classes of java.util.regex are Matcher and Pattern. Pattern provides the regular expression in an efficient compiled Java version. Matcher provides the methods needed to match a character sequence against a Pattern. The java.util.regex entry in the Java API 1.4 can be found at http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html. So that Java regex can be illustrated efficiently, a program will be developed that takes a searchPattern and a searchString as arguments and then prints out some information regarding the application of the searchPattern to the searchString. This will promote efficient demonstration of java.util.regex compliant regular expressions by removing the need to re-compile the test class with a new searchPattern and searchString. The process of building the program will illustrate how one can use the features provided by java.util.regex. The program is shown below:
import java.util.regex.*; public class Regex { public static void main(String args[]) { String searchString = "", searchPattern = ""; if(args.length==2) { searchPattern = args[0]; searchString = args[1]; } else { output("Usage:"); output("java regex searchPattern searchString"); System.exit(0); } Pattern p = Pattern.compile(searchPattern); Matcher m = p.matcher(searchString); boolean b = m.find(); output("\nMatch found while(b) { output("Match start output("Match end : "+b); : " + m.start()); : " + m.end());

5

Regular Expressions
output("Match content : " + m.group(0)); if(m.groupCount()!=0) { for(int i=1; i<=m.groupCount(); i++) { output("Group " + i + " : " + m.group(i)); } } b = m.find(); if(b) output("\nMatch found : "+b); } } private static void output(String s) { System.out.println(s); } }

The program begins with the importation of the Java regular expression package java.util.regex, the Strings searchString and searchString are declared ready for their use later. The number of command line arguments is checked, if it is not equal to two the usage message is output, if it is equal to two the first command-line argument is assigned to the String variable searchPattern and the second command-line argument is assigned to the String variable searchString.
Pattern p = Pattern.compile(searchPattern); Matcher m = p.matcher(searchString); boolean b = m.find();

The Pattern is created from the searchPattern using the compile method which compiles the given regular expression into a pattern. A Matcher is created based on the recently created Pattern and the searchString, the Matcher will match instances of the searchPattern within the searchString. A boolean called b is set to the result of m.find() which is the Matcher method which attempts to find the next subsequence of input sequence that matches the Pattern defined by searchPattern. The state of the Matcher is updated upon a successful match to contain information about the match such as where it occurred in the string and the content of marked groups.
output("\nMatch found : "+b); while(b) { output("Match start : " + m.start()); output("Match end : " + m.end()); output("Match content : " + m.group(0)); if(m.groupCount()!=0) { for(int i=1; i<=m.groupCount(); i++) { output("Group " + i + " : " + m.group(i)); } } b = m.find(); if(b) output("\nMatch found : "+b); }

This loop first prints whether or not a match was found, if it was then the start position of the match is output using the Matcher method start(), the end position (+1) of the match is output using the Matcher method end. The portion of searchString that matched the pattern is printed using the Matcher method group(int i) which returns the portion of searchString matched by the i'th bracketed group within the pattern, group(0) returns the portion of searchString that is matched by the whole pattern.
if(m.groupCount()!=0) { for(int i=1; i<=m.groupCount(); i++) { output("Group " + i + " : " + m.group(i)); } } b = m.find(); if(b) output("\nMatch found : "+b);

If any groups were defined in the pattern, this section loops through the groups and prints out there content, group(0) is not printed because it was printed earlier. The boolean b is set to the result of the next call to find() and if it is true, indicating another instance of the pattern has been matched, the program prints that it has found a match. This condition is necessary so that at the end of all the matches, "Match found : false" is not printed. The program can be downloaded from here: Regex.java [files/Regex.java] and it is executed like this:
java Regex patternString searchString

Here are a few examples:
java Regex "Hello" "Hello World!" Match found Match start : true : 0

6

Regular Expressions
Match end : 5 Match content : Hello

java Regex "[Hh]ello" "Hello there Peter! Oh hello there James!" Match Match Match Match Match Match Match Match found start end content found start end content : : : : : : : : true 0 5 Hello true 22 27 hello

java Regex "(H)(e)(l)(l)(o)" "Hello ello ello!" Match Match Match Match Group Group Group Group Group found start end content 1 2 3 4 5 : : : : : : : : : true 0 5 Hello H e l l o

java Regex "H(e(l(l(o))))" "Hello ello ello" Match Match Match Match Group Group Group Group found start end content 1 2 3 4 : : : : : : : : true 0 5 Hello ello llo lo o

java Regex "!*" "0+ !!!" Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match Match found start end content found start end content found start end content found start end content found start end content : true : 0 : 0 : : true : 1 : 1 : : true : 2 : 2 : : : : : true 3 6 !!!

: true : 6 : 6 :

The example shown last is quite strange in that it illustrates how each character in the input sequence matches the pattern since the pattern specifies that zero or more '!' characters should be matched, the content is empty however since zero '!' were matched, eventually the three '!'s are matched and then finally the newline produced when the command was entered is matched.
java Regex "S.*t" "Spontaneous combustion" Match found Match start Match end : true : 0 : 19

7

Regular Expressions
Match content : Spontaneous combust java Regex "S.*?t" "Spontaneous combustion" Match Match Match Match found start end content : : : : true 0 5 Spont

Notice the difference between these two, in that, the first expression uses the greedy version of ".*" which matches as many characters as it can whilst still producing a match where as the second expression uses the reluctant version ".*?" which matches the least amount of characters it can whilst still producing a match.
java Regex "10{3}\b" "10 100 1000 10000" Match Match Match Match found start end content : : : : true 7 11 1000

Matches '1' followed by three '0's. The "\b" matches a word boundary so that this expression does not match "10000".
java Regex "10{2,}\b" "10 100 1000 10000" Match Match Match Match Match Match Match Match Match Match Match Match found start end content found start end content found start end content : : : : : : : : : : : : true 3 6 100 true 7 11 1000 true 12 17 10000

Matches '1' followed by at least two '0's followed by a word boundary
java Regex "10{1,3}\b" "10 100 1000 10000" Match Match Match Match Match Match Match Match Match Match Match Match found start end content found start end content found start end content : : : : : : : : : : : : true 0 2 10 true 3 6 100 true 7 11 1000

Matches '1' followed by a minimum of one '0' and a maximum of 3 '0's followed by a word boundary.

5. Emacs Regular Expressions
Emacs has builtin regular expression support. Regular expressions may be used within searches by typing the Emacs command sequence C-M-s, this is CTRL-ALT-s on most computers. An example is shown below:

Figure 1. RegExp search

8

Regular Expressions

More useful however is the regular expression search and replace function. It is activated by typing C-M-%, that is CTRL-ALT-% on most computers. When this command sequence is entered, the user is asked to enter an expression to find the text to replace and then an expression to use to replace the text found. The regular expression syntax is shown below:
Regular Expressions any single character except a newline zero or more repeats one or more repeats zero or one repeat any character in the set any character not in the set beginning of line end of line quote a special character c alternative (\or\) grouping nth group beginning of buffer end of buffer word break not beginning or end of word beginning of word end of word any word-syntax character any non-word-syntax character character with syntax c character with syntax not c

.

(dot)

* + ? [ : : :] [^ : : :] ^ $ \c \_ \( : : :\) \n \` \' \b \B \lt; \gt; \w \W \sc \Sc

If you had a HTML file and you wanted to replace every occurrence of "<table>" with "<table border="1">" you could use:
Query replace regexp: <table> with: <table border=\"1\">

You can use back-references too:
Query replace regexp \(this\)\(.*\)\(that\) with \3\2\1

When operated on:
Switch this and that! Switch this and then switch that! Take this! and take that too!

Produces:
Switch that and this! Switch that and then switch this! Take that! and take this too!

".*" comes in greedy and non greedy flavours, consider the line:
You are greedy not greedy!

Using the greedy flavour:
Query replace regexp Y.*y with You are

Causes the whole line to be replaced with "You are", where as the non-greedy: 9

Regular Expressions
Query replace regexp Y.*?y with You are

Causes only the first part of the line to be replaced with "You are", leaving "You are not greedy!" As a more useful example, imagine you have a tab separated file like this:
NAME MAN1 MAN2 WOMAN1 WOMAN2 AGE 32 23 33 34 SEX M M F F

The following emacs regular expression could be used to swap the last two columns around:
^\([^ ]+\)\([ ]+\)\([^ ]\)\([ ]+\)\(.+?\)$

The regular expression begins with '^' to say that the pattern should begin with the beginning of a line. The next block is "\([^ ]+\)" which says to match one or more non tab characters, in the example the tab looks like a space, this is because in emacs one would actually enter the tab character like one would any other character so I have replaced the large gap that a tab makes with a smaller one in this document, in emacs the expression looks like this:

Figure 2. Emacs tabs

Notice how emacs varies the sizes of some of the tabs, if one did not know they were tabs it would be possible to mistake them for spaces. The "one or more non-tab characters" pattern is enclosed between "/(" and "/)", this is emacs regular expression grouping so that the group can be back-referenced later. The next section in the pattern is "\([ ]+\)" which says to match one or more tab characters and save them in a group. After this is another grouped "one or more non-tab characters" section followed by another grouped "one or more tab characters" section followed finally by a grouped "one or more any characters" section. Terminated with an end of line. This can be summarised to:
(StartOfLine) (Non-Tab+) group 1 (Tab+) group2 (Non-Tab+) group3 (Tab+) group4 (Anything+) group5 (EndOfLine)

When asked for the replacement text, this is used:
\1\2\5\4\3

Which replaces the line with the text from the groups, in the order specified, so that the text file is changed to:
NAME MAN1 MAN2 WOMAN1 WOMAN2 SEX M M F F AGE 32 23 33 34

The columns sex and age have been swapped.

6. References
• http://www.evolt.org/article/rating/20/22700/ Evolt Regular Expression Tutorial http://sitescooper.org/tao_regexps.html A Tao Of Regular Expressions http://www.zytrax.com/tech/web/regex.htm Zytrax Regular Expression Tutorial http://etext.lib.virginia.edu/helpsheets/regex.html Using Regular Expressions - Stephen Ramsay http://www.grymoire.com/Unix/Regular.html Regular Expressions - Bruce Barnett & General Electric Company http://jakarta.apache.org/regexp/ Jakarta Regexp 10

Regular Expressions • http://www.grymoire.com/Unix/Regular.html Regular ExPressions - Bruce Barnett http://theory.uwinnipeg.ca/localfiles/infofiles/lisp/lispref_Searching_and_Matching.html XEmacs Lisp Reference Manual - Searching and Matching

11