Perl Regular Expressions by Example

You might also like

You are on page 1of 6

4/10/13 Perl Regular Expressions by Example

www.somacon.com/p127.php 1/6
Perl Regular Expressions by Example
Introduction
Regular expressions are very powerful tools for matching, searching, and replacing text.
Unfortunately, they are also very obtuse. It does not help that most explanations of regular
expressions start from the specification, which is like learning to love Friends reruns by reading a
VCR manual. This page provides some simple examples for reference.
You should know a little programming and how to run basic perl scripts before reading this article.
Section 1: Basic matching and substitution
Declare a local variable called $mystring.
my $mystring;
Assign a value (string literal) to the variable.
$mystring = "Hello world!";
Does the string contains the word "World"?
if($mystring =~ m/World/) { print "Yes"; }
No, it doesn't. The binding operator =~ with the match operator m// does a pattern search on
$mystring and returns true if the pattern is found. The pattern is whatever is between the m/ and the
trailing /. (Note, there is no such thing as a ~= operator, and using it will give a compile error.)
Does the string contains the word "World", ignoring case?
if($mystring =~ m/World/i) { print "Yes"; }
Yes, it does. The pattern modifier i immediately after the trailing / changes the match to be case-
insensitive.
I want "Hello world!" to be changed to "Hello mom!" instead.
$mystring =~ s/world/mom/;
print $mystring;
4/10/13 Perl Regular Expressions by Example
www.somacon.com/p127.php 2/6
Prints "Hello mom!". The substitution operator s/// replaces the pattern between the s/ and the
middle /, with the pattern between the middle / and last /. In this case, "world" is replaced with the
word "mom".
Now change "Hello mom!" to say "Goodby mom!".
$mystring =~ s/hello/Goodbye/;
print $mystring;
This does not substitute, and prints "Hello mom!" as before. By default, the search is case
sensitive. As before, use the pattern modifier i immediately after the trailing / to make the search
case-insensitive.
Okay, ignoring case, change "Hello mom!" to say "Goodby mom!".
$mystring =~ s/hello/Goodbye/i;
print $mystring;
Prints "Goodby mom!".
Section 2: Extracting substrings
I want to see if my string contains a digit.
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/\d/) { print "Yes"; }
Prints "Yes". The pattern \d matches any single digit. In this case, the search will finish as soon as
it reads the "2". Searching always goes left to right.
Huh? Why doesn't "\d" match the exact characters '\' and 'd'?
This is because Perl uses characters from the alphabet to also match things with special meaning,
like digits. To differentiate between matching a regular character and something else, the
character is immediately preceded by a backslash. Therefore, whenever you read '\' followed by
any character, you treat the two together as one symbol. For example, '\d' means digit, '\w' means
alphanumeric characters including '_', '\/' means forward slash, and '\\' means match a single
backslash. Preceding a character with a '\' is called escaping, and the '\' together with its character
is called an escape sequence.
Okay, how do I return the first matching digit from my string?
4/10/13 Perl Regular Expressions by Example
www.somacon.com/p127.php 3/6
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d)/) {
print "The first digit is $1.";
}
Prints "The first digit is 2." In order to designate a pattern for extraction, one places parenthesis
around the pattern. If the pattern is matched, it is returned in the Perl special variable called $1. If
there are multiple parenthesized expressions, then they will be in variables $1, $2, $3, etc.
Huh? Why doesn't '(' and ')' match the parenthesis symbols exactly?
This is because the designers of regular expressions felt that some constructs are so common that
they should use unescaped characters to represent them. Besides parentheses, there are a
number of other characters that have special meanings when unescaped, and these are called
metacharacters. To match parenthesis characters or other metacharacters, you have to escape
them like '\(' and '\)'. They designed it for their convenience, not to make it easy to learn.
Okay, how do I extract a complete number, like the year?
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d+)/) {
print "The first number is $1.";
}
Prints "The first number is 2004." First, when one says "complete number", one really means a
grouping of one or more digits. The pattern quantifier + matches one or more of the pattern that
immediately precedes it, in this case, the \d. The search will finish as soon as it reads the "2004".
How do I print all the numbers from the string?
$mystring = "[2004/04/13] The date of this article.";
while($mystring =~ m/(\d+)/g) {
print "Found number $1.";
}
Prints "Found number 2004. Found number 04. Found number 13. ". This introduces another
pattern modifier g, which tells Perl to do a global search on the string. In other words, search the
whole string from left to right.
How do I get all the numbers from the string into an array instead?
$mystring = "[2004/04/13] The date of this article.";
@myarray = ($mystring =~ m/(\d+)/g);
4/10/13 Perl Regular Expressions by Example
www.somacon.com/p127.php 4/6
print join(",", @myarray);
Prints "2004,04,13". This does the same thing as before, except assigns the returned values from
the pattern search into myarray.
Section 3: Common tasks
How do I extract everything between a the words "start" and "end"?
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}
Prints text always precedes the end of the . The pattern .* is two different metacharacters
that tell Perl to match everything between the start and end. Specifically, the metacharacter .
means match any symbol except new line. The pattern quantifier * means match zero or more of
the preceding symbol.
That isn't exactly what I expected. How do I extract everything between "start" and the
first "end" encountered?
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*?)end/) {
print $1;
}
Prints text always precedes the . By default, the quantifiers are greedy. This means that when
you say .*, Perl matches every character (except new line) all the way to the end of the string, and
then works backward until it finds end. To make the pattern quantifier miserly, you use the pattern
quantifier limiter ?. This tells Perl to match as few as possible of the preceding symbol before
continuing to the next part of the pattern.
Conclusion
Regular expressions in Perl are very powerful, and there are many ways to do the same thing. I
hope you find this page useful to get started in regular expressions. Hopefully, now you can read
the specifications and get more out of it.
4/10/13 Perl Regular Expressions by Example
www.somacon.com/p127.php 5/6
Perl Book Recommendations
Perl Black Book: The Most Comprehensive Perl Reference Available Today - This book
takes a really hands-on, task-oriented approach, and has lots of examples. I would highly
recommend it.
Perl in a Nutshell - This book concisely summarizes Perl features
As an experienced, non-Perl programmer, I have been able to get by with the above two books, the
comp.lang.perl newsgroup, and the perldoc documentation. The first book I use when I need some
example code to get something working quickly, and the second book I use for reference when I
need to look up some regular expression syntax or a specific function call. The Nutshell book is
easier to use on my desk as a reference, because it is lightweight. However, if I were to own one
book, I would own the Perl Black Book. Neither of these books is for novice programmers who
don't understand things like control structures and functions.
Quick (Incomplete) Reference
Metacharacters
These need to be escaped to be matched.
\ . ^ $ * + ? { } [ ] ( ) |
Escape sequences for pre-defined character classes
\d - a digit - [0-9]
\D - a nondigit - [^0-9]
\w - a word character (alphanumeric including underscore) - [a-zA-Z_0-9]
\W - a nonword character - [^a-zA-Z_0-9]
\s - a whitespace character - [ \t\n\r\f]
\S - a non-whitespace character - [^ \t\n\r\f]
Assertions
Assertions have zero width.
^ - Matches the beginning of the line
$ - Matches the end of the line (or before a newline at the end)
\B - Matches everywhere except between a word character and non-word character
4/10/13 Perl Regular Expressions by Example
www.somacon.com/p127.php 6/6
\b - Matches between word character and non-word character
\A - Matches only at the beginning of a string
\Z - Matches only at the end of a string or before a newline
\z - Matches only at the end of a string
\G - Matches where previous m//g left off
Minimal Matching Quantifiers
The quantifiers below match their preceding element in a non-greedy way.
*? - zero or more times
+? - one or more times
?? - zero or one time
{n}? - n times
{n,}? - at least n times
{n,m}? - at least n times but not more than m times
Created 2004-12-16, Last Modified 2012-06-07, Shailesh N. Humbad
Disclaimer: This content is provided as-is. The information may be incorrect.

You might also like