You are on page 1of 28

$address =~ m/(\d* .*)\n(.*?, ([A-Z]{2}) (\d{5})-?

(\d{0,5})/
Introduction to Regular Expressions

• It’s all about patterns


• Character Classes match any text of a certain type
• Repetition operators specify a recurring pattern
• Search flags change how the RegEx operates
• In this presentation…
• green denotes a character class
• yellow denotes a repetition quantifier
• orange denotes a search flag or other symbol

• My examples use Perl syntax


Introduction to Regular Expressions

• Basic syntax
• All RegEx statements must begin and end with /
• /something/
• Escaping reserved characters is crucial
• /(i.e. / is invalid because ( must be closed
• However, /\(i\.e\. / is valid for finding ‘(i.e. ’
• Reserved characters include:
• .*?+()[]{}/\|
• Also some characters have special meanings
based on their position in the statement
Regular Expression Matching

• Text Matching
• A RegEx can match plain text
• ex. if ($name =~ /Dan/) { print “match”; }
• But this will match Dan, Danny, Daniel, etc…
• Full Text Matching with Anchors
• Might want to match a whole line (or string)
• ex. if ($name =~ /^Dan$/) { print “match”; }
• This will only match Dan
• ^ anchors to the front of the line
• $ anchors to the end of the line
Regular Expression Matching

• Order of results
• The search will begin at the start of the string
• This can be altered, don’t ask yet
• Every character is important
• Any plain text in the expression is treated literally
• Nothing is neglected (close doesn’t count)
• / s/ is not the same as / s/
• Far easier to write than to debug!
Regular Expression Char Classes

• Allows specification of only certain allowable chars


• [dofZ] matches only the letters d, o, f, and Z
• If you have a string ‘dog’ then /[dofZ]/ would
match ‘d’ only even though ‘o’ is also in the class
• So this expression can be stated “match one of
either d, o, f, or Z.”
• [A-Za-z] matches any letter
• [a-fA-F0-9] matches any hexadecimal character
• [^*$/\\] matches anything BUT *, $, /, or \
• The ^ in the front of the char class specifies ‘not’
• In a char class, you only need to escape \ ( ] - ^
Regular Expression Char Classes

• Special character classes match specific characters


• \d matches a single digit
• \w matches a word character (A-Z, a-z, _)
• \b matches a word boundary /\bword\b/
• \s matches a whitespace character (spc, tab, newln)
• . wildcard matches everything except newlines
• Use very carefully, you could get anything!

• To match “anything but…” capitalize the char class


• i.e. \D matches anything that isn’t a digit
Regular Expression Char Classes

• Character Class Examples


• $bodyPart =~ /e\w\w/;
• Matches ear, eye, etc
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /\s\d/;
• Matches ‘ 2’
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /[\s\d]/;
• Matches ‘1’
• Not always useful to match single characters
• $phone =~ /\d\d\d-\d\d\d-\d\d\d\d/;
• There’s a better way…
Regular Expression Repetition

• Repetition allows for flexibility


• Range of occurrences
• $weight =~ /\d{2,3}/;
• Matches any weight from 10 to 999
• $name =~ /\w{5,}/;
• Matches any name longer than 5 letters
• if ($SSN =~ /\d{9}/) { print “Invalid SSN!”; }
• Matches exactly 9 digits
Regular Expression Repetition

• General Quantifiers
• Some more special characters
• $favoriteNumber =~ /\d*/;
• Matches any size number or no number at all
• $firstName =~ /\w+/;
• Matches one or more characters
• $middleInitial =~ /\w?/;
• Matches one or zero characters
Regular Expression Repetition

• Greedy vs Nongreedy matching


• Greedy matching gets the longest results possible
• Nongreedy matching gets the shortest possible
• Let’s say $robot = ‘The12thRobotIs2ndInLine’
• $robot =~ /\w*\d+/; (greedy)
• Matches The12thRobotIs2
• Maximizes the length of \w
• $robot =~ /\w*?\d+/; (nongreedy)
• Matches The12
• Minimizes the length of \w
Regular Expression Repetition

• Greedy vs Nongreedy matching


• Suppose $txt = ‘something is so cool’;
• $txt =~ /something/;
• Matches ‘something’
• $txt =~ /so(mething)?/;
• Matches ‘something’ and the second ‘so’
• $txt =~ /so(mething)??/;
• Matches only ‘so’ and the second ‘so’
• Doesn’t really make sense to do this
Regular Expression Real Life Examples

• Using what you’ve learned so far, you can…


• Validate a standard 8.3 file name
• $path =~ /^\w{1,8}\.[A-Za-z0-9]{2,3}$/
• Account for poorly spelled user input
• $answer =~ /^ban{1,2}an{1,2}a$/
• $iansLastName =~ /^P[ae]t{1,2}ers[oe]n$/
• $iansFirstName =~ /^E?[Ii]?[aeo]?n$/
• Matches Ian, Ean, Eian, Eon, Ien, Ein
• At least everyone gets the n right…
Alternation

• Alternation allows multiple possibilities


• Let $story = ‘He went to get his mother’;
• $story =~ /^(He|She)\b.*?\b(his|her)\b.*?
(mother|father|brother|sister|dog)/;
• Also matches ‘She punched her fat brother’
• Make sure the grouping is correct!
• $ans =~ /^(true|false)$/
• Matches only ‘true’ or ‘false’
• $ans =~ /^true|false$/ (same as /(^true|false$)/)
• Matches ‘true never’ or ‘not really false’
Grouping for Backreferences

• Backreferences
• With all these wildcards and possible matches, we
usually need to know what the expression finally
ended up matching.
• Backreferences let you see what was matched
• Can be used after the expression has evaluated or
even inside the expression itself
• Handled very differently in different languages
• Numbered from left to right, starting at 1
Grouping for Backreferences

• Perl backreferences
• Used inside the expression
• $txt =~ /\b(\w+)\s+\1\b/
• Finds any duplicated word, must use \1 here
• Used after the expression
• $class =~ /(.+?)-(\d+)/
• The first word between hyphens is stored in the
Perl variable $1 (not \1) and the number goes in $2
• print “I am in class $1, section $2”;
Grouping for Backreferences

• Java backreferences
• Annoying but still useful
• Pattern p = Pattern.compile(“(.+?)-(\\d+)”);
Matcher m = p.matcher(mySchedule);
m.find();
System.out.println(“I am in class ” + m.group(1) +
“, section ” + m.group(2));
• Ugly, but usually better than the alternative
• m.group() returns the entire string matched
Grouping for Backreferences

• Javascript backreferences
• Used inside the expression
• Not supported
• Used after the expression
• /(.+?)-(\d+)/.test(class);
• alert(RegExp.$1);
• str = str.replace(/(\S+)\s+(\S+)/, “$2 $1”);
• RegExp supports all of Perl’s special backreference
variables (wait a few slides)
Grouping for Backreferences

• PHP/Python backreferences
• Allows the use of specifically named backreferences
• Groups also maintain their numbers
• .NET backreferences
• Allows named backreferences
• If you try to access named groups by number, stuff
breaks

• Check the web for info on how to use backreferences


in these and other languages.
Grouping without Backreferences

• Sometimes you just need to make a group


• If important groups must be backreferenced, disable
backreferencing for any unimportant groups
• $sentence =~ /(?:He|She) likes (\w+)\./;
• I don’t care if it’s a he or she
• All I want to know is what he/she likes
• Therefore I use (?:) to forgo the backreference
• $1 will contain that thing that he/she likes
Matching Modes

• Matching has different functional modes


• Modes can be set by flags outside the expression (only
in some languages & implementations)
• $name =~ /[a-z]+/i;
• i turns off case sensitivity
• $xml =~ /title=“([\w ]*)”.*keywords=“([\w ]*)”/s;
• s enables . to match newlines
• $report =~ /^\s*Name:[\s\S]*?The End.\s*$/m;
• m allows newlines between ^ and $
Matching Modes
• Matching has different functional modes
• Modes can be set by flags inside the expression
(except in Javascript and Ruby)
• $password =~ /^[a-z](?i)[a-jp-xz0-9]{4,11}$/;
• If an insane web site specifies that your
password must begin with a lowercase letter
followed by 4 to 11 upper/lower alphanumeric
characters excluding k through o and y.
• $element =~ /^(?i)[A-Z](?-i)[a-z]?$/;
• (?i) makes the first letter case insensitive (if
they type o, but meant O, we still know they
mean oxygen). (?-i) makes sure the second
letter is lowercase, otherwise it’s 2 elements
Regular Expression Replacing
• Replacements simplify complex data modification
• Generally the first part of a replace command is the
regular expression and the second part is what to
replace the matched text with
• Usually a backreference variable can be used in the
replacement text to refer to a group matched in the
expression
• The RegEx engine continues searching at the point in
the string following the replacement
• Replacements use all the same syntax, but have
several unique features and are implemented very
differently in various languages.
Regular Expression Replacing
• Perl replacement syntax
• $phone =~ s/\D//;
• Removes the first non-digit character in a phone #
• Note that leaving the replacement blank deletes
• $html =~ s/^(\s*)/$1\t/;
• Adds a tab to a line of HTML using backreferences
• $sample =~ s/[abc]/[ABC]/;
• Might not do what is expected
• The second part is NOT a regular expression, it’s a
string
Regular Expression Replacing
• Java replacement syntax (sucks)
• Pattern p = Pattern.compile(“\\\\\\\\server(\\d)”);
• p.matcher(netPath).replaceAll(“\\\\workstation$1”);
• Yes, you actually have to use 8 \’s to make \\
• Any \ in the expression needs to be doubled
• Matcher should parse replacement for $1
• This has the same effect but is slightly faster than
• netPath.replaceAll(“\\\\\\\\server(\\d)”,
“\\\\workstation$1”);
• No, you can’t seem to use .replace()…
Replacement Modes
• Replacements can be performed singly or globally
• The examples I have been using replace only single
occurrences of patterns
• Use the g flag to force the expression to scan the
entire string
• $phone =~ s/\D//g;
• Removes all non-digits in the phone number
• $myGarage =~ s/Jeep|Cougar/Boeing/g;
• Gives me jets in exchange for cars
• Don’t use it if it’s not necessary
Combining Replace and Match Modes
• Combining modes is easy
• To combine modes, just append the flags
• $alphabet =~ /Q//gi;
• Get rid of the pesky letter Q (and q too)
• $response =~ /(?im)“([aeiou].*?)”(?-m)(.*)/;
• This example sucks. Point is you can combine
modes inside the statement, too.
References for Learning More
• Tutorials for other programming languages
• http://www.regular-expressions.info/

• In-depth syntax
• http://kobesearch.cpan.org/htdocs/perl/perlreref.html

• Code Search (ex: ‘ip address regex’)


• http://www.google.com/codesearch

You might also like