You are on page 1of 5

Learning Activity - Introduction to Regular Expressions

Details
After completing this Learning Activity the learner should be able to:

Describe what regular expressions do


Objective(s)
Explain when to use regular expressions
Manipulate data using simple regular expressions

Time required 30 minutes

Overview
Introduction to Regular Expressions
Web Resources
Learning Activity
o Basic Command Structure:
o Boundary Symbols "^" , "$", "\<" and "\>"
! The caret symbol (^)
! The dollar sign symbol ($)
! The escaped pointy brackets
o Wildcards "*", "+", and "?"
o Matching a group of single characters "[" and "]"
o Backreferencing "\(", "\)", "\1", "\2", "\3" and so forth
o Matching a group of words "|"
o Special Characters

Introduction to Regular Expressions


Regular expressions are used to find certain strings of characters within a file. They are often used to
search for a certain string, and then replace that string with a different string.

Web Resources
http://main.rtfiber.com.tw/~changyj/sed/sed_commands.html

http://www.grymoire.com/Unix/Regular.html

Some other sites that have regex testers: http://regexpal.com/, http://gskinner.com/RegExr/


Learning Activity
This is a brief introduction to writing search-and-replace command lines, or regular expressions.

Basic Command Structure:


s/[search-string]/[replace-string]/g

Matches are case-sensitive


The search-string and replace-string are in the form of regular expressions (regexes)
Note: s/ will only search and replace on the line you are on. To search and replace
through the whole document, use :%s/

Examples:

s/dog/cat/g

This replaces all instances of "dog" with "cat". That is,

the dog -> the cat


hotdog -> hotcat
doggy -> catgy

Boundary Symbols "^" , "$", "\<" and "\>"


The caret symbol (^)

matches the beginning of a line.

Examples:

s/^dog/cat/g

replaces "dog" with "cat" only if it is anchored to the beginning of a line. That is,

the dog -> the dog


dog and cat -> cat and cat
doggy's home -> catgy's home

The dollar sign symbol ($)

matches the end of a line.

Examples:

s/dog$/cat/g

replaces "dog" with "cat" only if it is anchored to the end. That is,
the dog -> the cat
dog and cat -> dog and cat

The escaped pointy brackets

mark the beginning (\<) and end (\>) of a (alphanumeric) word boundary. These can be used
instead of the caret and dollar sign symbols. For instance, if you want to convert the number "1"
to the word "one", without changing "12" to "one2", you can have four lines using carets and
dollar signs:

s/^1 /one /g
s/ 1 / one /g
s/ 1$/ one/g
s/^1$/one/g

Or, you could have one line using pointy brackets:

s/\<1\>/one/g

Wildcards "*", "+", and "?"


The asterisk (*) matches none or many instances of the preceding character; "+" matches one or
many instances of the preceding character; and "?" matches none or one instance of the
preceding character.

Examples:

s/dog*/cat/g

replaces "do", "dog", "dogg", "doggg" with "cat".

s/do*g/cat/g

replaces "dg", "dog", "doog", "doog" with "cat".

s/dog\+/cat/g

replaces "dog", "dogg", "doggg" with cat".

s/dog\?/cat/g

replaces "do" and "dog" with "cat".

Matching a group of single characters "[" and "]"


Square brackets can be used to match one of a group of characters.

Examples:

s/d[aeiou]g/cat/g
replaces "dag", "deg", "dig", "dog" and "dug" with "cat".

Note also the following:

[A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]


[a-z] is equivalent to [abcdefghijklmnopqrstuvwxyz]
[0-9] is equivalent to [1234567890]
[^symbols] matches any character except symbols. Do not confuse this with the caret symbol (outside
square brackets) that matches the beginning of a line.

Example:

s/d[A-Z]g/cat/g

only "d" followed by an uppercase letter followed by a "g" will be replaced with "cat". That is,

the dog -> the dog


the dOg -> the cat
the dOOg -> the dOOg

s/d[0-9]g/cat/g

only "d" followed by a single digit followed by a "g" will be replaced with "cat". That is,

the d1g -> the cat


the d22g -> the d22g

s/d[^aeiou]g/cat/g

"d" followed by a non-vowel will be replaced with "cat". That is,

the dog -> the dog


the dyg -> the cat
the d$g -> the cat

Backreferencing "\(", "\)", "\1", "\2", "\3" and so forth


Backreferencing allows you to store a certain string of characters (in your pocket) for later retrieval. To
back-reference, use "\(" and "\)" in the search-string to capture the string you want to keep, and use "\1",
"\2" and so forth in the replace-string to retrieve the string. You can store up to nine things "\1", "\2",
"\3", "\4", "\5", "\6", "\7", "\8", "\9".

Examples:

s/d\([aeiou]\)g/c\1t/g

replaces "dag" with "cat"; "deg" with "cet"; "dig" with "cit", etc.

s/c\([ei]\)/s\1/g

replaces "c" followed by "e" or "i" with "s".


s/c\([auo]\)/k\1/g

replaces "c" followed by "a", "o" and "u" with "k".

Matching a group of words "|"


Allows you to backreference one "word" from a group of words.

Examples:

s/\(Henry\|John\|Alexander\|Charles\) VII/\1 the seventh/g

this is a more compact way of writing the following sed:

s/Henry VII/Henry the seventh/g


s/John VII/John the seventh/g
s/Alexander VII/Alexander the seventh/g
s/Charles VII/Charles the seventh/g

s/\(^|# \)\([aeiou]\)/\1? \2/g

in LexSpeak, inserts a glottal stop (/?/)at the beginning of a word beginning with a vowel sound
(/a/, /e/, /i/, /o/, /u/).

Special Characters
These characters should be escaped, that is, should be preceded by a backslash ("\") if they are to be
used in the search- string of a search-and-replace command. In the replace-string, the backslash is not
necessary unless specified.

Symbol Name Escaped Symbol


period \.
asterisk \*
left square bracket \[
right square bracket \]
tab \t
\n (in search-string)
newline
\r (in replace-string)