You are on page 1of 26

Regular expressions and grep

The grep command (remember)


grep - print lines matching a pattern
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]

DESCRIPTION
grep searches the named input FILEs for lines containing a match to
the given PATTERN.

-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input
files.

-n, --line-number
Prefix each line of output with the 1-based line number
within its input file.

-c, --count
Suppress normal output; instead print a count of matching
lines for each input file.

2
The grep command (remember)
I Display lines that contain the string Married

head -n 100 adult.data | grep "Married"

I Same ignoring case

head -n 100 adult.data | grep -i "Married"

I Same showing line number

head -n 100 adult.data | grep -in "Married"

I Count number of appearances

head -n 100 adult.data | grep -c "Married"


3
The grep patterns (regular expressions)

I Strings starting by ∧ only match at the beginning of a line


I Strings ending by $ only match at the end of a line
I A period (.) matches any character
I A group of characters within brackets [ ] match any character
in the group
I If the first symbol within the brackets is ∧, the pattern
matches any character not in the group
I The symbol * after a character or expression means zero or
more repetitions

4
The grep patterns (regular expressions)

I Display all lines starting with 25

head -n 100 adult.data | grep "^25"

I Display all lines ending with “Mexico, <=50K”

head -n 100 adult.data | grep "Mexico, <=50K$"

I Display all lines containing “th” preceded by any character

head -n 100 adult.data | grep ".th"

5
The grep patterns (regular expressions)
I Lines containing “th” preceded by any numeric character

head -n 100 adult.data | grep "[0-9]th"

I Preceded by any number

head -n 100 adult.data | grep "[1-9][0-9]*th"

I Lines whose first field is a number and whose second field is


“Private”

head -n 100 adult.data | grep "^[1-9][0-9]*, Private"

I Lines containing non numerical data

head -n 100 adult.data | grep "[^0-9]*"


6
Exercises

I Ex. 1: Write a grep regular expression to extract the lines


that contain only numbers

7
Escaping meta-characters

I If we want to search for a character that has a special meaning


(such as . or ∧), we have to precede it with a backslash \
I Lines starting with a capital letter and ending with a period

head -n 100 adult.data | grep "^[A-Z].*\.$"

I Lines that start with the ∧ symbol

head -n 100 adult.data | grep "^\^"

8
Splice-junction Gene Sequences data set

I For the following examples you must download the splice.data


file

wget https://archive.ics.uci.edu/ml/machine-learning-databases/
molecular-biology/splice-junction-gene-sequences/splice.data

9
Grouping

I Sometimes we may want to group expressions using


parentheses
I If we want the () symbols to be interpreted as group
delimiters, we must escape them
I One or more consecutive appearances of the string GTA

cat splice.data | grep "GTA\(GTA\)*"

I The string “GTA(GTA”, posibly followed by one or more “)”

cat splice.data | grep "GTA(GTA)*"

10
Extended regular expressions

I The option -E allows grep to understand a more extensive set


of regular expressions (egrep)
I Extended regular expressions include all the basic
meta-characters plus additional ones
I For example, the () are group delimiters with the -E option

What do you think the following commands will do?

cat splice.data | grep -E "GTA\(GTA\)*"

cat splice.data | grep -E "GTA(GTA)*"

11
Extended regular expressions - Alternation

I The | symbol indicates alternative matches


I Lines containing either GACC or TCAG (or both)

cat splice.data | grep -E "GACC|TCAG"

I Lines containing one or more consecutive pairs of equal letters

cat splice.data | grep -E "(AA|GG|TT|CC)(AA|GG|TT|CC)*"

12
Extended regular expressions - Quantifiers

After any expression:


I The symbol * means zero or more repetitions

cat splice.data | grep -E "(GACC)*"


I The symbol ? means zero or one repetitions

cat splice.data | grep -E "(GACC)?"


I The symbol + means one or more repetitions

cat splice.data | grep -E "(GACC)+"

13
Extended regular expressions - Quantifiers

After any expression:


I {N}, where N is a number, means exactly N repetitions

cat splice.data | grep -E "(GACC){3}"


I {N,M}, where N<M, means between N and M repetitions

cat splice.data | grep -E "(GACC){3,5}"

14
Exercises

I Ex. 2: Write a grep regular expression that matches strings in


the file splice.data with no T and an odd number of Gs
I Ex. 3: Write a grep regular expression that matches e-mail
addresses of the form x.y@t.z, where:
I x and y are non-empty strings that may contain lowercase
letters, digits or the underscore symbol ( ), but must start with
a letter
I t is a non-empty string that contains only lowercase letters
I z is either es or com

15
Expansions

16
Arithmetic expansion

Expressions of the form

$((oper))
where oper is any operation with integer numbers, are expanded
by the shell to their corresponding value
I Examples:

echo $((4+3))

echo $((10/3))

17
Brace expansion

Expressions of the form

{list}
where list is a comma separated list of strings, are expanded by the
shell as follows
I Print the list A B C

echo {A,B,C}
I Print the list horse home hope horoscope

echo ho{rs,m,p,roscop}e

18
Brace expansion

Expressions of the form

{x..y}
where x and y are integer numbers or chars, are expanded by the
shell in a similar way
I Print the list a b ... z

echo {a..z}
I Print the list 3 4 5 6 7

echo {3..7}

19
Brace expansion

I List expansions can be in reverse order

echo {z..a}

echo {9..1}
I Brace expansions can be nested

echo {X{1,2},Y{3,4}}

echo {X,Y}{1..4}

20
Exercises

I Ex. 4: Write a single line command that, using brace


expansion, creates the list of directories mmm-yy, where
mmm is a month (jan, feb,...) and yy is a year (16, 17, 18)

21
Command and process substitution

22
Command substitution

With command substitution we can use the output of a command


as an expansion

$(command)
I Print with echo the output of ls

echo $(ls)
I Make directories with the names of the unique values in the
4th column of adult.data
mkdir $(cut -d "," -f 4 adult.data | sort | uniq)

23
Process substitution

Process substitution allows a process input or output to be referred


to using a filename

command <(list)
Executes all the commands in list saving the output to a file, which
is then used as input to command

command >(list)
Executes command saving the output to a file, which is then used
as input to list

24
Process substitution

I Add a first column with the line number to file adult.data

paste -d "," <(seq 1 32562) adult.data > adult-nums.data

I Display the difference in the output from two commands

diff <(command1) <(command2)


I Display first 10 sorted rows (similar to a pipeline)

sort splice.data > >(head -n 10)

25
Exercises

I Ex. 5: Extract the columns 6 and 4 (in this order) of file


adult.data using a single line command

26

You might also like