Digital Evidence Search with
A Pattern Matching Game
The Power of Regular Expression
What types of evidence to match?
• Personal information
• name, phone number, email address, date of birth, zip code, SSN
• User account validation
• username, password
• Source code
• HTML, Java
• Network
• website visited (URL), IP address, Hex, MAC address, timestamp
• File
• file names, file attributes
What is regular expression (regex)?
• A special text string for describing a search pattern
• string is written in an expression language
• Extremely useful in extracting information from
• text file: source code, log files,
• documents: spreadsheets, PowerPoint, Word (need to unzip them)
• binary strings in files
• Often use with grep command f, r,a,n,k Literal Characters
| Logic: or
( | ) or relation in a group
[Link]
Download lab files
unzip the file
Verify some unzipped files
Verify the txt file. We will search the content of the file using regular expression.
Search for a specific name “Frank”
A simple pattern of all names
Frank_Xu
Frank Space Xu
One 1-10 1 or more One 1-10
Uppercase Lowercase space Uppercase Lowercase
space 1 or more
[A-Z]
[a-z] {1,10} \s +
Match all names in a text file
Word Character: [a-zA-Z0-9_]
Word string: a list of word characters
Word Boundary: \b 1. Before the first character in the word string
2. After the last character in the word string
NOT a Word Character
I a 50 year ol , he i 2!
m s d s
\b \b \b \b \b \b \b \b \b \b \b \b \b \b \b \b
I a 50 year ol , he i 2!
m s d s
\b \b \b \b \b \b \b \b \b \b \b \b \b \b \b \b
Search two characters between any two \b
\<..\>: Don’t’ cross \b, AND must
include a word string [a-zA-Z0-9_]
\b..\b: any boundaries
I a 50 year ol , he i 2!
m s d s
\b \b \b \b \b \b \b \b \b \b \b \b \b \b \b \b
-w : match only whole words [a-zA-Z0-9_]+
! is not a word character
Test the patten in a text
file
look behind look ahead
(?<=foo)xxx xxx(?=foo)
Match xxx with a preceding string foo Match xxx with a following string foo
has to be Perl-compatible
shorthand classes
\w "word" character (letter,
digit, or underscore)
[a-zA-Z0-9_]
\d digit
\s whitespace (space, tab,
vtab, newline)
-Po
Negative lookup
look behind look ahead
(?<!foo)xxx xxx(? !foo)
Match xxx without a preceding string foo Match xxx without a following string foo
• Negative lookup
pattern MUST be
single quote ‘ ’
• a lookup string
length must be
fixed
This is not a name!
First try. can’t have numbers before a name!
Fix name mismatch problems using lookup
negative
negative
Second try. “Ave” can’t be the last name. Need more testing if necessary.
Exclude Ave, St, and Dr from last name
Match phone numbers
• 1234567890
• 123.456.7890 1234567890
• 123-456-7890
• 123 456 7890
• (123)456-7890 [0-9]{10}
• +11234567890
Match any 10 digitals phone numbers (xxxxxxxxxx and x is a digital)
Match any 10 digitals phone numbers with the patten [Link] and x is a digital
123.456.7890
\b[0-9]{3} \.[0-9]{3}\.[0-9]{4}\b
1234567890
Match both?
123.456.7890
{0,1}
Test both phone number types
1234567890
Match all four? 123.456.7890
123-456-7890
123 456 7890
Test all four phone number types
1234567890
123.456.7890
Match all five? 123-456-7890
123 456 7890
syntax: (?if then|else) (123)456-7890
(\()? \b[0-9]{3} (?(1) \)|[. -]?) [0-9]{3}[. -]?[0-9]{4}\b
[Link]@[Link]
john-doe@[Link]
Match email addresses johnDoe001@[Link]
Local part domain name
johndoe@[Link]
• Consists of letters, digital, -, .
• 2-15 characters long @ •
•
Consists of letters, digital
1-15 characters long
. • Consists of letters, digital
• 3-4 characters long
[a-zA-Z0-9.-]{2,15} @ [a-zA-Z0-9]{1,15} \. [a-zA-Z0-9]{3,4}
Check if a password is valid
Pattern definition:
• Minimum length of 3, maximum length of 18
• Composed by letters, numbers or dashes or @
Match Java source code
Search string in Java and show line numbers
Show how many time the key words appears in Java source code
Match HTML code (including tags and
content) using Backreference \n
<h1>This is a heading. </h1>
Opening tag content Closing tag
Match HTML tags using Lazy quantifier
<h1>This is a heading. </h1> MUST use -P
Greedy quantifier Lazy quantifier Description
* *? Star Quantifier: 0 or more
+ +? Plus Quantifier: 1 or more
? ?? Optional Quantifier: 0 or 1
{n} {n} ? Quantifier: exactly n
{n,} {n,} ? Quantifier: n or more
{n,m} {n,m} ? Quantifier: between n and m
Match content of HTML code
<h1>This is a heading. </h1>
(?<=<([a-z0-9]{2}) >).* (?=</ \1>)
look behind look ahead
[Link]
Match IP4 address with group () [Link]
[Link]
[Link]
Match HTTP requests
[Link]
• [Link]
Match variations of website
• [Link]
• [Link]
• [Link]
Match Hexadecimal number
Match Hex of colors
The standard (IEEE 802) format for printing MAC-48 addresses in
human-friendly form is six groups of two hexadecimal digits,
separated by hyphens - or colons :.
Match MAC address
01-23-45-67-89-AB
[Link]
PaloAlto_[Link]
VMware_[Link]
Match MAC patterns
Match MAC from a pcap file
tshark help
• -r <infile>, --read-file <infile>:
• set the filename to read from
(or '-' for stdin)
• -e <field>
• field to print if -Tfields
selected (e.g. [Link],
_ws.[Link])
Convert pcap to text
Match the first pattern in a pcap file
Match the second pattern in a pcap file
Grep email from [Link]
.docx is a compressed file
content
unzip .docx to a directory
The content of .docx is saved in a xml file.
grep “[Link]” but results show many Word format information
Remove xml tag using sed
sed commands
replace character “1” of a phone number with 4
replace character “-” of a phone number with “.”
Remove all “-”
<h1>This is a heading. </h1>
Remove (use lazy match) all html tags first failed attempt due to sed doesn’t support -P
Remove all html tags using [^ not allowed character set]
Remove all xml tags
minor issue
Replace paragraph (paraID) tag with spaces
grep emails
Show .docx content without unzip to disk -p extract files to pipe, no messages
Show .docx props without unzip to disk
Summary
• grep is a powerful tool to extract digital forensic evidence
• sed is a stream editor
• grep/sed use regular expression (regex/pattern) to match text
• Key regex operations
• literal string: cat, character classes: [], [^], or: a|b, group: ()
• quantification: ?, *,+, {}
• scope: \b, \< \>, \w
• greedy vs. lazy: +?, *?, {}?
• back reference: \1, \2, …,\n
• lookahead and lookbehind: (?=), (?<=)
• Need both positive tests and negative tests
[Link]