You are on page 1of 19

Lecture 2: Regular

Expressions, Text
Normalization
Lecture Objectives:

•Student will be able to understand NLP tasks


•Students Will be able to understand Parsing Algorithm

CSC-441: Natural Language Processing


What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Text Annotation Tasks

• Classification of individual word tokens


• Identify phrases
• Parsing
• Text Classification
• Semantic annotation
What is Regular Expression?

Each Regular Expression (RE) represents a set of strings having certain


pattern.
•In NLP, we can use REs to find strings having certain patterns in a given text.
Simple Definition for Regular Expressions over alphabet 
• is a regular expression
•If a  , a is a regular expression
•or : If E1 and E2 are REs, then E1 | E2 is a regular expression
•concatenation : If E1 and E2 are REs, then E1E2 is a regular expression
•Kleene Closure: If E is a RE, then E* is a regular expression
•Positive Closure: If E is a RE, then E+ is a regular expression
Searching Strings with Regular Expressions

• How can we search for any of following strings?


– woodchuck
– woodchucks
– Woodchuck
– Woodchucks

5
Regular Expression Application
• In a Web search engine they might be the
entire documents or Web pages
• In a word processor they might be
individual words, or lines of a document

• E g the UNIX grep command


• E g dir *.*

6
Regular Expression Guide
Pattern Definition
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
+? Repeats a character one or more times (non-greedy)
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Regular Expressions:
Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret ^ matches the start of a string.
– The regular expression ^The matches the word The only at the start of a string.
• The dollar sign $ matches the end of a line.
Regular Expression Matches
.$ any character at the end of a string
\.$ dot character at the end of a string
^[A-Z] any uppercase character at the
beginning of a string
^[A-Z][^\.]*is[^\.]*\.$ a string that contains “is”, start with
capital and end with “.”
Regular Expressions:Disjunctions
Negations in []
• Negations in []:
– The square braces can also be used to specify what a single character cannot be,
by use of
the caret ^.
– If the caret ^ is the first symbol after the open square brace [, the resulting pattern
is negated.
Regular Expression Matches
[^A-Z] Not an upper case letter
[^a-z] Not a lower case letter
[^Ss] Neither ‘S’ nor ‘s’
[^e^] Neither e nor ^
a^b The pattern a^b
Regular Expressions: {} . ?
• {m,n} causes the resulting RE to match from m to n repetitions of the
preceding RE.
• {m} specifies that exactly m copies of the previous RE should be matched
• The question mark ? marks optionality of the previous expression.
Regular Expression Matches
woodchucks? woodchuck or woodchucks
colou?r color or colour
(a|b)?c ac, bc, c
(ba){2,3} baba, bababa

• A wildcard expression dot . matches any single character (except a carriage


return).
Regular Expression Matches
beg.n begin, begun, begxn, …
a.*b any string starts with a and ends with
b
Natural 1
Regular Expression: Writing
Patterns
Online regex finding Apps: PyRegex, Pythex

Pakistan, China and Iran are friends from years.


asiasamreen.bukc@bahria .edu.pk sent on Sat Jan 5 09:14:16 2008

What will be the results for:


•\S+@\S+
•[A-Z][a-zA-Z]+
•[A-Z][a-z]*
Regular Expression: Writing
Patterns in PYTHON

>>> import re
>>> text1= ‘We will be moving next month to earn PKR
20000'
>>> w = re.findall(r'\[0-9]+',x)
>>> print (w)
Books : Regular Expression

13
Text Classification
• Given:
– A representation of a document d
• Issue: how to represent text documents.
• Usually some type of high-dimensional space – bag of words
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:
– The category of d: γ(d) ∈ C, where γ(d) is a
classification function
– We want to build classification functions (“classifiers”).
Text Categorization
• Is it spam?
• Is it Urdu?
• Is it interesting to this user?
– News filtering
– Helpdesk routing
• Is it interesting to this NLP program?
– e.g., should my calendar system try to interpret this email
as an appointment (using info. extraction)?
• Where should it go in the directory?
– Yahoo! / Open Directory / digital libraries
– Which mail folder? (work, friends, junk, urgent ...)
15
Measuring Performance
• Precision =
good messages kept/
all messages kept

• Recall =
good messages kept/
all good messages

16
Summary
Basic text processing includes

• Words tokenization
• Ordering
• Removal of unnecessary information like useless words
• Can be used for categorization of text.

17
Test Data
Hahhahhaha
Ye bat to manany wali h ap ki.
Very well said, Irshad ;);D
Seventy-seven days of friendship
Beautiful humanity.
My son's US friends have left Paris to be with their
families. We had a long talk and decided that he'd stay
in Paris to finish his semester. He doesn't want further
disruption in his studies that have already moved from
regular classes to online ones. Solitude he's fine with
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu

19

You might also like