Professional Documents
Culture Documents
Expressions, Text
Normalization
Lecture Objectives:
5
Regular Expression Application
• In a Web search engine they might be the
entire documents or Web pages
• In a word processor they might be
individual words, or lines of a document
6
Regular Expression Guide
Pattern Definition
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
+? Repeats a character one or more times (non-greedy)
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Regular Expressions:
Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret ^ matches the start of a string.
– The regular expression ^The matches the word The only at the start of a string.
• The dollar sign $ matches the end of a line.
Regular Expression Matches
.$ any character at the end of a string
\.$ dot character at the end of a string
^[A-Z] any uppercase character at the
beginning of a string
^[A-Z][^\.]*is[^\.]*\.$ a string that contains “is”, start with
capital and end with “.”
Regular Expressions:Disjunctions
Negations in []
• Negations in []:
– The square braces can also be used to specify what a single character cannot be,
by use of
the caret ^.
– If the caret ^ is the first symbol after the open square brace [, the resulting pattern
is negated.
Regular Expression Matches
[^A-Z] Not an upper case letter
[^a-z] Not a lower case letter
[^Ss] Neither ‘S’ nor ‘s’
[^e^] Neither e nor ^
a^b The pattern a^b
Regular Expressions: {} . ?
• {m,n} causes the resulting RE to match from m to n repetitions of the
preceding RE.
• {m} specifies that exactly m copies of the previous RE should be matched
• The question mark ? marks optionality of the previous expression.
Regular Expression Matches
woodchucks? woodchuck or woodchucks
colou?r color or colour
(a|b)?c ac, bc, c
(ba){2,3} baba, bababa
>>> import re
>>> text1= ‘We will be moving next month to earn PKR
20000'
>>> w = re.findall(r'\[0-9]+',x)
>>> print (w)
Books : Regular Expression
13
Text Classification
• Given:
– A representation of a document d
• Issue: how to represent text documents.
• Usually some type of high-dimensional space – bag of words
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:
– The category of d: γ(d) ∈ C, where γ(d) is a
classification function
– We want to build classification functions (“classifiers”).
Text Categorization
• Is it spam?
• Is it Urdu?
• Is it interesting to this user?
– News filtering
– Helpdesk routing
• Is it interesting to this NLP program?
– e.g., should my calendar system try to interpret this email
as an appointment (using info. extraction)?
• Where should it go in the directory?
– Yahoo! / Open Directory / digital libraries
– Which mail folder? (work, friends, junk, urgent ...)
15
Measuring Performance
• Precision =
good messages kept/
all messages kept
• Recall =
good messages kept/
all good messages
16
Summary
Basic text processing includes
• Words tokenization
• Ordering
• Removal of unnecessary information like useless words
• Can be used for categorization of text.
17
Test Data
Hahhahhaha
Ye bat to manany wali h ap ki.
Very well said, Irshad ;);D
Seventy-seven days of friendship
Beautiful humanity.
My son's US friends have left Paris to be with their
families. We had a long talk and decided that he'd stay
in Paris to finish his semester. He doesn't want further
disruption in his studies that have already moved from
regular classes to online ones. Solitude he's fine with
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
19