Professional Documents
Culture Documents
Chapter 12
Text and Web Mining
hammad.afzal@mcs.edu.pk
2
TWO DIFFERENT DEFINITIONS OF
MINING
Goal-oriented (effectiveness driven)
Any process that generates useful results that are non-obvious
is called “mining”.
Keywords: “useful” + “non-obvious”
Data isn’t necessarily massive
4
WHAT IS INSIDE WATSON?
“Watson had access to 200 million pages of structured
and unstructured content consuming four terabytes of
disk storage including the full text of Wikipedia” – PC
World
5
TEXT MINING AROUND US
Sentiment analysis
6
TEXT MINING AROUND US
Sentiment analysis
7
TEXT MINING AROUND US
Document summarization
8
TEXT MINING AROUND US
Document summarization
9
TEXT MINING AROUND US
Movie recommendation
10
TEXT MINING AROUND US
News recommendation
11
HOW TO PERFORM TEXT MINING?
As computer scientists, we view it as
Text Mining = Data Mining + Text Data
Em
So ail Sc
Na In f t s ie n
Ap fo wa
pl tu rm Bl re tif
ie d r al ati og d ic
lan on s o cu lit
m
Tw
ac re m er
gu W atu
ee
hi a tri e en re
ts
ne ge ev b ta N
lea pr al p ag t io ew
rn oce e s n s sa
in ssin r ti
g c le
g s
12
MINING TEXT DATA: AN INTRODUCTION
05/25/2022
Structured Data Multimedia Free Text Hypertext
omeLoan ( Frank Rizzo bought <a href>Frank Rizzo
oanee: Frank Rizzo his home from Lake </a> Bought
ender: MWF View Real Estate in <a hef>this home</a>
gency: Lake View 1992. from <a href>Lake
mount: $200,000 He paid $200,000 View Real Estate</a>
erm: 15 years under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>... 13
NATURAL LANGUAGE PROCESSING
A dog is chasing a boy on the playground Lexical
05/25/2022
Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase
Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference Pragmatic analysis
14
(speech act)
LEVELS OF TEXT REPRESENTATION
05/25/2022
Character (character n-grams and sequences)
Part-of-Speech Tags
Taxonomies/Thesaurus
15
CHARACTER LEVEL
Character level representation of a text
05/25/2022
consists
from sequences of characters…
…a document is represented by a frequency
distribution of sequences
Usually
we deal with contiguous strings…
…each character sequence of length 1, 2, 3, …
16
CHARACTER LEVEL: GOOD AND BAD SIDES
05/25/2022
useful for e.g. spam detection, copy detection
17
WORD LEVEL
The most common representation of text; used for many techniques
05/25/2022
Tokenization: split text into the words
18
STOP-WORDS
Word frequencies in texts have power distribution: …small number
of very frequent words; big number of low frequency words
05/25/2022
Stop-words are words that do not carry information
19
STEMMING
Different forms of the same word are usually problematic for text
05/25/2022
data analysis, because they have different spelling and similar
meaning (e.g. learns, learned, learning,…)
20
NOUN/VERB PHRASES
Instead of having just single words we can deal with phrases
Noun Phrase
Verb Phrase
21
PART OF SPEECH
Introduces word-types enabling to differentiate words functions
22
PART OF SPEECH
23
WORDNET – DATABASE OF LEXICAL RELATIONS
The most well developed and widely used database for English
Consists of 4 databases (nouns, verbs, adjectives, adverbs)
24
WORDNET – DATABASE OF LEXICAL
RELATIONS
25
BAG-OF-TOKENS APPROACHES
28
COSINE SIMILARITY
Applications:
information retrieval, biologic taxonomy, gene
feature mapping, ...
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
30
EXAMPLE: COSINE SIMILARITY
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
31