CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

CS423
DATA WAREHOUSING AND DATA

MINING
Chapter 12
Text and Web Mining
Dr. Hammad Afzal
hammad.afzal@mcs.edu.pk
Department of Computer Software Engineering

National University of Sciences and Technology (NUST)
WHAT IS “TEXT MINING”?
 “Text mining, also referred to as text data mining,
roughly equivalent to text analytics, refers to the process
of deriving high-quality information from text.” -
Wikipedia
 “Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” - Hearst, 1999
2
TWO DIFFERENT DEFINITIONS OF
MINING
 Goal-oriented (effectiveness driven)
 Any process that generates useful results that are non-obvious
is called “mining”.
 Keywords: “useful” + “non-obvious”
 Data isn’t necessarily massive
 Method-oriented (efficiency driven)

 Any process that involves extracting information from
massive data is called “mining”
 Keywords: “massive” + “pattern”
 Patterns aren’t necessarily useful
3
KNOWLEDGE DISCOVERY FROM TEXT
DATA
 IBM’s Watson wins at Jeopardy! - 2011
4
WHAT IS INSIDE WATSON?
 “Watson had access to 200 million pages of structured
and unstructured content consuming four terabytes of
disk storage including the full text of Wikipedia” – PC
World
 “The sources of information for Watson include

encyclopedias, dictionaries, thesauri, newswire articles,
and literary works. Watson also used databases,
taxonomies, and ontologies. Specifically, DBPedia,
WordNet, and Yago were used.” – AI Magazine
5
TEXT MINING AROUND US
 Sentiment analysis
6
 Sentiment analysis
7
 Document summarization
8
 Document summarization
9
 Movie recommendation
10
 News recommendation
11
HOW TO PERFORM TEXT MINING?
 As computer scientists, we view it as
 Text Mining = Data Mining + Text Data
Em
So ail Sc
Na In f t s ie n
Ap fo wa
pl tu rm Bl re tif
ie d r al ati og d ic
lan on s o cu lit
m
Tw
ac re m er
gu W atu
ee
hi a tri e en re
ts
ne ge ev b ta N
lea pr al p ag t io ew
rn oce e s n s sa
in ssin r ti
g c le
g s
12
MINING TEXT DATA: AN INTRODUCTION
Data Mining / Knowledge Discovery
05/25/2022
Structured Data Multimedia Free Text Hypertext
omeLoan ( Frank Rizzo bought <a href>Frank Rizzo
oanee: Frank Rizzo his home from Lake </a> Bought
ender: MWF View Real Estate in <a hef>this home</a>
gency: Lake View 1992. from <a href>Lake
mount: $200,000 He paid $200,000 View Real Estate</a>
erm: 15 years under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>... 13
NATURAL LANGUAGE PROCESSING
A dog is chasing a boy on the playground Lexical
05/25/2022
Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase
Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference Pragmatic analysis
14
(speech act)
LEVELS OF TEXT REPRESENTATION
05/25/2022
 Character (character n-grams and sequences)
 Words (stop-words, stemming, lemmatization)
 Phrases (word n-grams, proximity features)
 Part-of-Speech Tags
 Taxonomies/Thesaurus
15
CHARACTER LEVEL
 Character level representation of a text
05/25/2022
 consists
from sequences of characters…
 …a document is represented by a frequency
 distribution of sequences
 Usually
we deal with contiguous strings…
 …each character sequence of length 1, 2, 3, …
 represent a feature with its frequency
16
CHARACTER LEVEL: GOOD AND BAD SIDES
 It captures simple patterns on character level
05/25/2022
 useful for e.g. spam detection, copy detection
 It is used as a basis for “string kernels” in combination with SVM

for capturing complex character sequence patterns
 For deeper semantic tasks, the representation is too weak
17
WORD LEVEL
 The most common representation of text; used for many techniques
05/25/2022
 Tokenization: split text into the words
 Relations among word surface forms and their senses:

 Synonymy: different form, same meaning (e.g. singer, vocalist)
 Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)
18
STOP-WORDS
 Word frequencies in texts have power distribution: …small number
of very frequent words; big number of low frequency words
05/25/2022
 Stop-words are words that do not carry information
 Usually we remove them to help the methods to perform better
 Stop words are language dependent – examples:

 English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
 AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY,
19
STEMMING
 Different forms of the same word are usually problematic for text
05/25/2022
data analysis, because they have different spelling and similar
meaning (e.g. learns, learned, learning,…)
 Stemming is a process of transforming a word into its stem

(normalized form)
 Stemming provides an inexpensive mechanism to merge
20
NOUN/VERB PHRASES
 Instead of having just single words we can deal with phrases
 Noun Phrase
 Verb Phrase
21
PART OF SPEECH
 Introduces word-types enabling to differentiate words functions
 Mainly used for “information extraction” where we are interested in

e.g. named entities which are “noun phrases”
 Another possible use is reduction of the vocabulary

 (features)
 E.g. nouns carry most of the information in text Documents
 POS Taggers are usually learned by HMM on manually tagged data
22
PART OF SPEECH
23
WORDNET – DATABASE OF LEXICAL RELATIONS
 The most well developed and widely used database for English
 Consists of 4 databases (nouns, verbs, adjectives, adverbs)
 Each database consists of sense entries – each sense consists of a set

of synonyms. E.g.
 musician, instrumentalist, player
 person, individual, someone
 life form, organism, being
24
WORDNET – DATABASE OF LEXICAL
RELATIONS
25
BAG-OF-TOKENS APPROACHES
Documents Token Sets
Four score and seven years

nation – 5
ago our fathers brought forth
civil - 1
on this continent, a new
war – 2
nation, conceived in Liberty,
Feature men – 2
and dedicated to the
Extraction died – 4
proposition that all men are
people – 5
created equal.
Liberty – 1
Now we are engaged in a
God – 1
great civil war, testing
…
whether that nation, or …
Loses all order-specific information!

26
Severely limits context!
27
COSINE SIMILARITY
 A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.
28
COSINE SIMILARITY
 Other vector objects: gene features in micro-arrays, …
 Applications:
information retrieval, biologic taxonomy, gene
feature mapping, ...
 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency

vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector

29 d
EXAMPLE: COSINE SIMILARITY
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d|: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
30
EXAMPLE: COSINE SIMILARITY
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
31

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

Copyright:

Available Formats

CS423

DATA WAREHOUSING AND DATA

Dr. Hammad Afzal

Department of Computer Software Engineering

 “Another way to view text data mining is as a process of

 Method-oriented (efficiency driven)

 “The sources of information for Watson include

Data Mining / Knowledge Discovery

 Words (stop-words, stemming, lemmatization)

 Phrases (word n-grams, proximity features)

 represent a feature with its frequency

 It captures simple patterns on character level

 It is used as a basis for “string kernels” in combination with SVM

 For deeper semantic tasks, the representation is too weak

 Relations among word surface forms and their senses:

 Usually we remove them to help the methods to perform better

 Stop words are language dependent – examples:

 Stemming is a process of transforming a word into its stem

 Stemming provides an inexpensive mechanism to merge

 Mainly used for “information extraction” where we are interested in

 Another possible use is reduction of the vocabulary

 POS Taggers are usually learned by HMM on manually tagged data

 Each database consists of sense entries – each sense consists of a set

Documents Token Sets

Four score and seven years

Loses all order-specific information!

 Other vector objects: gene features in micro-arrays, …

 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

 Ex: Find the similarity between documents 1 and 2.

You might also like