You are on page 1of 31

CS423

DATA WAREHOUSING AND DATA


MINING

Chapter 12
Text and Web Mining

Dr. Hammad Afzal

hammad.afzal@mcs.edu.pk

Department of Computer Software Engineering


National University of Sciences and Technology (NUST)
WHAT IS “TEXT MINING”?
 “Text mining, also referred to as text data mining,
roughly equivalent to text analytics, refers to the process
of deriving high-quality information from text.” -
Wikipedia

 “Another way to view text data mining is as a process of


exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” - Hearst, 1999

2
TWO DIFFERENT DEFINITIONS OF
MINING
 Goal-oriented (effectiveness driven)
 Any process that generates useful results that are non-obvious
is called “mining”.
 Keywords: “useful” + “non-obvious”
 Data isn’t necessarily massive

 Method-oriented (efficiency driven)


 Any process that involves extracting information from
massive data is called “mining”
 Keywords: “massive” + “pattern”
 Patterns aren’t necessarily useful
3
KNOWLEDGE DISCOVERY FROM TEXT
DATA
 IBM’s Watson wins at Jeopardy! - 2011

4
WHAT IS INSIDE WATSON?
 “Watson had access to 200 million pages of structured
and unstructured content consuming four terabytes of
disk storage including the full text of Wikipedia” – PC
World

 “The sources of information for Watson include


encyclopedias, dictionaries, thesauri, newswire articles,
and literary works. Watson also used databases,
taxonomies, and ontologies. Specifically, DBPedia,
WordNet, and Yago were used.” – AI Magazine

5
TEXT MINING AROUND US
 Sentiment analysis

6
TEXT MINING AROUND US
 Sentiment analysis

7
TEXT MINING AROUND US
 Document summarization

8
TEXT MINING AROUND US
 Document summarization

9
TEXT MINING AROUND US
 Movie recommendation

10
TEXT MINING AROUND US
 News recommendation

11
HOW TO PERFORM TEXT MINING?
 As computer scientists, we view it as
 Text Mining = Data Mining + Text Data

Em
So ail Sc
Na In f t s ie n
Ap fo wa
pl tu rm Bl re tif
ie d r al ati og d ic
lan on s o cu lit
m

Tw
ac re m er
gu W atu

ee
hi a tri e en re

ts
ne ge ev b ta N
lea pr al p ag t io ew
rn oce e s n s sa
in ssin r ti
g c le
g s

12
MINING TEXT DATA: AN INTRODUCTION

Data Mining / Knowledge Discovery

05/25/2022
Structured Data Multimedia Free Text Hypertext
omeLoan ( Frank Rizzo bought <a href>Frank Rizzo
oanee: Frank Rizzo his home from Lake </a> Bought
ender: MWF View Real Estate in <a hef>this home</a>
gency: Lake View 1992. from <a href>Lake
mount: $200,000 He paid $200,000 View Real Estate</a>
erm: 15 years under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>... 13
NATURAL LANGUAGE PROCESSING
A dog is chasing a boy on the playground Lexical

05/25/2022
Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase

Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference Pragmatic analysis
14
(speech act)
LEVELS OF TEXT REPRESENTATION

05/25/2022
 Character (character n-grams and sequences)

 Words (stop-words, stemming, lemmatization)

 Phrases (word n-grams, proximity features)

 Part-of-Speech Tags

 Taxonomies/Thesaurus
15
CHARACTER LEVEL
 Character level representation of a text

05/25/2022
 consists
from sequences of characters…
 …a document is represented by a frequency

 distribution of sequences

 Usually
we deal with contiguous strings…
 …each character sequence of length 1, 2, 3, …

 represent a feature with its frequency

16
CHARACTER LEVEL: GOOD AND BAD SIDES

 It captures simple patterns on character level

05/25/2022
 useful for e.g. spam detection, copy detection

 It is used as a basis for “string kernels” in combination with SVM


for capturing complex character sequence patterns

 For deeper semantic tasks, the representation is too weak

17
WORD LEVEL
 The most common representation of text; used for many techniques

05/25/2022
 Tokenization: split text into the words

 Relations among word surface forms and their senses:


 Synonymy: different form, same meaning (e.g. singer, vocalist)
 Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)

18
STOP-WORDS
 Word frequencies in texts have power distribution: …small number
of very frequent words; big number of low frequency words

05/25/2022
 Stop-words are words that do not carry information

 Usually we remove them to help the methods to perform better

 Stop words are language dependent – examples:


 English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
 AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY,

19
STEMMING
 Different forms of the same word are usually problematic for text

05/25/2022
data analysis, because they have different spelling and similar
meaning (e.g. learns, learned, learning,…)

 Stemming is a process of transforming a word into its stem


(normalized form)

 Stemming provides an inexpensive mechanism to merge

20
NOUN/VERB PHRASES
 Instead of having just single words we can deal with phrases
 Noun Phrase
 Verb Phrase

21
PART OF SPEECH
 Introduces word-types enabling to differentiate words functions

 Mainly used for “information extraction” where we are interested in


e.g. named entities which are “noun phrases”

 Another possible use is reduction of the vocabulary


 (features)
 E.g. nouns carry most of the information in text Documents

 POS Taggers are usually learned by HMM on manually tagged data

22
PART OF SPEECH

23
WORDNET – DATABASE OF LEXICAL RELATIONS
 The most well developed and widely used database for English
 Consists of 4 databases (nouns, verbs, adjectives, adverbs)

 Each database consists of sense entries – each sense consists of a set


of synonyms. E.g.
 musician, instrumentalist, player
 person, individual, someone
 life form, organism, being

24
WORDNET – DATABASE OF LEXICAL
RELATIONS

25
BAG-OF-TOKENS APPROACHES

Documents Token Sets

Four score and seven years


nation – 5
ago our fathers brought forth
civil - 1
on this continent, a new
war – 2
nation, conceived in Liberty,
Feature men – 2
and dedicated to the
Extraction died – 4
proposition that all men are
people – 5
created equal.
Liberty – 1
Now we are engaged in a
God – 1
great civil war, testing

whether that nation, or …

Loses all order-specific information!


26
Severely limits context!
27
COSINE SIMILARITY
 A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

28
COSINE SIMILARITY

 Other vector objects: gene features in micro-arrays, …

 Applications:
information retrieval, biologic taxonomy, gene
feature mapping, ...

 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency


vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector


29 d
EXAMPLE: COSINE SIMILARITY

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

30
EXAMPLE: COSINE SIMILARITY

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

31

You might also like