Professional Documents
Culture Documents
What is Text-Mining?
Computational
Linguistics
Data Analysis
Tutorial Contents
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
(M.Hearst 97)
(M.Hearst 97)
Word Level
Words Properties
Stop-Words
Stemming
Frequent N-Grams
Thesaurus (WordNet)
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Words Properties
Stop-words
Original text
Information Systems Asia Web provides research, IS-related
commercial materials,
interaction, and even research
sponsorship by interested
corporations with a focus on Asia
Pacific region.
Survey of Information Retrieval guide to IR, with an emphasis on
web-based projects. Includes a
glossary, and pointers to
interesting papers.
Stemming (I)
Stemming (II)
http://www.tartarus.org/~martin/PorterStemmer/
relational
-> relate
conditional -> condition
valenci
-> valence
hesitanci
-> hesitance
digitizer
-> digitize
conformabli -> conformable
radicalli
-> radical
differentli -> different
vileli
- > vile
analogousli -> analogous
learning is 2-gram)
2-grams
1.4M->207K
3-grams
742K->243K
4-grams
5-grams
309K->252K 262K->256K
Category
Unique
Forms
Number of
Senses
Noun
94474
116317
Verb
10319
22066
Adjective
20170
29881
Adverb
4546
5677
WordNet relations
Definition
Example
Hypernym
From concepts to
subordinate
Hyponym
From concepts to
subtypes
Has-Member
Member-Of
Has-Part
Part-Of
Antonym
Opposites
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Word Level
Sentence Level
Document Level
Summarization
Single Document Visualization
Text Segmentation
Document-Collection Level
Linked-Document-Collection Level
Application Level
Summarization
Summarization
Selection based
summarization
Weight(U)=LocationInText(U)+CuePhrase(U)+Statis
tics(U)+AdditionalPresence(U)
a lot of heuristics and tuning of parameters (also
with ML)
Selected units
Selection
threshold
Visualization of a single
document
Simple approach
The text is split into the sentences.
Each sentence is deep-parsed into its logical form
1.
2.
z
3.
z
4.
5.
6.
Clarence Thomas
article
Alan Greenspan
article
Text Segmentation
Text Segmentation
Algorithm:
Word Level
Sentence Level
Document Level
Document-Collection Level
Representation
Feature Selection
Document Similarity
Representation Change (LSI)
Categorization (flat, hierarchical)
Clustering (flat, hierarchical)
Visualization
Information Extraction
Linked-Document-Collection Level
Application Level
Representation
Bag-of-words document
representation
Word weighting
N
tfidf ( w) = tf . log(
)
df ( w)
TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate
Donald Trump has offered to acquire all Class B common shares of Resorts
International Inc, a spokesman for Trump said. The estate of late Resorts
chairman James M. Crosby owns 340,783 of the 752,297 Class B shares.
Resorts also has about 6,432,000 Class A common shares outstanding. Each
Class B share has 100 times the voting power of a Class A share, giving the
Class B stock about 93 pct of Resorts' voting power.
[RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171]
[ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119]
[DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097]
[COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075]
[SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065]
[ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062]
[OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049]
[INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022]
[MARCH:0.011]
Feature Selection
P(F)
P(C | F) log
P(C | F)
P(C)
InformationGain:
P (C | W )
P (W ) P (C | W ) log
CrossEntropyTxt:
P (C )
P (W | C )
P(C ) log P(W )
MutualInfoTxt:
C= pos,neg
F=W,W
C = pos , neg
C = pos , neg
P (C )P (W ) log
P (C | W )(1 P (C )
P (C )(1 P (C | W )
WeightOfEvidTxt:
P(W | pos ) (1 P(W | neg ))
OddsRatio:
log
(1 P(W | pos )) P (W | neg )
Frequency: Freq (W )
C = pos , neg
Odds Ratio
feature score
P(F|neg)]
[P(F|pos),
IR
5.28 [0.075, 0.0004]
INFORMATION RETRIEVAL 5.13...
RETRIEVAL 4.77 [0.075, 0.0007]
GLASGOW 4.72
[0.03, 0.0003]
ASIA
4.32
[0.03, 0.0004]
PACIFIC
4.02 [0.015, 0.0003]
INTERESTING 4.02[0.015, 0.0003]
EMPHASIS 4.02 [0.015, 0.0003]
GROUP
3.64 [0.045, 0.0012]
MASSACHUSETTS 3.46 [0.015, ...]
COMMERCIAL 3.46 [0.015,0.0005]
REGION
3.1
[0.015, 0.0007]
feature score
[P(F|pos), P(F|neg)]
LIBRARY
0.46
[0.015, 0.091]
PUBLIC
0.23
[0, 0.034]
PUBLIC LIBRARY 0.21
[0, 0.029]
UNIVERSITY 0.21 [0.045, 0.028]
LIBRARIES
0.197 [0.015, 0.026]
INFORMATION 0.17 [0.119, 0.021]
REFERENCES 0.117 [0.015, 0.012]
RESOURCES 0.11 [0.029, 0.0102]
COUNTY
0.096
[0, 0.0089]
INTERNET 0.091
[0, 0.00826]
LINKS
0.091
[0.015, 0.00819]
SERVICES 0.089
[0, 0.0079]
Document Similarity
Cosine similarity
between document vectors
1i 2 i
2
x
j j
2
x
k k
Representation Change:
Latent Semantic Indexing
LSI Example
d1
d2
d3
d4
d5
d6
cosmonaut
astronaut
moon
car
truck
d1
d2
d3
d4
Correlation matrix
d1
d6
d5
d2
d3
d4
d5
d6
d1 1.00
d2 0.8
1.00
d3 0.4
0.9
1.00
d4 0.5
-0.2
-0.6
1.00
d5 0.7
0.2
-0.3
0.9
1.00
d6 0.1
-0.5
-0.9
0.9
0.7
1.00
0.35
0.65
Text Categorization
Document categorization
???
Machine learning
labeled documents
document category
(label)
unlabeled
document
Automatic Document
Categorization Task
Perceptron algorithm
Input: set of pre-classified documents
Output: model, one weight for each word from the
vocabulary
Algorithm:
initialize the model by setting word weights to 0
iterate through documents N times
classify the document X represented as bag-of-words
V
i = 1 x i w i 0 predict positive class
else predict negative class
if document classification is wrong then adjust weights of
all words occurring in the document
sign(positive) = 1
wt +1 = wt + sign(trueClass ) ; > 0
sign(negative) =-1
P( C ) Precision(M,C )
i
(1 + 2 )Precision(M,targetC) Recall(M,targetC)
F(M,targetC) =
2 Precision(M,targetC) + Recall(M,targetC)
Classification accuracy
Break-even point (precision=recall)
F-measure (precision, recall = sensitivity)
Reuters dataset
Categorization to flat categories
ea
m a rn
on cq
ey
cr -fx
ud
gr e
a
tra in
in d
te e
r
whe st
ea
sh t
i
co p
rn
m
on o d
e y i ls e lr
-s ed
up
p
s u ly
ga
gn r
co p
f
ve fee
goi
g
na o l l
so t-g d
yb as
ea
n
bo
p
Number of Documents
Distribution of documents
(Reuters-21578)
Top 20 categories of Reuter news in 1987-91
4000
3000
2000
1000
Category
Positive
Class Weight
----------------------------STAKE
11.5
MERGER
9.5
TAKEOVER
9
ACQUIRE
9
ACQUIRED
8
COMPLETES
7.5
OWNERSHIP
7.5
SALE
7.5
OWNERSHIP
7.5
BUYOUT
7
ACQUISITION
6.5
UNDISCLOSED
6.5
BUYS
6.5
ASSETS
6
BID
6
BP
6
DIVISION
5.5
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Representation
SVM
Perceptron
Winnow
.\s
ub
ob
jp
re
dst
rin
gs
.\p
ro
x3g
r-w
10
no
st
em
.\5
gr
am
s-
.\2
-5
gr
am
sno
st
em
no
st
em
0.1
0
.\1
gr
am
s-
Break-even point
1
0.9
Comparison on using
SVM on stemmed 1-grams
Comparison on Lewis-split
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
q
ac
rn
co
de
ru
rn
ea
ain
r
g
in
t
es
r
te
-fx
y
e
on
m
category
Us
Thors ten
Sus an
ip
sh
de
rt a
at
e
wh
g.
v
oA
r
c
i
M
Document to categorize:
CFP for CoNLL-2000
Some
predicted
categories
System architecture
Web
Feature construction
vectors of n-grams
labeled documents
(from Yahoo! hierarchy)
Subproblem definition
Feature selection
Classifier construction
??
Document Classifier
unlabeled document
Content categories
W Doc
P(Ci )
i
Freq (W , Doc )
P
(
W
|
C
)
Wl Doc
Summary of experimental
results
Domain
probability rank precision recall
Entertain. 0.96
16 0.44
0.80
Arts
0.99
10
0.40
0.83
Computers
0.98
12
0.40
0.84
Education
0.99
0.57
0.65
Reference
0.99
0.51
0.81
Document Clustering
Document Clustering
K-Means clustering
Agglomerative hierarchical clustering
EM (Gaussian Mixture)
K-Means clustering
Given:
Visualization
Text visualizations
WebSOM
ThemeScape
Graph-Based Visualization
Tiling-Based Visualization
collection of approaches at
http://nd.loopback.org/hyperd/zb/
WebSOM
Demo at http://websom.hut.fi/websom/
WebSOM visualization
ThemeScape
1.
2.
3.
4.
Example of visualizing
Eu IST projects corpora
2.
3.
ThemeRiver
Information Extraction
(slides borrowed from
William Cohens Tutorial on IE)
NAME
TITLE
ORGANIZATION
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
Information Extraction =
segmentation + classification + clustering + association
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka named entity
Gates
extraction
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Information Extraction =
segmentation + classification + association + clustering
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Information Extraction =
segmentation + classification + association + clustering
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Information Extraction =
segmentation + classification + association + clustering
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Database
Query,
Search
Data mine
Typical approaches to IE
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Naive
Bayes
M-step: Use all
documents to
rebuild classifier
Co-training
Co-training
12 labeled pages
Page Classifier
Bills home
page
Profess
or Sly
Welcome
all.
I teach
Physics at
the
University
of North
Dakota.
See my
publications
Link Classifier
Hyperlink,
pointing to
the document
Document content
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Question-Answering
Mixing Data Sources (KDD Cup 2003)
Question-Answering
Question Answering
Question
Parse and classify
question
Identification of
sentence and
paragraph
boundaries, finding
density of query
terms in segment,
TextTiling
WordNet expansion,
verb transformation,
noun phrase identification
Generate
keyword query
Answers
Retrieve documents
from IR system
Segment
responses
Match segment
with question
Rank
segments
Parse top
segments
borrowed from
Janez Brank & Jure Leskovec
Solution
Most of our work was spent trying various features and adjusting
their weight (more on that later)
A Look Back
0
100
150
150.18
152.26
Predict median
A uthor + abstract
+ in-degree
+ in-links + out-links
+ journal
+ title-characters
+ title-w ord-length
+ year
+ ClusCentDist
Best three
entries
on KDD Cup
+ ClusDlMed
Our submission
37.62
135.86
35.99
127.69
29.9
123.74
29.42
121.12
29.14
119.58
29.21
118.94
29.17
118.81
28.99
117.96
28.81
117.23
31.8
118.89
155.38
146.77
143.38
140.3
139.75
138.69
138.43
137.81
141.6
158.39
29.24
114.72
141.39
124.66
128.7
28.5
CV training
CV test
181.11
2nd
143.06
146.34
200
1st
References to Conferences
Authonomy
ClearForest
Megaputer
SAS/Enterprise-Miner
SPSS - Clementine
Oracle - ConText
IBM - Intelligent Miner for Text
Final Remarks