L11-12 Text Clustering

#L11-12
yuancx@bupt.edu.cn

Natural Language Processing Caixia Yuan

Overview

e.g., Nave Bayes
=P(x,y;)
P (x, y; )

y = arg max y P (x, y; ) x
e.g., maximum entropy models (a.k.a. logistic
regression)
=F

f F

y = sign( f ( x)) x

Overview
Magic
NLPIR, recommendation
system, exploratory data analysis

Clusters 1,3,5,... 4,6,... ... 2,...
1. VS AC
2. ~~
3. 15
Documents 4. Samsung note 4 4
5.
6. iphone7
......

web
query
E.g., Jaguar, NLP, Paris Hilton
10animal
web

query
Jaguar or Jaguar car

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

x1 x 2 x 3 x 4 xm
x1
x2
x3
x4 d(x i , x j )

xm

Distance: Euclidean / cosine

Distance: divergences

1. d(x, y) 0, d(x, y) = 0 x = y
2. d(x, y) + d(y, z) d(x, z)
3. d(x, y) = d(y, x)

Examples:
Euclidean Distance:

d
d(x, y) = (x - y) 2 = (x - y) T (x y) = (x y ) 2
i =1 i i
L2
Manhattan Distance:
d(x, y) =| x - y | = i =1 | x i y i |
d
L1
Chebyshev Distance:
d(x, y) = max 1i d | x i y i |
L
: d(x,y) d2(x,y) metric
measure

Examples:
Mahalanobis Distance:
d(x,=
y) (x - y)T 1 (x y)

x A
x
1
d(x, A) =
|A|
yA
d(x, y)
A B
A, B
1
d(A, B) =
| A || B |
xA, yB
d(x, y)

K-means
Hierarchical Clustering
Density-based Clustering
Gaussian Mixture Model
Spectral Clustering


K-means
K
centroid

The basic algorithm:
1: select K points as the initial centroids
2: repeat
3: Form K clusters by assigning all points to the closest centroid
4: Recompute the centroid of each cluster
5: until the centroids don't change

K-means Clustering
...

K-means

the Sum of Squared Error, SSE
N K
=SSE r
=
n 1=
k 1
nk dist( xn k ) e.g., dist(xn- k)=||xn-k||2
k Ck Ck
xnCkrnk=1rnk= 0
krnk
SSE

K-means
1: choose some initial values for the k (k=1,...,K)

2: repeat
3: minimize SSE with respect to the rnk (n=1,...,N)
1 if k = arg min j||x n - j|| 2
rnk =
0 otherwise
4: minimize SSE with respect to the mk
k =
r xn nk n
r n nk
5: until convergence

K-means Details

K-medoids
O( n d K I )
n=d =
K = I =

K-means Details
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

K-means Details
Iteration 4
5
1
2
3
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

K-means Details
Six Clusters
Two Clusters Four Clusters

Hierarchical Clustering

dendrogram

agglomerative
divisive
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05 1
3
0
1 3 2 5 4 6

Hierarchical Agglomerative Clustering (HAC)

Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

(proximity matrix)
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12

C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12

(C2 and C5)

C1 C2 C3 C4 C5
C1
C2
C3 C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12

C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12

Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
GH
.
D(G, H)
. Proximity Matrix
(pairwise similarities) D(i, j), iG,
.
jH,

p1 p2 p3 p4 p5 ...
p1
p2
p3
MIN (single linkage) p4
p5
MAX (complete linkage)
.
Group Average
.
Proximity Matrix
Distance Between Centroids .
Other methods driven by an objective
function
Ward's Method uses squared error

p1 p2 p3 p4 p5 ...
p1
p2
p3
MAX (complete linkage) p5
Group Average .
. Proximity Matrix
Distance Between Centroids
.
function

How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
Group Average .
. Proximity Matrix
.
function

p1 p2 p3 p4 p5 ...
p1

p2
p3
Group Average .
. Proximity Matrix
.
function

Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward's Method 5
2 2
3 6 Group Average 3 6
3 1 1
4
4 4
3

Divisive Hierarchical Clustering
Social Network Graphs

E.g., in a MST approach

(Minimum Spanning Tree)
Prime

E.g., in a MST approach

(Minimum Spanning Tree)
Kruskal

Hierarchical Clustering: Strengths
Do not have to assume any particular number

of clusters
Any desired number of clusters can be obtained
by 'cutting' the dendogram at the proper level
They may correspond to meaningful

taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

Hierarchical Clustering: Problems and Limitations
Once a decision is made to combine two clusters, it cannot be

undone
No objective function is directly minimized
Different schemes have problems with one or more of the
following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex
shapes
Breaking large clusters

Probabilistic Clustering
Represent the probability distribution of the data as a mixture

model
captures uncertainty in cluster assignments
gives model for data distribution
Bayesian mixture model allows us to determine K
Consider mixtures of Gaussians

The Gaussian Distribution
(Gaussian Single Model)

x
xDD*D

D={xi}, i=1, , N
Gaussian
Gaussian

log


Gaussian
Gaussian Mixture Model (GMM)Gaussian


3Gaussian

Gaussian
xn
k
xn


Gaussian

Gaussian

labelslatent/hidden variable

kposterior
probabilitiesresponsibilities
Bayes theorem

GMM
Log
sumlog


GMM
log
ln p(X|,,) kGaussiank

N
1
k =
Nk

i =1
k ( xi ) xi
N
N k = k ( xi ) k
i =1

GMM

N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T
Lagrange multiplier
N
1
k =
N

i =1
k ( xi )

EM Algorithm

E-step: responsibilities
M-step: MLE
EM likelihood

EM Algorithm
E-step: responsibilities
k N ( xi | k , k )
k ( xi ) =
j N ( xi | j , j )
K
j =1
M-step: MLE
N
1
k =
Nk

i =1
k ( xi )
N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T

Example

GMM

xCk
N(x| k,k)
xCk

K-means
GMM
0
Responsibilities
EM algorithmK-means

Other issues: Over-fitting in Gaussian Mixture Models
j=2I, j = xn
20
log


Problems and Solutions

Bayesian
K
Bayesian

Spectral Clustering
The goal of spectral clustering is to cluster
data that is connected but not necessarily
compact or clustered within convex
boundaries
Close data points are in the same cluster.
Data points in different clusters are far
away. Compactness
But data points in the same cluster may
also be far awayeven farther away than
points in different clusters.
Our goal then is to transform the space so
that when two points are close, they
are always in same cluster, and when they
are far apart, they are in different clusters.
Connectivity
Spectral Clustering
Suppose the data points {x1, ..., xn} are organized by
similarity graph G = (V, E, W)
Each vertex vi in this graph represents a data point xi
Each edge eij is weighted by similarity wij between vi and vj
The graph construction depends on the application
The problem of clustering can now be reformulated
using the similarity graph:
to find a partition of the graph such that the edges between
different groups have very low weights and the edges within a
group have high weights

Graph partitioning
Clustering partitions the vertices of the graph.

A good clustering places dissimilar vertices in different
partitions.
The loss function for a partition of (A, ) (or the weighted
adjacency of (A,)) is given by the cut:
cut(A, ) =iA, jwij
For a given number k of subsets, the mincut approach simply
consists in choosing a partition A1, ..., Ak which minimizes
cut(A1, ..., Ak) =iAi, ji cut(Ai, i)
Find a partition that minimizes the cut (Mincut criterion) --
HOW?

In many cases, the solution of mincut simply separates one
individual vertex from the rest of the graph.
e.g., 2-way Partitioning...

Of course this is not what we want to achieve in clustering
Explicitly request that the sets A1, ..., Ak are "reasonably
large"
Two ways of measuring the "size" of a subset A:

|A| := the number of vertices in A
vol(A) :=di, for all i in A

Graph partitioning
A loss function that favors such clusters is Normalized
cut
1 k W ( Ai , Ai ) k cut ( Ai , Ai )
Ncut = =
2 i =1 vol ( Ai ) i =1 vol ( Ai )
vol(A) = iAdi, di= j=1,...n wij
A good partition should separate dissimilar vertices

and should produce balanced clusters
Minimizing normalized cut is NP-hard.
One way of approximately optimizing normalized cuts
leads to spectral clustering

2-way Spectral Graph Partitioning
Partition membership indicator:
D = diag(d1, ..., dn), di= j=1,...n wij

Relax indicators qi from discrete values to
continuous values, the solution for min J(q) is given
by the eigenvectors of
(D-W)q = q

Properties of Graph Laplacian (Proof omitted)
Laplacian matrix of the Graph: L = D-W

L is semi-positive definite xTLx 0 for any x.
Assume eigenvalues will always be ordered
increasingly
First eigenvector is q1=(1,,1)T with 1=0.
The eigenvectors with small eigenvalues provide
important information for clustering
Second eigenvector q2 can serve as the desired solution
Higher eigenvectors are also useful

Recovering Partitions
From the definition of cluster indicators, partitions A,

B are determined by:
A ={i|q2(i) < 0}, B ={i|q2(i) 0}

Spectral Clustering
In most common view:

Given a set of points X={x1, , xn} in RD that we want to
cluster into k subsets
Build a weighted graph G = (V, E, W).
Construct a Laplacian matrix L=f(W) (different variants of
spectral clustering result from different functions f ).
Compute the eigenvectors u1, , uk of k smallest
eigenvalues of L, and stack them in columns, forming the
matrix U=[u1, , uk] in Rnk. These provide a new
representation of the original data points.
Cluster the points in this new representation (e.g. using K-
means).
Note that there is no guarantee on the quality of the
solution.

Spectral Clustering
Spectral clustering algorithm
Input: Similarity matrix SRnn, number k of clusters to construct.

1: Construct a similarity graph G=(V, E). Let W be its weighted adjacency
matrix.
2: Compute the Laplacian matrix L.
3: Compute the k eigenvectors u1,... , uk with k smallest eigenvalues of L.
3: Let URnk be the matrix containing the vectors u1,... , uk as columns.
4: For i = 1, ..., n, let yiRk be the vector corresponding to the i-th row of U.
(for i = 1, ..., n, xi can be represented by yi=[Ui1, ..., Uik]).
5: Cluster the points (yi) i=1, ..., n in Rk with the k-means algorithm into clusters
C1, ... ,Ck.
Output: Clusters A1, ..., Ak with Ai = {j| yjCi}.

similarity matrix cosine similarity
W
transition matrix
L= D-W normalize rows
first k eigenalue of L
l1, ..., lk l1, ..., lk spectral mapping
clustering in Rk

Discussion
Time and space requirements of hierarchical

clustering.
What's the pros and cons of single linkage vs.
complete linkage and average linkage in
cluster similarity?
How to do multi-way graph partitioning?
2-way partitioning recursively?
using higher eigenvectors?
How does spectral clustering compare with k-
means and PCA?

Summary
Clustering is cool
It's easy to find the most salient pattern
It's quite hard to find the pattern you want
It's hard to know how to fix when broken
EM is a useful optimization technique you
should understand well if you don't already

Next two lectures
Machine translation:
Ref. Philipp Koehn, Ch4, 5, 6

L11-12 Text Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L11-12 Text Clustering

Uploaded by

Copyright:

Available Formats

#L11-12

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Clusters 1,3,5,... 4,6,... ... 2,...

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

1: choose some initial values for the k (k=1,...,K)

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Natural Language Processing Caixia Yuan

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Natural Language Processing Caixia Yuan

Two Clusters Four Clusters

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

MIN (single linkage) p4

Natural Language Processing Caixia Yuan

MIN (single linkage) p4

MAX (complete linkage) p5

Natural Language Processing Caixia Yuan

MIN (single linkage) p4

MAX (complete linkage) p5

Natural Language Processing Caixia Yuan

MIN (single linkage) p4

MAX (complete linkage) p5

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

E.g., in a MST approach

Natural Language Processing Caixia Yuan

E.g., in a MST approach

Natural Language Processing Caixia Yuan

Do not have to assume any particular number

They may correspond to meaningful

Natural Language Processing Caixia Yuan

Once a decision is made to combine two clusters, it cannot be

Natural Language Processing Caixia Yuan

Represent the probability distribution of the data as a mixture

Natural Language Processing Caixia Yuan

(Gaussian Single Model)

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Natural Language Processing Caixia Yuan

Gaussian Mixture Model (GMM)Gaussian