You are on page 1of 72

#L11-12

yuancx@bupt.edu.cn

Natural Language Processing Caixia Yuan


Overview




e.g., Nave Bayes
=P(x,y;)
P (x, y; )

y = arg max y P (x, y; ) x
e.g., maximum entropy models (a.k.a. logistic
regression)
=F

f F

y = sign( f ( x)) x

Natural Language Processing Caixia Yuan


Overview

Magic

NLPIR, recommendation
system, exploratory data analysis

Natural Language Processing Caixia Yuan


Clusters 1,3,5,... 4,6,... ... 2,...

1. VS AC
2. ~~
3. 15
Documents 4. Samsung note 4 4
5.
6. iphone7
......

Natural Language Processing Caixia Yuan


web

query
E.g., Jaguar, NLP, Paris Hilton

10animal
web


query
Jaguar or Jaguar car

Natural Language Processing Caixia Yuan


Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Natural Language Processing Caixia Yuan






x1 x 2 x 3 x 4 xm
x1
x2
x3
x4 d(x i , x j )



xm

Natural Language Processing Caixia Yuan





Distance: Euclidean / cosine




Distance: divergences

Natural Language Processing Caixia Yuan




1. d(x, y) 0, d(x, y) = 0 x = y
2. d(x, y) + d(y, z) d(x, z)
3. d(x, y) = d(y, x)

Natural Language Processing Caixia Yuan


Examples:

Euclidean Distance:


d
d(x, y) = (x - y) 2 = (x - y) T (x y) = (x y ) 2
i =1 i i
L2
Manhattan Distance:
d(x, y) =| x - y | = i =1 | x i y i |
d

L1
Chebyshev Distance:
d(x, y) = max 1i d | x i y i |
L
: d(x,y) d2(x,y) metric
measure

Natural Language Processing Caixia Yuan


Examples:

Mahalanobis Distance:

d(x,=
y) (x - y)T 1 (x y)

Natural Language Processing Caixia Yuan


x A
x

1
d(x, A) =
|A|
yA
d(x, y)
A B
A, B
1
d(A, B) =
| A || B |
xA, yB
d(x, y)

Natural Language Processing Caixia Yuan


K-means
Hierarchical Clustering
Density-based Clustering
Gaussian Mixture Model
Spectral Clustering

Natural Language Processing Caixia Yuan


K-means

K
centroid



The basic algorithm:
1: select K points as the initial centroids
2: repeat
3: Form K clusters by assigning all points to the closest centroid
4: Recompute the centroid of each cluster
5: until the centroids don't change

Natural Language Processing Caixia Yuan


K-means Clustering

...

Natural Language Processing Caixia Yuan


K-means


the Sum of Squared Error, SSE
N K
=SSE r
=
n 1=
k 1
nk dist( xn k ) e.g., dist(xn- k)=||xn-k||2

k Ck Ck
xnCkrnk=1rnk= 0

krnk
SSE

Natural Language Processing Caixia Yuan


K-means

1: choose some initial values for the k (k=1,...,K)


2: repeat
3: minimize SSE with respect to the rnk (n=1,...,N)
1 if k = arg min j||x n - j|| 2
rnk =
0 otherwise
4: minimize SSE with respect to the mk

k =
r xn nk n

r n nk
5: until convergence

Natural Language Processing Caixia Yuan


K-means Details


K-medoids

O( n d K I )
n=d =
K = I =

Natural Language Processing Caixia Yuan


K-means Details

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

Natural Language Processing Caixia Yuan


K-means Details

Iteration 4
5
1
2
3
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

Natural Language Processing Caixia Yuan


K-means Details

Six Clusters

Two Clusters Four Clusters

Natural Language Processing Caixia Yuan


Hierarchical Clustering


dendrogram

agglomerative
divisive
6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05 1
3

0
1 3 2 5 4 6

Natural Language Processing Caixia Yuan


Hierarchical Agglomerative Clustering (HAC)


Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

Natural Language Processing Caixia Yuan



(proximity matrix)
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12

Natural Language Processing Caixia Yuan


C1 C2 C3 C4 C5
C1

C2
C3 C3
C4 C4
C5

C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

Natural Language Processing Caixia Yuan



(C2 and C5)

C1 C2 C3 C4 C5
C1

C2
C3 C3
C4
C4
C5

C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

Natural Language Processing Caixia Yuan




C2
U
C1 C5 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3

C4 C3 ?

C4 ?

C1 Proximity Matrix

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12

Natural Language Processing Caixia Yuan


Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
GH
.
D(G, H)
. Proximity Matrix
(pairwise similarities) D(i, j), iG,
.
jH,

Natural Language Processing Caixia Yuan


Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

MIN (single linkage) p4

p5
MAX (complete linkage)
.
Group Average
.
Proximity Matrix
Distance Between Centroids .
Other methods driven by an objective
function
Ward's Method uses squared error

Natural Language Processing Caixia Yuan


Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

MIN (single linkage) p4

MAX (complete linkage) p5

Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error

Natural Language Processing Caixia Yuan


How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

MIN (single linkage) p4

MAX (complete linkage) p5

Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error

Natural Language Processing Caixia Yuan


Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

MIN (single linkage) p4

MAX (complete linkage) p5

Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error

Natural Language Processing Caixia Yuan


Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward's Method 5
2 2
3 6 Group Average 3 6
3 1 1
4
4 4
3

Natural Language Processing Caixia Yuan


Divisive Hierarchical Clustering
Social Network Graphs

Natural Language Processing Caixia Yuan


Divisive Hierarchical Clustering

E.g., in a MST approach


(Minimum Spanning Tree)
Prime

Natural Language Processing Caixia Yuan


Divisive Hierarchical Clustering

E.g., in a MST approach


(Minimum Spanning Tree)
Kruskal

Natural Language Processing Caixia Yuan


Hierarchical Clustering: Strengths

Do not have to assume any particular number


of clusters
Any desired number of clusters can be obtained
by 'cutting' the dendogram at the proper level

They may correspond to meaningful


taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

Natural Language Processing Caixia Yuan


Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two clusters, it cannot be


undone
No objective function is directly minimized
Different schemes have problems with one or more of the
following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex
shapes
Breaking large clusters

Natural Language Processing Caixia Yuan


Probabilistic Clustering

Represent the probability distribution of the data as a mixture


model
captures uncertainty in cluster assignments
gives model for data distribution
Bayesian mixture model allows us to determine K
Consider mixtures of Gaussians

Natural Language Processing Caixia Yuan


The Gaussian Distribution

(Gaussian Single Model)


x

xDD*D

Natural Language Processing Caixia Yuan



D={xi}, i=1, , N
Gaussian
Gaussian

Natural Language Processing Caixia Yuan



log

Natural Language Processing Caixia Yuan


Natural Language Processing Caixia Yuan


Gaussian

Gaussian Mixture Model (GMM)Gaussian


Natural Language Processing Caixia Yuan


3Gaussian

Natural Language Processing Caixia Yuan


Gaussian

xn
k
xn

Natural Language Processing Caixia Yuan


Gaussian


Gaussian

labelslatent/hidden variable

Natural Language Processing Caixia Yuan



kposterior
probabilitiesresponsibilities
Bayes theorem

Natural Language Processing Caixia Yuan


GMM

Log

sumlog

Natural Language Processing Caixia Yuan


GMM

log
ln p(X|,,) kGaussiank

N
1
k =
Nk

i =1
k ( xi ) xi

N
N k = k ( xi ) k
i =1

Natural Language Processing Caixia Yuan


GMM


N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T

Lagrange multiplier

N
1
k =
N

i =1
k ( xi )

Natural Language Processing Caixia Yuan


EM Algorithm





E-step: responsibilities
M-step: MLE

EM likelihood

Natural Language Processing Caixia Yuan


EM Algorithm

E-step: responsibilities

k N ( xi | k , k )
k ( xi ) =
j N ( xi | j , j )
K
j =1

M-step: MLE
N
1
k =
Nk

i =1
k ( xi )
N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T

Natural Language Processing Caixia Yuan


Example

Natural Language Processing Caixia Yuan


GMM



xCk
N(x| k,k)
xCk

Natural Language Processing Caixia Yuan


K-means

GMM
0
Responsibilities

EM algorithmK-means

Natural Language Processing Caixia Yuan


Other issues: Over-fitting in Gaussian Mixture Models

j=2I, j = xn

20
log

Natural Language Processing Caixia Yuan


Problems and Solutions


Bayesian

K
Bayesian

Natural Language Processing Caixia Yuan


Spectral Clustering
The goal of spectral clustering is to cluster
data that is connected but not necessarily
compact or clustered within convex
boundaries
Close data points are in the same cluster.
Data points in different clusters are far
away. Compactness
But data points in the same cluster may
also be far awayeven farther away than
points in different clusters.
Our goal then is to transform the space so
that when two points are close, they
are always in same cluster, and when they
are far apart, they are in different clusters.

Connectivity
Natural Language Processing Caixia Yuan
Spectral Clustering
Suppose the data points {x1, ..., xn} are organized by
similarity graph G = (V, E, W)
Each vertex vi in this graph represents a data point xi
Each edge eij is weighted by similarity wij between vi and vj
The graph construction depends on the application
The problem of clustering can now be reformulated
using the similarity graph:
to find a partition of the graph such that the edges between
different groups have very low weights and the edges within a
group have high weights

Natural Language Processing Caixia Yuan


Graph partitioning

Clustering partitions the vertices of the graph.


A good clustering places dissimilar vertices in different
partitions.
The loss function for a partition of (A, ) (or the weighted
adjacency of (A,)) is given by the cut:
cut(A, ) =iA, jwij
For a given number k of subsets, the mincut approach simply
consists in choosing a partition A1, ..., Ak which minimizes
cut(A1, ..., Ak) =iAi, ji cut(Ai, i)
Find a partition that minimizes the cut (Mincut criterion) --
HOW?

Natural Language Processing Caixia Yuan


In many cases, the solution of mincut simply separates one
individual vertex from the rest of the graph.

e.g., 2-way Partitioning...


Of course this is not what we want to achieve in clustering
Explicitly request that the sets A1, ..., Ak are "reasonably
large"

Two ways of measuring the "size" of a subset A:


|A| := the number of vertices in A
vol(A) :=di, for all i in A

Natural Language Processing Caixia Yuan


Graph partitioning
A loss function that favors such clusters is Normalized
cut
1 k W ( Ai , Ai ) k cut ( Ai , Ai )
Ncut = =
2 i =1 vol ( Ai ) i =1 vol ( Ai )
vol(A) = iAdi, di= j=1,...n wij

A good partition should separate dissimilar vertices


and should produce balanced clusters
Minimizing normalized cut is NP-hard.
One way of approximately optimizing normalized cuts
leads to spectral clustering

Natural Language Processing Caixia Yuan


2-way Spectral Graph Partitioning

Partition membership indicator:

D = diag(d1, ..., dn), di= j=1,...n wij


Relax indicators qi from discrete values to
continuous values, the solution for min J(q) is given
by the eigenvectors of
(D-W)q = q

Natural Language Processing Caixia Yuan


Properties of Graph Laplacian (Proof omitted)

Laplacian matrix of the Graph: L = D-W


L is semi-positive definite xTLx 0 for any x.
Assume eigenvalues will always be ordered
increasingly
First eigenvector is q1=(1,,1)T with 1=0.
The eigenvectors with small eigenvalues provide
important information for clustering
Second eigenvector q2 can serve as the desired solution
Higher eigenvectors are also useful

Natural Language Processing Caixia Yuan


Recovering Partitions

From the definition of cluster indicators, partitions A,


B are determined by:
A ={i|q2(i) < 0}, B ={i|q2(i) 0}

Natural Language Processing Caixia Yuan


Spectral Clustering

In most common view:


Given a set of points X={x1, , xn} in RD that we want to
cluster into k subsets
Build a weighted graph G = (V, E, W).
Construct a Laplacian matrix L=f(W) (different variants of
spectral clustering result from different functions f ).
Compute the eigenvectors u1, , uk of k smallest
eigenvalues of L, and stack them in columns, forming the
matrix U=[u1, , uk] in Rnk. These provide a new
representation of the original data points.
Cluster the points in this new representation (e.g. using K-
means).
Note that there is no guarantee on the quality of the
solution.

Natural Language Processing Caixia Yuan


Spectral Clustering

Spectral clustering algorithm

Input: Similarity matrix SRnn, number k of clusters to construct.


1: Construct a similarity graph G=(V, E). Let W be its weighted adjacency
matrix.
2: Compute the Laplacian matrix L.
3: Compute the k eigenvectors u1,... , uk with k smallest eigenvalues of L.
3: Let URnk be the matrix containing the vectors u1,... , uk as columns.
4: For i = 1, ..., n, let yiRk be the vector corresponding to the i-th row of U.
(for i = 1, ..., n, xi can be represented by yi=[Ui1, ..., Uik]).
5: Cluster the points (yi) i=1, ..., n in Rk with the k-means algorithm into clusters
C1, ... ,Ck.
Output: Clusters A1, ..., Ak with Ai = {j| yjCi}.

Natural Language Processing Caixia Yuan


similarity matrix cosine similarity
W

transition matrix
L= D-W normalize rows

first k eigenalue of L
l1, ..., lk l1, ..., lk spectral mapping

clustering in Rk

Natural Language Processing Caixia Yuan


Discussion

Time and space requirements of hierarchical


clustering.
What's the pros and cons of single linkage vs.
complete linkage and average linkage in
cluster similarity?
How to do multi-way graph partitioning?
2-way partitioning recursively?
using higher eigenvectors?
How does spectral clustering compare with k-
means and PCA?

Natural Language Processing Caixia Yuan


Summary

Clustering is cool
It's easy to find the most salient pattern
It's quite hard to find the pattern you want
It's hard to know how to fix when broken
EM is a useful optimization technique you
should understand well if you don't already

Natural Language Processing Caixia Yuan


Next two lectures

Machine translation:
Ref. Philipp Koehn, Ch4, 5, 6

Natural Language Processing Caixia Yuan

You might also like