Professional Documents
Culture Documents
L11-12 Text Clustering
L11-12 Text Clustering
yuancx@bupt.edu.cn
e.g., Nave Bayes
=P(x,y;)
P (x, y; )
y = arg max y P (x, y; ) x
e.g., maximum entropy models (a.k.a. logistic
regression)
=F
f F
y = sign( f ( x)) x
Magic
NLPIR, recommendation
system, exploratory data analysis
1. VS AC
2. ~~
3. 15
Documents 4. Samsung note 4 4
5.
6. iphone7
......
query
E.g., Jaguar, NLP, Paris Hilton
10animal
web
query
Jaguar or Jaguar car
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
x1 x 2 x 3 x 4 xm
x1
x2
x3
x4 d(x i , x j )
xm
Distance: Euclidean / cosine
Distance: divergences
1. d(x, y) 0, d(x, y) = 0 x = y
2. d(x, y) + d(y, z) d(x, z)
3. d(x, y) = d(y, x)
Examples:
Euclidean Distance:
d
d(x, y) = (x - y) 2 = (x - y) T (x y) = (x y ) 2
i =1 i i
L2
Manhattan Distance:
d(x, y) =| x - y | = i =1 | x i y i |
d
L1
Chebyshev Distance:
d(x, y) = max 1i d | x i y i |
L
: d(x,y) d2(x,y) metric
measure
Examples:
Mahalanobis Distance:
d(x,=
y) (x - y)T 1 (x y)
x A
x
1
d(x, A) =
|A|
yA
d(x, y)
A B
A, B
1
d(A, B) =
| A || B |
xA, yB
d(x, y)
K-means
Hierarchical Clustering
Density-based Clustering
Gaussian Mixture Model
Spectral Clustering
K
centroid
The basic algorithm:
1: select K points as the initial centroids
2: repeat
3: Form K clusters by assigning all points to the closest centroid
4: Recompute the centroid of each cluster
5: until the centroids don't change
...
the Sum of Squared Error, SSE
N K
=SSE r
=
n 1=
k 1
nk dist( xn k ) e.g., dist(xn- k)=||xn-k||2
k Ck Ck
xnCkrnk=1rnk= 0
krnk
SSE
k =
r xn nk n
r n nk
5: until convergence
K-medoids
O( n d K I )
n=d =
K = I =
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
Iteration 4
5
1
2
3
3
2.5
1.5
y
0.5
Six Clusters
dendrogram
agglomerative
divisive
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05 1
3
0
1 3 2 5 4 6
Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
(proximity matrix)
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C2
C3 C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C1 ?
C2 U C5 ? ? ? ?
C3
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
GH
.
D(G, H)
. Proximity Matrix
(pairwise similarities) D(i, j), iG,
.
jH,
p1 p2 p3 p4 p5 ...
p1
p2
p3
p5
MAX (complete linkage)
.
Group Average
.
Proximity Matrix
Distance Between Centroids .
Other methods driven by an objective
function
Ward's Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
Group Average .
. Proximity Matrix
Distance Between Centroids
.
Other methods driven by an objective
function
Ward's Method uses squared error
5
1 5 4 1
2 2
5 Ward's Method 5
2 2
3 6 Group Average 3 6
3 1 1
4
4 4
3
xDD*D
D={xi}, i=1, , N
Gaussian
Gaussian
log
3Gaussian
xn
k
xn
Gaussian
labelslatent/hidden variable
kposterior
probabilitiesresponsibilities
Bayes theorem
Log
sumlog
log
ln p(X|,,) kGaussiank
N
1
k =
Nk
i =1
k ( xi ) xi
N
N k = k ( xi ) k
i =1
N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T
Lagrange multiplier
N
1
k =
N
i =1
k ( xi )
E-step: responsibilities
M-step: MLE
EM likelihood
E-step: responsibilities
k N ( xi | k , k )
k ( xi ) =
j N ( xi | j , j )
K
j =1
M-step: MLE
N
1
k =
Nk
i =1
k ( xi )
N
1
=k
Nk
k i i k i k
(
i =1
x )( x )( x )T
xCk
N(x| k,k)
xCk
GMM
0
Responsibilities
EM algorithmK-means
j=2I, j = xn
20
log
Bayesian
K
Bayesian
Connectivity
Natural Language Processing Caixia Yuan
Spectral Clustering
Suppose the data points {x1, ..., xn} are organized by
similarity graph G = (V, E, W)
Each vertex vi in this graph represents a data point xi
Each edge eij is weighted by similarity wij between vi and vj
The graph construction depends on the application
The problem of clustering can now be reformulated
using the similarity graph:
to find a partition of the graph such that the edges between
different groups have very low weights and the edges within a
group have high weights
transition matrix
L= D-W normalize rows
first k eigenalue of L
l1, ..., lk l1, ..., lk spectral mapping
clustering in Rk
Clustering is cool
It's easy to find the most salient pattern
It's quite hard to find the pattern you want
It's hard to know how to fix when broken
EM is a useful optimization technique you
should understand well if you don't already
Machine translation:
Ref. Philipp Koehn, Ch4, 5, 6