Professional Documents
Culture Documents
In [ ]: import numpy as np
import matplotlib.pyplot as plt
In layman terms, clustering is simply clustering group of samples into a few cohesive
"clusters". Popular application is customer segmentation or image segmentation or for
visualization purposes.
Expectation-Maximization
Here the "E-step" or Expectation step is so-named because it involves updating our
expectation of which cluster each point belongs to. The "M-step" or Maximization step is
so-named because it involves maximizing some fitness function that defines the location of
the cluster centers—in this case, that maximization is accomplished by taking a simple mean
of the data in each cluster.
1. Define k (manually)
3. For each centroid c = 1, 2, ⋯ , k , calculate the nearest centroid for each sample
i = 1, 2, ⋯ , m which is based on each feature j = 1, 2, ⋯ , n :
where
n
(i) (c) 2 (i) (c) 2
||x − μ || = ∑(x − μ )
j j
j=1
i=1
(c)
μ =
m
(i)
∑ I {c = c}
i=1
5. Repeat 3-4 until either (1) max iterations, (2) samples no longer move to another class.
Choosing k
To choose k, we define "good" clusters as having minimum within-cluster variation. For
each c, the within-cluster variation (W (c)) is simply the sum of squared Euclidean distances
between the observations belonging to c and mean across all features. The formula is as
follows:
m n
(i) (c)
2
W (c) = ∑ ∑(x − μ )
j j
i∈c j=1
Using this equation, we can come up with total within-cluster variation (W (C)) or
sometimes we call total sum of squares:
k m n
(i) (c)
2
W (C) = ∑ ∑ ∑(x − μ )
j j
We can say that best k should minimize the following objective function:
\argmin W (C)
k
1. Scratch
In [ ]: from sklearn.metrics import pairwise_distances_argmin
pairwise_distances_argmin(X, centers)
iteration = 0
while True:
#2. assign lables based on closest center
#return the index of centers having smallest
#distance with X
labels = pairwise_distances_argmin(X, centers)
#plotting purpose
#plot every 5th iteration to save space
#remove this if, if you want to see each snapshot
if (iteration % 5 == 0):
pred = pairwise_distances_argmin(X, new_centers)
plt.figure(figsize=(5, 2))
plt.title(f"Iteration: {iteration}")
plt.scatter(X[:, 0], X[:, 1], c=pred)
plt.scatter(new_centers[:, 0], new_centers[:, 1], s=100, c="black", alp
DBSCAN identifies clusters based on the density of data points. It can find clusters of
arbitrary shapes and is robust to noise.
3. Assume spherical distribution. This also means that all k-means assume that clusters
have equal number of samples (which may not be true!)
4. Similar to K-nearest neighbors and MDS, k-means can be ridiciously slow for large
number of samples. One way to fix this is using the concept of Mini-Batch. It is
implemented in sklearn.cluster.MiniBatchMeans . You can also take a look at our
Extended - MiniBatch KMeans.ipynb
Workshop
Egg price Gold price Oil price
1 1 1 2
2 1 2 1
3 2 2 3
4 2 3 2
5 3 3 4
6 3 4 3
1. Based on this data, perform K-Means clustering for 2 iterations. Assume k = 3 , with the
initial center as [1, 1, 1], [2, 2, 2], and [3, 3, 3].
2. How do we know how many k to choose?
3. How do we think about between principle-first and data-first approach, when picking k?
4. In which case K-Means performs badly?
Gaussian Mixture Models Clustering
In [ ]: import numpy as np
import matplotlib.pyplot as plt
We can use Gaussian Mixture Models (GMM), another clustering algorithm, to address these
limitations.
Problem definition
In a Gaussian mixture model, we define each sample i = 1, 2, ⋯ , m with features of
j = 1, 2, ⋯ , n belong to clusters c = 1, 2, ⋯ , k , where each cluster c (i)
has a mean of μ (c)
and Σ (c)
. Combining these two parameters give us the multivariate gaussian distribution
N (x|μ
(c)
,Σ
(c)
) where
1 1
(c) T (c)−1 (c)
(c) (c) (− (x−μ ) Σ (x−μ ))
N (x|μ ,Σ ) = e 2
n 1
(c)
(2π) 2
|Σ |2
For those who are confused, you may compare it with univariate gaussian distribution and
you will realize they are the same but for multiple dimensions:
2
(x−μ)
1
1 −
2 2
e σ
√2πσ
Note that the determinant of the covariance matrix is the product of its diagonal entries and
each diagonal entry is the variance, thus is the same.
(c) (c)
0 ≤ π ≤ 1 ; ∑π = 1
c=1
Then, using Baye's theorem, we can define the likelihood of a sample x (i)
belong to a
cluster c as simply:
Finally, based on all the information, GMM defines its objective, i.e., to maximize
m k
(c) (i) (c) (c)
∏∑π N (x |μ ,Σ )
i=1 c=1
Derivatives
Our objective function is to find θ that maximize the log-likehood L. (Here we apply log for
mathematical stability.)
m k
(c) (c) (c)
max ∑ log ∑ π N (x|μ ,Σ )
θ
i=1 c=1
Our "normal" procedure would be to compute the gradient and set it to 0. however, if
dL
dθ
you try this yourself at home, you will find that it is not possible to find the closed form.
Responsibilities r
Before anything, let us introduce a quantity that will play a central role in this algorithm:
responsibilities. We define the quantity
basically gives us
(i)
rc
(i)
Probability of x belonging to cluster c
(i)
Probability of x over all clusters
k
(i)
∑ rc = 1
c=1
(i)
rc ≥ 0
EM algorithm
Notice that r actually depends on mean, covariance and pi. But then, mean, covariance,
and pi also depends on r. Based on this, we can use EM algorithm, where we can (1) create a
random mean, covariance, and pi, (2) calculate r, and then repeat 1 and 2 until certain
stopping criteria.
i=1
m
(i)
(i)
∑ rc x
(c) i=1
μnew =
m
(i)
∑ rc
i=1
Proof:
We take a partial derivative of our objective function with respect to the mean parameters
μ
(c)
. To simplify things, let's perform partial derivative without the log first and only consider
one sample.
Now, applying log, since we know the partial derivative of log of x is , thus
1
m (i) m (i)
∂L ∂ log p(x |θ) 1 ∂p(x |θ)
= ∑ = ∑ =
(c) (c) (i) (c)
∂μ ∂μ p(x |θ) ∂μ
i=1 i=1
m
(i) (i) (c) T (c)−1
= ∑ rc (x − μ ) Σ
i=1
m
(i) (i) (c) T (c)−1
∑ rc (x − μ ) Σ = 0
i=1
Multiply both sides by Σ will cancel out the inverse Σ, and move μ (c)
to another side
m m
(i) (i)
(i) (c)
∑ rc x = ∑ rc μ
i=1 i=1
m
(i)
(i)
∑ rc x
i=1
(c)
= μ
m
(i)
∑ rc
i=1
m
1 (i)
(i) (c)
∑ rc x = μ
(c)
N
i=1
m
(c) 1 (i) (i) (c) (i) (c) T
Σnew = ∑ rc (x − μ )(x − μ )
(c)
N
i=1
Proof
We take a partial derivative of our objective function with respect to the Sigma parameters
Σ
(c)
. Similarly, to simplify things, let's perform partial derivative without the log first and only
consider one sample.
(i)
∂p(x |θ) ∂ n 1 1 (i) (c) T (c)−1 (i) (c)
(c) − (c) (− (x −μ ) Σ (x −μ )))
= (π (2π) 2
det(Σ )2 e 2
(c) (c)
∂Σ ∂Σ
−
n
∂ −
1
1
(c) (c) (i) (c) T (c)−1 (i) (c)
= π (2π) 2
[ det(Σ ) 2
exp( − (x − μ ) Σ (x − μ )) +
(c)
∂Σ 2
1
∂ 1
(c) − (i) (c) T (c) (i) (c)
det(Σ ) 2
exp( − (x − μ ) Σ (x − μ )]
(c)
∂Σ 2
∂ ∂f (x)
−1
det(f (x)) = det(f (x))tr(f (x) )
∂X ∂x
We get that
∂ 1
1 1
(c) − (c) − (c)−1
det(Σ ) 2
= − det(Σ ) 2
Σ
(c)
∂Σ 2
We get that
∂
(i) (c) T (c)−1 (i) (c) (c)−1 (i) (c) (i) (c) T (c)−1
(x − μ ) Σ (x − μ ) = −Σ (x − μ )(x − μ ) Σ
(c)
∂Σ
(i)
∂p(x |θ)
(c) (i) (c) (c)
= π N (x |μ ,Σ )
(c)
∂Σ
1
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
∗ [ − (Σ − Σ (x − μ )(x − μ ) Σ ]
2
Now consider all samples and log as well, the partial derivative of the log-likelihood with
respect to Σ (c)
is given by
m (i)
∂L ∂ log p(x |θ)
= ∑
(c) (c)
∂Σ ∂Σ
i=1
m (i)
1 ∂p(x |θ)
= ∑
(i) (c)
(p(x |θ) ∂Σ
i=1
1
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
∗ [ − (Σ − Σ (x − μ )(x − μ ) Σ )]
2
Substituting r , we got
(i)
c
m
1 (i)
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
= − ∑ rc (Σ − Σ (x − μ )(x − μ ) Σ )
2
i=1
m m
1 (i) 1 (i)
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
= − Σ ∑ rc + Σ ( ∑ rc (x − μ )(x − μ ) )Σ
2 2
i=1 i=1
i=1
m
1 (i)
(c) (i) (c) (i) (c) T
Σ = ∑ rc (x − μ )(x − μ )
(c)
N
i=1
Updating the pi
The update of the mixture weights π (c)
:
(c)
(c)
N
πnew =
m
Proof
(c)
∑π = 1
c=1
The Lagrangian L is
(c)
L = L + β( ∑ π − 1)
c=1
m k k
(c) (c) (c) (c)
= ∑ log ∑ π N (x|μ ,Σ ) + β( ∑ π − 1)
(c)
N
= + β
(c)
π
k
∂L
(c)
= ∑π − 1
∂β
c=1
(c)
N
(c)
π = −
β
(c)
1 = ∑π
c=1
m
= − = 1
β
= β = −m
(c)
N
(c)
π =
m
Let's code!
Step 1: Define k random clusters
Define k clusters from k random number of gaussian distribution. Specifically, for each
cluster c, randomly initialize parameters mean μ (c)
, covariance Σ (c)
, fraction per class π (c)
Recall that gaussian distribution is parametrized by the mean μ and the covariance Σ
#get m and n
m, n = X.shape
X: [[ 5.48674679e+00 -4.72331117e+00]
[-2.97867201e+00 9.55684617e+00]
[ 1.09496992e+00 3.07303535e+00]
[-9.29984808e-01 9.78172086e+00]
[-7.33363923e+00 -7.58626144e+00]
[-6.14680281e+00 -6.99299774e+00]
[-1.39733358e+00 5.16333160e-03]
[-2.97261532e+00 8.54855637e+00]
[-6.84586309e+00 -7.59248369e+00]]
Mean: [[-1.39733358e+00 -6.14680281e+00 -7.33363923e+00]
[ 5.16333160e-03 -6.99299774e+00 -7.58626144e+00]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]]
m i 2
(a − μ)
cov(a) = var(a) = ∑
m
i
m i i
(a − μ)(b − μ)
cov(a, b) = ∑
m
i
1.3 Define pi
In [ ]: #initialize the weights (here we called pi)
pi = np.full((k, ), fill_value=1/k) #simply use 1/k for pi
print("PI (weights; must add up to 1): ", pi) #shape: (k, ) - similar to p(y) of e
Step 2: EM-step
Repeat until converged:
1. E-step: for each sample x , evaluate responsibilities r for every data point x using
(i)
(i) (i)
c
2. M-step: for each cluster, update the gaussian distribution of each cluster, i.e., restimate
parameters π using the updated responsibilites r .
(i)
(c) (c) (c) (c)
,μ ,Σ ,N c
m
(i)
(c)
N = ∑ rc
i=1
m
(c) 1 (i) (i)
μnew = ∑ rc x
(c)
N
i=1
m
(c) 1 (i) (i) (c) (i) (c) T
Σnew = ∑ rc (x − μ )(x − μ )
(c)
N i=1
(c)
(c)
N
πnew =
m
2.1 E-Step
In [ ]: from scipy.stats import multivariate_normal
for i in range(m):
for c in range(k):
xi_pdf = multivariate_normal.pdf(X[i], mean=mean[:, c], cov=cov[c])
r[i, c] = pi[c] * xi_pdf #p(y) * p(x | y)
r[i] /= np.sum(r[i]) #normalize
print("Shape of r: ", r.shape) #likelihood for each sample for each cluster
print("r: ", r)
Shape of r: (9, 3)
r: [[0.84400661 0.10948349 0.0465099 ]
[0.67145945 0.18427787 0.14426268]
[0.7732715 0.143172 0.08355649]
[0.78065746 0.13071049 0.08863205]
[0.13151509 0.42526555 0.44321936]
[0.17936309 0.41880079 0.40183612]
[0.57970857 0.24827633 0.1720151 ]
[0.65276438 0.1949 0.15233562]
[0.14707751 0.42495322 0.42796927]]
2.2 M-Step
In [ ]: # Find Nc first for latter use, you can think this as normalized factor
Nc = np.sum(r, axis=0)
assert Nc.shape == (k, )
print(Nc)
2.3 Update Pi
In [ ]: #Update pi
print("Pi (before): ", pi)
pi = Nc / m
assert pi.shape == (k, )
print("Pi (after): ", pi)
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]]
Cov (after) [[[15.89840478 -4.32093459]
[-4.32093459 51.74583404]]
[[19.29089449 17.4159892 ]
[17.4159892 71.79771663]]
[[17.61841975 18.12780574]
[18.12780574 73.37236612]]]
Step 3: Predict
In [ ]: #assume you finish repeating step 2
#get preds
yhat = np.argmax(r, axis=1)
print(yhat)
[0 0 0 0 2 1 0 0 2]
#==initialization==
#responsibliity
r = np.full(shape=(m, K), fill_value=1/K)
#pi
pi = np.full((K, ), fill_value=1/K) #simply use 1/k for pi
#mean
random_row = np.random.randint(low=0, high=m, size=K)
mean = np.array([X[idx,:] for idx in random_row ]).T #.T to make to shape (M, K)
#covariance
cov = np.array([np.cov(X.T) for _ in range (K)])
#===E-Step=====
#Update r_ik of each sample
for i in range(m):
for c in range(K):
xi_pdf = multivariate_normal.pdf(X[i], mean=mean[:, c], cov=cov[c])
r[i, c] = pi[c] * xi_pdf
r[i] /= np.sum(r[i])
#===M-Step====
# Find NK first for latter use
Nc = np.sum(r, axis=0)
assert Nc.shape == (K, )
#PI
pi = Nc / m
assert pi.shape == (K, )
#mean
mean = ( X.T @ r ) / Nc
assert mean.shape == (n, K)
#get preds
yhat = np.argmax(r, axis=1)
#plot
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=yhat)
plt.title("Final")
2. Sklearn
Though GMM is often categorized as a clustering algorithm, fundamentally it is an algorithm
for density estimation. That is to say, the result of a GMM fit to some data is technically not a
clustering model, but a generative probabilistic model describing the distribution of the
data.
If we try to fit this with a two-component GMM viewed as a clustering model, the results are
not particularly useful:
In [ ]: Xnew, _ = gmm16.sample(400)
plt.scatter(Xnew[:, 0], Xnew[:, 1])
Cons: - Just like K-mean, this algorithm can sometimes miss the globally optimal solution,
and thus in practice multiple random initializations are used.
Workshop
1. How GMM differs from K-means for centroid intialization?
2. In GMM, how do we initialize means?
3. In GMM, what's the shape of means?
4. In above code, we get 6 numbers in the mean; what these numbers describe?
5. In GMM, how do we initialize covariance matrix?
6. In GMM, what's the shape of covariance matrix?
7. In above code, we get 12 numbers in covariance matrix; what these numbers describe?
8. What π describes?
9. What's the shape of π?
10. Does π needs to sum to 1?
11. What responsibilities (r) describes?
12. What's the shape of r?
13. In GMM, what is the primary objective function?
14. Though Chaky did not talk about it, what do you think are some sensible stopping
criteria?