01 K Means - Merged

K-Means Clustering
In [ ]: import numpy as np
import matplotlib.pyplot as plt
In a clustering problem, we are given a training set {x and x , and we

(1) (m) (i) n
,⋯,x } ∈ R
are tasked to find {y (1)

,⋯,y
(m)
} . Since no labels y (i)
are given, this is an unsupervised
learning problem.
In layman terms, clustering is simply clustering group of samples into a few cohesive
"clusters". Popular application is customer segmentation or image segmentation or for
visualization purposes.
We can visualize the algorithm as shown in the following figure.
Expectation-Maximization
Here the "E-step" or Expectation step is so-named because it involves updating our
expectation of which cluster each point belongs to. The "M-step" or Maximization step is
so-named because it involves maximizing some fitness function that defines the location of
the cluster centers—in this case, that maximization is accomplished by taking a simple mean
of the data in each cluster.
More formal version

The k-means clustering algorithm is as follows:
1. Define k (manually)
2. Initialize cluster centroids μ randomly

(1) (2) (k) n
,μ ,⋯,μ ∈ R
3. For each centroid c = 1, 2, ⋯ , k , calculate the nearest centroid for each sample
i = 1, 2, ⋯ , m which is based on each feature j = 1, 2, ⋯ , n :
(i) (i) (c) 2

c = \argmin ||x − μ ||
c
where
n
(i) (c) 2 (i) (c) 2
||x − μ || = ∑(x − μ )
j j
j=1
4. Set the new centroid mean based on each centroid samples:

m
(i) (i)
∑ I {c = c}x
i=1
(c)
μ =
m
(i)
∑ I {c = c}
i=1
5. Repeat 3-4 until either (1) max iterations, (2) samples no longer move to another class.
Choosing k
To choose k, we define "good" clusters as having minimum within-cluster variation. For
each c, the within-cluster variation (W (c)) is simply the sum of squared Euclidean distances
between the observations belonging to c and mean across all features. The formula is as
follows:
m n
(i) (c)
2
W (c) = ∑ ∑(x − μ )
j j
i∈c j=1
Using this equation, we can come up with total within-cluster variation (W (C)) or
sometimes we call total sum of squares:
k m n
(i) (c)
2
W (C) = ∑ ∑ ∑(x − μ )
j j
c=1 i∈c j=1
We can say that best k should minimize the following objective function:
\argmin W (C)
k
1. Scratch
In [ ]: from sklearn.metrics import pairwise_distances_argmin
X = np.array([[2, 3], [1, 2], [6, 7]]) #close to 1, 0, 2 centers

centers = np.array([[1, 1], [2, 2], [3, 3]])
pairwise_distances_argmin(X, centers)
Out[ ]: array([1, 0, 2])
In [ ]: #Implement K-means from scratch

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances_argmin
from time import time
X, y_true = make_blobs(n_samples=1500, centers=4,

cluster_std=0.60, random_state=0)
def kmeans(X, n_clusters):

m, n = X.shape
#1. randomly choose n clusters from X
#you can also randomly generate any two points
rng = np.random.RandomState(42)
i = rng.permutation(m)[:n_clusters]
centers = X[i]
iteration = 0
while True:
#2. assign lables based on closest center
#return the index of centers having smallest
#distance with X
labels = pairwise_distances_argmin(X, centers)
#3. find new centers

new_centers = []
for i in range(n_clusters):
new_centers.append(X[labels == i].mean(axis=0))
#convert list to np.array; you can actually combine #3

#with np.array in one sentence
new_centers = np.array(new_centers)
#compute total within-variation score

total_with_variation_score = 0
labels = pairwise_distances_argmin(X, centers) #<---Note I use X here. Why
for i in range(n_clusters):
cluster_mean = X[labels==i].mean(axis=0)
total_with_variation_score += ((X[labels==i] - cluster_mean)** 2).sum()
print("Total with variation score: ", total_with_variation_score)
#plotting purpose
#plot every 5th iteration to save space
#remove this if, if you want to see each snapshot
if (iteration % 5 == 0):
pred = pairwise_distances_argmin(X, new_centers)
plt.figure(figsize=(5, 2))
plt.title(f"Iteration: {iteration}")
plt.scatter(X[:, 0], X[:, 1], c=pred)
plt.scatter(new_centers[:, 0], new_centers[:, 1], s=100, c="black", alp
#4 stopping criteria - if centers do not

#change anymore, we stop!
if(np.allclose(centers, new_centers)):
break
else:
centers = new_centers
iteration+=1
print(f"Done in {iteration} iterations")

return centers
def predict(X, centers):

return pairwise_distances_argmin(X, centers)
start = time()
preds = predict(X, kmeans(X, n_clusters=4))
print(f"Fit and predict time: {time() - start}")
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=preds, s=50)
plt.title("Final result")
Total with variation score: 3349.4613338786116

Done in 14 iterations
Fit and predict time: 0.0459291934967041
Out[ ]: Text(0.5, 1.0, 'Final result')
2. Sklearn
In [ ]: from sklearn.datasets import make_blobs
import numpy as np
X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
How to know how many clusters?
In [ ]: from sklearn.cluster import KMeans
#sum of squared distances

ssd = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k, n_init='auto')
kmeans.fit(X)
ssd.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))

plt.plot(range(2, 20), ssd)
plt.xticks(range(2, 20))
plt.grid(True)
plt.ylabel("within cluster variation")
plt.xlabel("k")
plt.title('Elbow curve')
plt.annotate('elbow', xy=(4.3, 220), xytext=(5, 600), #xytext ---> xy

arrowprops=dict(arrowstyle="->"))
Out[ ]: Text(5, 600, 'elbow')

Here, 4 is best. Although 5+ can further minimize W (C), more k leads to more
computational resources, thus 4 gives a good balance between performance and efficiency.
Limitation: K-means assumes equal-sized spherical

distribution
In [ ]: from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
labels = KMeans(2, random_state=0, n_init='auto').fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
Limitation: K-means does not work well with

uneven size clusters
In [ ]: n_samples_1 = 1000
n_samples_2 = 100
centers = [[0.0, 0.0], [5.0, 5.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],
centers=centers,
cluster_std=clusters_std,
random_state=0, shuffle=False)
kmeans = KMeans(n_clusters=2, n_init='auto')

kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
Out[ ]: <matplotlib.collections.PathCollection at 0x17cd27710>

Alternatives
When K-means (centroid-based) fail you, you may explore other alternatives:
Hierarchical Clustering (hierachical-based): This method starts with individual data

points as clusters and iteratively merges them into larger clusters until a stopping
criterion is met.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (density-
based):
DBSCAN identifies clusters based on the density of data points. It can find clusters of
arbitrary shapes and is robust to noise.
Gaussian Mixture Models (GMM) (distribution-based): GMM models the data as a

mixture of Gaussian distributions. It can capture complex cluster shapes and also
provides probabilistic cluster assignments.
When to use K-means

1. May not guarantee optimal solution. Depends on initialization. Can be fix by running k-
means many times with different init random values. Sklearn supports n_init=auto
which runs k-means with different center seeds and get the best results.
2. Require knowing how many clusters beforehand. Simple way is to use the elbow
method which compute within clusters distances. In sklearn, this can be easily computed
using kmeans inertia_ variable .
3. Assume spherical distribution. This also means that all k-means assume that clusters
have equal number of samples (which may not be true!)
4. Similar to K-nearest neighbors and MDS, k-means can be ridiciously slow for large
number of samples. One way to fix this is using the concept of Mini-Batch. It is
implemented in sklearn.cluster.MiniBatchMeans . You can also take a look at our
Extended - MiniBatch KMeans.ipynb
Workshop
Egg price Gold price Oil price
1 1 1 2
2 1 2 1
3 2 2 3
4 2 3 2
5 3 3 4
6 3 4 3
1. Based on this data, perform K-Means clustering for 2 iterations. Assume k = 3 , with the
initial center as [1, 1, 1], [2, 2, 2], and [3, 3, 3].
2. How do we know how many k to choose?
3. How do we think about between principle-first and data-first approach, when picking k?
4. In which case K-Means performs badly?
Gaussian Mixture Models Clustering
In [ ]: import numpy as np
We can use Gaussian Mixture Models (GMM), another clustering algorithm, to address these
limitations.
Problem definition
In a Gaussian mixture model, we define each sample i = 1, 2, ⋯ , m with features of
j = 1, 2, ⋯ , n belong to clusters c = 1, 2, ⋯ , k , where each cluster c (i)
has a mean of μ (c)
and Σ (c)
. Combining these two parameters give us the multivariate gaussian distribution
N (x|μ
(c)
,Σ
(c)
) where
1 1
(c) T (c)−1 (c)
(c) (c) (− (x−μ ) Σ (x−μ ))
N (x|μ ,Σ ) = e 2
n 1
(c)
(2π) 2
|Σ |2
For those who are confused, you may compare it with univariate gaussian distribution and
you will realize they are the same but for multiple dimensions:
2
(x−μ)
1
1 −
2 2
e σ
√2πσ
Note that the determinant of the covariance matrix is the product of its diagonal entries and
each diagonal entry is the variance, thus is the same.
We shall also further define π (c)

, which is simply the probability of being in the cluster (think
of it as p(y)). Since π is the probability of each cluster, it should sum to 1.
(c) (c)
0 ≤ π ≤ 1 ; ∑π = 1
c=1
Then, using Baye's theorem, we can define the likelihood of a sample x (i)
belong to a
cluster c as simply:
p(y|x) = p(y) ∗ p(x|y)
(i) (c) (i) (c) (c)

p(c|x ) = π ∗ N (x |μ ,Σ )
Finally, based on all the information, GMM defines its objective, i.e., to maximize
m k
(c) (i) (c) (c)
∏∑π N (x |μ ,Σ )
i=1 c=1
Derivatives
Our objective function is to find θ that maximize the log-likehood L. (Here we apply log for
mathematical stability.)
m k
(c) (c) (c)
max ∑ log ∑ π N (x|μ ,Σ )
θ
i=1 c=1
Our "normal" procedure would be to compute the gradient and set it to 0. however, if
dL
dθ
you try this yourself at home, you will find that it is not possible to find the closed form.
Responsibilities r
Before anything, let us introduce a quantity that will play a central role in this algorithm:
responsibilities. We define the quantity
(c) (i) (c) (c)

π N (x ∣ μ ,Σ )
(i)
rc =
k (c) (i) (c) (c)
Σ π N (x ∣ μ ,Σ )
c=1
basically gives us
(i)
rc
(i)
Probability of x belonging to cluster c
(i)
Probability of x over all clusters
Note that r for a sample i
(i) (i) (i) (i) k

r = r ,r ,⋯,r ∈ R
1 2 k
k
(i)
∑ rc = 1
c=1
(i)
rc ≥ 0
EM algorithm
Notice that r actually depends on mean, covariance and pi. But then, mean, covariance,
and pi also depends on r. Based on this, we can use EM algorithm, where we can (1) create a
random mean, covariance, and pi, (2) calculate r, and then repeat 1 and 2 until certain
stopping criteria.
Total responsibility N (c)

By summing all the total responsibility of the c th cluster along all samples, we get N (c)
.
m
(i) k
N c = ∑ rc ∈ R
i=1
Note that this value does not necessarily sum to 1.
Updating the mean

The update of the mean μ (c)
is given by:
m
(i)
(i)
∑ rc x
(c) i=1
μnew =
m
(i)
∑ rc
i=1
Proof:
We take a partial derivative of our objective function with respect to the mean parameters
μ
(c)
. To simplify things, let's perform partial derivative without the log first and only consider
one sample.
(i) k (i) (c) (c) (i) (c) (c)

∂p(x |θ) ∂N (x |μ ,Σ ) ∂N (x |μ ,Σ )
(c) (c)
= ∑π = π
(c) (c) (c)
∂μ ∂μ ∂μ
c=1
(c) (i) (c) T (c)−1 (i) (c) (c)

= π (x − μ ) Σ N (x |μ ,Σ )
Now, applying log, since we know the partial derivative of log of x is , thus
1
m (i) m (i)
∂L ∂ log p(x |θ) 1 ∂p(x |θ)
= ∑ = ∑ =
(c) (c) (i) (c)
∂μ ∂μ p(x |θ) ∂μ
i=1 i=1
m (c) (i) (c) (c)

π N (x ∣ μ ,Σ )
(i) (c) T (c)−1
∑(x − μ ) Σ
k (c) (i) (c) (c)
i=1
Σ π N (x ∣ μ ,Σ )
c=1
To simplify, we can substitute r into the equation, thus

(i)
m
(i) (i) (c) T (c)−1
= ∑ rc (x − μ ) Σ
i=1
We can now solve for μ so that and obtain

(c) ∂L
= 0
(c)
∂μ
m
(i) (i) (c) T (c)−1
∑ rc (x − μ ) Σ = 0
i=1
Multiply both sides by Σ will cancel out the inverse Σ, and move μ (c)
to another side
m m
(i) (i)
(i) (c)
∑ rc x = ∑ rc μ
i=1 i=1
m
(i)
(i)
∑ rc x
i=1
(c)
= μ
m
(i)
∑ rc
i=1
We can further substitute N (c)

so that
m
1 (i)
(i) (c)
∑ rc x = μ
(c)
N
i=1
Updating the covariances

The update of the covariance Σ (c)
is given by:
m
(c) 1 (i) (i) (c) (i) (c) T
Σnew = ∑ rc (x − μ )(x − μ )
(c)
N
i=1
Proof
We take a partial derivative of our objective function with respect to the Sigma parameters
Σ
(c)
. Similarly, to simplify things, let's perform partial derivative without the log first and only
consider one sample.
(i)
∂p(x |θ) ∂ n 1 1 (i) (c) T (c)−1 (i) (c)
(c) − (c) (− (x −μ ) Σ (x −μ )))
= (π (2π) 2
det(Σ )2 e 2
(c) (c)
∂Σ ∂Σ
Using derivative multiplication rule, we got
−
n
∂ −
1
1
(c) (c) (i) (c) T (c)−1 (i) (c)
= π (2π) 2
[ det(Σ ) 2
exp( − (x − μ ) Σ (x − μ )) +
(c)
∂Σ 2
1
∂ 1
(c) − (i) (c) T (c) (i) (c)
det(Σ ) 2
exp( − (x − μ ) Σ (x − μ )]
(c)
∂Σ 2
Using this following rule
∂ ∂f (x)
−1
det(f (x)) = det(f (x))tr(f (x) )
∂X ∂x
We get that
∂ 1
1 1
(c) − (c) − (c)−1
det(Σ ) 2
= − det(Σ ) 2
Σ
(c)
∂Σ 2
Using this following rule

T
∂a Xb
T
= ab
∂X
We get that
∂
(i) (c) T (c)−1 (i) (c) (c)−1 (i) (c) (i) (c) T (c)−1
(x − μ ) Σ (x − μ ) = −Σ (x − μ )(x − μ ) Σ
(c)
∂Σ
Putting them together, we got:
(i)
∂p(x |θ)
(c) (i) (c) (c)
= π N (x |μ ,Σ )
(c)
∂Σ
1
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
∗ [ − (Σ − Σ (x − μ )(x − μ ) Σ ]
2
Now consider all samples and log as well, the partial derivative of the log-likelihood with
respect to Σ (c)
is given by
m (i)
∂L ∂ log p(x |θ)
= ∑
(c) (c)
∂Σ ∂Σ
i=1
m (i)
1 ∂p(x |θ)
= ∑
(i) (c)
(p(x |θ) ∂Σ
i=1
m (c) (i) (c) (c)

π N (x ∣ μ ,Σ )
= ∑
k (c) (i) (c) (c)
Σ π N (x ∣ μ ,Σ )
i=1 c=1
1
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
∗ [ − (Σ − Σ (x − μ )(x − μ ) Σ )]
2
Substituting r , we got
(i)
c
m
1 (i)
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
= − ∑ rc (Σ − Σ (x − μ )(x − μ ) Σ )
2
i=1
m m
1 (i) 1 (i)
(c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
= − Σ ∑ rc + Σ ( ∑ rc (x − μ )(x − μ ) )Σ
2 2
i=1 i=1
Setting this to zero, we obtain:

m
(i)
(c) (c)−1 (c)−1 (i) (c) (i) (c) T (c)−1
N Σ = Σ ( ∑ rc (x − μ )(x − μ ) )Σ
i=1
By solving for Σ (c)

we got
m
1 (i)
(c) (i) (c) (i) (c) T
Σ = ∑ rc (x − μ )(x − μ )
(c)
N
i=1
Updating the pi
The update of the mixture weights π (c)
:
(c)
(c)
N
πnew =
m
Proof
To find the partial derivative, we account for the equality constraint
(c)
∑π = 1
c=1
The Lagrangian L is
(c)
L = L + β( ∑ π − 1)
c=1
m k k
(c) (c) (c) (c)
= ∑ log ∑ π N (x|μ ,Σ ) + β( ∑ π − 1)
i=1 c=1 c=1
Taking the partial derivative with respect to π (c)

as
m (i) (c) (c)

∂L N (x ∣ μ ,Σ )
= ∑ + β
(c) k (c) (i) (c) (c)
∂π Σ π N (x ∣ μ ,Σ )
i=1 c=1
m (c) (i) (c) (c)

1 π N (x ∣ μ ,Σ )
= ∑ + β
(c) k (c) (i) (c) (c)
π Σ π N (x ∣ μ ,Σ )
i=1 c=1
(c)
N
= + β
(c)
π
Taking the partial derivative with respect to β is
k
∂L
(c)
= ∑π − 1
∂β
c=1
Setting both partial derivatives to zero yield
(c)
N
(c)
π = −
β
(c)
1 = ∑π
c=1
Using the top formula to solve for the bottom formula:

m (c)
N
−∑ = 1
β
i=1
m
= − = 1
β
= β = −m
Substitute −m for β yield
(c)
N
(c)
π =
m
Let's code!
Step 1: Define k random clusters
Define k clusters from k random number of gaussian distribution. Specifically, for each
cluster c, randomly initialize parameters mean μ (c)
, covariance Σ (c)
, fraction per class π (c)
and responsiblities (likelihood) of each sample r

(i)
c
Recall that gaussian distribution is parametrized by the mean μ and the covariance Σ
In [ ]: from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=9, cluster_std=[1.0, 3.5, 0.5], random_state=42)
plt.scatter(X[:, 0], X[:, 1], c =y);
1.1 Define means
In [ ]: #define number of cluster (use domain knowledge or elbow rule)
k = 3
#get m and n
m, n = X.shape
#initialize K number of gaussian distribution
#first, initialize mean

random_row = np.random.randint(low=0, high=m, size=k)
mean = np.array([X[idx,:] for idx in random_row ]).T
print("X: ", X)
print("Mean: ", mean) #shape: n, k (mean for each feature for each cluster)
X: [[ 5.48674679e+00 -4.72331117e+00]
[-2.97867201e+00 9.55684617e+00]
[ 1.09496992e+00 3.07303535e+00]
[-9.29984808e-01 9.78172086e+00]
[-7.33363923e+00 -7.58626144e+00]
[-6.14680281e+00 -6.99299774e+00]
[-1.39733358e+00 5.16333160e-03]
[-2.97261532e+00 8.54855637e+00]
[-6.84586309e+00 -7.59248369e+00]]
Mean: [[-1.39733358e+00 -6.14680281e+00 -7.33363923e+00]
[ 5.16333160e-03 -6.99299774e+00 -7.58626144e+00]]
1.2 Define covariance

In [ ]: #initialize covariance (std for multi-dimension)
cov = np.array([np.cov(X.T) for _ in range (k)])
print("Cov: ", cov) #shape: k, n, n (covriance matrix (n x n) for each cluster)
Cov: [[[17.04621881 9.37794282]

[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]]
For those who forget about covariance matrix,
m i 2
(a − μ)
cov(a) = var(a) = ∑
m
i
m i i
(a − μ)(b − μ)
cov(a, b) = ∑
m
i
1.3 Define pi
In [ ]: #initialize the weights (here we called pi)
pi = np.full((k, ), fill_value=1/k) #simply use 1/k for pi
print("PI (weights; must add up to 1): ", pi) #shape: (k, ) - similar to p(y) of e
PI (weights; must add up to 1): [0.33333333 0.33333333 0.33333333]
1.4 Define responsibilities

In [ ]: #initialize the responsiblities (likelihood)
r = np.full(shape=(m, k), fill_value=1/k)
print("Responsilibities (likelihood - add up to 1): ", r) #shape: (m, k) - the like
Responsilibities (likelihood - add up to 1): [[0.33333333 0.33333333 0.33333333]

[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]
[0.33333333 0.33333333 0.33333333]]
Step 2: EM-step
Repeat until converged:
1. E-step: for each sample x , evaluate responsibilities r for every data point x using
(i)
(i) (i)
c
(c) (i) (c) (c)

π N (x ∣ μ ,Σ )
(i)
rc =
k (c) (i) (c) (c)
Σ π N (x ∣ μ ,Σ )
c=1
2. M-step: for each cluster, update the gaussian distribution of each cluster, i.e., restimate
parameters π using the updated responsibilites r .
(i)
(c) (c) (c) (c)
,μ ,Σ ,N c
m
(i)
(c)
N = ∑ rc
i=1
m
(c) 1 (i) (i)
μnew = ∑ rc x
(c)
N
i=1
m
(c) 1 (i) (i) (c) (i) (c) T
Σnew = ∑ rc (x − μ )(x − μ )
(c)
N i=1
(c)
(c)
N
πnew =
m
2.1 E-Step
In [ ]: from scipy.stats import multivariate_normal
for i in range(m):
for c in range(k):
xi_pdf = multivariate_normal.pdf(X[i], mean=mean[:, c], cov=cov[c])
r[i, c] = pi[c] * xi_pdf #p(y) * p(x | y)
r[i] /= np.sum(r[i]) #normalize
print("Shape of r: ", r.shape) #likelihood for each sample for each cluster
print("r: ", r)
Shape of r: (9, 3)
r: [[0.84400661 0.10948349 0.0465099 ]
[0.67145945 0.18427787 0.14426268]
[0.7732715 0.143172 0.08355649]
[0.78065746 0.13071049 0.08863205]
[0.13151509 0.42526555 0.44321936]
[0.17936309 0.41880079 0.40183612]
[0.57970857 0.24827633 0.1720151 ]
[0.65276438 0.1949 0.15233562]
[0.14707751 0.42495322 0.42796927]]
2.2 M-Step
In [ ]: # Find Nc first for latter use, you can think this as normalized factor
Nc = np.sum(r, axis=0)
assert Nc.shape == (k, )
print(Nc)
[4.75982366 2.27983975 1.96033659]
2.3 Update Pi
In [ ]: #Update pi
print("Pi (before): ", pi)
pi = Nc / m
assert pi.shape == (k, )
print("Pi (after): ", pi)
Pi (before): [0.33333333 0.33333333 0.33333333]

Pi (after): [0.5288693 0.25331553 0.21781518]
2.4 Update mean

In [ ]: #Update mean
print("X.T: ", X.T) #(n, m)
print("r: ", r) #(m, k)
print("X.T @ r ", X.T @ r)
print("Nc: ", Nc)
print("Mean (before) ", mean) #(n, k)
mean = ( X.T @ r ) / Nc
assert mean.shape == (n, k)
print("Mean (after) ", mean) #(n, k)
X.T: [[ 5.48674679e+00 -2.97867201e+00 1.09496992e+00 -9.29984808e-01
-7.33363923e+00 -6.14680281e+00 -1.39733358e+00 -2.97261532e+00
-6.84586309e+00]
[-4.72331117e+00 9.55684617e+00 3.07303535e+00 9.78172086e+00
-7.58626144e+00 -6.99299774e+00 5.16333160e-03 8.54855637e+00
-7.59248369e+00]]
r: [[0.84400661 0.10948349 0.0465099 ]
[0.67145945 0.18427787 0.14426268]
[0.7732715 0.143172 0.08355649]
[0.78065746 0.13071049 0.08863205]
[0.13151509 0.42526555 0.44321936]
[0.17936309 0.41880079 0.40183612]
[0.57970857 0.24827633 0.1720151 ]
[0.65276438 0.1949 0.15233562]
[0.14707751 0.42495322 0.42796927]]
X.T @ r [[-3.07282733 -9.44147402 -9.50889278]
[14.65750205 -4.7513667 -5.83586731]]
Nc: [4.75982366 2.27983975 1.96033659]
Mean (before) [[-1.39733358e+00 -6.14680281e+00 -7.33363923e+00]
[ 5.16333160e-03 -6.99299774e+00 -7.58626144e+00]]
Mean (after) [[-0.64557588 -4.14128845 -4.85064291]
[ 3.07942123 -2.08407924 -2.97697208]]
2.5 Update covariance

In [ ]: print("Cov (before) ", cov) #(n, k)
#covariance (also called Sigma)

for c in range(k):
for i in range(m):
X_mean = (X[i]-mean[:, c]).reshape(-1, 1)
cov[c] += r[i, c] * (X_mean @ X_mean.T)
cov[c] /= Nc[c]
assert cov.shape == (k, n, n)
print("Cov (after) ", cov) #(n, k)
Cov (before) [[[17.04621881 9.37794282]

[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]
[[17.04621881 9.37794282]
[ 9.37794282 56.76310386]]]
Cov (after) [[[15.89840478 -4.32093459]
[-4.32093459 51.74583404]]
[[19.29089449 17.4159892 ]
[17.4159892 71.79771663]]
[[17.61841975 18.12780574]
[18.12780574 73.37236612]]]
Step 3: Predict
In [ ]: #assume you finish repeating step 2
#get preds
yhat = np.argmax(r, axis=1)
print(yhat)
[0 0 0 0 2 1 0 0 2]
Putting everything together

In [ ]: from sklearn.cluster import KMeans
from scipy.stats import multivariate_normal
import math
X, y = make_blobs(n_samples=1500, cluster_std=[1.0, 3.5, 0.5], random_state=42)
#define basic params

m, n = X.shape
K = 3
max_iter = 20
#==initialization==
#responsibliity
r = np.full(shape=(m, K), fill_value=1/K)
#pi
pi = np.full((K, ), fill_value=1/K) #simply use 1/k for pi
#mean
random_row = np.random.randint(low=0, high=m, size=K)
mean = np.array([X[idx,:] for idx in random_row ]).T #.T to make to shape (M, K)
#covariance
cov = np.array([np.cov(X.T) for _ in range (K)])
for iteration in range(max_iter):
#===E-Step=====
#Update r_ik of each sample
for i in range(m):
for c in range(K):
xi_pdf = multivariate_normal.pdf(X[i], mean=mean[:, c], cov=cov[c])
r[i, c] = pi[c] * xi_pdf
r[i] /= np.sum(r[i])
#===M-Step====
# Find NK first for latter use
Nc = np.sum(r, axis=0)
assert Nc.shape == (K, )
#PI
pi = Nc / m
assert pi.shape == (K, )
#mean
mean = ( X.T @ r ) / Nc
assert mean.shape == (n, K)
#covariance (also called Sigma)

cov = np.zeros((K, n, n))
for k in range(K):
for i in range(m):
X_mean = (X[i]-mean[:, k]).reshape(-1, 1)
cov[k] += r[i, k] * (X_mean @ X_mean.T)
cov[k] /= Nc[k]
assert cov.shape == (K, n, n)
#get preds
yhat = np.argmax(r, axis=1)
#plot
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=yhat)
plt.title("Final")
Out[ ]: Text(0.5, 1.0, 'Final')
2. Sklearn
Though GMM is often categorized as a clustering algorithm, fundamentally it is an algorithm
for density estimation. That is to say, the result of a GMM fit to some data is technically not a
clustering model, but a generative probabilistic model describing the distribution of the
data.
In [ ]: from sklearn.datasets import make_moons

Xmoon, ymoon = make_moons(200, noise=.05, random_state=0)
plt.scatter(Xmoon[:, 0], Xmoon[:, 1]);
If we try to fit this with a two-component GMM viewed as a clustering model, the results are
not particularly useful:
In [ ]: from sklearn.mixture import GaussianMixture as GMM

gmm2 = GMM(n_components=2, covariance_type='full', random_state=0)
pred = gmm2.fit(Xmoon).predict(Xmoon)
plt.scatter(Xmoon[:, 0], Xmoon[:, 1], c=pred, s=40, cmap='viridis')
Out[ ]: <matplotlib.collections.PathCollection at 0x166b69b50>

In [ ]: gmm16 = GMM(n_components=16, covariance_type='full', random_state=0)
pred = gmm16.fit(Xmoon).predict(Xmoon)
plt.scatter(Xmoon[:, 0], Xmoon[:, 1], c=pred, s=40, cmap='viridis')
Out[ ]: <matplotlib.collections.PathCollection at 0x168bfcd10>

Here the mixture of 16 Gaussians serves not to find separated clusters of data, but rather to
model the overall distribution of the input data. This is a generative model of the
distribution, meaning that the GMM gives us the recipe to generate new random data
distributed similarly to our input. For example, here are 400 new points drawn from this 16-
component GMM fit to our original data:
In [ ]: Xnew, _ = gmm16.sample(400)
plt.scatter(Xnew[:, 0], Xnew[:, 1])
Out[ ]: <matplotlib.collections.PathCollection at 0x168c6b2d0>
When to use GMM

Pros: - Address the limitations of K-means - Can be used to generate data, since we know
P (x|y)
Cons: - Just like K-mean, this algorithm can sometimes miss the globally optimal solution,
and thus in practice multiple random initializations are used.
Workshop
1. How GMM differs from K-means for centroid intialization?
2. In GMM, how do we initialize means?
3. In GMM, what's the shape of means?
4. In above code, we get 6 numbers in the mean; what these numbers describe?
5. In GMM, how do we initialize covariance matrix?
6. In GMM, what's the shape of covariance matrix?
7. In above code, we get 12 numbers in covariance matrix; what these numbers describe?
8. What π describes?
9. What's the shape of π?
10. Does π needs to sum to 1?
11. What responsibilities (r) describes?
12. What's the shape of r?
13. In GMM, what is the primary objective function?
14. Though Chaky did not talk about it, what do you think are some sensible stopping
criteria?

01 K Means - Merged

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 K Means - Merged

Uploaded by

Copyright:

Available Formats

K-Means Clustering

In a clustering problem, we are given a training set {x and x , and we

are tasked to find {y (1)

We can visualize the algorithm as shown in the following figure.

More formal version

2. Initialize cluster centroids μ randomly

(i) (i) (c) 2

4. Set the new centroid mean based on each centroid samples:

c=1 i∈c j=1

X = np.array([[2, 3], [1, 2], [6, 7]]) #close to 1, 0, 2 centers

Out[ ]: array([1, 0, 2])

In [ ]: #Implement K-means from scratch

X, y_true = make_blobs(n_samples=1500, centers=4,

def kmeans(X, n_clusters):

#3. find new centers

#convert list to np.array; you can actually combine #3

#compute total within-variation score

print("Total with variation score: ", total_with_variation_score)

#4 stopping criteria - if centers do not

print(f"Done in {iteration} iterations")

def predict(X, centers):

Total with variation score: 3349.4613338786116

X, y_true = make_blobs(n_samples=300, centers=4,

In [ ]: from sklearn.cluster import KMeans

#sum of squared distances

fig = plt.figure(figsize=(15, 5))

plt.annotate('elbow', xy=(4.3, 220), xytext=(5, 600), #xytext ---> xy

Out[ ]: Text(5, 600, 'elbow')

Limitation: K-means assumes equal-sized spherical

labels = KMeans(2, random_state=0, n_init='auto').fit_predict(X)

Limitation: K-means does not work well with

kmeans = KMeans(n_clusters=2, n_init='auto')

Out[ ]: <matplotlib.collections.PathCollection at 0x17cd27710>

Hierarchical Clustering (hierachical-based): This method starts with individual data

Gaussian Mixture Models (GMM) (distribution-based): GMM models the data as a

When to use K-means

We shall also further define π (c)

p(y|x) = p(y) ∗ p(x|y)

(i) (c) (i) (c) (c)

(c) (i) (c) (c)

Note that r for a sample i

(i) (i) (i) (i) k

Total responsibility N (c)

Note that this value does not necessarily sum to 1.

Updating the mean

(i) k (i) (c) (c) (i) (c) (c)

(c) (i) (c) T (c)−1 (i) (c) (c)

m (c) (i) (c) (c)

To simplify, we can substitute r into the equation, thus

We can now solve for μ so that and obtain

We can further substitute N (c)

Updating the covariances

Using derivative multiplication rule, we got

Using this following rule

Using this following rule

Putting them together, we got:

m (c) (i) (c) (c)

Setting this to zero, we obtain:

By solving for Σ (c)

To find the partial derivative, we account for the equality constraint

i=1 c=1 c=1

Taking the partial derivative with respect to π (c)