You are on page 1of 8

K –MEANS++

K-Means++
K-means++ is an initialization algorithm used in the K-means
clustering algorithm to select initial centroids. It aims to improve
the convergence speed and the quality of the final clustering
result.

How K-Means++ works?


K-Means++ is a smart centroid initialization technique and the rest
of the algorithm is the same as that of K-Means. The steps to follow
for centroid initialization are:

 Pick the first centroid point (C_1) randomly.

 Compute distance of all points in the dataset from the


selected centroid. The distance of x_i point from the farthest
centroid can be computed by

d_i: Distance of x_i point from the farthest centroid


m: number of centroids already picked

 Make the point xi as the new centroid that is having


maximum probability proportional to d_i.

 Repeat the above two steps till you find k-centroids.


Real-Time Examples

Here are a few real-time examples where K-means++ can be


applied:

Customer Segmentation:
In the e-commerce industry, K-means++ can be used to segment
customers based on their purchasing behavior. By analyzing
their buying patterns, preferences, and demographic
information, businesses can create targeted marketing
campaigns and personalize their offerings to different customer
segments.

Image Compression:
K-means++ can be employed in image compression algorithms.
It can cluster similar colors together and replace them with the
centroid of the cluster, reducing the number of unique colors
required to represent an image. This helps in reducing the file
size of the image while preserving its overall visual quality.

Anomaly Detection:
K-means++ can be utilized in anomaly detection tasks. By
clustering data points based on their similarity, any data points
that fall far away from the cluster centroids can be considered as
anomalies or outliers. This can be applied in various domains
such as fraud detection, network intrusion detection, or
detecting faulty equipment in manufacturing processes.
Document Clustering:
K-means++ can be applied to cluster documents based on their
content similarity. This can be useful in information retrieval
systems or document organization tasks, where documents with
similar themes or topics can be grouped together.

Social Network Analysis:


K-means++ can be used in social network analysis to identify
groups or communities within a network. By clustering
individuals based on their connections, interests, or interactions,
it becomes possible to understand the structure and dynamics of
the network, identify influential individuals, or detect
communities of interest.

Numerical Example

Let's consider an example with the following data


points in a two-dimensional space:

Data Points:
1. (2, 4)
2. (4, 6)
3. (3, 2)
4. (8, 5)
5. (7, 3)
6. (6, 8)
7. (5, 7)
8. (1, 2)
9. (2, 1)

To initialize the k-means++ algorithm, we need to


choose the first centroid randomly from the given data
points. Let's say we randomly choose data point 1 as
the first centroid.

Step 1: Randomly choose the first centroid


Centroid 1: (2, 4)

Step 2: Compute distances and select next centroids


Next, we calculate the distance from each data point to the
nearest centroid that has already been chosen. The probability of
selecting a data point as the next centroid is proportional to the
square of its distance to the nearest centroid.

Calculating distances from each data point to the


nearest centroid:

Data Point 2: (4, 6)


Distance to nearest centroid: sqrt((4 - 2)^2 + (6 - 4)^2) = sqrt(8)
= 2.83

Data Point 3: (3, 2)


Distance to nearest centroid: sqrt((3 - 2)^2 + (2 - 4)^2) = sqrt(5)
= 2.24
Data Point 4: (8, 5)
Distance to nearest centroid: sqrt((8 - 2)^2 + (5 - 4)^2) =
sqrt(40) = 6.32

Data Point 5: (7, 3)


Distance to nearest centroid: sqrt((7 - 2)^2 + (3 - 4)^2) =
sqrt(26) = 5.10

Data Point 6: (6, 8)


Distance to nearest centroid: sqrt((6 - 2)^2 + (8 - 4)^2) =
sqrt(32) = 5.66

Data Point 7: (5, 7)


Distance to nearest centroid: sqrt((5 - 2)^2 + (7 - 4)^2) =
sqrt(18) = 4.24

Data Point 8: (1, 2)


Distance to nearest centroid: sqrt((1 - 2)^2 + (2 - 4)^2) = sqrt(5)
= 2.24

Data Point 9: (2, 1)


Distance to nearest centroid: sqrt((2 - 2)^2 + (1 - 4)^2) = sqrt(9)
= 3.00
Calculate the probability
To select the next centroid, we calculate the probability for each
data point based on its squared distance to the nearest centroid:

Probability of selecting Data Point 2 = (2.83^2) / (2.83^2 +


2.24^2 + 6.32^2 + 5.10^2 + 5.66^2 + 4.24^2 + 2.24^2 +
3.00^2) = 0.10
Probability of selecting Data Point 3 = 0.07
Probability of selecting Data Point 4 = 0.22
Probability of selecting Data Point 5 = 0.15

Probability of selecting Data Point 6 = 0.17


Probability of selecting Data Point 7 = 0.13
Probability of selecting Data Point 8 = 0.07
Probability of selecting Data Point 9 = 0.09

To choose the next centroid, we sample a data point based on


these probabilities. Let's say we randomly select Data Point 4 as
the second centroid.

Centroid 1: (2, 4)
Centroid 2: (8, 5)

Step 3: Repeat Step 2 until all centroids are chosen


We repeat Step 2 to select the remaining centroids until we have
k centroids. In this example, let's assume we want to find 3
centroids.

Calculating distances and probabilities for the remaining data


points:

Data Point 2: Distance to nearest centroid: sqrt((4 - 8)^2 + (6 -


5)^2) = sqrt(5) = 2.24
Data Point 3: Distance to nearest centroid: sqrt((3 - 8)^2 + (2 -
5)^2) = sqrt(18) = 4.24
Data Point 5: Distance to nearest centroid: sqrt((7 - 8)^2 + (3 -
5)^2) = sqrt(5) = 2.24
Data Point 6: Distance to nearest centroid: sqrt((6 - 8)^2 + (8 -
5)^2) = sqrt(13) = 3.61
Data Point 7: Distance to nearest centroid: sqrt((5 - 8)^2 + (7 -
5)^2) = sqrt(5) = 2.24
Data Point 8: Distance to nearest centroid: sqrt((1 - 8)^2 + (2 -
5)^2) = sqrt(50) = 7.07
Data Point 9: Distance to nearest centroid: sqrt((2 - 8)^2 + (1 -
5)^2) = sqrt(32) = 5.66

Probability of selecting Data Point 2 = (2.24^2) / (2.24^2 +


4.24^2 + 2.24^2 + 3.61^2 + 2.24^2 + 7.07^2 + 5.66^2) = 0.05

Probability of selecting Data Point 3 = 0.13


Probability of selecting Data Point 5 = 0.05
Probability of selecting Data Point 6 = 0.17
Probability of selecting Data Point 7 = 0.05
Probability of selecting Data Point 8 = 0.43
Probability of selecting Data Point 9 = 0.12

Randomly selecting a data point based on these probabilities,


let's say we choose Data Point 8 as the third centroid.

Centroid 1: (2, 4)
Centroid 2: (8, 5)
Centroid 3: (1, 2)

These three centroids serve as the initial cluster centers for the
k-means algorithm, and the algorithm proceeds to assign each
data point to its nearest centroid and iteratively update the
centroids until convergence.

Note: The final clustering results and the order in which the
centroids are selected may vary depending on the random
choices made during the process.

K-Means Limitations and Weaknesses


Unfortunately, k-means has limitations and doesn’t work well
with different types of clusters. It doesn’t do well when:

 The clusters are of unequal size or density.


 The clusters are non-spherical.
 There are outliers in the data.

You might also like