K Mean++

K –MEANS++
K-Means++
K-means++ is an initialization algorithm used in the K-means
clustering algorithm to select initial centroids. It aims to improve
the convergence speed and the quality of the final clustering
result.
How K-Means++ works?

K-Means++ is a smart centroid initialization technique and the rest
of the algorithm is the same as that of K-Means. The steps to follow
for centroid initialization are:
 Pick the first centroid point (C_1) randomly.
 Compute distance of all points in the dataset from the

selected centroid. The distance of x_i point from the farthest
centroid can be computed by
d_i: Distance of x_i point from the farthest centroid

m: number of centroids already picked
 Make the point xi as the new centroid that is having

maximum probability proportional to d_i.
 Repeat the above two steps till you find k-centroids.

Real-Time Examples
Here are a few real-time examples where K-means++ can be

applied:
Customer Segmentation:
In the e-commerce industry, K-means++ can be used to segment
customers based on their purchasing behavior. By analyzing
their buying patterns, preferences, and demographic
information, businesses can create targeted marketing
campaigns and personalize their offerings to different customer
segments.
Image Compression:
K-means++ can be employed in image compression algorithms.
It can cluster similar colors together and replace them with the
centroid of the cluster, reducing the number of unique colors
required to represent an image. This helps in reducing the file
size of the image while preserving its overall visual quality.
Anomaly Detection:
K-means++ can be utilized in anomaly detection tasks. By
clustering data points based on their similarity, any data points
that fall far away from the cluster centroids can be considered as
anomalies or outliers. This can be applied in various domains
such as fraud detection, network intrusion detection, or
detecting faulty equipment in manufacturing processes.
Document Clustering:
K-means++ can be applied to cluster documents based on their
content similarity. This can be useful in information retrieval
systems or document organization tasks, where documents with
similar themes or topics can be grouped together.
Social Network Analysis:

K-means++ can be used in social network analysis to identify
groups or communities within a network. By clustering
individuals based on their connections, interests, or interactions,
it becomes possible to understand the structure and dynamics of
the network, identify influential individuals, or detect
communities of interest.
Numerical Example
Let's consider an example with the following data

points in a two-dimensional space:
Data Points:
1. (2, 4)
2. (4, 6)
3. (3, 2)
4. (8, 5)
5. (7, 3)
6. (6, 8)
7. (5, 7)
8. (1, 2)
9. (2, 1)
To initialize the k-means++ algorithm, we need to

choose the first centroid randomly from the given data
points. Let's say we randomly choose data point 1 as
the first centroid.
Step 1: Randomly choose the first centroid

Centroid 1: (2, 4)
Step 2: Compute distances and select next centroids

Next, we calculate the distance from each data point to the
nearest centroid that has already been chosen. The probability of
selecting a data point as the next centroid is proportional to the
square of its distance to the nearest centroid.
Calculating distances from each data point to the

nearest centroid:
Data Point 2: (4, 6)

Distance to nearest centroid: sqrt((4 - 2)^2 + (6 - 4)^2) = sqrt(8)
= 2.83

= 2.24
Distance to nearest centroid: sqrt((8 - 2)^2 + (5 - 4)^2) =
sqrt(40) = 6.32

sqrt(26) = 5.10

sqrt(32) = 5.66

sqrt(18) = 4.24

= 2.24

= 3.00
Calculate the probability
To select the next centroid, we calculate the probability for each
data point based on its squared distance to the nearest centroid:
Probability of selecting Data Point 2 = (2.83^2) / (2.83^2 +

2.24^2 + 6.32^2 + 5.10^2 + 5.66^2 + 4.24^2 + 2.24^2 +
3.00^2) = 0.10
Probability of selecting Data Point 3 = 0.07

To choose the next centroid, we sample a data point based on

these probabilities. Let's say we randomly select Data Point 4 as
the second centroid.
Centroid 1: (2, 4)
Centroid 2: (8, 5)
Step 3: Repeat Step 2 until all centroids are chosen

We repeat Step 2 to select the remaining centroids until we have
k centroids. In this example, let's assume we want to find 3
centroids.
Calculating distances and probabilities for the remaining data

points:
Data Point 2: Distance to nearest centroid: sqrt((4 - 8)^2 + (6 -

5)^2) = sqrt(5) = 2.24
5)^2) = sqrt(18) = 4.24
5)^2) = sqrt(5) = 2.24
5)^2) = sqrt(13) = 3.61
5)^2) = sqrt(5) = 2.24
5)^2) = sqrt(50) = 7.07
5)^2) = sqrt(32) = 5.66
Probability of selecting Data Point 2 = (2.24^2) / (2.24^2 +

4.24^2 + 2.24^2 + 3.61^2 + 2.24^2 + 7.07^2 + 5.66^2) = 0.05

Randomly selecting a data point based on these probabilities,

let's say we choose Data Point 8 as the third centroid.
Centroid 1: (2, 4)
Centroid 2: (8, 5)
Centroid 3: (1, 2)
These three centroids serve as the initial cluster centers for the
k-means algorithm, and the algorithm proceeds to assign each
data point to its nearest centroid and iteratively update the
centroids until convergence.
Note: The final clustering results and the order in which the
centroids are selected may vary depending on the random
choices made during the process.
K-Means Limitations and Weaknesses

Unfortunately, k-means has limitations and doesn’t work well
with different types of clusters. It doesn’t do well when:
 The clusters are of unequal size or density.

 The clusters are non-spherical.
 There are outliers in the data.

K Mean++

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Mean++

Uploaded by

Copyright:

Available Formats

K –MEANS++

How K-Means++ works?

 Pick the first centroid point (C_1) randomly.

 Compute distance of all points in the dataset from the

d_i: Distance of x_i point from the farthest centroid

 Make the point xi as the new centroid that is having

 Repeat the above two steps till you find k-centroids.

Here are a few real-time examples where K-means++ can be

Social Network Analysis:

Let's consider an example with the following data

To initialize the k-means++ algorithm, we need to

Step 1: Randomly choose the first centroid

Step 2: Compute distances and select next centroids

Calculating distances from each data point to the

Data Point 2: (4, 6)

Data Point 3: (3, 2)

Data Point 5: (7, 3)

Data Point 6: (6, 8)

Data Point 7: (5, 7)

Data Point 8: (1, 2)

Data Point 9: (2, 1)

Probability of selecting Data Point 2 = (2.83^2) / (2.83^2 +

Probability of selecting Data Point 6 = 0.17

To choose the next centroid, we sample a data point based on

Step 3: Repeat Step 2 until all centroids are chosen

Calculating distances and probabilities for the remaining data

Data Point 2: Distance to nearest centroid: sqrt((4 - 8)^2 + (6 -

Probability of selecting Data Point 2 = (2.24^2) / (2.24^2 +

Probability of selecting Data Point 3 = 0.13

Randomly selecting a data point based on these probabilities,

K-Means Limitations and Weaknesses

 The clusters are of unequal size or density.

You might also like