You are on page 1of 10

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“JnanaSangama”,Belagavi-18,Karnataka,India

Report on

“ANALYSIS OF K-MEANS CLUSTERING”

Project Submitted in partial fulfillment of the requirement for the degree of

Bachelor of Engineering
In
INDUSTRIAL. ENGINEERING AND MANAGEMENT
By
PREETHI M 1DS19ET062
CHETHAN M N 1DS20ET018
VAISHAK M 1DS18ET058

7thSem B.E
Under the guidance of

Prof.
Assistant Professor

Department of Industrial Engineering and Management


DAYANANDA SAGAR COLLEGE OF
ENGINEERING BENGALURU-
560078.

2023-2024
INTRODUCTION

The advent of data-driven decision-making has propelled the exploration of clustering


algorithms to discern inherent patterns within datasets. In this report, we delve into a
clustering analysis performed on a dataset named "income.csv," focusing on two pivotal
variables: age and income. The overarching goal is to uncover natural groupings or clusters
within the data, shedding light on potential trends or relationships between individuals' age
and income.

Understanding the dynamics between age and income is paramount in various fields, ranging
from marketing strategies and targeted advertising to demographic studies. The KMeans
clustering algorithm, a stalwart in unsupervised machine learning, serves as our tool of choice
to unravel patterns within this dataset. Initial observations are based on an unaltered dataset,
and subsequent analyses include the impact of data scaling on clustering outcomes.

Through visualization techniques and cluster assignments, we aim to provide a


comprehensive understanding of how individuals might naturally segregate based on age and
income. As we progress through this report, key findings, visual representations, and
recommendations for further exploration will illuminate the significance of this clustering
analysis in deciphering underlying structures within the dataset.
METHODOLOGY

K-Means clustering is a popular unsupervised machine learning algorithm used for


partitioning a dataset into distinct groups or clusters based on inherent patterns. The
algorithm operates iteratively, aiming to minimize the sum of squared distances between data
points and their assigned cluster centres. The following steps outline the typical methodology
of K-Means clustering:

1. Initialization:

- Begin by selecting the number of clusters (k) that the algorithm should identify
in the dataset.

- Randomly initialize k cluster centres within the feature space of the dataset.

2. Assignment of Data Points to Clusters:

- For each data point in the dataset, calculate the distance to each cluster centre.

- Assign the data point to the cluster with the nearest centre.

3. Update Cluster Centres:

- Recalculate the cluster centres by computing the mean of all data points
assigned to each cluster.

- This step adjusts the cluster centres based on the current assignment of data points.

4. Repeat:

- Repeat steps 2 and 3 until convergence, where either the cluster assignments
stabilize or a predefined number of iterations is reached.
5. Convergence Criteria:

- Convergence is typically determined by assessing whether the cluster assignments


remain unchanged between iterations or whether the change falls below a specified
threshold.

6. Final Results:

- The final result is a partitioning of the dataset into k clusters, each characterized by its
centroid (mean location) and the data points assigned to it.

Key Considerations:

 Number of Clusters (k): Selecting an appropriate value for k is crucial. Techniques such as the
elbow method or silhouette analysis can help determine the optimal number of clusters.
 Initialization Sensitivity: K-Means is sensitive to the initial placement of cluster centers.
Running the algorithm multiple times with different initializations and choosing the best
result can mitigate this sensitivity.
 Scaling Features: It's often advisable to scale or normalize features to ensure equal importance
during distance calculations.
 Handling Outliers: Outliers can disproportionately influence cluster centers. Preprocessing
steps like outlier removal or using more robust clustering algorithms may be considered.

K-Means clustering finds applications in diverse fields, including customer segmentation, image
compression, and anomaly detection, offering a versatile and efficient solution for partitioning
datasets into meaningful groups.
Dataset overview

The dataset used for this analysis is named "income.csv." It includes the following
columns:

Name: The name of the individual.

Age: The age of the individual.

Income($):The income of the individual.

Data Exploration

The initial exploration involved visualizing the relationship between age and income
through a scatter plot. This provided an overview of the data distribution and potential
patterns.

# Scatter plot code snippet

plt.scatter(df.Age, df['Income($)'])

plt.xlabel('Age')

plt.ylabel('Income($)')

plt.show()

KMeans Clustering

 Initial Clustering: The KMeans clustering algorithm was applied to the data
with three clusters. The resulting cluster assignments were added to the dataset.

# KMeans clustering code snippet

km = KMeans(n_clusters=3)

y_predicted = km.fit_predict(df[['Age',

'Income($)']]) df['cluster'] = y_predicted


 Data Scaling: The initial clusters were visualized on the scatter plot, with
cluster centres marked.

# Visualization of initial clusters code snippet

plt.scatter(df1.Age, df1['Income($)'], color='green')

plt.scatter(df2.Age, df2['Income($)'], color='red')

plt.scatter(df3.Age, df3['Income($)'], color='black')

plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color='purple',


marker='*', label='centroid')

plt.legend()

plt.xlabel('Age')

plt.ylabel('Income($)')

plt.show()

To ensure uniformity in feature scales, MinMaxScaler was applied to the 'Age'


and 'Income($)' columns.

# Data scaling code snippet

scaler = MinMaxScaler()

scaler.fit(df[['Age', 'Income($)']])

df[['Age', 'Income($)']] = scaler.transform(df[['Age', 'Income($)']])

 Re-run KMeans on Scaled Data: The KMeans algorithm was re-applied to the
scaled data, and new cluster assignments were updated in the 'cluster' column.

# Re-running KMeans on scaled data code snippet

km.fit_predict(df[['Age', 'Income($)']])
df['cluster'] = y_predicted

 Visualizing Scaled Clusters: The clusters obtained from the scaled data
were visualized on the scatter plot, with cluster centers marked.

# Visualization of scaled clusters code snippet

plt.scatter(df1.Age, df1['Income($)'], color='green')

plt.scatter(df2.Age, df2['Income($)'], color='red')

plt.scatter(df3.Age, df3['Income($)'], color='black')

plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color='purple',


marker='*', label='centroid')

plt.legend()

plt.xlabel('Age (scaled)')

plt.ylabel('Income ($) (scaled)')

plt.show()
CONCLUSION

In conclusion, the K-Means clustering analysis conducted on the "income.csv" dataset has
revealed significant patterns and structures within the data. The initial clustering, based on
three clusters, provided a preliminary understanding of how individuals naturally group based
on age and income. Visualizations, including scatter plots and centroids, enhanced the
interpretability of these clusters. However, the sensitivity to initial cluster placement
highlighted the need for further exploration.

To address this concern, we applied MinMax scaling to standardize the age and income
features, leading to a more equitable influence of both variables. The subsequent clustering
on the scaled data presented clusters with enhanced cohesion and stability. This step not only
improved the robustness of the analysis but also offered a more reliable basis for
interpretation.

In essence, the K-Means clustering analysis has successfully unveiled insights into the
inherent structures of the dataset. These findings have implications across various domains,
including marketing, demographics, and targeted services. While this analysis has provided a
solid foundation, further investigation, possibly incorporating different cluster numbers or
alternative algorithms, could deepen our understanding and refine strategic decision-making.
Fig: Dataset used for K-means
Clustering

Fig: Datapoints after K-Means


Clustering
RESULTS

The results of the K-Means clustering analysis are noteworthy. The initial clustering,
performed with three clusters, revealed distinct groups of individuals based on their age and
income. Visualizations, including scatter plots and centroid markers, illustrated the
characteristics of each cluster. However, the sensitivity of the algorithm to initializations
prompted a refinement strategy.

The introduction of MinMax scaling significantly improved the stability and reliability of the
clustering results. The subsequent analysis of the scaled data exhibited more coherent
clusters, mitigating the influence of initial cluster placements. This step enhances the
interpretability of the clusters and provides a more accurate representation of the inherent
patterns within the dataset.

In summary, the results affirm the efficacy of K-Means clustering in uncovering meaningful
structures in the "income.csv" dataset. The clusters obtained, particularly after scaling, offer
valuable insights into how age and income contribute to natural groupings. These results form
a solid basis for informed decision-making in fields such as targeted marketing and
demographic analysis.

The dataset has been partitioned into three clusters based on age and income:

Cluster 0:Comprising individuals with moderate age and income levels.

Cluster 1:Defined by younger individuals with lower incomes.

Cluster 2: Encompassing older individuals with higher incomes.

You might also like