Professional Documents
Culture Documents
“JnanaSangama”,Belagavi-18,Karnataka,India
Report on
Bachelor of Engineering
In
INDUSTRIAL. ENGINEERING AND MANAGEMENT
By
PREETHI M 1DS19ET062
CHETHAN M N 1DS20ET018
VAISHAK M 1DS18ET058
7thSem B.E
Under the guidance of
Prof.
Assistant Professor
2023-2024
INTRODUCTION
Understanding the dynamics between age and income is paramount in various fields, ranging
from marketing strategies and targeted advertising to demographic studies. The KMeans
clustering algorithm, a stalwart in unsupervised machine learning, serves as our tool of choice
to unravel patterns within this dataset. Initial observations are based on an unaltered dataset,
and subsequent analyses include the impact of data scaling on clustering outcomes.
1. Initialization:
- Begin by selecting the number of clusters (k) that the algorithm should identify
in the dataset.
- Randomly initialize k cluster centres within the feature space of the dataset.
- For each data point in the dataset, calculate the distance to each cluster centre.
- Assign the data point to the cluster with the nearest centre.
- Recalculate the cluster centres by computing the mean of all data points
assigned to each cluster.
- This step adjusts the cluster centres based on the current assignment of data points.
4. Repeat:
- Repeat steps 2 and 3 until convergence, where either the cluster assignments
stabilize or a predefined number of iterations is reached.
5. Convergence Criteria:
6. Final Results:
- The final result is a partitioning of the dataset into k clusters, each characterized by its
centroid (mean location) and the data points assigned to it.
Key Considerations:
Number of Clusters (k): Selecting an appropriate value for k is crucial. Techniques such as the
elbow method or silhouette analysis can help determine the optimal number of clusters.
Initialization Sensitivity: K-Means is sensitive to the initial placement of cluster centers.
Running the algorithm multiple times with different initializations and choosing the best
result can mitigate this sensitivity.
Scaling Features: It's often advisable to scale or normalize features to ensure equal importance
during distance calculations.
Handling Outliers: Outliers can disproportionately influence cluster centers. Preprocessing
steps like outlier removal or using more robust clustering algorithms may be considered.
K-Means clustering finds applications in diverse fields, including customer segmentation, image
compression, and anomaly detection, offering a versatile and efficient solution for partitioning
datasets into meaningful groups.
Dataset overview
The dataset used for this analysis is named "income.csv." It includes the following
columns:
Data Exploration
The initial exploration involved visualizing the relationship between age and income
through a scatter plot. This provided an overview of the data distribution and potential
patterns.
plt.scatter(df.Age, df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')
plt.show()
KMeans Clustering
Initial Clustering: The KMeans clustering algorithm was applied to the data
with three clusters. The resulting cluster assignments were added to the dataset.
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age',
plt.legend()
plt.xlabel('Age')
plt.ylabel('Income($)')
plt.show()
scaler = MinMaxScaler()
scaler.fit(df[['Age', 'Income($)']])
Re-run KMeans on Scaled Data: The KMeans algorithm was re-applied to the
scaled data, and new cluster assignments were updated in the 'cluster' column.
km.fit_predict(df[['Age', 'Income($)']])
df['cluster'] = y_predicted
Visualizing Scaled Clusters: The clusters obtained from the scaled data
were visualized on the scatter plot, with cluster centers marked.
plt.legend()
plt.xlabel('Age (scaled)')
plt.show()
CONCLUSION
In conclusion, the K-Means clustering analysis conducted on the "income.csv" dataset has
revealed significant patterns and structures within the data. The initial clustering, based on
three clusters, provided a preliminary understanding of how individuals naturally group based
on age and income. Visualizations, including scatter plots and centroids, enhanced the
interpretability of these clusters. However, the sensitivity to initial cluster placement
highlighted the need for further exploration.
To address this concern, we applied MinMax scaling to standardize the age and income
features, leading to a more equitable influence of both variables. The subsequent clustering
on the scaled data presented clusters with enhanced cohesion and stability. This step not only
improved the robustness of the analysis but also offered a more reliable basis for
interpretation.
In essence, the K-Means clustering analysis has successfully unveiled insights into the
inherent structures of the dataset. These findings have implications across various domains,
including marketing, demographics, and targeted services. While this analysis has provided a
solid foundation, further investigation, possibly incorporating different cluster numbers or
alternative algorithms, could deepen our understanding and refine strategic decision-making.
Fig: Dataset used for K-means
Clustering
The results of the K-Means clustering analysis are noteworthy. The initial clustering,
performed with three clusters, revealed distinct groups of individuals based on their age and
income. Visualizations, including scatter plots and centroid markers, illustrated the
characteristics of each cluster. However, the sensitivity of the algorithm to initializations
prompted a refinement strategy.
The introduction of MinMax scaling significantly improved the stability and reliability of the
clustering results. The subsequent analysis of the scaled data exhibited more coherent
clusters, mitigating the influence of initial cluster placements. This step enhances the
interpretability of the clusters and provides a more accurate representation of the inherent
patterns within the dataset.
In summary, the results affirm the efficacy of K-Means clustering in uncovering meaningful
structures in the "income.csv" dataset. The clusters obtained, particularly after scaling, offer
valuable insights into how age and income contribute to natural groupings. These results form
a solid basis for informed decision-making in fields such as targeted marketing and
demographic analysis.
The dataset has been partitioned into three clusters based on age and income: