You are on page 1of 3

MSc AIBT : Machine Learning with Python

Practice 2 – kmeans
The database is available on moodle (or mail)

1) Data Visualization
a) Load the database (Customers_practice.csv).

b) Print the 10 first rows (with head function) of the dataset. Determine the size of the
examples and the number of features of the problem.

c) Display a scatter plot of the data. You should obtain the following expected result :

2) K-means algorithm
Sklearn documentation available here : https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

a) Test the kmeans algorithm with k=3, with random_state=0. Use the fit() function on
your dataset. Because there is no target column, you can use all of the Data to train
your model.

1
MSc AIBT : Machine Learning with Python

b) Once the model is trained, you can access to the labels assigned to Data by kmeans
using labels_ attribute (look for the documentation to see an example of usage).
Display the distinct classes assigned by kmeans (use np.unique())
c) You can access to the centroids of the clusters using the cluster_centers_ attribute
(look for the documentation to see an example of usage). Print them.
d) Plot the scatter plot using the labels assigned by kmeans algorithm. This time plot the
points according to the label. You should obtain the following plot :

e) Explain why k=3 seems not appropriate for the correct number of clusters.
f) Find a way to plot the centroids on the plot. Be practical and create a function to plot
everything.

3) Find the optimal value of k


Find in the documentation the attribute allowing you to recover the ssd value of the trained
kmeans model.
a) Using the whole base, write a script for :
- Finding the optimal value of k using the elbow method (use the following range : [1,16[ ).

2
MSc AIBT : Machine Learning with Python

- Use the following parameters in Kmeans initialization : random_state = 42 and init=’k-


means++’.
- Draw the elbow method plot (you should obtain the following plot)

- Conclude on the best value of k.

b) Train a k-means model with the best value of k obtained before :


- random_state=42 and init=’k-means++’
- Draw the scatterplot associated
- Observe and describe the obtained clusters according to the axis (e.g. cluster 1 contains the
customers having low income but a high number of transactions)

5) More
Load the test samples (Customers_practice_test.csv).

a) Use your trained kmeans on optimal value of k (found in part 4) to predict the test
samples just loaded.
b) Print the predictions
c) Plot the decision boundaries (here is an example with k=3)

You might also like