You are on page 1of 1

Assignment

1) Study the code showing K-means clustering using the Iris dataset. The number of
clusters is chosen to be 5.

Click here to browse the K-means clustering code in Google Colab.

a) Experiment within different values of number of clusters (say from 1 to 10) and store
the error in a list.
(Hint: Error = [] Error.append(model_kmeans.inertia_))
The K-means algorithm aims to choose centroids that minimise the inertia, or within-
cluster sum-of-squares criterion (https://scikit-learn.org/stable/modules/clustering.html)
b) Plot a graph where X axis represents the number of clusters and Y axis represents
the error. What is the optimal value of the number of clusters?
View this video to understand the graph that you have plotted.

2) Study the code for a simple spam classifier using Bag of Words representation (each
feature is basically the frequency of a particular word in the document)

Click here to browse the code in Colab.

Now enhance the code to use Term Frequency — Inverse Document Frequency (TF-
IDF) as feature instead of word count.

Hint: Please refer the examples by visting the links below:

Explanation of TF-IDF
SK-learn page of TfidfVectorizer

You might also like