Professional Documents
Culture Documents
SCHOOL OF ENGINEERING
PRACTICAL RECORD
Course:______________Branch:____________Reg.No:______________________
Name of Laboratory:___________________________________________________
ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212
SCHOOL OF ENGINEERING
Certificate
This is to certify that Mr. / Ms.
academic year .
Signature of HOD
Index
Signature of
S.No Name of the Program Page No. Date
the Faculty
lOMoAR cPSD| 25109292
Course Objective:
● Practical exposure on implementation of well-known data mining algorithms
● Learning performance evaluation of data mining algorithms in a supervised and an
unsupervised setting.
Course Outcomes:
Upon successful completion of the course, the student will be able to:
● Apply preprocessing techniques on real world datasets
● Apply apriori algorithm to generate frequent itemsets.
● Apply Classification and clustering algorithms on diferent datasets.
Note: Use python library scikit-learn wherever necessary
LIST OF PROGRAMS
1. Demonstrate the following data preprocessing tasks using python libraries. a) Loading the
dataset b) Identifying the dependent and independent variables c) Dealing with missing data
7. Generate frequent itemsets using Apriori Algorithm in python and also generate
association rules for any market basket data.
1. Demonstrate the following data preprocessing tasks using python libraries. a) Loading the
dataset b) Identifying the dependent and independent variables c) Dealing with missing data
import pandas as pd
dataset = pd.read_excel("age_salary.xls")
Having seen the data we can clearly identify the dependent and independent factors.Here we
just have 2 factors, age and salary.Salary is the dependent factor that changes with the
independent factor age.Now let’s classify them programmatically.
We have already noticed the missing fields in the data denoted by “nan”. Machine learning
models cannot accommodate missing fields in the data they are provided with.So the missing
fields must be filled with values that will not affect the variance of the data or make it more noisy.
The scikit-learn library’s SimpleImputer Class allows us to impute the missing fields in a dataset
with valid data. In the above code, we have used the default strategy for filling missing values
which is the mean. The imputer can not be applied on 1D arrays and since Y is a 1D array, it
needs to be converted to a compatible shape.The reshape functions allows us to reshape any
array.The fit_transform() method will fit the imputer object and then transforms the arrays.
Output
lOMoAR cPSD| 25109292
When dealing with large and real-world datasets, categorical data is almost
inevitable.Categorical variables represent types of data which may be divided into
groups. Examples of categorical variables are race, sex, age group, educational level
etc. These variables often has letters or words as its values. Since machine learning
models are all about numbers and calculations , these categorical variables need to be
coded in to numbers. Having coded the categorical variable into numbers may just not
be enough.
For example, consider the dataset below with 2 categorical features nation and
purchased_item. Let us assume that the dataset is a record of how age, salary and
country of a person determine if an item is purchased or not.Thus purchased_item is the
dependent factor and age, salary and nation are the independent factors.
actor and age, salary and nation are the independent factors.
lOMoAR cPSD| 25109292
It has 3 countries listed. In a larger dataset, these may be large groups of data. Since
countries don’t have a mathematical relation between them unless we are considering
some known factors such as size or population etc , coding them in numbers will not
work, as a number may be less than or greater than another number. Dummy variables
are the solution. Using one hot encoding we will create a dummy variable for each of the
category in the column. And uses binary encoding for each dummy variable. We do not
need to create dummy variables for the feature purchased_item as it has only 2
categories either yes or no.
dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(),
[0])], remainder='passthrough')
X=np.array(columnTransformer.fit_transform(X),dtype=np.str)
print(X)
Output
lOMoAR cPSD| 25109292
The the first 3 columns are the dummy features representing Germany,India and Russia
respectively.The 1’s in each column represent that the person belongs to that specific
country.
Y = le_X.fit_transform(Y)
Output:
All machine learning models require us to provide a training set for the machine so that
the model can train from that data to understand the relations between features and can
predict for new observations.When we are provided a single huge dataset with too much
of observations ,it is a good idea to split the dataset into to two, a training_set and a
test_set, so that we can test our model after its been trained with the training_set.
Scikit-learn comes with a method called train_test_split to help us with this task.
The above code will split X and Y into two subsets each.
Since machine learning models rely on numbers to solve relations it is important to have
similarly scaled data in a dataset. Scaling ensures that all data in a dataset falls in the
same range.Unscaled data can cause inaccurate or false predictions.Some machine
learning algorithms can handle feature scaling on its own and doesn’t require it explicitly.
The StandardScaler class from the scikit-learn library can help us scale the dataset.
sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()
Output
X_train before scaling :
lOMoAR cPSD| 25109292
Many data science techniques are based on measuring similarity and dissimilarity between
objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In
Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to
compute the distance between the cluster centroids and it‟s assigned data points.
Recommendation engines use neighborhood based collaborative filtering methods which
identify an individual‟s neighbor based on the similarity/dissimilarity to the other users.
Similarity based methods determine the most similar objects with the highest values
as it implies they live in closer neighborhoods.
Pearson’s Correlation
where
lOMoAR cPSD| 25109292
The Pearson‟s correlation can take a range of values from -1 to +1. Only having an
increase or decrease that are directly related will not lead to a Pearson‟s correlation of
1 or -1.
Implementation in Python:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
lOMoAR cPSD| 25109292
Cosine Similarity
The cosine similarity calculates the cosine of the angle between two vectors. In order
to calculate the cosine similarity we use the following formula:
lOMoAR cPSD| 25109292
Recall the cosine function: on the left the red vectors point at different angles and thegraph on the
right shows the resulting function.
Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors
point in the exact same direction, the cosine similarity is +1. If the vectors point in
opposite directions, the cosine similarity is -1.
lOMoAR cPSD| 25109292
The cosine similarity is very popular in text analysis. It is used to determine how
similar documents are to one another irrespective of their size. The TF-IDF text
analysis technique helps converting the documents into vectors where each value in
the vector corresponds to the TF-IDF score of a word in the document. Each word has
its own axis, the cosine similarity then determines how similar the documents are.
Implementation in Python
We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine
similarity for a single sample.
accard Similarity
Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is
for comparing two binary vectors (sets).
lOMoAR cPSD| 25109292
We can see that the Jaccard similarity divides the size of the intersection by the size of
the union of the sample sets.
Both Cosine similarity and Jaccard similarity are common metrics for calculating text
similarity. Calculating the Jaccard similarity is computationally more expensive as it
matches all the terms of one document to another document. The Jaccard similarity
turns out to be useful by detecting duplicates.
Implementation in Python
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
Distance based methods prioritize objects with the lowest values to detect similarity
amongst them.
Euclidean Distance
Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very
often in the context of NLP applications. It is appropriate for continuous numerical
variables. Euclidean distance is not scale invariant, therefore scaling the data prior to
computing the distance is recommended. Additionally, Euclidean distance multiplies
the effect of redundant information in the dataset. If I had five variables which are
heavily correlated and we take all five variables as input, then we would weight this
redundancy effect by five.
Implementation in Python
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)
Manhattan Distance
Different from Euclidean distance is the Manhattan distance, also called „cityblock‟,
distance from one vector to another. You can imagine this metric as a way to compute
the distance between two points when you are not able to go through buildings.
The green line gives you the Euclidean distance, while the purple line gives you the
Manhattan distance.
lOMoAR cPSD| 25109292
Implementation in Python
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)
Import pandas as pd
import NumPy, which is a popular library for numerical computing. Numpy is known
for its NumPy array data structure as well as its useful methods reshape, arange,
and append.
It is convention to import NumPy under the alias np. You can import numpy with the
following statement:
Import numpy as np
import matplotlib, which is Python's most popular library for data visualization.
matplotlib is typically imported under the alias plt. You can import matplotlib with
the following statement:
%matplotlib inline
Lastly, you will want to import seaborn, which is another Python data visualization
library that makes it easier to create beautiful visualizations using matplotlib.
You can import seaborn with the following statement:
housing = pd.read_csv('housing.csv')
ousing.shape
For this data set, (20640,7) should be printed. This represents 7 columns and 20640
rows.
housing.head()
If you wanted to print out from the bottom upwards, you would use the “tail” function
instead.
By default will print out 5 rows. For this particular data set, this means rows 20635 to
20639.
Next, we should try and plot the data. We can do so with the following command:
housing.plot("median_income", "median_house_value")
There are various lines making it difficult to see individual trends. So, to remedy this,
we should use a scatter plot without individual lines.
As can be seen, the correlation is significantly more apparent because there are no
longer random lines to distract.
Now, it is time to actually start to analyze the data. We can start this off by running a
particular directive.
x_train, x_test, y_train, y_test = train_test_split(housing.median_income,
housing.median_house_value, test_size = 0.2)
This line of code is very important. The primary function is to split up the data as
“train” and “test.”
The overall data will be split up into 80% as train and 20% as test. The “y-values” will
be the “median_house_value,” and the “x-values” will be the “median_income.”
Next, impose a linear regression. This can be done with the following.
regr = LinearRegression()
This will call LinearRegression(), and then allow us to use our own data to predict.
regr.fit(np.array(x_train).reshape(-1,1), y_train)
lOMoAR cPSD| 25109292
This will shape the model using one predictor. Reshape is being applied to change it
from pandas to NumPy, and finally into a vector. (Reshape transverses it from a single
dimension matrix to a vertical shape.)
We can compare our predictions with the actual values. This can be done with the
code that follows.
Compare the first values. For the actual, it is equal to 252,900. Our prediction, on the
other hand, guesses approximately 180,156. (That is not bad, but that is not great!)
Looking at values is great visually, but there are thousands of data points to be
considered. So, we need a more sophisticated way of doing so.
This will show how far off the values are. This is showing the predicted value minus
the actual test value for all the data points.
Then, we should plot with a histogram to see how “off” each value is. This can be done
with the following command.
plt.hist(residuals)
lOMoAR cPSD| 25109292
Lastly, we should use root mean squared error to find the error. This can be done as
follows:
mean_squared_error(y_test, preds) ** 0.5
A decision tree is a machine learning algorithm that uses a tree-like model of decisions and
their subsequent consequences to arrive at a particular decision. It is a Supervised Machine
Learning model, where the data is continuously split according to a certain parameter, and
finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the leaf
nodes at the bottom. A decision tree usually contains 3 types of nodes.
1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.
Decision trees work in a step-wise manner, meaning that they perform a step-by-step
process instead of following a continuous process. Decision trees follow a tree-like structure,
where the nodes of a tree are split using the features based on defined criteria. The main
criteria based on which decision trees split are:
• Variance: This is normally used in the Regression model, which is a measure of the
variation of each data point from the mean.
Dataset to apply decision tree algorithms in Python. You can follow the steps below to create
a feasible and useful decision tree:
• Use the test dataset to make a prediction and check the accuracy score of the model.
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
• In lines 1 to 4, we import the necessary libraries to read and analyze the dataset.
• In line 7, we store the IRIS dataset in the variable data. Since the sklearn library
contains the IRIS dataset by default, you do not need to upload it again.
• From lines 22 to 24, we create a decision tree classifier and fit it against the training
dataset. By default, the criterion parameter is set to gini. From lines 27 to 30, we
import the “accuracy_score” module and implement the same to find the accuracy of
both the training and test data.
• In lines 28 and 29, we get the output as 1, i.e., 100% for training data and 0.947,
which is approximately 95%, for the test dataset
lOMoAR cPSD| 25109292
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable
for a large chunk of data. Naive Bayes classifier is successfully used in various applications
such as spam filtering, text classification, sentiment analysis, and recommender systems. It
uses Bayes theorem of probability for prediction of unknown class.
Whenever you perform classification, the first step is to understand the problem and identify
potential features and label. Features are those characteristics or attributes which affect the
results of the label. For example, in the case of a loan distribution, bank manager's identify
customer’s occupation, income, age, location, previous loan history, transaction history, and
credit score. These characteristics are known as features which help the model classify
customers.
The classification has two phases, a learning phase, and the evaluation phase. In the learning
phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the
classifier performance. Performance is evaluated on the basis of various parameters such as
accuracy, error, precision, and recall.
or
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and
reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent
of other features. For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class
conditional independence.
• P(h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as
posterior probability.
Defining Dataset
In this example, you can use the dummy dataset with three columns: weather, temperature,
and play. The first two are features(weather, temperature) and the other is the label.
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
lOMoAR cPSD| 25109292
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','
Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
Encoding Features
First, you need to convert these string labels into numbers. for example: 'Overcast', 'Rainy',
'Sunny' as 0, 1, 2. This is known as label encoding. Scikit-learn provides LabelEncoder
library for encoding labels with a value between 0 and one less than the number of discrete
classes.
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded)
[2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Now combine both the features (weather and temp) in a single variable (list of tuples).
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
print features
Generating Model
• Perform prediction
• #Import Gaussian Naive Bayes model
• from sklearn.naive_bayes import GaussianNB
•
lOMoAR cPSD| 25109292
Predicted value:1
Apriori algorithm is the perfect algorithm to start with association analysis as it is not just
easy to understand and interpret but also to implement.
Python has many libraries for apriori implementation. One can also implement the algorithm
from scratch. But wait, there is mlxtend for the rescue. This library has beautiful
implementation of apriori and it also allows to extract association rules from the result.
Required Library
1) mlxtend or ML extended will be used for apriori implementation and extracting association
rules.
2) And then there was one: matplotlib for visualizing results
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
df = pd.read_csv('retail_dataset.csv')
## Print first 10 rows
df.head(10)
lOMoAR cPSD| 25109292
Each row of the dataset represents items that were purchased together on the same
day at the same store.The dataset is a sparse dataset as relatively high percentage of
data is NA or NaN or equivalent.
These NaNs make it hard to read the table. Let‟s find out how many unique items are
actually there in the table.
items = set()
for col in df:
items.update(df[col].unique())print(items)Out:
{'Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
'Diaper', 'Milk']}
There are only 9 items in total that make up the entire dataset.
Data Preprocessing
To make use of the apriori module given by mlxtend library, we need to convert the
dataset according to it‟s liking. apriori module requires a dataframe that has either 0
and 1 or True and False as data. The data we have is all string (name of items), we
need to One Hot Encode the data.
lOMoAR cPSD| 25109292
Applying Apriori
apriori module from mlxtend library provides fast and efficient apriori
implementation.
Parameters
• min_support : Floating point value between 0 and 1 that indicates the minimum
support required for an itemset to be selected.
# of observation with item / total observation# of observation with item / total
observation
• use_colnames : This allows to preserve column names for itemset making it more
readable.
• max_len : Max length of itemset generated. If not set, all possible lengths are
evaluated.
lOMoAR cPSD| 25109292
• verbose : Shows the number of iterations if >= 1 and low_memory is True. If =1 and
low_memory is False , shows the number of combinations.
• low_memory :
• If True, uses an iterator to search for combinations above min_support. Note that
while low_memory=True should only be used for large dataset if memory resources are
limited, because this implementation is approx. 3–6x slower than the default.
The output is a data frame with the support for each itemsets.
lOMoAR cPSD| 25109292
The result of association analysis shows which item is frequently purchased with
other items.
lOMoAR cPSD| 25109292
Visualizing results
1. Support vs Confidence
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
2. Support vs Lift
plt.scatter(rules[‘support’], rules[‘lift’], alpha=0.5)
plt.xlabel(‘support’)
plt.ylabel(‘lift’)
plt.title(‘Support vs Lift’)
plt.show()
lOMoAR cPSD| 25109292
Lift vs Confidence
fit = np.polyfit(rules[‘lift’], rules[‘confidence’], 1)
fit_fn = np.poly1d(fit)
plt.plot(rules[‘lift’], rules[‘confidence’], ‘yo’, rules[‘lift’],
fit_fn(rules[‘lift’]))
lOMoAR cPSD| 25109292
K, here is the pre-defined number of clusters to be formed by the Algorithm. If K=3, It means
the number of clusters to be formed from the dataset is 3
Step-3: Assign each data point, based on their distance from the randomly selected points
(Centroid), to the nearest/closest centroid which will form the predefined clusters.
Step-5: Repeat step no.3, which reassign each datapoint to the new closest centroid of each
cluster.
Step-7: FINISH
STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the dataset and to put
them into different respective clusters. We will choose some random 2 points which will act
as centroid to form the cluster.
lOMoAR cPSD| 25109292
STEP 2: Now we will assign each data point to a scatter plot based on its distance from the
closest K-point or centroid. It will be done by drawing a median between both the centroids.
Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the right of the line
are close to the yellow centroid. The left one Form cluster with blue centroid and the right
one with the yellow centroid.
STEP 4:repeat the process by choosing a new centroid. To choose the new centroids, we will
find the new center of gravity of these centroids, which is depicted below :
STEP 5: Next, we will reassign each datapoint to the new centroid. We will repeat the same
process as above (using a median line). The yellow data point on the blue side of the
median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step of finding new
centroids.
lOMoAR cPSD| 25109292
STEP 7: We will repeat the above process of finding the center of gravity of centroids, as being
depicted below
lOMoAR cPSD| 25109292
STEP 8: After Finding the new centroids we will again draw the median line and reassign the
data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that two groups are
being formed and no dissimilar point to be included in a single group
The number of clusters that we choose for the algorithm shouldn’t be random. Each
and Every cluster is formed by calculating and comparing the mean distances of each
data points within a cluster from its centroid.
We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-
Squares (WCSS) method.
WCSS Stands for the sum of the squares of distances of the data points in each and every
cluster from its centroid.
The main idea is to minimize the distance between the data points and the centroid of the
clusters. The process is iterated until we reach a minimum value for the sum of distances.
To find the optimal value of clusters, the elbow method follows the below
steps:
1 Execute the K-means clustering on a given dataset for different K values (ranging from 1-
10).
3 Plots a graph/curve between WCSS values and the respective number of clusters K.
4 The sharp point of bend or a point( looking like an elbow joint ) of the plot like an arm, will
be considered as the best/optimal value of K
Python Implementation
Clustering
kmeans = KMeans(3)
means.fit(x)
Clustering Results
identified_clusters = kmeans.fit_predict(x)
identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_
with_clusters['Clusters'],cmap='rainbow')
lOMoAR cPSD| 25109292
number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
lOMoAR cPSD| 25109292
we can choose 3 as no. of clusters, this method shows what is the good number of
clusters.
Let’s say we have the below points and we want to cluster them into groups:
Now, based on the similarity of these clusters, we can combine the most similar clusters
together and repeat this process until only a single cluster is left:
We are essentially building a hierarchy of clusters. That’s why this algorithm is called
hierarchical clustering. I will discuss how to decide the number of clusters in a later section.
For now, let’s look at the different types of hierarchical clustering.
We assign each point to an individual cluster in this technique. Suppose there are 4 data
points. We will assign each of these points to a cluster and hence will have 4 clusters in the
beginning:
lOMoAR cPSD| 25109292
We assign each point to an individual cluster in this technique. Suppose there are 4 data
points. We will assign each of these points to a cluster and hence will have 4 clusters in the
beginning:
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a
single cluster is left:
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is
also known as additive hierarchical clustering.
Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters
(in case of n observations), we start with a single cluster and assign all the points to that
cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the
same cluster at the beginning:
lOMoAR cPSD| 25109292
Now, at each iteration, we split the farthest point in the cluster and repeat this process until
each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical
clustering.
Agglomerative Clustering is widely used in the industry and that will be the focus in this
article. Divisive hierarchical clustering will be a piece of cake once we have a handle on the
agglomerative type.
We merge the most similar points or clusters in hierarchical clustering – we know this. Now
the question is – how do we decide which points are similar and which are not? It’s one of
the most important questions in clustering!
Here’s one way to calculate similarity – Take the distance between the centroids of these
clusters. The points having the least distance are referred to as similar points and we can
merge them. We can refer to this as a distance-based algorithm as well (since we are
calculating the distances between the clusters).
lOMoAR cPSD| 25109292
In hierarchical clustering, we have a concept called a proximity matrix. This stores the
distances between each point. Let’s take an example to understand this matrix as well as the
steps to perform hierarchical clustering.
Suppose a teacher wants to divide her students into different groups. She has the marks
scored by each student in an assignment and based on these marks, she wants to segment
them into groups. There’s no fixed target here as to how many groups to have. Since the
teacher does not know what type of students should be assigned to which group, it cannot
be solved as a supervised learning problem. So, we will try to apply hierarchical clustering
here and segment the students into different groups.
First, we will create a proximity matrix which will tell us the distance between each of these
points. Since we are calculating the distance of each point from each of the other points, we
will get a square matrix of shape n X n (where n is the number of observations).
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is
always 0. We will use the Euclidean distance formula to calculate the rest of the distances.
So, let’s say we want to calculate the distance between point 1 and 2:
√(10-7)^2 = √9 = 3
Similarly, we can calculate all the distances and fill the proximity matrix.
lOMoAR cPSD| 25109292
Different colors here represent different clusters. You can see that we have 5 different
clusters for the 5 points in our data.
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the
points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2:
Let’s look at the updated clusters and accordingly update the proximity matrix:
lOMoAR cPSD| 25109292
Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this
cluster. Instead of the maximum, we can also take the minimum value or the average values
as well. Now, we will again calculate the proximity matrix for these clusters:
So, we will first look at the minimum distance in the proximity matrix and then merge the
closest pair of clusters. We will get the merged clusters as shown below after repeating
these steps:
lOMoAR cPSD| 25109292
We started with 5 clusters and finally have a single cluster. This is how agglomerative
hierarchical clustering works. But the burning question still remains – how do we decide
the number of clusters? Let’s understand that in the next section.
Ready to finally answer this question that’s been hanging around since we started learning?
To get the number of clusters for hierarchical clustering, we make use of an awesome
concept called a Dendrogram.
Let’s get back to our teacher-student example. Whenever we merge two clusters, a
dendrogram will record the distance between these clusters and represent it in graph form.
Let’s see how a dendrogram looks like:
lOMoAR cPSD| 25109292
We have the samples of the dataset on the x-axis and the distance on the y-axis. Whenever
two clusters are merged, we will join them in this dendrogram and the height of the
join will be the distance between these points. Let’s build the dendrogram for our
example:
lOMoAR cPSD| 25109292
Take a moment to process the above image. We started by merging sample 1 and 2 and the
distance between these two samples was 3 (refer to the first proximity matrix in the previous
section). Let’s plot this in the dendrogram:
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples. Similarly, we plot all the steps where we merged the
clusters and finally, we get a dendrogram like this:
lOMoAR cPSD| 25109292
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the
threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
The number of clusters will be the number of vertical lines which are being
intersected by the line drawn using the threshold. In the above example, since the red
line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4)
and the other will have a sample (3,5). Pretty straightforward, right?
This is how we can decide the number of clusters using a dendrogram in Hierarchical
Clustering. In the next section, we will implement hierarchical clustering which will help you
to understand all the concepts that we have learned in this article.
We will be working on a wholesale customer segmentation problem. You can download the
dataset using this link. The data is hosted on the UCI Machine Learning repository. The aim
lOMoAR cPSD| 25109292
of this problem is to segment the clients of a wholesale distributor based on their annual
spending on diverse product categories, like milk, grocery, region, etc.
Let’s explore the data first and then apply Hierarchical Clustering to segment the clients.
import pandas as pd
import numpy as np
%matplotlib inline
data =
pd.read_csv('Wholesal
e customers
data.csv')
data.head()
There are multiple product categories – Fresh, Milk, Grocery, etc. The values represent
the number of units purchased by each client for each product. Our aim is to make
clusters from this data that can segment similar clients together. We will, of course,
use Hierarchical Clustering for this problem.
But before applying Hierarchical Clustering, we have to normalize the data so that the
scale of each variable is the same. Why is this important? Well, if the scale of the
variables is not the same, the model might become biased towards the variables with a
higher magnitude like Fresh or Milk (refer to the above table).
So, let’s first normalize the data and bring all the variables to the same scale:
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled,
lOMoAR cPSD| 25109292
columns=data.columns)
Here, we can see that the scale of all the variables is almost similar. Now, we are good to
go. Let’s first draw the dendrogram to help us decide the number of clusters for this
particular problem:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled,
method='ward'))
The x-axis contains the samples and y-axis represents the distance between these
samples. The vertical line with maximum distance is the blue line and hence we can
lOMoAR cPSD| 25109292
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled,
method='ward'))
We have two clusters as this line cuts the dendrogram at two points. Let’s now apply
hierarchical clustering for 2 clusters:
cluster.fit_predict(data_scaled)
lOMoAR cPSD| 25109292
We can see the values of 0s and 1s in the output since we defined 2 clusters. 0
represents the points that belong to the first cluster and 1 represents points in the second
cluster. Let’s now visualize the two clusters:
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'],
c=cluster.labels_)
lOMoAR cPSD| 25109292
Awesome! We can clearly visualize the two clusters here. This is how we can implement
hierarchical clustering in Python.