Data Mining Using Python Lab

ST.
MARY’S GROUP OF INSTITUTIONS GUNTUR

(Approved by AICTE &Govt .of AP, Affiliated to JNTU-KAKINADA, Accredited by 'NAAC')
Chebrolu (V&M), Guntur (Dist), Andhra Pradesh, INDIA-522212
SCHOOL OF ENGINEERING
PRACTICAL RECORD
Name of the student: __________________________________________________
Course:______________Branch:____________Reg.No:______________________
Year:_____________ Semester:_______________ Regulation:_________________
Name of Laboratory:___________________________________________________
ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
SCHOOL OF ENGINEERING
Certificate
This is to certify that Mr. / Ms.
bearing roll no of _______B. Tech semester
branch has satisfactorily
completed laboratory during the
academic year .
Place: Signature of Faculty

Date:
External Practical Examination held on:______________________
Signature of HOD
Signature of Internal Examiner Signature of External Examiner

ST.MARY’S GROUP OF INSTITUTIONS GUNTUR
Index
Signature of
S.No Name of the Program Page No. Date
the Faculty
lOMoAR cPSD| 25109292
DATA MINING USING PYTHON LAB
Course Objective:
● Practical exposure on implementation of well-known data mining algorithms
● Learning performance evaluation of data mining algorithms in a supervised and an
unsupervised setting.
Course Outcomes:
Upon successful completion of the course, the student will be able to:
● Apply preprocessing techniques on real world datasets
● Apply apriori algorithm to generate frequent itemsets.
● Apply Classification and clustering algorithms on diferent datasets.
Note: Use python library scikit-learn wherever necessary
LIST OF PROGRAMS
1. Demonstrate the following data preprocessing tasks using python libraries. a) Loading the
dataset b) Identifying the dependent and independent variables c) Dealing with missing data
2. Demonstrate the following data preprocessing tasks using python library

a) Dealing with categorical data b) Scaling the features
c) Splitting dataset into Training and Testing Sets
3. Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation b) Cosine Similarity c) Jaccard Similarity d) Euclidean Distance
e) Manhattan Distance
4. Build a model using linear regression algorithm on any dataset.
5. Build a classification model using Decision Tree algorithm on iris dataset
6. Apply Naïve Bayes Classification algorithm on any dataset
7. Generate frequent itemsets using Apriori Algorithm in python and also generate
association rules for any market basket data.
8. Apply K- Means clustering algorithm on any dataset.
9. Apply Hierarchical Clustering algorithm on any dataset.
10. Apply DBSCAN clustering algorithm on any dataset.

1. Demonstrate the following data preprocessing tasks using python libraries. a) Loading the
dataset b) Identifying the dependent and independent variables c) Dealing with missing data
Importing the pandas
import pandas as pd
Loading the dataset
dataset = pd.read_excel("age_salary.xls")
The data set used here is as simple as shown below:

Classifying the dependent and Independent Variables
Having seen the data we can clearly identify the dependent and independent factors.Here we
just have 2 factors, age and salary.Salary is the dependent factor that changes with the
independent factor age.Now let’s classify them programmatically.
X = dataset.iloc[:,:-1].values #Takes all rows of all columns except

the last column
Y = dataset.iloc[:,-1].values # Takes all rows of the last column
• X : independent variable set

• Y : dependent variable set
The dependent and independent values are stored in different arrays. In case of multiple
independent variables use X = dataset.iloc[:,a:b].values where a is the starting range
and b is the ending range (column indices). You can also specify the column indices in a list to
select specific columns.
Dealing with Missing Data
We have already noticed the missing fields in the data denoted by “nan”. Machine learning
models cannot accommodate missing fields in the data they are provided with.So the missing
fields must be filled with values that will not affect the variance of the data or make it more noisy.
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy="mean")
X = imp.fit_transform(X)
Y = Y.reshape(-1,1)
Y = imp.fit_transform(Y)
Y = Y.reshape(-1)
The scikit-learn library’s SimpleImputer Class allows us to impute the missing fields in a dataset
with valid data. In the above code, we have used the default strategy for filling missing values
which is the mean. The imputer can not be applied on 1D arrays and since Y is a 1D array, it
needs to be converted to a compatible shape.The reshape functions allows us to reshape any
array.The fit_transform() method will fit the imputer object and then transforms the arrays.
Output
2. Demonstrate the following data preprocessing tasks using python library

a) Dealing with categorical data b) Scaling the features
c) Splitting dataset into Training and Testing Sets
Dealing with Categorical Data
When dealing with large and real-world datasets, categorical data is almost
inevitable.Categorical variables represent types of data which may be divided into
groups. Examples of categorical variables are race, sex, age group, educational level
etc. These variables often has letters or words as its values. Since machine learning
models are all about numbers and calculations , these categorical variables need to be
coded in to numbers. Having coded the categorical variable into numbers may just not
be enough.
For example, consider the dataset below with 2 categorical features nation and
purchased_item. Let us assume that the dataset is a record of how age, salary and
country of a person determine if an item is purchased or not.Thus purchased_item is the
dependent factor and age, salary and nation are the independent factors.
actor and age, salary and nation are the independent factors.
It has 3 countries listed. In a larger dataset, these may be large groups of data. Since
countries don’t have a mathematical relation between them unless we are considering
some known factors such as size or population etc , coding them in numbers will not
work, as a number may be less than or greater than another number. Dummy variables
are the solution. Using one hot encoding we will create a dummy variable for each of the
category in the column. And uses binary encoding for each dummy variable. We do not
need to create dummy variables for the feature purchased_item as it has only 2
categories either yes or no.
dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(),
[0])], remainder='passthrough')
X=np.array(columnTransformer.fit_transform(X),dtype=np.str)
print(X)
Output
The the first 3 columns are the dummy features representing Germany,India and Russia
respectively.The 1’s in each column represent that the person belongs to that specific
country.
Y = le_X.fit_transform(Y)
Output:
Splitting the Dataset into Training and Testing sets
All machine learning models require us to provide a training set for the machine so that
the model can train from that data to understand the relations between features and can
predict for new observations.When we are provided a single huge dataset with too much
of observations ,it is a good idea to split the dataset into to two, a training_set and a
test_set, so that we can test our model after its been trained with the training_set.
Scikit-learn comes with a method called train_test_split to help us with this task.
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size = 0.3, random_state = 0)
The above code will split X and Y into two subsets each.
• test_size: the desired size of the test_set. 0.3 denotes 30%.

• random_state: This is used to preserve the uniqueness. The split will happen
uniquely for a random_state.
Scaling the features
Since machine learning models rely on numbers to solve relations it is important to have
similarly scaled data in a dataset. Scaling ensures that all data in a dataset falls in the
same range.Unscaled data can cause inaccurate or false predictions.Some machine
learning algorithms can handle feature scaling on its own and doesn’t require it explicitly.
The StandardScaler class from the scikit-learn library can help us scale the dataset.
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()
Output
X_train before scaling :
X_train after scaling :

3. Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation b) Cosine Similarity c) Jaccard Similarity d) Euclidean Distance
e) Manhattan Distance
Many data science techniques are based on measuring similarity and dissimilarity between
objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In
Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to
compute the distance between the cluster centroids and it‟s assigned data points.
Recommendation engines use neighborhood based collaborative filtering methods which
identify an individual‟s neighbor based on the similarity/dissimilarity to the other users.
Similarity Based Metrics
Similarity based methods determine the most similar objects with the highest values
as it implies they live in closer neighborhoods.
Pearson’s Correlation
Correlation is a technique for investigating the relationship between two quantitative,

continuous variables, for example, age and blood pressure.
Pearson‟s correlation coefficient is a measure related to the strength and direction of a

linear relationship. We calculate this metric for the vectors x and y in the following
way:
where
The Pearson‟s correlation can take a range of values from -1 to +1. Only having an
increase or decrease that are directly related will not lead to a Pearson‟s correlation of
1 or -1.
Implementation in Python:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
# calculate Pearson's correlation

corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)
Pearsons correlation: 0.810
Cosine Similarity
The cosine similarity calculates the cosine of the angle between two vectors. In order
to calculate the cosine similarity we use the following formula:
Recall the cosine function: on the left the red vectors point at different angles and thegraph on the
right shows the resulting function.
Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors
point in the exact same direction, the cosine similarity is +1. If the vectors point in
opposite directions, the cosine similarity is -1.
The cosine similarity is very popular in text analysis. It is used to determine how
similar documents are to one another irrespective of their size. The TF-IDF text
analysis technique helps converting the documents into vectors where each value in
the vector corresponds to the TF-IDF score of a word in the document. Each word has
its own axis, the cosine similarity then determines how similar the documents are.
Implementation in Python
We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine
similarity for a single sample.
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)
Cosine similarity: 0.773
accard Similarity
Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is
for comparing two binary vectors (sets).
In set theory it is often helpful to see a visualization of the formula:
We can see that the Jaccard similarity divides the size of the intersection by the size of
the union of the sample sets.
Both Cosine similarity and Jaccard similarity are common metrics for calculating text
similarity. Calculating the Jaccard similarity is computationally more expensive as it
matches all the terms of one document to another document. The Jaccard similarity
turns out to be useful by detecting duplicates.
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
Jaccard similarity: 0.500

Distance Based Metrics
Distance based methods prioritize objects with the lowest values to detect similarity
amongst them.
Euclidean Distance
The Euclidean distance is a straight-line distance between two vectors.
For the two vectors x and y, this can be computed as follows:
Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very
often in the context of NLP applications. It is appropriate for continuous numerical
variables. Euclidean distance is not scale invariant, therefore scaling the data prior to
computing the distance is recommended. Additionally, Euclidean distance multiplies
the effect of redundant information in the dataset. If I had five variables which are
heavily correlated and we take all five variables as input, then we would weight this
redundancy effect by five.
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)
Euclidean distance: 3.273

Manhattan Distance
Different from Euclidean distance is the Manhattan distance, also called „cityblock‟,
distance from one vector to another. You can imagine this metric as a way to compute
the distance between two points when you are not able to go through buildings.
We calculate the Manhattan distance as follows:
The green line gives you the Euclidean distance, while the purple line gives you the
Manhattan distance.
In many ML applications Euclidean distance is the metric of choice. However, for

high dimensional data Manhattan distance is preferable as it yields more robust
results.
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)
Manhattan distance: 10.468
4. Build a model using linear regression algorithm on any dataset

Linear regression is the first machine learning model .More specifically, we will be
working with a data set of housing data and attempting to predict housing prices.
Before we build the model, we'll first need to import the required libraries.
The first library that we need to import is pandas, which is a portmanteau of "panel data" and
is the most popular Python library for working with tabular data.
Import pandas as pd
import NumPy, which is a popular library for numerical computing. Numpy is known
for its NumPy array data structure as well as its useful methods reshape, arange,
and append.
It is convention to import NumPy under the alias np. You can import numpy with the
following statement:
Import numpy as np
import matplotlib, which is Python's most popular library for data visualization.
matplotlib is typically imported under the alias plt. You can import matplotlib with
the following statement:
import matplotlib.pyplot as plt
%matplotlib inline
The %matplotlib inline statement will cause of of our matplotlib visualizations to

embed themselves directly in our Jupyter Notebook, which makes them easier to access and
interpret.
Lastly, you will want to import seaborn, which is another Python data visualization
library that makes it easier to create beautiful visualizations using matplotlib.
You can import seaborn with the following statement:
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as
pltimport pandas as pd
pull housing.csv from the Kaggle
housing = pd.read_csv('housing.csv')
ousing.shape
For this data set, (20640,7) should be printed. This represents 7 columns and 20640
rows.
housing.head()
Doing so will print out the top 5 (0–4) rows.
If you wanted to print out from the bottom upwards, you would use the “tail” function
instead.
For example, as seen below.

housing.tail()
By default will print out 5 rows. For this particular data set, this means rows 20635 to
20639.
Next, we should try and plot the data. We can do so with the following command:
housing.plot("median_income", "median_house_value")
The result of this can be seen below.

There are various lines making it difficult to see individual trends. So, to remedy this,
we should use a scatter plot without individual lines.
Thus, run the command below.

housing.plot.scatter("median_income", "median_house_value")
Doing so will produce the following graph.

As can be seen, the correlation is significantly more apparent because there are no
longer random lines to distract.
Now, it is time to actually start to analyze the data. We can start this off by running a
particular directive.
x_train, x_test, y_train, y_test = train_test_split(housing.median_income,
housing.median_house_value, test_size = 0.2)
This line of code is very important. The primary function is to split up the data as
“train” and “test.”
The overall data will be split up into 80% as train and 20% as test. The “y-values” will
be the “median_house_value,” and the “x-values” will be the “median_income.”
Next, impose a linear regression. This can be done with the following.
regr = LinearRegression()
This will call LinearRegression(), and then allow us to use our own data to predict.
regr.fit(np.array(x_train).reshape(-1,1), y_train)
This will shape the model using one predictor. Reshape is being applied to change it
from pandas to NumPy, and finally into a vector. (Reshape transverses it from a single
dimension matrix to a vertical shape.)
Then, we need to pass in the data to give predictions.

preds = regr.predict(np.array(x_test).reshape(-1,1))
We can compare our predictions with the actual values. This can be done with the
code that follows.
Run this initially:

y_test.head()
Then, run this:

preds
Compare the first values. For the actual, it is equal to 252,900. Our prediction, on the
other hand, guesses approximately 180,156. (That is not bad, but that is not great!)
Looking at values is great visually, but there are thousands of data points to be
considered. So, we need a more sophisticated way of doing so.
This can be done with the following:

residuals = preds - y_test
This will show how far off the values are. This is showing the predicted value minus
the actual test value for all the data points.
Then, we should plot with a histogram to see how “off” each value is. This can be done
with the following command.
plt.hist(residuals)
Lastly, we should use root mean squared error to find the error. This can be done as
follows:
mean_squared_error(y_test, preds) ** 0.5
The overall mean squared error should be 82097.96292953077

5. Build a classification model using Decision Tree algorithm on iris dataset
A decision tree is a machine learning algorithm that uses a tree-like model of decisions and
their subsequent consequences to arrive at a particular decision. It is a Supervised Machine
Learning model, where the data is continuously split according to a certain parameter, and
finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the leaf
nodes at the bottom. A decision tree usually contains 3 types of nodes.
1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.
Decision trees work in a step-wise manner, meaning that they perform a step-by-step
process instead of following a continuous process. Decision trees follow a tree-like structure,
where the nodes of a tree are split using the features based on defined criteria. The main
criteria based on which decision trees split are:
• Gini impurity: Measures the impurity in a node.

• Entropy: Measures the randomness of the system.
• Variance: This is normally used in the Regression model, which is a measure of the
variation of each data point from the mean.
Dataset to apply decision tree algorithms in Python. You can follow the steps below to create
a feasible and useful decision tree:
• Gather the data.

• Import the required Python libraries and build a data frame.
• Create the model in Python (we will use decision trees).
• Use the test dataset to make a prediction and check the accuracy score of the model.
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Reading the Iris.csv file

data = load_iris()
# Extracting Attributes / Features

X = data.data
# Extracting Target / Class Labels

y = data.target
print(X)
# Import Library for splitting data
# Creating Train and Test datasets

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size =
0.25)
# Creating Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
# Predict Accuracy Score

y_pred = clf.predict(X_test)
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train
)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))
• In lines 1 to 4, we import the necessary libraries to read and analyze the dataset.
• In line 7, we store the IRIS dataset in the variable data. Since the sklearn library
contains the IRIS dataset by default, you do not need to upload it again.
• In line 10, we extract all of the attributes in variable X.
• In line 13, we extract the target, i.e., the labels in variable y.
• In line 16, we import the train_test_split function.
• In line 19, we implement the train_test_split() function. The

parameter random_state can be randomly set to any value, but the same needs to be
maintained in order to produce reproducible splits. The parameter test_size can
also be manipulated based on need. Here, we use a test_size of 0.25, which
indicates that we want to split the test data as 25% of the total dataset, and the
remaining 75% will be assigned as training data.
• From lines 22 to 24, we create a decision tree classifier and fit it against the training
dataset. By default, the criterion parameter is set to gini. From lines 27 to 30, we
import the “accuracy_score” module and implement the same to find the accuracy of
both the training and test data.
• In lines 28 and 29, we get the output as 1, i.e., 100% for training data and 0.947,
which is approximately 95%, for the test dataset
6. Apply Naive Bayes classification algorithm on any dataset?
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable
for a large chunk of data. Naive Bayes classifier is successfully used in various applications
such as spam filtering, text classification, sentiment analysis, and recommender systems. It
uses Bayes theorem of probability for prediction of unknown class.
Whenever you perform classification, the first step is to understand the problem and identify
potential features and label. Features are those characteristics or attributes which affect the
results of the label. For example, in the case of a loan distribution, bank manager's identify
customer’s occupation, income, age, location, previous loan history, transaction history, and
credit score. These characteristics are known as features which help the model classify
customers.
The classification has two phases, a learning phase, and the evaluation phase. In the learning
phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the
classifier performance. Performance is evaluated on the basis of various parameters such as
accuracy, error, precision, and recall.
# load the iris dataset

from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)

X = iris.data
y = iris.target
# splitting X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)
# training the model on training set

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set

y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response values

(y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
or
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and
reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent
of other features. For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class
conditional independence.
• P(h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as
posterior probability.
Naive Bayes Classifier
Defining Dataset
In this example, you can use the dummy dataset with three columns: weather, temperature,
and play. The first two are features(weather, temperature) and the other is the label.
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','
Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
Encoding Features
First, you need to convert these string labels into numbers. for example: 'Overcast', 'Rainy',
'Sunny' as 0, 1, 2. This is known as label encoding. Scikit-learn provides LabelEncoder
library for encoding labels with a value between 0 and one less than the number of discrete
classes.
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded)
[2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Similarly, you can also encode temp and play columns.
# Converting string labels into numbers

temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print("Temp:",temp_encoded)
print("Play:",label)
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Now combine both the features (weather and temp) in a single variable (list of tuples).
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
print features
Generating Model
Generate a model using naive bayes classifier in the following steps:
• Create naive bayes classifier
• Fit the dataset on classifier
• Perform prediction
• #Import Gaussian Naive Bayes model
• from sklearn.naive_bayes import GaussianNB
•
• #Create a Gaussian Classifier

• model = GaussianNB()
•
• # Train the model using the training sets
• model.fit(features,label)
•
• #Predict Output
• predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
• print("Predicted Value:", predicted)
Predicted value:1
7. Generate frequent itemsets using Apriori Algorithm in python and also

generate association rules for any market basket data
Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent
individual items in the database and extending them to larger and larger item sets
as long as those item sets appear sufficiently often in the database. The frequent
item sets determined by Apriori can be used to determine association rules which
highlight general trends in the database: this has applications in domains such
as market basket analysis.
Apriori algorithm is the perfect algorithm to start with association analysis as it is not just
easy to understand and interpret but also to implement.
Python has many libraries for apriori implementation. One can also implement the algorithm
from scratch. But wait, there is mlxtend for the rescue. This library has beautiful
implementation of apriori and it also allows to extract association rules from the result.
Required Library
1) mlxtend or ML extended will be used for apriori implementation and extracting association
rules.
2) And then there was one: matplotlib for visualizing results
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
df = pd.read_csv('retail_dataset.csv')
## Print first 10 rows
df.head(10)
Each row of the dataset represents items that were purchased together on the same
day at the same store.The dataset is a sparse dataset as relatively high percentage of
data is NA or NaN or equivalent.
These NaNs make it hard to read the table. Let‟s find out how many unique items are
actually there in the table.
items = set()
for col in df:
items.update(df[col].unique())print(items)Out:
{'Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
'Diaper', 'Milk']}
There are only 9 items in total that make up the entire dataset.
Data Preprocessing
To make use of the apriori module given by mlxtend library, we need to convert the
dataset according to it‟s liking. apriori module requires a dataframe that has either 0
and 1 or True and False as data. The data we have is all string (name of items), we
need to One Hot Encode the data.
Custom One Hot Encoding

itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
rowset = set(row)
labels = {}
uncommons = list(itemset - rowset)
commons = list(itemset.intersection(rowset))
for uc in uncommons:
labels[uc] = 0
for com in commons:
labels[com] = 1
encoded_vals.append(labels)
encoded_vals[0]ohe_df = pd.DataFrame(encoded_vals)
Applying Apriori
apriori module from mlxtend library provides fast and efficient apriori
implementation.
apriori(df, min_support=0.5, use_colnames=False, max_len=None,

verbose=0, low_memory=False)
Parameters
• df : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as

values
• min_support : Floating point value between 0 and 1 that indicates the minimum
support required for an itemset to be selected.
# of observation with item / total observation# of observation with item / total
observation
• use_colnames : This allows to preserve column names for itemset making it more
readable.
• max_len : Max length of itemset generated. If not set, all possible lengths are
evaluated.
• verbose : Shows the number of iterations if >= 1 and low_memory is True. If =1 and
low_memory is False , shows the number of combinations.
• low_memory :
• If True, uses an iterator to search for combinations above min_support. Note that
while low_memory=True should only be used for large dataset if memory resources are
limited, because this implementation is approx. 3–6x slower than the default.
• If True, uses an iterator to search for combinations above min_support. Note

that while low_memory=True should only be used for large dataset if memory
resources are limited, because this implementation is approx. 3–6x slower
than the default.
freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True, verbose=1)
freq_items.head(7)
The output is a data frame with the support for each itemsets.
Mining Association Rules
Frequent if-then associations called association rules which consists of an

antecedent (if) and a consequent (then).
association_rules(df, metric=’confidence’, min_threshold=0.8,

support_only=False)
Metric can be set to confidence, lift, support, leverage and conviction.

rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head()
The result of association analysis shows which item is frequently purchased with
other items.
Visualizing results
1. Support vs Confidence
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
2. Support vs Lift
plt.scatter(rules[‘support’], rules[‘lift’], alpha=0.5)
plt.xlabel(‘support’)
plt.ylabel(‘lift’)
plt.title(‘Support vs Lift’)
plt.show()
Lift vs Confidence
fit = np.polyfit(rules[‘lift’], rules[‘confidence’], 1)
fit_fn = np.poly1d(fit)
plt.plot(rules[‘lift’], rules[‘confidence’], ‘yo’, rules[‘lift’],
fit_fn(rules[‘lift’]))
8.Apply K-Means Clustering algorithm on any dataset
Kmeans Algorithm is an Iterative algorithm that divides a group of n datasets into k

subgroups /clusters based on the similarity and their mean distance from the centroid
of that particular subgroup/ formed.
K, here is the pre-defined number of clusters to be formed by the Algorithm. If K=3, It means
the number of clusters to be formed from the dataset is 3
Algorithm steps Of K Means
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the value of K, to decide the number of clusters to be formed.
Step-2: Select random K points which will act as centroids.
Step-3: Assign each data point, based on their distance from the randomly selected points
(Centroid), to the nearest/closest centroid which will form the predefined clusters.
Step-4: place a new centroid of each cluster.
Step-5: Repeat step no.3, which reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
Step-7: FINISH
3. Diagrammatic Implementation of K Means Clustering
STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the dataset and to put
them into different respective clusters. We will choose some random 2 points which will act
as centroid to form the cluster.
STEP 2: Now we will assign each data point to a scatter plot based on its distance from the
closest K-point or centroid. It will be done by drawing a median between both the centroids.
Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the right of the line
are close to the yellow centroid. The left one Form cluster with blue centroid and the right
one with the yellow centroid.
STEP 4:repeat the process by choosing a new centroid. To choose the new centroids, we will
find the new center of gravity of these centroids, which is depicted below :
STEP 5: Next, we will reassign each datapoint to the new centroid. We will repeat the same
process as above (using a median line). The yellow data point on the blue side of the
median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step of finding new
centroids.
STEP 7: We will repeat the above process of finding the center of gravity of centroids, as being
depicted below
STEP 8: After Finding the new centroids we will again draw the median line and reassign the
data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that two groups are
being formed and no dissimilar point to be included in a single group
The final Cluster being formed are as Follows

4. Choosing The Right Number Of Clusters
The number of clusters that we choose for the algorithm shouldn’t be random. Each
and Every cluster is formed by calculating and comparing the mean distances of each
data points within a cluster from its centroid.
We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-
Squares (WCSS) method.
WCSS Stands for the sum of the squares of distances of the data points in each and every
cluster from its centroid.
The main idea is to minimize the distance between the data points and the centroid of the
clusters. The process is iterated until we reach a minimum value for the sum of distances.
To find the optimal value of clusters, the elbow method follows the below
steps:
1 Execute the K-means clustering on a given dataset for different K values (ranging from 1-
10).
2 For each value of K, calculates the WCSS value.

3 Plots a graph/curve between WCSS values and the respective number of clusters K.
4 The sharp point of bend or a point( looking like an elbow joint ) of the plot like an arm, will
be considered as the best/optimal value of K
Python Implementation
Importing relevant libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
Loading the Data

data = pd.read_csv('Countryclusters.csv')
data
data = pd.read_csv('Countryclusters.csv')
data
Plotting the data

plt.scatter(data['Longitude'],data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90,90)
plt.show()
Selecting the feature

x = data.iloc[:,1:3] # 1t for rows and second for columns
x
Clustering
kmeans = KMeans(3)
means.fit(x)
Clustering Results
identified_clusters = kmeans.fit_predict(x)
identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters
plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_
with_clusters['Clusters'],cmap='rainbow')
Trying different method ( to find no .of clusters to be selected)
WCSS and Elbow Method

wcss=[]
for i in range(1,7):
kmeans = KMeans(i)
kmeans.fit(x)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)
number_clusters = range(1,7)
plt.plot(number_clusters,wcss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
we can choose 3 as no. of clusters, this method shows what is the good number of
clusters.
9. Apply Hierarchical Clustering algorithm on any dataset
Let’s say we have the below points and we want to cluster them into groups:
We can assign each of these points to a separate cluster:
Now, based on the similarity of these clusters, we can combine the most similar clusters
together and repeat this process until only a single cluster is left:
We are essentially building a hierarchy of clusters. That’s why this algorithm is called
hierarchical clustering. I will discuss how to decide the number of clusters in a later section.
For now, let’s look at the different types of hierarchical clustering.
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
1. Agglomerative hierarchical clustering

2. Divisive Hierarchical clustering
Agglomerative Hierarchical Clustering
We assign each point to an individual cluster in this technique. Suppose there are 4 data
points. We will assign each of these points to a cluster and hence will have 4 clusters in the
beginning:
We assign each point to an individual cluster in this technique. Suppose there are 4 data
points. We will assign each of these points to a cluster and hence will have 4 clusters in the
beginning:
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a
single cluster is left:
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is
also known as additive hierarchical clustering.
Divisive Hierarchical Clustering
Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters
(in case of n observations), we start with a single cluster and assign all the points to that
cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the
same cluster at the beginning:
Now, at each iteration, we split the farthest point in the cluster and repeat this process until
each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical
clustering.
Agglomerative Clustering is widely used in the industry and that will be the focus in this
article. Divisive hierarchical clustering will be a piece of cake once we have a handle on the
agglomerative type.
Steps to Perform Hierarchical Clustering
We merge the most similar points or clusters in hierarchical clustering – we know this. Now
the question is – how do we decide which points are similar and which are not? It’s one of
the most important questions in clustering!
Here’s one way to calculate similarity – Take the distance between the centroids of these
clusters. The points having the least distance are referred to as similar points and we can
merge them. We can refer to this as a distance-based algorithm as well (since we are
calculating the distances between the clusters).
In hierarchical clustering, we have a concept called a proximity matrix. This stores the
distances between each point. Let’s take an example to understand this matrix as well as the
steps to perform hierarchical clustering.
Setting up the Example
Suppose a teacher wants to divide her students into different groups. She has the marks
scored by each student in an assignment and based on these marks, she wants to segment
them into groups. There’s no fixed target here as to how many groups to have. Since the
teacher does not know what type of students should be assigned to which group, it cannot
be solved as a supervised learning problem. So, we will try to apply hierarchical clustering
here and segment the students into different groups.
Let’s take a sample of 5 students:

Creating a Proximity Matrix
First, we will create a proximity matrix which will tell us the distance between each of these
points. Since we are calculating the distance of each point from each of the other points, we
will get a square matrix of shape n X n (where n is the number of observations).
Let’s make the 5 x 5 proximity matrix for our example:
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is
always 0. We will use the Euclidean distance formula to calculate the rest of the distances.
So, let’s say we want to calculate the distance between point 1 and 2:
√(10-7)^2 = √9 = 3
Similarly, we can calculate all the distances and fill the proximity matrix.
Steps to Perform Hierarchical Clustering
Step 1: First, we assign all the points to an individual cluster:
Different colors here represent different clusters. You can see that we have 5 different
clusters for the 5 points in our data.
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the
points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2:
Let’s look at the updated clusters and accordingly update the proximity matrix:
Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this
cluster. Instead of the maximum, we can also take the minimum value or the average values
as well. Now, we will again calculate the proximity matrix for these clusters:
Step 3: We will repeat step 2 until only a single cluster is left.
So, we will first look at the minimum distance in the proximity matrix and then merge the
closest pair of clusters. We will get the merged clusters as shown below after repeating
these steps:
We started with 5 clusters and finally have a single cluster. This is how agglomerative
hierarchical clustering works. But the burning question still remains – how do we decide
the number of clusters? Let’s understand that in the next section.
How should we Choose the Number of Clusters in Hierarchical Clustering?
Ready to finally answer this question that’s been hanging around since we started learning?
To get the number of clusters for hierarchical clustering, we make use of an awesome
concept called a Dendrogram.
A dendrogram is a tree-like diagram that records the sequences of merges or splits.
Let’s get back to our teacher-student example. Whenever we merge two clusters, a
dendrogram will record the distance between these clusters and represent it in graph form.
Let’s see how a dendrogram looks like:
We have the samples of the dataset on the x-axis and the distance on the y-axis. Whenever
two clusters are merged, we will join them in this dendrogram and the height of the
join will be the distance between these points. Let’s build the dendrogram for our
example:
Take a moment to process the above image. We started by merging sample 1 and 2 and the
distance between these two samples was 3 (refer to the first proximity matrix in the previous
section). Let’s plot this in the dendrogram:
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples. Similarly, we plot all the steps where we merged the
clusters and finally, we get a dendrogram like this:
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the
threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and
draw a horizontal line:
The number of clusters will be the number of vertical lines which are being
intersected by the line drawn using the threshold. In the above example, since the red
line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4)
and the other will have a sample (3,5). Pretty straightforward, right?
This is how we can decide the number of clusters using a dendrogram in Hierarchical
Clustering. In the next section, we will implement hierarchical clustering which will help you
to understand all the concepts that we have learned in this article.
Solving the Wholesale Customer Segmentation problem using Hierarchical

Clustering
Time to get our hands dirty in Python!
We will be working on a wholesale customer segmentation problem. You can download the
dataset using this link. The data is hosted on the UCI Machine Learning repository. The aim
of this problem is to segment the clients of a wholesale distributor based on their annual
spending on diverse product categories, like milk, grocery, region, etc.
Let’s explore the data first and then apply Hierarchical Clustering to segment the clients.
We will first import the required libraries:
import pandas as pd
import numpy as np
%matplotlib inline
data =
pd.read_csv('Wholesal
e customers
data.csv')
data.head()
There are multiple product categories – Fresh, Milk, Grocery, etc. The values represent
the number of units purchased by each client for each product. Our aim is to make
clusters from this data that can segment similar clients together. We will, of course,
use Hierarchical Clustering for this problem.
But before applying Hierarchical Clustering, we have to normalize the data so that the
scale of each variable is the same. Why is this important? Well, if the scale of the
variables is not the same, the model might become biased towards the variables with a
higher magnitude like Fresh or Milk (refer to the above table).
So, let’s first normalize the data and bring all the variables to the same scale:
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled,
columns=data.columns)
Here, we can see that the scale of all the variables is almost similar. Now, we are good to
go. Let’s first draw the dendrogram to help us decide the number of clusters for this
particular problem:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled,
method='ward'))
The x-axis contains the samples and y-axis represents the distance between these
samples. The vertical line with maximum distance is the blue line and hence we can
decide a threshold of 6 and cut the dendrogram:
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled,
method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
We have two clusters as this line cuts the dendrogram at two points. Let’s now apply
hierarchical clustering for 2 clusters:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean',

linkage='ward')
cluster.fit_predict(data_scaled)
We can see the values of 0s and 1s in the output since we defined 2 clusters. 0
represents the points that belong to the first cluster and 1 represents points in the second
cluster. Let’s now visualize the two clusters:
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'],
c=cluster.labels_)
Awesome! We can clearly visualize the two clusters here. This is how we can implement
hierarchical clustering in Python.

Data Mining Using Python Lab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Using Python Lab

Uploaded by

Copyright:

Available Formats

ST.

MARY’S GROUP OF INSTITUTIONS GUNTUR

Name of the student: __________________________________________________

Year:_____________ Semester:_______________ Regulation:_________________

bearing roll no of _______B. Tech semester

branch has satisfactorily

completed laboratory during the

Place: Signature of Faculty

External Practical Examination held on:______________________

Signature of Internal Examiner Signature of External Examiner

DATA MINING USING PYTHON LAB

2. Demonstrate the following data preprocessing tasks using python library

3. Demonstrate the following Similarity and Dissimilarity Measures using python

4. Build a model using linear regression algorithm on any dataset.

5. Build a classification model using Decision Tree algorithm on iris dataset

6. Apply Naïve Bayes Classification algorithm on any dataset

8. Apply K- Means clustering algorithm on any dataset.

9. Apply Hierarchical Clustering algorithm on any dataset.

10. Apply DBSCAN clustering algorithm on any dataset.

Importing the pandas

Loading the dataset

The data set used here is as simple as shown below:

Classifying the dependent and Independent Variables

X = dataset.iloc[:,:-1].values #Takes all rows of all columns except

• X : independent variable set

Dealing with Missing Data

from sklearn.impute import SimpleImputer

2. Demonstrate the following data preprocessing tasks using python library

Dealing with Categorical Data

Splitting the Dataset into Training and Testing sets

from sklearn.model_selection import train_test_split

• test_size: the desired size of the test_set. 0.3 denotes 30%.

Scaling the features

from sklearn.preprocessing import StandardScaler

X_train after scaling :

3. Demonstrate the following Similarity and Dissimilarity Measures using python

Similarity Based Metrics

Correlation is a technique for investigating the relationship between two quantitative,

Pearson‟s correlation coefficient is a measure related to the strength and direction of a

# calculate Pearson's correlation

Pearsons correlation: 0.810

from sklearn.metrics.pairwise import cosine_similarity

Cosine similarity: 0.773

In set theory it is often helpful to see a visualization of the formula:

Jaccard similarity: 0.500

Distance Based Metrics

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Euclidean distance: 3.273

We calculate the Manhattan distance as follows:

In many ML applications Euclidean distance is the metric of choice. However, for

Manhattan distance: 10.468

4. Build a model using linear regression algorithm on any dataset

import matplotlib.pyplot as plt

The %matplotlib inline statement will cause of of our matplotlib visualizations to

import seaborn as sns

from sklearn.model_selection import train_test_split

pull housing.csv from the Kaggle

Doing so will print out the top 5 (0–4) rows.

For example, as seen below.

The result of this can be seen below.

Thus, run the command below.

Doing so will produce the following graph.

Year:_____ Semester:_ Regulation:___________