Data Mining

1
Experiment 1
Objective: To perform classification using the Bayesian classification algorithm in python
Bayes Theorem
• Based on prior knowledge of conditions that may be related to an event, Bayes theorem
describes the probability of the event
• conditional probability can be found this way
• Assume we have a Hypothesis(H) and evidence(E),
According to Bayes theorem, the relationship between the probability of Hypothesis before
getting the evidence represented as P(H) and the probability of the hypothesis after getting
the evidence represented as P(H|E) is:
P(H|E) = P(E|H)*P(H)/P(E)
• Prior probability = P(H) is the probability before getting the evidence
• Posterior probability = P(H|E) is the probability after getting evidence
• In general,
P(class|data) = (P(data|class) * P(class)) / P(data)
Approach:
Naive Bayes classifier calculates the probability of an event in the following steps:
• Step 1: Calculate the prior probability for given class labels
• Step 2: Find Likelihood probability with each attribute for each class
• Step 3: Put these value in Bayes Formula and calculate posterior probability.
• Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
Advantages
• It is not only a simple approach but also a fast and accurate method for prediction.
• Naive Bayes has a very low computation cost.
• It can efficiently work on a large dataset.
• It performs well in case of discrete response variable compared to the continuous variable.
• It can be used with multiple class prediction problems.
• It also performs well in the case of text analytics problems.
• When the assumption of independence holds, a Naive Bayes classifier performs better
compared to other models like logistic regression.
Disadvantages
• The assumption of independent features. In practice, it is almost impossible that model will
get a set of predictors which are entirely independent.
• If there is no training tuple of a particular class, this causes zero posterior probability. In this
case, the model is unable to make predictions. This problem is known as Zero
Probability/Frequency Problem.
Classification using Bayesian Classification Algorithm
# generating the dataset

from sklearn.datasets import make_classification
X, y = make_classification(
n_features=6,
Ayush Choudhary CS-1 0827CS201052

2
n_classes=3,n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=y, marker="*")
# train test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=125 )
# model building and training
from sklearn.naive_bayes import GaussianNB
# Build a Gaussian Classifier
model = GaussianNB()
# Model training
model.fit(X_train, y_train)
# Predict Output
predicted = model.predict([X_test[6]])
print("Actual Value:", y_test[6])
print("Predicted Value:", predicted[0])
Actual Value
0
Predicted Value: 0
# model evaluation
from sklearn.metrics import (

3
accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,)
y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")
print("Accuracy:",accuray)
print("F1 Score:", f1)
Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328
# visualizing the confusion matrix labels = [0,1,2]

cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=labels) disp.plot();

4
Experiment 2
Objective: To perform cluster analysis by k-means method using python
K-means Clustering:
K-means is an unsupervised learning method for clustering data points. The algorithm iteratively
divides data points into K clusters by minimizing the variance in each cluster.
Approach:
First, each data point is randomly assigned to one of the K clusters. Then, we compute the centroid
(functionally the centre) of each cluster, and reassign each data point to the cluster with the closest
centroid. We repeat this process until the cluster assignments for each data point are no longer
changing.
K-means clustering requires us to select K, the number of clusters we want to group the data into.
The elbow method lets us graph the inertia (a distance-based metric) and visualize the point at
which it starts decreasing linearly. This point is referred to as the "elbow" and is a good estimate for
the best value for K based on our data.
1. Decide how many clusters you want, i.e. choose k

2. Randomly assign a centroid to each of the k clusters
3. Calculate the distance of all observation to each of the k centroids
4. Assign observations to the closest centroid
5. Find the new location of the centroid by taking the mean of all the observations in each
cluster
6. Repeat steps 3-5 until the centroids do not change position
K-means clustering performs best on data that are spherical. Data that are not spherical or should
not be spherical do not work well with k-means clustering. For example, k-means clustering would
not do well on the below data as we would not be able to find distinct centroids to cluster the two
circles or arcs differently, despite them clearly visually being two distinct circles and arcs that should
be labelled as such.
Clustering using K-means Clustering
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21,21]
plt.scatter(x,y)
plt.show()

5
from sklearn.cluster import KMeans

data = list(zip(x,y))
inertias = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
plt.plot(range(1,11), inertias,marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') plt.show()
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
plt.scatter(x,y,c=kmeans.labels_)
plt.show()

6
Experiment 3
Objective: To perform the hierarchical clustering using python
Hierarchical Clustering: Hierarchical clustering is an unsupervised learning method for clustering
data points. The algorithm builds clusters by measuring the dissimilarities between data.
Unsupervised learning means that a model does not have to be trained, and we do not need a
"target" variable. This method can be used on any data to visualize and interpret the relationship
between individual data points.
We will use Agglomerative Clustering, a type of hierarchical clustering that follows a bottom up
approach. We begin by treating each data point as its own cluster. Then, we join clusters together
that have the shortest distance between them to create larger clusters. This step is repeated until
one large cluster is formed containing all of the data points.
Hierarchical clustering requires us to decide on both a distance and linkage method. We will use
euclidean distance and the Ward linkage method, which attempts to minimize the variance between
clusters.
Approach:
• Step 1: Initially, assume each data point is an independent cluster, i.e. 6 clusters.
• Step 2: Into a single cluster, merge the two closest data points. By so doing, we ended up
with 5 clusters.
• Step 3: Again, merge the two closest clusters into a single cluster. By so doing, we ended up
with 4 clusters.
• Step 4: Repeat step three above until a single cluster of all data points is obtained.
Hierarchical Clustering using Python
import numpy as np
import matplotlib.pyplot as plt from
scipy.cluster.hierarchy
import dendrogram, linkage
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21,21]
# visualizing some datapoints

plt.scatter(x, y)
plt.show()
# compute the ward linkage using euclidean distance

data = list(zip(x, y))

7
linkage_data = linkage(data, method='ward', metric='euclidean')

dendrogram(linkage_data)
# visualize it using a dendrogram

plt.show()
from sklearn.cluster import AgglomerativeClustering
# visualize on a 2-dimensional plot

hierarchical_cluster = AgglomerativeClustering(n_clusters=2,
affinity='euclidean', linkage='ward')
labels = hierarchical_cluster.fit_predict(data)
plt.scatter(x,y,c=labels)
plt.show()

8
Experiment 4
Objective: Study of Regression Analysis Using Python
Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of
future events.
Linear Regression
Linear regression uses the relationship between the data-points to draw a straight line through all
them.
This line can be used to predict future values.
Python has methods for finding a relationship between data-points and to draw a line of linear
regression. We will show you how to use these methods instead of going through the mathematic
formula.
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value, meaning
that we try to predict a value based on two or more variables.
Polynomial Regression
Polynomial regression, like linear regression, uses the relationship between the variables x and y
to find the best way to draw a line through the data points.
Regression Analysis Using Python
# linear regression example import
matplotlib.pyplot as plt from scipy
import stats
# x-axis represents age
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
# y-axis represents speed

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x,y)
plt.show()

9
# draw the regression line

slope, intercept, r, p, std_err = stats.linregress(x, y)
def
myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc,x))
plt.scatter(x, y)
plt.plot(x, mymodel)
# predicting the speed of a 10yr old car def

myfunc(x):
return slope * x + intercept
speed= myfunc(10)
print(speed)
85.59308314937454
# multiple regression example
import pandas from sklearn
import linear_model
df =pandas.read_csv("/content/sample_data/data.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr=linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300kg,

and the volume is 1300cm3:
predictedCO2
import = regr.predict([[2300,
matplotlib.pyplot as plt 1300]])
import numpy
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
print(predictedCO2)
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(1, 22,100)
plt.scatter(x,y)
plt.scatter(x, y)
plt.show()
plt.plot(myline,
from sklearn.metrics import r2_score
[107.2087328]
mymodel(myline))
print(r2_score(y, mymodel(x)))
plt.show()
# polynomial regression
0.9432150416451026
plt.show()
# Predict the speed of a car passing at
17:00
speed = mymodel(17) print(speed)
88.87331269698001

10
EXPERIMENT -5
Objective: Outlier detection using python.
There are several ways to treat outliers in a dataset, depending on the nature of the outliers
and the problem being solved. Here are some of the most common ways of treating outlier
values:
1. Z-Score Treatment
2. IQR based filtering
3. Percentile Method
Considering Normal Distribution, we use Z-Score Treatment method to treat outliers.

1. Step 1: Importing necessary dependencies
import numpy as np
import pandas as pd
import seaborn as sns
2. Step 2: Read and load the dataset
df = pd.read_csv(‘placement.csv’)
df.sample(5)
3. Step 3: Plot the distribution plots for the features

import warnings
warnings.filterwarnings(‘ignore’)
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df[‘cgpa’])
plt.subplot(1,2,2)
sns.distplot(df[‘placement_exam_marks’])
plt.show()
11
4. Step 4: Finding the boundary values

print(“Highest allowed”,df[‘cgpa’].mean() + 3*df[‘cgpa’].std())
print(“Lowest allowed”,df[‘cgpa’].mean() – 3*df[‘cgpa’].std())
Output:
Highest allowed 8.808933625397177

Lowest allowed 5.113546374602842
5. Step 5: Finding the outliers

df[(df[‘cgpa’] > 8.80) | (df[‘cgpa’] < 5.11)]
6. Step 6: Trimming of outliers

new_df = df[(df[‘cgpa’] < 8.80) & (df[‘cgpa’] > 5.11)]
new_df
7. Step 7: Capping on outliers
upper_limit = df[‘cgpa’].mean() + 3*df[‘cgpa’].std()
lower_limit = df[‘cgpa’].mean() – 3*df[‘cgpa’].std()
8. Step 8: Now, apply the capping
df[‘cgpa’] = np.where(
df[‘cgpa’]>upper_limit,upper_limit,np.where(df[‘cgpa’]<lower_limit,lower_limit,df[‘cg
pa’]))
9. Step 9: Now, see the statistics using the “Describe” function

df[‘cgpa’].describe()
output:
12
EXPERIMENT -06
Objective: Demonstration of association rule mining using Apriory algorithm on supermarket
data.
Association rule mining is a data mining technique used to discover interesting patterns or
associations in a dataset. The Apriori algorithm is one of the most widely used algorithms for
this purpose.
Data Preparation : Prepare a dataset containing supermarket transaction data. Each
transaction should list the items purchased by a customer.
Set Minimum Support and Minimum Confidence Thresholds

Before running the Apriori algorithm, we need to set the minimum support and minimum
confidence thresholds. These thresholds determine which itemsets and rules are considered
interesting. For example, we can set a minimum support of 0.2 (meaning an itemset must
appear in at least 20% of transactions) and a minimum confidence of 0.6 (meaning a rule must
have at least 60% confidence to be considered).
Generate Frequent Itemsets
Use the Apriori algorithm to find frequent itemsets. This involves iteratively finding itemsets
that meet the minimum support threshold. The algorithm starts with frequent itemsets of
length 1 and then iteratively generates longer itemsets based on the previous step's results.
Generate Association Rules From the frequent itemsets, you can generate association rules
that meet the minimum confidence threshold. Association rules typically have the form "if X,
then Y," where X and Y are itemsets. For example, a rule might be "if Bread and Milk are
purchased, then Eggs are also purchased."
Evaluate and Interpret Results Once we have the frequent itemsets and association rules, we
can evaluate and interpret the results. we can identify interesting rules and patterns that
provide insights into customer behavior, product placement, and promotions.
from mlxtend.frequent_patterns import apriori

from mlxtend.frequent_patterns import association_rules
# Define your dataset

dataset = [
['Bread', 'Milk', 'Eggs'],
['Bread', 'Juice', 'Chips', 'Milk'],
['Eggs', 'Milk', 'Juice', 'Sausage'],
['Bread', 'Eggs', 'Sausage', 'Yogurt'],
13
['Milk', 'Juice', 'Sausage']]
# Convert the dataset into a one-hot encoded DataFrame

from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Generate frequent itemsets

frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
# Generate association rules

rules = association_rules(frequent_itemsets, metric="confidence",
min_threshold=0.6)
# Print the frequent itemsets and association rules

print("Frequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
14
EXPERIMENT-07
Objective: Demonstration of FP Growth algorithm on supermarket data
The FP-Growth (Frequent Pattern Growth) algorithm is another popular technique for mining
frequent itemsets and association rules in transactional data. It has an advantage over the
Apriori algorithm in terms of speed and efficiency.
1. Step 1: Data Preparation
2. Step 2: Install Required Libraries

pip install mlxtend
3. Step 3: Perform Association Rule Mining with FP-Growth Algorithm
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
import pandas as pd
dataset = [
['Bread', 'Milk', 'Eggs'],
['Bread', 'Juice', 'Chips', 'Milk'],
['Eggs', 'Milk', 'Juice', 'Sausage'],
['Bread', 'Eggs', 'Sausage', 'Yogurt'],
['Milk', 'Juice', 'Sausage']
15
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets=fpgrowth(df,min_support=0.2,use_colnames=True)
rules=association_rules(frequent_itemsets, metric="confidence",
min_threshold=0.6)
print("Frequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
16
EXPERIMENT -08
Objective: To perform the statistical analysis of data
import numpy as np
import pandas as pd
import scipy.stats as stats
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Income': [50000, 60000, 75000, 80000, 90000, 100000, 110000, 120000, 130000, 140000],
'Score': [75, 80, 85, 88, 90, 92, 95, 96, 98, 99]}
df = pd.DataFrame(data)
summary_statistics = df.describe()
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(df['Age'], df['Income'])
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs. Income')
plt.subplot(1, 2, 2)
plt.hist(df['Score'], bins=5)
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')
plt.show()
age_income_ttest, p_value = stats.ttest_ind(df['Age'], df['Income'])
if p_value < 0.05:
print("There is a significant difference between Age and Income.")
else:
print("There is no significant difference between Age and Income.")
confidence_interval = stats.norm.interval(0.95, loc=df['Score'].mean(), scale=df['Score'].std())
import statsmodels.api as sm
X = sm.add_constant(df['Age'])
model = sm.OLS(df['Income'], X).fit()
17
regression_summary = model.summary()
print("Summary Statistics:\n", summary_statistics)
18
EXPERIMENT-9
Case study for XYZ Retail

Implementing a data warehouse for XYZ Retail is a strategic solution to address the data-related
challenges they are facing. Below is a high-level overview of the steps and considerations for
implementing a data warehouse at XYZ Retail:
1.Data Extraction (Python): Assume you have data in various formats (e.g., CSV files, databases).
import pandas as pd
# Load data from CSV files
sales_data = pd.read_csv('sales_data.csv')
customer_data = pd.read_csv('customer_data.csv')
product_data = pd.read_csv('product_data.csv')
2.Data Transformation
# Merge data
merged_data = pd.merge(sales_data, customer_data, on='customer_id', how='inner')
merged_data = pd.merge(merged_data, product_data, on='product_id', how='inner')
# Transform and clean data as needed

Convert date strings to datetime objects
merged_data['order_date'] = pd.to_datetime(merged_data['order_date'])
3.Data Loading
import sqlite3
conn = sqlite3.connect('data_warehouse.db')
# Load data into the database

merged_data.to_sql('fact_sales', conn, if_exists='replace', index=False)
4.Data Querying (Python/SQL): Once the data is loaded, you can run SQL queries to extract insights.
# Query the data warehouse
query = """
SELECT product_name, SUM(sales_amount) AS total_sales
FROM fact_sales
GROUP BY product_name ORDER BY total_sales DESC
"""
# Execute the query

result = pd.read_sql(query, conn)
# Display the result

print(result)
Output:
product_name total_sales
0 Product A 5000
1 Product B 4500
2 Product C 3500
3 Product D 3000

Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

1

Classification using Bayesian Classification Algorithm

# generating the dataset

Ayush Choudhary CS-1 0827CS201052

# train test split

Ayush Choudhary CS-1 0827CS201052

# visualizing the confusion matrix labels = [0,1,2]

Ayush Choudhary CS-1 0827CS201052

1. Decide how many clusters you want, i.e. choose k

Ayush Choudhary CS-1 0827CS201052

from sklearn.cluster import KMeans

Ayush Choudhary CS-1 0827CS201052

# visualizing some datapoints

# compute the ward linkage using euclidean distance

Ayush Choudhary CS-1 0827CS201052

linkage_data = linkage(data, method='ward', metric='euclidean')

# visualize it using a dendrogram

from sklearn.cluster import AgglomerativeClustering

# visualize on a 2-dimensional plot

Ayush Choudhary CS-1 0827CS201052

# y-axis represents speed

Ayush Choudhary CS-1 0827CS201052

# draw the regression line

return slope * x + intercept

# predicting the speed of a 10yr old car def

#predict the CO2 emission of a car where the weight is 2300kg,

Ayush Choudhary CS-1 0827CS201052

Considering Normal Distribution, we use Z-Score Treatment method to treat outliers.

3. Step 3: Plot the distribution plots for the features

4. Step 4: Finding the boundary values

Highest allowed 8.808933625397177

5. Step 5: Finding the outliers

6. Step 6: Trimming of outliers

9. Step 9: Now, see the statistics using the “Describe” function

Set Minimum Support and Minimum Confidence Thresholds

from mlxtend.frequent_patterns import apriori

# Define your dataset

['Milk', 'Juice', 'Sausage']]

# Convert the dataset into a one-hot encoded DataFrame

# Generate frequent itemsets

# Generate association rules

# Print the frequent itemsets and association rules

1. Step 1: Data Preparation

2. Step 2: Install Required Libraries

from mlxtend.preprocessing import TransactionEncoder

Case study for XYZ Retail

# Transform and clean data as needed

# Load data into the database

# Execute the query

# Display the result

You might also like