You are on page 1of 57

Degree Engineering

A Laboratory Manual for

Data Mining
(3160714)

[ B.E. (Computer Engineering) : Semester - 6 ]

Enrolment No 200170107049
Name Patel Tanmay Anilkumar
Branch Computer Engineering
Academic Term 2022-2023
Institute Name VGEC

Directorate of Technical Education, Gandhinagar,


Gujarat
Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.

By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.

Data mining is a key to sentiment analysis, price optimization, database marketing, credit risk
management, training and support, fraud detection, healthcare and medical diagnoses, risk
assessment, recommendation systems and much more. It can be an effective tool in just about
any industry, including retail, wholesale distribution, service industries, telecom,
communications, insurance, education, manufacturing, healthcare, banking, science,
engineering, and online marketing or social media.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.
Vishwakarma Government Engineering College
Department of Computer Engineering

CERTIFICATE

This is to certify that Mr./Ms. _____________________________________________

Enrollment No. _______________ of B.E. Semester ________ from Computer Engineering

Department of this Institute (GTU Code: 017) has satisfactorily completed the Practical /

Tutorial work for the subject Data Mining (3160714) for the academic year 2022-23.

Place: ___________

Date: ___________

Signature of Course Faculty Head of the Department

1
Data Mining (3160714) 200170107049

DTE’s Vision

 To provide globally competitive technical education


 Remove geographical imbalances and inconsistencies
 Develop student friendly resources with a special focus on girls’ education and support to
weaker sections
 Develop programs relevant to industry and create a vibrant pool of technical professionals

Institute’s Vision

 To create an ecosystem for proliferation of socially responsible and technically sound


engineers, innovators and entrepreneurs.

Institute’s Mission

 To develop state-of-the-art laboratories and well-equipped academic infrastructure.


 To motivate faculty and staff for qualification up-gradation, and enhancement of subject
knowledge.
 To promote research, innovation and real-life problem-solving skills.
 To strengthen linkages with industries, academic and research organizations.
 To reinforce concern for sustainability, natural resource conservation and social
responsibility.

Department’s Vision

 To create an environment for providing value-based education in Computer Engineering


through innovation, team work and ethical practices.

Department’s Mission

 To produce computer engineering graduates according to the needs of industry,


government, society and scientific community.
 To develop state of the art computing facilities and academic infrastructure.
 To develop partnership with industries, government agencies and R & D organizations for
knowledge sharing and overall development of faculties and students.
 To solve industrial, governance and societal issues by applying computing techniques.
 To create environment for research and entrepreneurship.
Data Mining (3160714) 200170107049

Programme Outcomes (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis
of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change.
Data Mining (3160714) 200170107049

Program Specific Outcomes (PSOs)

 Sound knowledge of fundamentals of computer science and engineering including


software and hardware.
 Develop the software using sound software engineering principles having web
based/mobile based interface.
 Use various tools and technology supporting modern software frameworks for solving
problems having large volume of data in the domain of data science and machine learning.

Program Educational Objectives (PEOs)

 Possess technical competence in solving real life problems related to Computing.


 Acquire good analysis, design, development, implementation and testing skills to
formulate simple computing solutions to the business and societal needs.
 Provide requisite skills to pursue entrepreneurship, higher studies, research, and
development and imbibe high degree of professionalism in the fields of computing.
 Embrace life-long learning and remain continuously employable.
 Work and excel in a highly competence supportive, multicultural and professional
environment which abiding to the legal and ethical responsibilities.
Data Mining (3160714) 200170107049

Practical – Course Outcome matrix

Course Outcomes (COs):


CO_3160714.1 Perform the preprocessing of data and apply mining techniques on it.
CO_3160714.2 Identify the association rules, classification, and clusters in large data sets.
CO_3160714.3 Solve real-world problems in business and scientific information using data
mining.
CO_3160714.4 Use data analysis tools for scientific applications.
CO_3160714.5 Implement various supervised machine learning algorithms.

Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Identify how data mining is an interdisciplinary field by an
1. √
Application.
Write programs to perform the following tasks of
preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2. 2.2 Normalization Techniques √
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using python
libraries
To perform hand on experiments of data preprocessing
3. √ √
with sample data on Orange tool.
Implement Apriori algorithm of association rule data
4. mining technique in any √
Programming language.
Apply association rule data mining technique on sample
5. √ √
data sets using XL Miner Analysis Tool.
Apply Classification data mining technique on sample
6. √ √
data sets in Weka.
7.1. Implement Classification technique with quality
Measures in any Programming language.
7. √
7.2 Implement Regression technique in any
Programming language.
Apply K-means Clustering Algorithm any Programming
8. √ √
language.
Perform hands on experiment on any advance mining
9. √
Techniques Using Appropriate Tool.
Solve Real world problem using Data Mining Techniques
10. √
using Python Programming Language.
Data Mining (3160714) 200170107049

Guidelines for Faculty members

1. Teacher should provide the guideline with demonstration of practical to the students
with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the students
before starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed in the
students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.

Instructions for Students

1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination
scheme, skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.

Common Safety Instructions

Students are expected to

1) switch on the PC carefully (not to use wet hands)


2) shutdown the PC properly at the end of your Lab
3) carefully handle the peripherals (Mouse, Keyboard, Network cable etc)
4) use Laptop in lab after getting permission from Teacher
Data Mining (3160714) 200170107049

Index
(Progressive Assessment Sheet)
Sr. Objective(s) of Experiment Page Date Date of Assessme Sign. of Rema
No. No. of submis nt Teacher rks
perfor sion Marks with
mance date
1 Identify how data mining is an interdisciplinary
field by an Application.
2 Write programs to perform the following tasks
of preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2.2 Normalization Techniques
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using
python libraries
3 To perform hand on experiments of data
preprocessing with sample data on Orange tool.
4 Implement Apriori algorithm of association rule
data mining technique in any Programming
language.
5 Apply association rule data mining technique on
sample data sets using XL Miner Analysis Tool.
6 Apply Classification data mining technique on
sample data sets in Weka.
7 7.1. Implement Classification technique with
quality Measures in any Programming language.
7.2. Implement Regression technique in any
Programming language.
8 Apply K-means Clustering Algorithm any
Programming language.
9 Perform hands on experiment on any advance
mining Techniques Using Appropriate Tool.
10 Solve Real world problem using Data Mining
Techniques using Python Programming
Language.
Total
Data Mining (3160714) 200170107049

Experiment No - 1
Aim: Identify how data mining is an interdisciplinary field by an Application.

Data mining is an interdisciplinary field that involves computer science, statistics, mathematics, and
domain-specific knowledge. One application that showcases the interdisciplinary nature of data
mining

Date:

Competency and Practical Skills: Understanding and Analyzing

Relevant CO: CO3

Objectives: (a) To understand the application of domain


(b) To understand Preprocessing Techniques.
(c) To understand the application's use of Data Mining functionalities.
.
Equipment/Instruments: Personal Computer

Theory:

System Name: Car price prediction systems

Car price prediction systems are used to predict the prices of Cars based on various factors such as
brand, specifications, features, and market trends. These systems are valuable for both consumers
and sellers, allowing them to make informed decisions about purchasing or selling Cars. The
process of creating a car price prediction system involves the following steps:

Dataset: A cars price prediction system requires a dataset that contains information about cars and
their attributes. Here are some examples of datasets:

 Cars Prices: This is a dataset of cars prices collected from various sources such as online
marketplaces and retailers. It contains information about the brand, model, specifications,
and price of cars.
 Cars Specifications: This is a dataset of Cars specifications collected from various sources
such as manufacturer websites and online retailers. It contains information about the
company, segment, engine, display and other features of cars.

Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in cars price prediction systems:

 Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a cars has missing information such as the speed, it
may be removed from the dataset or the information may be imputed.
 Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, prices from different retailers may be normalized to a common currency or a
common range of values.
1
Data Mining (3160714) 200170107049
 Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, cars brands may be encoded as binary variables to enable analysis using
machine learning algorithms.
 Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the age of the cars may be calculated based on the release
date, or the number of cores in the processor may be counted.
 Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.

These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for car recommendations.

Data Mining Techniques: Association rule mining, clustering, and classification are all data
mining techniques that can be applied to movie recommendation systems. Here is a brief overview
of how each of these techniques can be used:

 Association Rule Mining: Association rule mining is a data mining technique used to find
associations or relationships among variables in large datasets. In the context of car price
prediction, association rule mining can be used to identify patterns and relationships
between different features that might affect the price of a car. For example, the technique
can be used to find out whether the brand, speed type, features, segment type, or screen size
are related to the car price. These associations can then be used to make predictions about
the price of a car with similar features.

 Clustering: Clustering is a data mining technique used to group similar data points or
objects together based on their similarities or differences. In the context of car price
prediction, clustering can be used to group cars with similar features together, such as cars
with similar colour, engine, or segment type. Clustering can help in identifying the different
price ranges for cars with similar features, which can be useful in predicting the price of a
cars based on its features.
 Classification: Classification is a data mining technique used to categorize data points or
objects into pre-defined classes based on their characteristics or features. In the context of
cars price prediction, classification can be used to classify cars into different price ranges
based on their features, such as speed type, engine, segment type, and screen size. This
technique can also be used to predict the price range of a car based on its features, which can
be useful in making pricing decisions.

Observations:
In a Car price prediction system, data mining techniques are used to analyze large amounts of data
about cars and generate accurate price predictions. These systems can be used by consumers to
make informed decisions about purchasing cars, and by sellers to set prices that are competitive and
profitable.

Conclusion: Car price prediction systems provide a compelling example of how data mining can be
used to analyze and predict trends in the market. By using these systems, consumers and sellers can
make informed decisions that are based on accurate and up-to-date information.

2
Data Mining (3160714) 200170107049
Quiz:

(1) What are the different preprocessing techniques can be applied on dataset?
(2) What is the use of data mining techniques on particular system?

Suggested References:

1. Han, J., & Kamber, M. (2011). Data mining: concepts and techniques.
2. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems

References used by the students:


1. Han, J., & Kamber, M. (2011). Data mining: concepts and techniques.
2. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems
3. https://www.geeksforgeeks.org/data-preprocessing-and-its-types/

Rubric wise marks obtained:

Problem Completeness
Knowledge Team Work
Recognition and accuracy Ethics (2)
Rubrics (2) (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

3
Data Mining (3160714) 200170107049

Experiment No - 2
Aim: Write programs to perform the following tasks of preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2.2 Normalization Techniques
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries

Date:

Competency and Practical Skills: Programming and statistical methods

Relevant CO: CO1

Objectives: (a) To understand Basic Preprocessing Techniques and statistical Measures.


(b) To show how to implement Preprocessing Techniques.
(c) To show how to use different Python Libraries to implement Techniques.
.
Equipment/Instruments: Personal Computer, open-source software for programming

Theory:

2.1 Noisy data handling


Equal Width Binning
Equal Frequency/Depth Binning

Noise: random error or variance in a measured variable


Incorrect attribute values may be due to

• Faulty Data Collection Instruments


• Data Entry Problems
• Data Transmission Problems
• Technology Limitation
• Inconsistency in Naming Convention

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing.

4
Data Mining (3160714) 200170107049
Data Preprocessing Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency)
Bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:


Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28

Smoothing by bin boundaries:


Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

Equal Width Binning :

bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min +
nw] where w = (max – min) / (N)
Example :
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215

W=(215-5)/3=70

bin1: 5,10,11,13,15,35,50,55,72 I.e. all values between 5 and 75


bin2: 92 I.e. all values between 75 and 145
bin3: 204,215 I.e. all values between 145 and 215
.
3.2. Normalization Techniques
Min max normalization
Z score normalization
Decimal scaling

Normalization techniques are used in data preprocessing to scale numerical data to a common
range. Here are three commonly used normalization techniques:

The measurement unit used can affect the data analysis. For example, changing measurement units
from meters to inches for height, or from kilograms to pounds for weight, may lead to very different
results. In general, expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.” To help avoid

5
Data Mining (3160714) 200170107049
dependence on the choice of measurement units, the data should be normalized or standardized.
This involves transforming the data to fall within a smaller or common range such as [−1,1] or [0.0,
1.0]. (The terms standardize and normalize are used interchangeably in data preprocessing, although
in statistics, the latter term also has other connotations.) Normalizing the data attempts to give all
attributes an equal weight. Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as nearest-neighbor classification and
clustering. If using the neural network back propagation algorithm for classification mining. There
are many methods for data normalization. We Focus on min-max normalization, z-score
normalization, and normalization by decimal scaling.

Min-Max Normalization: This technique scales the data to a range of 0 to 1. The formula for min-
max normalization is:
X_norm = (X - X_min) / (X_max - X_min)
where X is the original data, X_min is the minimum value in the dataset, and X_max is the
maximum value in the dataset.

Z-Score Normalization: This technique scales the data to have a mean of 0 and a standard
deviation of 1. The formula for z-score normalization is:
X_norm = (X - X_mean) / X_std
where X is the original data, X_mean is the mean of the dataset, and X_std is the standard deviation
of the dataset.

Decimal Scaling: This technique scales the data by moving the decimal point a certain number of
places to the left or right. The formula for decimal scaling is:
X_norm = X / 10^j
where X is the original data and j is the number of decimal places to shift.

2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries

Five Number Summary


Descriptive Statistics involves understanding the distribution and nature of the data. Five
number summary is a part of descriptive statistics and consists of five values and all these
values will help us to describe the data.
The minimum value (the lowest value)
25th Percentile or Q1
50th Percentile or Q2 or Median
75th Percentile or Q3
Maximum Value (the highest value)

Let’s understand this with the help of an example. Suppose we have some data such as:
11,23,32,26,16,19,30,14,16,10

Here, in the above set of data points our Five Number Summary are as follows:
First of all, we will arrange the data points in ascending order and then calculate the
summary: 10,11,14,16,16,19,23,26,30,32

Minimum value: 10
25th Percentile: 14
Calculation of 25th Percentile: (25/100)*(n+1) = (25/100)*(11) = 2.75 i.e 3rd value of the data
6
Data Mining (3160714) 200170107049
50th Percentile : 17.5
Calculation of 50th Percentile: (16+19)/2 = 17.5
75th Percentile : 26
Calculation of 75th Percentile: (75/100)*(n+1) = (75/100)*(11) = 8.25 i.e 8th value of the
data

Box plots

Boxplots are the graphical representation of the distribution of the data using Five Number
summary values. It is one of the most efficient ways to detect outliers in our dataset.

Fig Box plot using Five Number Summary

In statistics, an outlier is a data point that differs significantly from other observations. An
outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the dataset. An outlier can cause serious problems in
statistical analyses.

Program
Code:
#2.1 Noisy Data Handling
# Generate random Numbers
import random
data = random.sample(range(10, 100), 20)
data = sorted(data)
print("Random data sample: ", data)

# Numbers of bins user wants


bins = int(input('Enter the number of bins: '))

#1. Equal width binning

equal_width = []
min_val = min(data)
max_val = max(data)
diff_val = (max_val - min_val) // bins

def range_val(j, limit):


d = []
while data[j] <= limit:

7
Data Mining (3160714) 200170107049
d.append(data[j])
j = j+1
return j, d

j=0
for i in range(1, bins+1):
j, val = range_val(j, min_val + (i * diff_val))
equal_width.append(val)

print("Equal Width : ")


print(equal_width)

#2. Equal Frequency/Depth Binning

equal_freq = []

for i in range(1, bins+1):


size_bin = len(data) // bins
start = (i-1) * size_bin
stop = start + size_bin # (i * (size_bin-1)) + i
equal_freq.append(data[start: stop])

print("Equal frequency : ")


print(equal_freq)

# apply smoothing techniques by means, by median and bin boundaries.

from statistics import mean, median

def smooth_mean(data):
smooth_data = []
for i in range(bins):
mean_data = mean(data[i])
smooth_data.append([mean_data for j in range(len(data[i]))])
return smooth_data

def smooth_median(data):
smooth_data = []
for i in range(bins):
median_data = median(data[i])
smooth_data.append([median_data for j in range(len(data[i]))])
return smooth_data

def smooth_bound(data):
smooth_data = []
for i in range(bins):
d = []
d.append(data[i][0])
for j in range(1, len(data[i])-1):
min_d = min(data[i])
max_d = max(data[i])
if (data[i][j] - min_d) <= (max_d - data[i][j]):
d.append(min_d)

8
Data Mining (3160714) 200170107049
else:
d.append(max_d)
d.append(data[i][-1])
smooth_data.append(d)
return smooth_data

print("Smooth mean for Equal frequency : ")


print(smooth_mean(equal_freq))
print("Smooth mean for Equal Width : ")
print(smooth_mean(equal_width))
print("Smooth median for Equal frequency : ")
print(smooth_median(equal_freq))
print("Smooth median for Equal Width : ")
print(smooth_median(equal_width))
print("Smooth bound for Equal frequency : ")
print(smooth_bound(equal_freq))
print("Smooth bound for Equal Width : ")
print(smooth_bound(equal_width))

2.2 Normalization Techniques

1) Min max normalization


2) Z score normalization
3) Decimal scaling

Code:
# 2.2 Normalization Techniques
# Min max normalization
# Z score normalization
# Decimal scaling
from statistics import mean, stdev
import random

data = random.sample(range(100, 1000), 20)


data = sorted(data)
print("Sample data: ")
print(data)

min_data = min(data)
max_data = max(data)

# 1. min-max normalization

def min_max(val, new_min, new_max):


return round(((val-min_data)/(max_data-min_data)) * (new_max-new_min) + new_min, 2)

new_min = 0.0
new_max = 1.0

9
Data Mining (3160714) 200170107049
min_max_norm = [min_max(i, new_min, new_max) for i in data]
print('Min-max norm: ')
print(min_max_norm)

# 2. Z score normalization

def z_score(val, mean, std):


return round((val-mean) / std, 2)

data_mean = mean(data)
data_std = stdev(data)

z_norm = [z_score(i, data_mean, data_std) for i in data]


print('Z norm: ')
print(z_norm)

# 3. Decimal Scaling

def dec_scale(n, j):


return round(n / (10**j), 2)

abs_max_data = max(list(map(abs, data)))


len_abs_max_data = len(str(abs_max_data))

dec_norm = [dec_scale(i, len_abs_max_data) for i in data]


print('Decimal Scaling norm: ')
print(dec_norm)

# 2.3 Implement data dispersion measure Five Number Summary generate box plot using python
libraries.

import matplotlib.pyplot as plt


import seaborn as sns

if len(data)%2 == 0:
ind = len(data) // 2
else:
ind = (len(data)+1) // 2

Q1 = median(data[: ind+1])
Q2 = median(data)
Q3 = median(data[ind+1: ])

print('Five Number Summary')


print('Minimum value: ', min_data)
print('Q1 (25%): ', Q1)
print('Q2 (50%)(Median):', Q2)
print('Q3 (75%): ', Q3)
print('Maximum value: ', max_data)

sns.boxplot(data)
plt.show()

10
Data Mining (3160714) 200170107049

Observations:

2.1 Noisy Data Handling


Output:

2.2 Normalization Techniques


Output:

2.3 Implement data dispersion measure Five Number Summary


Output:

11
Data Mining (3160714) 200170107049

Conclusion:

Binning, Normalization techniques and the five number summary are both important tools in data
preprocessing that help prepare data for data mining tasks.

Quiz:

(1) What is Five Number summary? How to generate box plot using Python Libraries?
(2) What is Normalization techniques?
(3) What are the different smoothing techniques?

Suggested Reference:

J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

https://www.geeksforgeeks.org/binning-in-data-mining/
https://www.geeksforgeeks.org/data-normalization-in-data-mining/
https://stackoverflow.com/questions/53388096/generate-box-plot-from-5-number-summary-
min-max-quantiles

Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

12
Data Mining (3160714) 200170107049

Experiment No - 3
Aim: To perform hand on experiments of data preprocessing with sample data on Orange
tool.

Date:

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO1 & CO4

Objectives: 1) improve users' understanding of data preprocessing techniques


2) Familiarize them with the tool
.
Equipment/Instruments: Orange tool

Demonstration of Tool:

Data Preprocessing With Orange tool

Preprocesses data with selected methods. Inputs Data: input dataset Outputs Preprocessor:
preprocessing method Preprocessed Data: data preprocessed with selected methods
Preprocessing is crucial for achieving better-quality analysis results. The Preprocess
widget offers several preprocessing methods that can be combined in a single
preprocessing pipeline. Some methods are available as separate widgets, which offer
advanced techniques and greater parameter tuning.

Fig Handling Missing Values

1. List of preprocessors. Double click the preprocessors you wish to use and shuffle their

13
Data Mining (3160714) 200170107049
order by dragging them up or down. You can also add preprocessors by dragging them
from the left menu to the right.
2. Preprocessing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.

⮚ Preprocessed Technique :

[ Fig: Descrete Continuous variables -> Most Frequent is base Used ]

14
Data Mining (3160714) 200170107049

⮚ Data Table of Preprocessed Data

Conclusion: Orange is a powerful open-source data analysis and visualization tool for machine
learning and data mining tasks. It provides a wide variety of functionalities including data
visualization, data preprocessing, feature selection, classification, regression, clustering, and more.
Its user-friendly interface and drag-and-drop workflow make it easy for non-experts to work with
and understand machine learning concepts.
Quiz:

(1) What is the purpose of Orange's Preprocess method?


(2) What is the use of orange tool?

15
Data Mining (3160714) 200170107049

Suggested Reference:
1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufman
2. https://orangedatamining.com/docs/

References used by the students:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufman


2. https://orangedatamining.com/docs/

Rubric wise marks obtained:

Problem Tool Usage/


Knowledge Communicati
Recognition Demonstrati Ethics (2)
Rubrics (2) on Skill (2) Total
(2) on (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

16
Data Mining (3160714) 200170107049

Experiment No - 4

Aim: Implement Apriori algorithm of association rule data mining technique in any Programming
language.

Date:

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO2

Objectives: To implement basic logic for association rule mining algorithm with support and
confidence measures.
.
Equipment/Instruments: Personal Computer, open-source software for programming

Program:

Implement Apriori algorithm of association rule data mining technique in any Programming
language.

Code:
# Define the dataset
transactions = [
["I1","I2","I5"],
["I2","I4"],["I2","I3"],["I1","I2","I4"],["I1","I3"],
["I2","I3"],["I1","I3"],
["I1","I2","I3","I5"],
["I1","I2","I3"]
]

def apriori(transactions, min_support, min_confidence):


# Get unique items in transactions
unique_items = sorted(list(set([item for transaction in transactions for item in transaction])))

# Create initial frequent itemsets


frequent_itemsets = {frozenset([item]): 0 for item in unique_items}
for transaction in transactions:
for item in transaction:
frequent_itemsets[frozenset([item])] += 1

# Remove items that don't meet minimum support threshold


frequent_itemsets = {itemset: count for itemset, count in frequent_itemsets.items() if count >=
min_support}

# Create initial candidate itemsets


17
Data Mining (3160714) 200170107049
candidate_itemsets = frequent_itemsets.keys()

# Create frequent itemsets of length 2 or greater


k=2
while candidate_itemsets:
# Generate candidate itemsets of length k
candidate_itemsets = set(
[itemset1.union(itemset2) for itemset1 in candidate_itemsets for itemset2 in
candidate_itemsets if
len(itemset1.union(itemset2)) == k])

# Calculate support for candidate itemsets and remove those that don't meet minimum support
threshold
itemset_counts = {itemset: 0 for itemset in candidate_itemsets}
for transaction in transactions:
for itemset in itemset_counts.keys():
if itemset.issubset(transaction):
itemset_counts[itemset] += 1
frequent_itemsets.update({itemset: count for itemset, count in itemset_counts.items() if count
>= min_support})

# Increment k
k += 1

# Generate association rules


rules = []
for itemset, count in frequent_itemsets.items():
if len(itemset) > 1:
for item in itemset:
left_side = itemset - frozenset([item])
support_left = frequent_itemsets[left_side]
confidence = count / support_left
if confidence >= min_confidence:
rules.append((left_side, frozenset([item]), confidence))

return frequent_itemsets, rules


frequent_itemsets, rules = apriori(transactions, 3, 0.6)
print("Frequent Items with Support:")
for i in frequent_itemsets:
print("(",str(i)[11:-2],"):",frequent_itemsets[i])
print("Rules with confidence:")
for i in rules:
print("(",str(i[0])[11:-2],",",str(i[1])[11:-2],"):",i[-1])

18
Data Mining (3160714) 200170107049
Observations:
Output:

Conclusion:
Apriori algorithm is an effective and widely used approach for discovering frequent itemsets and
association rules in large transaction datasets. It has been used in various applications such as
market basket analysis, customer segmentation, and web usage mining.

Quiz:

(1) What Do you Mean by Association rule mining?


(2) What are the different measures are used in apriori algorithm?

Suggested Reference:

 J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

https://www.geeksforgeeks.org/apriori-algorithm/
Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

19
Data Mining (3160714) 200170107049

Experiment No - 5

Aim: Apply association rule data mining technique on sample data sets using Weka
Analysis Tool.

Date:

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO2 & CO4

Objectives: 1) Improve users' understanding of Association rule mining techniques


2) Familiarize with the tool
.
Equipment/Instruments: WEKA Tool

Demonstration of Tool:

1. Open the Weka Analysis Tool and load your dataset. For this example, we will use the
"supermarket.arff" dataset which contains information about customers' purchases at a
supermarket.

20
Data Mining (3160714) 200170107049

2. Preprocess the dataset by selecting the "Filter" tab and choosing the "Nominal to Binary"
filter. This will convert the nominal attributes in the dataset to binary ones, which is
necessary for association rule mining.

21
Data Mining (3160714) 200170107049
3. Select the "Associate" tab and choose the "Apriori" algorithm from the list of association
rule algorithms.

4. Set the minimum support and confidence values for the algorithm. For this example, we will
set the minimum support to 0.2 and the minimum confidence to 0.5.
5. Click on the "Start" button to run the algorithm. The results will be displayed in the output
window, showing the generated association rules based on the selected support and
confidence values.

6. Analyze the generated association rules to identify interesting patterns and insights. For
example, you may find that customers who buy bread are more likely to buy milk, or that
customers who buy vegetables are less likely to buy junk food.

22
Data Mining (3160714) 200170107049
7. You can further refine your analysis by adjusting the support and confidence values, or by
using other association rule algorithms such as FP-Growth or Eclat.

Observations: NA

Conclusion:

One of the key strengths of WEKA is its wide range of data mining techniques, including decision
trees, neural networks, clustering, and association rules, among others. These techniques are
accessible through an intuitive graphical user interface (GUI), which allows users to easily build
models and analyze data without needing advanced programming skills.

Another advantage of WEKA is a widely used spreadsheet software. This allows users to leverage
Excel's built-in data management and manipulation features, while also taking advantage of
WEKA's advanced analytics capabilities.

Quiz:

(1) What is WEKA tool?


(2) What is association analysis, and how can it be performed using WEKA?
(3) How can you create loop statement in assembly language program?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann


2. https://www.solver.com/WEKA-data-mining

References used by the students:


https://www.softwaretestinghelp.com/weka-explorer-tutorial/
Rubric wise marks obtained:

Problem Tool Usage/


Knowledge Communicatio
Recognition Demonstrati Ethics (2)
Rubrics (2) n Skill (2) Total
(2) on (2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)

Marks

23
Data Mining (3160714) 200170107049

Experiment No - 6

Aim: Apply Classification data mining technique on sample data sets in WEKA.

Date:

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO4 & CO5

Objectives: 1) improve users' understanding of classification techniques


2) Familiarize with the tool
.
Equipment/Instruments: WEKA Tool

Demonstration of Tool:

WEKA:

WEKA - an open-source software provides tools for data preprocessing, implementation of


several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems. What WEKA offers is
summarized in the following diagram –

24
Data Mining (3160714) 200170107049

Now we will be performing data mining techniques on sample data set in arff extension available in
WEKA.
Now we will be completing the process in the following steps.
Step-1:
First open the weka app. And open the “Explorer” Tag in the menu bar.

25
Data Mining (3160714) 200170107049
Step-2:
Now we will load our sample data set. Weather-Nominal.arff from the data directory under weka folder
in our system.

26
Data Mining (3160714) 200170107049
Step-3:
Now we can visualize our sample datasets available in WEKA.

Step-4:
Now we can use tools available in WEKA to partition our sample data into training data and testing
data and we can print our outcomes.

27
Data Mining (3160714) 200170107049

Conclusion:

Weka is a widely used and highly regarded data mining and machine learning tool that provides a
comprehensive suite of data preprocessing, classification, regression, clustering, and association
rule mining algorithms. It is an open-source software that is available for free and is written in Java,
making it platform-independent and easily accessible.

One of the key strengths of Weka is its extensive set of machine learning algorithms, which can be
easily applied to various types of data and problems. It offers a wide range of algorithms, including
decision trees, support vector machines, neural networks, random forests, and others, which are
supported by a comprehensive set of evaluation metrics and visualization tools.

Quiz:

1) What is classification and how can it be performed using WEKA tool?


2) What are the key evaluation metrics used to assess the accuracy of a model in WEKA?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann


2. https://waikato.github.io/weka-wiki/documentation/
References used by the students:
https://www.softwaretestinghelp.com/weka-datasets/

28
Data Mining (3160714) 200170107049
Rubric wise marks obtained:

Problem
Knowledge Tool Usage Demonstration
Recognition Ethics (2)
Rubrics (2) (2) (2) Total
(2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)

Marks

29
Data Mining (3160714) 200170107049
Experiment No - 7

Aim: 7.1.Implement Classification technique with quality Measures in any Programming


language.
7.2 Implement Regression technique in any Programming language.

Date:

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO5


Objectives:
(a) To evaluate the quality of the classification model using accuracy and confusion
matrix.
(b) To evaluate regression model

Equipment/Instruments: open-source software for programming


Program:

Code (7.1) :

 Import needed Libraries:

 Now we read data:

30
Data Mining (3160714) 200170107049
 Describe our data set:

 Checking for null values in our Data set:

 Drop some column and split labels and features

 Fill missing values in data set

 Convert categrial data to numarcal values:

 Split dataset into training and testing data set:

31
Data Mining (3160714) 200170107049
 Scaling our data set:

 Training different data models:

Logistic Regression:

Random Forest:

Decision Tree:

Naïve Bayes:

Code (7.2) :

 Import Libraries:

 Read data set:


32
Data Mining (3160714) 200170107049

 Checking null values in data set:

 Split data into labels and features:

 Split data set into training and testing data set:

 Scaling:

33
Data Mining (3160714) 200170107049

 Training different models:

Linear Regression:

Random Forest:

Ridge Regression:

Lasso Regression:

Decision Tree:

Observations:

The Classification Model Algorithm Consist of Logistic , Random, Decision ,Ridge and Naïve
Bayes ranges the Accuracy score from 74% to 79% and Mean absolute error from 20% to 24%

The Regression Model Algorithm Consist of Linear , Lasso, Decision and Random Bayes ranges
the Mean square error from 8% to 24%

Conclusion:

34
Data Mining (3160714) 200170107049
Classification models are used to classify data into different categories or classes based on certain
features or attributes. This can be useful in a variety of applications, such as image recognition,
spam filtering, or fraud detection. Commonly used classification models include decision trees,
logistic regression, and naive Bayes classifiers.

Regression models, on the other hand, are used to predict a numerical value based on input features.
For example, a regression model might be used to predict the price of a house based on its size,
location, and other features. Popular regression models include linear regression, polynomial
regression, and decision trees.

Quiz:

(1) What is the use of precision, recall, specificity, sensitivity etc.


(2) What are the different Regression techniques?
(3) What is information gain, gini index and gain ratio in decision tree induction method.
Suggested Reference:

 J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann


References used by the students:
https://www.javatpoint.com/regression-vs-classification-in-machine-learning
Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Experiment No - 8

35
Data Mining (3160714) 200170107049
Aim: Apply K-means Clustering Algorithm any Programming language.

Date:

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO2 & CO4

Objectives: To implement Clustering Algorithm.


.
Equipment/Instruments: open-source software for programming

Program:
Code:
import matplotlib.pyplot as plt

data=[[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]]
k1=[1]
k2=[2]
c1=data[0]
c2=data[1]

for i in range(2,len(data)):

E1=((data[i][0]-c1[0])**2 +(data[i][1]-c1[1])**2)**0.5
E2=((data[i][0]-c2[0])**2 +(data[i][1]-c2[1])**2)**0.5

if E1<E2:
k1.append(i+1)
c1=[(c1[0]+data[i][0])/2,(c1[1]+data[i][1])/2]

else:
k2.append(i + 1)
c2 = [(c2[0] + data[i][0]) / 2, (c2[1] + data[i][1]) / 2]

print("Cluster 1:",k1)
print("Cluster 2:",k2)
plt.scatter([data[i-1][0] for i in k1],[data[i-1][1] for i in k1],marker="*",label='Cluster 1')
plt.scatter([data[i-1][0] for i in k2],[data[i-1][1] for i in k2],label='Cluster 2')
plt.legend()
plt.show()

36
Data Mining (3160714) 200170107049
Observations:
Output:

Conclusion:
One of the key advantages of k-means is its scalability, as it can efficiently handle large datasets
with high-dimensional features. However, it also has some limitations, such as its sensitivity to
initial centroid positions, and its tendency to converge to local optima.
Quiz:
(1) What are the different distance measures?
(2) What do you mean by centroid in K-means Algorithm?
Suggested Reference:
J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann
References used by the students:
https://www.youtube.com/watch?v=CLKW6uWJtTc&ab_channel=5MinutesEngineering
Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

37
Data Mining (3160714) 200170107049
Experiment No - 9

Aim: Perform hands on experiment on any advance mining Techniques Using Appropriate Tool.

Date:
Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO4

Objectives:

1) Improve users' understanding of advance mining Techniques like Text Mining, Stream
Mining, and Web Content Mining Using Appropriate Tool
2) Familiarize with the tool
.
Equipment/Instruments: Octoparse

Demonstration of Tool:

Web Mining :-
Web mining is the way you apply data mining techniques so that you can extract
knowledge from web data. This web data could be a number of things. It could be
web documents, hyperlinks between documents and/or usage logs of websites etc.
Once you have the extracted information, you could analyze it to derive insights as per your
requirement. For instance, you could align your marketing or sales strategy based on
the results that your web mining throws up.
Since you have access to a lot of data, you have got your finger on the market pulse. You
can study customer behavior patterns to know and understand what the customers
want. With this sort of analysis of data, you can discover internal bottlenecks and
troubleshoot. Overall, you can get ahead of everyone in terms of how you anticipate
the industry trends and plan accordingly.
Web Mining Tool
A web mining tool is computer software that uses data mining techniques to identify or
discover patterns from large data sets.
There are various web mining tools available , here is a list of some of them.

 R :- R is a language or a free environment for statistical computing and graphics.


 Octoparse:- Octoparse is a simple but powerful web data mining tool that automates web
data extraction.
 Oracle Data Mining (ODM)
 Tableau.
 Scrapy.
 HITS algorithm.
 PageRank Algorithm.
I have decided to go with Octoparse as it is easy to use and does automated mining.
38
Data Mining (3160714) 200170107049
Octoparse:
Octoparse is a modern visual web data extraction software. Both experienced and
inexperienced users would find it easy to bulk extract information from websites
with it. For most scraping tasks, no coding is needed.
Octoparse supports Windows XP, 7, 8, 10. It works well for both static and dynamic
websites, including those web pages using Ajax. To export the data, there are
various data formats of your choice like CSV, EXCEL, HTML, TXT, and databases
(MySQL, SQL Server, and Oracle via API). Octoparse simulates human operation
to interact with web pages.
Its remarkable features such as filling out forms, entering a search term into the textbox,
etc., make extracting web data an easy process. You can run your extraction project
either on your local machines (Local Extraction) or in the cloud (Cloud Extraction).
Some of our clients use Octoparse’s cloud service, which can extract and store large
amounts of data to meet large-scale extraction needs.
Octoparse free and paid editions share some features in common. Paid editions allows users
to extract enormous amounts of data on a 24-7 basis using Octoparse’s cloud
service. The prices of each plan can be viewed here.
INSTALLATION :

(1) Download Octoparse setup and run ,then set the destination folder

(2) Complete the octoparse setup

39
Data Mining (3160714) 200170107049

(3) User dashboard of Octoparse.

As we can see over here there are multiple options available on the left hand side related to project
creation and management . we just need to enter the url of the site we want to scrap and we then
after the processing is done we would be able to get all the data that the octoparse has found out.
(4) Here we need to go advance section and need to insert the site link which we want to scrap.
Here I have selected amazon website to scrap.

40
Data Mining (3160714) 200170107049

(5) After we save this project then octoparse loads the website by its own and then starts to auto
scrape the website.

(6) After the auto scrapping is done then it generates a report/file of all the things it has found and
represent it an tabular format.

This is the list of all the tags links and data that octopare has found. Here it has identified 20 items
and 11 columns related to each item.

As we can see in the above figure all the fields that octoparse has identified are represented in red
square.
(7) Now we can generate report of all the data that has been gathered and used it for our purpose.

Observations: NA

Conclusion:

Octoparse is a powerful web scraping tool that allows users to extract data from websites without
the need for coding skills. It offers a user-friendly interface and a range of features such as
scheduling, data export, and cloud extraction.

The tool is highly customizable, and users can easily create their own scraping workflows with the
built-in point-and-click editor. Octoparse also provides excellent customer support and a helpful
community forum where users can share their experiences and ask for assistance

Quiz:

1) What different data mining techniques are used in your tool?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:


https://www.octoparse.com/

Rubric wise marks obtained:

Rubrics Knowledge Problem Tool Usage/ Communicatio Ethics (2) Total


41
Data Mining (3160714) 200170107049
Recognition Demonstrati
(2) n Skill (2)
(2) on (2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)

Marks

Experiment No - 10

42
Data Mining (3160714) 200170107049
Aim: Solve Real world problem using Data Mining Techniques using Python Programming
Language.
Date:

Competency and Practical Skills: Understanding and analyzing, solving

Relevant CO: CO3

Objectives: (a) To understand real-world problems.


(b) To analyze which data mining technique can be used to solve your problem.

Equipment/Instruments: 8086 microprocessor board


Theory:
System Name: Car price prediction systems

Car price prediction systems are used to predict the prices of cars based on various factors such as
brand, specifications, features, and market trends. These systems are valuable for both consumers
and sellers, allowing them to make informed decisions about purchasing or selling cars. The process
of creating a car price prediction system involves the following steps:

Dataset: A car price prediction system requires a dataset that contains information about cars and
their attributes. Here are some examples of datasets:

 Car Prices: This is a dataset of car prices collected from various sources such as online
marketplaces and retailers. It contains information about the brand, model, specifications,
and price of cars.
 Car Specifications: This is a dataset of car specifications collected from various sources such
as manufacturer websites and online retailers. It contains information about the speed,
engine, segment, colour, and other features of cars.

Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in car price prediction systems:

 Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a car has missing information such as the processor
speed, it may be removed from the dataset or the information may be imputed.
 Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, prices from different retailers may be normalized to a common currency or a
common range of values.
 Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, car brands may be encoded as binary variables to enable analysis using
machine learning algorithms.
 Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the age of the car may be calculated based on the release
date, or the number of cores in the processor may be counted.
 Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.

These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for car recommendations.
43
Data Mining (3160714) 200170107049

Data Mining Techniques: Association rule mining, clustering, and classification are all data
mining techniques that can be applied to movie recommendation systems. Here is a brief overview
of how each of these techniques can be used:

 Association Rule Mining: Association rule mining is a data mining technique used to find
associations or relationships among variables in large datasets. In the context of car price
prediction, association rule mining can be used to identify patterns and relationships
between different features that might affect the price of a car. For example, the technique
can be used to find out whether the brand, segment type, engine, body type, or colour size
are related to the car price. These associations can then be used to make predictions about
the price of a car with similar features.

 Clustering: Clustering is a data mining technique used to group similar data points or
objects together based on their similarities or differences. In the context of car price
prediction, clustering can be used to group cars with similar features together, such as cars
with similar brand, engine, or segment type. Clustering can help in identifying the different
price ranges for cars with similar features, which can be useful in predicting the price of a
car based on its features.
 Classification: Classification is a data mining technique used to categorize data points or
objects into pre-defined classes based on their characteristics or features. In the context of
car price prediction, classification can be used to classify cars into different price ranges
based on their features, such as brand type, engine, segment type, and colur size. This
technique can also be used to predict the price range of a car based on its features, which can
be useful in making pricing decisions.

Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder

# Step 1: Collect and Clean Data


car_data = pd.read_csv('car_data.csv')
car_data.drop_duplicates(inplace=True)
car_data.dropna(inplace=True)

# Step 2: Feature Engineering


X = car_data[['brand', 'engine', 'segment', 'storage', 'feature', 'graphics']]
y = car_data['price']

# Convert categorical variables to numerical using one-hot encoding


encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

# Step 3: Split Data


X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
44
Data Mining (3160714) 200170107049

# Step 4: Model Selection and Training


rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Step 5: Model Evaluation


y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error:', mse)
print('Mean Absolute Error:', mae)
print('R-squared:', r2)

# Step 6: Model Deployment


import streamlit as st
import pickle
import numpy as np

# import the model


pipe = pickle.load(open('pipe.pkl','rb'))
df = pickle.load(open('df.pkl','rb'))

st.title("Car Predictor")

# brand
company = st.selectbox('Brand',df['Company'].unique())

# type of car
type = st.selectbox('Type',df['TypeName'].unique())

# engine
engine = st.selectbox('Engine',[4c,v4,v6,v8,v12])

# weight
weight = st.number_input('Weight of the Car')

# Infotainment system
infotainment = st.selectbox('Infotainment',['No','Yes'])

# Luxury
luxury = st.selectbox('luxury',['No','Yes'])

# length size
length_size = st.number_input('Length Size')

# Fuel Type
fuel = st.selectbox('Fuel',['EV','Pertrol','Diesel','CNG'])

#fetures
wheel = st.selectbox('WHEEL',df['wheel_type'].unique())
45
Data Mining (3160714) 200170107049

gear = st.selectbox('GEAR',[MT,AT])

Music system = st.selectbox('Music',df['Music brand'].unique())

colour = st.selectbox('colour',df['col'].unique())

if st.button('Predict Price'):
# query
ppi = None
if infotainment == 'Yes':
infotainment = 1
else:
infotainment = 0

if luxury == 'Yes':
luxury = 1
else:
luxury = 0

X_ft = int(fuel.split('x')[0])
Y_ft = int(fuel.split('x')[1])
ppi = ((X_ft**2) + (Y_ft**2))**5/body_size
query =
np.array([company,type,engine,weight,infotainment,luxury,ppi,fuel,wheel,gear,music,col])

query = query.reshape(1,12)
st.title("The predicted price of this configuration is " + str(int(np.exp(pipe.predict(query)
[0]))))

46
Data Mining (3160714) 200170107049
Observations:

Conclusion:

In this project, we have analyzed a car dataset and performed various data cleaning and
preprocessing techniques. We have also extracted useful features from the existing features like the
presence of infotainment and luxury, colour, fuel type, PPI, engine brand, and type, and wheel base
and gear, length. Finally, we have built a machine learning model using the Random Forest
Regressor algorithm to predict the car prices based on the features.

The model has achieved an accuracy score of 89% on the test data, which indicates that the model
can predict the car prices with high accuracy. We have also visualized some important features that
are related to car prices, such as company, car type, engine brand, segmnet, and body type, which
can help users to make better decisions while buying a car. Overall, this project provides useful
insights into the car industry and how machine learning can be used to predict car prices.

47
Data Mining (3160714) 200170107049
Quiz:
1) What are other techniques that can be used to solve your system problem?

References used by the students:


https://www.youtube.com/watch?v=BgpM2IiCH6k&ab_channel=CampusX
Rubric wise marks obtained:

Completeness
Knowledge Teamwork Logic
and accuracy Ethics (2)
Rubrics (2) (2) Building (2) Total
(2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

48

You might also like