0% found this document useful (0 votes)
17 views23 pages

6. DWM Lab Manual

The document is a laboratory manual for a Data Warehousing and Mining course at SAL College of Engineering for the academic year 2026. It outlines various experiments and objectives related to data mining techniques, including preprocessing, association rule mining, classification, and clustering, along with practical exercises using tools like Python and WEKA. The manual emphasizes the interdisciplinary nature of data mining through applications such as movie recommendation systems and provides detailed methodologies for implementing various data preprocessing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

6. DWM Lab Manual

The document is a laboratory manual for a Data Warehousing and Mining course at SAL College of Engineering for the academic year 2026. It outlines various experiments and objectives related to data mining techniques, including preprocessing, association rule mining, classification, and clustering, along with practical exercises using tools like Python and WEKA. The manual emphasizes the interdisciplinary nature of data mining through applications such as movie recommendation systems and provides detailed methodologies for implementing various data preprocessing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SAL College of Engineering

Department of
Information Technology

DATA WAREHOUSING & MINING


(3161610)

Laboratory Manual

Academic Year: 2026

Enrollment No. DWM (3161610) 1


INDEX
[Link] EXPERIMENT PAGE DATE SIGN
NO
1 Identify how data mining is an interdisciplinary
field by an Application.

2 Write programs to perform the following tasks of


preprocessing (any language).
2.1 Noisy data handling
◻ Equal Width Binning
◻ Equal Frequency/Depth Binning
2.2 Normalization Techniques
◻ Min max normalization
◻ Z score normalization
◻ Decimal scaling
2.3. Implement data dispersion measure
Five Number Summary generate box plot
using python libraries.

3 To perform hand on experiments of data


preprocessing with sample data on Orange tool.

4 Implement Apriori algorithm of association rule


data mining technique in any
Programming language.

5 Apply association rule data mining technique on


sample data sets using WEKA.

6 Apply Classification data mining technique on


sample data sets in WEKA or Python.

7 7.1. Implement Classification technique with quality


Measures in any Programming language.
7.2 Implement Regression technique in any
Programming language.

8 Apply K-means Clustering Algorithm any


Programming language.

9 Perform hands on experiment on any advance


mining Techniques Using Appropriate Tool.

10 Solve Real world problem using Data Mining


Techniques using Python Programming
Language.

Enrollment No. DWM (3161610) 2


EXPERIMENT 1 DATE:
Aim: Identify how data mining is an interdisciplinary field by an Application.

Objectives: (a) To understand the application of domain


(b) To understand Preprocessing Techniques.
(c) To understand the application's use of Data Mining functionalities.
Theory:
System Name: Movie recommendation systems
Movie recommendation systems are a great example of how data mining is an interdisciplinary field
involving multiple expertise areas. The process of creating a movie recommendation system involves
the following steps:

Dataset: A movie recommendation system requires a dataset that contains information about movies
and their attributes, as well as information about users and their preferences. Here are some Examples
of dataset.

● MovieLens: This is a dataset of movie ratings collected by the GroupLens research group at
the University of Minnesota. It contains over 27,000 movies and 21 million ratings from around
250,000 users.
● Netflix Prize: This is a dataset of movie ratings collected by the online movie rental company
Netflix. It contains over 100 million ratings from around 480,000 users on 17,000 movies.
● IMDb: This is a dataset of movie metadata collected by the Internet Movie Database (IMDb).
It contains information about over 600,000 movies, including titles, cast and crew, plot
summaries, ratings, and reviews.

Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in movie recommendation systems:

● Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a movie has missing information such as the release date,
it may be removed from the dataset or the information may be imputed.
● Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, ratings from different users may be normalized to a common scale of 1-5.
● Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, movie genres may be encoded as binary variables to enable analysis using
machine learning algorithms.
● Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the average rating for each movie may be calculated based on
user ratings, or the number of times a movie has been watched may be counted.

Enrollment No. DWM (3161610) 3


● Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used to
identify the most important features in the dataset.

These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for movie recommendations.
Data Mining Techniques: Association rule mining, clustering, and classification are all data mining
techniques that can be applied to movie recommendation systems. Here is a brief overview of how
each of these techniques can be used:

● Association Rule Mining: This technique can be used to identify patterns and relationships
between movies that can help with recommendations. For example, if users who watch
romantic comedies also tend to watch dramas, then a recommendation system can use this
association to suggest dramas to users who have watched romantic comedies. Association rule
mining can also be used to identify frequent item sets, which can help identify popular movies
or genres.
● Clustering: This technique can be used to group movies and users based on their
characteristics. For example, movies can be clustered based on genre, rating, release year, or
other attributes. Users can also be clustered based on their movie preferences or ratings.
Clustering can help with making personalized recommendations by identifying groups of
similar users or movies.
● Classification: This technique can be used to predict the preferences or ratings of users for a
particular movie. For example, a classification algorithm can be trained on past user ratings to
predict whether a new user will like a particular movie or not. Classification algorithms can
also be used to predict the genre or rating of a new movie based on its characteristics.

EXERCISES:
(1) What are the different preprocessing techniques can be applied on dataset?
(2) What is the use of data mining techniques on particular system?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 4


EXPERIMENT 2 DATE:
Aim: Write programs to perform the following tasks of preprocessing (any language).

2.1 Noisy data handling


◻ Equal Width Binning
◻ Equal Frequency/Depth Binning
2.2 Normalization Techniques
◻ Min max normalization
◻ Z score normalization
◻ Decimal scaling
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries

Objectives: (a) To understand Basic Preprocessing Techniques and statistical Measures.


(b) To show how to implement Preprocessing Techniques.
(c) To show how to use different Python Libraries to implement Techniques.

Theory:

2.1 Noisy data handling


Equal Width Binning
Equal Frequency/Depth Binning

Noise: random error or variance in a measured variable


Incorrect attribute values may be due to

• Faulty Data Collection Instruments


• Data Entry Problems
• Data Transmission Problems
• Technology Limitation
• Inconsistency in Naming Convention

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.

Data Preprocessing Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency)
bins:
Bin 1: 4, 8, 1

Enrollment No. DWM (3161610) 5


Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:


Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28

Smoothing by bin boundaries:


Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

Equal Width Binning :


bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw]
where w = (max – min) / (N)
Example :
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215

W=(215-5)/3=70

bin1: 5,10,11,13,15,35,50,55,72 I.e. all values between 5 and 75


bin2: 92 I.e. all values between 75 and 145
bin3: 204,215 I.e. all values between 145 and 215
.
[Link] Techniques
Min max normalization
Z score normalization
Decimal scaling

Normalization techniques are used in data preprocessing to scale numerical data to a common range.
Here are three commonly used normalization techniques:

The measurement unit used can affect the data analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight,
may lead to very different results. In general, expressing an attribute in smaller units will lead
to a larger range for that attribute, and thus tend to give such an attribute greater effect or

Enrollment No. DWM (3161610) 6


“weight.” To help avoid dependence on the choice of measurement units, the data should be
normalized or standardized. This involves transforming the data to fall within a smaller or
common range such as [−1,1] or [0.0, 1.0]. (The terms standardize and normalize are used
interchangeably in data preprocessing, although in statistics, the latter term also has other
connotations.) Normalizing the data attempts to give all attributes an equal weight. Normalization
is particularly useful for classification algorithms involving neural networks or distance measurements
such as nearest-neighbor classification and clustering. If using the neural network back propagation
algorithm for classification mining. There are many methods for data normalization. We Focus on min-
max normalization, z-score normalization, and normalization by decimal scaling.

Min-Max Normalization: This technique scales the data to a range of 0 to 1. The formula for min-
max normalization is:
X_norm = (X - X_min) / (X_max - X_min)
where X is the original data, X_min is the minimum value in the dataset, and X_max is the maximum
value in the dataset.

Z-Score Normalization: This technique scales the data to have a mean of 0 and a standard deviation
of 1. The formula for z-score normalization is:
X_norm = (X - X_mean) / X_std
where X is the original data, X_mean is the mean of the dataset, and X_std is the standard deviation
of the dataset.

Decimal Scaling: This technique scales the data by moving the decimal point a certain number of places
to the left or right. The formula for decimal scaling is:
X_norm = X / 10^j
where X is the original data and j is the number of decimal places to shift.

2.3. Implement data dispersion measure Five Number Summary generate box plot
using python libraries

Five Number Summary


Descriptive Statistics involves understanding the distribution and nature of the data. Five
number summary is a part of descriptive statistics and consists of five values and all these
values will help us to describe the data.
The minimum value (the lowest value)
25th Percentile or Q1
50th Percentile or Q2 or Median
75th Percentile or Q3
Maximum Value (the highest value)

Let’s understand this with the help of an example. Suppose we have some data such as:
11,23,32,26,16,19,30,14,16,10

Here, in the above set of data points our Five Number Summary are as follows:
First of all, we will arrange the data points in ascending order and then calculate the
summary: 10,11,14,16,16,19,23,26,30,32
Enrollment No. DWM (3161610) 7
Minimum value: 10
25th Percentile: 14
Calculation of 25th Percentile: (25/100)*(n+1) = (25/100)*(11) = 2.75 i.e 3rd value of the data
50th Percentile : 17.5
Calculation of 50th Percentile: (16+19)/2 = 17.5
75th Percentile : 26
Calculation of 75th Percentile: (75/100)*(n+1) = (75/100)*(11) = 8.25 i.e 8th value of the
data

Box plots
Boxplots are the graphical representation of the distribution of the data using Five Number
summary values. It is one of the most efficient ways to detect outliers in our dataset.

Fig Box plot using Five Number Summary

In statistics, an outlier is a data point that differs significantly from other observations. An
outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the dataset. An outlier can cause serious problems in
statistical analyses.

EXERCISES:
(1) What is Five Number summary? How to generate box plot using Python Libraries?
(2) What is Normalization techniques?
(3) What are the different smoothing techniques?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date


Enrollment No. DWM (3161610) 8
DATE:
EXPERIMENT 3
Aim: To perform hand on experiments of data preprocessing with sample data on Orange tool.

Objectives: 1) improve users' understanding of data preprocessing techniques


2) Familiarize them with the tool
Theory:

Demonstration of Tool:

Data Preprocessing With Orange tool

Preprocesses data with selected methods. Inputs Data: input dataset Outputs Preprocessor:
preprocessing method Preprocessed Data: data preprocessed with selected methods Preprocessing is
crucial for achieving better-quality analysis results. The Preprocess widget offers several preprocessing
methods that can be combined in a single preprocessing pipeline. Some methods are available as
separate widgets, which offer advanced techniques and greater parameter tuning.

Fig Handling Missing Values

1. List of preprocessors. Double click the preprocessors you wish to use and shuffle their order by
dragging them up or down. You can also add preprocessors by dragging them from the left menu to
the right.
2. Preprocessing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.
Enrollment No. DWM (3161610) 9
Preprocessed Technique :

[ Fig: Descrete Continuous variables -> Most Frequent is base Used ]

Enrollment No. DWM (3161610) 10


Data Table of Preprocessed Data

EXERCISES:

(1) What is the purpose of Orange's Preprocess method?


(2) What is the use of orange tool?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 11


EXPERIMENT 4 DATE:
Aim: Implement Apriori algorithm of association rule data mining technique in any Programming
language.
Objectives: To implement basic logic for association rule mining algorithm with support and
confidence measures.

Theory:
The Apriori algorithm is a classic and fundamental data mining algorithm used for discovering
association rules in transactional datasets.

● Apriori is designed for finding associations or relationships between items in a dataset. It's
commonly used in market basket analysis and recommendation systems.
● Apriori discovers frequent itemsets, which are sets of items that frequently co-occur in
transactions. A frequent itemset is a set of items that appears in a minimum number of
transactions, known as the "support threshold."
● Support and Confidence: Support measures how often an itemset appears in the dataset, while
confidence measures how often a rule is true. High-confidence rules derived from frequent
itemsets are of interest.

● Apriori uses an iterative approach to progressively discover frequent itemsets of increasing


size. It starts with finding frequent 1-itemsets, then 2-itemsets, and so on.
● The algorithm employs pruning techniques to reduce the number of candidate itemsets that
need to be checked, making it more efficient.
● Apriori is widely used in retail for market basket analysis. It helps retailers understand which
products are often purchased together, allowing for optimized store layouts, targeted marketing,
and product recommendations.

EXERCISES:
(1) What Do you Mean by Association rule mining?
(2) What are the different measures are used in apriori algorithm.

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 12


EXPERIMENT 5 DATE:
Aim: Apply association rule data mining technique on sample data sets using WEKA.

Objectives: 1) improve users' understanding of Association rule mining techniques


2) Familiarize with the tool

EXERCISE:
(1) What is WEKA tool?
(2) What is association analysis, and how can it be performed using WEKA Tool?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 13


EXPERIMENT 6 DATE:
Aim: Apply Classification data mining technique on sample data sets in WEKA or Python.

Objectives: 1) improve users' understanding of classification techniques


2) Familiarize with the tool

EXERCISES:

1) What is classification and how can it be performed using WEKA or any Python tool?
2) What are the key evaluation metrics used to assess the accuracy of a model in WEKA?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 14


EXPERIMENT 7 DATE:
Aim:
[Link] Classification technique with quality Measures in any Programming language.
7.2 Implement Regression technique in any Programming language.

Objectives:
(a) To evaluate the quality of the classification model using accuracy and confusion matrix.
(b) To evaluate regression model

Theory:

Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the category
of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram,
there are two classes, class A and Class B. These classes have features that are similar to each other
and dissimilar to other classes.

Enrollment No. DWM (3161610) 15


The algorithm which implements the classification on a dataset is known as a classifier. There are
two types of Classifications:
Binary Classifier:
If the classification problem has only two possible outcomes, then it is called as Binary Classifier.
Examples:
YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier:
If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems: In the classification problems, there are two types of learners:
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction.
Example: Decision Trees, Naïve Bayes, ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
Linear Models
Logistic Regression
Support Vector
Machines Non-linear
Models
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Decision Tree Classification
Enrollment No. DWM (3161610) 16
Random Forest Classification
Evaluating a Classification model:
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification
or Regression model. So for evaluating a Classification model, we have the following ways:
1. Log Loss or Cross-Entropy Loss:
It is used for evaluating the performance of a classifier, whose output is a probability value between
the 0 and 1.
For a good binary Classification model, the value of log loss should be near to 0.
The value of log loss increases if the predicted value deviates from the actual value.
The lower log loss represents the higher accuracy of the model.
2. Confusion Matrix:
The confusion matrix provides us a matrix/table as output and describes the performance of the model.
It is also known as the error matrix.
The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:
ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
It is a graph that shows the performance of the classification model at different thresholds.
To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
Email Spam Detection
Speech Recognition
Identifications of Cancer tumor cells.
Drugs Classification
Biometric Identification, etc.

Regression
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,

Enrollment No. DWM (3161610) 17


Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years and
the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression [Link] is a supervised learning technique which helps in
finding the correlation between variables and enables us to predict the continuous output variable based
on the one or more predictor variables. It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.

Enrollment No. DWM (3161610) 18


o Underfitting and Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting. And if our algorithm does not perform
well even with training dataset, then such problem is called underfitting.

o Regression analysis helps in the prediction of a continuous variable. There are various scenarios
in the real world where we need some future predictions such as weather condition, sales
prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type
has its own importance on different scenarios, but at the core, all the regression methods analyze the
effect of the independent variable on dependent variables. Here we are discussing some important
types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Enrollment No. DWM (3161610) 19


EXERCISES:

(1) What is the use of precision, recall, specificity, sensitivity etc.


(2) What are the different Regression techniques?
(3) What is information gain, gini index and gain ratio in decision tree induction method.

EVALUATION:

Involvement(4) Understanding/Probl Timely Total(10)


em Solving(3) Completion(3)

Signature with date

Enrollment No. DWM (3161610) 20


EXPERIMENT 8 DATE:
Aim: Apply K-means Clustering Algorithm any Programming language.

Objectives: To implement Clustering Algorithm.

Theory:
K means Clustering
Unsupervised Learning is the process of teaching a computer to use unlabeled, unclassified data and
enabling the algorithm to operate on that data without supervision. Without any previous data training,
the machine’s job in this case is to organize unsorted data according to parallels, patterns, and
variations.
The goal of clustering is to divide the population or set of data points into a number of groups so that
the data points within each group are more comparable to one another and different from the data
points within the other groups. It is essentially a grouping of things based on how similar and different
they are to one another.
We are given a data set of items, with certain features, and values for these features (like a vector).
The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm;
an unsupervised learning algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize
the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean
distance as a measurement.
The algorithm works as follows:
● First, we randomly initialize k points, called means or cluster centroids.
● We categorize each item to its closest mean and we update the mean’s coordinates, which are
the averages of the items categorized in that cluster so far.
● We repeat the process for a given number of iterations and at the end, we have our clusters.
● The “points” mentioned above are called means because they are the mean values of the items
categorized in them. To initialize these means, we have a lot of options. An intuitive method is
to initialize the means at random items in the data set.
EXERCISES:
(1) What are the different distance measures?
(2) What do you mean by centroid in K-means Algorithm?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date


Enrollment No. DWM (3161610) 21
DATE:
EXPERIMENT 9
Aim: Perform hands on experiment on any advance mining Techniques Using Appropriate Tool.

Objectives:

1) Improve users' understanding of advance mining Techniques like Text Mining, Stream Mining,
and Web Content Mining Using Appropriate Tool
2) Familiarize with the tool

EXERCISES:
1. What different data mining techniques are used in your tool?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 22


EXPERIMENT 10 DATE:
Aim: Solve Real world problem using Data Mining Techniques using Python Programming
Language.

Objectives: (a) To understand real-world problems.


(b) To analyze which data mining technique can be used to solve your problem.

EXERCISES:
1) What are other techniques that can be used to solve your system problem?

EVALUATION:

Involvement(4) Understanding/Proble Timely Completion(3) Total(10)


m Solving(3)

Signature with date

Enrollment No. DWM (3161610) 23

You might also like