0% found this document useful (0 votes)

46 views74 pages

Understanding Data Mining and Machine Learning

Data mining involves extracting useful patterns from large amounts of data. It aims to discover patterns that are valid, novel, potentially useful and understandable. The rapid growth of data from various sources makes data mining an important process for making better business decisions by automatically predicting trends and discovering previously unknown patterns. The knowledge discovery process in data mining involves data cleaning, preprocessing, reduction, selection, transformation, mining, pattern evaluation, and presentation of results. Data mining uses machine learning algorithms to model and analyze sample data to make predictions or decisions without being explicitly programmed.

Uploaded by

Gunjan Suman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views74 pages

Understanding Data Mining and Machine Learning

Uploaded by

Gunjan Suman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MACHINE LEARNING with Big Data

UNIT 3
What is Data Mining?

Data mining is also called knowledge discovery

and data mining (KDD)
Data mining is
Extraction of useful patterns from data sources, e.g.-
Databases, texts, web, image
Patterns must be:
Valid, novel, potentially useful, understandable

2
What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
Why is Data Mining important?

Rapid computerization of businesses produce

huge amount of data
How to make best use of data?
It is used to discover patterns and relationships in
the data in order to help make better business
decisions
Data mining technology can generate new
business opportunities by:
Automated prediction of trends and behaviors
Automated discovery of previously unknown patterns

4
Why Data Mining?
The Explosive Growth of Data: from terabytes to
petabytes
Data and storage predictions for the year
2025(zettabytes)
The storage industry will ship 42ZB of capacity over the
next seven years.
90ZB of data will be created on IoT devices by 2025.
By 2025, 49 percent of data will be stored in public cloud
environments.
Nearly 30 percent of the data generated will be
consumed in real-time by 2025.
Why Data Mining?
The Explosive Growth of Data: from terabytes to zettabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
Example of Discovered Patterns

Association rules:
“80% of customers who buy cheese and milk also buy
bread, and 5% of customers buy all of them together”
Cheese, Milk→ Bread [sup =5%, confid=80%]

7
Data mining Algorithms: can be characterized as
consisting of three parts

Model: is to be fit on data.

Preference: criteria to select one model over other
 Search: techniques to evaluate data point or searching of data.
Goal of Data Mining

Goal of Data Mining to provide efficient tools and techniques for KDD
Knowledge Discovery (KDD) Process

This is a view from typical

database systems and data
warehousing communities
Data mining plays an essential
role in the knowledge discovery
process
Knowledge Discovery (KDD) Process
1. Developing an understanding of:
The application domain
The relevant prior knowledge
The goals of the end-user
Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
2. Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
3. Data reduction and projection
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
11
Knowledge Discovery (KDD) Process
4. Choosing the data mining task.
Deciding whether the goal of the KDD process is classification, regression,
clustering, etc.
5. Choosing the data mining algorithm(s)
 Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the
KDD process.
6. Data mining.
Searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering,
and so forth.
7. Interpreting mined patterns.
8. Consolidating discovered knowledge.

12
Sequence of the steps
Data Cleaning: To remove noise and inconsistent data.
Data Integration: where multiple data sources may be combined.
Data Selection: Where data relevant to the analysis task are retrieved
from the database.
Data Transformation: where data transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations.
Data Mining: An essential process where intelligent methods are
applied to extract data patterns.
Pattern evaluation: To identify the truly interesting pattern representing
knowledge based on interesting measures.
Knowledge Presentation: where visualization and knowledge
representation tech. are used to present mined knowledge to users.
Applications of Data Mining
Data mining Data mining is available
applications are widely in various forms
used in Text mining
Direct marketing Web mining
Health industry Audio & video data
mining
E-commerce
Pictorial data mining
Customer relationship Relational databases
management (CRM) data mining
Telecommunication Social networks data
industry and financial mining
sector, etc..
Machine Learning

Subset of AI(Artificial Intelligence)

Computer Systems/Machines analyze huge data sets and learn patterns which can help in
making future predictions about new data.

Definition
“Learning is any process by which a system improves performance from experience.”- Herbert Simon
“Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T, E> - Tom Mitchell(1998)
Machine Learning
In Traditional Programming :
▪ provide input to the machine with the
programs using some programming
language.
▪ System will generate the output
based on the program

In Machine learning
▪ Data is provided as input as well as the
output value is provided as input in few
ML approaches.
▪ The machine trains/learns from the input
and output by identifying some patterns,
relationships using some statistical
measures on the input values that lead to
the output.
▪ This training model /program is the
output which further can be applied to
new data for predictions
Significance/Motivation for
Machine Learning
Humans can’t explain their expertise : (Speech Recognition)
Human expertise does not exist (Mars navigation)
Models must be customized ( Personalized Medicine)
Analyzing huge amounts of data with little or no human intervention
No explicit programming needed
Customer behavior analysis for business
Building marketing strategies.
Defence :
➢ Surveillance to detect/prevent any intrusion
➢ self-control, self-regulation, and self-actuation of combat systems through embedded AI
➢ Cybersecurity
➢ Effective logistics and transportation of goods, ammunitions, armaments, troops
➢ Detect any anomalies in the goods,armaments for timely repair or maintenance
➢ AI can mine soldiers’ medical records and assist in complex diagnosis.
Machine Learning
Types of Data from ML perspective:

1. Numerical Data : quantitative data

➢ Can be discrete or continuous
2. Categorical Data: characteristics
➢ They can have numerical values for
representing the output Example : 0 for No,1
for Yes
3. Time – series Data :Sequence of numbers
collected at regular intervals over some period of
time.
➢ Temporal values (Date and Time
associated)
4.Text : Words.
➢ Text analysis
Types of Machine learning approaches

1. Supervised (inductive) learning

➢ – Given: training data + desired outputs
(labels)

➢ - use of labeled datasets to train

algorithms that to classify data or predict
outcomes accurately if new data values or
not seen data values are provided after
training.
➢ - Supervised learning helps organizations
solve for a variety of real-world problems
at scale, such as classifying spam in a
separate folder from your inbox.
➢ Aim : find a mapping function f(x) to map the
input variable(x) with the output variable(y).
1. Supervised Learning

Input : A Basket of Fruits

Training :
If the color of fruit is red and shape i s with a depression
at the top then it is an “Apple”.
If the color of fruit is green/yellow and shape if like a
curving cylinder then it is labelled as “Banana”
Testing :
Suppose now one separate fruit is given to the trained model to
identify whether it is an apple/banana then the model will apply
the knowledge gained during training phase related to color and
shape of the fruit and correctly classify it.
Types of Tasks in Supervised
Learning
Supervised learning is classified into two categories of algorithms:

1. Classification: A classification problem is when the output variable is a category,

such as “Red” or “blue” or “disease” and “no disease”.
• It predicts the category the data belongs to.
2. Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight” or “house price”
• Predict numerical value based on previously observed data.
Supervised Classification
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
a lg o r ith m
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes M o d el
9 No Medium 75K No
10 No Small 90K Yes
M odel
10

Training S et
Apply
Tid Attrib1 Attrib2 Attrib3 Class
M o d el
11 No Small 55K ?
12 Yes Medium 80K ?

13 Yes Large 110K ? D e d u c t io n

14 No Small 95K ?

15 No Large 67K ?
10

Te s t S e t
Classification Process
▪ Model construction: describing a set of predetermined
classes
▪ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
▪ The set of tuples used for model construction is training set
▪ The model is represented as classification rules, decision trees, or
mathematical formulae
▪ Model usage: for classifying future or unknown objects
▪ Estimate accuracy of the model
▪ The label of test sample is compared with the classified result from the
model
▪ Accuracy rate is the percentage of test set samples that are correctly
classified by the model
▪ Test set is independent of training set, otherwise over-fitting will occur
▪ If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Step 1: Model Construction

Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Step 2: Model Usage

Classifier

Testing Unseen
Data Data

(Jeff, Professor, 4)

Tenured?
Supervised Machine Learning

Supervised Learning Algorithms

➢ Logistic Regression
➢ K-Nearest Neighbor
➢ Support Vector Machine
➢ Naïve Bayes
➢ Decision Tree
➢ Ensemble algorithms
Naïve Bayes
Classification
Algorithm
Performs Probabilistic Prediction,
class membership probability
Based on the Bayes’
theorem.
Assumes feature in a class is
unrelated to any other feature
even if they depend on each
other. Therefore, called as
‘Naïve’.
Useful and scales well Principle
If it walks like a duck, quacks like a duck, then it is
probably a duck
Naïve Bayes Algorithm

Formula : Let 𝐸1, 𝐸2, ……𝐸𝑛be n mutually exclusive and exhaustive events associated with a
random experiment. If A is any event which occurs with 𝐸1𝑜𝑟𝐸2𝑜𝑟……𝐸𝑛, then

𝑃 𝐸𝑖.𝑃(𝐴|𝐸𝑖)
𝑃(𝐸 𝑖 𝐴 = σ𝑖=
𝑛
1 𝑃 𝐸𝑖.𝑃(𝐴|𝐸𝑖)

P(class|data)=P(data|class) * P(class)
P(data)
• P(class|data) is the posterior probability of class(target) given predictor(attribute). The
probability of a data point having either class, given the data point. This is the value that we
are looking to calculate.
• P(class) is the prior probability of class.
• P(data|class) is the likelihood, which is the probability of predictor given class.
• P(data) is the prior probability of predictor or marginal likelihood.
age income studen redit_ratin _co
t m
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Example
>40 medium no fair yes
Consider following example of computer purchase dataset
>40 low yes fair yes
Identify the Independent and dependent variable
>40 low yes excellent no
Step 1 : Calculate: P(class) =Number of data points in the 31…40 low yes excellent yes
class/Total number of observations
<=30 medium no fair no
Class:
<=30 low yes fair yes
C1:buys_computer = ‘yes’
>40 medium yes fair yes
C2:buys_computer = ‘no’ <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Step 2 :Calculate Likelihood >40 medium no excellent no
P(data/class) = Number of similar observations to th e Q. Which class does following data instance belong to :
class/Total no. of points in th e class. Data instance
X =(age <=30,
Income =medium,
Student =yes
Credit_rating =fair)
Solution P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” |
buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” |
buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” |
buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer =
“yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

X =(age <=30, income =medium, student =yes, credit_rating =fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 × 0.4 × 0.2 × 0.4 = 0.019

P(X|Ci)P(Ci) : P(X|buys_computer = “yes”) P(buys_computer = “yes”) = 0.028

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer =yes”)

Types of Machine learning approaches

2. Unsupervised (inductive) learning

➢ – Given: training data without desired
outputs (labels)
➢ - training of a machine using information
that is neither classified nor labeled and
allowing the algorithm to act on that
information without guidance
➢ - group unsorted information according to
similarities, patterns, and differences
without any prior training of data.
2. Unsupervised
learning Tasks

cannot be directly applied to a regression or

classification problem
The goal of unsupervised learning is to find
the underlying structure of dataset, group
that data according to similarities, and
represent that dataset in a compressed
format.
much similar as a human learns to think by
their own experiences, which makes it
closer to the real AI.
Types :
Clustering : grouping based on similarity
Association : find relationships between
variables
Unsupervised learning Algorithms

• K-means clustering
• KNN (k-nearest neighbours)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
2.Unsupervised Learning :k-Means
Clustering
k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong [1979].

Given a set of n distinct objects, the k-Means clustering algorithm partitions the
objects into k number of clusters such that intracluster similarity is high but the
intercluster similarity is low.

In this algorithm, user has to specify k, the number of clusters and consider the
objects are defined with numeric attributes and thus using any one of the distance
metric to demarcate the clusters.
Clustering

 Clustering is the classification of objects into different groups, or more

precisely, the partitioning of a data set into subsets (clusters), so that the
data in each subset (ideally) share some common trait - often
according to some defined distance measure.
Clustering 36

ORGANIZING DATA INTO CLASSES SUCH THAT THERE IS

high intra-class similarity
low inter-class similarity
FINDING THE CLASS LABELS AND THE NUMBER OF CLASSES DIRECTLY
FROM THE DATA (IN CONTRAST TO CLASSIFICATION).
MORE INFORMALLY, FINDING NATURAL GROUPINGS AMONG
OBJECTS.
ALSO CALLED UNSUPERVISED LEARNING, SOMETIMES CALLED
CLASSIFICATION BY STATISTICIANS AND SORTING BY
PSYCHOLOGISTS AND SEGMENTATION BY PEOPLE IN MARKETING

36
Clustering 37

FINDING GROUPS OF OBJECTS SUCH THAT THE OBJECTS IN A

GROUP WILL BE SIMILAR (OR RELATED) TO ONE ANOTHER AND
DIFFERENT FROM (OR UNRELATED TO) THE OBJECTS IN OTHER
GROUPS

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering Example
What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males

39
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Webster's Dictionary

Similarity is
hard to define,
but…
“We know it
when we see it”

The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.

40
Common Distance measures:
DISTANCE MEASURE WILL DETERMINE HOW THE
SIMILARITY OF TWO ELEMENTS IS CALCULATED AND IT
WILL INFLUENCE THE SHAPE OF THE CLUSTERS.
THEY INCLUDE:
1. THE EUCLIDEAN DISTANCE (ALSO CALLED 2-NORM
DISTANCE) IS GIVEN BY:

2. THE MANHATTAN DISTANCE (ALSO CALLED TAXICAB

NORM OR 1-NORM) IS GIVEN BY:
Examples of Clustering Applications

MARKETING: HELP MARKETERS DISCOVER DISTINCT GROUPS IN THEIR

CUSTOMER BASES, AND THEN USE THIS KNOWLEDGE TO DEVELOP
TARGETED MARKETING PROGRAMS
LAND USE: IDENTIFICATION OF AREAS OF SIMILAR LAND USE IN AN
EARTH OBSERVATION DATABASE
INSURANCE: IDENTIFYING GROUPS OF MOTOR INSURANCE POLICY
HOLDERS WITH A HIGH AVERAGE CLAIM COST
CITY-PLANNING: IDENTIFYING GROUPS OF HOUSES ACCORDING TO
THEIR HOUSE TYPE, VALUE, AND GEOGRAPHICAL LOCATION
EARTH-QUAKE STUDIES: OBSERVED EARTH QUAKE EPICENTERS
SHOULD BE CLUSTERED ALONG CONTINENT FAULTS
What Is Good Clustering?
A GOOD CLUSTERING METHOD WILL PRODUCE HIGH QUALITY CLUSTERS
WITH
high intra-class similarity

low inter-class similarity

THE QUALITY OF A CLUSTERING RESULT DEPENDS ON BOTH THE SIMILARITY

MEASURE USED BY THE METHOD AND ITS IMPLEMENTATION.
THE QUALITY OF A CLUSTERING METHOD IS ALSO MEASURED BY ITS ABILITY
TO DISCOVER SOME OR ALL OF THE HIDDEN PATTERNS.
Clustering 44

Hierarchical Partitional

PARTITIONAL ALGORITHMS: CONSTRUCT VARIOUS PARTITIONS AND THEN EVALUATE THEM BY SOME CRITERION
HIERARCHICAL ALGORITHMS: CREATE A HIEARCHICAL DECOMPOSITION OF THE SET OF OBJECTS USING SOME
CRITERION
44
Partitional Clustering 45

Original Points A Partitional Clustering

Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
K-Means CLUSTERING
SIMPLY SPEAKING K-MEANS CLUSTERING IS AN ALGORITHM TO
CLASSIFY OR TO GROUP THE OBJECTS BASED ON
ATTRIBUTES/FEATURES INTO K NUMBER OF GROUP.
K IS POSITIVE INTEGER NUMBER.
THE GROUPING IS DONE BY MINIMIZING THE SUM OF SQUARES OF
DISTANCES BETWEEN DATA AND THE CORRESPONDING CLUSTER
CENTROID.
Working of K-Means CLUSTERING
Classical Partitioning Method- K-Means

IN OTHER WORDS, FOR EACH OBJECT IN EACH CLUSTER , THE

DISTANCE FROM THE OBJECT TO ITS CLUSTER CENTER IS SQUARED,
AND THE DISTANCES ARE SUMMED.
THIS CRITERION TRIES TO MAKE THE RESULTING K CLUSTERS AS
COMPACT AND AS SEPARATE AS POSSIBLE.
Working of K-Means Clustering
• BEGIN WITH A DECISION ON THE VALUE OF K = NUMBER OF
CLUSTERS
• ARBITRARILY ASSIGN K OBJECTS FROM D AS THE INITIAL CLUSTER
CENTERS
• EACH OBJECT IS DISTRIBUTED TO A CLUSTER BASED ON THE CLUSTER
CENTER TO WHICH IT IS THE NEAREST.
• NEXT, THE CLUSTER CENTERS ARE UPDATED I.E. MEAN VALUE OF
EACH CLUSTER IS RECALCULATED BASED ON THE CURRENT OBJECTS
IN THE CLUSTER
• USING THE NEW CLUSTER CENTERS, THE OBJECTS ARE REDISTRIBUTED
TO THE CLUSTERS BASED ON WHICH CLUSTER CENTER IS THE
NEAREST.
• THIS PROCESS ITERATES
• EVENTUALLY, NO REDISTRIBUTION OF THE OBJECTS IN ANY OCCURS
, AND SO THE PROCESS TERMINATES
• RESULTING CLUSTERS ARE RETURNED BY THE CLUSTERING PROESS.
•
Algorithm of K-Means Clustering
K-Means Clustering-Example 1
GIVEN: {2,3,6,8,9,12,15,18,22} ASSUME K=3.
SOLUTION:
Randomly partition given data set:
K1 = 2,8,15 mean = 8.3
K2 = 3,9,18 mean = 10
K3 = 6,12,22 mean = 13.3
Reassign
K1 = 2,3,6,8,9 mean = 5.6
K2 = mean = 0
K3 = 12,15,18,22 mean = 16.75
K-Means Clustering-Example 1
Reassign
K1 = 3,6,8,9 mean = 6.5
K2 = 2 mean = 2
K3 = 12,15,18,22 mean = 16.75
Reassign
K1 = 6,8,9 mean = 7.6
K2 = 2,3 mean = 2.5
K3 = 12,15,18,22 mean = 16.75
Reassign
K1 = 6,8,9 mean = 7.6
K2 = 2,3 mean = 2.5
K3 = 12,15,18,22 mean = 16.75
STOP
Example
GIVEN {2,4,10,12,3,20,30,11,25}
ASSUME K=2.
55

Solution

K1 = 2,3,4,10,11,12
K2 = 20, 25, 30
K-Means Clustering-Example 2 56

A Simple example showing the implementation of k-means algorithm

(using K=2)
K-Means Clustering-Example572

Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

UNIT IV- CLASSIFICATION AND CLUSTERING

K-Means Clustering-Example
582

Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.

Their new centroids are:

K-Means Clustering-Example
592
Step 3:
Now using these centroids we
compute the Euclidean distance of
each object, as shown in table.

Therefore, the new clusters are:

{1,2} and {3,4,5,6,7}

Next centroids are:

m1=(1.25,1.5) and m2 = (3.9,5.1)

K-Means Clustering-Example602

Step 4:
The clusters obtained are:
{1,2} and {3,4,5,6,7}

Therefore, there is no change in the

cluster.

Thus, the algorithm comes to a halt

here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.
K-Means Clustering-Example 2 61

Step 1 Step 2
K-Means Clustering-Example 3
62

We have 4 medicines as our training data points object and

each medicine has 2 attributes. Each attribute represents
coordinate of the object. We have to determine which
medicines belong to cluster 1 and which medicines belong to
the other cluster.
Attribute1 (X): weight Attribute 2 (Y): pH
Object
index

Medicine A 1 1

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
K-Means Clustering-Example633

Step 1:
Initial value of centroids : Suppose we use medicine A and medicine B as
the first centroids.
Let and c1 and c2 denote the coordinate of the centroids, then c1=(1,1)
and c2=(2,1)
K-Means Clustering-Example 3
Step 1:
Objects-Centroids distance : we calculate the distance between cluster
centroid to each object.
Let us use Euclidean distance, then we have distance matrix at iteration 0
is

■ Each column in the distance matrix symbolizes the object.

■ The first row of the distance matrix corresponds to the distance of
each object to the first centroid and the second row is the distance
of each object to the second centroid.
■ For example, distance from medicine C = (4, 3) to the first
centroid is , and
■ Its distance to the second centroid is , is
K-Means Clustering-Example 3
 Step 2:
 Objects clustering : We
assign each object based
on the minimum distance.
 Medicine A is assigned to
group 1, medicine B to group
2, medicine C to group 2
and medicine D to group 2.
 The elements of Group
matrix below is 1 if and only if
the object is assigned to that
group.
K-Means Clustering-Example 3

ITERATION-1, OBJECTS-CENTROIDS DISTANCES : THE NEXT STEP IS TO

COMPUTE THE DISTANCE OF ALL OBJECTS TO THE NEW CENTROIDS.
SIMILAR TO STEP 2, WE HAVE DISTANCE MATRIX AT ITERATION 1 IS
K-Means Clustering-Example 3

 Iteration-1, Objects clustering:

Based on the new distance matrix,
we move the medicine B to Group
1 while all the other objects remain.
The Group matrix is shown below

 Iteration 2, determine centroids:

Now we repeat step 4 to calculate
the new centroids coordinate
based on the clustering of previous
iteration. Group1 and group 2 both
has two members, thus the new
centroids are
and
K-Means Clustering-Example 3

 Iteration-2,
 Objects-Centroids distances :
Repeat step 2 again, we have
new distance matrix at iteration
2 as
K-Means Clustering-Example 3

 Iteration-2, Objects clustering:

Again, we assign each object
based on the minimum distance.
 We obtain result that .
Comparing the grouping of last
iteration and this iteration reveals
that the objects does not move
group anymore.
 Thus, the computation of the k-
mean clustering has reached its
stability and no more iteration is
needed..

69
K-Means Clustering-Example 3

Object Feature1(X): Feature2 Group

weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

WE GET THE FINAL GROUPING AS THE RESULTS AS:

Advantages & Disadvantage
ADVANTAGES
of K means-Clustering
K-means is relatively scalable and efficient in processing large
data sets
The computational complexity of the algorithm is O(nkt)
n: the total number of objects
k: the number of clusters
t: the number of iterations
Normally: k<<n and t<<n
DISADVANTAGE
Can be applied only when the mean of a cluster is defined
Users need to specify k
K-means is not suitable for discovering clusters with non convex
Shapes or clusters of very different size
It is sensitive to noise and outlier data points
Weaknesses of K-Means
Clustering
WHEN THE NUMBERS OF DATA ARE NOT SO MANY,
INITIAL GROUPING WILL DETERMINE THE CLUSTER
SIGNIFICANTLY.
THE NUMBER OF CLUSTER, K, MUST BE DETERMINED
BEFORE HAND. ITS DISADVANTAGE IS THAT IT DOES NOT
YIELD THE SAME RESULT WITH EACH RUN, SINCE THE
RESULTING CLUSTERS DEPEND ON THE INITIAL RANDOM
ASSIGNMENTS.
WE NEVER KNOW THE REAL CLUSTER, USING THE SAME
DATA, BECAUSE IF IT IS INPUTTED IN A DIFFERENT ORDER IT
MAY PRODUCE DIFFERENT CLUSTER IF THE NUMBER OF
DATA IS FEW.
IT IS SENSITIVE TO INITIAL CONDITION. DIFFERENT INITIAL
CONDITION MAY PRODUCE DIFFERENT RESULT OF
CLUSTER. THE ALGORITHM MAY BE TRAPPED IN THE
LOCAL OPTIMUM.
Supervised v/s Unsupervised
Learning
Supervised Learning Unsupervised Learning

2 different problems/tasks can be addressed : 2 different problems/tasks can be addressed :

Classification and Regression Clustering and Association

Input Data is provided to the model along with the Only input data is provided in Unsupervised
output in the Supervised Learning. Learning.

Output is predicted by the Supervised Learning. Hidden patterns in the data can be found

Contain labelled data/class Unlabeled data/class

Objective : Training the model to predict output when a Objective: Finding useful insights, hidden patterns from
new data is provided the unknown dataset

Examples : Naïve Bayes, Decision Tree Examples : k-Means Clustering, Apriori algorithm

Applications : Spam detection, handwriting Applications: Customer segmentation, detect

recognition, speech recognition fraudulent transactions, data preprocessing
References

Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber, and Jian
Pei, 3rd edition

https://builtin.com/data-science/supervised-machine-learning- classification
Machine Learning,Anuradha Srinivasaraghavan,Vincy Joseph,Wiley
Publication
https://www.javatpoint.com/reinforcement-learning

Understanding Data Mining Essentials
No ratings yet
Understanding Data Mining Essentials
59 pages
Introduction to Data Mining & Machine Learning
100% (1)
Introduction to Data Mining & Machine Learning
51 pages
Unit - I MLT
No ratings yet
Unit - I MLT
137 pages
Data Mining
No ratings yet
Data Mining
11 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Data Mining
No ratings yet
Data Mining
254 pages
DM &W UNIT 1 - PPT Shobana
No ratings yet
DM &W UNIT 1 - PPT Shobana
46 pages
Data Mining & BI Course Guide
No ratings yet
Data Mining & BI Course Guide
25 pages
Ch1 Overview KDD - ML
No ratings yet
Ch1 Overview KDD - ML
23 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Data Mining Chapter 1
0% (1)
Data Mining Chapter 1
12 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Introduction to Data Mining Basics
No ratings yet
Introduction to Data Mining Basics
43 pages
DMiningKuliah1 (Introduction)
No ratings yet
DMiningKuliah1 (Introduction)
45 pages
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Introduction To Data Mining Unit1
100% (1)
Introduction To Data Mining Unit1
37 pages
Data Mining
No ratings yet
Data Mining
3 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
44 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
60 Common Data Mining Interview Questions in 2025
No ratings yet
60 Common Data Mining Interview Questions in 2025
20 pages
DB 14
No ratings yet
DB 14
97 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
DM Course Material
No ratings yet
DM Course Material
128 pages
Data Mining in Knowledge Management
No ratings yet
Data Mining in Knowledge Management
41 pages
Introduction to Data Mining Course
No ratings yet
Introduction to Data Mining Course
79 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
A Detailed Study On Machine Learning Techniques For Data Mining
No ratings yet
A Detailed Study On Machine Learning Techniques For Data Mining
6 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
44 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
6 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
12 pages
Understanding Data Mining Concepts
No ratings yet
Understanding Data Mining Concepts
41 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
17 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
35 pages
Understanding Data Mining Concepts
No ratings yet
Understanding Data Mining Concepts
51 pages
Data Mining (Introduction)
No ratings yet
Data Mining (Introduction)
31 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Data Mining Basics with Excel and R
100% (1)
Data Mining Basics with Excel and R
17 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
5 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
70 pages
Data Mining in Digital Humanities
No ratings yet
Data Mining in Digital Humanities
84 pages
Understanding Data Mining Processes
No ratings yet
Understanding Data Mining Processes
18 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
43 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
23 pages
Data and Knowledge Discovery Overview
No ratings yet
Data and Knowledge Discovery Overview
44 pages
Data Mining
No ratings yet
Data Mining
88 pages
Data Mining and Its Branches
No ratings yet
Data Mining and Its Branches
37 pages
Machine Learning and Data Mining Overview
No ratings yet
Machine Learning and Data Mining Overview
40 pages
Unit Iii
No ratings yet
Unit Iii
33 pages
Data Mining for Business Insights
100% (3)
Data Mining for Business Insights
11 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
47 pages
Lecture 1-Introduction To Data Mining - M
No ratings yet
Lecture 1-Introduction To Data Mining - M
38 pages
Data Science and Machine Learning Guide
No ratings yet
Data Science and Machine Learning Guide
67 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
23 pages
DWDM Unit II
No ratings yet
DWDM Unit II
18 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
Data Mining Mids
No ratings yet
Data Mining Mids
24 pages
CSE 5th Semester - Disaster Risk Reduction and Management - MX3084 - Hand Written Notes - Unit 5 - Disaster Management Case Studies
No ratings yet
CSE 5th Semester - Disaster Risk Reduction and Management - MX3084 - Hand Written Notes - Unit 5 - Disaster Management Case Studies
20 pages
L4Ka Microkernel Research Overview
No ratings yet
L4Ka Microkernel Research Overview
8 pages
Future of AI and Compute Infrastructure
No ratings yet
Future of AI and Compute Infrastructure
28 pages
Abaqus Python Script PDF
100% (3)
Abaqus Python Script PDF
281 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Paper Title Starts Here:: Centre, Times New Roman 20pt., Bold
No ratings yet
Paper Title Starts Here:: Centre, Times New Roman 20pt., Bold
5 pages
SNM 20 Im Hs 0002 SP 99 Automation (Aut Feb20 DL 2)
No ratings yet
SNM 20 Im Hs 0002 SP 99 Automation (Aut Feb20 DL 2)
44 pages
Elektronikon Mk5 Touch PSS EN
100% (2)
Elektronikon Mk5 Touch PSS EN
3 pages
Topic 4
No ratings yet
Topic 4
54 pages
Computer Basics and Applications Guide
No ratings yet
Computer Basics and Applications Guide
38 pages
ASUG82531 - New Product Costing - SAP Product Lifecycle Costing Goes Cloud
No ratings yet
ASUG82531 - New Product Costing - SAP Product Lifecycle Costing Goes Cloud
45 pages
Apple Price List
No ratings yet
Apple Price List
5 pages
PL7 Installation Startup Guide
No ratings yet
PL7 Installation Startup Guide
66 pages
Manual NG-MB2
No ratings yet
Manual NG-MB2
81 pages
BIOS ROM Editing and Flashing Guide
No ratings yet
BIOS ROM Editing and Flashing Guide
5 pages
Ad Tech Slides For Chapters 1-4
No ratings yet
Ad Tech Slides For Chapters 1-4
68 pages
A Multi Objective Binary Bat Approach For Testcase Selection in Object Oriented Testing
No ratings yet
A Multi Objective Binary Bat Approach For Testcase Selection in Object Oriented Testing
12 pages
Hashing
No ratings yet
Hashing
80 pages
Elektor01 2025
100% (1)
Elektor01 2025
48 pages
Mx8733b Spec v1
No ratings yet
Mx8733b Spec v1
7 pages
ACP Topic Test
No ratings yet
ACP Topic Test
13 pages
nd-s1 Leaflet
No ratings yet
nd-s1 Leaflet
2 pages
Synopsis 023012
No ratings yet
Synopsis 023012
3 pages
VOYAGER: LLM-Driven Minecraft Agent
No ratings yet
VOYAGER: LLM-Driven Minecraft Agent
42 pages
Projector Compatibility Guide
No ratings yet
Projector Compatibility Guide
2 pages
OOP Proposal
No ratings yet
OOP Proposal
4 pages
MS Office 2021 Script
No ratings yet
MS Office 2021 Script
1 page
Cloud Computing Security Assurance Modelling Through Risk
No ratings yet
Cloud Computing Security Assurance Modelling Through Risk
14 pages
MSSQL Server Cheat Sheet For Security Engineers 1699719420
No ratings yet
MSSQL Server Cheat Sheet For Security Engineers 1699719420
17 pages
Computer Networks 20APC0516 Min
No ratings yet
Computer Networks 20APC0516 Min
175 pages

Understanding Data Mining and Machine Learning

Uploaded by

Understanding Data Mining and Machine Learning

Uploaded by

MACHINE LEARNING with Big Data

Data mining is also called knowledge discovery

Data mining (knowledge discovery from data)

Rapid computerization of businesses produce

Model: is to be fit on data.

This is a view from typical

Subset of AI(Artificial Intelligence)

1. Numerical Data : quantitative data

1. Supervised (inductive) learning

➢ - use of labeled datasets to train

Input : A Basket of Fruits

1. Classification: A classification problem is when the output variable is a category,

13 Yes Large 110K ? D e d u c t io n

Supervised Learning Algorithms

Compute P(X|Ci) for each class

X =(age <=30, income =medium, student =yes, credit_rating =fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028

Therefore, X belongs to class (“buys_computer =yes”)

2. Unsupervised (inductive) learning

cannot be directly applied to a regression or

 Clustering is the classification of objects into different groups, or more

ORGANIZING DATA INTO CLASSES SUCH THAT THERE IS

FINDING GROUPS OF OBJECTS SUCH THAT THE OBJECTS IN A

Simpson's Family School Employees Females Males

2. THE MANHATTAN DISTANCE (ALSO CALLED TAXICAB

MARKETING: HELP MARKETERS DISCOVER DISTINCT GROUPS IN THEIR

low inter-class similarity

THE QUALITY OF A CLUSTERING RESULT DEPENDS ON BOTH THE SIMILARITY

Original Points A Partitional Clustering

IN OTHER WORDS, FOR EACH OBJECT IN EACH CLUSTER , THE

A Simple example showing the implementation of k-means algorithm

UNIT IV- CLASSIFICATION AND CLUSTERING

Their new centroids are:

Therefore, the new clusters are:

Next centroids are:

m1=(1.25,1.5) and m2 = (3.9,5.1)

Therefore, there is no change in the

Thus, the algorithm comes to a halt

We have 4 medicines as our training data points object and

■ Each column in the distance matrix symbolizes the object.

ITERATION-1, OBJECTS-CENTROIDS DISTANCES : THE NEXT STEP IS TO

 Iteration-1, Objects clustering:

 Iteration 2, determine centroids:

 Iteration-2, Objects clustering:

Object Feature1(X): Feature2 Group

WE GET THE FINAL GROUPING AS THE RESULTS AS:

2 different problems/tasks can be addressed : 2 different problems/tasks can be addressed :

Contain labelled data/class Unlabeled data/class

Applications : Spam detection, handwriting Applications: Customer segmentation, detect

You might also like

P(X|Ci)P(Ci) : P(X|buys_computer = “yes”) P(buys_computer = “yes”) = 0.028