MACHINE LEARNING with Big Data
UNIT 3
What is Data Mining?
Data mining is also called knowledge discovery
and data mining (KDD)
Data mining is
Extraction of useful patterns from data sources, e.g.-
Databases, texts, web, image
Patterns must be:
Valid, novel, potentially useful, understandable
2
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
Why is Data Mining important?
Rapid computerization of businesses produce
huge amount of data
How to make best use of data?
It is used to discover patterns and relationships in
the data in order to help make better business
decisions
Data mining technology can generate new
business opportunities by:
Automated prediction of trends and behaviors
Automated discovery of previously unknown patterns
4
Why Data Mining?
The Explosive Growth of Data: from terabytes to
petabytes
Data and storage predictions for the year
2025(zettabytes)
The storage industry will ship 42ZB of capacity over the
next seven years.
90ZB of data will be created on IoT devices by 2025.
By 2025, 49 percent of data will be stored in public cloud
environments.
Nearly 30 percent of the data generated will be
consumed in real-time by 2025.
Why Data Mining?
The Explosive Growth of Data: from terabytes to zettabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
Example of Discovered Patterns
Association rules:
“80% of customers who buy cheese and milk also buy
bread, and 5% of customers buy all of them together”
Cheese, Milk→ Bread [sup =5%, confid=80%]
7
Data mining Algorithms: can be characterized as
consisting of three parts
Model: is to be fit on data.
Preference: criteria to select one model over other
Search: techniques to evaluate data point or searching of data.
Goal of Data Mining
Goal of Data Mining to provide efficient tools and techniques for KDD
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities
Data mining plays an essential
role in the knowledge discovery
process
Knowledge Discovery (KDD) Process
1. Developing an understanding of:
The application domain
The relevant prior knowledge
The goals of the end-user
Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
2. Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
3. Data reduction and projection
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
11
Knowledge Discovery (KDD) Process
4. Choosing the data mining task.
Deciding whether the goal of the KDD process is classification, regression,
clustering, etc.
5. Choosing the data mining algorithm(s)
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the
KDD process.
6. Data mining.
Searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering,
and so forth.
7. Interpreting mined patterns.
8. Consolidating discovered knowledge.
12
Sequence of the steps
Data Cleaning: To remove noise and inconsistent data.
Data Integration: where multiple data sources may be combined.
Data Selection: Where data relevant to the analysis task are retrieved
from the database.
Data Transformation: where data transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations.
Data Mining: An essential process where intelligent methods are
applied to extract data patterns.
Pattern evaluation: To identify the truly interesting pattern representing
knowledge based on interesting measures.
Knowledge Presentation: where visualization and knowledge
representation tech. are used to present mined knowledge to users.
Applications of Data Mining
Data mining Data mining is available
applications are widely in various forms
used in Text mining
Direct marketing Web mining
Health industry Audio & video data
mining
E-commerce
Pictorial data mining
Customer relationship Relational databases
management (CRM) data mining
Telecommunication Social networks data
industry and financial mining
sector, etc..
Machine Learning
Subset of AI(Artificial Intelligence)
Computer Systems/Machines analyze huge data sets and learn patterns which can help in
making future predictions about new data.
Definition
“Learning is any process by which a system improves performance from experience.”- Herbert Simon
“Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T, E> - Tom Mitchell(1998)
Machine Learning
In Traditional Programming :
▪ provide input to the machine with the
programs using some programming
language.
▪ System will generate the output
based on the program
In Machine learning
▪ Data is provided as input as well as the
output value is provided as input in few
ML approaches.
▪ The machine trains/learns from the input
and output by identifying some patterns,
relationships using some statistical
measures on the input values that lead to
the output.
▪ This training model /program is the
output which further can be applied to
new data for predictions
Significance/Motivation for
Machine Learning
Humans can’t explain their expertise : (Speech Recognition)
Human expertise does not exist (Mars navigation)
Models must be customized ( Personalized Medicine)
Analyzing huge amounts of data with little or no human intervention
No explicit programming needed
Customer behavior analysis for business
Building marketing strategies.
Defence :
➢ Surveillance to detect/prevent any intrusion
➢ self-control, self-regulation, and self-actuation of combat systems through embedded AI
➢ Cybersecurity
➢ Effective logistics and transportation of goods, ammunitions, armaments, troops
➢ Detect any anomalies in the goods,armaments for timely repair or maintenance
➢ AI can mine soldiers’ medical records and assist in complex diagnosis.
Machine Learning
Types of Data from ML perspective:
1. Numerical Data : quantitative data
➢ Can be discrete or continuous
2. Categorical Data: characteristics
➢ They can have numerical values for
representing the output Example : 0 for No,1
for Yes
3. Time – series Data :Sequence of numbers
collected at regular intervals over some period of
time.
➢ Temporal values (Date and Time
associated)
4.Text : Words.
➢ Text analysis
Types of Machine learning approaches
1. Supervised (inductive) learning
➢ – Given: training data + desired outputs
(labels)
➢ - use of labeled datasets to train
algorithms that to classify data or predict
outcomes accurately if new data values or
not seen data values are provided after
training.
➢ - Supervised learning helps organizations
solve for a variety of real-world problems
at scale, such as classifying spam in a
separate folder from your inbox.
➢ Aim : find a mapping function f(x) to map the
input variable(x) with the output variable(y).
1. Supervised Learning
Input : A Basket of Fruits
Training :
If the color of fruit is red and shape i s with a depression
at the top then it is an “Apple”.
If the color of fruit is green/yellow and shape if like a
curving cylinder then it is labelled as “Banana”
Testing :
Suppose now one separate fruit is given to the trained model to
identify whether it is an apple/banana then the model will apply
the knowledge gained during training phase related to color and
shape of the fruit and correctly classify it.
Types of Tasks in Supervised
Learning
Supervised learning is classified into two categories of algorithms:
1. Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
• It predicts the category the data belongs to.
2. Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight” or “house price”
• Predict numerical value based on previously observed data.
Supervised Classification
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
a lg o r ith m
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes M o d el
9 No Medium 75K No
10 No Small 90K Yes
M odel
10
Training S et
Apply
Tid Attrib1 Attrib2 Attrib3 Class
M o d el
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? D e d u c t io n
14 No Small 95K ?
15 No Large 67K ?
10
Te s t S e t
Classification Process
▪ Model construction: describing a set of predetermined
classes
▪ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
▪ The set of tuples used for model construction is training set
▪ The model is represented as classification rules, decision trees, or
mathematical formulae
▪ Model usage: for classifying future or unknown objects
▪ Estimate accuracy of the model
▪ The label of test sample is compared with the classified result from the
model
▪ Accuracy rate is the percentage of test set samples that are correctly
classified by the model
▪ Test set is independent of training set, otherwise over-fitting will occur
▪ If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Step 1: Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Step 2: Model Usage
Classifier
Testing Unseen
Data Data
(Jeff, Professor, 4)
Tenured?
Supervised Machine Learning
Supervised Learning Algorithms
➢ Logistic Regression
➢ K-Nearest Neighbor
➢ Support Vector Machine
➢ Naïve Bayes
➢ Decision Tree
➢ Ensemble algorithms
Naïve Bayes
Classification
Algorithm
Performs Probabilistic Prediction,
class membership probability
Based on the Bayes’
theorem.
Assumes feature in a class is
unrelated to any other feature
even if they depend on each
other. Therefore, called as
‘Naïve’.
Useful and scales well Principle
If it walks like a duck, quacks like a duck, then it is
probably a duck
Naïve Bayes Algorithm
Formula : Let 𝐸1, 𝐸2, ……𝐸𝑛be n mutually exclusive and exhaustive events associated with a
random experiment. If A is any event which occurs with 𝐸1𝑜𝑟𝐸2𝑜𝑟……𝐸𝑛, then
𝑃 𝐸𝑖.𝑃(𝐴|𝐸𝑖)
𝑃(𝐸 𝑖 𝐴 = σ𝑖=
𝑛
1 𝑃 𝐸𝑖.𝑃(𝐴|𝐸𝑖)
P(class|data)=P(data|class) * P(class)
P(data)
• P(class|data) is the posterior probability of class(target) given predictor(attribute). The
probability of a data point having either class, given the data point. This is the value that we
are looking to calculate.
• P(class) is the prior probability of class.
• P(data|class) is the likelihood, which is the probability of predictor given class.
• P(data) is the prior probability of predictor or marginal likelihood.
age income studen redit_ratin _co
t m
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Example
>40 medium no fair yes
Consider following example of computer purchase dataset
>40 low yes fair yes
Identify the Independent and dependent variable
>40 low yes excellent no
Step 1 : Calculate: P(class) =Number of data points in the 31…40 low yes excellent yes
class/Total number of observations
<=30 medium no fair no
Class:
<=30 low yes fair yes
C1:buys_computer = ‘yes’
>40 medium yes fair yes
C2:buys_computer = ‘no’ <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Step 2 :Calculate Likelihood >40 medium no excellent no
P(data/class) = Number of similar observations to th e Q. Which class does following data instance belong to :
class/Total no. of points in th e class. Data instance
X =(age <=30,
Income =medium,
Student =yes
Credit_rating =fair)
Solution P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” |
buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” |
buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” |
buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer =
“yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X =(age <=30, income =medium, student =yes, credit_rating =fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 × 0.4 × 0.2 × 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer =yes”)
Types of Machine learning approaches
2. Unsupervised (inductive) learning
➢ – Given: training data without desired
outputs (labels)
➢ - training of a machine using information
that is neither classified nor labeled and
allowing the algorithm to act on that
information without guidance
➢ - group unsorted information according to
similarities, patterns, and differences
without any prior training of data.
2. Unsupervised
learning Tasks
cannot be directly applied to a regression or
classification problem
The goal of unsupervised learning is to find
the underlying structure of dataset, group
that data according to similarities, and
represent that dataset in a compressed
format.
much similar as a human learns to think by
their own experiences, which makes it
closer to the real AI.
Types :
Clustering : grouping based on similarity
Association : find relationships between
variables
Unsupervised learning Algorithms
• K-means clustering
• KNN (k-nearest neighbours)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
2.Unsupervised Learning :k-Means
Clustering
k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong [1979].
Given a set of n distinct objects, the k-Means clustering algorithm partitions the
objects into k number of clusters such that intracluster similarity is high but the
intercluster similarity is low.
In this algorithm, user has to specify k, the number of clusters and consider the
objects are defined with numeric attributes and thus using any one of the distance
metric to demarcate the clusters.
Clustering
Clustering is the classification of objects into different groups, or more
precisely, the partitioning of a data set into subsets (clusters), so that the
data in each subset (ideally) share some common trait - often
according to some defined distance measure.
Clustering 36
ORGANIZING DATA INTO CLASSES SUCH THAT THERE IS
high intra-class similarity
low inter-class similarity
FINDING THE CLASS LABELS AND THE NUMBER OF CLASSES DIRECTLY
FROM THE DATA (IN CONTRAST TO CLASSIFICATION).
MORE INFORMALLY, FINDING NATURAL GROUPINGS AMONG
OBJECTS.
ALSO CALLED UNSUPERVISED LEARNING, SOMETIMES CALLED
CLASSIFICATION BY STATISTICIANS AND SORTING BY
PSYCHOLOGISTS AND SEGMENTATION BY PEOPLE IN MARKETING
36
Clustering 37
FINDING GROUPS OF OBJECTS SUCH THAT THE OBJECTS IN A
GROUP WILL BE SIMILAR (OR RELATED) TO ONE ANOTHER AND
DIFFERENT FROM (OR UNRELATED TO) THE OBJECTS IN OTHER
GROUPS
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering Example
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family School Employees Females Males
39
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary
Similarity is
hard to define,
but…
“We know it
when we see it”
The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.
40
Common Distance measures:
DISTANCE MEASURE WILL DETERMINE HOW THE
SIMILARITY OF TWO ELEMENTS IS CALCULATED AND IT
WILL INFLUENCE THE SHAPE OF THE CLUSTERS.
THEY INCLUDE:
1. THE EUCLIDEAN DISTANCE (ALSO CALLED 2-NORM
DISTANCE) IS GIVEN BY:
2. THE MANHATTAN DISTANCE (ALSO CALLED TAXICAB
NORM OR 1-NORM) IS GIVEN BY:
Examples of Clustering Applications
MARKETING: HELP MARKETERS DISCOVER DISTINCT GROUPS IN THEIR
CUSTOMER BASES, AND THEN USE THIS KNOWLEDGE TO DEVELOP
TARGETED MARKETING PROGRAMS
LAND USE: IDENTIFICATION OF AREAS OF SIMILAR LAND USE IN AN
EARTH OBSERVATION DATABASE
INSURANCE: IDENTIFYING GROUPS OF MOTOR INSURANCE POLICY
HOLDERS WITH A HIGH AVERAGE CLAIM COST
CITY-PLANNING: IDENTIFYING GROUPS OF HOUSES ACCORDING TO
THEIR HOUSE TYPE, VALUE, AND GEOGRAPHICAL LOCATION
EARTH-QUAKE STUDIES: OBSERVED EARTH QUAKE EPICENTERS
SHOULD BE CLUSTERED ALONG CONTINENT FAULTS
What Is Good Clustering?
A GOOD CLUSTERING METHOD WILL PRODUCE HIGH QUALITY CLUSTERS
WITH
high intra-class similarity
low inter-class similarity
THE QUALITY OF A CLUSTERING RESULT DEPENDS ON BOTH THE SIMILARITY
MEASURE USED BY THE METHOD AND ITS IMPLEMENTATION.
THE QUALITY OF A CLUSTERING METHOD IS ALSO MEASURED BY ITS ABILITY
TO DISCOVER SOME OR ALL OF THE HIDDEN PATTERNS.
Clustering 44
Hierarchical Partitional
PARTITIONAL ALGORITHMS: CONSTRUCT VARIOUS PARTITIONS AND THEN EVALUATE THEM BY SOME CRITERION
HIERARCHICAL ALGORITHMS: CREATE A HIEARCHICAL DECOMPOSITION OF THE SET OF OBJECTS USING SOME
CRITERION
44
Partitional Clustering 45
Original Points A Partitional Clustering
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
K-Means CLUSTERING
SIMPLY SPEAKING K-MEANS CLUSTERING IS AN ALGORITHM TO
CLASSIFY OR TO GROUP THE OBJECTS BASED ON
ATTRIBUTES/FEATURES INTO K NUMBER OF GROUP.
K IS POSITIVE INTEGER NUMBER.
THE GROUPING IS DONE BY MINIMIZING THE SUM OF SQUARES OF
DISTANCES BETWEEN DATA AND THE CORRESPONDING CLUSTER
CENTROID.
Working of K-Means CLUSTERING
Classical Partitioning Method- K-Means
IN OTHER WORDS, FOR EACH OBJECT IN EACH CLUSTER , THE
DISTANCE FROM THE OBJECT TO ITS CLUSTER CENTER IS SQUARED,
AND THE DISTANCES ARE SUMMED.
THIS CRITERION TRIES TO MAKE THE RESULTING K CLUSTERS AS
COMPACT AND AS SEPARATE AS POSSIBLE.
Working of K-Means Clustering
• BEGIN WITH A DECISION ON THE VALUE OF K = NUMBER OF
CLUSTERS
• ARBITRARILY ASSIGN K OBJECTS FROM D AS THE INITIAL CLUSTER
CENTERS
• EACH OBJECT IS DISTRIBUTED TO A CLUSTER BASED ON THE CLUSTER
CENTER TO WHICH IT IS THE NEAREST.
• NEXT, THE CLUSTER CENTERS ARE UPDATED I.E. MEAN VALUE OF
EACH CLUSTER IS RECALCULATED BASED ON THE CURRENT OBJECTS
IN THE CLUSTER
• USING THE NEW CLUSTER CENTERS, THE OBJECTS ARE REDISTRIBUTED
TO THE CLUSTERS BASED ON WHICH CLUSTER CENTER IS THE
NEAREST.
• THIS PROCESS ITERATES
• EVENTUALLY, NO REDISTRIBUTION OF THE OBJECTS IN ANY OCCURS
, AND SO THE PROCESS TERMINATES
• RESULTING CLUSTERS ARE RETURNED BY THE CLUSTERING PROESS.
•
Algorithm of K-Means Clustering
K-Means Clustering-Example 1
GIVEN: {2,3,6,8,9,12,15,18,22} ASSUME K=3.
SOLUTION:
Randomly partition given data set:
K1 = 2,8,15 mean = 8.3
K2 = 3,9,18 mean = 10
K3 = 6,12,22 mean = 13.3
Reassign
K1 = 2,3,6,8,9 mean = 5.6
K2 = mean = 0
K3 = 12,15,18,22 mean = 16.75
K-Means Clustering-Example 1
Reassign
K1 = 3,6,8,9 mean = 6.5
K2 = 2 mean = 2
K3 = 12,15,18,22 mean = 16.75
Reassign
K1 = 6,8,9 mean = 7.6
K2 = 2,3 mean = 2.5
K3 = 12,15,18,22 mean = 16.75
Reassign
K1 = 6,8,9 mean = 7.6
K2 = 2,3 mean = 2.5
K3 = 12,15,18,22 mean = 16.75
STOP
Example
GIVEN {2,4,10,12,3,20,30,11,25}
ASSUME K=2.
55
Solution
K1 = 2,3,4,10,11,12
K2 = 20, 25, 30
K-Means Clustering-Example 2 56
A Simple example showing the implementation of k-means algorithm
(using K=2)
K-Means Clustering-Example572
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
UNIT IV- CLASSIFICATION AND CLUSTERING
K-Means Clustering-Example
582
Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
K-Means Clustering-Example
592
Step 3:
Now using these centroids we
compute the Euclidean distance of
each object, as shown in table.
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 = (3.9,5.1)
K-Means Clustering-Example602
Step 4:
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no change in the
cluster.
Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.
K-Means Clustering-Example 2 61
Step 1 Step 2
K-Means Clustering-Example 3
62
We have 4 medicines as our training data points object and
each medicine has 2 attributes. Each attribute represents
coordinate of the object. We have to determine which
medicines belong to cluster 1 and which medicines belong to
the other cluster.
Attribute1 (X): weight Attribute 2 (Y): pH
Object
index
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
K-Means Clustering-Example633
Step 1:
Initial value of centroids : Suppose we use medicine A and medicine B as
the first centroids.
Let and c1 and c2 denote the coordinate of the centroids, then c1=(1,1)
and c2=(2,1)
K-Means Clustering-Example 3
Step 1:
Objects-Centroids distance : we calculate the distance between cluster
centroid to each object.
Let us use Euclidean distance, then we have distance matrix at iteration 0
is
■ Each column in the distance matrix symbolizes the object.
■ The first row of the distance matrix corresponds to the distance of
each object to the first centroid and the second row is the distance
of each object to the second centroid.
■ For example, distance from medicine C = (4, 3) to the first
centroid is , and
■ Its distance to the second centroid is , is
K-Means Clustering-Example 3
Step 2:
Objects clustering : We
assign each object based
on the minimum distance.
Medicine A is assigned to
group 1, medicine B to group
2, medicine C to group 2
and medicine D to group 2.
The elements of Group
matrix below is 1 if and only if
the object is assigned to that
group.
K-Means Clustering-Example 3
ITERATION-1, OBJECTS-CENTROIDS DISTANCES : THE NEXT STEP IS TO
COMPUTE THE DISTANCE OF ALL OBJECTS TO THE NEW CENTROIDS.
SIMILAR TO STEP 2, WE HAVE DISTANCE MATRIX AT ITERATION 1 IS
K-Means Clustering-Example 3
Iteration-1, Objects clustering:
Based on the new distance matrix,
we move the medicine B to Group
1 while all the other objects remain.
The Group matrix is shown below
Iteration 2, determine centroids:
Now we repeat step 4 to calculate
the new centroids coordinate
based on the clustering of previous
iteration. Group1 and group 2 both
has two members, thus the new
centroids are
and
K-Means Clustering-Example 3
Iteration-2,
Objects-Centroids distances :
Repeat step 2 again, we have
new distance matrix at iteration
2 as
K-Means Clustering-Example 3
Iteration-2, Objects clustering:
Again, we assign each object
based on the minimum distance.
We obtain result that .
Comparing the grouping of last
iteration and this iteration reveals
that the objects does not move
group anymore.
Thus, the computation of the k-
mean clustering has reached its
stability and no more iteration is
needed..
69
K-Means Clustering-Example 3
Object Feature1(X): Feature2 Group
weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2
WE GET THE FINAL GROUPING AS THE RESULTS AS:
Advantages & Disadvantage
ADVANTAGES
of K means-Clustering
K-means is relatively scalable and efficient in processing large
data sets
The computational complexity of the algorithm is O(nkt)
n: the total number of objects
k: the number of clusters
t: the number of iterations
Normally: k<<n and t<<n
DISADVANTAGE
Can be applied only when the mean of a cluster is defined
Users need to specify k
K-means is not suitable for discovering clusters with non convex
Shapes or clusters of very different size
It is sensitive to noise and outlier data points
Weaknesses of K-Means
Clustering
WHEN THE NUMBERS OF DATA ARE NOT SO MANY,
INITIAL GROUPING WILL DETERMINE THE CLUSTER
SIGNIFICANTLY.
THE NUMBER OF CLUSTER, K, MUST BE DETERMINED
BEFORE HAND. ITS DISADVANTAGE IS THAT IT DOES NOT
YIELD THE SAME RESULT WITH EACH RUN, SINCE THE
RESULTING CLUSTERS DEPEND ON THE INITIAL RANDOM
ASSIGNMENTS.
WE NEVER KNOW THE REAL CLUSTER, USING THE SAME
DATA, BECAUSE IF IT IS INPUTTED IN A DIFFERENT ORDER IT
MAY PRODUCE DIFFERENT CLUSTER IF THE NUMBER OF
DATA IS FEW.
IT IS SENSITIVE TO INITIAL CONDITION. DIFFERENT INITIAL
CONDITION MAY PRODUCE DIFFERENT RESULT OF
CLUSTER. THE ALGORITHM MAY BE TRAPPED IN THE
LOCAL OPTIMUM.
Supervised v/s Unsupervised
Learning
Supervised Learning Unsupervised Learning
2 different problems/tasks can be addressed : 2 different problems/tasks can be addressed :
Classification and Regression Clustering and Association
Input Data is provided to the model along with the Only input data is provided in Unsupervised
output in the Supervised Learning. Learning.
Output is predicted by the Supervised Learning. Hidden patterns in the data can be found
Contain labelled data/class Unlabeled data/class
Objective : Training the model to predict output when a Objective: Finding useful insights, hidden patterns from
new data is provided the unknown dataset
Examples : Naïve Bayes, Decision Tree Examples : k-Means Clustering, Apriori algorithm
Applications : Spam detection, handwriting Applications: Customer segmentation, detect
recognition, speech recognition fraudulent transactions, data preprocessing
References
Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber, and Jian
Pei, 3rd edition
https://builtin.com/data-science/supervised-machine-learning- classification
Machine Learning,Anuradha Srinivasaraghavan,Vincy Joseph,Wiley
Publication
https://www.javatpoint.com/reinforcement-learning