You are on page 1of 84

Clustering

Machine learning: Supervised vs Unsupervised.

Supervised learning - discover patterns in the data that relate data


attributes with a target (class) attribute.
• there must be a training data set in which the solution is already
known.

Unsupervised learning - the outcomes are unknown or The data have


no target attribute.

• cluster the data to reveal meaningful partitions and Hierarchies


• We have to explore the data to find some intrinsic structures in them.
INTRODUCTION-
What is clustering?
• Clustering is the classification of objects into
different groups, or more precisely, the
partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share
some common trait - often according to some
defined distance measure
Examples
• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
Current Applications
• Document Classification: Cluster documents in
multiple categories based on tags, topics, and the
content of the document. This is a very standard
classification problem and k-means is a highly
suitable algorithm for this purpose.
•  Identifying Crime Localities: With data related to
crimes available in specific localities in a city, the
category of crime, the area of the crime, and the
association between the two can give quality insight
into crime-prone areas within a city 
Current Applications
• Customer Segmentation: Clustering helps
marketers improve their customer base, work on
target areas, and segment customers based
on purchase history, interests, or activity
monitoring.
• Insurance Fraud Detection: Utilizing past
historical data on fraudulent claims, it is possible
to isolate new claims based on its proximity to
clusters that indicate fraudulent patterns. 
Current Applications
• Rideshare Data Analysis: The publicly
available Uber ride information dataset
provides a large amount of valuable data
around traffic, transit time, peak pickup
localities, and more. Analyzing this data is
useful not just in the context of Uber but also
in providing insight into urban traffic patterns
and helping us plan for the cities of the future.
Current Applications
• Social network analysis - Facebook
"smartlists"
• Organizing computer clusters and data
centers for network layout and location
• Astronomical data analysis - Understanding
galaxy formation
illustration
• The data set has three natural groups of data
points, i.e., 3 natural clusters.
Aspects of clustering
• A distance (similarity, or dissimilarity)
function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Types of clustering
• Hierarchical algorithms: these find
successive clusters
1. Agglomerative ("bottom-up"): Agglomerative
algorithms begin with each element as a
separate cluster and merge them into
successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin
with the whole set and proceed to divide it into
successively smaller clusters.
Types of clustering
• Partitional clustering: Partitional algorithms
determine all clusters at once. It include:
K-means and derivatives.
• The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where k
< n.
• It assumes that the object attributes form a vector
space.
Other Approaches
• Density-based
• Mixture model
• Spectral methods
K-means clustering
• k-means clustering is an algorithm to classify or to
group the objects based on attributes/features into K
number of group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How it works?
Algorithm
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly, or systematically
as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest centroid.
After each assignment, recompute the centroid of
the gaining cluster.
Algorithm
• Step 3: Take each sample in sequence and compute
its distance from the centroid of each of the clusters.
If a sample is not currently in the cluster with the
closest centroid, switch this sample to that cluster
and update the centroid of the cluster gaining the
new sample and the cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
Simple example
• Take some random values.
• Consider number of clusters and centroids
randomly.
• 3,8,24,91,53,75,31,9,6,44,62,15
• Two clusters?
• Consider mid points as 24 and 62
• With 24 -?
• 3,8,31,9,6,15
• With 62-44, 53,91,75
• With three clusters with 15, 44 and 75 as mid
points?
A Simple example showing the implementation
of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
• Thus, we obtain two
• clusters containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object,
as shown in table.
• Therefore, the new
clusters are:{1,2} and
{3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and
m2 = (3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
• Therefore, there is
no change in the cluster.
• Thus, the algorithm
comes to a halt here and
final result consist of
2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3) Step 1
Step 2
PLOT
Limitations
• K-means is extremely
sensitive to cluster
center initializations
• Bad initialization can
lead to Poor
convergence speed
• Bad initialization can
lead to bad overall
clustering
‘k’ value
• Elbow method
• Within Group Sum of
Square(WGSS)
• Convergence value
will be chosen
Practise
• We have 4 medicines as our training data points
object and each medicine has 2 attributes. Each
attribute represents coordinate of the object. We have
to determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object weight index
1 1
Medicine A

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does


not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
34
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

35
Dendrogram: Shows How Clusters are Merged

Decompose data
objects into a several
levels of nested
partitioning (tree of
clusters), called a
dendrogram

A clustering of the
data objects is
obtained by cutting
the dendrogram at
the desired level,
then each connected
component forms a
cluster
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

37
Example of converting data points
into distance matrix
• Clustering analysis with agglomerative
algorithm

data matrix

distance matrix

Euclidean distance 38
Example
X Y

A 0.40 0.53

B 0.22 0.38

C 0.35 0.32

D 0.26 0.19

E 0.08 0.41

F 0.45 0.30
Example
A B C D E F

A 0

B 0.23 0

C 0.22 0.15 0

D 0.37 0.20 0.15 0

E 0.34 0.14 0.28 0.29 0

F 0.23 0.25 0.11 0.22 0.39 0


Example
A B C,F D E

A 0

B 0.23 0

C,F 0.22 0.15 0

D 0.37 0.20 0.15 0

E 0.34 0.14 0.28 0.29


Example
A B,E C,F D

A 0

B,E 0.23 0

C,F 0.22 0.15 0

D 0.37 0.20 0.15 0


Example
A (B,E), (C,F) D

A 0

(B,E), (C,F) 0.22 0

D 0.37 0.15 0
Practise
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
MIN or Single Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the closest pair
of data objects belonging to different clusters.
• Determined by one pair of points, i.e., by one
link in the proximity graph
MAX or Complete Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the farthest pair
of data objects belonging to different clusters
Distance between Clusters X X

• Single link: smallest distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an element in


the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
– Medoid: a chosen, centrally located object in the cluster
49
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

• Radius: square root of average distance from any point of the


cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
• Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

50
Parametric vs Non Parametric
Estimation
Learning a Function

• Machine learning can be summarized as


learning a function (f) that maps input
variables (X) to output variables (Y).
Y = f(x)

• An algorithm learns this target mapping


function from training data
• Different algorithms make different
assumptions or biases about the form of the
function and how it can be learned.
Parametric Machine Learning Algorithms

• Assumptions can greatly simplify the learning


process, but can also limit what can be learned.
Algorithms that simplify the function to a known
form are called parametric machine learning
algorithms.
• A learning model that summarizes data with a set of
parameters of fixed size (independent of the number
of training examples) is called a parametric model.
• No matter how much data you throw at a parametric
model, it won’t change about how many parameters
it needs.
The algorithms involve two steps:

1. Select a form for the function.


2. Learn the coefficients for the function from the
training data.
• An easy to understand functional form for the
mapping function is a line, as is used in linear
regression: b0 + b1*x1 + b2*x2 = 0
• Where b0, b1 and b2 are the coefficients of the line
that control the intercept and slope, and x1 and x2
are two input variables.
Parametric Estimation
• Assuming the functional form of a line greatly simplifies
the learning process. Now, all we need to do is estimate the
coefficients of the line equation and we have a predictive
model for the problem.
• Some more examples of parametric machine learning
algorithms include
– Logistic Regression
– Linear Discriminant Analysis
– Perceptron
– Naive Bayes
– Simple Neural Networks
Benefits of Parametric Machine Learning
Algorithms:
• Simpler: These methods are easier to
understand and interpret results.
• Speed: Parametric models are very fast to
learn from data.
• Less Data: They do not require as much
training data and can work well even if the fit
to the data is not perfect.
Limitations of Parametric Machine
Learning Algorithms:
• Constrained: By choosing a functional form
these methods are highly constrained to the
specified form.
• Limited Complexity: The methods are more
suited to simpler problems.
• Poor Fit: In practice the methods are unlikely
to match the underlying mapping function.
Nonparametric Machine Learning Algorithms

• Algorithms that do not make strong assumptions


about the form of the mapping function are called
nonparametric machine learning algorithms. By
not making assumptions, they are free to learn
any functional form from the training data.
• Nonparametric methods are good when you have
a lot of data and no prior knowledge, and when
you don’t want to worry too much about choosing
just the right features.
Nonparametric Estimation
• Nonparametric methods seek to best fit the training data
in constructing the mapping function, whilst maintaining
some ability to generalize to unseen data. As such, they
are able to fit a large number of functional forms.
• An easy to understand nonparametric model is the k-
nearest neighbors algorithm that makes predictions based
on the k most similar training patterns for a new data
instance. The method does not assume anything about
the form of the mapping function other than patterns that
are close are likely have a similar output variable.
Nonparametric Estimation
• Some more examples of popular non
parametric machine learning algorithms are:
• k-Nearest Neighbours
• Decision Trees like CART and C4.5
• Support Vector Machines
Benefits of Nonparametric Machine Learning
Algorithms:
• Flexibility: Capable of fitting a large number
of functional forms.
• Power: No assumptions (or weak
assumptions) about the underlying function.
• Performance: Can result in higher
performance models for prediction.
Limitations of Nonparametric Machine
Learning Algorithms:
• More data: Require a lot more training data to
estimate the mapping function.
• Slower: A lot slower to train as they often
have far more parameters to train.
• Overfitting: More of a risk to overfit the
training data and it is harder to explain why
specific predictions are made.
K Nearest Neighbour Classification
• It classifies new points based on the similarity
measure.
• Also identifies data points that are separated
into several classes to predict the classification
of a sample point.
K Nearest Neighbor Classification
• Step 1: Initialize ‘k’
• Step 2: For each sample in the training data,
– Calculate distance between query point and the current
point
– Add the distance and the index of the example to an
ordered collection.
• Sort the ordered collection of distances and indexes
from small to large.
• Pick the first ‘k’ entries from the list.
• Get the labels of selected ‘k’ entries.
K Nearest Neighbor Classification
Height Weight T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
K=5 and for an input height as 161cm
and weight as 61kg
K Nearest Neighbor Classification-
Visualization
KNN vs. K-mean
• K-mean is an unsupervised learning technique (no
dependent variable) whereas KNN is a supervised
learning algorithm (dependent variable exists)
• K-mean is a clustering technique which tries to
split data points into K-clusters such that the
points in each cluster tend to be near each other
whereas K-nearest neighbor tries to determine the
classification of a point, combines the
classification of the K nearest points
K Nearest Neighbor Classification -
Practise
Perform KNN Classification algorithm on
following dataset and predict the class for P 1=3
and P2=7. Consider k=3.
P1 P2 Class
7 7 False
7 5 False
5 6 False
3 4 True
2 3 True
4 3 True
Voronoi Diagram
Nonparametric Regression:
Smoothing Models
Regression
• In regression, given the training set X ={xt, rt}
where rt ∈ R, we assume
rt = g(xt ) + ∈
• In parametric regression, we assume a
polynomial of a certain order and compute its
coefficients that minimize the sum of squared
error on the training set.
Nonparametric regression
• Nonparametric regression is used when no
such polynomial can be assumed;
• we only assume that close x have close g(x)
values.
• As in nonparametric density estimation, given
x, our approach is to find the neighborhood of
x and average the r values in the neighborhood
to calculate ˆg(x).
Nonparametric regression
• The nonparametric regression estimator is also
called a smoother and the estimate is called a
smooth
Regressogram

• a commonly used simple non-parametric


method
Regressogram

• This is an analysis for astronomy data. On the


X-axis is the galaxy distance to some
cosmological structure and on the Y-axis is the
correlation for some features of this galaxy.
We binned the data according to galaxy
distance and take the mean within each bin as
a landmark (or summary) and show how this
landmark changes along galaxy distance.
Regressogram

• Note that now the range of Y is (0,1) while in


the regressogram, the range is (0.7, 0.8). If you
want to visualize the data, this scatter plot will
not be helpful. The regressogram, however, is
a simple approach to visualize hidden structure
within this complicated data.
• Here’s the steps for constructing regressogram.
First we bin the data according to the X-axis
(shown by red lines):
• Then we compute the mean within each bin
(shown by the blue points):
• We can show only the blue points (and blue
curves, which just connects each points) so
that the result looks much more concise:
• However, since the range for Y-axis is too large, this
does not show the trend. So we zoom-in and compute
the error for estimating the mean within each bin. 
• The advantage for regressogram is its simplicity.
Since we’re summarizing the whole data by points
representing the mean within each bin, the
interpretation is very straight-forward.
• Also, it shows the trend (and error bars) for the data
so that we have rough idea what’s going on.
Moreover, no matter how complicated the original
plot is, the regressogram uses only a few of statistics
(the mean within each bin) to summarize the whole
data. Notice that we do not make any assumption on
distribution (like normally distributed) of the data;
thus, regressogram is a non-parametric method.
Kernel smoother
• KDE

K – Kernel function (non negative)


h – smoothing parameter

You might also like