You are on page 1of 100

Data Mining

Unsupervised Learning
Clustering
PART 6

◼ Prof. Ahmed Sultan Al-Hegami


◼ Professor of AI & Intelligent Systems
◼ Sana’a University
Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Supervised learning vs. unsupervised
learning
◼ Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
❑ These patterns are then utilized to predict the
values of the target attribute in future data
instances.
◼ Unsupervised learning: The data have no
target (Class) attribute.
❑ We want to explore the data to find some intrinsic
(‫ )جوهري‬structures in them.
Prof. Ahmed Sultan Al-Hegami
Clustering
◼ Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
❑ it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
◼ Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
◼ Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
❑ In fact, association rule mining is also unsupervised

Prof. Ahmed Sultan Al-Hegami


An illustration
◼ The data set has three natural groups of data points,
i.e., 3 natural clusters.

Prof. Ahmed Sultan Al-Hegami


What is clustering for?
◼ Let us see some real-life examples
◼ Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
❑ Tailor-made for each person: too expensive
❑ One-size-fits-all: does not fit all.
◼ Example 2: In marketing, segment customers
according to their similarities
❑ To do targeted marketing.

Prof. Ahmed Sultan Al-Hegami


What is clustering for? (cont…)
◼ Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
❑ To produce a topic hierarchy
◼ In fact, clustering is one of the most utilized
data mining techniques.
❑ It has a long history, and used in almost every
field, e.g., medicine, psychology, botany,
sociology, biology, archeology, marketing,
insurance, libraries, etc.
❑ In recent years, due to the rapid increase of online
documents, text clustering becomes important.
Prof. Ahmed Sultan Al-Hegami
General Applications of Clustering
◼ Pattern Recognition
◼ Spatial Data Analysis
❑ create thematic maps in GIS by clustering feature
spaces
❑ detect spatial clusters and explain them in spatial data
mining
◼ Image Processing
◼ Economic Science (especially market research)
◼ WWW
❑ Document classification
❑ Cluster Weblog data to discover groups of similar
access patterns

Prof. Ahmed Sultan Al-Hegami


Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
◼ Land use: Identification of areas of similar land use in an
earth observation database
◼ Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
◼ City-planning: Identifying groups of houses according to
their house type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults

Prof. Ahmed Sultan Al-Hegami


What is not Cluster Analysis?
◼ Supervised classification
❑ Have class label information

◼ Simple segmentation
❑ Dividing students into different registration groups
alphabetically, by last name

◼ Results of a query
❑ Groupings are a result of an external specification

◼ Graph partitioning
❑ Some mutual relevance and synergy, but areas are not
identical

Prof. Ahmed Sultan Al-Hegami


What Is Good Clustering?
◼ A good clustering method will produce high quality
clusters with
❑ high intra-class similarity
❑ low inter-class similarity
◼ The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
◼ The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.

Prof. Ahmed Sultan Al-Hegami


Data Structures in Clustering
 x11 ... x1f ... x1p 
◼ Data matrix  
 ... ... ... ... ... 
❑ (two modes) x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
◼ Dissimilarity matrix  d(2,1) 0 
 
❑ (one mode)  d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

Prof. Ahmed Sultan Al-Hegami


Measuring Similarity
◼ Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
◼ There is a separate “quality” function that measures the
“goodness” of a cluster.
◼ The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
◼ Weights should be associated with different variables
based on applications and data semantics.
◼ It is hard to define “similar enough” or “good enough”
❑ the answer is typically highly subjective.

Prof. Ahmed Sultan Al-Hegami


Aspects of clustering
◼ A clustering algorithm
❑ Partitional clustering
❑ Hierarchical clustering
❑ …
◼ A distance (similarity, or dissimilarity) function
◼ Clustering quality
❑ Inter-clusters distance  maximized
❑ Intra-clusters distance  minimized
◼ The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Prof. Ahmed Sultan Al-Hegami
Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Types of Clusterings
◼ A clustering is a set of clusters
◼ Important distinction between hierarchical
and partitional sets of clusters
◼ Partitional Clustering
❑ A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

◼ Hierarchical clustering
❑ A set of nested clusters organized as a hierarchical tree

Prof. Ahmed Sultan Al-Hegami


Partitional Clustering

Original Points A Partitional Clustering

Prof. Ahmed Sultan Al-Hegami


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
◼Traditional Hierarchical Clustering ◼Traditional Dendrogram (tree)

p1
p3 p4
p2
p1 p2 p3 p4

◼Non-traditional Hierarchical ◼Non-traditional Dendrogram (tree)


Clustering
Prof. Ahmed Sultan Al-Hegami
Other Distinctions Between Sets of Clusters

◼ Exclusive versus non-exclusive


❑ In non-exclusive clusterings, points may belong to multiple
clusters.
❑ Can represent multiple classes or ‘border’ points
◼ Fuzzy versus non-fuzzy
❑ In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
❑ Weights must sum to 1
❑ Probabilistic clustering has similar characteristics
◼ Partial versus complete
❑ In some cases, we only want to cluster some of the data
◼ Heterogeneous versus homogeneous
❑ Cluster of widely different sizes, shapes, and densities‫الكثافات‬

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Types of Clusters
◼ Well-separated clusters

◼ Center-based clusters

◼ Contiguous clusters

◼ Density-based clusters

◼ Property or Conceptual

◼ Described by an Objective Function

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Well-Separated

◼ Well-Separated Clusters:
❑ A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

◼3 well-separated clusters

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Center-Based

◼ Center-based
❑ A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
❑ The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster

4 center-based clusters

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Contiguity-Based

◼ Contiguous Cluster (Nearest neighbor or


Transitive)
❑ A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

◼8 contiguous clusters

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Density-Based

◼ Density-based
❑ A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
❑ Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Conceptual Clusters

◼ Shared Property or Conceptual Clusters


❑ Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Objective Function

◼ Clusters Defined by an Objective Function


❑ Finds clusters that minimize or maximize an objective function.
❑ Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function. (NP Hard)
❑ Can have global or local objectives.
◼ Hierarchical clustering algorithms typically have local objectives
◼ Partitional algorithms typically have global objectives
❑ A variation of the global objective function approach is to fit the data
to a parameterized model.
◼ Parameters for the model are determined from the data.
◼ Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.

Prof. Ahmed Sultan Al-Hegami


Types of Clusters: Objective Function …

◼ Map the clustering problem to a different


domain and solve a related problem in that
domain
❑ Proximity matrix ‫مصفوفة قرب‬defines a weighted graph,
where the nodes are the points being clustered, and
the weighted edges represent the proximities between
points

❑ Clustering is equivalent to breaking the graph into


connected components, one for each cluster.

❑ Want to minimize the edge weight between clusters


and maximize the edge weight within clusters
Prof. Ahmed Sultan Al-Hegami
Characteristics of the Input Data Are Important

◼ Type of proximity or density measure


❑ This is a derived measure, but central to clustering
◼ Sparseness
❑ Dictates type of similarity‫نوع قواعد التشابه‬
❑ Adds to efficiency
◼ Attribute type
❑ Dictates type of similarity
◼ Type of Data
❑ Dictates type of similarity
❑ Other characteristics, e.g., autocorrelation
◼ Dimensionality
◼ Noise and Outliers
◼ Type of Distribution

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Clustering Algorithms

◼ K-means and its variants

◼ Hierarchical clustering

◼ Density-based clustering

Prof. Ahmed Sultan Al-Hegami


K-means Clustering
◼ Partitional clustering approach
◼ Each cluster is associated with a centroid (center point)
◼ Each point is assigned to the cluster with the closest
centroid
◼ Number of clusters, K, must be specified
◼ The basic algorithm is very simple

Prof. Ahmed Sultan Al-Hegami


K-means Clustering – Details
◼ Initial centroids are often chosen randomly.
❑ Clusters produced vary from one run to another.
◼ The centroid is (typically) the mean of the points in the
cluster.
◼ ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
◼ K-means will converge for common similarity measures
mentioned above.
◼ Most of the convergence happens in the first few
iterations.
❑ Often the stopping condition is changed to ‘Until relatively few
points change clusters’
◼ Complexity is O( n * K * I * d )
❑ n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Prof. Ahmed Sultan Al-Hegami


Hierarchical Clustering

◼ Produces a set of nested clusters organized as a


hierarchical tree
◼ Can be visualized as a dendrogram
❑ A tree like diagram that records the sequences of

merges or splits
6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05
3 1

0
1 3 2 5 4 6

Prof. Ahmed Sultan Al-Hegami


Strengths of Hierarchical Clustering

◼ Do not have to assume any particular number


of clusters
❑ Any desired number of clusters can be obtained
by ‘cutting’ the dendogram at the proper level

◼ They may correspond to meaningful


taxonomies
❑ Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)

Prof. Ahmed Sultan Al-Hegami


Hierarchical Clustering
◼ Two main types of hierarchical clustering
❑ Agglomerative:
◼ Start with the points as individual clusters
◼ At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

❑ Divisive:
◼ Start with one, all-inclusive cluster
◼ At each step, split a cluster until each cluster contains a point (or
there are k clusters)

◼ Traditional hierarchical algorithms use a similarity or


distance matrix
❑ Merge or split one cluster at a time

Prof. Ahmed Sultan Al-Hegami


Agglomerative Clustering Algorithm
◼ More popular hierarchical clustering technique
◼ Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

◼ Key operation is the computation of the proximity of


two clusters
❑ Different approaches to defining the distance between
clusters distinguish the different algorithms

Prof. Ahmed Sultan Al-Hegami


Density-Based Clustering Methods

◼ Clustering based on density (local cluster criterion),


such as density-connected points
◼ Major features:
❑ Discover clusters of arbitrary shape
❑ Handle noise
❑ One scan
❑ Need density parameters as termination condition
◼ Several interesting studies:
❑ DBSCAN: Ester, et al. (KDD’96)
❑ OPTICS: Ankerst, et al (SIGMOD’99).
❑ DENCLUE: Hinneburg & D. Keim (KDD’98)
❑ CLIQUE: Agrawal, et al. (SIGMOD’98)

Prof. Ahmed Sultan Al-Hegami


DBSCAN

◼ DBSCAN is a density-based algorithm.


◼ Definitions:
❑ Density = number of points within a specified radius (Eps)

❑ A point is a core point if it has more than a specified number


of points (MinPts) within Eps
◼ These are points that are at the interior of a cluster

❑ A border point has fewer than MinPts within Eps, but is in


the neighborhood of a core point

❑ A noise point is any point that is not a core point or a border


point.

Prof. Ahmed Sultan Al-Hegami


DBSCAN: Core, Border, and Noise Points

Prof. Ahmed Sultan Al-Hegami


DBSCAN Algorithm

◼ Eliminate noise points


◼ Perform clustering on the remaining points

Prof. Ahmed Sultan Al-Hegami


DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise
Eps = 10, MinPts = 4

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


K-means clustering
◼ K-means is a partitional clustering algorithm
◼ Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
◼ The k-means algorithm partitions the given
data into k clusters.
❑ Each cluster has a cluster center, called centroid.
❑ k is specified by the user

Prof. Ahmed Sultan Al-Hegami


K-means algorithm
◼ Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

Prof. Ahmed Sultan Al-Hegami


K-means algorithm – (cont …)

Prof. Ahmed Sultan Al-Hegami


Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE = 
j =1
xC j
dist(x, m j ) 2 (1)

❑ Ci is the jth cluster, mj is the centroid of cluster Cj


(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.

Prof. Ahmed Sultan Al-Hegami


An example

+
+

Prof. Ahmed Sultan Al-Hegami


An example (cont …)

Prof. Ahmed Sultan Al-Hegami


An example distance function

Prof. Ahmed Sultan Al-Hegami


An example

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example(cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


An example (cont.)

Prof. Ahmed Sultan Al-Hegami


Strengths of k-means
◼ Strengths:
❑ Simple: easy to understand and to implement
❑ Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
❑ Since both k and t are small. k-means is considered a
linear algorithm.
◼ K-means is the most popular clustering algorithm.

Prof. Ahmed Sultan Al-Hegami


K-means summary
◼ Despite weaknesses, k-means is still the
most popular algorithm due to its simplicity,
efficiency and
❑ other clustering algorithms have their own lists of
weaknesses.
◼ No clear evidence that any other clustering
algorithm performs better in general
❑ although they may be more suitable for some
specific types of data or applications.
◼ Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
Prof. Ahmed Sultan Al-Hegami
Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Hierarchical Clustering
◼ Produce a nested sequence of clusters, a tree,
also called Dendrogram.

Prof. Ahmed Sultan Al-Hegami


Types of hierarchical clustering
◼ Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
❑ merges the most similar (or nearest) pair of clusters
❑ stops when all the data points are merged into a single
cluster (i.e., the root cluster).
◼ Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
❑ Splits the root into a set of child clusters. Each child cluster
is recursively divided further
❑ stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

Prof. Ahmed Sultan Al-Hegami


Agglomerative clustering

It is more popular then divisive methods.


◼ At the beginning, each data point forms a
cluster (also called a node).
◼ Merge nodes/clusters that have the least
distance.
◼ Go on merging

◼ Eventually all nodes belong to one cluster

Prof. Ahmed Sultan Al-Hegami


Agglomerative clustering algorithm

Prof. Ahmed Sultan Al-Hegami


An example: working of the algorithm

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Distance functions
◼ Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
◼ There are numerous distance functions for
❑ Different types of data
◼ Numeric data
◼ Nominal data
❑ Different specific applications

Prof. Ahmed Sultan Al-Hegami


Distance functions for numeric attributes
◼ Most commonly used functions are
❑ Euclidean distance and
❑ Manhattan (city block) distance
◼ We denote distance with: dist(xi, xj), where xi
and xj are data points (vectors)
◼ They are special cases of Minkowski distance.
h is positive integer.
1
dist(xi , x j ) = (( xi1 − x j1 ) + ( xi 2 − x j 2 ) + ... + ( xir − x jr
h h h h
))

Prof. Ahmed Sultan Al-Hegami


Euclidean distance and Manhattan distance
◼ If h = 2, it is the Euclidean distance
dist(xi , x j ) = ( xi1 − x j1 ) 2 + ( xi 2 − x j 2 ) 2 + ... + ( xir − x jr ) 2

◼ If h = 1, it is the Manhattan distance


dist (xi , x j ) =| xi1 − x j1 | + | xi 2 − x j 2 | +...+ | xir − x jr |

◼ Weighted Euclidean distance


dist(xi , x j ) = w1 ( xi1 − x j1 ) 2 + w2 ( xi 2 − x j 2 ) 2 + ... + wr ( xir − x jr ) 2

Prof. Ahmed Sultan Al-Hegami


Squared distance and Chebychev distance
◼ Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j ) = ( xi1 − x j1 ) 2 + ( xi 2 − x j 2 ) 2 + ... + ( xir − x jr ) 2

◼ Chebychev distance: one wants to define two


data points as "different" if they are different
on any one of the attributes.
dist (xi , x j ) = max(| xi1 − x j1 |, | xi 2 − x j 2 |, ..., | xir − x jr |)

Prof. Ahmed Sultan Al-Hegami


Distance functions for binary and
nominal attributes
◼ Binary attribute: has two values or states but
no ordering relationships, e.g.,
❑ Gender: male and female.
◼ We use a confusion matrix to introduce the
distance functions/measures.
◼ Let the ith and jth data points be xi and xj
(vectors)

Prof. Ahmed Sultan Al-Hegami


Confusion matrix

Prof. Ahmed Sultan Al-Hegami


Symmetric binary attributes
◼ A binary attribute is symmetric if both of its
states (0 and 1) have equal importance, and
carry the same weights, e.g., male and
female of the attribute Gender
◼ Distance function: Simple Matching
Coefficient, proportion of mismatches of their
values
b+c
dist (xi , x j ) =
a+b+c+d

Prof. Ahmed Sultan Al-Hegami


Symmetric binary attributes: example

Prof. Ahmed Sultan Al-Hegami


Asymmetric binary attributes
◼ Asymmetric: if one of the states is more
important or more valuable than the other.
❑ By convention, state 1 represents the more
important state, which is typically the rare or
infrequent state.
❑ Jaccard coefficient is a popular measure
b+c
dist(xi , x j ) =
a+b+c
❑ We can have some variations, adding weights

Prof. Ahmed Sultan Al-Hegami


Nominal attributes

◼ Nominal attributes: with more than two states


or values.
❑ the commonly used distance measure is also
based on the simple matching method.
❑ Given two data points xi and xj, let the number of
attributes be r, and the number of values that
match in xi and xj be q.
r−q
dist (xi , x j ) =
r

Prof. Ahmed Sultan Al-Hegami


Distance function for text documents
◼ A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
◼ To simplify: a document is usually considered a
“bag” of words in document clustering.
❑ Sequence and position of words are ignored.
◼ A document is represented with a vector just like a
normal data point.
◼ It is common to use similarity to compare two
documents rather than distance.
❑ The most commonly used similarity function is the cosine
similarity. We will study this later.

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Data standardization
◼ In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
◼ Consider the following pair of data points
❑ xi: (0.1, 20) and xj: (0.9, 720).

dist (xi , x j ) = (0.9 − 0.1) 2 + (720 − 20) 2 = 700.000457,

◼ The distance is almost completely dominated by


(720-20) = 700.
◼ Standardize attributes: to force the attributes to have
a common value range
Prof. Ahmed Sultan Al-Hegami
Interval-scaled attributes
◼ Their values are real numbers following a
linear scale.
❑ The difference in Age between 10 and 20 is the
same as that between 40 and 50.
❑ The key idea is that intervals keep the same
importance through out the scale
◼ Two main approaches to standardize interval
scaled attributes, range and z-score. f is an
attribute
xif − min( f )
range( xif ) = ,
max( f ) − min( f )
Prof. Ahmed Sultan Al-Hegami
Interval-scaled attributes (cont …)
◼ Z-score: transforms the attribute values so that they
have a mean of zero and a mean absolute
deviation of 1. The mean absolute deviation of
attribute f, denoted by sf, is computed as follows
1
( )
s f = | x1 f − m f | + | x2 f − m f | +...+ | xnf − m f | ,
n
1
(
m f = x1 f + x2 f + ... + xnf ,
n
)
xif − m f
Z-score: z ( xif ) = .
sf
Prof. Ahmed Sultan Al-Hegami
Ratio-scaled attributes
◼ Numeric attributes, but unlike interval-scaled
attributes, their scales are exponential,
◼ For example, the total amount of
microorganisms that evolve in a time t is
approximately given by
AeBt,
❑ where A and B are some positive constants.
◼ Do log transform:
log( xif )
❑ Then treat it as an interval-scaled attribuete

Prof. Ahmed Sultan Al-Hegami


Nominal attributes
◼ Sometime, we need to transform nominal
attributes to numeric attributes.
◼ Transform nominal attributes to binary
attributes.
❑ The number of values of a nominal attribute is v.
❑ Create v binary attributes to represent them.
❑ If a data instance for the nominal attribute takes a
particular value, the value of its binary attribute is
set to 1, otherwise it is set to 0.
◼ The resulting binary attributes can be used as
numeric attributes, with two values, 0 and 1.
Prof. Ahmed Sultan Al-Hegami
Nominal attributes: an example
◼ Nominal attribute fruit: has three values,
❑ Apple, Orange, and Pear
◼ We create three binary attributes called,
Apple, Orange, and Pear in the new data.
◼ If a particular data instance in the original
data has Apple as the value for fruit,
❑ then in the transformed data, we set the value of
the attribute Apple to 1, and
❑ the values of attributes Orange and Pear to 0

Prof. Ahmed Sultan Al-Hegami


Ordinal attributes

◼ Ordinal attribute: an ordinal attribute is like a


nominal attribute, but its values have a
numerical ordering. E.g.,
❑ Age attribute with values: Young, MiddleAge and
Old. They are ordered.
❑ Common approach to standardization: treat is as
an interval-scaled attribute.

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Mixed attributes
◼ Our distance functions given are for data with
all numeric attributes, or all nominal
attributes, etc.
◼ Practical data has different types:
❑ Any subset of the 6 types of attributes,
◼ interval-scaled,
◼ symmetric binary,
◼ asymmetric binary,
◼ ratio-scaled,
◼ ordinal and
◼ nominal

Prof. Ahmed Sultan Al-Hegami


Convert to a single type
◼ One common way of dealing with mixed
attributes is to
❑ Decide the dominant attribute type, and
❑ Convert the other types to this type.
◼ E.g, if most attributes in a data set are
interval-scaled,
❑ we convert ordinal attributes and ratio-scaled
attributes to interval-scaled attributes.
❑ It is also appropriate to treat symmetric binary
attributes as interval-scaled attributes.

Prof. Ahmed Sultan Al-Hegami


Convert to a single type (cont …)
◼ It does not make much sense to convert a
nominal attribute or an asymmetric binary
attribute to an interval-scaled attribute,
❑ but it is still frequently done in practice by
assigning some numbers to them according to
some hidden ordering, e.g., prices of the fruits
◼ Alternatively, a nominal attribute can be
converted to a set of (symmetric) binary
attributes, which are then treated as numeric
attributes.
Prof. Ahmed Sultan Al-Hegami
Combining individual distances
◼ This approach computes
 f =1 ij dij
r
individual attribute  f f

distances and then dist (xi , x j ) =


 f =1 ij
r
combine them.  f

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


How to choose a clustering algorithm
◼ Clustering research has a long history. A vast
collection of algorithms are available.
❑ We only introduced several main algorithms.
◼ Choosing the “best” algorithm is a challenge.
❑ Every algorithm has limitations and works well with certain
data distributions.
❑ It is very hard, if not impossible, to know what distribution
the application data follow. The data may not fully follow
any “ideal” structure or distribution required by the
algorithms.
❑ One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.

Prof. Ahmed Sultan Al-Hegami


Choose a clustering algorithm (cont …)
◼ Due to these complexities, the common practice is
to
❑ run several algorithms using different distance functions
and parameter settings, and
❑ then carefully analyze and compare the results.
◼ The interpretation of the results must be based on
insight into the meaning of the original data together
with knowledge of the algorithms used.
◼ Clustering is highly application dependent and to
certain extent subjective (personal preferences).

Prof. Ahmed Sultan Al-Hegami


Road map
◼ Basic concepts
◼ Types of Clusterings
◼ Types of Clusters
◼ Clustering Algorithms
◼ K-means algorithm
◼ Hierarchical clustering
◼ Distance functions
◼ Data standardization
◼ Handling mixed attributes
◼ Which clustering algorithm to use?
◼ Summary

Prof. Ahmed Sultan Al-Hegami


Summary
◼ Clustering is has along history and still active
❑ There are a huge number of clustering algorithms
❑ More are still coming every year.
◼ We only introduced several main algorithms. There are
many others, e.g.,
❑ density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
◼ Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
◼ Clustering is highly application dependent and to some
extent subjective.

Prof. Ahmed Sultan Al-Hegami


Thank you !!!
Prof. Ahmed Sultan Al-Hegami

You might also like