0% found this document useful (0 votes)

63 views21 pages

Understanding Cluster Analysis in Data Mining

The document discusses cluster analysis and its applications. Cluster analysis involves grouping similar data objects into clusters, where objects within a cluster are similar to each other and dissimilar to objects in other clusters. It is an unsupervised learning technique used to gain insight into data distribution or as a preprocessing step. Examples of applications include marketing, land use analysis, insurance, and city planning. The document also discusses evaluating the quality of clustering results.

Uploaded by

drazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views21 pages

Understanding Cluster Analysis in Data Mining

Uploaded by

drazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

What is Cluster Analysis?

 Cluster: a collection of data objects

 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
March 28, 2019 Data Mining: Concepts and Techniques 1
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

March 28, 2019 Data Mining: Concepts and Techniques 2

Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

March 28, 2019 Data Mining: Concepts and Techniques 3

Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in

terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
March 28, 2019 Data Mining: Concepts and Techniques 4
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

March 28, 2019 Data Mining: Concepts and Techniques 5

Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 Dissimilarity matrix  0 
 (one mode)  d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

March 28, 2019 Data Mining: Concepts and Techniques 6

Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

March 28, 2019 Data Mining: Concepts and Techniques 7

Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )

 Calculate the standardized measurement (z-score)

xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation

March 28, 2019 Data Mining: Concepts and Techniques 8

Similarity and Dissimilarity Between
Objects
 Distances are normally used to measure the similarity or
dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x | q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

March 28, 2019 Data Mining: Concepts and Techniques 9

Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

March 28, 2019 Data Mining: Concepts and Techniques 10

Binary Variables
Object j
1 0 sum
 A contingency table for
1 a b a b
binary data Object i
0 c d cd
sum a  c b  d p
 Distance measure for
symmetric binary variables:
 Distance measure for bc
d (i, j) 
asymmetric binary variables: a bc
 Jaccard coefficient (similarity
measure for asymmetric simJaccard (i, j)  a
a bc
binary variables):
March 28, 2019 Data Mining: Concepts and Techniques 11
Dissimilarity between Binary Variables

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute

 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
March 28, 2019 Data Mining: Concepts and Techniques 12
Nominal Variables

 A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j)  p 
p
m

 Method 2: use a large number of binary variables

 creating a new binary variable for each of the M
nominal states

March 28, 2019 Data Mining: Concepts and Techniques 13

Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

March 28, 2019 Data Mining: Concepts and Techniques 14

Ratio-Scaled Variables

 Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables— not a good
choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as
interval-scaled

March 28, 2019 Data Mining: Concepts and Techniques 15

Variables of Mixed Types

 A database may contain all the six types of variables

 symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio

 One may use a weighted formula to combine their
effects  pf  1 ij( f ) d ij( f )
d (i, j ) 
 pf  1 ij( f )
 f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

 f is interval-based: use the normalized distance

 f is ordinal or ratio-scaled

 compute ranks r and

if
z 
r 1
if

 and treat z as interval-scaled if M 1

if f

March 28, 2019 Data Mining: Concepts and Techniques 16

Vector Objects

 Vector objects: keywords in documents, gene

features in micro-arrays, etc.
 Broad applications: information retrieval, biologic
taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient

March 28, 2019 Data Mining: Concepts and Techniques 17

Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

March 28, 2019 Data Mining: Concepts and Techniques 18

Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
March 28, 2019 Data Mining: Concepts and Techniques 19
Summary
 Cluster analysis groups objects based on their similarity
and has wide applications
 Measure of similarity can be computed for various types
of data
 Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster analysis

March 28, 2019 Data Mining: Concepts and Techniques 20

Problems and Challenges

 Considerable progress has been made in scalable

clustering methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, ROCK, CHAMELEON
 Density-based: DBSCAN, OPTICS, DenClue
 Grid-based: STING, WaveCluster, CLIQUE
 Model-based: EM, Cobweb, SOM
 Frequent pattern-based: pCluster
 Constraint-based: COD, constrained-clustering
 Current clustering techniques do not address all the
requirements adequately, still an active area of research
March 28, 2019 Data Mining: Concepts and Techniques 21

Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Cluster Analysis
No ratings yet
Cluster Analysis
136 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Unsupervised Learning: Cluster Analysis
No ratings yet
Unsupervised Learning: Cluster Analysis
39 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
98 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
127 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
89 pages
8 CLST
No ratings yet
8 CLST
100 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
51 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
120 pages
Constraint-Based Cluster Analysis Overview
No ratings yet
Constraint-Based Cluster Analysis Overview
56 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
120 pages
8 CLST
No ratings yet
8 CLST
98 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
65 pages
Understanding Cluster Analysis in Data Mining
100% (1)
Understanding Cluster Analysis in Data Mining
60 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
107 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
104 pages
Cluster
No ratings yet
Cluster
120 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
86 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
84 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
119 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
51 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
48 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Understanding Cluster Analysis in Data Mining
No ratings yet
Understanding Cluster Analysis in Data Mining
35 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
51 pages
Understanding Data Mining Basics
No ratings yet
Understanding Data Mining Basics
17 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Clustering
No ratings yet
Clustering
23 pages
Cluster Analysis Methods and Applications
No ratings yet
Cluster Analysis Methods and Applications
53 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Understanding Cluster Analysis in Data Mining
No ratings yet
Understanding Cluster Analysis in Data Mining
36 pages
Cluster Analysis Methods and Applications
No ratings yet
Cluster Analysis Methods and Applications
21 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
51 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
75 pages
Clustering Methods in Data Mining
No ratings yet
Clustering Methods in Data Mining
47 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Data Mining for Beginners
No ratings yet
Data Mining for Beginners
63 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Summarizing Transactional Data Insights
No ratings yet
Summarizing Transactional Data Insights
22 pages
Data Mining: Techniques and Processes
No ratings yet
Data Mining: Techniques and Processes
25 pages
Data Mining Intro, Functionalities, Issues
No ratings yet
Data Mining Intro, Functionalities, Issues
30 pages
02 Data
No ratings yet
02 Data
24 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
15 pages
6-工艺库 2
No ratings yet
6-工艺库 2
38 pages
An Analysis Framework
No ratings yet
An Analysis Framework
12 pages
Comprehensive College Calculator Guide
50% (2)
Comprehensive College Calculator Guide
2 pages
Berkeley Math Circle Contest 2 Solutions
No ratings yet
Berkeley Math Circle Contest 2 Solutions
3 pages
15MATDIP31 New Syllabus
50% (2)
15MATDIP31 New Syllabus
2 pages
Problem Solving in Mathematics Education Tracing Its Foundations and Current Research-Practice Trends
No ratings yet
Problem Solving in Mathematics Education Tracing Its Foundations and Current Research-Practice Trends
12 pages
General Material Balance Equation in Reservoir Engineering
No ratings yet
General Material Balance Equation in Reservoir Engineering
25 pages
Locally Weighted Regression
No ratings yet
Locally Weighted Regression
17 pages
Operations Research Question Bank
No ratings yet
Operations Research Question Bank
10 pages
Pre-AP Algebra 2 Chapter 1 Overview
100% (1)
Pre-AP Algebra 2 Chapter 1 Overview
27 pages
Jeopardy: Inductive & Deductive Reasoning
No ratings yet
Jeopardy: Inductive & Deductive Reasoning
27 pages
Trovides A Practical Guide To Get Started and Execute On Machine Learning Within A Few Days Without Necessarily Knowing Much About Machine Learning.
No ratings yet
Trovides A Practical Guide To Get Started and Execute On Machine Learning Within A Few Days Without Necessarily Knowing Much About Machine Learning.
130 pages
User's Guide for BPIP Program
No ratings yet
User's Guide for BPIP Program
86 pages
Class 9 Maths New
No ratings yet
Class 9 Maths New
4 pages
30 Transfer Function and Frequency Response
100% (1)
30 Transfer Function and Frequency Response
54 pages
Ethiopia's Population Projections
100% (1)
Ethiopia's Population Projections
118 pages
Yield Line Patterns for End-Plate Connections
No ratings yet
Yield Line Patterns for End-Plate Connections
117 pages
Testing For Cross-Sectional Dependence in Panel Data Models
No ratings yet
Testing For Cross-Sectional Dependence in Panel Data Models
13 pages
Al-Naji2018 Article AnEfficientMotionMagnification
No ratings yet
Al-Naji2018 Article AnEfficientMotionMagnification
16 pages
Lab Rubrics
No ratings yet
Lab Rubrics
1 page
Calculus Rate of Change Guide
No ratings yet
Calculus Rate of Change Guide
4 pages
Hybrid Inversion for Electrical Resistivity
No ratings yet
Hybrid Inversion for Electrical Resistivity
43 pages
Btech Ce 3 Sem Surveying and Geomatics Kce302 2021
No ratings yet
Btech Ce 3 Sem Surveying and Geomatics Kce302 2021
2 pages
Genesis 27
No ratings yet
Genesis 27
19 pages
SAP2000 Temperature Load Verification
No ratings yet
SAP2000 Temperature Load Verification
3 pages
Articulo Tanaka de La Densidad Del Agua
No ratings yet
Articulo Tanaka de La Densidad Del Agua
9 pages
Unit - 3 Notes Python
No ratings yet
Unit - 3 Notes Python
37 pages
SG Unit1ProgressCheckFRQ 68f11e4fa9b223.68f11e50b1e4e4.22880948
No ratings yet
SG Unit1ProgressCheckFRQ 68f11e4fa9b223.68f11e50b1e4e4.22880948
6 pages
Measurement - Wikipedia
No ratings yet
Measurement - Wikipedia
53 pages
Exponential Functions and Logarithms Guide
No ratings yet
Exponential Functions and Logarithms Guide
26 pages