cours classification non supervisée

© All Rights Reserved

6 views

cours classification non supervisée

© All Rights Reserved

- Iris Dataset Clustering and Spam Email Separation
- An Integration of K-means and Decision Tree (ID3) towards a more Efficient Data Mining Algorithm
- Cluster Analysis
- Enhancing K-Means Algorithm with Semi-Unsupervised Centroid Selection Method
- Flexible Manufacturing System
- A Survey Based On Data Clustering Algorithms
- Journal Data Mining
- Lecture 6 - Mid-Level Vision - Artificial
- IJETTCS-2014-01-08-015
- VM’s Consolidation Using Ant Colony System And Clustering Of PM’s For Energy Efficient Network
- ieee
- Clustering Ensembles Models of Consensus and Weak Partitions
- La Classification
- 0705.2646
- MIMO_MPC
- Prediction of Atmospheric Pressure at Ground Level using Artificial Neural Network
- Cl Valid
- Scaling up QoE-driven Delivery of Image-rich Web Applications
- DAMI-011114b
- Lesson_4.1_-_Unsupervised_Learning_Partitioning_Methods.pdf

You are on page 1of 47

Mining

Clustering I

Vasileios Megalooikonomou

(based on notes by Jiawei Han and Micheline Kamber)

Agenda

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

Pattern Recognition

Spatial Data Analysis

create thematic maps in GIS by clustering feature spaces

detect spatial clusters and explain them in spatial data mining

e.g., land use, city planning, earth-quake studies

Image Processing

Economic Science (especially market research) e.g.,

marketing, insurance

WWW

Document classification

Cluster Weblog data to discover groups of similar access patterns

A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity

similarity measure used by the method and its

implementation.

The quality of a clustering method is also measured

by its ability to discover some or all of the hidden

patterns.

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape (not just spherical

clusters)

Minimal requirements for domain knowledge to determine

input parameters (such as # of clusters)

Able to deal with noise and outliers

Insensitive to order of input records

Incremental clustering

High dimensionality (especially very sparse and highly

skewed data)

Incorporation of user-specified constraints

Interpretability and usability (close to semantics)

Agenda

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

Data matrix

x11

...

x

i1

...

x

n1

(two modes)

n objects, p variables

...

x1f

...

x1p

...

...

...

...

xif

...

...

xip

...

...

... xnf

...

...

...

xnp

d(2,1)

0

Dissimilarity matrix

d(3,1) d ( 3,2) 0

(one mode)

:

:

:

... 0

Dissimilarity/Similarity metric: Similarity is expressed

in terms of a distance function, which is typically

metric: d(i, j)

There is a separate quality function that measures the

goodness of a cluster.

The definitions of distance functions are usually very

different for interval-scaled, boolean, categorical,

ordinal and ratio variables.

Weights should be associated with different variables

based on applications and data semantics.

Hard to define similar enough or good enough

highly subjective.

Interval-valued variables

Continuous measurements of a roughly linear scale (e.g., weight,

height, temperature, etc)

Standardize data (to avoid dependence on the measurement units)

Calculate the mean absolute deviation of a variable f with n measurements:

s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)

where

m f 1n (x1 f x2 f

...

xnf )

xif m f

zif

sf

using standard deviation

Distances are normally used to measure the similarity

or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

d (i, j) q (| x x |q | x x | q ... | x x |q )

i1

j1

i2

j2

ip

jp

where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two pdimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

d (i, j) | x x | | x x | ... | x x |

i1 j1 i2 j 2

ip jp

If q = 2, d is Euclidean distance:

d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )

i1

j1

i2

j2

ip

jp

Properties

d(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i),

symmetry

Pearson product moment correlation, or other

dissimilarity measures.

Binary Variables

A contingency table for binary data

Object j

Object i

1

0

1

a

c

0

b

d

sum a c b d

sum

a b

cd

p

variable is symmetric (both states same weight)): d (i, j)

Jaccard coefficient (noninvariant similarity, if the binary

bc

a bc d

e.g., outcomes of a disease test)):

d (i, j)

bc

a b c

Example

Name

Jack

Mary

Jim

Gender

M

F

M

Fever

Y

Y

Y

Cough

N

N

P

Test-1

P

P

N

Test-2

N

N

N

Test-3

N

P

N

Test-4

N

N

N

the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be set to 0

01

0.33

2 01

11

d ( jack , jim )

0.67

111

1 2

d ( jim , mary )

0.75

11 2

d ( jack , mary )

Nominal Variables

A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: # of matches (# of variables for which i and j are in the same

state), p: total # of variables

m

d (i, j) p

p

creating a new binary variable for each of the M nominal states

Ordinal Variables

An ordinal variable can be discrete or continuous

Resembles nominal var but order is important, e.g., rank

Can be treated like interval-scaled

replace xif by their rank

r {1,..., M }

if

where ordinal variable f has Mf states

and xif is fthe value of

map the range of each variable onto [0, 1] by replacing the rank of the i-th

object in the f-th variable by

rif 1

zif using

methods for interval-scaled variables

compute the dissimilarity

M f 1

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale,

approximately at exponential scale, such as AeBt or Ae-Bt where A

and B are positive constants (e.g., decay of radioactive elements)

Methods:

treat them like interval-scaled variables not a good choice!

(why?)

apply logarithmic transformation

yif = log(xif)

and treat them as interval-valued

treat them as continuous ordinal data and treat their rank as

interval-scaled.

A database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal, ordinal,

interval and ratio.

effects.

p ( f )d ( f )

d (i, j )

f 1 ij

p

f 1

ij

(f)

ij

f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise.

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled

compute ranks rif and z r 1

if

and treat zif as interval-scaled M 1

if

Agenda

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

Clustering Approaches

Partitioning algorithms: Construct various partitions and then evaluate

them by some criterion (k-means, k-medoids)

Hierarchical algorithms: Create a hierarchical decomposition

(agglomerative or divisive) of the set of data (or objects) using some

criterion (CURE, Chameleon, BIRCH)

Density-based: based on connectivity and density functions grow a

cluster as long as density in the neighborhood exceeds a threshold

(DBSCAN, CLIQUE)

Grid-based: based on a multiple-level grid structure (i.e., quantized

space) (STING, CLIQUE)

Model-based: A model is hypothesized for each of the clusters and the

idea is to find the best fit of the data to the given model (EM)

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

Partitioning method: Construct a partition of a database D

of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the

chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen67): Each cluster is represented by the

center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw87): Each cluster is represented by one of the objects

in the cluster

4 steps:

1. Partition objects into k nonempty subsets

2. Compute seed points as the centroids of the clusters of

the current partition. The centroid is the center (mean

point) of the cluster.

3. Assign each object to the cluster with the nearest seed

point.

4. Go back to Step 2, stop when no more new assignment

(or fractional drop of SSE or MSE is less than a

threshold).

Example

10

10

0

0

10

10

10

10

0

0

10

10

Strengths

Relatively efficient: O(tkn), where n is # objects, k is # clusters,

and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may

be found using techniques such as: deterministic annealing and

genetic algorithms

Weaknesses

Applicable only when mean is defined, then what about

categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical

objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype

method

Find representative (i.e., the most centrally located) objects,

called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the

medoids by one of the non-medoids if it improves the total distance of the

resulting clustering

PAM works effectively for small data sets, but does not scale well for large

data sets

More robust than k-means

Complexity: O(k(n-k)2) for n objects and k clusters

CLARANS (Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)

1. Select k representative objects arbitrarily

2. For each pair of non-selected object h and selected object i,

calculate the total swapping cost Tcih

3. For each pair of i and h,

If TCih < 0, i is replaced by h

Then assign each non-selected object to the most similar

representative object

4. repeat steps 2-3 until there is no change

10

10

8

7

8

7

6

5

4

3

6

5

h

i

4

3

0

0

10

Cjih = 0

0

10

10

10

7

6

5

4

3

0

10

0

10

CLARA (Kaufmann and Rousseeuw in 1990)(O(ks2 + k(n-k)))

Built in statistical analysis packages, such as S+

It draws multiple samples of the data set, applies PAM on each

sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

efficiency depends on the sample size and

the sample: A good clustering based on samples will not

necessarily represent a good clustering of the whole data set if

the sample is biased

CLARANS (A Clustering Algorithm based on Randomized

Search) (Ng and Han94) (O(n2))

CLARANS draws a sample of neighbors dynamically

The clustering process can be presented as searching a graph

where every node is a potential solution, that is, a set of k

medoids

If the local optimum is found, CLARANS starts with new

randomly selected node in search for a new local optimum

It is more efficient and scalable than both PAM and CLARA

Focusing techniques and spatial access structures may further

improve its performance (Ester et al.95)

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

Hierarchical Clustering

Use distance matrix as clustering criteria. This

method does not require the number of clusters k as an

input, but needs a termination condition

Step 0

a

b

Step 1

ab

abcde

cde

de

e

Step 4

agglomerative

(AGNES)

Step 3

divisive

(DIANA)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix:

each cluster is represented by all of its objects and the similarity between clusters is

measured by the similarity of the closest pair of data points belonging to different

clusters

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster

10

10

10

0

0

10

0

0

10

10

Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected

component forms a cluster.

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

10

10

10

0

0

10

0

0

10

10

Major weakness of agglomerative clustering methods

do not scale well: time complexity of at least O(n2), where n is

the number of total objects

can never undo (backtrack) what was done previously

clustering

BIRCH (1996): uses CF-tree (clustering feature tree) and

incrementally adjusts the quality of sub-clusters

CURE (1998): selects well-scattered (representative) points

from the cluster and then shrinks them towards the center of

the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic

modeling

BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering using

Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96)

Incrementally construct a CF (Clustering Feature) tree, a

hierarchical data structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level

compression of the data that tries to preserve the inherent clustering

structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of

the CF-tree

improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order

of the data record.

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1Xi

CF = (5, (16,30),(54,190))

SS: Ni=1Xi2

Clustering features are additive

10

9

8

7

6

5

4

3

2

1

0

0

10

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

CF Tree

Root

nonleaf node

L: Threshold: max diameter of subclusters

at leaf nodes

B=7

CF1

CF2 CF3

CF6

L=6

child1

child2 child3

child6

CF1

Non-leaf node

CF2 CF3

CF5

child1

child2 child3

child5

Leaf node

prev

CF1 CF2

CF6 next

Leaf node

prev

CF1 CF2

CF4 next

REpresentatives )

Stops the creation of a cluster hierarchy if a level consists of k

clusters

Uses multiple representative points to evaluate the distance

between clusters

Adjusts well to arbitrary shaped clusters and avoids single-link

effect

Consider only one point as representative of a cluster

Good only for convex shaped, similar size and density, and

if k can be reasonably estimated

Draw random sample s.

Partition sample to p partitions with size s/p

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it.

Cluster partial clusters.

Label data in disk

s = 50

p=2

s/p = 25

s/pq = 5

y

y

x

y

y

x

x

x

x

y

gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

ROCK: Robust Clustering using linKs,

by S. Guha, R. Rastogi, K. Shim (ICDE99).

Use links to measure similarity/proximity (# points from

different clusters who have neighbors in common) ->

# of links of two points = # common neighbors

Not distance based

Hierarchical clustering

2

2

O

(

n

nm

m

n

log n)

m

a

Computational complexity:

Basic ideas:

Similarity function and neighbors:

Let T1 = {1,2,3}, T2={3,4,5}

Sim( T 1, T 2)

T1 T2

Sim( T1 , T2 )

T1 T2

{3}

1

0.2

{1,2,3,4,5}

5

CHAMELEON

CHAMELEON: hierarchical clustering using

dynamic modeling, by G. Karypis, E.H. Han and V.

Kumar99 (O(n2))

Measures the similarity based on a dynamic model

Two clusters are merged only if the interconnectivity and

closeness (proximity) between two clusters are high

relative to the internal interconnectivity of the clusters

and closeness of items within the clusters

1. Use a graph partitioning algorithm: cluster objects into

a large number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm:

find the genuine clusters by repeatedly combining these

sub-clusters

Construct

Partition the Graph

Sparse Graph

Data Set

Merge Partition

Final Clusters

to construct the sparse graph: Each vertex

represents an object, an edge exists between

two vertices if one object is among the k-most similar objects of the other

- Iris Dataset Clustering and Spam Email SeparationUploaded byAkash M Shahzad
- An Integration of K-means and Decision Tree (ID3) towards a more Efficient Data Mining AlgorithmUploaded byJournal of Computing
- Cluster AnalysisUploaded byShyaam Sundhar Baskaran
- Enhancing K-Means Algorithm with Semi-Unsupervised Centroid Selection MethodUploaded byijcsis
- Flexible Manufacturing SystemUploaded byVenkatakrishnan S Lakshminarayanan
- A Survey Based On Data Clustering AlgorithmsUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Journal Data MiningUploaded bymukeshnt
- Lecture 6 - Mid-Level Vision - ArtificialUploaded byLaurence Carroll
- IJETTCS-2014-01-08-015Uploaded byAnonymous vQrJlEN
- VM’s Consolidation Using Ant Colony System And Clustering Of PM’s For Energy Efficient NetworkUploaded byIJAFRC
- ieeeUploaded byRajesh Raj
- Clustering Ensembles Models of Consensus and Weak PartitionsUploaded bymoonbdreamys
- La ClassificationUploaded byAli AlFiguigui
- 0705.2646Uploaded byfirefly9x
- MIMO_MPCUploaded bychoc_ngoay1
- Prediction of Atmospheric Pressure at Ground Level using Artificial Neural NetworkUploaded byWhite Globe Publications (IJORCS)
- Cl ValidUploaded byAshwini Kumar Pal
- Scaling up QoE-driven Delivery of Image-rich Web ApplicationsUploaded byInstartLogic
- DAMI-011114bUploaded byRakaBeta
- Lesson_4.1_-_Unsupervised_Learning_Partitioning_Methods.pdfUploaded byTayyaba Faisal
- pentaho ID server guideUploaded by梅止观
- dm_part2Uploaded byPrerna Mahajan
- GMM-Based Identification of Indonesian SpeechUploaded byjulius_v_m_bata
- 找民找民Uploaded byAmanda Martinez
- Computational Perceptual Features for Texture - ABSTRACTUploaded byShreyansh
- 1-s2.0-S0167865509002323-main.pdfUploaded byLloyd Kyle
- ClusterUploaded byCristhian Danilo Cardenas
- ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING.pdfUploaded byesatjournals
- 447-1910-1-PB.pdfUploaded byමිලන්
- Data Mining Review Paper JIM Final AcceptedUploaded byChinmay Patil

- AnnualReportUploaded byameetthind_eic
- cca_samplepaperUploaded byLucas Bissoli
- Modelling Land-use and Land-cover Changes Using Markov-CA, and Multiple Decision Making in Kirkuk CityUploaded byAmin Mojiri
- MIASMATIC EVALUATION OF HYPERLIPIDEMIAUploaded byRajneesh Kumar Sharma
- GrandBanks 76 Yacht Dolcenera - Longrange Yacht for saleUploaded byEast Coast Yacht Brokers
- Loma Systems - IQ2F Ferrousin Foil User Manual.pdfUploaded byYani Mamani de Sáez
- Agitated Vessel Heat TransferUploaded bykitofaneco
- Green Computing: Efficient and Eco-friendly Use of Computer Resources to Improve Energy UtilizationUploaded byAnonymous vQrJlEN
- Civil Works Technical Specs (1)Uploaded byKiran Venkata
- 828826297d7ea460155685a47ad70b4f-originalUploaded byJohn Lacebal III
- Tridaṇḍisvāmī Śrī Śrīmad Bhaktivedānta Nārāyaṇa Gosvāmī Mahārāja speaks about Rādhā TattvaUploaded byshareinspiration
- The Agreed UniverseUploaded byNinke van Herpt
- A Call to ChristiansUploaded byRosa Osborn
- Case Study - P hilippinesUploaded byTal Peralta
- 30_dec_16_engUploaded bygayathri
- NEW Electrical EstimateUploaded byApparao Reddem
- Joy968AUploaded bySorin Brc Evo
- Compilation Reversed WheelUploaded byZak Rymill
- Leo Charney Empty Moments Cinema Modernity and DriftUploaded bymnanananananana
- Math g7 m6 Topic d Lesson 20 TeacherUploaded byTylerShaw
- Jesus Centered Youth Ministry: Guide For VolunteersUploaded byGroup
- Design of generalized regression model based neural network controller to improve transient stability of power systemUploaded byJournal of Computing
- spot facing cuttersUploaded bynevadablue
- The Lancet Oncology Volume Issue 2018 [Doi 10.1016%2FS1470-2045%2818%2930380-2] Galimberti, Viviana; Cole, Bernard F; Viale, Giuseppe; Veronesi, -- Axillary Dissection Versus No Axillary Dissection InUploaded byPutri Yunanda
- Wireless ChargingUploaded byShiva Charan
- 126.pdfUploaded byIJAR Journal
- Cantor Et Al, 2012Uploaded byMariana De Assis Espécie
- 520Uploaded bySasi Kumar
- Using Pyrotechnics OnboardUploaded byalive2flirt
- be_15_hcl.pdfUploaded byvitry