66 views

Uploaded by api-27259648

- Clustering
- 285 Yamyak
- v39-41
- 06.Fuzzy
- Minimizing Spurious Patterns Using Association Rule Mining
- 6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
- 1902.08201
- JaiweiHanDataMining.ppt
- A Study On Protein Sequence Clustering with OPTICS
- Consumption Capability Analysis for Micro-Blog Users Based on Data Mining
- Customer Value Analysis
- dwdm12
- Web Search Result Clustering- A Review
- Kdd1999 Cactus
- DWM_Experiment5_E059
- CSE-TR-530-07
- Poster Reinholt Lukon Juola
- Ncct - Ieee Software Abstract Collection 3
- 1712.06863.pdf
- A Comprehensive Survey of Link Mining and Anomalies Detection

You are on page 1of 96

Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 1

What is Cluster Analysis?

Cluster: a collection of data objects

Similar to one another within the same

cluster

Dissimilar to the objects in other clusters

Cluster analysis

clusters

Clustering is unsupervised classification: no

predefined classes

Typical applications

data distribution

October 15, 2008 Data Mining: Concepts and Techniques

General Applications of Clustering

Pattern Recognition

Spatial Data Analysis

create thematic maps in GIS by clustering

feature spaces

detect spatial clusters and explain them in

Image Processing

Economic Science (especially market research)

WWW

Document classification

Examples of Clustering

Applications

Marketing: Help marketers discover distinct groups

in their customer bases, and then use this

knowledge to develop targeted marketing

programs

Land use: Identification of areas of similar land use

in an earth observation database

Insurance: Identifying groups of motor insurance

policy holders with a high average claim cost

City-planning: Identifying groups of houses

according to their house type, value, and

geographical location

Earth-quake studies:

October 15, 2008

Observed earth quake

Data Mining: Concepts and Techniques 4

What Is Good Clustering?

quality clusters with

high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on both

the similarity measure used by the method and its

implementation.

The quality of a clustering method is also

measured by its ability to discover some or all of

the hidden patterns.

October 15, 2008 Data Mining: Concepts and Techniques 5

Requirements of Clustering in Data

Mining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to

determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

October 15, 2008 Data Mining: Concepts and Techniques 6

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 7

Data Structures

Data matrix

... ... ... ... ...

(two modes) x ... xif ... xip

i1

... ... ... ... ...

x ... xnf ... xnp

n1

0

d(2,1) 0

Dissimilarity matrix

d(3,1) d ( 3,2) 0

(one mode)

: : :

d ( n,1) d ( n,2) ... ... 0

Measure the Quality of

Clustering

Dissimilarity/Similarity metric: Similarity is

expressed in terms of a distance function, which is

typically metric: d(i, j)

There is a separate “quality” function that

measures the “goodness” of a cluster.

The definitions of distance functions are usually

very different for interval-scaled, boolean,

categorical, ordinal and ratio variables.

Weights should be associated with different

variables based on applications and data

semantics.

It is hard to define “similar enough” or “good

enough”

October 15, 2008 Data Mining: Concepts and Techniques 9

Type of data in clustering analysis

Interval-scaled variables:

Binary variables:

Nominal, ordinal, and ratio variables:

Variables of mixed types:

Interval-valued variables

Standardize data

Calculate the mean absolute deviation:

s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

.

xif − m f

score) zif = sf

than using standard

October 15, 2008

deviation

Data Mining: Concepts and Techniques 11

Similarity and Dissimilarity

Between Objects

similarity or dissimilarity between two data

objects

Some popular

d (i, j) = q (| ones

x − x |include:

q

+ | x − x Minkowski

| q +...+ | x − x |distance:

q

)

i1 j1 i2 j2 ip jp

are two p-dimensional data objects, and q is a

positive integer

, j) =| x − x | + |distance

If q = 1, d isd (iManhattan x − x | + ...+ | x − x |

i1 j1 i2 j 2 ip jp

Similarity and Dissimilarity

Between Objects (Cont.)

If q = 2, d is Euclidean distance:

d (i, j) = (| x − x | 2 + | x − x |2 +...+ | x − x | 2 )

i1 j1 i2 j2 ip jp

Properties

d(i,j) ≥ 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) ≤ d(i,k) + d(k,j)

Also one can use weighted distance, parametric

Pearson product moment correlation, or other

disimilarity measures.

Binary Variables

A contingency table for binary data

Object j

1 0 sum

1 a b a +b

Object i 0 c d c+d

sum a + c b + d p

d (i, j) =

binary variable is symmetric): b +c

a +b +c +d

Jaccard coefficient (noninvariant if the binary

d (i, j) = b +c

variable is asymmetric):

a +b +c

October 15, 2008 Data Mining: Concepts and Techniques 14

Dissimilarity between Binary

Variables

Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be

setdto 0 , mary ) = 0 +1 =0.33

( jack

2 +0 +1

1 +1

d ( jack , jim ) = =0.67

1 +1 +1

1 +2

d ( jim , mary ) = =0.75

1 +1 +2

October 15, 2008 Data Mining: Concepts and Techniques 15

Nominal Variables

take more than 2 states, e.g., red, yellow, blue,

green

Method 1: Simple matching

m: # of matches,

d (i, j)p: p −m# of variables

=total

p

creating a new binary variable for each of the M

nominal states

October 15, 2008 Data Mining: Concepts and Techniques 16

Ordinal Variables

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Can be treated like interval-scaled

replacing x by their rank rif ∈{1,..., M f }

if

replacing i-th object in the f-th variable by

rif −1

zif =

Mf − 1

interval-scaled variables

Ratio-Scaled

Variables

Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential

scale, such as AeBt or Ae-Bt

Methods:

treat them like interval-scaled variables — not a

good choice! (why?)

apply logarithmic transformation

yif = log(xif)

treat them as continuous ordinal data treat their

rank as interval-scaled.

October 15, 2008 Data Mining: Concepts and Techniques 18

Variables of Mixed

Types

A database may contain all the six types of

variables

symmetric binary, asymmetric binary, nominal,

One may use a weighted δ

Σpf =formula

(f)

d tof )combine their

(

effects. d (i, j ) = 1 ij ij

Σp δ( f ) f =1 ij

f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is interval-based: use the normalized distance

z r −1

f is ordinal or ratio-scaled

if

=

M −1

if

f

compute ranks r and

if

October 15, 2008

Data Mining: Concepts and Techniques 19

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 20

Major Clustering Approaches

and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical

decomposition of the set of data (or objects) using

some criterion

Density-based: based on connectivity and density

functions

Grid-based: based on a multiple-level granularity

structure

Model-based: A model is hypothesized for each of

October 15, 2008 Data Mining: Concepts and Techniques 21

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 22

Partitioning Algorithms: Basic

Concept

Partitioning method: Construct a partition of a

database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that

optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all

partitions

Heuristic methods: k-means and k-medoids

algorithms

k-means (MacQueen’67): Each cluster is

represented by the center of the cluster

k-medoids or PAM (Partition around medoids)

(Kaufman

October 15, 2008 & Rousseeuw’87): Each cluster is

Data Mining: Concepts and Techniques 23

The K-Means Clustering Method

in 4 steps:

Partition objects into k nonempty subsets

centroid is the center (mean point) of the

cluster.

Assign each object to the cluster with the

Go back to Step 2, stop when no more new

assignment.

The K-Means Clustering Method

Example

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means Method

Strength

Relatively efficient: O(tkn), where n is # objects, k

<< n.

Often terminates at a local optimum. The global

deterministic annealing and genetic algorithms

Weakness

Need to specify k, the number of clusters, in

advance

Unable to handle noisy data and outliers

October 15, 2008 Data Mining: Concepts and Techniques 26

Variations of the K-Means Method

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Replacing means of clusters with modes

categorical objects

Using a frequency-based method to update

modes of clusters

A mixture of categorical and numerical data: k-

prototype method

October 15, 2008 Data Mining: Concepts and Techniques 27

The K-Medoids Clustering Method

clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and

iteratively replaces one of the medoids by one

of the non-medoids if it improves the total

distance of the resulting clustering

PAM works effectively for small data sets, but

does not scale well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

October

15, 2008 Data Mining: Concepts and Techniques 28

PAM (Partitioning Around Medoids)

(1987)

Splus

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and

selected object i, calculate the total swapping

cost TCih

For each pair of i and h,

If TCih < 0, i is replaced by h

Then assign each non-selected object to the

October 15, 2008 Data Mining: Concepts and Techniques 29

PAM Clustering: Total swapping cost TCih=∑

j Cjih

10 10

9 9

j

8

t 8

t

7 7

5

j 6

4

i h 4

h

3

2

3

2

i

1 1

0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10

10

9

9

8

h 8

j

7

7

6

6

5

5 i

i h j

t

4

4

3

3

2

2

1

t

1

0

0

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

C jih

October 15, 2008 = d(j, t) - d(j, i)Data Mining: ConceptsCand = d(j, h) - d(j, t)

jih Techniques 30

CLARA (Clustering Large Applications)

(1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as

S+

It draws multiple samples of the data set, applies

PAM on each sample, and gives the best

clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not

October 15, 2008 Data Mining: Concepts and Techniques 31

CLARANS (“Randomized” CLARA)

(1994)

Randomized Search) (Ng and Han’94)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as

searching a graph where every node is a potential

solution, that is, a set of k medoids

If the local optimum is found, CLARANS starts with

new randomly selected node in search for a new

local optimum

It is more efficient and scalable than both PAM and

CLARA

Focusing techniques

October 15, 2008 andConcepts

Data Mining: spatial access structures

and Techniques 32

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 33

Hierarchical Clustering

Use distance matrix as clustering criteria. This

method does not require the number of clusters k

as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative

(AGNES)

a ab

b abcde

c

cde

d

de

e

divisive

Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

October 15, 2008 Data Mining: Concepts and Techniques 34

AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g.,

Splus

Use the Single-Link method and the dissimilarity

matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

10 10 10

9

8

Eventually all nodes belong to the same cluster 9

8

9

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

A Dendrogram Shows How the

Clusters are Merged Hierarchically

partitioning (tree of clusters), called a dendrogram.

dendrogram at the desired level, then each connected

component forms a cluster.

DIANA (Divisive Analysis)

Implemented in statistical analysis packages, e.g.,

Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

10 10

10

9 9

9

8 8

8

7 7

7

6 6

6

5 5

5

4 4

4

3 3

3

2 2

2

1 1

1

0 0

0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

More on Hierarchical Clustering

Methods

Major weakness of agglomerative clustering

methods

do not scale well: time complexity of at least

can never undo what was done previously

clustering

BIRCH (1996): uses CF-tree and incrementally

CURE (1998): selects well-scattered points from

center of the cluster by a specified fraction

CHAMELEON

15,

October 2008 (1999):

Data Mining:hierarchical clustering using

Concepts and Techniques 38

BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering

using Hierarchies, by Zhang, Ramakrishnan, Livny

(SIGMOD’96)

Incrementally construct a CF (Clustering Feature)

tree, a hierarchical data structure for multiphase

clustering

Phase 1: scan DB to build an initial in-memory CF

tree (a multi-level compression of the data that

tries to preserve the inherent clustering structure

of the data)

Phase 2: use an arbitrary clustering algorithm to

cluster the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single

scan

October and improves

15, 2008 the

Data quality

Mining: Concepts andwith a few additional

Techniques 39

Clustering Feature Vector

N: Number of data points

LS: ∑Ni=1=Xi

SS: ∑Ni=1=Xi2 CF = (5, (16,30),(54,190))

10

9

(3,4)

(2,6)

8

4 (4,5)

3

1

(4,7)

(3,8)

0

0 1 2 3 4 5 6 7 8 9 10

CF Tree Root

Non-leaf node

CF1 CF2 CF3 CF5

child1 child2 child3 child5

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

CURE (Clustering Using

REpresentatives )

Stops the creation of a cluster hierarchy if a

level consists of k clusters

Uses multiple representative points to evaluate

the distance between clusters, adjusts well to

arbitrary shaped clusters and avoids single-link

October 15,effect

2008 Data Mining: Concepts and Techniques 42

Drawbacks of Distance-Based

Method

method

Consider only one point as representative of a

cluster

Good only for Data

convex shaped, similar size and

October 15, 2008 Mining: Concepts and Techniques 43

Cure: The Algorithm

Partition sample to p partitions with size s/p

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it.

Cluster partial clusters.

Label data in disk

October 15, 2008 Data Mining: Concepts and Techniques 44

Data Partitioning and

Clustering

s = 50

p=2 s/pq = 5

s/p = 25

y

y y

x

y y

x x

x x

October 15, 2008 Data Mining: Concepts and Techniques 45

Cure: Shrinking Representative

Points

y y

x x

the gravity center by a fraction of α.

Multiple representatives capture the shape of the

cluster

October 15, 2008 Data Mining: Concepts and Techniques 46

Clustering Categorical Data:

ROCK

ROCK: Robust Clustering using linKs,

by S. Guha, R. Rastogi, K. Shim (ICDE’99).

Use links to measure similarity/proximity

Computational complexity:

Basic ideas:

Similarity function and neighbors: T ∩T

Sim( T , T ) = 1 2

T ∪T

1 2

{3} 1

Sim( T 1, T 2) = = =0.2

{1,2,3,4,5} 5

Rock: Algorithm

Links: The number of common neighbours

for the two points.

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}

{1,2,3} 3 {1,2,4}

Algorithm

Draw random sample

CHAMELEON

dynamic modeling, by G. Karypis, E.H. Han and

V. Kumar’99

Measures the similarity based on a dynamic

model

Two clusters are merged only if the

between two clusters are high relative to the

internal interconnectivity of the clusters and

closeness of items within the clusters

A two phase algorithm

sub-clusters Data Mining: Concepts and Techniques

October 15, 2008 49

Overall Framework of

CHAMELEON

Construct

Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 51

Density-Based Clustering

Methods

Clustering based on density (local cluster

criterion), such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

October 15, 2008 Data Mining: Concepts and Techniques 52

Density-Based Clustering:

Background

Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Eps-

neighbourhood of that point

NEps(p): {q belongs to D | dist(p,q) <= Eps}

Directly density-reachable: A point p is directly

density-reachable from a point q wrt. Eps, MinPts

if

1) p belongs to NEps(q) p MinPts = 5

q

2) core point condition: Eps = 1 cm

|NEps (q)| >= MinPts

October 15, 2008 Data Mining: Concepts and Techniques 53

Density-Based Clustering:

Background (II)

Density-reachable:

p

A point p is density-reachable

from a point q wrt. Eps, MinPts p1

if there is a chain of points p1, q

…, pn, p1 = q, pn = p such that

pi+1 is directly density-

reachable from pi

Density-connected p q

A point p is density-connected o

to a point q wrt. Eps, MinPts if

there is a point o such that

both, p and q are density-

reachable

October 15, 2008 from Data

o wrt.

Mining: Eps

Conceptsand

and Techniques 54

DBSCAN: Density Based Spatial

Clustering of Applications with

Noise

cluster is defined as a maximal set of density-

connected points

Discovers clusters of arbitrary shape in spatial

databases with noise

Outlier

Border

Eps = 1cm

Core MinPts = 5

DBSCAN: The Algorithm

Retrieve all points density-reachable from p wrt

Eps and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-

reachable from p and DBSCAN visits the next

point of the database.

Continue the process until all of the points have

been processed.

October 15, 2008 Data Mining: Concepts and Techniques 56

OPTICS: A Cluster-Ordering Method

(1999)

Structure

Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)

Produces a special order of the database wrt its

This cluster-ordering contains info equiv to the

broad range of parameter settings

Good for both automatic and interactive cluster

structure

15,Can

October 2008 be represented

Data Mining: graphically or using

Concepts and Techniques 57

OPTICS: Some Extension from

DBSCAN

Index-based:

k = number of dimensions

N = 20

p = 75% D

M = N(1-p) = 5

Complexity: O(kN2)

Core Distance p1

o

Reachability Distance

p2 o

Max (core-distance (o), d (o, p))

MinPts = 5

r(p1, o) = 2.8cm. r(p2,o) = 4cm

October 15, 2008 ε = 3 cm

Data Mining: Concepts and Techniques 58

Reachability

-distance

undefined

ε

ε

ε ‘

Cluster-order

October 15, 2008 Data Mining: Concepts and Techniques

of the objects 59

DENCLUE: using density

functions

DENsity-based CLUstEring by Hinneburg & Keim

(KDD’98)

Major features

Solid mathematical foundation

Good for data sets with large amounts of noise

Allows a compact mathematical description of

arbitrarily shaped clusters in high-dimensional

data sets

Significant faster than existing algorithm (faster

than DBSCAN by a factor of up to 45)

But needs a large number of parameters

October 15, 2008 Data Mining: Concepts and Techniques 60

Denclue: Technical Essence

Uses grid cells but only keeps information about

grid cells that do actually contain data points

and manages these cells in a tree-based access

structure.

Influence function: describes the impact of a

data point within its neighborhood.

Overall density of the data space can be

calculated as the sum of the influence function

of all data points.

Clusters can be determined mathematically by

identifying density attractors.

Density attractors

October 15, 2008 Data are

Mining:local maximal

Concepts and Techniques of the 61

Gradient: The steepness of a slope

Example

d ( x , y )2

−

f Gaussian ( x , y ) = e 2σ 2

d ( x , xi ) 2

−

( x) = ∑i =1 e

D N

2σ 2

f Gaussian

d ( x , xi ) 2

−

( x, xi ) = ∑ i =1 ( xi − x) ⋅ e

N

∇f D

Gaussian

2σ 2

Density Attractor

Center-Defined and Arbitrary

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 65

Grid-Based Clustering Method

Using multi-resolution grid data structure

Several interesting methods

STING (a STatistical INformation Grid

approach) by Wang, Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee,

and Zhang (VLDB’98)

A multi-resolution clustering approach

using wavelet method

CLIQUE: Agrawal, et al. (SIGMOD’98)

STING: A Statistical Information

Grid Approach

Wang, Yang and Muntz (VLDB’97)

The spatial area area is divided into rectangular

cells

There are several levels of cells corresponding to

different levels of resolution

STING: A Statistical

Information Grid Approach (2)

Each cell at a high level is partitioned into a number

of smaller cells in the next lower level

Statistical info of each cell is calculated and stored

Parameters of higher level cells can be easily

count, mean, s, min, max

queries

Start from a pre-selected layer—typically with a

small number of cells

For each cell in the current level compute the

confidence interval

October 15, 2008 Data Mining: Concepts and Techniques

STING: A Statistical

Information Grid Approach (3)

Remove the irrelevant cells from further

consideration

When finish examining the current layer,

Repeat this process until the bottom layer is

reached

Advantages:

incremental update

O(K), where K is the number of grid cells at

Disadvantages:

horizontal orData

October 15, 2008 vertical, and

Mining: Concepts no diagonal

and Techniques

WaveCluster (1998)

Sheikholeslami, Chatterjee, and Zhang

(VLDB’98)

A multi-resolution clustering approach which

applies wavelet transform to the feature space

A wavelet transform is a signal processing

technique that decomposes a signal into

different frequency sub-band.

Both grid-based and density-based

Input parameters:

# of grid cells for each dimension

the wavelet, and the # of applications of

October 15, 2008 Data Mining: Concepts and Techniques 70

What is Wavelet (1)?

WaveCluster (1998)

How to apply wavelet transform to find clusters

Summaries the data by imposing a

multidimensional grid structure onto data

space

These multidimensional spatial data objects

space

Apply wavelet transform on feature space to

Apply wavelet transform multiple times which

to coarse

October 15, 2008 Data Mining: Concepts and Techniques 72

What Is Wavelet (2)?

Quantization

Transformation

WaveCluster (1998)

Why is wavelet transformation useful for

clustering

Unsupervised clustering

where points cluster, but simultaneously to

suppress weaker information in their boundary

Multi-resolution

Cost efficiency

Major features:

Complexity O(N)

October 15,scales

2008 Data Mining: Concepts and Techniques 76

CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan

(SIGMOD’98).

Automatically identifying subspaces of a high

dimensional data space that allow better clustering

than original space

CLIQUE can be considered as both density-based

and grid-based

It partitions each dimension into the same

number of equal length interval

It partitions an m-dimensional data space into

non-overlapping rectangular units

A unit is dense if the fraction of total data points

contained in the unit exceeds the input model

parameter

October 15, 2008 Data Mining: Concepts and Techniques 77

CLIQUE: The Major Steps

Partition the data space and find the number of

points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using

the Apriori principle

Identify clusters:

Determine dense units in all subspaces of

interests

Determine connected dense units in all

subspaces of interests.

Generate minimal description for the clusters

Determine maximal regions that cover a cluster

October 15,Determination ofMining:

minimal cover for each cluster

2008 Data Concepts and Techniques 78

Vacation

(10,000)

(week)

Salary

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

age age

20 30 40 50 60 20 30 40 50 60

τ=3

Vacation

r y 30 50

a l a

S age

Strength and Weakness of

CLIQUE

Strength

It automatically finds subspaces of the highest

exist in those subspaces

It is insensitive to the order of records in input

distribution

It scales linearly with the size of input and has

in the data increases

Weakness

degraded at the

October 15, 2008 expense

Data Mining: ofTechniques

Concepts and simplicity of the 80

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 81

Model-Based Clustering

Methods

Attempt to optimize the fit between the data and

some mathematical model

Statistical and AI approach

Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of unlabeled

objects

Finds characteristic description for each concept (class)

COBWEB (Fisher’87)

A popular a simple method of incremental conceptual

learning

Creates a hierarchical clustering in the form of a

classification tree

Each node refers to a concept and contains a

October 15, 2008 Data Mining: Concepts and Techniques 82

COBWEB Clustering

Method

A classification tree

More on Statistical-Based

Clustering

Limitations of COBWEB

The assumption that the attributes are

because correlation may exist

Not suitable for clustering large database data

distributions

CLASSIT

an extension of COBWEB for incremental

suffers similar problems as COBWEB

Uses Bayesian statistical analysis to estimate

Popular in industry

October 15, 2008 Data Mining: Concepts and Techniques 84

Other Model-Based

Clustering Methods

Neural network approaches

Represent each cluster as an exemplar, acting

New objects are distributed to the cluster

to some dostance measure

Competitive learning

Involves a hierarchical architecture of several

units (neurons)

Neurons compete in a “winner-takes-all”

presented

October 15, 2008 Data Mining: Concepts and Techniques 85

Model-Based Clustering Methods

Self-organizing feature maps

(SOMs)

several units competing for the current

object

The unit whose weight vector is closest to

the current object wins

The winner and its neighbors learn by

having their weights adjusted

SOMs are believed to resemble

processing that can occur in the brain

Useful for visualizing high-dimensional

data in 2- or 3-D space

October 15, 2008 Data Mining: Concepts and Techniques 87

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 88

What Is Outlier Discovery?

What are outliers?

The set of objects are considerably dissimilar

Example: Sports: Michael Jordon, Wayne

Gretzky, ...

Problem

Find top n outlier points

Applications:

Credit card fraud detection

Customer segmentation

Medical analysis

Outlier Discovery:

Statistical

Approaches

generates data set (e.g. normal distribution)

Use discordancy tests depending on

data distribution

Drawbacks

known

October 15, 2008 Data Mining: Concepts and Techniques 90

Outlier Discovery: Distance-

Based Approach

imposed by statistical methods

We need multi-dimensional analysis without

Distance-based outlier: A DB(p, D)-outlier is an

fraction p of the objects in T lies at a distance

greater than D from O

Algorithms for mining distance-based outliers

Index-based algorithm

Nested-loop algorithm

Cell-based algorithm

October 15, 2008 Data Mining: Concepts and Techniques

Outlier Discovery: Deviation-

Based Approach

Identifies outliers by examining the main

characteristics of objects in a group

Objects that “deviate” from this description are

considered outliers

sequential exception technique

simulates the way in which humans can

distinguish unusual objects from among a

series of supposedly like objects

OLAP data cube technique

uses data cubes to identify regions of

anomalies in large multidimensional data

October 15, 2008 Data Mining: Concepts and Techniques 92

Chapter 8. Cluster Analysis

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

October 15, 2008 Data Mining: Concepts and Techniques 93

Problems and Challenges

Considerable progress has been made in scalable

clustering methods

Partitioning: k-means, k-medoids, CLARANS

Hierarchical: BIRCH, CURE

Density-based: DBSCAN, CLIQUE, OPTICS

Grid-based: STING, WaveCluster

Model-based: Autoclass, Denclue, Cobweb

Current clustering techniques do not address all

the requirements adequately

Constraint-based clustering analysis: Constraints

exist in data space (bridges and highways) or in

October 15, 2008 Data Mining: Concepts and Techniques 94

Constraint-Based Clustering

Analysis

desired constraints, e.g., an ATM allocation problem

Summary

similarity and has wide applications

Measure of similarity can be computed for various

types of data

Clustering algorithms can be categorized into

density-based methods, grid-based methods, and

model-based methods

Outlier detection and analysis are very useful for

statistical, distance-based or deviation-based

approaches

There are still lots of research issues on cluster

October 15, 2008 Data Mining: Concepts and Techniques 96

- ClusteringUploaded bysunnynnus
- 285 YamyakUploaded byDaring Chowdaries
- v39-41Uploaded byazharfaheem
- 06.FuzzyUploaded bysgrrsc
- Minimizing Spurious Patterns Using Association Rule MiningUploaded byseventhsensegroup
- 6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156Uploaded byiserp
- 1902.08201Uploaded bydr_do7
- JaiweiHanDataMining.pptUploaded byAlfan Bahar
- A Study On Protein Sequence Clustering with OPTICSUploaded byJournal of Computing
- Consumption Capability Analysis for Micro-Blog Users Based on Data MiningUploaded byAdam Hansen
- Customer Value AnalysisUploaded byjbsimha3629
- dwdm12Uploaded byapi-26830587
- Web Search Result Clustering- A ReviewUploaded byijcses
- Kdd1999 CactusUploaded byKaunteya Shaw
- DWM_Experiment5_E059Uploaded byShubham Gupta
- CSE-TR-530-07Uploaded byaptureinc
- Poster Reinholt Lukon JuolaUploaded byRaj Kumar Singh
- Ncct - Ieee Software Abstract Collection 3Uploaded byesskayn16936
- 1712.06863.pdfUploaded byNicola Landro
- A Comprehensive Survey of Link Mining and Anomalies DetectionUploaded byCS & IT
- seanjordan philipgraff posterboardUploaded byapi-306603109
- Experiments on Graph Clustering AlgorithmsUploaded byYog Sothoth
- cvUploaded byanon-396243
- Data Mining QuestionsUploaded byPritam Saha
- Chemo Metrics Application 070511Uploaded byEmilie Eureka Lixue
- ds06.pdfUploaded byKevin Mondragon
- MLP and RBFN as ClassifiersUploaded byCHAITALI29
- 4156Uploaded by105tolga
- Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 4: Information overload and algorithmic filteringUploaded byJonathan Stray
- A Robust Split-And-Merge Text Segmentation Approach for ImagesUploaded byMateo A. Cortés

- CA-no_standing_arrearUploaded byapi-27259648
- JAVA_QTN2Uploaded byapi-27259648
- Virtual MemoryUploaded byapi-27259648
- CTS_Topper_shopper_MCAUploaded byapi-27259648
- Campus_data_format_HCL.xlsMCAUploaded byapi-27259648
- MC1702Uploaded byapi-27259648
- Unit 2 FinalUploaded byapi-27259648
- Application Format 126 GDOCUploaded byapi-27259648
- Application Format 126 GDOCUploaded byapi-27259648
- EC1307 Microprocessor and Application LabUploaded byapi-27259648
- CTS_Final_MCAUploaded byapi-27259648
- jaeUploaded byapi-27259648
- coprocessor1Uploaded byapi-27259648
- DS 82C55Uploaded byilluminatissm
- Programmable Keyboard-Display Interface - 8279Uploaded byapi-27259648
- Creating night effectsUploaded byapi-27259648
- PRINT2Uploaded byapi-27259648
- Creating own brush toolUploaded byapi-27259648
- PRINT1Uploaded byapi-27259648
- IdlUploaded byapi-3836654
- flashUploaded byapi-27259648
- Time table-AUploaded byapi-27259648
- Unix 4 and 5Uploaded byapi-3836654
- Flash MaskUploaded byapi-27259648
- Chap_6Uploaded byapi-27259648
- flashformUploaded byapi-27259648
- HCL__CTS.xlsMCAUploaded byapi-27259648
- Chap_9Uploaded byapi-27259648
- chap_7Uploaded byapi-27259648

- Owb Tuning ExcerptUploaded bynikojim
- 1-s2.0-S1053811916303470-mainUploaded byfiestamix
- Spearmans RankUploaded bysekhar_ntpc
- Interpreting SPSS OutputUploaded byhayati5823
- Chapter OneUploaded byJulius Eze
- ExerciseUploaded byTuti Aryanti Hastomo
- Application of ANOVA.pptUploaded byUma Shankar
- 4.02 Comparing Group Means - T-Tests and One-Way ANOVA Using Stata, SAS, R, And SPSS (2009)Uploaded bycorinag37
- Research Methods Session 11 Data Preparation and Preliminary Data Analysis [Compatibility Mode]Uploaded byGopaul Usha
- tay pss-fss evaluation toolkit draft 4-21-16Uploaded byapi-290161318
- article eval 2 cavallaroUploaded byapi-216287146
- Lean Six Sigma Black Belt -Introduction to MinitabUploaded byRachel McKenzie
- Dsur i Chapter 10 Glm 1Uploaded byDanny
- Lin's Concordance Correlation CoefficientUploaded byscjofyWFawlroa2r06YFVabfbaj
- TA Bivariate DataUploaded bystudentofstudent
- [Revised]Simple Linear Regression and CorrelationUploaded byAdam Zakiy
- Devendrakumar Singh SiddiquiUploaded bysaifchi555
- Statistics and Correlation Answer KeyUploaded bycharlottedwn
- SPSS 19 Answers to Selected ExercisesUploaded byAnkur Aggarwal
- Cluster Analysis - Ward’s MethodUploaded byraghu_sb
- # 9 Uji Hipotesa Dua ParameterUploaded byIdes Trian
- Simple Linear RegressionUploaded byTeacherIsa Calica
- 1Uploaded byMayank Tewari
- Econometrics Chapter 4Uploaded byJade Nguyen
- Chapter 4 SAS ESSENTIALS.pptxUploaded byRitesh Raman
- st3054_slides.pdfUploaded byDamien Ashwood
- Regression Analysis.pptxUploaded byT_007
- EDASage.pdfUploaded bymariammari
- Bookkeeping IntrodutionUploaded byNitesh Kotian
- WeiPanUmangBhaskar (1).pptUploaded byManikantan Thanayath