This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Abstract

Introduction

Preliminaries

Drifting concept detection

Clustering relationship analysis

Experimental results

Conclusions

Abstract

the problem of how to allocate those

unlabeled data points into proper

clusters remains as a challenging

issue in the categorical domain.

Abstract

In this paper, a mechanism named MAximal

Resemblance Data Labeling (abbreviated as

MARDL) is proposed to allocate each

unlabeled data point into the corresponding

appropriate cluster.

MARDL has two advantages:

1) MARDL exhibits high execution efficiency, and

2) MARDL can achieve high intracluster similarity

and low intercluster similarity

Introduction

As the concepts behind the data evolve with

time, the underlying clusters may also

change considerably with time.

Previous works on clustering categorical

data focus on doing clustering on the entire

data set and do not take the drifting

concepts into consideration.

The problem of clustering time-evolving data

in the categorical domain remains a

challenging issue.

Introduction

Introduction

A practical categorical clustering

representative, named “Node

Importance Representative”(NIR).

NIR represents clusters by measuring

the importance of each attribute value

in the clusters.

Introduction

Based on NIR, we propose the “Drifting

Concept Detection”(DCD).

In DCD, the incoming categorical data

points at the present sliding window are first

allocated into the corresponding proper

cluster at the last clustering result

If the distribution is changed (exceeding

some criteria), the concepts are said to drift.

Introduction

The framework presented in this paper

not only detects the drifting concepts in

the categorical data but also explains

the drifting concepts by analyzing the

relationship between clustering results

at different times.

The analyzing algorithm is named

“Cluster Relationship Analysis” (CRA).

Preliminaries

The problem of clustering the

categorical time-evolving data is

formulated as follows:

a series of

categorical data set D

data point

Attribute

Preliminaries

1. Suppose that the window size N is also given. The data set D is separated

into several continuous subsets St

2. The superscript number t is the identification number of the sliding window

and t is also called time stamp in this paper.

For example, the first N data points in D are located in the first subset S1

Preliminaries

the objective of the framework is to

perform clustering on the data set D and

consider the drifting concepts between St

and Stt1 and also analyze the relationship

between different clustering results.

Preliminaries

Preliminaries

The basic idea behind NIR is to represent a

cluster as the distribution of the attribute

values, which are called “nodes”.

the importance of a node is evaluated based

on the following two concepts:

The node is important in the cluster when

the frequency of the node is high in this

cluster.

The node is important in the cluster if the

node appears prevalently in this cluster

rather than in other clusters.

Preliminaries

Definition 1 (node). A node, , is defined as

attribute name + attribute value：

The age is in the range 50-59 and the weight is in the

range 50-59, the attribute value 50-59 is confusing

when we separate the attribute value from the

attribute name.

Nodes [age = 50-59] and [weight = 50-59] avoid this

ambiguity

r

I

Preliminaries

Definition 2 (node importance). The importance

value of the node is calculated as the

following equations：

ir

I

¿

¿

=

=

=

÷

÷ =

=

t

k

z

zr

yr

yr

k

y

yr yr r

r

i

ir

ir i

I

I

I p

where

I p I p

k

I f

I f

m

I

I c w

1

1

) (

)) ( log( ) ( *

log

1

1 ) (

) ( * ) , (

Preliminaries

| | { }

1 )

3

0

log

3

0

3

3

log

3

3

( *

2 log

1

1 ) (

)) ( log( ) ( *

log

1

1 ) (

1

1

= +

÷

÷ = ¬

÷

÷ =

=

=

¿

A A

k

y

yr yr r

I f

I p I p

k

I f

| | { }

| | { }

¦

¹

¦

´

¦

= = =

= = =

=

0 1 *

2

0

) , (

1 1 *

3

3

) , (

) ( * ) , (

1

1

2

1

1

1

A A c w

A A c w

I f

m

I

I c w

r

i

ir

ir i

Preliminaries

Drifting Concept Detection

The objective of the DCD algorithm is to detect

the difference of cluster distributions between

the current data subset and the last

clustering result and to decide whether

the reclustering is required or not in .

t

S

] 1 , [ ÷ t t

e

C

t

S

Drifting Concept Detection

Drifting Concept Detection

The goal of data labeling is to decide the most

appropriate cluster label for each incoming data

point.

Definition 3 (resemblance and maximal

resemblance).Given a data point and an NIR

table of clusters , a data point is labeled to

the cluster that obtains the maximal resemblance:

¿

=

=

q

r

ir i i j

I c w c p R

1

) , ( ) , (

j

p

i

c

j

p

Drifting Concept Detection

When a data point contains nodes that are more

important in cluster than in cluster ,

will be larger than .

if the maximal resemblance (the most

appropriate cluster) is smaller than the threshold

in that cluster, the data point is seen as an

outlier.

x

c

y

c

) , (

y j

c p R

) , (

x j

c p R

i

ì

¹

´

¦

s s >

=

. ,

, 1 , ) , ( max ,

*

otherwise outliers

k i where c p R if C

Label

i i j i

ì

Drifting Concept Detection

¹

´

¦

=

=

=

0

0

) , , (

1

2

1

1

6

c in

c in

G E B p

¹

´

¦

= + + =

=

=

529 . 1 1 029 . 0 5 . 0

029 . 0

) , , (

1

2

1

1

7

c in

c in

P M X p

皆小於threshold=0.5故此data point為outlier

1.529>0.029且

大於threshold=0.5

故此data point屬第二群

Drifting Concept Detection

The clustering results are said to be

different according to the following two

criteria:

The clustering results are different if quite

a large number of outliers are found by

data labeling.

The clustering results are different if quite

a large number of clusters are varied in

the ratio of data points.

Drifting Concept Detection

Drifting Concept Detection

2

S

There are three outliers in , and the ratio of

outliers in S2 is

Therefore, S2 is considered as a concept drifting

window and is going to do reclustering.

4 . 0 6 . 0

5

3

= > = u

Drifting Concept Detection

5 . 0 1

2

2

1 ) , ( ) , (

3 . 0 6 . 0

5

0

5

3

3 . 0 4 . 0

5

4

5

2

3 ' 2

3

2

' 2

2

3

1

' 2

1

3

2

' 2

2

3

1

' 2

1

= > =

= =

¦

¦

¹

¦

¦

´

¦

= > = ÷

= > = ÷

q

c

c

is C and C

c c d c c d

is c and c

is c and c

4 . 0 2 . 0

5

1

= < = u

the ratio of outliers in is

However, the variation of the ratio of data points between clusters

3

S

S3 is also considered as a

concept-drifting window

Drifting Concept Detection

Drifting Concept Detection

Drifting Concept Detection

The bottlenecks of the execution time in

DCD may occur on the reclustering step

when the concept drifts and on the updating

NIR table step when the concept does not

drift.

if we can obtain prior knowledge such as the

frequency of the drifting concepts of the data

from domain experts, the prior knowledge

can help us to set proper parameter values.

Clustering relationship analysis

CRA measures the similarity of

clusters between the clustering results

at different time stamps.

CRA links the similar clusters，when

similarity is higher than the threshold.

CRA will provide clues for us to catch

the time-evolving trends in the data set.

Node Importance Vector and

Cluster Distance

Node importance vector

The dimensions of all the vectors are

the same.

i

c

Example

Vector space(14 nodes)：

| | | | | | | | | | | | | | | | | | | | | | | | | | | | ( ) T A , P A , G A , D A , C A , N A , M A , F A , E A , Z A , Y A , X A , B A , A A

3 3 3 3 3 2 2 2 2 1 1 1 1 1

= = = = = = = = = = = = = =

Cosine measure

Calculate the cosine of the angle

between two vectors.

Measure of similarity.

Example

The similarity between vectors and

1

1

c

2

1

c

Visualizing the Evolving Clusters

Cluster

Time

Clustering

result

Experimental Results-Test Environment

Synthetic data sets

Numerical data set Clustering data

Drifting concept is generated by

combining two different clustering

results.

Experimental Results-Test Environment

Real data set（KDD-CUP’99 Network intrusion

Detection）

Each record：normal connection、attack

Drifting concept： the change is continued

for at least 300 connections.

493,857 records；each record contains 42

attributes.

33 drifting concepts

Evaluation on Efficiency

The number of drifting concepts directly

impacts the execution time of DCD.

The execution time of DCD is faster than that of EM.

a little

influence

dimensionality=20

# of clusters=20

N=500

Evaluation on scalability

Data size=50000

N=500

# of clusters=20 dimensionality=20

bottleneck：the number of drifting concepts

that require doing reclustering.

Evaluation on Accuracy

Test the accuracy of drifting concepts that

are detected by DCD.

The CU function

To maximize both the probability

the same cluster the same attribute values

different clusters different attributes

Evaluation on Accuracy

Confusion matrix accuracy (CMA)

Evaluate the clustering results by

comparing with the original clustering

labels j.

By maximizing the count of （i；j） in

which one output cluster is mapped

to one original clustering label j.

i

c

i

c

Accuracy Evaluation on Synthetic Data

Set

Each synthetic data set is generated by randomly

combining 50 clustering results

DCD is effective for detecting drifting concepts.

data set varies dramatically smaller N

data set is stable larger N，save the execution time

>0.8

The highest

# of clusters ,averages of 20 experiments

imum max k , 5 . 0 , 1 . 0 , 1 . 0 = = = = q c u

Accuracy Evaluation on Synthetic Data

Set

Clustering results：DCD VS. EM performing in

setting

N=2000, drifting concepts occur once per five

sliding windows (50*10000/2000=250,250/50=5)

The variation of CU and CMA on doing EM once

is quite larger than DCD.

1

D

Accuracy Evaluation on Synthetic Data

Set

Clustering results：DCD VS. EM performing in

setting

The drifting concepts occur irregularly.

DCD better than performing EM when we do

clustering on the categorical time-evolving data.

2

D

Accuracy Evaluation on Real Data

Set

The small sliding window size is induced to a

high recall but a little low precision.

the data set does not evolve frequently larger

N

3000 N , 10 k , 5 . 0 , 1 . 0 , 1 . 0 = = = = = q c u

Accuracy Evaluation on Real Data

Set

The records are the same in 51-114,134-149, and

155-160 sliding windows.

The peak value of CU in DCD is the time stamp

that a drifting concept occurs.

DCD is able to quickly reflect the drifting concept

and generate better clustering results.

Conclusions

A framework to perform clustering on

categorical time-evolving data.

Detects the drifting concepts at different sliding

window by DCD.

CRA to analyze and show the changes between

different clustering results.

Shows the relationship between clustering

results by visualization.

DCD can provide high-quality clustering results

with correctly detected drifting concepts.

Abstract

the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain.

Abstract

In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster. MARDL has two advantages:

1) MARDL exhibits high execution efficiency, and 2) MARDL can achieve high intracluster similarity and low intercluster similarity

Introduction

As the concepts behind the data evolve with time, the underlying clusters may also change considerably with time. Previous works on clustering categorical data focus on doing clustering on the entire data set and do not take the drifting concepts into consideration. The problem of clustering time-evolving data in the categorical domain remains a challenging issue.

Introduction .

Introduction A practical categorical clustering representative. . NIR represents clusters by measuring the importance of each attribute value in the clusters. named “Node Importance Representative”(NIR).

the incoming categorical data points at the present sliding window are first allocated into the corresponding proper cluster at the last clustering result If the distribution is changed (exceeding some criteria).Introduction Based on NIR. the concepts are said to drift. In DCD. . we propose the “Drifting Concept Detection”(DCD).

Introduction The framework presented in this paper not only detects the drifting concepts in the categorical data but also explains the drifting concepts by analyzing the relationship between clustering results at different times. The analyzing algorithm is named “Cluster Relationship Analysis” (CRA). .

Preliminaries The problem of clustering the categorical time-evolving data is formulated as follows: data point Attribute a series of categorical data set D .

Preliminaries 1. For example. The data set D is separated into several continuous subsets St 2. the first N data points in D are located in the first subset S1 . Suppose that the window size N is also given. The superscript number t is the identification number of the sliding window and t is also called time stamp in this paper.

Preliminaries the objective of the framework is to perform clustering on the data set D and consider the drifting concepts between St and Stt1 and also analyze the relationship between different clustering results. .

Preliminaries .

The node is important in the cluster if the node appears prevalently in this cluster rather than in other clusters. . the importance of a node is evaluated based on the following two concepts: The node is important in the cluster when the frequency of the node is high in this cluster.Preliminaries The basic idea behind NIR is to represent a cluster as the distribution of the attribute values. which are called “nodes”.

the attribute value 50-59 is confusing when we separate the attribute value from the attribute name. I r .Preliminaries Definition 1 (node). A node. Nodes [age = 50-59] and [weight = 50-59] avoid this ambiguity . is defined as attribute name + attribute value： The age is in the range 50-59 and the weight is in the range 50-59.

I ir ) I ir mi * f (Ir ) k 1 f (Ir ) 1 * p ( I yr ) log( p ( I yr )) log k y 1 where p ( I yr ) I yr kt I z 1 zr .Preliminaries Definition 2 (node importance). The importance value of the node I ir is calculated as the following equations： w(ci .

Preliminaries k 1 f (Ir ) 1 * p( I yr ) log( p( I yr )) log k y 1 f ( I A1 A ) 1 1 3 3 0 0 * ( log log ) 1 log 2 3 3 3 3 w(ci . A1 A) 3 *1 1 0 w(c1 . A1 A) *1 0 2 2 . I ir ) I ir mi * f (I r ) 3 1 w(c1 .

Preliminaries .

e .Drifting Concept Detection The objective of the DCD algorithm is to detect the difference of cluster distributions between the current data subset S t and the last clustering result C[t .t 1] and to decide whether the reclustering is required or not in S t .

Drifting Concept Detection .

Definition 3 (resemblance and maximal resemblance). ci ) w(ci . a data point p j is labeled to the cluster that obtains the maximal resemblance: R ( p j .Drifting Concept Detection The goal of data labeling is to decide the most appropriate cluster label for each incoming data point. I ir ) r 1 q .Given a data point p j and an NIR table of clusters ci .

c y ) .Drifting Concept Detection When a data point contains nodes that are more important in cluster c x than in cluster c y . . if the maximal resemblance (the most appropriate cluster) is smaller than the threshold i in that cluster. outliers. if max R( p j . ci ) i . c x ) will be larger than R ( p j . R ( p j . the data point is seen as an outlier. Label otherwise. Ci* . where 1 i k .

Drifting Concept Detection 1 in c1 0 p6 ( B.529 2 1. E . M . P) in c1 0.029 1 1.529>0.5 0.5 故此data point屬第二群 .029 p7 ( X . G ) in c1 0 2 皆小於threshold=0.5故此data point為outlier 1 in c1 0.029且 大於threshold=0.

Drifting Concept Detection The clustering results are said to be different according to the following two criteria: The clustering results are different if quite a large number of outliers are found by data labeling. The clustering results are different if quite a large number of clusters are varied in the ratio of data points. .

Drifting Concept Detection .

Drifting Concept Detection There are three outliers in S 2 . S2 is considered as a concept drifting window and is going to do reclustering. and the ratio of outliers in S2 is 3 0.4 5 Therefore.6 0. .

c ' 2 ) 1 3 0 is 0.4 5 However.3 3 3 5 5 2 2 d (c1 .2 0.6 0.5 2 C2 and C '3 S3 is also considered as a concept-drifting window . c '1 ) d (c2 .4 0.3 5 5 is 2 1 0. the variation of the ratio of data points between clusters 3 2 c1 and c '1 2 c2 and c '3 2 is 2 4 0.Drifting Concept Detection 3 the ratio of outliers in S is 1 0.

Drifting Concept Detection .

Drifting Concept Detection .

.Drifting Concept Detection The bottlenecks of the execution time in DCD may occur on the reclustering step when the concept drifts and on the updating NIR table step when the concept does not drift. the prior knowledge can help us to set proper parameter values. if we can obtain prior knowledge such as the frequency of the drifting concepts of the data from domain experts.

CRA will provide clues for us to catch the time-evolving trends in the data set. CRA links the similar clusters，when similarity is higher than the threshold.Clustering relationship analysis CRA measures the similarity of clusters between the clustering results at different time stamps. .

Node Importance Vector and Cluster Distance Node importance vector The dimensions of all the vectors c i are the same. .

A3 D. A3 P.Example Vector space(14 nodes)： A1 A. A1 Y. A3 C. A 2 N. A1 X. A1 Z. A3 T . A 2 M. A3 G . A 2 F. A1 B. A 2 E.

Cosine measure Calculate the cosine of the angle between two vectors. Measure of similarity. .

Example 2 The similarity between vectors c1 and c1 1 .

Visualizing the Evolving Clusters Cluster Clustering result Time .

.Experimental Results-Test Environment Synthetic data sets Numerical data set Clustering data Drifting concept is generated by combining two different clustering results.

857 records；each record contains 42 attributes.Experimental Results-Test Environment Real data set（KDD-CUP’99 Network intrusion Detection） Each record：normal connection、attack Drifting concept： the change is continued for at least 300 connections. 493. 33 drifting concepts .

a little influence dimensionality=20 # of clusters=20 N=500 The execution time of DCD is faster than that of EM. .Evaluation on Efficiency The number of drifting concepts directly impacts the execution time of DCD.

Data size=50000 N=500 .Evaluation on scalability # of clusters=20 dimensionality=20 bottleneck：the number of drifting concepts that require doing reclustering.

The CU function To maximize both the probability the same cluster different clusters the same attribute values different attributes .Evaluation on Accuracy Test the accuracy of drifting concepts that are detected by DCD.

By maximizing the count of （i；j） in which one output cluster c i is mapped to one original clustering label j. .Evaluation on Accuracy Confusion matrix accuracy (CMA) Evaluate the clustering results c i by comparing with the original clustering labels j.

k max imum # of clusters . >0. 0.Accuracy Evaluation on Synthetic Data Set The highest Each synthetic data set is generated by randomly combining 50 clustering results DCD is effective for detecting drifting concepts.5.1.8 data set varies dramatically smaller N data set is stable larger N，save the execution time 0.averages of 20 experiments .1. 0.

250/50=5) The variation of CU and CMA on doing EM once is quite larger than DCD.Accuracy Evaluation on Synthetic Data Set Clustering results：DCD VS. . EM performing in D1 setting N=2000. drifting concepts occur once per five sliding windows (50*10000/2000=250.

Accuracy Evaluation on Synthetic Data Set Clustering results：DCD VS. EM performing in D 2 setting The drifting concepts occur irregularly. DCD better than performing EM when we do clustering on the categorical time-evolving data. .

N 3000 larger .Accuracy Evaluation on Real Data Set The small sliding window size is induced to a high recall but a little low precision. k 10.1.5.1. 0. 0. the data set does not evolve frequently N 0.

DCD is able to quickly reflect the drifting concept and generate better clustering results.134-149. and 155-160 sliding windows.Accuracy Evaluation on Real Data Set The records are the same in 51-114. The peak value of CU in DCD is the time stamp that a drifting concept occurs. .

Conclusions A framework to perform clustering on categorical time-evolving data. . Detects the drifting concepts at different sliding window by DCD. DCD can provide high-quality clustering results with correctly detected drifting concepts. Shows the relationship between clustering results by visualization. CRA to analyze and show the changes between different clustering results.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd