|Views: 384|Likes: 2

Published by ijcsis

Categorical data clustering is an interesting challenge for researchers in the data mining and machine learning, because of many practical aspects associated with efficient processing and concepts are often not stable but

change with time. Typical examples of this are weather prediction rules and customer’s preferences, intrusion detection in a network traffic stream. Another example is the case of text data points, such as that occurring in Twitter/search engines. In this regard the sampling is an important technique to improve the efficiency of clustering. However, with sampling applied, those sampled points that are not having their labels after the normal process. Even though there is straight forward method for numerical domain and categorical data. But still it has a problem that is how to allocate those unlabeled data points into appropriate clusters in efficient manner. In this paper the concept-drift phenomenon is studied, and we first propose an adaptive threshold for outlier detection, which is a playing

vital role detection of cluster. Second, we propose a probabilistic approach for detection of cluster using POur-NIR method which is an alternative method

change with time. Typical examples of this are weather prediction rules and customer’s preferences, intrusion detection in a network traffic stream. Another example is the case of text data points, such as that occurring in Twitter/search engines. In this regard the sampling is an important technique to improve the efficiency of clustering. However, with sampling applied, those sampled points that are not having their labels after the normal process. Even though there is straight forward method for numerical domain and categorical data. But still it has a problem that is how to allocate those unlabeled data points into appropriate clusters in efficient manner. In this paper the concept-drift phenomenon is studied, and we first propose an adaptive threshold for outlier detection, which is a playing

vital role detection of cluster. Second, we propose a probabilistic approach for detection of cluster using POur-NIR method which is an alternative method

Categorical data clustering is an interesting challenge for researchers in the data mining and machine learning, because of many practical aspects associated with efficient processing and concepts are often not stable but

change with time. Typical examples of this are weather prediction rules and customer’s preferences, intrusion detection in a network traffic stream. Another example is the case of text data points, such as that occurring in Twitter/search engines. In this regard the sampling is an important technique to improve the efficiency of clustering. However, with sampling applied, those sampled points that are not having their labels after the normal process. Even though there is straight forward method for numerical domain and categorical data. But still it has a problem that is how to allocate those unlabeled data points into appropriate clusters in efficient manner. In this paper the concept-drift phenomenon is studied, and we first propose an adaptive threshold for outlier detection, which is a playing

vital role detection of cluster. Second, we propose a probabilistic approach for detection of cluster using POur-NIR method which is an alternative method

change with time. Typical examples of this are weather prediction rules and customer’s preferences, intrusion detection in a network traffic stream. Another example is the case of text data points, such as that occurring in Twitter/search engines. In this regard the sampling is an important technique to improve the efficiency of clustering. However, with sampling applied, those sampled points that are not having their labels after the normal process. Even though there is straight forward method for numerical domain and categorical data. But still it has a problem that is how to allocate those unlabeled data points into appropriate clusters in efficient manner. In this paper the concept-drift phenomenon is studied, and we first propose an adaptive threshold for outlier detection, which is a playing

vital role detection of cluster. Second, we propose a probabilistic approach for detection of cluster using POur-NIR method which is an alternative method

See More

See less

Clustering of Concept Drift Categorical Data usingPOur-NIR Method

N.Sudhakar Reddy K.V.N. Sunitha

Professor in CSE Prof. in CSESVCE, TirupatiGNITS, HyderabadIndia India

Abstract -

Categorical data clustering is aninteresting challenge for researchers in the datamining and machine learning, because of manypractical aspects associated with efficientprocessing and concepts are often not stable butchange with time. Typical

examples

of this areweather prediction rules and customer’spreferences, intrusion detection in a networktraffic stream . Another example is the case of text data points, such as that occurring inTwitter/search engines. In this regard the sampling is animportant technique to improve the efficiency of clustering. However, with sampling applied, thosesampled points that are not having their labels after thenormal process. Even though there is straight forwardmethod for numerical domain and categorical data. Butstill it has a problem that is how to allocate thoseunlabeled data points into appropriate clusters in efficientmanner. In this paper the concept-drift phenomenon isstudied, and we first propose an adaptivethreshold for outlier detection, which is a playingvital role detection of cluster. Second, we proposea probabilistic approach for detection of clusterusing POur-NIR method which is an alternativemethod

Keywords- clustering, NIR, POur-NIR, ConceptDrift nd node

.

I.

INTRODUCTIONExtracting Knowledge from large amount of data isdifficult which is known as data mining. Clustering isa collection of similar objects from a given data setand objects in different collection are dissimilar.Most of the algorithms developed for numerical datamay be easy, but not in Categorical data [1, 2, 12,13]. It is challenging in categorical domain, wherethe distance between data points is not defined. It isalso not easy to find out the class label of unknowndata point in categorical domain. Samplingtechniques improve the speed of clustering and weconsider the data points that are not sampled toallocate into proper clusters. The data which dependson time called time evolving data. For example, the buying preferences of customers may change withtime, depending on the current day of the week,availability of alternatives, discounting rate etc. Sincedata evolve with time, the underlying clusters mayalso change based on time by the data driftingconcept [11, 17]. The clustering time-evolving data inthe numerical domain [1, 5, 6, 9] has been exploredin the previous works, where as in categorical domainnot that much. Still it is a challenging problem in thecategorical domain.As a result, our contribution in modifyingthe frame work which is proposed by Ming-SyanChen in 2009[8] utilizes any clustering algorithm todetect the drifting concepts. We adopted slidingwindow technique and initial data (at time t=0) isused in initial clustering. These clusters arerepresented by using POur-NIR [19], where eachattribute value importance is measured. We findwhether the data points in the next sliding window(current sliding window) belongs to appropriateclusters of last clustering results or they are outliers.We call this clustering result as a temporal andcompare with last clustering result to drift the data points or not. If the concept drift is not detected toupdate the POur-NIR otherwise dump attribute value based on importance and then reclustering usingclustering techniques.The rest of the paper is organized as follows.In section II discussed related work, in section III basic notations and concept drift, in section IV newmethods for node importance representativediscussed and also contains results with comparisonof Ming-Syan Chen method and our method, insection V discussed distribution of clustering andfinally concluded with section VI.II. RELATED WORK In this section, we discuss various clusteringalgorithms on categorical data with cluster representatives and data labeling. We studied manydata clustering algorithms with time evolving.

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011109http://sites.google.com/site/ijcsis/ISSN 1947-5500

Cluster representative is used to summarizeand characterize the clustering result, which is notfully discussed in categorical domain unlikenumerical domain.In K-modes which is an extension of K-meansalgorithm in categorical domain a cluster isrepresented by ‘mode’ which is composed by themost frequent attribute value in each attribute domainin that cluster. Although this cluster representative issimple, only use one attribute value in each attributedomain to represent a cluster is questionable. Itcomposed of the attribute values with high co-occurrence. In the statistical categorical clusteringalgorithms [3,4] such as COOLCAT and LIMBO,data points are grouped based on the statistics. Inalgorithm COOLCAT, data points are separated insuch a way that the expected entropy of the wholearrangements is minimized. In algorithm LIMBO, theinformation bottleneck method is applied to minimizethe information lost which resulted fromsummarizing data points into clusters.However, all of the above categoricalclustering algorithms focus on performing clusteringon the entire dataset and do not consider the time-evolving trends and also the clusteringrepresentatives in these algorithms are not clearlydefined.The new method is related to the idea of conceptual clustering [9], which creates a conceptualstructure to represent a concept (cluster) duringclustering. However, NIR only analyzes theconceptual structure and does not perform clustering,i.e., there is no objective function such as categoryutility (CU) [12] in conceptual clustering to lead theclustering procedure. In this aspect our method can provide in better manner for the clustering of data points on time based.The main reason is that in concept driftingscenarios, geometrically close items in theconventional vector space might belong to differentclasses. This is because of a concept change (drift)that occurred at some time point.Our previous work [19] addresses the nodeimportance in the categorical data with the help of sliding window. That is new approach to the best of our knowledge that proposes these advancedtechniques for concept drift detection and clusteringof data points. In this regard the concept driftshandling by the headings such as node importanceand resemblance. In this paper, the main objective of the idea of representing the clusters by aboveheadings. This representation is more efficient thanusing the representative points.After scanning the literature, it is clear thatclustering categorical data is untouched many tiesdue to the complexity involved in it. A time-evolvingcategorical data is to be clustered within the duecourse hence clustering data can be viewed asfollows: there are a series of categorical data points Dis given, where each data point is a vector of qattribute values, i.e.,

p

j

=(p

j1

,p

j2

,...,p

jq

)

. And

A = {A

1

, A

2

,..., A

q

}

, where

A

a

is the

a

th

categorical attribute,

1

≤

a

≤

q

. The window size

N

is to be given so that thedata set

D

is separated into several continuoussubsets

S

t

, where the number of data points in each

S

t

is

N

. The superscript number

t

is the identificationnumber of the sliding window and

t

is also calledtime stamp. Here in we consider the first

N data points of data set

D

this makes the first data slide or the first

sliding window S

0

. C

i j

or Cij is representingfor the cluster, in this the j indication of the cluster number respect to sliding window i. Our intension isto cluster every data slide and relate the clusters of every data slide with previous clusters formed by the previous data slides. Several notations andrepresentations are used in our work to ease the process of presentation:

III.

CONCEPT DRIFT DETECTION

Concept drift is an sudden

substitution

of one sliding window S

1

(with an underlying

probability distribution

Π

S1

), with another sliding window

S

2

(with

distribution

Π

S2

). Asconcept drift is assumed to be

unpredictable,

periodic

seasonality

is usually not considered as aconcept drift problem. As an exception, if

sea

sonality is not known with certainty

,

it

migh

t

be regarded as a concept drift problem. The

core

assumption,

when dealing with the concept drift problem, is

uncertain

t

y

about the future - weassume

that

the source of the

target

instance isnot known with

certain

t

y.

For successfulautomatic clustering data points we are not onlylooking for fast and accurate clustering algorithms, but also for complete methodologies that can detectand quickly adapt to time varying concepts. This problem is usually called “concept drift” anddescribes the change of concept of a target class withthe passing of time.As said earlier in this section that meansdetects the difference of cluster distribution betweenthe current data subset S

t

( i.e. sliding window 2)andthe last clustering result C

[tr,t-1]

(sliding window1)and to decide whether the resulting is required or not in S

t

. Hence the upcoming data points in the slideS

t

should be able to be allocated into thecorresponding proper cluster at the last clusteringresult. Such process of allocating the data points tothe proper cluster is named as “labeled data”. Labeled

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011110http://sites.google.com/site/ijcsis/ISSN 1947-5500

data in our work even detects the outlier data pointsas few data points may not be assigned to the cluster,“outlier detection”.If the comparison between the last clustersand the temporal clusters availed from the newsliding window data labeling, produce the enoughdifferences in the cluster distributions, then the latestsliding window is considered as a concept-driftingwindow. A re-clustering is done on the latest slidingwindow. This includes the consideration of theoutliers that are obtained in the latest sliding window,and forming new clusters which are the new conceptsthat help in the new decisions. The above process can be handled by the following headings such Nodeselection, POur-NIR, Resemble method and thresholdvalue. This is new scenario because of we introducedthe POur-NIR method compared with existingmethod and also published in[19]

3.1 Node selection

:

In this category, proposedsystems try to select the most appropriate set of pastcases in order to make future clustering. The work related to representatives of the categorical data withsliding window technique based on time. In slidingwindow technique, older points are useless for clustering of new data and therefore, adapting toconcept drift is synonym to successfully forgettingold instances /knowledge. Examples of this groupcan be found in [10, 15]

3.2 Node Importance

:

In this group, we assume thatold knowledge becomes less important as time goes by. All data points are taken under consideration for building clusters, but this time, new coming pointshave larger effect in the model than the older ones.To achieve this goal, we introduced a newweighting scheme for the finding of the nodeimportance and also published in [15, 19].

3.3 Resemblance Method

:

The main aimof this method is to have a number of clusters that are effective only on a certainconcept. It has importance that is to findlabel for unlabeled data points and storeinto appropriate cluster.

3.3.1 Maximal resemblance

All the weights associated with a single data point corresponding to the unique cluster forms theresemblance. This can be given with the equation:R(Pj,Ci)=

[i, r]1

(,N)

qr

WCi

=

∑

----------------------------- (1)Here a data point p

j

of the new data slide and thePOur-NIR of the data point with all the clusters arecalculated and are placed in the table. Henceresemblance R(pj,ci) can be obtained by summing upthe POur-NIR of the cluster c

i

. This just gives themeasurement of the resemblance of the node withcluster. And now these measurements are used to findthe maximal resemblance. i.e, if data point p

j

hasmaximum resemblance R (P

j

,C

x

),towards a cluster C

x

, then the data point is labeled to that cluster.If any data point is not similar or has anyresemblance to any of the cluster then that data point is considered to be the outlier.We even introduce the threshold to simplify theoutlier detection. With the threshold value the data points with small resemblance towards many clusterscan be considered as the outlier if the resemblance isless than the threshold.

IV.

VALUE OF THRESHOLD

In this section, we introduce the decisionfunction that is to find out the threshold, whichdecides the quality of the cluster and the number of the clusters. Here we have to calculate the threshold(

λ

) for every cluster can be set identical, i.e.,

λ

1

=

λ

2

=…=

λ

n

=

λ

. Even then we have a problem to findthe main

λ

(threshold) that can be find with comparingall the clusters. Hence an intermediate solution ischosen to identify the threshold (

λ

i

) the smallestresemblance value of the last clustering result is usedas the new threshold for the new clustering. After data labeling we obtain clustering results which arecompared to the clusters formed at the last clusteringresult which are base for the formation of the newclusters. This leads to the “Cluster DistributionComparison” step.

4.1 Labeling and Outlier Detection using adaptivethreshold

The data point is identified as an outlier

if it is outsidethe radius of all the data points in the resemblancemethods. Therefore, if the data point is outside thecluster of a data point, but very close to its cluster, itwill still be an outlier. However, this case might befrequent due to concept- drift or noise, As a result,detecting existing clusters as novel would be high.In order to solve this problem. Here we adapted thethreshold for detecting the outliers/labeling. The mostimportant step in the detection of the drift in theconcept starts at the data labeling. The conceptformation from the raw data which is used for thedecision making is to be perfect to produce proper results after the decision, hence the formation of clustering with the incoming data points is animportant step.Comaprision of the incoming data point with the initial clusters generated with the previous data available gives rise to the new clusters.If a data point p

j

is the next incoming data point in the current sliding window, this data point is

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011111http://sites.google.com/site/ijcsis/ISSN 1947-5500