Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
A Study on the Performance of Classical Clustering Algorithms with Uncertain Moving Object Data Sets

A Study on the Performance of Classical Clustering Algorithms with Uncertain Moving Object Data Sets

Ratings: (0)|Views: 111 |Likes:
Published by ijcsis
In recent years, real world application domains are generating data with uncertainty, incomplete and probabilistic in nature. Examples of such data include location based services, sensor networks, scientific and biological databases. Data mining is widely used to extract interesting patterns in the large amount of data generated by such applications. In this paper, we addressed the classical mining and data-analysis algorithms, particularly clustering algorithms, for clustering uncertain and probabilistic data. To model uncertain database, we simulated a moving object database with two states: one contains real location and another contains outdated recorded location. We evaluated the performance and compared the results of clustering the two states of location data with k-means, DBSCAN and SOM.
In recent years, real world application domains are generating data with uncertainty, incomplete and probabilistic in nature. Examples of such data include location based services, sensor networks, scientific and biological databases. Data mining is widely used to extract interesting patterns in the large amount of data generated by such applications. In this paper, we addressed the classical mining and data-analysis algorithms, particularly clustering algorithms, for clustering uncertain and probabilistic data. To model uncertain database, we simulated a moving object database with two states: one contains real location and another contains outdated recorded location. We evaluated the performance and compared the results of clustering the two states of location data with k-means, DBSCAN and SOM.

More info:

Published by: ijcsis on May 11, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/11/2011

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4,April, 2011
A Study on the Performance of Classical Clustering Algorithms withUncertain Moving Object Data Sets
Angeline Christobel . Y
 
College of Computer StudiesAMA International UniversitySalmabad, Kingdom of Bahrainangeline_christobel@yahoo.com
Dr. Sivaprakasam
Department of Computer ScienceSri Vasavi CollegeErode, India psperode@yahoo.com 
 Abstract 
 — In recent years, real world application domains aregenerating data with uncertainty, incomplete and probabilistic innature. Examples of such data include location based services,sensor networks, scientific and biological databases. Data miningis widely used to extract interesting patterns in the large amountof data generated by such applications.In this paper, we addressed the classical mining and data-analysisalgorithms, particularly clustering algorithms, for clusteringuncertain and probabilistic data. To model uncertain database,we simulated a moving object database with two states: onecontains real location and another contains outdated recordedlocation. We evaluated the performance and compared theresults of clustering the two states of location data with k-means,DBSCAN and SOM.Key Words:
 Data Mining, Uncertain Data, Moving Objects Database, Clustering.
 
I.
 
I
NTRODUCTION
 
Data uncertainty naturally arises in many real worldapplications due to reasons such as outdated sources or imprecise measurement. This is true for applications such aslocation based services [12] and sensor monitoring [6] thatneeds interaction with the physical world. For example, in thecase of moving objects, it is impossible for the database totrack the exact locations of all objects at all time. So thelocation of each object is associated with uncertainty betweenupdates [7]. In order to produce good mining results, their uncertainties have to be considered.In recent years, there has been much research on themanagement of uncertain data in databases, such as therepresentation of uncertainty in databases and querying datawith uncertainty but only little research work has addressedthe issue of mining uncertain data. Many scientific methodsfor data collection are known to have error-estimationmethodologies built into the data collection and featureextraction process. In[2],[13], a number of real applications, inwhich such error information can be known or estimated a- priori has been summarized as follows:
 
The statistical error of data collection can beestimated by prior experimentation, if the inaccuracyarises out of the limitations of data collectionequipment. In such cases, different features of observation may be collected to a different level of approximation.
 
The imputation procedures can be used to estimatethe missing values in the case of missing data. Thestatistical error of imputation for a given entry isoften known a-priori, if such procedures are used.
 
Data mining methods are applied to derived data setsthat are generated by statistical methods such asforecasting. In such cases, the error of the data can bederived from the methodology used to construct thedata.
 
The data is available only on a partially aggregated basis in many applications such as demographic datasets. Each aggregated record is actually a probabilitydistribution.
 
The trajectory of the objects may be unknown inmany mobile applications. In fact, manyspatiotemporal applications are inherently uncertain,since the future behavior of the data can be predictedonly approximately.This paper will neither address the existing techniques for uncertain data clustering nor propose a new one. Instead, itwill address the impact of uncertain data in clustering resultsusing a primitive model of a moving object database.
II
.
 
C
LUSTERING
A
LGORITHMS
 
Clustering is a data mining technique used to identify clusters based on the similarity between data objects. Traditionally,clustering is applied to unclassified data objects with theobjective to maximize the distance between clusters andminimize the distance inside each cluster. Clustering is widelyused in many applications including pattern recognition, denseregion identification, customer purchase pattern analysis, web pages grouping, information retrieval, and scientific andengineering analysis. Clustering algorithms deal with a set of objects whose positions are accurately known [3].
11http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4,April, 2011
To study the performance of the clustering algorithms withuncertain moving object date sets, we have chosen K-means,DBSCAN, SOM algorithms and it is discussed below.
 A) K-Mean Clustering algorithm
One of the best known and most popular clustering algorithmsis the k-means algorithm. K-means clustering involves searchand optimization.K-means is a partition based clustering algorithm. K-means’goal is to partition data D into K parts, where there is littlesimilarity across groups, but great similarity within a group.More specifically, K-means aims to minimize the mean squareerror of each point in a cluster, with respect to its cluster centroid.Formula for Square Error:Square Error (SE)=
==    
ici jci j x
1||1
,where k is the number of clusters, |ci| is the number of elements in cluster ci, and M
ci
is the mean for cluster ci.
 Steps of K-Means Algorithm
The k Means algorithm is explained in the following steps.The algorithm normally converges in short iterations. But willtake considerably long time for iteration if the number of data points and the dimension of each data are high.Step 1: Choose k random points as the cluster centroids.Step 2: For every point p in the data, assign it to the closestcentroid. That is compute d(p, M
ci
) for all clusters, and assign p to cluster C* where distance(d(P, M
c*
) <= d(P, M
ci
))Step 3: Recompute the center point of each cluster based on all points assigned to the said cluster.Step 4: Repeat steps 2 & 3 until there is convergence. (Note:Convergence can mean repeating for a fixed number of times,or until SE
new
- SE
old
<=
ε
, where
ε
is some small constant, themeaning being that we stop the clustering if the new SEobjective is sufficiently close to the old SE.)
 B) DBSCAN Algorithm
Density based spatial clustering of applications with noise relyon a density-based notion of clusters, which is designed todiscover clusters of arbitrary shape and also have ability tohandle noise.DBSCAN requires two parameters
 
Eps: Maximum radius of the neighborhood
 
MinPts: Minimum number of points in an Eps-neighborhood .The clustering process is based on the classification of the points in the dataset as core points, border points and noise points and on the use of density relations between pointsdirectly density reachable, density reachable, densityconnected[Ester 1996] to form the clusters.
Core points:
The points that are at the interior of a cluster are called core points. A point is an interior point if there are enough points inits neighborhood
.
 
 Border points:
Points on the border of a cluster are called border points.
 NEps(p)
:
{q belongs to D
|
dist(p,q) <= Eps} Noise points:
A noise point is any point that not a core point or a border  point.
 Directly Density-Reachable:
A point
 p
is directly density-reachable from a point
q
withrespect to
 Eps
,
MinPts
if 
 p
 belongs to
 NEps(q)
|
 NEps (q)
| >=
MinPts Density-Reachable:
A point
 p
is density-reachable from a point
q
with respect to
 Eps
,
MinPts
if there is a chain of points
 p1
, …,
 pn
,
 p1
=
q
,
 pn
=
 p
such that
 pi+1
is directly density-reachable from
 pi Density-Connected:
A point
 p
is density-connected to a point
q
with respect to
 Eps
,
MinPts
if there is a point
o
such that both,
 p
and
q
are density-reachable from
o
with respect to
 Eps
and
MinPts
.
 Algorithm:
The algorithm of DBSCAN is as follows
(M. Ester,H. P. Kriegel, J. Sander, 1996)
 
Arbitrary select a point
 p
 
Retrieve all points density-reachable from
 p
withrespect to
 Eps
and
MinPts
.
 
If 
 p
is a core point, a cluster is formed.
 
If 
 p
is a border point, no points are density-reachablefrom
 p
and DBSCAN visits the next point of thedatabase.
 
Continue the process until all of the points have been processed
.
C) The Self-Organizing Map SOM 
The Self Organizing Map (SOM) is developed by Professor Teuvo Kohonen in the early 1980's. It is a computationalmethod for the visualization and analysis of high dimensionaldata.A self organizing map consists of components called nodes.The nodes of the network are connected to each other, so thatit becomes possible to determine the neighborhood of a node.Each node receives all elements of the training set, one at atime, in vector format. For each element, Euclidean distance iscalculated to determine the fit between that element and the
12http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4,April, 2011
weight of the node. The weight is a vector of the samedimension as the input vectors. This allows to determine the“winning node”, that is the node that represents the besttraining element. Once the winning node is found, theneighbors of the winning node are identified. The winningnode and these neighbors are then updated to reflect the newtraining element.It appears to be customary that both the neighborhood functionand the learning rate are a decreasing function of time. Thismeans that as more training elements are learned, theneighborhood is smaller and the nodes are less affected by thenew elements.We express this change as the following function: for a nodex, the update is equal tox(t+1) = x(t) + N(x,t)α(t)(ξ(t) – x(t))Wherex(t+1) is the next value of the weight vector x(t) is the current value of the weight vector  N(x,t) is the neighborhood function, which decreases thesize of the neighbourhood as a function of timeα(t) is the learning rate, which decreases as a function of timeξ(t) is the vector representing the input documentBased on this information, the algorithm is given below.
 Algorithm
1.
 
Initialize the weights of the nodes, either to randomor pre computed values2.
 
For all input elements:
 
Take the input, get its vector 
 
For each node in the map: Compare the nodewith the input’s vector 
 
The node with the vector closest to the inputvector is the winning node.
 
For the winning node and its neighbors,update them according to the formula above.
The Metric Used to Measure the Performance
In order to compare clustering results against external criteria,a measure of agreement is needed. Since we assume that eachrecord is assigned to only one class in the external criterionand to only one cluster, measures of agreement between two partitions can be used.The Rand index or Rand measure is a commonly usedtechnique for measure of such similarity between two dataclusters. This measure was found by W. M. Rand andexplained in his paper "Objective criteria for the evaluation of clustering methods" in Journal of the American StatisticalAssociation (1971).Given a set of n objects S = {O1, ..., On} and two data clustersof S which we want to compare: X = {x1, ..., xR} and Y ={y1, ..., yS} where the different partitions of X and Y aredisjoint and their union is equal to S; we can compute thefollowing values:a is the number of elements in S that are in the same partitionin X and in the same partition in Y, b is the number of elements in S that are not in the same partition in X and not in the same partition in Y,c is the number of elements in S that are in the same partitionin X and not in the same partition in Y,d is the number of elements in S that are not in the same partition in X but are in the same partition in Y.Intuitively, one can think of a + b as the number of agreements between X and Y and c + d the number of disagreements between X and Y. The Rand index, R, then becomes,The Rand index has a value between 0 and 1 with 0 indicatingthat the two data clusters do not agree on any pair of pointsand 1 indicating that the data clusters are exactly the same.
III
.
 
M
ODELING
M
OVING
O
BJECT
D
ATABASE WITH
U
NCERTAINTY
 
The following figure from [1] illustrates the problemwhen a clustering algorithm is applied to moving objectswith location uncertainty. Figure 4(a) shows the actuallocations of a set of objects, Figure 4(b) shows therecorded location of these objects, which are alreadyoutdated and Figure4(c) shows the uncertain datalocations. The clusters obtained from these outdatedvalues could be significantly different from those obtainedas if the actual locations were available (Figure 4(b)). If wesolely rely on the recorded values, many objects could possibly be put into wrong clusters. Even worse, eachmember of a cluster would change the cluster centroids,thus resulting in more errors.Figure 4: The Uncertain Data Clustering ScenarioWe have modeled a moving object database to resemble the previously explained scenario. Here we present an examplecase of the model under consideration. The Attributes of theSimulated Moving Object Database presented here are:The Number of Groups : 5The Number of Dimensions : 2 Number of Objects per Groups : 50The Standard Deviation : 0.6
13http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->