Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
0 of .
Results for:
P. 1
Density Based Clustering Algorithm using Sparse Memory Mapped File

# Density Based Clustering Algorithm using Sparse Memory Mapped File

Ratings: (0)|Views: 119 |Likes:
The DBSCAN [1] algorithm is a popular algorithm in Data Mining field as it has the ability to mine the noiseless arbitrary shape Clusters in an elegant way. As the original DBSCAN algorithm uses the distance measures to compute the distance between objects, it consumes so much processing time and it’s computation complexity comes as O(N^2). In this paper we have proposed a new algorithm for mining the density based clusters using Sparse Memory Mapped File (Spares MMF) [3]. All the given objects are initially loaded into their corresponding Sparse Memory Mapped File’s locations and during the SparseMemoryRegionQuery operation each objects’ surrounding cells will be visited for the neighbour objects instead of computing the distance between each of the objects in the data set. Using the Sparse MMF approach, it is proved that the DBSCAN algorithm can process huge amount of objects without having any runtime issues and the new algorithm’s performance analysis shows that proposed solution is super fast than the existing algorithm.
The DBSCAN [1] algorithm is a popular algorithm in Data Mining field as it has the ability to mine the noiseless arbitrary shape Clusters in an elegant way. As the original DBSCAN algorithm uses the distance measures to compute the distance between objects, it consumes so much processing time and it’s computation complexity comes as O(N^2). In this paper we have proposed a new algorithm for mining the density based clusters using Sparse Memory Mapped File (Spares MMF) [3]. All the given objects are initially loaded into their corresponding Sparse Memory Mapped File’s locations and during the SparseMemoryRegionQuery operation each objects’ surrounding cells will be visited for the neighbour objects instead of computing the distance between each of the objects in the data set. Using the Sparse MMF approach, it is proved that the DBSCAN algorithm can process huge amount of objects without having any runtime issues and the new algorithm’s performance analysis shows that proposed solution is super fast than the existing algorithm.

### Availability:

See more
See less

09/07/2010

pdf

text

original

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 5, 2010
Density Based Clustering Algorithm using SparseMemory Mapped File
J. Hencil Peter A. Antonysamy
Department of Computer Science Department of MathematicsSt. Xavier’s College, Palayamkottai , India. St. Xavier’s College, Kathmandu, Nepal.hencilpeter@hotmail.com fr_antonysamy@hotmail.com
Abstract:The DBSCAN [1] algorithm is a popular algorithm in DataMining field as it has the ability to mine the noiseless arbitraryshape Clusters in an elegant way. As the original DBSCANalgorithm uses the distance measures to compute the distancebetween objects, it consumes so much processing time and it’scomputation complexity comes as O(N
2
). In this paper we haveproposed a new algorithm for mining the density based clustersusing Sparse Memory Mapped File (Spares MMF) [3]. All thegiven objects are initially loaded into their correspondingSparse Memory Mapped File’s locations and during theSparseMemoryRegionQuery operation each objects’surrounding cells will be visited for the neighbour objectsinstead of computing the distance between each of the objectsin the data set. Using the Sparse MMF approach, it is provedthat the DBSCAN algorithm can process huge amount of objects without having any runtime issues and the newalgorithm’s performance analysis shows that proposed solutionis super fast than the existing algorithm.
Keywords: Sparse Memory Mapped File; Sparse MMF;Sparse Memory; Neighbour Cells; Sparse Memory DBSCAN.
I.

I
NTRODUCTION
Data mining is a fast growing field in whichclustering plays a very important role. Clustering is theprocess of grouping a set of physical or abstract objects intoclasses of similar objects [2]. Among the many algorithmsproposed in the clustering field, DBSCAN is one of the mostpopular algorithms due to its high quality of noiseless outputclusters.The most of the Density Based Clusteringalgorithms requires O (N
2
) computation time and requireshuge amount of main memory to process in the real timescenario. Since the seed object list grows during run time, itis very difficult to predict the required memory to process theentire objects present in the data set. If the memory isinsufficient to process the growing seed objects, theDBSCAN algorithm will crash in the run time. So to get ridof the instability problem and improve the performance, anew solution has been proposed in this paper.Rest of the paper is organised as follows. Section 2gives the brief history about the related works in the samearea. Section 3 gives the introduction of original DBSCANand section 4 explains the proposed solution. After the newalgorithm’s explanation, section 5 shows the ExperimentalResults and final section 6 presents the conclusion and futurework associated with this algorithm.II

R
ELATED
W
ORK
The DBSCAN (Density Based Spatial Clustering of Application with Noise) [1] is the basic clustering algorithmto mine the clusters based on objects density. In thisalgorithm, first the number of objects present within theneighbour region (Eps) is computed. If the neighbour objectscount is below the given threshold value, the object will bemarked as NOISE. Otherwise the new cluster will be formedfrom the core object by finding the group of densityconnected objects that are maximal w.r.t density-reachability.The OPTICS [4] algorithm adopts the originalDBSCAN algorithm to deal with variance density clusters.This algorithm computes an ordering of the objects based onthe reachability distance for representing the intrinsichierarchical clustering structure. The Valleys in the plotindicate the clusters. But the input parameters
ξ
is criticalfor identifying the valleys as
ξ
clusters.The DENCLUE [5] algorithm uses kernel densityestimation. The result of density function gives the localdensity maxima value and this local density value is used toform the clusters. If the local density value is very small, theobjects of clusters will be discarded as NOISE.A Fast DBSCAN (FDBSCAN) Algorithm[6] hasbeen invented to improve the speed of the original DBSCANalgorithm and the performance improvement has beenachieved through considering only few selectedrepresentative objects belongs inside a core object’sneighbour region as seed objects for the further expansion.Hence this algorithm is faster than the basic version of DBSCAN algorithm and suffers with the loss of resultaccuracy.The MEDBSCAN [7] algorithm has been proposedrecently to improve the performance of DBSCAN algorithm,at the same time without loosing the result accuracy. In thisalgorithm totally three queues have been used, the first queuewill store the neighbours of the core object which belong

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 5, 2010
inside Eps distance, the second queue is used to store theneighbours of the core object which belong inside 2*Epsdistance and the third queue is the seeds queue which storethe unhandled objects for further expansion. This algorithmguarantees some notable performance improvement if Epsvalue is not very sensitive.Though the DBSCAN algorithm’s complexity canbe reduced to O(N * logN) using some spatial trees, it is anextra effort to construct, organize the tree and the treerequires an additional memory to hold the objects. In thisnew algorithm different new complexity O (N * 2
Eps
) hasbeen achieved and it is proved that the new complexity betterthan the previous version of DBSCAN algorithms when theEps value is minimal.

II.

I
NTRODUCTION TO
DBSCAN

A
LGORITHM
The working principles of the DBSCAN algorithmare based on the following definitions:Definition 1: Eps Neighbourhood of an object pThe Eps Neighbourhood of an object p is referred asNEps(p), defined asNEps(p) = {q
D | dist(p,q) <=Eps}.Definition 2: Core Object ConditionAn Object p is referred as core object, if the neighbourobjects count >= given threshold value (MinObjs). i.e.|NEps(p)|>=MinObjsWhere MinObjs refers the minimum number of neighbourobjects to satisfy the core object condition. In the abovecase, if p has neighbours which are exist within the Epsradius count is >= MinObjs, p can be referred as core object.Definition 3: Directly Density Reachable ObjectAn Object p is referred as directly density reachable fromanother object q w.r.t Eps and MinObjs if p
NEps(q) and|NEps(q)|>= MinObjs (Core Object condition)Definition 4: Density Reachable ObjectAn object p is referred as density reachable from anotherobject q w.r.t Eps and MinObjs if there is a chain of objectsp1,…,pn, p1=q, pn=p such that pi+1 is directly densityreachable from pi.Definition 5: Density connected objectAn Object p is density connected to another object q if thereis an object o such that both, p and q are density reachablefrom o w.r.t Eps and MinObjs.Definition 6: ClusterA Cluster C is a non-empty subset of a Database D w.r.t Epsand MinObjs which satisfying the following conditions.For every p and q, if p
cluster C and q is density reachablefrom p w.r.t Eps and MinObjs then q
C.For every p and q, q
C; p is density connected to q w.r.tEps and MinObjs.Definition 7: NoiseAn object which doesn’t belong to any cluster is called noise.The DBSCAN algorithm finds the EpsNeighbourhood of each object in a Database during theclustering process. Before the cluster expansion, if thealgorithm finds any non core object, it will be marked asNOISE. With a core object, algorithm initiate a cluster andsurrounding objects will be added into the queue for thefurther expansion. Each queue objects will be popped outand find the Eps neighbour objects for the popped out object.When the new object is a core object, all its neighbourobjects will be assigned with the current cluster id and itsunprocessed neighbour objects will be pushed into queue forfurther processing. This process will be repeated until thereis no object in the queue for the further processing.IV.

P
ROPOSED
S
OLUTION
A new algorithm has been proposed in this paper toimprove the performance as well as to process huge amountof data. This algorithm is totally relying on Sparse MMF andthe Sparse MMF concept has been explained below briefly:
A. Sparse Memory Mapped File (Sparse MMF)
The Sparse MMF [3] is the derived mechanism of Memory Mapped File. The Memory Mapped File [3] is likevirtual memory and it allows reserving a region of addressspace and committing physical storage to the region. Thedifference is that the physical storage comes from a file thatis already on the disk instead of the system’s paging file. Thememory mapped file can be used to access the data file ondisk (even very huge files), load and execute executable filesand libraries and allowing multiple processes running on thesame machine to share data with each other. The SparseMMF is similar to Memory Mapped File but it occupies onlythe required storage space in the physical file. If we useMemory Mapped File to reserve the region of memory, whilecommitting the changes to the file on disk, the file size willbe equivalent of the created Memory Mapped File size.Instead if we replace the same with Sparse MMF, final file’ssize will be equivalent to the e non-zero element which is

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 5, 2010
stored in the Sparse MMF. So Sparse MMF gives betterstorage result and hence it has been used in our research.
B. Object’s Structure
As this algorithm’s core is Spare MMF, the objectsthat needs to be processed by this algorithm are organized bitdifferently and each objects’ structure will have threeadditional fields NextObjectOffset, NextSeedObjectOffsetand NextTempObjectOffset.
Figure 1. Sparse Memory Mapped File Object’s Structure
While loading all the objects in Sparse MMF, all theobjects are chained in a sequence like linked list (but notexactly linked list). The first additional fieldNextObjectOffset will hold the Offset value of the nextobject, second object will hold the offset of its immediatesuccessor object, etc and the final object’s NextObjectOffsetwill set to NULL to indicate that there are no more objectsfurther to visit during the clustering process. So the firstobject’s address should be retained always to visit the entireobjects loaded in the Sparse MMF. The other two fieldsNextSeedObjectOffset and NextTempObjectOffset fields areused by SparseMemoryRegionQuery function call and it isexplained in the below section.
C. SparseMemoryRegionQuery function
The proposed algorithm doesn’t uses any extra buffer orqueue to store the seed objects as well as neighbour objectsduring the run time, instead each object has thecorresponding Offset field and in which the exact offset of the next seed object will be stored. In the original DBSCANalgorithm, RegionQuery function has been used to retrievethe neighbour objects and in this new algorithmSpareMemoryRegionQuery function has been introducedinstead of RegionQuery. This function visits all the requiredsurrounding cells in memory and the non empty cell objectswill be chained and return back as seed objects. i. e Thefunction start from the center cell and visit the neighbourcells one by one. When the non empty object found in thefirst time, center object’s NextSeedOffset field will beassigned the Offset of new object (Address(NewObject) –Address(CenterObject)) and next time when the new objectfound, current object’s offset will be stored in the previousobject’s NextSeedObject field and so on. Eventually lastobject’s NextSeedObject field will be assigned with NULL.Thus the extra memory as well as buffer/queue requirementto store the seed objects has been removed in this solution.This function has been customized to update the neighbourobjects offset in the either field NextSeedObectOffset orNextTempObjectOffset. If this function receives an updateflag UpdateMasterSeedOffset, neighbour objects offset willbe stored in NextSeedObectOffset field and input update flagis UpdateTempSeedOffset then the NextTempObjectOffsetwill be updated with the neighbour object(s) offset.The DBSCAN algorithm’s computation complexityvaries based on the RegionQuery function and it usesdistance function to compute the neighbours present with inthe certain radius (Eps). In this new approach, distancecomputation during the SparseMemoryRegionQuery functioncall has been removed and it visit’s the required number of neighbour cells from the center cell.
Figure 2. Neighbour Cells Diagram

In this proposed solution, we have selected twodimensional dataset for the experiment and the abovediagram shows the neighbour cells with different distance.The center cell has been painted in red colour and it’sdistance of object stored in the cell will be zero, nextimmediate neighbours whose distance is 1 from the centercell have been painted in blue colour, the yellow colour cellsdistance are greater than 1 and <=2 and so on. Theseneighbour cells offsets are pre-computed and stored in M X 2dimensional array and it will be passed to theSparseMemoryRegionQuery function to visit only therequired number of neighbour cells to process. Thus thedistance computation between objects is not required.