10 views

Uploaded by IDES

We propose a new gravitational based hierarchical
clustering algorithm using kd- tree. kd- tree generates densely
populated packets and finds the clusters using gravitational
force between the packets. Gravitational based hierarchical
clustering results are of high quality and robustness. Our
method is effective as well as robust. Our proposed algorithm
is tested on synthetic dataset and results are presented.

save

You are on page 1of 3

03, Dec 2010

**Gravitational Based Hierarchical Clustering Algorithm
**

P S Bishnu1 and V Bhattacherjee2

Department of Computer Science and Engineering Birla Institute of Technology, Ranchi, India Email: 1psbishnu@gmail.com, 2vbhattacharya@bitmesra.ac.in

Abstract—We propose a new gravitational based hierarchical clustering algorithm using kd- tree. kd- tree generates densely populated packets and finds the clusters using gravitational force between the packets. Gravitational based hierarchical clustering results are of high quality and robustness. Our method is effective as well as robust. Our proposed algorithm is tested on synthetic dataset and results are presented. Index Terms —Data Mining, hierarchical clustering, kd-tree

III. OVERVIEW OF AGGLOMERATIVE CLUSTERING ALGORITHM AND KD-TREE A. Agglomerative hierarchical clustering algorithm Agglomerative hierarchical algorithms begin with all the data objects as individual cluster. At each step two most alike clusters are merged. After each merge, the total number of clusters decreases by one. These steps can be repeated until the desired number of clusters is obtained or the distance between two closest clusters is above a certain threshold distance [5]. B. kd Tree kd- tree is a geometrical, top-down hierarchical tree data structure. At the root whole data space is divided with a vertical line into two subsets of roughly equal size. The splitting line is stored at the root, the left data points are assigned as the content of the left subtree and the right data points are assigned as the content of the right subtree. At the next level, each node is again partitioned along the alternate line (e.g. if the previous level was partitioned by vertical line, this level would be partitioned along horizontal line). At each partitioning, data points to the left or on a vertical line are assigned to the left subtree and the rest to the right subtree. The process continues till the criteria function has converged. [8]. IV. THE PROPOSED GRAVITATIONAL BASED HIERARCHICAL

ALGORITHM

1. INTRODUCTION Hierarchical clustering generates a hierarchical series of nested clusters which can be graphically represented by a tree called “Dendrogram”. By cutting the dendrogram at some level, we can obtain a specified number of clusters [3]. Due to its nested structure it is effective and gives better structural information. This paper presents a new gravity based hierarchical technique using kd- Tree. We call our new algorithm GLHL (which is anagram of the bold letters in GravitationaL Based Hierarchical ALgorithm) The orientation of the paper is as follows: Section 2 presents survey of the related work. Section 3 gives an overview of the hierarchical clustering algorithm and kd-trees. Section 4 presents our proposed algorithm. Next we discuss results in Section 5 and conclusions in Section 6. 2. RELATED WORK Here we review some of the pioneering methods in hierarchical clustering. Zhang et al. proposed BIRCH technique [6]. It overcomes the two limitations like scalability and the inability to undo what was done in the previous step, of agglomerative clustering [1] [6]. It is designed for clustering large amount of dataset and it is incremental and hierarchical and can handle outliers. It introduces a concept like clustering feature (CF) which contains information regarding a cluster. CF tree is created using CF and it contains CF information about its subclusters [2] [6]. BIRCH applies only to numeric data [2] and if the clusters are not spherical in nature BIRCH does not perform well [1]. Jiang et al. proposed DHC (density based hierarchical clustering) [7]. It uses two types of data structure called density tree which is used to uncover the embedded cluster and attraction tree which is used to explore the inner structure of clusters, the boundary of the cluster and the outliers. 5 © 2010 ACEEE DOI: 01.IJCOM.01.03.49

The proposed algorithm uses kd- tree to divide the data space into regions called leaf buckets and calculate the density of each leaf buckets. Further, mean of each leaf buckets is calculated and treated as center of gravity of each leaf bucket. Next, an object (leaf bucket) with high density attracts some other objects (leaf bucket) with lower density [7]. Using gravity function we can calculate the attraction force between the two objects (leaf buckets) and the degree of gravity is in direct ratio to the product of two objects (leaf buckets) density and in inverse ratio to the square of their distance [9]. A. Calculation of density of a leaf bucket Consider a set of n data points, (o1 , o 2 ,..., o n ) occupying a t dimensional space. Each

ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010

point

oi has

associated

with

it

to

coordinate (oi1 , oi 2 ,..., oit ) . There exists a bounding box (buckets) which contains all data points and whose extrema are defined by the maximum and minimum coordinate values of the data points in each dimension figure [1a]. The data is then divided into two sub buckets by splitting the data along the median value of the coordinates of the parent bucket so that the number of points in each bucket remains roughly the same. The division process recursively repeated on each sub bucket until a leaf bucket is created where each leaf bucket contains single data point [Figure 1b-1c]. The basic concept behind the calculation of density of each leaf bucket is with the help of bucket size and the number of data points present in that bucket (here each leaf bucket contain single data points but the bucket size differs). If the numbers of data points present are more then the density is likely to be more [11].

computed from Δ =

∑ (s

j =1

K

j

( − 1) o0 − o0 j )

2 2

, where s j is

**the number of data points in cluster C j , K denotes number of clusters,
**

n

o0

is

the

global

centroid

defined

**1 ∑ oi where n is total number of data points n i =1 ( j) and oi is the data points, o0 denotes the centroid of the
**

as o0 = cluster as o0

( j)

j,

which

( j) i

is denotes data

defined points

=

1 sj

∑o

i =1 th

sj

where oi

( j)

belongs to j cluster. D. The GLHL Algorithm This technique is based upon picking up the highest density leaf buckets and calculating the gravitational attraction force with the next lower density leaf buckets. The buckets with the maximum force of attraction are merged during each iteration. The iterations continue until a single bucket remains in the bucket set B. The algorithm uses the following data structures:

B D [1..n] g [0..n] dis [1..n][1..n] F1 [1..n][1..n] c [1..n] : Set of buckets :density vector where n is the number of data points : gain array : distance array : force array : number of clusters

Table 5.1(Dataset )

Figure 1a

Figure 1b

Figure 1c

Figure1: Implementation of kd-Tree

B. Calculation of force of attraction

**mm The force of gravity is given by F = G 1 2 2 . Where, r
**

m1, m2 are two objects’ weight, r is the distance between them and G is the universal gravitational constant. The formula used in this paper has been adapted from [9] and the density value calculated in section 4.1 has been used. The force of attraction between two buckets bi and b j is

f ij =

di d j

2 dis ij

where d i and d j are density of two buckets

**bi and b j , dis ij is distance between two buckets bi
**

and b j . C. Calculation of gain Jung et al proposed clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds and the measure can be applicable for both hierarchical and partitional clustering method to estimate desired number of clusters. Authors showed the clustering gain to have a maximum value at the optimal number of clusters. In addition, clustering gain is cheap to compute. Therefore, it can be computed in each step of clustering process to determine the optimal number of clusters without increasing the computational complexity [10]. The clustering gain can be 6 © 2010 ACEEE DOI: 01.IJCOM.01.03.49 All these data structures (B, D, g, dis and F1) need to be recalculated at each iteration because of merger of buckets. The pseudocode for the GLHL algorithm is given in Algorithm 1.

ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010

Algorithm1: Gravitational based hierarchical clustering algorithm (GLHL Algorithm)

bold font give the optimal number of clusters in each case indicated by maximum gain value. VI. CONCLUSION In this paper a new gravitational based hierarchical clustering algorithm using kd- tree has been proposed. The kd- tree structure is used to generate densely populated packets. The clusters are formed by calculating the gravitational force between the packets. For validation of the performance of our algorithm we have used the concept of gain calculated as in section 4.3. The validation results have been presented in section 5. REFERENCES [1] Data mining concepts and techniques,, By Han and Kamber, Morgan Kaufmann Publishers. [2] Data Mining- Introductory and Advanced Topics, BY M H Dunhum, Pearson Education. [3] Y. Liu, Z. Liu, An Improved Hierarchical K-Means Algorithm for Web Document Clustering, Int. Con. on Computer Science and Info. Tech., IEEE Conf. 2008. [4] S. Lee, W. Lee, S. Chung, D., “An, Ingeun Bok, Hongjin Ryu, “Selection of Cluster Hierarchy Depth in Hierarchical Clustering using K-Means Algorithm,” Int. Sym. on Info. Tech. Conv., IEEE 2007. [5] George K, Eui-Hong H, Vipin K, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, COMPUTER, 32: 68-75, IEEE, 1999. [6] T Zhang, R Ramakrishnan, M Linvy, “BIRCH: an efficient data clustering method for very large databases. In Proc ACM-SIGMOD Int. Conf. Management of data, pp. 103-114, Canada, 1996. [7] D. Jiang J. Pei, A. Zhang, “DHC: A Density-based Hierarchical Clustering Method for Time Series Gene Expression Data”, IEEE Transcaction. [8] Berg et al. “Computational Geometry Algorithms and Application” 2nd Edition Springer publication. [9] M. Jian, L. Cheng, W. Xiang, “Research of Gravitybased Outliers Detection”, Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing. IEEE, 2008. [10] Y. Jung, H. Park, D. Z. Du, B. L Drake, “ A decision criteria for the optimal number of clusters in Hierarchical clustering”, J. of global optimization, 25, 91-111, 2003. [11] Stephen J. Redmond, Conor Heneghan, A method for initializing the K-Means clustering algorithm using kd-trees, Pattern Recognition Letters 28(2007) 965973.

B = {bi }: bi = kd _ tree _ partitioni ng (dataset ); D = [d i ]: d i = calculate _ density (bi ); dis = [ dis ij ]: dis ij = calculate _ dis tan ce(bi , b j ) ∀buckets ; F1 = [ f ij ]: f ij = calculate _ force (bi , b j ) ∀buckets ; g[0] = calculate _ gain (); //each data point as cluster and g vector

q = 1; // counter for iterations

Repeat stores value of gain

N = B ; // number of buckets

If

**N = 1 then break; // only one bucket remaining merge (bi , b j ) : f ij = find _ max( F1 );
**

Update

**B, D, dis , F1 ; g[ q ] = calculate _ gain (); c[ q ] = N ; // c vectors stores number of clusters (bucket) c[ q ] = N ;
**

Until true; Return

c[k ] as optimal number of clusters : g[ k ] = find _ max( g );

The GLHL algorithm functions in four phases. Phase1 constitutes the initialization of buckets by kd-tree partitions and data structures density D, distance dis, gravitational force F1 and the initial gain g [0] for all n buckets where n is total data points. Phase2 checks the exit criteria that there is one bucket remaining. Phase3 finds the maximum entry f ij in the F1 matrices and merges buckets

**bi , b j corresponding to the (ij) pair. Phase4 updates the B,
**

D, dis, F1, g and the number of clusters is stored in c and returns the number of clusters k corresponding to maximum gain. V. RESULTS The GLHL algorithm was implemented in C-language and validations were performed. The data sets used is synthetic two dimensional data points generated by mouse clicking. Result is presented in Table 5.1. The rows in the

7 © 2010 ACEEE DOI: 01.IJCOM.01.03.49

- cluster analysisUploaded byLohith N Reddy
- agri(1)Uploaded byKannerla Naga Venkata Lakshmikanth
- 10.1109@TCYB.2017.2701900Uploaded byAndrés Morocho
- rock aUploaded byParendra Singh
- Data Clustering and Its ApplicationsUploaded bySimmi Joshi
- 20_ijictv3n10spl.pdfUploaded byKhusnulAzima
- reconocimiento de patronesUploaded byfparrale
- IEEE Projects 2013-2014 - DataMining - Project Titles and AbstractsUploaded byelysiumtechnologies
- DATA MINING.docUploaded bysaranya
- Performance Evaluation of K-means Clustering Algorithm with Various Distance MetricsUploaded byNkiru Edith
- 10.1.1.88Uploaded byAli Rezai
- Multivariate AnalysisUploaded bysuduku007
- Paper Siber Ekki JasonUploaded byEkki Rinaldi
- YEAH.docxUploaded byAnkit Pathak
- Analysis of Clustering Approaches for Data Mining in Large Data SourcesUploaded byEditor IJRITCC
- C2662_PDF_C01Uploaded byFrancisco Quiroa
- Janus(EA)Uploaded byShreya Nag
- Machine LearningUploaded byAmit Patra
- Introduction to Clustering Large and High-Dimensional Data, 0521852676, 2006Uploaded byAnonymous QlJjisdlLI
- cs123syllabusUploaded bys_paraiso
- IOSRUploaded byFidel Souza
- Swarm_Intelligence_in_Text_Document_Clustering.pdfUploaded byanon_127489414
- IOSR Journal of Computer EngineeringUploaded byInternational Organization of Scientific Research (IOSR)
- kmeans __ Functions (Statistics Toolbox™)Uploaded byCahyo Aji Nugroho
- Flexsim Simulation TutorialUploaded bySyarifah Nabila Aulia Ranti
- Transmission-Efficient Clustering Method for Wireless Sensor Networks Using Compressive SensingUploaded byLOGIC SYSTEMS
- Joining 2011Uploaded bythamizh555
- A Clustering MODEL With Renyi Entropy RegularizationUploaded byCostinel Lepadatu
- Eradicate Energy Inefficiencies That Reduce The Lifetime Using Clustering Approach For WSNUploaded byseventhsensegroup
- Interactive Document Clustering Using Iterative Class-Based Feature SelectionUploaded byxpiration

- cluster analysisUploaded byLohith N Reddy
- agri(1)Uploaded byKannerla Naga Venkata Lakshmikanth
- 10.1109@TCYB.2017.2701900Uploaded byAndrés Morocho
- rock aUploaded byParendra Singh
- Data Clustering and Its ApplicationsUploaded bySimmi Joshi
- 20_ijictv3n10spl.pdfUploaded byKhusnulAzima
- reconocimiento de patronesUploaded byfparrale
- IEEE Projects 2013-2014 - DataMining - Project Titles and AbstractsUploaded byelysiumtechnologies
- DATA MINING.docUploaded bysaranya
- Performance Evaluation of K-means Clustering Algorithm with Various Distance MetricsUploaded byNkiru Edith
- 10.1.1.88Uploaded byAli Rezai
- Multivariate AnalysisUploaded bysuduku007
- Paper Siber Ekki JasonUploaded byEkki Rinaldi
- YEAH.docxUploaded byAnkit Pathak
- Analysis of Clustering Approaches for Data Mining in Large Data SourcesUploaded byEditor IJRITCC
- C2662_PDF_C01Uploaded byFrancisco Quiroa
- Janus(EA)Uploaded byShreya Nag
- Machine LearningUploaded byAmit Patra
- Introduction to Clustering Large and High-Dimensional Data, 0521852676, 2006Uploaded byAnonymous QlJjisdlLI
- cs123syllabusUploaded bys_paraiso
- IOSRUploaded byFidel Souza
- Swarm_Intelligence_in_Text_Document_Clustering.pdfUploaded byanon_127489414
- IOSR Journal of Computer EngineeringUploaded byInternational Organization of Scientific Research (IOSR)
- kmeans __ Functions (Statistics Toolbox™)Uploaded byCahyo Aji Nugroho
- Flexsim Simulation TutorialUploaded bySyarifah Nabila Aulia Ranti
- Transmission-Efficient Clustering Method for Wireless Sensor Networks Using Compressive SensingUploaded byLOGIC SYSTEMS
- Joining 2011Uploaded bythamizh555
- A Clustering MODEL With Renyi Entropy RegularizationUploaded byCostinel Lepadatu
- Eradicate Energy Inefficiencies That Reduce The Lifetime Using Clustering Approach For WSNUploaded byseventhsensegroup
- Interactive Document Clustering Using Iterative Class-Based Feature SelectionUploaded byxpiration
- Van Der Meer, F. - San Agustín, pastor de almasUploaded byDaniel Lozada
- Parameter Guide for Electribe SamplerUploaded byallerliefstejongen
- Tips n TrickUploaded byMuhammadShahiburrida
- 12okteUploaded byKatalin Berec
- Guia Para Elaborar Proyectos de InvestigigacionUploaded bypabloleonel7
- irc-sp-51-1999.pdfUploaded bySaurabh Pednekar
- united soccer fallflyerback18Uploaded byapi-282864864
- Lila. Soal Ke 2. Kista Duktus KoledokusUploaded byLila Ekania
- Minit CuraiUploaded byHaziq Kamu
- IlSO - Etiquetado NutricionaUploaded byArlette Gómez Baeza
- 01-LanguageFundamentals.pdfUploaded byangelita browster
- Abakadang Filipino Turingan TechniqueUploaded byCrystal Marie Jordan Aguhob
- Linzay - Taller de GradoUploaded byJohnny Ortiz Pedrazas
- A Train to GranpaUploaded byGermana copii
- jtptunimus-gdl-s1-2008-ikedewierw-1044-BAB+II+I-e.docUploaded byNengahSubagia
- DOC-20180627-WA0052.docUploaded bydimas
- Solutions Manual for Brooks Cole Empowerment Series Understanding Human Behavior and the Social Environment 9th Edition by ZastrowUploaded byxovik
- Premat5.pdfUploaded byCristian Alvayai
- libro secuencia didactica.pdfUploaded byFlorita Tortuga Gigli
- Teoria Del Desarrollo MarxistaUploaded byGuido Edison Lezama Ccoya
- 431.2.docxUploaded byZuyina Prima Yeni
- Kimia Ppt Unsur Transisi Periode 4 ConvertedUploaded byM Fadiel Ikchsan
- dracula__bram_stoker.pdfUploaded byNagy Erwin
- Acuerdo Suspensión Evaluación DiputadoUploaded byJorge Nuñez
- hchUploaded byNung Alok
- AbrilUploaded byMario Aguila
- derUploaded byKyrillos Attia
- 2010_LIBRO_RFEN_ARELLANO2.pdfUploaded byMauro Alejandro Nova Perez
- My pdf.pptxUploaded byAnand Dev
- metodo A2 pararrosilinaUploaded byJuan Poveda
- Selfish Node Isolation & Incentivation using Progressive ThresholdsUploaded byIDES
- Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in Ad hoc NetworksUploaded byIDES
- Artificial Intelligence Technique based Reactive Power Planning Incorporating FACTS Controllers in Real Time Power Transmission SystemUploaded byIDES
- Power System State Estimation - A ReviewUploaded byIDES
- Cloud Security and Data Integrity with Client Accountability FrameworkUploaded byIDES
- Genetic Algorithm based Layered Detection and Defense of HTTP BotnetUploaded byIDES
- Enhancing Data Storage Security in Cloud Computing Through SteganographyUploaded byIDES
- Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WSNs during Wormhole AttackUploaded byIDES
- Study of Structural Behaviour of Gravity Dam with Various Features of Gallery by FEMUploaded byIDES
- Line Losses in the 14-Bus Power System Network using UPFCUploaded byIDES
- Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-machine System Considering Detailed ModelUploaded byIDES
- Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radial Distribution Systems using ABC AlgorithmUploaded byIDES
- Permutation of Pixels within the Shares of Visual Cryptography using KBRP for Enhanced SecurityUploaded byIDES
- Assessing Uncertainty of Pushover Analysis to Geometric ModelingUploaded byIDES
- Secure Multi-Party Negotiation: An Analysis forElectronic Payments in Mobile ComputingUploaded byIDES
- Mental Stress Evaluation using an Adaptive ModelUploaded byIDES
- Optimized Neural Network for Classification of Multispectral ImagesUploaded byIDES
- Genetic Algorithm based Mosaic Image Steganography for Enhanced SecurityUploaded byIDES
- Opportunities and Challenges of Software CustomizationUploaded byIDES
- High Density Salt and Pepper Impulse Noise RemovalUploaded byIDES
- Comparative Study of Morphological, Correlation, Hybrid and DCSFPSS based Morphological & Tribrid Algorithms for GFDDUploaded byIDES
- Rotman Lens Performance AnalysisUploaded byIDES
- Efficient Architecture for Variable Block Size Motion Estimation in H.264/AVCUploaded byIDES
- Low Energy Routing for WSN’sUploaded byIDES
- Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site of -lactamaseUploaded byIDES
- 3-D FFT Moving Object Signatures for Velocity FilteringUploaded byIDES
- Blind Source Separation of Super and Sub-Gaussian Signals with ABC AlgorithmUploaded byIDES
- Texture Unit based Monocular Real-world Scene Classification using SOM and KNN ClassifierUploaded byIDES
- Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesUploaded byIDES
- Ontology-based Semantic Approach for Learning Object RecommendationUploaded byIDES

- Selfish Node Isolation & Incentivation using Progressive ThresholdsUploaded byIDES
- Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in Ad hoc NetworksUploaded byIDES
- Artificial Intelligence Technique based Reactive Power Planning Incorporating FACTS Controllers in Real Time Power Transmission SystemUploaded byIDES
- Power System State Estimation - A ReviewUploaded byIDES
- Cloud Security and Data Integrity with Client Accountability FrameworkUploaded byIDES
- Genetic Algorithm based Layered Detection and Defense of HTTP BotnetUploaded byIDES
- Enhancing Data Storage Security in Cloud Computing Through SteganographyUploaded byIDES
- Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WSNs during Wormhole AttackUploaded byIDES
- Study of Structural Behaviour of Gravity Dam with Various Features of Gallery by FEMUploaded byIDES
- Line Losses in the 14-Bus Power System Network using UPFCUploaded byIDES
- Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-machine System Considering Detailed ModelUploaded byIDES
- Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radial Distribution Systems using ABC AlgorithmUploaded byIDES
- Permutation of Pixels within the Shares of Visual Cryptography using KBRP for Enhanced SecurityUploaded byIDES
- Assessing Uncertainty of Pushover Analysis to Geometric ModelingUploaded byIDES
- Secure Multi-Party Negotiation: An Analysis forElectronic Payments in Mobile ComputingUploaded byIDES
- Mental Stress Evaluation using an Adaptive ModelUploaded byIDES
- Optimized Neural Network for Classification of Multispectral ImagesUploaded byIDES
- Genetic Algorithm based Mosaic Image Steganography for Enhanced SecurityUploaded byIDES
- Opportunities and Challenges of Software CustomizationUploaded byIDES
- High Density Salt and Pepper Impulse Noise RemovalUploaded byIDES
- Comparative Study of Morphological, Correlation, Hybrid and DCSFPSS based Morphological & Tribrid Algorithms for GFDDUploaded byIDES
- Rotman Lens Performance AnalysisUploaded byIDES
- Efficient Architecture for Variable Block Size Motion Estimation in H.264/AVCUploaded byIDES
- Low Energy Routing for WSN’sUploaded byIDES
- Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site of -lactamaseUploaded byIDES
- 3-D FFT Moving Object Signatures for Velocity FilteringUploaded byIDES
- Blind Source Separation of Super and Sub-Gaussian Signals with ABC AlgorithmUploaded byIDES
- Texture Unit based Monocular Real-world Scene Classification using SOM and KNN ClassifierUploaded byIDES
- Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesUploaded byIDES
- Ontology-based Semantic Approach for Learning Object RecommendationUploaded byIDES