(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No.

5, 2011

Potential Research into Spatial Cancer Database by Using Data Clustering Techniques
N. Naga Saranya,
Research Scholar (C.S), Karpagam University, Coimbatore641021, Tamilnadu, India. E-mail: nachisaranya01@gmail.com

Dr.M. Hemalatha,
Head, Department of Software Systems, Karpagam University, Coimbatore-641021, Tamilnadu,India. E-mail: hema.bioinf@gmail.com

Abstract— Data mining, the taking out of hidden analytical information from large databases. Data mining tools forecast future trends and behaviors, allowing businesses to build practical, knowledge-driven decisions. This paper discusses the data analytical tools and data mining techniques to analyze data. It allows users to analyze data from many different dimensions or angles, sort it, and go over the relationships identified. Here we are analyzing the medical data as well as spatial data. Spatial data mining is the process of difficult to discover patterns in geographic data. Spatial data mining is measured a more difficult face than traditional mining because of the difficulties associated with analyzing objects with concrete existences in space and time. Here we applied clustering techniques to form the efficient cluster in discrete and continuous spatial medical database. The clusters of random shapes are created if the data is continuous in natural world. Furthermore, this application investigated data mining techniques (clustering techniques) such as Exclusive clustering and hierarchical clustering on the spatial data set to generate the well-organized clusters. The tentative results showed that there are certain particulars that are evolved and can not be apparently retrieved as of raw data. Keywords- Data Mining, Spatial Data Mining, Clustering Techniques, K-means, HAC, Standard Deviation,Medical Database, Cancer Patients, Hidden Analytical.

liver, lungs, kidney, cervix, prostate testis, bladder, blood, borne, breast and many others. There has been huge development in the clinical data from past decades, so we need proper data analysis techniques for more sophisticated methods of data exploration. In this study, we are using different data mining technique for effective implementation of clinical data. The main aim of this work is to discover various data mining techniques on clinical and spatial data sets. Several data mining techniques are pattern recognition, clustering, association, and classification. Our Proposed work is on medical spatial datasets by using clustering techniques. There are fast and enormous numbers of clustering algorithms are developed for large datasets such as CURE, MAFIA, DBSCAN, CLARANS, BIRCH, and STING. II. CLUSTERING ALGORITHMS AND TECHNIQUES IN DATA MINING

I.

INTRODUCTION

The process of organizing objects into groups whose members are similar in some way is called clustering. So, the goal of clustering is to conclude the essential grouping in a set of unlabeled data. Various kinds of Clustering algorithms are partitioning-based clustering, hierarchical algorithms, density based clustering and grid based clustering. A. Partitioning Algorithm K-Means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. Fig.1. shows the K-Means algorithm is composed of the following steps belongs to centroid: 1. It classifies a given dataset through certain number of clusters (assume k clusters). These points are first group centroids. 2. Grouping is done based on the Euclidean's distance. 3. And the centroids are formed on the basis of mean value of that object group. 4. The steps 2 & 3 repeats until the centroids no longer move.

Recently many commercial data mining clustering techniques have been developed and their usage is increasing tremendously to achieve desired goal. Researchers are putting their best hard work to reach the fast and well-organized algorithm for the abstraction of spatial medical data sets. Cancer has become one of the foremost causes of deaths in India. An analysis of most recent data has shown that over 7 lakh new cases of cancer and 3 lakh deaths occur annually due to cancer in India. Cancer has striven against near insurmountable obstacles of financial difficulties and an almost indifferent ambience, to fulfill the objectives of its founder, bringing to the poorest in the land the most refined scientific technology and excellent patient care. Furthermore, cancer is a preventable disease if it is analyzed at an early stage. There are different sites of cancer such as oral, stomach,

168

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011

which top down strategy is used to cluster the objects. In this method the larger clusters are divided into smaller clusters until each object forms cluster of its own. Fig.2 shows simple example of hierarchical clustering. C. Density Based Clustering Algorithm It is a clustering technique to develop clusters of arbitrary shapes. They are different types of density based clustering techniques such as DBSCAN, SNN, OPTICS and DENCLUE. DBSCAN algorithm: The DBSCAN algorithm was early introduced by Ester, et al. [Ester1996], and relies on a densitybased notion of clusters. Clusters are recognized by looking at the density of points. Regions with a high density of points depict the existence of clusters whereas regions with allow density of points indicate clusters of noise or clusters of outliers. This algorithm is particularly suited to deal with large datasets, with noise, and is able to identify clusters with different sizes and shapes. The algorithm: The key idea of the DBSCAN algorithm is that, for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points, that is, the density in the neighborhood has to exceed some predefined threshold. This algorithm requires three input parameters: - k, the neighbors list size; - Eps, the radius that delimitate the neighborhood area of a point (Eps neighborhood); - MinPts, the minimum number of points that must exist in the Eps-neighborhoods. The clustering process is based on the classification of the points in the dataset as core points, border points and noise points, and on the use of density relations between points (directly density-reachable, density-reachable, densityconnected [Ester, 1996] [2] ) to form the clusters. D. SNN algorithm The SNN algorithm [Ertoz, 2003] [3]) the same as DBSCAN, is a density-based clustering algorithm. The main difference between this algorithm and DBSCAN is that it defines the similarity between points by looking at the number of nearest neighbors that two points share. Using this similarity measure in the SNN algorithm, the density is defined as the sum of the similarities of the nearest neighbors of a point. Points with high density become core points, while points with low density represent noise points. All remainder points that are strongly similar to a specific core points will represent a new clusters. The algorithm: The SNN algorithm needs three inputs parameters: - K, the neighbors’ list size; - Eps, the threshold density;

Figure 1: Work Flow of Partition based cluster algorithms

B. Hierarchical Clustering Algorithms The hierarchical clustering functions essentially in combine closest clusters until the desired number of clusters is achieved. This sort of hierarchical clustering is named agglomerative since it joins the clusters iteratively. There is also a divisive hierarchical clustering that does a turn around process, every data item start in the same cluster and then it is divided in slighter groups (JAIN, MURTY, FLYNN, 1999). The distance capacity between clusters can be done in numerous ways, and that's how hierarchical clustering algorithms of single, common and totally differ. Many hierarchical clustering algorithms have an interesting property that the nested sequence of clusters can be graphically represented with a tree, called a 'dendrogram' (CHIPMAN, TIBSHIRANI, 2006). There are two approaches to hierarchical clustering: we can go from the bottom up, grouping small clusters into larger ones, or from the top down, splitting big clusters into small ones. These are called agglomerative and divisive clustering, respectively.

Figure 2: Hierarchical Clustering

1) Agglomerative approach is the clustering technique in which bottom up strategy is used to cluster the objects. It merges the atomic clusters into larger and larger until all the objects are merged into single cluster. 2) Divisive approach is the clustering technique in

169

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011

- MinPts, the threshold that define the core points. After defining the input parameters, the SNN algorithm first finds the K nearest neighbors of each point of the dataset. Then the similarity between pairs of points is calculated in terms of how many nearest neighbors the two points share. Using this similarity measure, the density of each point can be calculated as being the numbers of neighbors with which the number of shared neighbors is equal or greater than Eps (density threshold). Next, the points are classified as being core points, if the density of the point is equal or greater than MinPts (core point threshold). At this point, the algorithm has all the information needed to start to build the clusters. Optics: OPTICS (Ordering Points to Identify the Clustering Structure) is the clustering technique in which the augmented order of the datasets for cluster analysis. Optics built datasetusing density based clustering structure. The advantage of using optics is it in not sensitive to parameters input values through the user it automatically generates the number of clusters. Denclue: DENCLUE (Clustering Based on Density Distribution Function) is the clustering technique in which the clustering method is dependent on density distribution function. A cluster is denned by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster. The disadvantage of Denclue 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close. The clustering technique is basically based on influence function (data point impact on its neighborhood), the overall density of data space can be calculated as the sum of influence functions applied to data points) and cluster can be calculated using density attractors (local maxima of the overall density function). E. Grid Based Clustering The Grid Based clustering algorithm, to form a grid structure it partitions the data space into a finite number of cell. After that performs all clustering operations are obtained grid structure. It is a well-organized clustering algorithm, but its effect is gravely partial by the size of the cells. Grid-based approaches are well-liked for mining clusters in a large multidimensional space in which clusters are regarded as denser regions than their environs. The computational difficulty of most clustering algorithms is at least linearly comparative to the size of the data set. The great advantage of grid-based clustering is its important decrease of the computational complexity, especially for clustering very huge data sets [8]. In general, a distinctive grid-based clustering algorithm consists of the following five basic steps (Grabusts and Borisov, 2002) [7] : 1. Grid Structure Creation i.e., splitting the data space into a finite number of cells.

2. Cell Density Calculation for each cell. 3. Form of the cells according to their densities. 4. Identify the cluster centers according to their result. 5. Finally Traversal of neighboring cells. III. EXPERIMENTAL RESULTS Here we have taken several series of Datasets by using several websites and direct surveys. And we conclude applicable pattern detection for medical diagnosis. Cancer Database (SEER Datasets): The web site called wwwdep.iarc.fr/ globocan/database.htm consist of datasets. It contain number of cancer patients those who registered themselves in this. The dataset consists of basic attributes such as sex, age, marital status, height and weight. The data of age group was taken from (20 - 75+) years in this group major cancers were examined. A total of male and female cases were examined for the various cancers. The data were collected and substantial distribution was found for Incidence and Mortality by Sex and Cancer site. Perhaps analysis suggests that they were more male cases those who were suffering from cancer as per opposite sex. In this study, the data was taken from SEER datasets which has record of cancer patients from the year 1975-2008. Spatial dataset consists of location collected include remotely sensed images, geographical information with spatial attributes such as location, digital sky survey data, mobile phone usage data, and medical data. The five major cancer areas such as lung, kidney, bones, small intestine and liver were experimented. After this data mining algorithms were applied on the data sets such as K-means, SOM and Hierarchical clustering technique. The database analysis was done using XLMiner tool kit. Fig.3 represents the statistical diagram for representation between number of male and female cases for cancer. The data consists of discrete data sets with following attribute value types of cancer, male cases, female cases, cases of death pertaining to specific cancer. They were around 21 cancers that have been used as the part of analysis. The XLMiner tool doesn’t take the discrete value so it has to be transformed into continuous attribute value.
8 7 6 5 4 3 2 1 0 Small intestine Lung and Bronchus Bones and Joints Kidney and Renal Pelvis 2000 2002 2004 2006 2008

Figure 3 : Female and Male cases of Cancer

. The data was subdivided into X, Y values and the result was formed using K-means and HAC clustering algorithm. In XLMiner, the low level clusters are formed using K-MEANS and SOM then HAC clustering builds the Dendrogram using the low level clusters. Fig.3 specifies the number of clusters for both sexes , in this male is more affected compare to the opposite sex.

170

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011
8 7 6 5 4 3 2 1 0 Small intestine Lung and Bronchus Bones and Joints Kidney and Renal Pelvis 2000 2002 2004 2006 2008

TABLE1: Input of Hierarchical Clustering Data ['both si Input data dendrogram.xls']'Sheet1'!$A$ 2:$F$5 # Records in the input data 4 Input variables normalized Yes Data Type Raw data # Selected Variables Selected variables Parameters/Options Draw dendrogram Show cluster membership # Clusters Selected Similarity measure Selected clustering method

Figure 4 : Female Cases of cancer
10 9 8 7 6 5 4 3 2 1 0 Small intestine Lung and Bronchus Bones and Joints Kidney and Renal Pelvis 2000 2002 2004 2006 2008

X1

X2

X3

X4

Yes Yes 3 Euclidean distance Average group linkage

Figure 5: Male cases of Cancer

The Fig.4 and Fig.5 specifies the number of cluster for male and female suffering from different cancers. This sample is collected from the patient who couldn’t stay alive with the disease. The result of the analysis shows that the male ratio was large in percentage while compared to the opposite sex. Possibly by analyzing the collected data we can enlarge certain measures for the improved procurement of this disease. . Fig.6 specifies the number of death in both males and female cases of death due to cancer using XLMiner.
30 25 20 15 10 5 0 Small intestine Lung and Bronchus Bones and Joints Kidney and Renal Pelvis 2008 2006 2004 2002 2000

TABLE2: Clustering Stages Stage 1 2 3 Overall (secs) Cluster 1 1 1 1 Cluster 2 3 4 2 3.00 Distance 0.079582 1.69234 3.146232

Elapsed Time:
Table 3 presents HAC (hierarchical agglomerative clustering) in which the cluster were determined with appropriate size. Clusters are subdivided in to many sub clusters and the attributes are Xn, (n= 1,2,3,4,5). In this we predicted the clusters by using hierarchical clustering.
TABLE3: Hierarchical Clustering – Predicted Clusters

Figure 6: Female and Male death cases of Cancer

The K-means method is an efficient technique for clustering large data sets and is used to determine the size of each cluster. The input of the hierarchical clustering shown in Table1, and it contain the data, variables, parameters. These are all calculated by the distance measure which is in side the hierarchical clustering. Here Xn, (n= 1,2,3,4,5) are the selected variables which is placed in the datasets. After this the HAC (hierarchical agglomerative clustering), is used on our datasets in which we have used tree based partition method in which the results has shown as clustering stages and its elapsed time in Table 2. The HAC has proved to have for better results than other clustering methods. The principal component analysis technique has been used to visualize the data. The X, Y coordinates recognize the point position of objects. The coordinates were used and the clusters were determined by appropriate attribute value. The mean and standard deviation of each cluster was determined.

The Figure 7 represents the dendrogram in which the dataset has been partitioned into three clusters with the Kmeans.
Dendrogram(Average group linkage) 3.5 3 2.5 D is t a n c e 2 1.5 1 0.5 0
0

4000

0 1
1

3

2

4

3

2

4

5

Figure 7: Dendrogram

171

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011

The HAC clustering algorithm is applied on K-means to generate the dendrogram. In a dendrogram, the elements are grouped together in one cluster when they have the closest values of all elements available. The cluster 2 and cluster 3 are combined together in the diagram. Analyzes is done in the subdivisions of clusters.
TABLE 4: Representation of Cluster Mean and Standard Deviation Cluster_HAC_1=c_hac_1 Cluster_HAC_1=c_hac_2 Exampl es Test Grou Overra value p l Att Desc [ 59.3 %] 16 Example s Test - valu Grou e p [ 7.4 %] 2

The cluster compactness has been determined by standard deviation where the cluster becomes compact when standard deviation value decreases and if the value of standard deviation increases the cluster becomes dispersed. IV. Dicussion This paper focuses on clustering algorithms such as HAC and K-Means in which, HAC is applied on K-means to determine the number of clusters. The quality of cluster is improved, if HAC is applied on K-means. The paper has referenced and discussed the issues on the specified algorithms for the data analysis. The analysis does not include missing records. The application can be used to demonstrate how data mining technique can be combined with medical data sets and can be efficiently established in modifying the various cancer related research. V. Conclusion This study clearly shows that data mining techniques are promising for cancer datasets. Our future work will be related to missing values and applying various algorithms for the fast implementation of records. In addition, the research would be focusing on spatial data clustering to develop a new spatial data mining algorithm. Once our tool will be implemented as a complete data analysis environment in the cancer registry of SEER datasets, we aim at transferring the tool to related domains, thus showing the flexibility and extensibility of the underlying basic concepts and system architecture. References

Overra l

Continuous attributes : Mean Continuous attributes : Mean (StdDev) (StdDev) Year 82.50 ofDiagno sis 0.5 (4.76) 11.79 Att Desc 1.7 (4.34) 24.98 fstomach 2.1 (5.31) 17.15 mstomac h 3.5 (1.90) 18.88 flungs 3.6 (1.08) 4.20 3.9 (1.92) 13.90 mlungs 3.9 (0.68) 58.71 (3.83) 62.14 80.59 (24.14) 13.03 (4.47) 26.61 (4.84) 19.57 (4.22) 19.83 (1.64) 6.53 (3.72) 14.56 (1.05) 62.13 (5.30) 66.97 (7.28) fkidney 2.1 mkidney 2.1 flungs 2.4 fliver 2.6 mstomac h 2.7 mliver 2.7 moral 2.9 64.90 55.78 (0.99) (4.52) 13.60 6.53 (0.00) (3.72) 27.60 19.57 (2.97) (4.22) 21.05 13.03 (1.77) (4.47) 22.55 19.83 (0.78) (1.64) 70.00 62.13 (0.99) (5.30) 77.75 66.97 (2.05) (7.28) 32.20 26.61 (1.56) (4.84) 0.50 80.59 (24.14 (0.71) )

1.

mliver

2.

mkidney -4

fkidney 4.1 (5.17)

fstomach 1.7 Year ofDiagno sis -4.8

3.

Table 4 characterizes the cluster according to the mean and standard deviation of each object and cluster were determined. The primary comparison in between cluster 1 objects. The second comparison was between the objects of cluster 2 and cluster 1.The third comparison was determined in between cluster 3 and cluster 1. The results show the mean and standard deviation of each cluster and also among the objects in each cluster. The cluster 1 has the lowest number of cancer cases the cluster 2 has average number of cancer cases where as the cluster 3 has large number of cancer cases.

4.

5.

Rao, Y.N, Sudir Gupta and S.P. Agarwal 2003. National Cancer Control Programme: Current status and strategies, 50 years of cancer control in India, NCD Section, Director General of Health. Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases. In the Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD.96), Portland, Oregon, pp. 226-231. [Ertöz et al. 2003] Ertöz, L., Steinbach, M., Kumar, V.: “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data”; In Proc. of SIAM Int. Conf. on Data Mining (2003), 1-12. Aberer, Karl. 2001. P-Grid: A self-organizing access structure for P2P information systems. In Proc. International Conference on Cooperative Information Systems, pp. 179-194. Springer. Bar-Yossef, Ziv, and Maxim Gurevich. 2006. Random sampling from a search engine's index. In Proc. WWW, pp. 367-376. ACM Press. DOI:doi.acm. org/10.1145/ 135777.1135833.

6.

Ng R.T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. 20th Int. Conf. on Very Large Data Bases, 144-155. Santiago, Chile.

172

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 5, 2011

7.

8.

9.

Grabust P., Borisov A. Using grid-clustering methods in data classification// Proceedings of the International Conference on Parallel Computing & Electrical Engineering- PARELEC'2002. - Warsaw, Poland, 2002 p. 425-426. H. Pilevar, M. Sukumar, “GCHL: A grid-clustering algorithm for high- dimensional very large spatial data bases”, Pattern Recognition Letters 26(2005), pp.9991010. Jones C., et al., 2002. Spatial Information Retrieval and Geographical Ontologies: An Overview of the SPIRIT Project [C]. In proceedings: 25th ACM Conference of the Special Interest Group in Information Retrieval, pp387388.

Experience in teaching and published Twenty seven papers in International Journals and also presented seventy papers in various National conferences and one international conference. Area of research is Data Mining, Software Engineering, Bioinformatics, Neural Network. Also reviewer in several National and International journals.

10. Processing of Spatial Joins, Proc. ACM SIGMOD Int. Conf. on Management of Data, Minneapolis, MN, 1994, pp. 197- 208. 11. T. Zhang, R. Ramakrishnan, and M. L1nvy, B1RCH: An Efficient Data C1ustering 12. Method for Very Large Databases, Proc. ACM SIGMOD Int’L Conf. On Management of Data, ACM Press, pp. 103-114 (1996).
13. M. Ester, H. Kriegel, J. Sander, and X. Xu. “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, In Proc. of 2nd Int. Conf. on KDD, 1996, pp. 226-231. 14. Wang, Yang, R. Muntz, Wei Wang and Jiong Yang and Richard R. Muntz “STING: A Statistical Information Grid Approach to Spatial Data Mining”, In Proc. of 23rd Int. Conf. on VLDB, 1997, pp. 186-195. 15. Ian H. Witten; Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques, 2nd Edition". Morgan Kaufmann, San Francisco. 16. M J Horner, L A G Ries, M Krapcho, N Neyman, R Aminou, N Howlader, et al. (2009) SEER Cancer Statistics Review , 1975-2007, Based on November 2008 SEER data Submission. 17. Gondek D, Hofmann T (2007) Non- redundant data clustering. Knowl Inf Syst 12(1):1–24.

Authors Profile N.Naga Saranya received the first degree in Mathematics from Periyar University in 2006, Tamilnadu, India. She obtained her master degree in Computer Applications from Anna University in 2009, Tamilnadu, India. She is currently pursuing her Ph.D. degree Under the guidance of Dr. M.Hemalatha, Head, Dept of Software Systems, Karpagam University, Tamilnadu, India. Dr.M.Hemaltha completed MCA MPhil., PhD in Computer Science and Currently working as a AsstProfessor and Head , Dept of Software systems in Karpagam University. Ten years of

173

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Sign up to vote on this title
UsefulNot useful