You are on page 1of 5

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

Clustering In Data Mining

Nimrat Kaur Sidhu*, Rajneet Kaur**
*

(Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

**(Assistant Professor, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

Abstract--Clustering is the most commonly used technique of data mining under which patterns are discovered in the underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers. Keywords-- Data Mining, Customer Clustering, Categorical Data

C. Similarity Measures A similarity measure SIMILAR ( Di, Dj ) can be used to represent the similarity between two documents i and j. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. D. Threshold

I. INTRODUCTION Data mining systems can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used or the applications. Three important components of data mining systems are databases, data mining engine, and pattern evaluation modules. Next [1] are a few important definitions that are used in clustering technique of data mining. A. Cluster A cluster is an ordered list of objects, which have some common objects. The objects belong to an interval [ a,b]. B. Distance between Two Clusters The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. The distance between two points is taken as a common metric to assess the similarity among the components of a population. The most commonly used distance measure is the Euclidean metric which defines the distance between two points p = (p1,p2,) and q = (q1,q2, .) as d = [ ( pi qi)2]1/2 The lowest possible input value of similarity required joining two objects in one cluster. A threshold T(J) is given for the Jth variable (1< J < N ). Cases are partitioned into clusters so that within each cluster the Jth variable has a range less than T(J). The thresholds should be chosen fairly large, especially if there are many variable. The procedure is equivalent to converting each variable to a category variable (using the thresholds to define the categories) and the clusters are then cells of the multidimensional contingency table between all variables. E. Similarity Matrix Similarity between objects calculated by the function SIMILAR (Di,,Dj), represented in the form of a matrix is called a similarity matrix. F. Cluster Seed First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming objects similarity is compared with the initiator. The initiator is called the cluster seed.

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 710

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013
at all; note that any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist. II. Basic Clustering Step A. Preprocessing and feature selection Most clustering models assume that n-dimensional feature vectors represent all data items. This step therefore involves choosing an appropriate feature, and doing appropriate preprocessing and feature extraction on data items to measure the values of the chosen feature set. It will often be desirable to choose a subset of all the features available, to reduce the dimensionality of the problem space. This step often requires a good deal of domain knowledge and data analysis. B. Similarity measure Similarity measure plays an important role in the process of clustering where a set of objects are grouped into several clusters, so that similar objects will be in the same cluster and dissimilar ones in different cluster. In clustering, its features represent an object and the similarity relationship between objects is measured by a similarity function. This is a function, which takes two sets of data items as input, and returns as output a similarity measure between them. C. Clustering algorithm Clustering algorithms are general schemes, which use particular similarity measures as subroutines. The particular choice of clustering algorithms depends on the desired properties of the final clustering, e.g. what are the relative importance of compactness, parsimony, and inclusiveness? Other considerations include the usual time and space complexity. A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster (2). D. Result validation IV. CUSTOMER DATA CLUSTERING Do the results make sense? If not, we may want to iterate back to some prior stage. It may also be useful to do a test of clustering tendency, to try to guess if clusters are present [3] Customer clustering is the most important data mining methodologies used in marketing and customer relationship management (CRM). Customer clustering would use E. Result interpretation and application Typical applications of clustering include data compression (via representing data samples by their cluster representative), hypothesis generation (looking for patterns in the clustering of data), hypothesis testing (e.g. verifying feature correlation or other data properties through a high degree of cluster formation), and prediction (once clusters have been formed from data and characterized, new data items can be classified by the characteristics of the cluster to which they would belong). III. Clustering Techniques Traditionally clustering techniques are broadly divided into hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. A. Agglomerative Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. B. Divisive Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide, at each step, which cluster to split and how to perform the split. Hierarchical techniques produce a nested sequence of partitions, with a single, allinclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The result of a hierarchical clustering algorithm can be graphically displayed as tree, called a dendogram. This tree graphically displays the merging process and the intermediate clusters. For document clustering, this dendogram provides a taxonomy, or hierarchical index.[2]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 711

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013
customer-purchase transaction data to track buying behavior and create strategic business initiatives. Companies want to keep high-profit, high-value, and lowrisk customers. This cluster typically represents the 10 to 20 percent of customers who create 50 to 80 percent of a company's profits. A company would not want to lose these customers, and the strategic initiative for the segment is obviously retention. A low-profit, high-value, and low-risk customer segment is also an attractive one, and the obvious goal here would be to increase profitability for this segment. A. Architecture The approach is a two phased model. In first phase, collect the data from our organization retail smart store and then do the data cleansing. It involves removing the noise first, so the incomplete, missing and irrelevant data are removed and formatted according to the required format. In second phase, generate the clusters and profile the clusters to identify by best clusters. Fig.1 illustrates the whole process.[3] customers have been clustered using IBM Intelligent Miner tool. The first steps in the clustering process involve selecting the data set and the algorithm. There are two types of algorithms available in I-Miner process.[3] 1) Demographic clustering process 2) Neural clustering process In this exercise, the Demographic clustering process has been chosen, since it works best for the continuous data type. The data set has all the data types are continuous. The next step in the process is to choose the basic run parameters for the process. The basic parameters available for demographic clustering include are: 1) Maximum number of clusters 2) Maximum number of passes through the data 3) Accuracy 4) Similarity threshold The input parameters for the customers clustering are: 1) Recency 2) Total customer profit 3) Total customer revenue 4) Top revenue Department The data is first extracted from the oracle databases and flat files and converted into flat files. Subsequently, the IMiner process picks up the file and processed. The entire output data set would have customer information appended to the end of the each record. C. Cluster Profiling The next step in the clustering process is to profile the clusters by executing SQL queries. The purpose of profiling is to assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster. V. CLUSTERING NUMERIC AND CATEGORICAL DATA Clustering typically groups data into sets in such a way that the intra-cluster similarity is maximized while the intercluster similarity is minimized.
Fig 1. Clustering Process

A. Cluster Ensembles Cluster ensembles is the method tocombine several runs of different clustering algorithms to get a common partition of

B. Experiments and Results For this study, the transaction of data of our organization retail smart store has been taken. Using these data,

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 712

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013
the original dataset, aiming for consolidation of the results from a portfolio of individual clustering results. In [4], the authors formally defined the cluster ensmble problem as an optimization problem and propose combiners for solving it based on a hyper-graph model. B. Cluster Ensemble: The Viewpoint of Categorical Data Clustering Clustering aims at discovering groups and patterns in data sets. In general, the output produced by a special clustering algorithm will be the assignment of data objects in dataset to different groups. In other words, it will be sufficient to identify data object with a unique cluster label. From the viewpoint of clustering, data objects with different cluster labels are considered to be in different clusters, if two objects are in the same cluster then they are considered to be fully similar, if not they are fully dissimilar. Thus, it is obvious that cluster labels are impossible to be given a natural ordering in a way similar to real numbers, that is to say, the output of clustering algorithm can be viewed as categorical. Since the output of individual clustering algorithm is categorical and so the cluster ensemble problem can be viewed as the categorical data clustering problem, in which runs of different clustering algorithm are combined into a new categorical dataset. Transforming the cluster ensemble problem into categorical data clustering problem has following advantages. First, some efficient algorithms for clustering categorical data have been proposed recently [1,8,14]. These algorithms can be fully exploited and also cluster ensemble problem could benefit from the advances in the research of categorical data clustering. Further, the problem of categorical data clustering is relatively simple and provides a unified framework for problem formalization. For clustering datasets with mixed types of attributes, we propose a novel divide and conquer technique. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Last, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. C. Overview The steps involved in the cluster ensemble based algorithm framework are described in figure 1. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Finally, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. For this algorithm, framework gets clustering output from both splitting categorical dataset and numeric dataset, therefore, it is named as CEBMDC (Cluster Ensemble Based Mixed Data Cluatering).

Fig 2 : Overview of CEBMDC algorithm framework

VI. CONCLUSION Clustering has been proved to be the most extensively used technique. In this paper, the various applications were discussed. For the customer clustering, demographic clustering technique was used. For the categorical data, the existing clustering algorithms were integrated and can be fully exploited. In futures, an alternative clustering algorithms can be integrated into the algorithm framework, to get a better insight and an advancement can be made in clustering technique. VII. REFERENCES
[1] I.K. Ravichandra Rao, Data Mining and Clustering Techniques, DRTC Workshop on Sementic Web, 8-10 december, 2003, DRTC, Banglore. Hartigan, John A, " Clustering Algorithms ".1975.John Wiley. New York. Rajagopal, Dr. Sankar, Customer Data Clustering Using Data Mining Technique, International Journal of Database Management Systems (IJDMS) Vol.3, No.4, November 2011 A. Strehl, J. Ghosh : " Cluster Ensembles- A Knowledge Reuse Framework for Combining Partitions. Proc. of the 8th National Conference on Artificial Intelligence and 4th Conference on

[2]

[3]

[4]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 713

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013
Innovative Applications of Artificial Intelligence, pp. 93-99, 2002. [5] Z. He, X. Xu, S. Deng: " Squeezer: An Efficient Algorithm for Clustering Categorical Data. Journal of Computer Science and Technology, vol 17, no. 5, pp.611-625, 2002. Sudipto Guha , Rajeev Rastogi , Kyuseok Shim : ROCK : A Robust Clustering Algorithm for Categorical Attributes. In Proc. 1999 Int. Conf. Data Engineering , pp. 512-521, Sydney, Australia, Mar.1999. Ke Wang, Chu Xu, Bing Liu: " Clustering Transactions Using Large items. Proceedings of the 1999 ACM International Conference on Information and Knowledge Management, pp.483-490, 1999.

[6]

[7]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 714