35 views

Uploaded by seventhsensegroup

Clustering is the most commonly used technique of
data mining under which patterns are discovered in the
underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers.

- A Clustering Approach Based on Charged Particles Chapter 11
- Effect of K-Means Clustering Method in Guitar Election Decision Support System_I Gusti Ngurah Agung Yogha Pramana S.kom
- Data Mining for Intelligence, Fraud & Criminal Detection
- dm_spss
- Cluster Analysis Zain
- Fraudulent Electronic Transaction Detection Using Dynamic KDA Model
- Analisis de Mineria de Datos Emergencias Medicas
- Energy Efficient Cluster Head Election using Fuzzy Logic in Wireless Sensor Networks
- POL BigDataStatisticsJune2014
- Unsupervised Learning - Text Clustering Machine Learning for NLP
- BDE-DBSCAN
- Jurnal Profesor
- Fdp on Im & Dm Using Ost
- Paper Privacy Preserving
- Extension of Direct Search Methods to find Optimal Cluster Centroid for Constrained Multi- Variable Functions
- 10.1145@988672.988762
- A Survey of Clustering Algorithm for Very Large Datasets
- K Means Based Consensus Clustering
- International Journal of Engineering Research and Development
- Medical 99

You are on page 1of 5

Nimrat Kaur Sidhu*, Rajneet Kaur**

*

(Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

**(Assistant Professor, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.)

Abstract--Clustering is the most commonly used technique of data mining under which patterns are discovered in the underlying data. This paper presents that how clustering is carried out and the applications of clustering. It also provides us with a framework for the mixed attributes clustering problem and also shows us that how the customer data can be clustered identifying the high-profit, high-value and low-risk customers. Keywords-- Data Mining, Customer Clustering, Categorical Data

C. Similarity Measures A similarity measure SIMILAR ( Di, Dj ) can be used to represent the similarity between two documents i and j. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. D. Threshold

I. INTRODUCTION Data mining systems can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used or the applications. Three important components of data mining systems are databases, data mining engine, and pattern evaluation modules. Next [1] are a few important definitions that are used in clustering technique of data mining. A. Cluster A cluster is an ordered list of objects, which have some common objects. The objects belong to an interval [ a,b]. B. Distance between Two Clusters The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. The distance between two points is taken as a common metric to assess the similarity among the components of a population. The most commonly used distance measure is the Euclidean metric which defines the distance between two points p = (p1,p2,) and q = (q1,q2, .) as d = [ ( pi qi)2]1/2 The lowest possible input value of similarity required joining two objects in one cluster. A threshold T(J) is given for the Jth variable (1< J < N ). Cases are partitioned into clusters so that within each cluster the Jth variable has a range less than T(J). The thresholds should be chosen fairly large, especially if there are many variable. The procedure is equivalent to converting each variable to a category variable (using the thresholds to define the categories) and the clusters are then cells of the multidimensional contingency table between all variables. E. Similarity Matrix Similarity between objects calculated by the function SIMILAR (Di,,Dj), represented in the form of a matrix is called a similarity matrix. F. Cluster Seed First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming objects similarity is compared with the initiator. The initiator is called the cluster seed.

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 710

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

at all; note that any clustering algorithm will produce some clusters regardless of whether or not natural clusters exist. II. Basic Clustering Step A. Preprocessing and feature selection Most clustering models assume that n-dimensional feature vectors represent all data items. This step therefore involves choosing an appropriate feature, and doing appropriate preprocessing and feature extraction on data items to measure the values of the chosen feature set. It will often be desirable to choose a subset of all the features available, to reduce the dimensionality of the problem space. This step often requires a good deal of domain knowledge and data analysis. B. Similarity measure Similarity measure plays an important role in the process of clustering where a set of objects are grouped into several clusters, so that similar objects will be in the same cluster and dissimilar ones in different cluster. In clustering, its features represent an object and the similarity relationship between objects is measured by a similarity function. This is a function, which takes two sets of data items as input, and returns as output a similarity measure between them. C. Clustering algorithm Clustering algorithms are general schemes, which use particular similarity measures as subroutines. The particular choice of clustering algorithms depends on the desired properties of the final clustering, e.g. what are the relative importance of compactness, parsimony, and inclusiveness? Other considerations include the usual time and space complexity. A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster (2). D. Result validation IV. CUSTOMER DATA CLUSTERING Do the results make sense? If not, we may want to iterate back to some prior stage. It may also be useful to do a test of clustering tendency, to try to guess if clusters are present [3] Customer clustering is the most important data mining methodologies used in marketing and customer relationship management (CRM). Customer clustering would use E. Result interpretation and application Typical applications of clustering include data compression (via representing data samples by their cluster representative), hypothesis generation (looking for patterns in the clustering of data), hypothesis testing (e.g. verifying feature correlation or other data properties through a high degree of cluster formation), and prediction (once clusters have been formed from data and characterized, new data items can be classified by the characteristics of the cluster to which they would belong). III. Clustering Techniques Traditionally clustering techniques are broadly divided into hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. A. Agglomerative Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. B. Divisive Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide, at each step, which cluster to split and how to perform the split. Hierarchical techniques produce a nested sequence of partitions, with a single, allinclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The result of a hierarchical clustering algorithm can be graphically displayed as tree, called a dendogram. This tree graphically displays the merging process and the intermediate clusters. For document clustering, this dendogram provides a taxonomy, or hierarchical index.[2]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 711

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

customer-purchase transaction data to track buying behavior and create strategic business initiatives. Companies want to keep high-profit, high-value, and lowrisk customers. This cluster typically represents the 10 to 20 percent of customers who create 50 to 80 percent of a company's profits. A company would not want to lose these customers, and the strategic initiative for the segment is obviously retention. A low-profit, high-value, and low-risk customer segment is also an attractive one, and the obvious goal here would be to increase profitability for this segment. A. Architecture The approach is a two phased model. In first phase, collect the data from our organization retail smart store and then do the data cleansing. It involves removing the noise first, so the incomplete, missing and irrelevant data are removed and formatted according to the required format. In second phase, generate the clusters and profile the clusters to identify by best clusters. Fig.1 illustrates the whole process.[3] customers have been clustered using IBM Intelligent Miner tool. The first steps in the clustering process involve selecting the data set and the algorithm. There are two types of algorithms available in I-Miner process.[3] 1) Demographic clustering process 2) Neural clustering process In this exercise, the Demographic clustering process has been chosen, since it works best for the continuous data type. The data set has all the data types are continuous. The next step in the process is to choose the basic run parameters for the process. The basic parameters available for demographic clustering include are: 1) Maximum number of clusters 2) Maximum number of passes through the data 3) Accuracy 4) Similarity threshold The input parameters for the customers clustering are: 1) Recency 2) Total customer profit 3) Total customer revenue 4) Top revenue Department The data is first extracted from the oracle databases and flat files and converted into flat files. Subsequently, the IMiner process picks up the file and processed. The entire output data set would have customer information appended to the end of the each record. C. Cluster Profiling The next step in the clustering process is to profile the clusters by executing SQL queries. The purpose of profiling is to assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster. V. CLUSTERING NUMERIC AND CATEGORICAL DATA Clustering typically groups data into sets in such a way that the intra-cluster similarity is maximized while the intercluster similarity is minimized.

Fig 1. Clustering Process

A. Cluster Ensembles Cluster ensembles is the method tocombine several runs of different clustering algorithms to get a common partition of

B. Experiments and Results For this study, the transaction of data of our organization retail smart store has been taken. Using these data,

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 712

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

the original dataset, aiming for consolidation of the results from a portfolio of individual clustering results. In [4], the authors formally defined the cluster ensmble problem as an optimization problem and propose combiners for solving it based on a hyper-graph model. B. Cluster Ensemble: The Viewpoint of Categorical Data Clustering Clustering aims at discovering groups and patterns in data sets. In general, the output produced by a special clustering algorithm will be the assignment of data objects in dataset to different groups. In other words, it will be sufficient to identify data object with a unique cluster label. From the viewpoint of clustering, data objects with different cluster labels are considered to be in different clusters, if two objects are in the same cluster then they are considered to be fully similar, if not they are fully dissimilar. Thus, it is obvious that cluster labels are impossible to be given a natural ordering in a way similar to real numbers, that is to say, the output of clustering algorithm can be viewed as categorical. Since the output of individual clustering algorithm is categorical and so the cluster ensemble problem can be viewed as the categorical data clustering problem, in which runs of different clustering algorithm are combined into a new categorical dataset. Transforming the cluster ensemble problem into categorical data clustering problem has following advantages. First, some efficient algorithms for clustering categorical data have been proposed recently [1,8,14]. These algorithms can be fully exploited and also cluster ensemble problem could benefit from the advances in the research of categorical data clustering. Further, the problem of categorical data clustering is relatively simple and provides a unified framework for problem formalization. For clustering datasets with mixed types of attributes, we propose a novel divide and conquer technique. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Last, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. C. Overview The steps involved in the cluster ensemble based algorithm framework are described in figure 1. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Finally, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is exploited to get the final clusters. For this algorithm, framework gets clustering output from both splitting categorical dataset and numeric dataset, therefore, it is named as CEBMDC (Cluster Ensemble Based Mixed Data Cluatering).

VI. CONCLUSION Clustering has been proved to be the most extensively used technique. In this paper, the various applications were discussed. For the customer clustering, demographic clustering technique was used. For the categorical data, the existing clustering algorithms were integrated and can be fully exploited. In futures, an alternative clustering algorithms can be integrated into the algorithm framework, to get a better insight and an advancement can be made in clustering technique. VII. REFERENCES

[1] I.K. Ravichandra Rao, Data Mining and Clustering Techniques, DRTC Workshop on Sementic Web, 8-10 december, 2003, DRTC, Banglore. Hartigan, John A, " Clustering Algorithms ".1975.John Wiley. New York. Rajagopal, Dr. Sankar, Customer Data Clustering Using Data Mining Technique, International Journal of Database Management Systems (IJDMS) Vol.3, No.4, November 2011 A. Strehl, J. Ghosh : " Cluster Ensembles- A Knowledge Reuse Framework for Combining Partitions. Proc. of the 8th National Conference on Artificial Intelligence and 4th Conference on

[2]

[3]

[4]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 713

International Journal of Computer Trends and Technology (IJCTT) - volume4Issue4 April 2013

Innovative Applications of Artificial Intelligence, pp. 93-99, 2002. [5] Z. He, X. Xu, S. Deng: " Squeezer: An Efficient Algorithm for Clustering Categorical Data. Journal of Computer Science and Technology, vol 17, no. 5, pp.611-625, 2002. Sudipto Guha , Rajeev Rastogi , Kyuseok Shim : ROCK : A Robust Clustering Algorithm for Categorical Attributes. In Proc. 1999 Int. Conf. Data Engineering , pp. 512-521, Sydney, Australia, Mar.1999. Ke Wang, Chu Xu, Bing Liu: " Clustering Transactions Using Large items. Proceedings of the 1999 ACM International Conference on Information and Knowledge Management, pp.483-490, 1999.

[6]

[7]

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 714

- A Clustering Approach Based on Charged Particles Chapter 11Uploaded byYugal Kumar
- Effect of K-Means Clustering Method in Guitar Election Decision Support System_I Gusti Ngurah Agung Yogha Pramana S.komUploaded byIndra Wiguna
- Data Mining for Intelligence, Fraud & Criminal DetectionUploaded byczech75b
- dm_spssUploaded byMonte Carlo
- Cluster Analysis ZainUploaded byZain Ul Abidin Mushki
- Fraudulent Electronic Transaction Detection Using Dynamic KDA ModelUploaded byijcsis
- Analisis de Mineria de Datos Emergencias MedicasUploaded byHéctor F Bonilla
- Energy Efficient Cluster Head Election using Fuzzy Logic in Wireless Sensor NetworksUploaded byijcsis
- POL BigDataStatisticsJune2014Uploaded byHHenry1860
- Unsupervised Learning - Text Clustering Machine Learning for NLPUploaded byravigobi
- BDE-DBSCANUploaded byBruno Romero de Azevedo
- Jurnal ProfesorUploaded byKris Nugroz
- Fdp on Im & Dm Using OstUploaded bysomenath_sengupta
- Paper Privacy PreservingUploaded byhridita
- Extension of Direct Search Methods to find Optimal Cluster Centroid for Constrained Multi- Variable FunctionsUploaded byIDES
- 10.1145@988672.988762Uploaded byHimanshu Gautam
- A Survey of Clustering Algorithm for Very Large DatasetsUploaded byInternational Journal of Innovative Science and Research Technology
- K Means Based Consensus ClusteringUploaded byKalaiselvibala
- International Journal of Engineering Research and DevelopmentUploaded byIJERD
- Medical 99Uploaded bySasa Trifkovic
- A 046010107Uploaded byAnonymous 7VPPkWS8O
- 10.1.1.99.971Uploaded byGowtham Gowtham
- Quantitative Discovery From Qualitiative InformationUploaded byanon_476190768
- ClusteringTimeSeriesUsingUnsupervised-ShapeletsUploaded byJessica Nasayao
- Integrating Pattern and Context in SpatiUploaded bygtametal
- 03CS3006 Pawan Singh FaujadarUploaded byGilbert Rozario
- NO-SUP-CLUUploaded bywilliam_sbonchow
- A new step in cluster policy evaluation in the Basque Country (Eng) / Un nuevo paso en la evaluación de la política clúster en el País Vasco (Ing) / Beste urrats bat EAEn kluster politika ebaluatzeko bidean (Ing)Uploaded byEKAI Center
- 545 ProjectUploaded byYifu Wu
- DM DWUploaded byvenkateshmukharji

- A Multi-Level Storage Tank Gauging And Monitoring System Using A Nanosecond PulseUploaded byseventhsensegroup
- Experimental Investigation On Performance, Combustion Characteristics Of Diesel Engine By Using Cotton Seed OilUploaded byseventhsensegroup
- An Efficient Model Of Detection And Filtering Technique Over Malicious And Spam E-MailsUploaded byseventhsensegroup
- Comparison Of The Effects Of Monochloramine And Glutaraldehyde (Biocides) Against Biofilm Microorganisms In Produced WaterUploaded byseventhsensegroup
- Comparison of the Regression Equations in Different Places using Total StationUploaded byseventhsensegroup
- Optimal Search Results Over Cloud with a Novel Ranking ApproachUploaded byseventhsensegroup
- The Utilization Of Underbalanced Drilling Technology May Minimize Tight Gas Reservoir Formation Damage: A Review StudyUploaded byseventhsensegroup
- FPGA Based Design and Implementation of Image Edge Detection Using Xilinx System GeneratorUploaded byseventhsensegroup
- Design, Development And Performance Evaluation Of Solar Dryer With Mirror Booster For Red Chilli (Capsicum Annum)Uploaded byseventhsensegroup
- Application of Sparse Matrix Converter for Microturbine-Permanent Magnet Synchronous Generator output Voltage Quality EnhancementUploaded byseventhsensegroup
- A Simple Method For Operating The Three-Phase Induction Motor On Single Phase Supply (For Wye Connection Standard)Uploaded byseventhsensegroup
- Extended Kalman Filter based State Estimation of Wind TurbineUploaded byseventhsensegroup
- Color Constancy for Light SourcesUploaded byseventhsensegroup
- Fabrication Of High Speed Indication And Automatic Pneumatic Braking SystemUploaded byseventhsensegroup
- Implementation of Single Stage Three Level Power Factor Correction AC-DC Converter with Phase Shift ModulationUploaded byseventhsensegroup
- Study On Fly Ash Based Geo-Polymer Concrete Using AdmixturesUploaded byseventhsensegroup
- Review On Different Types Of Router Architecture And Flow ControlUploaded byseventhsensegroup
- A Comparative Study Of Impulse Noise Reduction In Digital Images For Classical And Fuzzy FiltersUploaded byseventhsensegroup
- Free Vibration Characteristics of Edge Cracked Functionally Graded Beams by Using Finite Element MethodUploaded byseventhsensegroup
- A Review On Energy Efficient Secure Routing For Data Aggregation In Wireless Sensor NetworksUploaded byseventhsensegroup
- An Efficient Encrypted Data Searching Over Out Sourced DataUploaded byseventhsensegroup
- An Efficient Expert System For Diabetes By Naïve Bayesian ClassifierUploaded byseventhsensegroup
- Key Drivers For Building Quality In Design PhaseUploaded byseventhsensegroup
- Separation Of , , & Activities In EEG To Measure The Depth Of Sleep And Mental StatusUploaded byseventhsensegroup
- Design And Implementation Of Height Adjustable Sine (Has) Window-Based Fir Filter For Removing Powerline Noise In ECG SignalUploaded byseventhsensegroup
- Non-Linear Static Analysis of Multi-Storied BuildingUploaded byseventhsensegroup
- Performance And Emissions Characteristics Of Diesel Engine Fuelled With Rice Bran OilUploaded byseventhsensegroup
- IJETT-V4I10P158Uploaded bypradeepjoshi007
- High Speed Architecture Design Of Viterbi Decoder Using Verilog HDLUploaded byseventhsensegroup
- An Efficient And Empirical Model Of Distributed ClusteringUploaded byseventhsensegroup

- PPT PAHUploaded byannis
- Manual Masina de Spalat hotpoint aristonUploaded byAndreea Dea Burlacu
- SR-2011 Fall Insight NewsletterUploaded bydw1006
- DSOC102_3 (3) (1) (1)Uploaded byamit kumar singh
- Abdominals HerniaUploaded byinna3003
- GIS BASED WATERSHED DELINEATION AND DERIVATION OF UNIT HYDROGRAPH FROM GIUH ANALYSISUploaded byVijay Krishnan
- DNCE 345 Course Outline Spring 2012 (2)Uploaded byJosie Carberry
- Potato+Processing+CompilationUploaded byMahesh Dengale
- coms 101 persuasive speechUploaded byapi-350008560
- 5_HandlingGoodsSafelyUploaded byKien Trung Nguyen
- 1-s2.0-S1877050915022218-mainUploaded byNaveen Kumar
- anthonyIM_24Uploaded byVictor Lim
- Human associative memoryUploaded byCaterina Carbone
- EoP_Card Game ImagesUploaded bylivewithyourconscious7559
- AnatomyUploaded byMiguel Machaca Flores (QuimioFarma)
- 100_Ideas - For Better OrganizationUploaded bysaheri1009
- Factors Influencing Blood Pressure Levels.1Uploaded bynina nurhayati
- Testimony 2017 SyllabusUploaded byPriyanka Sethi
- Definition of MetamorphismUploaded byNgurah Sandy
- Corpus LinguisticsUploaded byidolman
- S10 Exam 1 VA SolutionsUploaded byEkim
- A Theoretical Flaw in the Advance Market Commitment IdeaUploaded byFelipe de Carvalho
- European Resuscitation Council Algorithm PostersUploaded bymisohorne
- r TutorialUploaded byudhai170819
- Seafarers Rv Rules 111105Uploaded byrabbitman14
- Week1 and 2-Business Analytics 101Uploaded byjventus
- Basic Community Organizing ProcessUploaded byirish x
- ATM SystemUploaded bynarendra
- Electrical FloodlightUploaded byavk99
- Secret Society of the Cambridge ApostlesUploaded bywalkley8