This action might not be possible to undo. Are you sure you want to continue?
Narasimha Murthy Computer Science and Automation Indian Institute of Science, Bangalore
Applications in various domains often lead to very large and frequently high-dimensional data; the dimension of the data being in the hundreds or thousands, for example in text/web mining and bioinformatics. In addition to the high dimensionality, these data sets are also often sparse. Clustering such large and high-dimensional data sets is a contemporary challenge. Successful algorithms must avoid the curse of dimensionality but at the same time should be computationally eﬃcient. Finding useful patterns in large datasets has attracted considerable interest recently. On the basis of quantitative measurements on a set of objects, devise a method that assigns the objects to meaningful subclasses. The primary goal of the project is to implement an eﬃcient new tree based clustering method, and combine the clustering methods with classiﬁcation. The implementation of the algorithm involves many challenging issues like good accuracy and less time. The algorithm incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). We will evaluate the time and space eﬃciency, data input order sensitivity, and clustering quality through several experiments.
ables are ones that take on values across a dimensional range. Nonmetric variables are those that change from one categorical state to another. A pattern classiﬁer has two phases. They are: design phase where abstractions are created; and classiﬁcation phase, where the classiﬁcation of test patterns is done using these abstractions. Corresponding to these two phases, we have design time and classiﬁcation time. Classication based on neural networks and genetic algorithms needs more design time. This is because typically such classiﬁers access the training data a large number of times. On the other hand, in classiﬁcation based on neighbourhood classiﬁers, there is no design phase; so, zero design time. However, the classiﬁcation phase could be computationally expensive.
With an increasing number of new database applications dealing with very large high dimensional data sets, data mining on such data sets has emerged as an important research area. These applications include multimedia content-based retrieval, geographic and molecular biology data analysis, text mining, bioinformatics, medical applications, and time-series matching. Clustering of very large high dimensional data sets is an important problem. There are a number of diﬀerent clustering algorithms that are applicable to very large data sets, and a few that address large high dimensional data.
Clustering is the unsupervised classiﬁcation of patterns (observations, data items or feature vectors) into groups (clusters). Data clustering identiﬁes the sparse and the crowded places, and hence discovers the overall distribution patterns of the dataset. Besides, the derived clusters can be visualized more eﬃciently and eﬀectively than the original dataset. Generally, there are two types of attributes involved in the data to be clustered: metric and nonmetric. Metric vari-
The goals of this project and the work done in the direction of achieving them is described in this section. Our ﬁrst goal is to come up with a new tree based clustering algorithm, which is a combination of CF tree and KD tree.Towards this direction, we have completed the analysis of CF-tree and KD-tree. Our second goal is to combine the clusterting algorithm with the classiﬁcation problem. We are also
The algorithm attempts to automatically partition the data space into a set of regions or clusters. memory. A subsequent pass. In other words.i. the neighbourhood of a given radius has to contain at least a minimum number of data objects. Section 4 gives description about incremental mining.4 Outline of the Report This report is organized as follows. recalculate the positions of the K centroids. These points represent initial group centroids. It is possible to use frequent itemset mining algorithms to generate the clusters and their descriptions eﬃciently.e. Repeat Steps 2 and 3 until the centroids no longer move. Section 2 provides a review of relevant work in this area. • FP-tree A frequent pattern tree is a tree structure consisting of an item-preﬁx-tree and a frequent-itemheader table. When all objects have been assigned. Section 3 describes diﬀerent classiﬁcation approaches. support count and node link. The ﬁrst pass of the algorithm simply counts item occurences to determine the frequent itemsets. but it suﬀers from two major drawbacks:it assumes that all object ﬁt into the main 2 Clustering Clustering is a method of grouping data into diﬀerent groups. and those descriptions helps to classify the unknown records. for each object of a cluster. in which each partition is nested into the next partition in the sequence. which is more eﬃcient. Here we give a brief description about some of the algorithms. The key idea is that. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. the classiﬁcation process attempts to generate the description of the classes. and Divide and Conquer approach. Item-preﬁx-tree consists of a root node labelled null. First. 2. Using the training set. which have been used for clustering. The classiﬁcation problem is concerned with generating a description or a model for each class from the given data set.5 Review of Literature We give a brief review of the relative work done in this area. so that the data in each group share similar trends and patterns. Next. Clustering constitutes a major class of data mining algorithms.two tree based clustering algorithms namely CF tree. the frequent item sets found in the k -1 pass are used to generate the candidate itemsets. It creates a heirarchy of clusters from small to big or big to small. Two popular algorithms are: • Apriori algorithm It is the most popular algorithm to ﬁnd all the frequent sets. Section 6 has experimental results of our preliminary implementation of the two clustering algorithms. There are two main approaches to clusteringheirarchical clustering and partitioning clustering. 4. 1. 3. the density of the neighbourhood must exceed a threshold. consists of two phases. • CLARANS Clustering Large Application based on Randomized Search CLARANS is medoid-based method. The partitioning clustering technique partitions the database into a predeﬁned number of clusters. either deterministically or . the database is scanned and the support of candidates is counted. 1. • K-means algorithm The algorithm is composed of the following steps: 1. The heirarchical clustering technique produces a sequence of partitions. Partitioning clustering algorithms are: k -means and k medoid algorithms. Frequent-item-header-table consists of item name and head of node link which points to the ﬁrst node in the FP-tree carrying the item name. We conclude this report with a road map to future work.. and the result is very sensitive to input order. Section 5 gives the details about diﬀerent data sets. say pass k. to which the examples in the tables are assigned.looking at the incremental mining and Divide and Conquer approaches. Numerous algorithms have been developed for clustering large data sets. Assign each object to the group that has the closest centroid. • DBSCAN Density Based Spatial Clustering of Application of Noise DBSCAN uses a density based notion of clusters to discover clusters of arbitrary shapes. KD tree. Each non-root node consists of three ﬁelds: item name.Place K points into the space represented by the objects that are being clustered.
10) node splits along X=5 plane. But all entries in a leaf node must satisfy a threshold requirement. where n is the number of data objects in CF. It was examined in detail by J Bentley Friedman et al. The goal of the process is to identify all sets of similar examples in the data. Example: The KD Tree for (3.7 8. A KD Tree is a binary tree. 9] is based on the principle of agglomerative clustering. A leaf node also represents a cluster made up of all the subclusters represented by its entries. that is. A leaf node contains at most. 2.3) . and CFi is the CF of the subcluster represented by this child. A CF tree is a height-balanced tree with two parameters: branching factor B and threshold T. A KD-Tree is a data structure for storing a ﬁnite set of points from a k -dimensional space.3) and (6. For example.probability-wise. L.7) (8. at any given stage there are smaller subclusters and the decision at the current stage is to merge subclusters based on some criteria. L entries. ls. 2. childi is a pointer to its th i child node. · · · . choose x as the basis.7) node splits along the Y=7 plane and the (5. in some optimal fashion. And we give a brief overview about Divide and Conquer approach. “prev” and “next” which are used to chain all leaf nodes together for eﬃcient scans. where i ∈ 1. ss).12 It is a 2d-tree of four elements.2 KD Tree 2. CF tree and KD tree. Each nonleaf node contains at most B entries of the form [CFi . designed to handle spatial data in a simple way. ls is the linear sum of the data objects.3 5. It maintains set of cluster features for each subcluster. each leaf node has two pointers. childi ]. In addition. B. choose one of the coordinate as a basis of dividing the rest of the points. all items to the left of root will have the x-coordinate less than that of the root All items to the right of the root will have the xcoordinate greater than (or equal to) that of the root.12) 3. where i = 1.1977. 2. at the root. · · · . with respect to a threshold value T the diameter has to be less than T.1 CF Tree CF tree[8. The objectives of clustering are: • to uncover natural groupings • to initiate hypothesis about the data • to ﬁnd consistent and valid organization of the data In this section we brieﬂy discuss two tree based clustering algoritms namely. So a nonleaf node represents a cluster made up of all the subclusters represented by its entries. The below ﬁgure shows the way nodes partition the plane (5. Like binary search trees.The (3.12) (3.10) (6..10 6. and ss is the square sum of the data objects in CF. At each step. this tree is called CF TREE.10) (8.The Cluster Features of diﬀerent subclusters are maintained in a + tree (in a B tree fashion). CF vector = (n. each of the form [CFi ].7) (5.
The key idea behind KNN classiﬁcation is that similar observations belong to similar classes. and it should be kept small.e.2. Divide S into l disjoint pieces X1 . But while classifying. because all points are in same class. we will ﬁnd K nearest training points with respect to Euclidean distance. The number of the nearest neighbors. 3. and outputs a decision.3 Divide-and-Conquer We study clustering under the data stream model of computation. Internal nodes of this tree are patterns and leaf nodes are categories. One of the ﬁrst requisites of clustering a data stream is that the computation be carried out in small space. So if a pattern belongs to that cluster. of the major drawbacks of KNN classiﬁers is that the classiﬁer needs all available data. Classiﬁcation label of that test pattern will be the majority among those K. The quality of predictive models is measured in terms of accuracy on un-seen data.. which have been generated by some clustering method. 4. yet powerful classiﬁcation method. we select a feature on which we take decision. Xl . So tree construction time will not be reduced.. The input for the classiﬁcation is the training data set. A decision tree takes as input an object described by a set of properties. • Single decision tree: In this we construct only one desion tree for all training patterns.. This will reduce the complexity of classiﬁcation. should be odd in order to avoid ties. but other schemes are conceivable. we can use any of the following methods for classifying those. In decision trees. CPAR . Classiﬁcation analyzes the training data set and constructs a model based on the class label. The weighing scheme of the class numbers is often a majority rule. 2. 3.3 CPAR CPAR: Classiﬁcation based on Predictive Association Rules It combines the advantages of both associative classication and traditional rule-based classication. This may lead to considerable overhead.. It can be shown that the performance of a KNN classiﬁer is always at least half of the best possible classiﬁer for a given problem. this type of classiﬁcation is known as supervised learning. the objective is to maintain a consistent good clustering of the sequence observed so far. A set of classiﬁcation rules are generated by such a classiﬁcation process. 3 Classiﬁcation 3. using a small amount of memory and time. we use those rules only for border cluster points. Since the class ﬁeld is known. we dont have to ﬁnd KNN. • Separate decision tree for each border cluster: In this we construct a separate decision tree for each border class. ﬁnd O(k) centers in Xi . there are two factors that we can measure:accuracy and number of nodes In decision tree classiﬁcation. For each i. Thus. . At each level of the tree. For each test pattern which belong to border clusters. When we see the test point of those clusters. where. X2 .1 KNNC K-Nearest Neighbor (KNN) classiﬁcation is a very simple. we will use the KNNC for the clusters. One 3. So classiﬁcation time can be reduced. one simply has to look for the class designators of a certain number of the nearest neighbors and weigh their class numbers to assign a class number to the unknown. and aims to assign a class label to the future unlabelled records. Based on the decision we split the set of patterns.2 Decision trees Classiﬁcation involves ﬁnding rules that partition the data into disjoint groups. where each center c is weighted by the number of points assigned to it.neighbours. if the training data set is large. the clusters which contains data points of multtiple classes. Partitioning variables are selected automatically according to a given quality measurement criterion. given a sequence of points. We will ﬁnd KNNs for only those patterns which belongs to border clusters i. After generating clusters. which can be used to classify future data and develop a better understanding of each class in the database. So to reduce that overhead. In some clusters all points belongs to the same class. since a large k tends to create misclassiﬁcations unless the individual classes are well-separated. Let X be the O(lk) centers obtained in (2). we use that cluster decision tree for identifying the class. Cluster X to ﬁnd k centers. whose class labels are already known. k. Algorithm Small-Space(S): 1. we can use the followingtwo methods. Assign each point in Xi to its closest center. Decision tree classiﬁer is a predictive model which is presented as a tree.
and node-link links to the next node in the FP-tree carrying the same item-name. count. CPAR generates and tests more rules than traditional rule-based classiers to avoid missing important rules. Eﬃcient and scalable approaches to data-driven knowledge acquisition from distributed. changes in frequency caused by incremental updates. Each node in the item preﬁx sub-tree consists of three ﬁelds: item-name. and for ﬁnding the best k-rules we use that. As a result.inherits the basic idea of FOIL. because they easily ﬁnd mergeable paths and require only upward path traversals. the association rule generation is straightforward. if we know the support of all frequent itemsets. Instead of generating a large number of candidate rules as in associative classication. 5 Datasets This section gives the details about the data sets which have been used for clustering. To avoid overﬁtting. dynamic data sources call for algorithms that can modify knowledge structures (e. a tree structure for eﬃcient incremental mining of frequent patterns. An FP-tree (frequent pattern tree) is a variation of the trie data structure. CanTree may take a large amount of memory. While classifying a pattern. a set of item preﬁx sub-trees as the children of the root. their validity of some constraints). If we have clusters of a single class. a number of areas. Moreover. we check the rules of that particular cluster for which the test pattern belongs. After ﬁnding best k-rules for each class among the satisﬁed rules. where learning tasks are incremental. where the training set can be constructed a priori and learning stops once this set has been duly processed. item name and head of node-link. and/or modiﬁcations of transactions have no eﬀect on the ordering of items in the tree. for which we get maximum average laplace accuracy. all pattern will be positive only. CPAR adopts a greedy algorithm to generate rules directly from training data. inductive machine learning has focused on non-incremental learning tasks. However. items are arranged according to some canonical order. we will ﬁnd the average laplace accuracy for each class. So it will not be useful for those clusters. however.e. Alternatively. There are.1 CanTree In CanTree. We calculate laplace accuracy for eacu rule.2 FP Tree 4 Incremental Mining Historically. CPAR uses expected accuracy to evaluate each rule and uses the best k rules in prediction. and node-link. their price values. mining all frequent itemsets might not be a good idea. where item-name indicates which item this node represents. items can be consistently arranged in lexicographic order or alphabetical order. This will reduce the number rules to be generated. 4.count indicates the number of transactions containing items in the portion of the path reaching this node. In this section we focus on CanTree. So we take border clusters and generate rules for those clusters only. CanTrees signiﬁcantly reduce computation and time. which is a preﬁx-tree structure for storing crucial and compressed information about frequent patterns. we consider all positive and negative examples for each class. Speciﬁcally. So we are thinking of to combine CanTree with FP Tree..g. In contrast. In a transaction database. Since items are consistently ordered in CanTree. pattern classiﬁers) in an incremental fashion without having to revisit previously processed data (examples). In this algorithm. Each entry in the frequent item header table consists of two ﬁelds. . It consists of one root labeled as NULL. The frequency of a node in the CanTree is at least as high as the sum of frequencies of its children. which can be determined by the user prior to the mining process or at runtime during the mining process. CanTrees provide users with eﬃcient incremental mining. The latter points to the ﬁrst node in the FP-tree carrying the item name.g. 4.. when a transaction database contains large number of large frequent itemsets. i. And we assign the label of a class. The construction of CanTree requires only one database scan.and a frequent item header table. swapping of tree nodes which may lead to merging and splitting of tree nodes is not needed. deletions. or null if there is none. As a result. it does not require such user thresholds as preMinsup. The construction of our proposed CanTree is independent of the threshold values. The ordering of items is unaﬀected by the Frequent Pattern Tree A fundamental problem for mining association rules is to mine frequent itemsets (FI’s). CanTree takes large amount of memory. any insertions. such as agents. Thus.. items can be arranged according to some speciﬁc order depending on the item properties (e.
004sec • KD-Tree Accuracy: 98 Time for tree construction: 0.93 Time for tree construction: 60. The forest cover type is the classiﬁcation problem.10 quantitative variables. But CF Tree takes less space than KD Tree.437 Time for tree construction: 8sec Time for testing: 7sec • KD-Tree Accuracy: 96. Training patterns: 494021 Test patterns: 4898431 • CF-Tree Accuracy: 91.2 Covtype Number of Patterns: 581012 Number of Attributes: 54 Number of Classes: 8 Attribute breakdown: 54 columns of data . So. the measurement unit and a brief description. it is seen that CF tree is gives less accuracy than KD tree.018 Time for tree construction: 4sec Time for testing: 26sec • KD-Tree Accuracy: 99.13sec From the above results. So the idea is to construct a new tree type data structure for clustering. as future work we are planning to implement a new clustering algorithm which will perform better than excisting algorithms.5. KDD100 • CF-Tree Accuracy: 98. by using tree based clustering and combining classiﬁcation with clustering. KDD10 Although.02sec 4. Covtype • CF-Tree Accuracy: 73.93 Time for tree construction: 19sec Time for testing: 46143sec 3.1 KDDcup99 Network intrusion detection data 1.93 Time for tree construction: 19sec Time for testing: 4605sec 2. KDD100 data This is KDDCUP 99 100% testing data. current clustering techniques do not address all the requirements adequately and concurrently.01sec Time for testing: 47. 5.0182 Time for tree construction: 0.01sec Time for testing: 0. KDD10 data This is KDDCUP 99 10% data Training patterns: 494021 Test patterns: 311029 Number of Attributes: 36 Number of classes: 36 But we convert that into a 5 class problem.3 SONAR This is the data set used by Gorman and Sejnowski in their study of the classiﬁcation of sonar signals using a neural network. but training data is 10% only. dealing with large number of dimensions and large number of data items can be problematic because of time complexity. The task is to train a network to discriminate between sonar signals bounced oﬀ a metal cylinder and those bounced oﬀ a roughly cylindrical rock. and it should take less time than KD tree. 4 binary wilderness areas and 40 binary soil type variables Missing Attribute Values: None Variable Information: Given is the variable name. for getting better accuracy than CF tree. 2.028 Time for tree construction: 4sec Time for testing: 3sec • KD-Tree Accuracy: 96. but the time taken for KD tree is very high. Number of Attributes:60 Number of Training patterns: 104 Number of Testpatterns: 104 Number of classes: 2 7 Conclusions and Future Work 6 Results 1. SONAR • CF-Tree Accuracy: 87. there are many clustering algorithms. We . 5. Number of classes is same as 10%. variable type.006sec Time for testing: 0. The order of this listing corresponds to the order of numerals along the rows of the database.
2 edition. SIAM.  L. An intoductory tutorial on kd-trees. Srihari. Jagadish and Inderpal Singh Mumick. Kuncheva and L. In Danielle C. K. volume 1. Germany. pages 359–366. 1994. Wynne Hsu. and P. Califonia. U. January 1997. CA. 8(1):161–168. Pattern Analysis and Machine Intelligence. Some methods for classiﬁcation and analysis of multivariate observations. and Xiaowei Xu. Flynn. 1998.  Ramez Elmasri and Shamkant B. BIRCH: An eﬃcient data clustering method for very large databases. editor. Jorg Sander. Jain. Mishra.  Xiaoxin Yin and Jiawei Han. Cameron-Jones. pages 274–281. Proceedings of the 41st Annual Symposium on Foundations of Computer Science.  Martin Ester. Clustering data streams. In P.  Andrew W. In KDD. .  S. Motwani. Springer-Verlag.  M. M. editors. Research Report RJ 9839. Data clustering: a review. Portland. Surv. Quebec. Murty. IEEE Trans. O’Callaghan. 2003.  Arun K Pujari. and Tariqul Hoque.S. November 1999. Navathe. Quamrul I. Zhang and S. M. FOIL: A midterm report. Second International Conference on Knowledge Discovery and Data Mining. USA. June 1994. pages 281–297. April 2004. Austria. Proceedings of the Third SIAM International Conference on Data Mining. 31(3):264–323. IEEE Trans. pages 80–86. N. Moore. J. edia tors. Murty.are looking at divide and conquer approaches to work with large datasets.  Rakesh Agrawal and Ramakrishnan Srikant. Guha. October 08 1997. 26(4):525–528. Nearest neighbor classiﬁer: Simultaneous editing and feature selection. Vienna. Fast K-nearest neighbor classiﬁcation using cluster-based trees. ACM Comput. Quinlan and R. Growing subspace pattern-recognition methods and their neuralnetwork models. San Jose. References  A. 1993. N.  Raymond T. Berlin. Oregon. IEEE Trans. Young. 1996. In L. 1999. CPAR: Classiﬁcation based on predictive association rules. editors. November 12–14 2000. DATA MINING Techniques. Redwood City. University of California Press. IEEE Computer Society.A. AAAI Press. R. Prakash and M. May 1-3. Machine Learning (ECML-93) European Conference on Machine Learning Proceedings. N.  Carson Kai-Sang Leung. 1967. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. V. Le Cam and J. San Francisco.  Tian Zhang. Fundamentals of Database Systems. Los Alamitos. 14(5):1003–1016. C. M. Integrating classiﬁcation and association rule mining. Universities Press. and Yiming Ma. Neyman. and Usama Fayyad. B Brazdil. Neural Networks. MacQueen. A density-based algorithm for discovering clusters in large spatial databases with noise. Khan. Pattern Recognition Letters. Data Eng. In ICDM. Canada. In Evangelos Simoudis. California 95120. 2005. editor. California.  B. Montreal.  J. Benjamin/Cummings. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 2002. 4–6 June 1996. 2003. Fast algorithms for mining association rules. R. and L. I. 20(1113):1149–1156. and Miron Livny. Berkeley. editors. Cantree: A tree structure for eﬃcient incremental mining of frequent patterns. IEEE Computer Society. pages 3–20.  Bing Liu. Raghu Ramakrishnan. Hans-Peter Kriegel. CLARANS: A method for clustering objects for spatial data mining. Also we are looking at incremental mining for online training data sets.. N. Knowl. 2001. Jiawei Han. Jain. pages 226–231. Ng and Jiawei Han..  J. In H. pages 103–114. IBM Almaden Research Center. In Daniel Barbar´ and Chandrika Kamath.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.