This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Spring 2007, SJSU Benjamin Lam

Overview

Definition of Clustering Existing clustering methods Clustering examples Classification Classification examples Conclusion

Definition

Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is the process of organizing objects into groups whose members are similar in some way . A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Mu-Yu Lu, SJSU

Simplifications Pattern detection Useful in data concept construction Unsupervised learning process .Why clustering? A few good reasons ...

Where to use clustering? Data mining Information retrieval text mining Web analysis marketing medical diagnostic .

Which method should I use? Type of attributes in data Scalability to larger dataset Ability to work with irregular data Time cost complexity Data order dependency Result presentation .

Major Existing clustering methods Distance-based Hierarchical Partitioning Probabilistic .

Sin-Min . which is typically metric: d(i. Weights should be associated with different variables based on applications and data semantics. categorical. ordinal and ratio variables. Professor Lee. The definitions of distance functions are usually very different for interval-scaled. j) There is a separate quality function that measures the goodness of a cluster.Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function. It is hard to define similar enough or good enough the answer is typically highly subjective. boolean.

.Distance based method In this case we easily identify the 4 clusters into which the data can be divided. This is called distance-based clustering. the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance.

1. Divisive (top down) Start with a big cluster Recursively divide into smaller clusters Stop when k number of clusters is achieved. recursively add two or more appropriate clusters 3. . Stop when k number of clusters is achieved. 2. 3. 1.Hierarchical clustering Agglomerative (bottom up) start with 1 point (singleton) 2.

C. SJSU . Find the closest (most similar) pair of clusters and merge them into a single cluster. Johnson in 1967) is this: Start by assigning each item to a cluster. so that now you have one cluster less. so that if you have N items. Compute distances (similarities) between the new cluster and each of the old clusters. you now have N clusters. and an N*N distance (or similarity) matrix. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. each containing just one item.general steps of hierarchical clustering Given a set of N items to be clustered. Repeat steps 2 and 3 until all items are clustered into K number of clusters Mu-Yu Lu. the basic process of hierarchical clustering (defined by S.

. On the contrary the second type. the overlapping clustering. non exclusive clustering In the first case data are grouped in an exclusive way. A simple example of that is shown in the figure below. so that each point may belong to two or more clusters with different degrees of membership. where the separation of points is achieved by a straight line on a bi-dimensional plane.Exclusive vs. uses fuzzy sets to cluster data. so that if a certain datum belongs to a definite cluster then it could not be included in another cluster.

recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) This recursive relocation= higher quality cluster . Divide data into proper subset 2.Partitioning clustering 1.

variance of each distribution as parameters for cluster 3. Single cluster membership . 2. Use the mean. Data are picked from mixture of probability distribution.Probabilistic clustering 1.

....j)] The clusterings are assigned sequence numbers 0. (n-1) L(k) is the level of the kth clustering A cluster with sequence number m is denoted (m) The proximity between clusters (r) and (s) is denoted d [(r)..Single-Linkage Clustering(hierarchical) The N*N proximity matrix is D = [d(i.(s)] Mu-Yu Lu.1... SJSU .

The algorithm is composed of the following steps: Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.(j)] where the minimum is over all pairs of clusters in the current clustering. (s). . according to d[(r). Find the least dissimilar pair of clusters in the current clustering. say pair (r).(s)] = min d[(i).

s) and old cluster (k) is defined in this way: d[(k). The proximity between the new cluster.s)] = min d[(k).(s)] Update the proximity matrix. D. Else. denoted (r.(s)] If all objects are in one cluster.(r)]. Merge clusters (r) and (s) into a single cluster to form the next clustering m. stop. go to step 2. .) Increment the sequence number : m = m +1. d[(k).The algorithm is composed of the following steps:(cont. by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. (r. Set the level of this clustering to L(m) = d[(r).

The method used is single-linkage. Input distance matrix (L = 0 for all the clusters): .Hierarchical clustering example Let s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities.

Then we compute the distance from this new compound object to all other objects. So the distance from "MI/TO" to RM is chosen to be 564. which is the distance from MI to RM. The nearest pair of cities is MI and TO. . In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. at distance 138. and so on. The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1. These are merged into a single cluster called "MI/TO".

After merging MI with TO we obtain the following matrix: .

RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NA/RM) = 219 m=2 . min d(i.j) = d(NA.

min d(i.j) = d(BA.NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM L(BA/NA/RM) = 255 m=3 .

FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m=4 . min d(i.j) = d(BA/NA/RM.

we merge the last two clusters at level 295. The process is summarized by the following hierarchical tree: . Finally.

. then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters.K-mean algorithm 1.000 rows of data in the dataset and 3 clusters need to be formed. It accepts the number of clusters to group data into. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example. if there are 10. 2. Each of the 3 initial clusters formed will have just one row of data. and the dataset to cluster as input values.

The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. Height = 1. .3. then P = {Age. Weight). would be represented as John = {20. For Example if the dataset we are discussing is a set of Height. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. 170. 80} where John's Age = 20 years. 80}. 170. Weight and Age measurements for students in a University. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20. where a record P in the dataset S is represented by a Height. Weight and Age measurement.70 metres and Weight = 80 Pounds. Height. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. their is only one record. Then a record containing the measurements of a student John. In each of the first K initial clusters.

Following the same procedure. Heightmean. 120}. For Example. K-Means re-assigns each record in the dataset to the most similar cluster and recalculates the arithmetic mean of all the clusters in the dataset. 100}.4. K-Means assigns each record in the dataset to only one of the initial clusters. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. 160. Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. 165. if a cluster contains two records where the record of the set of measurements for John = {20. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure. then the arithmetic mean Pmean is represented as Pmean= {Agemean. . new cluster centers are formed for all the existing clusters. 170. This new arithmetic mean becomes the center of this new cluster. Weightmean). Agemean= (20 + 30)/2. 5. The arithmetic mean of this cluster = {25. Next. 80} and Henry = {30.

. K-Means re-assigns each record in the dataset to only one of the new clusters formed. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.6. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 7. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center.

Classification Goal: Provide an overview of the classification problem and introduce some of the basic algorithms Classification Problem Overview Classification Techniques Regression Distance Decision Trees Rules Neural Networks .

or F. D. Speech recognition Pattern recognition . Predict when a river will flood. Identify mushrooms as poisonous or edible. Identify individuals with credit risks. B.Classification Examples Teachers classify students grades as A. C.

If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If x<50 then grade =F. x <90 x <80 x <70 x <50 F >=90 A >=80 B >=70 C >=60 D . If 60<=x<70 then grade =D.Classification Ex: Grading If x >= 90 then grade =A.

Classification Techniques Approach: 1. Create specific model by evaluating training data (or using domain experts knowledge). or are based on distances or statistical methods. Apply model developed to new data. 2. . NNs. Classes must be predefined Most common techniques use DTs.

Defining Classes Distance Based Partitioning Based .

Classification Using Regression Division: Use regression function to divide area into regions. . Input includes desired class. Prediction: Use regression function to predict a class membership function.

6 m 2m 1 .8 8 m 1 .6 m 1 .7 m 1 .9 m 1 .2 m 2 .Height Example Data Nam e K ris tin a J im M a g g ie M a rth a S te p h a n ie Bob K a th y D a ve W o rth S te ve n D e b b ie Todd K im Amy W yn e tte G ender F M F F F M F M M M F M F F F H e ig h t 1 .7 5 m O u tp u t1 S h o rt T a ll M e d iu m M e d iu m S h o rt M e d iu m S h o rt S h o rt T a ll T a ll M e d iu m M e d iu m M e d iu m M e d iu m M e d iu m O u tp u t2 M e d iu m M e d iu m T a ll T a ll M e d iu m M e d iu m M e d iu m M e d iu m T a ll T a ll M e d iu m M e d iu m T a ll M e d iu m M e d iu m .8 m 1 .9 m 1 .8 m 1 .9 5 m 1 .8 5 m 1 .1 m 1 .7 m 2 .

Division .

Prediction .

Classes represented by Centroid: Central value. Medoid: Representative point. Must determine distance between an item and a class. Individual points Algorithm: KNN .Classification Using Distance Place items in class to which they are closest .

Examine K items near item to be classified. (Here q is the size of the training set.) . New item placed in class with the most number of close items. O(q) for each tuple to be classified.K Nearest Neighbor (KNN): Training set includes classes.

KNN .

KNN Algorithm .

5.Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. C4. CART . Algorithms: ID3. Tuple placed into class based on the region within which it falls.

. . . . A2. Ah} Classes C={C1. Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class. Cm} Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute. Cj . tih> Database schema contains {A1. tn} where ti=<ti1.Decision Tree Given: D = {t1. .

DT Induction .

Comparing DTs Balanced Deep .

ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison. ID3 chooses split attribute with the highest information gain using entropy as base for calculation. ..

Conclusion very useful in data mining applicable for both text and graphical based data Help simplify data complexity classification detect hidden pattern in data .

edu/~mhd/dmbook/part2. SJSU Database System Concepts.smu. Dr.H. Korth. M. Lee.Reference Dr. Dunham http://engr.ppt. Sin-Min San Jose State University Mu-Yu Lu. Silberschatz. Sudarshan .

- Data Mining and Clustering - Benjamin Lam
- Mtech Thesis Rgpv Cse
- 02 Data Mining-Partitioning method
- On Continuous Optimization Methods in Data Mining — Cluster Analysis, Classification and Regression —Provided for Decision Support and Other Applications (1)
- Malhotra MR6e 20
- Chapter 10
- techsec10
- Material Teorii Scufundare
- The Curve of Spee Revisited
- Lec19p4
- 1097_scaling Ra Joss
- creisWasteManagement
- Rare Time Series Motif Discovery from Unbounded Streams
- SAQQD
- suff2 (1)
- Lecture1 Dallas
- ch7-4710
- Sss
- 2 Measures of Central Tendency
- 2011 Wang Seismic Impedance Inversion Using l1norm and Gradient Descent Methods
- geop62-1419
- Back Propagation
- statistics
- a Steffensen-like Method and Its Higher-Order Variants
- 1
- Tutorial Kf
- Alphabetical Statistical Symbols
- PCA Analysis
- Lec12
- System Methods with R

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd