5 views

Uploaded by PhạmTrungKiên

nhan dang thi giac

save

- Performance Evaluation of K-means Clustering Algorithm with Various Distance Metrics
- PRIZM NE Segment Narratives 2007.pdf
- Clustering of High-Volume Data Streams In Network Traffic
- Matt
- 4835932
- Lecture 9 Clustering
- Essentials of Machine Learning Algorithms (With Python and R Codes)
- Analysis of Clustering Approaches for Data Mining in Large Data Sources
- SIMILARITY BASED CLUSTERING TECHNIQUES IN UNCERTAIN DATASET – A LITERATURE SURVEY
- 20-463 internal and external validity.pdf
- Untitled 1
- Clustering
- K Mean Evaluation in Weka Tool and Modifying It Using Standard Score Method
- [IJCT-V3I2P10]
- Real World Document Clustering Using Modified Balanced Iterative Reducing and Clustering using Hierarchies
- 3434
- Enhanced Document Clustering Using K-Means With Support Vector Machine (SVM) Approach
- 10.1.1.108
- ijsrp-may-2012-89
- Performance Evaluation of Clustering Algorithms
- Fm 3210271031
- Segmentation of Images by using Fuzzy k-means clustering with ACO
- 50120140501012
- baili2011
- Graph Partitioning Advance Clustering Technique
- IJCSE11-03-12-085
- 3_A Vector Quantization
- IRJET-Detection and Classification of Plant Diseases
- Image Segmentation FINAL PAPER (16)
- COLOUR BASED IMAGE SEGMENTATION USING HYBRID KMEANS WITH WATERSHED SEGMENTATION
- Input ThuocLa CapLai
- Slide Oop Ontap 0275
- Cài Đặt SOLR Trên Windows
- sqljion.docx
- Company Trip
- [biboo.vn] - Chuong 03 - Lop Va Doi Tuong.pdf
- Input ThuocLa CapMoi
- SQL Injection
- sqljion
- automaper
- sqljion.docx
- [biboo.vn] - Chuong 07 - Tinh Da Hinh.pdf
- sql injection.docx
- fedu.vn_31-07-2017 (2)
- www cho web.txt
- New Text Document
- A Simple Shopping Cart
- Dang Ky Thuc Hanh
- 7-de-thi-co-lg
- Quynh
- New Text Document
- New Text Document
- Ky Thuat Lap Trinh NukeViet 3
- bosung_150259_en
- Hướng Dẫn Đăng Ký Khóa Học Online
- Code Complete 2
- slidewebservice-1210738041070959-9
- 100 toeic bstudent
- Mvc

You are on page 1of 28

Clustering

• Partition unlabeled examples into disjoint

subsets of clusters, such that:

– Examples within a cluster are very similar

– Examples in different clusters are very different

**• Discover new categories in an unsupervised
**

manner (no sample category labels

provided).

Slide credit: Raymond J. Mooney

2

. . . . . . .. . . . . 3 . . .Clustering Example . .

4 .Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. mammal worm insect crustacean • Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate invertebrate fish reptile amphib.

Divisive Clustering • Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. • Divisive (partitional.Aglommerative vs. top-down) separate all examples immediately into clusters. 5 .

• A clustering evaluation function assigns a realvalue quality measure to a clustering. desired.Direct Clustering Method • Direct clustering methods require a specification of the number of clusters. k. 6 . • The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

Hierarchical Agglomerative Clustering (HAC) • Assumes a similarity function for determining the similarity of two instances. • The history of merging forms a binary tree or hierarchy. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. 7 .

that are most similar.HAC Algorithm Start with all instances in their own cluster. determine the two clusters. Replace ci and cj with a single cluster ci cj 8 . ci and cj. Until there is only one cluster: Among the current clusters.

– Cosine similarity of vectors.Cluster Similarity • Assume a similarity function that determines the similarity of two instances: sim(x. – Group Average: Average similarity between members. 9 .y). – Complete Link: Similarity of two least similar members. • How to compute similarity of two clusters each possibly containing multiple instances? – Single Link: Similarity of two most similar members.

such as clustering islands. 10 .Single Link Agglomerative Clustering • Use maximum similarity of pairs: sim(ci . y) xci . yc j • Can result in “straggly” (long and thin) clusters due to chaining effect.c j ) max sim( x. – Appropriate in some domains.

Single Link Example 11 .

y ) xci . yc j • Makes more “tight.” spherical clusters that are typically preferable. 12 .c j ) min sim( x.Complete Link Agglomerative Clustering • Use minimum similarity of pairs: sim(ci .

Complete Link Example 13 .

all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In order to maintain an overall O(n2) performance. 14 . it must compute the distance between the most recently created cluster and all other existing clusters. • In each of the subsequent n2 merging iterations.Computational Complexity • In the first iteration. computing similarity to each other cluster must be done in constant time.

sim(c j . ck ) min( sim(ci . ck ) max( sim(ci . ck )) – Complete Link: sim((ci c j ).Computing Cluster Similarity • After merging ci and cj. can be computed by: – Single Link: sim((ci c j ). the similarity of the resulting cluster to any other cluster. ck )) 15 . ck ). ck ). ck. sim(c j .

1 sim(ci .Group Average Agglomerative Clustering • Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters). y) ci c j ( ci c j 1) x( ci c j ) y( ci c j ): y x • Compromise between single and complete link. c j ) sim( x. 16 .

c j ) ( s (ci ) s (c j )) ( s (ci ) s (c j )) (| ci | | c j |) (| ci | | c j |)(| ci | | c j | 1) 17 . • Always maintain sum of vectors in each cluster. s(c ) x j xc j • Compute similarity of clusters in constant time: sim(ci .Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length.

repeatedly reallocating instances to different clusters to improve the overall clustering. k.Non-Hierarchical Clustering • Typically must provide the number of desired clusters. • Stop when clustering converges or after a fixed number of iterations. • Iterate. • Randomly choose k instances as seeds. one per cluster. • Form initial clusters based on these seeds. 18 .

or mean of points in a cluster.K-Means • Assumes instances are real-valued vectors. • Clusters based on centroids. 19 . center of gravity. c: 1 μ(c) x | c | xc • Reassignment of instances to clusters is based on distance to the current cluster centroids.

y ) xi yi i 1 • Cosine Similarity (transform to a distance by subtracting from 1): x y 1 x y 20 . y ) • L1 norm: m 2 ( x y ) i i i 1 m L1 ( x .Distance Metrics • Euclidian distance (L2 norm): L2 ( x .

sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) 21 . Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi.… sk} as seeds. Select k random instances {s1.K-Means Algorithm Let d be the distance measure between instances. s2.

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! 22 .

• Reassigning clusters: O(kn) distance computations. • Assume these two steps are each done once for I iterations: O(Iknm). 23 . assuming a fixed number of iterations. or O(knm). more efficient than O(n2) HAC. • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Linear in all relevant factors.Time Complexity • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors.

24 . • The k-means algorithm is guaranteed to converge a local optimum. K l 1 xi X l || xi l || 2 • Finding the global optimum is NP-hard.K-Means Objective • The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid.

• Select good seeds using a heuristic or the results of another method. • Some seeds can result in poor convergence rate. 25 . or convergence to sub-optimal clusterings.Seed Choice • Results can vary based on random seed selection.

• First randomly take a sample of instances of size n • Run group-average HAC on this sample. which takes only O(n) time. • Overall algorithm is O(n) and avoids problems of bad seed selection.Buckshot Algorithm • Combines HAC and K-Means clustering. 26 . • Use the results of HAC as initial seeds for Kmeans.

• Overlapping clustering.Issues in Clustering • How to evaluate clustering? – Internal: • Tightness and separation of clusters (e. 27 .g. k-means objective) • Fit of probabilistic model to data – External • Compare to known class labels on benchmark data • Improving search to converge faster and avoid local minima.

Conclusions • Unsupervised learning induces categories from unlabeled data. • There are a variety of approaches. including: – HAC – k-means • Semi-supervised learning uses both labeled and unlabeled data to improve results. 28 .

- Performance Evaluation of K-means Clustering Algorithm with Various Distance MetricsUploaded byNkiru Edith
- PRIZM NE Segment Narratives 2007.pdfUploaded bycyzyjy
- Clustering of High-Volume Data Streams In Network TrafficUploaded byijcsis
- MattUploaded bykajomi
- 4835932Uploaded byMilkha Singh
- Lecture 9 ClusteringUploaded byMohsin Iqbal
- Essentials of Machine Learning Algorithms (With Python and R Codes)Uploaded byAbhishek Patel
- Analysis of Clustering Approaches for Data Mining in Large Data SourcesUploaded byEditor IJRITCC
- SIMILARITY BASED CLUSTERING TECHNIQUES IN UNCERTAIN DATASET – A LITERATURE SURVEYUploaded byijaert
- 20-463 internal and external validity.pdfUploaded byAkhmad Mustafa
- Untitled 1Uploaded bysreekantha
- ClusteringUploaded byRafael
- K Mean Evaluation in Weka Tool and Modifying It Using Standard Score MethodUploaded byEditor IJRITCC
- [IJCT-V3I2P10]Uploaded byIjctJournals
- Real World Document Clustering Using Modified Balanced Iterative Reducing and Clustering using HierarchiesUploaded byIRJET Journal
- 3434Uploaded byHelloproject
- Enhanced Document Clustering Using K-Means With Support Vector Machine (SVM) ApproachUploaded byEditor IJRITCC
- 10.1.1.108Uploaded byKau Chau
- ijsrp-may-2012-89Uploaded byRishabh Garg
- Performance Evaluation of Clustering AlgorithmsUploaded byAnonymous F1whTR
- Fm 3210271031Uploaded byAnonymous 7VPPkWS8O
- Segmentation of Images by using Fuzzy k-means clustering with ACOUploaded byijtetjournal
- 50120140501012Uploaded byIAEME Publication
- baili2011Uploaded byRajendra Prasad
- Graph Partitioning Advance Clustering TechniqueUploaded byijcses
- IJCSE11-03-12-085Uploaded byDian_Kartika_S_4446
- 3_A Vector QuantizationUploaded byHadeelAl-hamaam
- IRJET-Detection and Classification of Plant DiseasesUploaded byIRJET Journal
- Image Segmentation FINAL PAPER (16)Uploaded byPrajwal Prakash
- COLOUR BASED IMAGE SEGMENTATION USING HYBRID KMEANS WITH WATERSHED SEGMENTATIONUploaded byIAEME Publication

- Input ThuocLa CapLaiUploaded byPhạmTrungKiên
- Slide Oop Ontap 0275Uploaded byPhạmTrungKiên
- Cài Đặt SOLR Trên WindowsUploaded byPhạmTrungKiên
- sqljion.docxUploaded byPhạmTrungKiên
- Company TripUploaded byPhạmTrungKiên
- [biboo.vn] - Chuong 03 - Lop Va Doi Tuong.pdfUploaded byPhạmTrungKiên
- Input ThuocLa CapMoiUploaded byPhạmTrungKiên
- SQL InjectionUploaded byPhạmTrungKiên
- sqljionUploaded byPhạmTrungKiên
- automaperUploaded byPhạmTrungKiên
- sqljion.docxUploaded byPhạmTrungKiên
- [biboo.vn] - Chuong 07 - Tinh Da Hinh.pdfUploaded byPhạmTrungKiên
- sql injection.docxUploaded byPhạmTrungKiên
- fedu.vn_31-07-2017 (2)Uploaded byPhạmTrungKiên
- www cho web.txtUploaded byPhạmTrungKiên
- New Text DocumentUploaded byPhạmTrungKiên
- A Simple Shopping CartUploaded byPhạmTrungKiên
- Dang Ky Thuc HanhUploaded byPhạmTrungKiên
- 7-de-thi-co-lgUploaded byPhạmTrungKiên
- QuynhUploaded byPhạmTrungKiên
- New Text DocumentUploaded byPhạmTrungKiên
- New Text DocumentUploaded byPhạmTrungKiên
- Ky Thuat Lap Trinh NukeViet 3Uploaded byPhạmTrungKiên
- bosung_150259_enUploaded byPhạmTrungKiên
- Hướng Dẫn Đăng Ký Khóa Học OnlineUploaded byPhạmTrungKiên
- Code Complete 2Uploaded byPhạmTrungKiên
- slidewebservice-1210738041070959-9Uploaded byPhạmTrungKiên
- 100 toeic bstudentUploaded byPhạmTrungKiên
- MvcUploaded byPhạmTrungKiên