10 views

Uploaded by Shay Bhatter

Notes on clustering

save

- Emily Ouma , Michel Dione Et Al
- CONFIDENTIAL DATA IDENTIFICATION USING DATA MINING TECHNIQUES IN DATA LEAKAGE PREVENTION SYSTEM
- Alternative Clustering: A survey and a new approach
- Expert Systems With Application
- FullMarks_Clustering StudentSolution 2
- Bs 31267274
- Hier a Chica l Clustering
- Bb 25322324
- SPSS-Tutorial-Cluster-Analysis
- A Matrix Density Based Algorithm to Hierarchically Co-Cluster Documents and Words
- Compusoft, 2(8), 228-230.pdf
- ABC Clustering
- Meta Similarity Noise-Free Clusters Using Dynamic Minimum Spanning Tree With Self-Detection of Best Number of Clusters
- Making Social Media Analysis More Efficient Through Taxonomy Supported Concept Suggestion
- 06 Clustering
- e examnation
- Efficent Algorithms for Mining Outliers From Large Data Sets-Sridhar Ramaswamy
- a1
- Colloquium Ppt
- Factor Analysis T. Ramayah
- An Efficient Clustering and Distance Based Approach for Outlier Detection
- Service Model Article
- 1505.00168
- Privacy Preserving Clustering on Centralized Data Through Scaling Transf
- 309-1273-1-PB
- M-Fish Karyotyping - A New Approach Based on Watershed Transform
- 00146746.pdf
- Basic Concepts of Machine Leaning
- Week4.Cutting.et.Al.1992.Scatter Gather
- Munro - Probabilistic Representation SFG
- Yes Please
- The Unwinding: An Inner History of the New America
- Sapiens: A Brief History of Humankind
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- John Adams
- The Prize: The Epic Quest for Oil, Money & Power
- The Emperor of All Maladies: A Biography of Cancer
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- The New Confessions of an Economic Hit Man
- Team of Rivals: The Political Genius of Abraham Lincoln
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- Rise of ISIS: A Threat We Can't Ignore
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Bad Feminist: Essays
- Angela's Ashes: A Memoir
- Steve Jobs
- How To Win Friends and Influence People
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- The Light Between Oceans: A Novel
- The Incarnations: A Novel
- You Too Can Have a Body Like Mine: A Novel
- The Love Affairs of Nathaniel P.: A Novel
- Life of Pi
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The Master
- Bel Canto
- A Man Called Ove: A Novel
- The Kitchen House: A Novel
- Beautiful Ruins: A Novel
- Interpreter of Maladies
- The Wallcreeper
- The Art of Racing in the Rain: A Novel
- Wolf Hall: A Novel
- The Cider House Rules
- A Prayer for Owen Meany: A Novel
- My Sister's Keeper: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- The Perks of Being a Wallflower

You are on page 1of 21

Ke Chen

COMP24111 Machine Learning

Outline

• Introduction

• Cluster Distance Measures

• Agglomerative Algorithm

• Example and Demo

• Relevant Issues

• Summary

COMP24111 Machine Learning

2

Divisive – Two sequential clustering strategies for constructing a tree of clusters – Agglomerative: a bottom-up strategy Initially each data object is in its own (atomic) cluster Then merge these atomic clusters into larger and larger clusters – Divisive: a top-down strategy Initially all objects are in one single cluster Then the cluster is subdivided into smaller and smaller clusters COMP24111 Machine Learning 3 .Introduction • Hierarchical Clustering Approach – A typical clustering analysis approach via partitioning data set sequentially – Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) – Use (generalised) distance matrix as clustering criteria • Agglomerative vs.

c.e } Step 1 Step 2 Step 3 Step 4 Step 0 Agglomerative a ab b abcde c cde d de e Step 4 Cluster distance Termination condition Divisive Step 3 Step 2 Step 1 Step 0 COMP24111 Machine Learning 4 .Introduction • Illustrative Example Agglomerative and divisive clustering on the data set {a. b. d .

Cluster Distance Measures • Single link: smallest distance between an element in one cluster single link (min) and an element in the other.e. xjq)} COMP24111 Machine Learning d(C. C)=0 5 .. xjq)} • Complete link: largest distance between an element in one cluster complete link (max) and an element in the other. i. Cj) = max{d(xip. i.e. xjq)} • Average: avg distance between elements in one cluster and average elements in the other.e. d(Ci. Cj) = avg{d(xip. d(Ci. d(Ci. Cj) = min{d(xip... i.

d(a.Single 2. c).5 6 6 dist(C1 . d. c). b c d e dist(C1 . d(a. d).Cluster Distance Measures Example: Given a data set of five objects characterised by a single continuous feature. d(b. d(b. e). 5. 4} 2 Complete link a 0 1 3 4 5 b 1 0 2 3 4 c 3 2 0 1 2 dist(C 1 . c) d(a. d(b. 4. e}. C 2 ) COMP24111 Machine Learning 6 . d(b. 5. d(b. 4. d(a. e) 6 3 4 5 2 3 4 21 3. e). 2. 2. C 2 ) min{d(a. a b c d e Feature 1 2 4 5 6 1. d). c) d(b. e)} d 4 3 1 0 1 Average e 5 4 2 1 0 max{3. d(b. d). e) d(b. 4} 5 d(a. d) d(a. assume that there are two clusters: C 1: {a. C 2 ) max{d(a. 3. c). b} and C2: {c. e)} min{3. link Calculate three cluster distances between C1 andaC2. d(a. d). c). d) d(b. 3. Calculate the distance matrix.

we will have N clusters at the beginning) 3)Repeat until number of cluster is one (or known # of clusters) Merge two closest clusters COMP24111 Machine Learning Update “distance 7 .Agglomerative Algorithm • The Agglomerative algorithm is carried out in three steps: 1)Convert all object features into a distance matrix 2)Set each object as a cluster (thus if we have N objects.

Example • Problem: clustering analysis with agglomerative algorithm data matrix Euclidean distance distance matrix COMP24111 Machine Learning 8 .

Example • Merge two closest clusters (iteration 1) COMP24111 Machine Learning 9 .

Example • Update distance matrix (iteration 1) COMP24111 Machine Learning 10 .

Example • Merge two closest clusters (iteration 2) COMP24111 Machine Learning 11 .

Example • Update distance matrix (iteration 2) COMP24111 Machine Learning 12 .

Example • Merge two closest clusters/update distance matrix (iteration 3) COMP24111 Machine Learning 13 .

Example • Merge two closest clusters/update distance matrix (iteration 4) COMP24111 Machine Learning 14 .

Example • Final result (meeting termination condition) COMP24111 Machine Learning 15 .

C). In the beginning we have 6 clusters: A. F). E) and C into (((D. B) at distance 0. F) into ((D. F). thus conclude the computation COMP24111 Machine Learning 16 .00 5. E). We merge clusters ((D. E and F 2. F) at distance 0. (A. B. B)) at distance 2. F). B) into ((((D. F). E) at distance 1. We merge clusters E and (D. C.71 4. E).41 6. E).Example • Dendrogram tree representation 6 lifetime 5 4 3 2 object 1. D. We merge clusters D and F into cluster (D.50 3. We merge clusters (((D. F). C) at distance 1.50 7. We merge cluster A and cluster B into (A. C) and (A. The last cluster contain all the objects.

Example • Dendrogram tree representation: “clustering” USA states COMP24111 Machine Learning 17 .

a 0 1 3 4 5 b 1 0 2 3 4 c 3 2 0 1 2 d 4 3 1 0 1 e 5 4 2 1 0 COMP24111 Machine Learning 18 . a b c d e respectively. complete-link and averaging cluster distance measures to produce three dendrogram trees.Exercise Given a data set of five objects characterised by a single continuous feature: a b C d e Feature 1 2 4 5 6 Apply the agglomerative algorithm with single-link.

Demo Agglomerative Demo COMP24111 Machine Learning 19 .

Relevant Issues • How to determine the number of clusters – – – If the number of clusters known. termination condition is given! The K-cluster lifetime as the range of threshold value on the dendrogram tree that leads to the identification of K clusters Heuristic rule: cut a dendrogram tree with maximum Kcluster life time COMP24111 Machine Learning 20 .

Summary • Hierarchical algorithm is a sequential clustering algorithm – – Use distance matrix to construct a tree of clusters (dendrogram) Hierarchical representation without the need of knowing # of clusters (can set termination condition with known # of clusters) • Major weakness of agglomerative clustering methods – – – Can never undo what was done previously Sensitive to cluster distance measures and noise/outliers Less efficient: O (n2 logn).com/watch?v=aYzjenNNOcc COMP24111 Machine Learning 21 .youtube. where n is the number of total objects • There are several variants to overcome its weaknesses – – – BIRCH: scalable to a large data set ROCK: clustering categorical data CHAMELEON: hierarchical clustering using dynamic modelling Online tutorial: the hierarchical clustering functions in Matlab https://www.

- Emily Ouma , Michel Dione Et AlUploaded byajjumamil
- CONFIDENTIAL DATA IDENTIFICATION USING DATA MINING TECHNIQUES IN DATA LEAKAGE PREVENTION SYSTEMUploaded byLewis Torres
- Alternative Clustering: A survey and a new approachUploaded byElias Diab
- Expert Systems With ApplicationUploaded byjchica
- FullMarks_Clustering StudentSolution 2Uploaded byHaura Al Maqsurah
- Bs 31267274Uploaded byIJMER
- Hier a Chica l ClusteringUploaded byniti860
- Bb 25322324Uploaded byAnonymous 7VPPkWS8O
- SPSS-Tutorial-Cluster-AnalysisUploaded byAnirban Bhowmick
- A Matrix Density Based Algorithm to Hierarchically Co-Cluster Documents and WordsUploaded byChong Wang
- Compusoft, 2(8), 228-230.pdfUploaded byIjact Editor
- ABC ClusteringUploaded byPhadmaa Imm Nidaa
- Meta Similarity Noise-Free Clusters Using Dynamic Minimum Spanning Tree With Self-Detection of Best Number of ClustersUploaded bysaeedullah81
- Making Social Media Analysis More Efficient Through Taxonomy Supported Concept SuggestionUploaded byWarren Smith QC (Quantum Cryptanalyst)
- 06 ClusteringUploaded byEhtasham Ul Haq
- e examnationUploaded byAjithRao Madarapu
- Efficent Algorithms for Mining Outliers From Large Data Sets-Sridhar RamaswamyUploaded byAnonymous 3SfBIF
- a1Uploaded byapi-279173920
- Colloquium PptUploaded byapoorva96
- Factor Analysis T. RamayahUploaded byAmir Alazhariy
- An Efficient Clustering and Distance Based Approach for Outlier DetectionUploaded byseventhsensegroup
- Service Model ArticleUploaded bymimoune90
- 1505.00168Uploaded byFaten Clubistia
- Privacy Preserving Clustering on Centralized Data Through Scaling TransfUploaded byIAEME Publication
- 309-1273-1-PBUploaded byBinalonan Pangasinan
- M-Fish Karyotyping - A New Approach Based on Watershed TransformUploaded byijcseitjournal
- 00146746.pdfUploaded byBasil
- Basic Concepts of Machine LeaningUploaded bysulekhaisaac
- Week4.Cutting.et.Al.1992.Scatter GatherUploaded bychunkfrog
- Munro - Probabilistic Representation SFGUploaded bylinusalbertus