You are on page 1of 8

Unsupervised models: clustering techniques

Unsupervised learning models


Cluster analysis
Introduction
• Clustering is used in many areas from astronomy to sociology
• Clustering → segments the data
• Famous example
– Mendeleyev's periodic table (Scheikunde)
• Examples in business
– Market segmentation (personas)
– Balanced portfolios in finance
– Industry analysis
• Useful to improve performance of supervised methods
– Model clusters separately instead of heterogeneous dataset

Why?
- Group similar utilities to predict cost impact of deregulation
Distance measures
• Record i: (xi1, xi2, ..., xip)
• Record j: (xj1, xj2, ..., xjp)
• dij: distance metric dissimilarity measure
• Properties for distances
– Non-negative: dij ≥ 0 – Self-proximity: dii = 0
– Symmetry: dij = dji
– Triangle inequality: dij ≤ dik + dkj

Euclidean distance

• Normalization
– Before computing distance
– z-scores = (xi−ˆx) σx
• Unequal weights possible

• Features of this measure


– Highly scale dependent
– Ignores vars relationships
– Sensitive to outliers

Choosing distance measure


• Distance choice plays a role in the analysis – Domain dependent
• Distance measures
– Euclidean → dissimilarity
– Square of Pearson → correlation-based similarity
– Mahalanobis → accounts for correlation between vars
– Manhattan → absolute distance
– Maximum → vars with highest deviation

Distance for categorical variables


• Example: binary variables
• Matching coefficient: (a+d) / p
• Jaquard’s coefficient: d / (b+c+d)
• Gower’s similarity: mixed data
Clustering techniques

Hierarchical clustering
• Arrange groups into a natural hierarchy
• Algorithm for agglomerative
1. Start with n clusters (each record is a cluster)
2. Merge two closest records into a cluster
3. Repeat steps to merge two
clusters/records with smallest
distance
Hierarchical agglomerative clustering
Dendrograms
• Summarizes process of clustering
• x-axis: records
• y-axis: distance
• Cutoff distance: horizontal line

Interpreting
K-Means
• Pre-specified number of clusters
• Minimize measure of dispersion within clusters
– Sum of distances of records to centroid
– Sum of squared Euclidean distances of records to centroid
• Algorithm
1. Start with k initial clusters
2. At every step, each record is reassigned to cluster with the closest
centroid
3. Recompute centroid of clusters that lost/gained records and repeat 2
4. Stop when moving records increases cluster dispersion

Choosing k
• Previous knowledge
• Practical constraints
• Try a few k’s and compare results
– Randomly generated starting points helps with avoiding poor results
• There is an optimum solution, but computationally expensive to solve
Interpreting

Validating clusters
• Meaningful clusters that generate insights
• Interpretability
– Can you assign a label?
• Stability
– Do clusters change a lot with a slight input change?
– Check with data partition
• Number of clusters
– Result must be useful

You might also like