Professional Documents
Culture Documents
Why?
- Group similar utilities to predict cost impact of deregulation
Distance measures
• Record i: (xi1, xi2, ..., xip)
• Record j: (xj1, xj2, ..., xjp)
• dij: distance metric dissimilarity measure
• Properties for distances
– Non-negative: dij ≥ 0 – Self-proximity: dii = 0
– Symmetry: dij = dji
– Triangle inequality: dij ≤ dik + dkj
Euclidean distance
• Normalization
– Before computing distance
– z-scores = (xi−ˆx) σx
• Unequal weights possible
Hierarchical clustering
• Arrange groups into a natural hierarchy
• Algorithm for agglomerative
1. Start with n clusters (each record is a cluster)
2. Merge two closest records into a cluster
3. Repeat steps to merge two
clusters/records with smallest
distance
Hierarchical agglomerative clustering
Dendrograms
• Summarizes process of clustering
• x-axis: records
• y-axis: distance
• Cutoff distance: horizontal line
Interpreting
K-Means
• Pre-specified number of clusters
• Minimize measure of dispersion within clusters
– Sum of distances of records to centroid
– Sum of squared Euclidean distances of records to centroid
• Algorithm
1. Start with k initial clusters
2. At every step, each record is reassigned to cluster with the closest
centroid
3. Recompute centroid of clusters that lost/gained records and repeat 2
4. Stop when moving records increases cluster dispersion
Choosing k
• Previous knowledge
• Practical constraints
• Try a few k’s and compare results
– Randomly generated starting points helps with avoiding poor results
• There is an optimum solution, but computationally expensive to solve
Interpreting
Validating clusters
• Meaningful clusters that generate insights
• Interpretability
– Can you assign a label?
• Stability
– Do clusters change a lot with a slight input change?
– Check with data partition
• Number of clusters
– Result must be useful