You are on page 1of 21

Clustering Algorithms

Minimize distance
But to Centers of Groups
Clustering
• First need to identify clusters
– Can be done automatically
– Often clusters determined by problem
• Then simple matter to measure distance
from new observation to each cluster
– Use same measures as with Memory-based
Partitioning
• Define new categorical variables
– Divide data into fixed number (k) of regions
– K-means clustering
Clustering Uses
• Segment customers
– Find profitability of each,
treat accordingly
• Star classification:
– Red giants, white dwarfs,
normal
– Brightness & temperature
used to classify
• U.S. Army
– Identify sizes needed for
female soldiers
– (males – one size fits all)
Tires
• Segment customers into product
categories
– High end (they would buy Michelins)
– Intermediate & Low
• Standardize data (as in Memory based
reasoning)
Raw Tire Data
BRAND INCOME AGE OF CAR
Michelin $182,200 5 months
Michelin $171,200 3 years
Goodyear $28,800 7 years
Goodyear $37,800 6 years
Goodyear $42,200 5 years
Goodyear $55,600 4 years
Goodyear $51,200 9 years
Goodyear $173,400 7 years
Opie’s tires $13,400 3 years
Opie’s tires $68,800 6 years
Standardize
• INCOME
– MIN(1,INCOME/200000)
• AGE OF CAR
– IF({AGE OF CAR})<12 months,1,
– ELSE[MAX{(8-Years)/7},0]
Sort Data by Outcome
BRAND INCOME AGE OF CAR
Michelin High income Bought this year
Michelin High income Bought 1-3 yrs ago
Goodyear Low income Bought 4+ yrs ago
Goodyear Low income Bought 4+ yrs ago
Goodyear Low income Bought 4+ yrs ago
Goodyear Avg income Bought 1-3 yrs ago
Goodyear Avg income Bought 4+ yrs ago
Goodyear High income Bought 4+ yrs ago
Opie’s tires Low income Bought 1-3 yrs ago
Opie’s tires Avg income Bought 4+ yrs ago
Standardized Training Data
BRAND INCOME AGE OF CAR
Michelin 0.911 1
Michelin 0.856 0.714
Goodyear 0.144 0.143
Goodyear 0.189 0.286
Goodyear 0.211 0.429
Goodyear 0.278 0.571
Goodyear 0.256 0
Goodyear 0.867 0.143
Opie’s tires 0.067 0.714
Opie’s tires 0.344 0.286
Identify Cluster Means
(could use median, mode)
BRAND INCOME CAR AGE

Michelin 0.884 0.857

Goodyear 0.324 0.262

Opie’s tires 0.206 0.500


New Case #1
• From new data (could be test set, or new
observations to classify) squared distance
to each centroid
Michelin: 0.840
Goodyear 0.025
Opie’s tires 0.047
• So minimum distance to Goodyear
New Case #2
• squared distance to each centroid
Michelin: 0.634
Goodyear 0.255
Opie’s tires 0.057
• So minimum distance to Opie’s
Software Methods
• Hierarchical clustering
– Number of clusters unspecified a priori
– Two-step a form of hierarchical clustering
• K-Means clustering
• Self-Organizing Maps
– Neural network
• Hybrids combine methods
Application: Credit Cards
• Credit scoring critical
• Use past applicants, develop model to
predict payback
– Look for indicators providing early warning of
trouble
British Credit Card Company
• Monthly account status – over 90
thousand customers, one year operations
• Outcome variable STATE: cumulative
months of missed payments (integer)
– Some errors & missing data (eliminated
observations)
– Biased sample of 10 thousand observations
– Required initial STATE of 0
British Credit Card Company
• Compared clustering approaches with pattern
detection method
• Used medians rather than centroids
– More stable
– Partitioned data
• Clustering useful for general profile behavior
• Pattern search method sought local clusters
– Unable to partition entire data set
– Identified a few groups with unusual behavior
Insurance Claim Application
• Large data warehouse of financial
transactions & claims
• Customer retention very important
– Recent heavy growth in policies
– Decreased profitability
• Used clustering to analyze claim patterns
– Wanted hidden trends & patterns
Insurance Claim Mining
• Undirected knowledge discovery
– Cluster analysis to identify risk categories
• Data for 1996-1998
– Quarterly data
– Claims for prior 12 months
– Contribution to profit of each policy
– Over 100,000 samples
– Heavy growth in young people with expensive
automobiles
– Transformed data to normalize, remove outliers
Insurance Claim Mining
• Number of clusters
– Too few – no discrimination – best here was 50
– Used k-means algorithm to minimize least squared
error
• Identified a few cluster with high claims
frequency, unprofitability
• Compared 1998 data with 1996 to find trends
• Developed model to predict new policy holder
performance
– Used for pricing
Computational Constraints
• Each cluster should have adequate sample size
• Since cluster averages are used, cluster
analysis not as sensitive to disproportional
cluster sizes relative to matching
• The more variables you have, the greater the
computational complexity
– The curse of dimensionality
– (it won’t run in a reasonable time if you have too
many variables)
Fuzzy Clustering Methods
• K-Means
• Fuzzy c-means
• Hierarchical
• Bayesian Classification
ROUGH SET CLUSTERING
• Put indiscernible objects together
– If similarity index below threshold, indiscernible
– If higher, fewer clusters
– Weighted sum of:
• Numerical – Euclidean
• Nominal – Hamming distance

You might also like