Professional Documents
Culture Documents
Business Analytics
ISOM 2600
Topic 5: Clustering
1
Births of a feather flock together
2
Introduction
Clustering is the task of partitioning the dataset into
groups, called clusters.
3
Application
In biology, clustering analysis has been used for
decades in the area of taxonomy.
The most general classification is kingdom, followed by
phylum, subphylum, etc.
4
Clustering analysis has been used in medicine to
assign patients to specific diagnostic categories on
the basis of their presenting symptoms and signs.
5
Comparison to the classification problem
The clustering analysis is highly empirical.
6
Clustering is subjective
7
Some naive methods for clustering
Scatter diagram
Profile diagram
Distance measures
8
Scatter diagram
A hypothetical data set
only used for two variables
9
Profile Diagram/Plot
A helpful technique for a moderate number of variables
10
Example - Forbes Financial Data
The financial performance
data from January 1981
issue of Forbes
11
Background of these companies
Among the chemical companies, all of the large diversified
firms were selected.
From the major supermarket chains, the top six rated for
return on equity were included.
12
Details of the variables
P/E: price-to-earnings ratio, which is the price of one share of
common stock divided by the earnings per share for the last
year. The ratio shows the dollar amount investors are willing to
pay for the stock per dollar of current earnings of the company.
D/E: debt-to-equity (invested capital) ratio for the last year. This
ratio indicates the extent to which management is using
borrowed funds to operate the company.
13
SALESGR5: percent annual compound growth rate of sales,
computed from the most recent five years compared with the
previous five years.
NPM1: percent net profit margin, which is the net profits divided
by the net sales for the past year, expressed as a percentage.
15
Conclusions based on the profile diagram
Companies 20 and 21 are very similar.
Companies 16 and 18 are very similar.
Companies 15 and 17 are very similar.
Company 19 stands alone.
16
Distance Measures
For a large dataset, analytical methods such as K-means
are necessary.
17
Euclidian distance
the sum of the squared differences between coordinates of
each variable for two observations:
𝑝
2
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 𝑥𝑗 − 𝑦𝑗
𝑗=1
𝑇 𝑇
𝑓𝑜𝑟 𝑥 = 𝑥1 , … , 𝑥𝑝 𝑎𝑛𝑑 𝑦 = 𝑦1 , … , 𝑦𝑝
18
Issue: scaling
An electronic online retailer sells two items: socks and
computers.
19
Left: the number of pairs of socks, and computers
Center: the same data is shown, after scaling each
variable by sd
Right: the same data, but now the y-axis represents the
number of dollars
20
K-means
K-means clustering is a
popular nonhierarchical
clustering technique.
21
22
A simulation study
23
K-means results for simulation data
24
Misspecification of the 𝐾 value
𝐾=2 𝐾=5
25
Elbow Plot
For each select 𝑘, we evaluate the clustering rule by
calculating the within group sum of square (WSS).
28
Result from K-mean
Interpret clusters by K-mean clustering
29
Clusters by K-mean clustering
30
31
Business implication
Cluster 0: Growth company(/stock). (largest SALESGR5
and large PE).
32
Take away from Topic 5
Profile diagram/plot
K-means
Algorithm
Elbow plot
33
That’s all, folks!
Adios~
34
HIERARCHICAL CLUSTERING
35
Hierarchical Clustering
Hierarchical methods can be either agglomerative or
divisive.
Most of the programs are of agglomerative type, and we
only focus on it.
36
Algorithm
37
How to merge two clusters?
Linkage method
Single Linkage: minimal inter-
cluster distance
38
Linkage method
Ward Linkage: it picks the two clusters such that the sum of
squared deviation within all clusters increases the least. This
often leads to clusters that are relatively equally sized.
39
Dendrogram or tree graph
With two variables and a large number of data
points, the representation of the steps in a graph can
get too cluttered to interpret.
40
A simulation study with 3 clusters
Dendrogram (Linkage=ward)
41
Example - Forbes Financial Data
(Continued)
42
Hierarchical clustering
We use centroid linkage method with the Euclidian
distance.
The horizontal axis lists the observation numbers.
The distances are measured being the centers of the
two clusters just joined.
The distances are progressively increasing.
43
Result from Hierarchical clustering
44
How to choose the number of clusters in the
dendrogram plot?
As a general rule, large increases in the sequence
should be a signal for examining those particular
steps.
A large distance indicates that at least two very
different observations exist, one from each of the two
clusters just combined.
The large increase in distance occurs when we
combine the last two clusters.
45
Analysis of the results
Companies 1, 2, and 3 form a single cluster, where
there are 22 clusters.
46
Chemical companies
13 of the chemical companies, all except no. 14 (rci),
are clustered together with only one non-chemical
firm, company 19 (ahs), when the number of clusters
is eight.
47
Hospital management firms
At the level of nine clusters, three of the four hospital management
firms, companies 16, 17, and 18 (hca, nme, and ami) have been
clustered together, and the other, company 15(hum), is added to that
cluster before it is aggregated with any non-hospital management
firms.
From the data table, company 15 has clearly a different D/E value
from others.
The misfit in this health group is company 19, clustered with chemical
firms.
In fact, company 19 is a large, established supplies and equipment
48
firm.
Grocery firms
The grocery firms does
not cluster tightly.