Introduction To Business Analytics: Topic 5: Clustering

Introduction to
Business Analytics
ISOM 2600
Topic 5: Clustering
1
Births of a feather flock together
2
Introduction
Clustering is the task of partitioning the dataset into
groups, called clusters.
The goal is to split up the data in such a way that points

within a single cluster are very similar and points in
different clusters are different.
Similar to the classification algorithms, clustering

algorithms assign (or predict) a number to each data
point, indicating which cluster a particular point belongs
to.
3
Application
In biology, clustering analysis has been used for
decades in the area of taxonomy.
The most general classification is kingdom, followed by
phylum, subphylum, etc.
4
Clustering analysis has been used in medicine to
assign patients to specific diagnostic categories on
the basis of their presenting symptoms and signs.
It has also been used in anthropology to classify

stone tools, shards, or fossil remains by the
civilization that produced them.
Consumers can be clustered on the basis of their

choices of purchase in the market research.
5
Comparison to the classification problem
The clustering analysis is highly empirical.
Different methods can lead to different groupings,

both in number and in content.
Since the groups are not known a priori, it is usually

difficult to judge whether the results make sense in
the context of the problem being studied.
6
Clustering is subjective
7
Some naive methods for clustering
Scatter diagram
Profile diagram
Distance measures
8
Scatter diagram
A hypothetical data set
only used for two variables
9
Profile Diagram/Plot
A helpful technique for a moderate number of variables
To plot a profile of an individual case in the sample, the

investigator customarily first standardizes the data by
subtracting the mean and dividing by the standard
deviation for each variable
However, the step is omitted by some researchers,

especially if the units of the measurement of the
variables are comparable
10
Example - Forbes Financial Data
The financial performance
data from January 1981
issue of Forbes
Table shows the data for 25

companies from three
industries:
chemical companies (the first
14)
health care companies (15-19)
Supermarket companies (20-
25)
11
Background of these companies
Among the chemical companies, all of the large diversified
firms were selected.
From the major supermarket chains, the top six rated for
return on equity were included.
In the health care industry, four of the five companies

included were those connected with hospital management;
the remaining company involves hospital supplies and
equipment (company 19).
12
Details of the variables
P/E: price-to-earnings ratio, which is the price of one share of
common stock divided by the earnings per share for the last
year. The ratio shows the dollar amount investors are willing to
pay for the stock per dollar of current earnings of the company.
ROR5: percent rate of return on total capital (invested plus

debt) averaged over the past five years.
D/E: debt-to-equity (invested capital) ratio for the last year. This
ratio indicates the extent to which management is using
borrowed funds to operate the company.
13
SALESGR5: percent annual compound growth rate of sales,
computed from the most recent five years compared with the
previous five years.
EPS5: percent annual compound growth in earnings per share,

computed from the most recent five years compared with the
previous five years.
NPM1: percent net profit margin, which is the net profits divided
by the net sales for the past year, expressed as a percentage.
PAYOUTR1: annual dividend divided by the latest 12-month

earnings per share. This value represents the proportion of
earnings paid out to shareholders rather than to operate and
expand the company.
14
Profile diagram (standardized)
15
Conclusions based on the profile diagram
Companies 20 and 21 are very similar.
Company 19 stands alone.
The clusters are consistent with the type of companies,

especially noting that company 19 deals with hospital
supplies.
16
Distance Measures
For a large dataset, analytical methods such as K-means
are necessary.
This method requires defining some measure of

closeness or similarity of two observations.
The converse of similarity is distance.
17
Euclidian distance
the sum of the squared differences between coordinates of
each variable for two observations:
𝑝
2
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ 𝑥𝑗 − 𝑦𝑗
𝑗=1
𝑇 𝑇
𝑓𝑜𝑟 𝑥 = 𝑥1 , … , 𝑥𝑝 𝑎𝑛𝑑 𝑦 = 𝑦1 , … , 𝑦𝑝
When the variables have different units, it is advisable to

standardize the data before computing the distances.
18
Issue: scaling
An electronic online retailer sells two items: socks and
computers.
In the data set, we have n=8 customers on two

variables.
The choice of scaling on the variables has dramatic

effect on the clustering result.
19
Left: the number of pairs of socks, and computers
Center: the same data is shown, after scaling each
variable by sd
Right: the same data, but now the y-axis represents the
number of dollars
20
K-means
K-means clustering is a
popular nonhierarchical
clustering technique.
For a specified number of

clusters 𝐾, the basic
algorithm proceeds in the
following steps.
21
22
A simulation study
23
K-means results for simulation data
24
Misspecification of the 𝐾 value
𝐾=2 𝐾=5
25
Elbow Plot
For each select 𝑘, we evaluate the clustering rule by
calculating the within group sum of square (WSS).
The idea is that if 𝑘 is less than the number of the

clusters, WSS decreases dramatically as 𝑘 increases.
Once 𝑘 exceeds the number of the clusters, WSS only

goes down slowly.
The plot for WSS versus 𝑘 will exhibit a sudden

change, which looks like an elbow.
26
27
(Continued)
Now, we apply the K-means to the financial performance

data set.
The information on the type of company is not used to

derive the clusters. However, the information will be
used to interpret the results.
Because there are three types of companies, we set 𝐾 =

3.
28
Result from K-mean
Interpret clusters by K-mean clustering
29
Clusters by K-mean clustering
30
31
Business implication
Cluster 0: Growth company(/stock). (largest SALESGR5
and large PE).
Cluster 1: Income company(/stock). (large in ROR and

Payout)
Cluster 2: Value company(/stock). (large Payout and

lowest PE).
32
Take away from Topic 5
Profile diagram/plot
K-means
Algorithm
Elbow plot
33
That’s all, folks!
Adios~
34
HIERARCHICAL CLUSTERING
35
Hierarchical Clustering
Hierarchical methods can be either agglomerative or
divisive.
Most of the programs are of agglomerative type, and we
only focus on it.
36
Algorithm
37
How to merge two clusters?
Linkage method
Single Linkage: minimal inter-
cluster distance
Complete Linkage: maximal

inter-cluster distance
Average Linkage: mean inter-

cluster distance
Centroid Linkage: distance

between the centroid for cluster
A and cluster B
38
Linkage method
Ward Linkage: it picks the two clusters such that the sum of
squared deviation within all clusters increases the least. This
often leads to clusters that are relatively equally sized.
Mathematically, suppose there are K groups after each merge,

then the sum of squared deviation within all clusters is defined
as
𝐾
2
෍ ෍ 𝑋𝑖𝑗 − 𝑋ത.𝑗
𝑗=1 𝑖∈𝐶𝑗
39
Dendrogram or tree graph
With two variables and a large number of data
points, the representation of the steps in a graph can
get too cluttered to interpret.
If the number of variables is more than two, such a

graph is not feasible.
Another tool to visualize hierarchical clustering, called

a dendrogram, that can handle multidimensional
datasets.
40
A simulation study with 3 clusters
Dendrogram (Linkage=ward)
41
(Continued)
Now, we apply the Hierarchical clustering to the financial

performance data set.
The information on the type of company is not used to

derive the clusters. However, the information will be
used to interpret the results.
42
Hierarchical clustering
We use centroid linkage method with the Euclidian
distance.
The horizontal axis lists the observation numbers.
The distances are measured being the centers of the
two clusters just joined.
The distances are progressively increasing.
43
Result from Hierarchical clustering
44
How to choose the number of clusters in the
dendrogram plot?
As a general rule, large increases in the sequence
should be a signal for examining those particular
steps.
A large distance indicates that at least two very
different observations exist, one from each of the two
clusters just combined.
The large increase in distance occurs when we
combine the last two clusters.
45
Analysis of the results
Companies 1, 2, and 3 form a single cluster, where
there are 22 clusters.
At the opposite end, 15, 16, 17, and 18 form a single

cluster at the step in which there are two clusters.
Company 22 stays by itself until there are only four

clusters.
46
Chemical companies
13 of the chemical companies, all except no. 14 (rci),
are clustered together with only one non-chemical
firm, company 19 (ahs), when the number of clusters
is eight.
The results are impressive when one considers that

these are large diversified companies with varied
emphasis, ranging from industrial chemicals to
textiles to oil and gas production.
47
Hospital management firms
At the level of nine clusters, three of the four hospital management
firms, companies 16, 17, and 18 (hca, nme, and ami) have been
clustered together, and the other, company 15(hum), is added to that
cluster before it is aggregated with any non-hospital management
firms.
From the data table, company 15 has clearly a different D/E value
from others.
The misfit in this health group is company 19, clustered with chemical
firms.
In fact, company 19 is a large, established supplies and equipment
48
firm.
Grocery firms
The grocery firms does
not cluster tightly.
From the table, they vary substantially on most variables.

Company 22 is highly leveraged (high DE) and company
21 has low leverage (low DE) relative to others.
Import disparities:
Three of them, companies 21, 24, and 25 (win, ka, sa), are three
of the four largest United States grocery supermarket chains.
Tow others, companies 20 and 22 (kls, sgl), have a diversified mix
of grocery, drug, department, and other stores.
The remaining firm, company 23 (slc) concentrates on convenient
stores (7-Eleven) and has a substantial franchising operation.
49
Thus, the six are quite different from each other.

Introduction To Business Analytics: Topic 5: Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Business Analytics: Topic 5: Clustering

Uploaded by

Copyright:

Available Formats

Introduction to

The goal is to split up the data in such a way that points

Similar to the classification algorithms, clustering

It has also been used in anthropology to classify

Consumers can be clustered on the basis of their

Different methods can lead to different groupings,

Since the groups are not known a priori, it is usually

To plot a profile of an individual case in the sample, the

However, the step is omitted by some researchers,

Table shows the data for 25

In the health care industry, four of the five companies

ROR5: percent rate of return on total capital (invested plus

EPS5: percent annual compound growth in earnings per share,

PAYOUTR1: annual dividend divided by the latest 12-month

The clusters are consistent with the type of companies,

This method requires defining some measure of

The converse of similarity is distance.

When the variables have different units, it is advisable to

In the data set, we have n=8 customers on two

The choice of scaling on the variables has dramatic

For a specified number of

The idea is that if 𝑘 is less than the number of the

Once 𝑘 exceeds the number of the clusters, WSS only

The plot for WSS versus 𝑘 will exhibit a sudden

Now, we apply the K-means to the financial performance

The information on the type of company is not used to

Because there are three types of companies, we set 𝐾 =

Cluster 1: Income company(/stock). (large in ROR and

Cluster 2: Value company(/stock). (large Payout and

Complete Linkage: maximal

Average Linkage: mean inter-

Centroid Linkage: distance

Mathematically, suppose there are K groups after each merge,

If the number of variables is more than two, such a

Another tool to visualize hierarchical clustering, called

Now, we apply the Hierarchical clustering to the financial

The information on the type of company is not used to

At the opposite end, 15, 16, 17, and 18 form a single

Company 22 stays by itself until there are only four

The results are impressive when one considers that

From the table, they vary substantially on most variables.

You might also like