You are on page 1of 49

Introduction to

Business Analytics
ISOM 2600

Topic 5: Clustering

1
Births of a feather flock together

2
Introduction
Clustering is the task of partitioning the dataset into
groups, called clusters.

The goal is to split up the data in such a way that points


within a single cluster are very similar and points in
different clusters are different.

Similar to the classification algorithms, clustering


algorithms assign (or predict) a number to each data
point, indicating which cluster a particular point belongs
to.

3
Application
In biology, clustering analysis has been used for
decades in the area of taxonomy.
The most general classification is kingdom, followed by
phylum, subphylum, etc.

4
Clustering analysis has been used in medicine to
assign patients to specific diagnostic categories on
the basis of their presenting symptoms and signs.

It has also been used in anthropology to classify


stone tools, shards, or fossil remains by the
civilization that produced them.

Consumers can be clustered on the basis of their


choices of purchase in the market research.

5
Comparison to the classification problem
The clustering analysis is highly empirical.

Different methods can lead to different groupings,


both in number and in content.

Since the groups are not known a priori, it is usually


difficult to judge whether the results make sense in
the context of the problem being studied.

6
Clustering is subjective

7
Some naive methods for clustering
Scatter diagram
Profile diagram
Distance measures

8
Scatter diagram
A hypothetical data set
only used for two variables

9
Profile Diagram/Plot
A helpful technique for a moderate number of variables

To plot a profile of an individual case in the sample, the


investigator customarily first standardizes the data by
subtracting the mean and dividing by the standard
deviation for each variable

However, the step is omitted by some researchers,


especially if the units of the measurement of the
variables are comparable

10
Example - Forbes Financial Data
The financial performance
data from January 1981
issue of Forbes

Table shows the data for 25


companies from three
industries:
chemical companies (the first
14)
health care companies (15-19)
Supermarket companies (20-
25)

11
Background of these companies
Among the chemical companies, all of the large diversified
firms were selected.

From the major supermarket chains, the top six rated for
return on equity were included.

In the health care industry, four of the five companies


included were those connected with hospital management;
the remaining company involves hospital supplies and
equipment (company 19).

12
Details of the variables
P/E: price-to-earnings ratio, which is the price of one share of
common stock divided by the earnings per share for the last
year. The ratio shows the dollar amount investors are willing to
pay for the stock per dollar of current earnings of the company.

ROR5: percent rate of return on total capital (invested plus


debt) averaged over the past five years.

D/E: debt-to-equity (invested capital) ratio for the last year. This
ratio indicates the extent to which management is using
borrowed funds to operate the company.

13
SALESGR5: percent annual compound growth rate of sales,
computed from the most recent five years compared with the
previous five years.

EPS5: percent annual compound growth in earnings per share,


computed from the most recent five years compared with the
previous five years.

NPM1: percent net profit margin, which is the net profits divided
by the net sales for the past year, expressed as a percentage.

PAYOUTR1: annual dividend divided by the latest 12-month


earnings per share. This value represents the proportion of
earnings paid out to shareholders rather than to operate and
expand the company.
14
Profile diagram (standardized)

15
Conclusions based on the profile diagram
Companies 20 and 21 are very similar.
Companies 16 and 18 are very similar.
Companies 15 and 17 are very similar.
Company 19 stands alone.

The clusters are consistent with the type of companies,


especially noting that company 19 deals with hospital
supplies.

16
Distance Measures
For a large dataset, analytical methods such as K-means
are necessary.

This method requires defining some measure of


closeness or similarity of two observations.

The converse of similarity is distance.

17
Euclidian distance
the sum of the squared differences between coordinates of
each variable for two observations:

𝑝
2
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ 𝑥𝑗 − 𝑦𝑗
𝑗=1
𝑇 𝑇
𝑓𝑜𝑟 𝑥 = 𝑥1 , … , 𝑥𝑝 𝑎𝑛𝑑 𝑦 = 𝑦1 , … , 𝑦𝑝

When the variables have different units, it is advisable to


standardize the data before computing the distances.

18
Issue: scaling
An electronic online retailer sells two items: socks and
computers.

In the data set, we have n=8 customers on two


variables.

The choice of scaling on the variables has dramatic


effect on the clustering result.

19
Left: the number of pairs of socks, and computers
Center: the same data is shown, after scaling each
variable by sd
Right: the same data, but now the y-axis represents the
number of dollars

20
K-means
K-means clustering is a
popular nonhierarchical
clustering technique.

For a specified number of


clusters 𝐾, the basic
algorithm proceeds in the
following steps.

21
22
A simulation study

23
K-means results for simulation data

24
Misspecification of the 𝐾 value
𝐾=2 𝐾=5

25
Elbow Plot
For each select 𝑘, we evaluate the clustering rule by
calculating the within group sum of square (WSS).

The idea is that if 𝑘 is less than the number of the


clusters, WSS decreases dramatically as 𝑘 increases.

Once 𝑘 exceeds the number of the clusters, WSS only


goes down slowly.

The plot for WSS versus 𝑘 will exhibit a sudden


change, which looks like an elbow.
26
27
Example - Forbes Financial Data
(Continued)

Now, we apply the K-means to the financial performance


data set.

The information on the type of company is not used to


derive the clusters. However, the information will be
used to interpret the results.

Because there are three types of companies, we set 𝐾 =


3.

28
Result from K-mean
Interpret clusters by K-mean clustering

29
Clusters by K-mean clustering

30
31
Business implication
Cluster 0: Growth company(/stock). (largest SALESGR5
and large PE).

Cluster 1: Income company(/stock). (large in ROR and


Payout)

Cluster 2: Value company(/stock). (large Payout and


lowest PE).

32
Take away from Topic 5
Profile diagram/plot

K-means
Algorithm
Elbow plot

33
That’s all, folks!
Adios~

34
HIERARCHICAL CLUSTERING

35
Hierarchical Clustering
Hierarchical methods can be either agglomerative or
divisive.
Most of the programs are of agglomerative type, and we
only focus on it.

36
Algorithm

37
How to merge two clusters?
Linkage method
Single Linkage: minimal inter-
cluster distance

Complete Linkage: maximal


inter-cluster distance

Average Linkage: mean inter-


cluster distance

Centroid Linkage: distance


between the centroid for cluster
A and cluster B

38
Linkage method

Ward Linkage: it picks the two clusters such that the sum of
squared deviation within all clusters increases the least. This
often leads to clusters that are relatively equally sized.

Mathematically, suppose there are K groups after each merge,


then the sum of squared deviation within all clusters is defined
as
𝐾
2
෍ ෍ 𝑋𝑖𝑗 − 𝑋ത.𝑗
𝑗=1 𝑖∈𝐶𝑗

39
Dendrogram or tree graph
With two variables and a large number of data
points, the representation of the steps in a graph can
get too cluttered to interpret.

If the number of variables is more than two, such a


graph is not feasible.

Another tool to visualize hierarchical clustering, called


a dendrogram, that can handle multidimensional
datasets.

40
A simulation study with 3 clusters
Dendrogram (Linkage=ward)

41
Example - Forbes Financial Data
(Continued)

Now, we apply the Hierarchical clustering to the financial


performance data set.

The information on the type of company is not used to


derive the clusters. However, the information will be
used to interpret the results.

42
Hierarchical clustering
We use centroid linkage method with the Euclidian
distance.
The horizontal axis lists the observation numbers.
The distances are measured being the centers of the
two clusters just joined.
The distances are progressively increasing.

43
Result from Hierarchical clustering

44
How to choose the number of clusters in the
dendrogram plot?
As a general rule, large increases in the sequence
should be a signal for examining those particular
steps.
A large distance indicates that at least two very
different observations exist, one from each of the two
clusters just combined.
The large increase in distance occurs when we
combine the last two clusters.

45
Analysis of the results
Companies 1, 2, and 3 form a single cluster, where
there are 22 clusters.

At the opposite end, 15, 16, 17, and 18 form a single


cluster at the step in which there are two clusters.

Company 22 stays by itself until there are only four


clusters.

46
Chemical companies
13 of the chemical companies, all except no. 14 (rci),
are clustered together with only one non-chemical
firm, company 19 (ahs), when the number of clusters
is eight.

The results are impressive when one considers that


these are large diversified companies with varied
emphasis, ranging from industrial chemicals to
textiles to oil and gas production.

47
Hospital management firms
At the level of nine clusters, three of the four hospital management
firms, companies 16, 17, and 18 (hca, nme, and ami) have been
clustered together, and the other, company 15(hum), is added to that
cluster before it is aggregated with any non-hospital management
firms.

From the data table, company 15 has clearly a different D/E value
from others.
The misfit in this health group is company 19, clustered with chemical
firms.
In fact, company 19 is a large, established supplies and equipment
48
firm.
Grocery firms
The grocery firms does
not cluster tightly.

From the table, they vary substantially on most variables.


Company 22 is highly leveraged (high DE) and company
21 has low leverage (low DE) relative to others.
Import disparities:
Three of them, companies 21, 24, and 25 (win, ka, sa), are three
of the four largest United States grocery supermarket chains.
Tow others, companies 20 and 22 (kls, sgl), have a diversified mix
of grocery, drug, department, and other stores.
The remaining firm, company 23 (slc) concentrates on convenient
stores (7-Eleven) and has a substantial franchising operation.
49
Thus, the six are quite different from each other.

You might also like