P. 1
Class 6 Cluster Analysis

# Class 6 Cluster Analysis

|Views: 481|Likes:

See more
See less

03/18/2014

pdf

text

original

# Cluster Analysis

Analysis and Output Interpretation using Hierarchical Cluster Technique & SPSS 6.00

0011 0010 1010 1101 0001 0100 1011

Dr. Rohit Vishal Kumar

Reader, Department of Marketing Xavier Institute of Social Service

4

1

2

Cluster Analysis - Introduction
• Cluster Analysis is a multivariate analysis technique that seeks to organize information about variables so that 0011 0010 1010 1101 0001 0100 1011 relatively homogeneous groups, or "clusters," can be formed. The clusters formed with this family of methods should be highly internally homogenous (members are similar to one another) and highly externally heterogeneous (members are not like members of other clusters.

• Although cluster analysis is relatively simple, and can use a variety of input data, it is a relatively new technique and is not supported by a comprehensive body of statistical literature. So, most of the guidelines for using cluster analysis are rules of thumb and some authors caution that researchers should use cluster analysis

4

1

2

Cluster Analysis - Key Features
• Cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into 0011 0010 1010 1101 0001 0100 1011 clusters." • Cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here

4

1

2

Cluster Analysis - Applications
• • • • Medicine: clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful classification and better diagnosis. Psychiatry: the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. Archeology: researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. Marketing: researchers have attempted to use cluster analysis to identify the closeness or difference (real or perceived) between brands image, identify relatively homogenous marketing segments, identify similarities in ideas of communications etc.
0011 0010 1010 1101 0001 0100 1011

In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.

4

1

2

Four Common Distance Measures
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is 1101 0001 0100 1011 0011 0010 1010 computed as: distance(x,y) = { (xi - yi)2 }½ • • • Note: Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. •

Advantage: the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers.

Disadvantage: The distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected, and consequently, the results of cluster analyses may be very different.

4

1

2

Four Common Distance Measures
Squared Euclidean distance. One may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance 0011 0010 1010 1101 0001 0100 1011 is computed as : distance(x,y) = i (xi - yi)2 • City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as: distance(x,y) = i |xi - yi| • Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi| •

4

1

2

Cluster Analysis
0011 0010 1010 1101 0001 0100 1011

The Example and SPSS Procedure

4

1

2

The Raw Data
Rs o d n ep n e t 1 2 3 4 5 0011 0010 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0

1010

V 1 6 2 7 4 1 1101 0001 6 5 7 2 3 1 5 2 4 6 3 4 3 4 2

V 2 4 3 2 6 3 0100 4 3 3 4 5 3 4 2 6 5 5 4 7 6 3

1011

V 3 7 1 6 4 2 6 6 7 3 3 2 5 1 4 4 4 7 2 3 2

V 4 3 4 4 5 2 3 3 4 3 6 3 4 5 6 2 6 2 6 7 4

The above data was collected from 20 respondents. The respondents were asked to rate the following statement on a 7 point scale
V1 : Shopping is Fun V2 : Shopping is bad for your budget V3 : I combine shopping with eating out V4 : I try to get the best buys while shopping V5 : I don’t care about shopping V6 : You can save money by comparing prices

Completely Disagree 1

SCALE USED Neither Agree Nor Disagree 4

4

1

V 5 2 5 1 3 6 3 3 1 6 4 5 2 4 4 1 4 2 4 2 7

2

V 6 3 4 3 6 4 4 4 4 3 6 3 4 4 7 4 7 5 3 7 2

Completely Agree 7

SPSS Screen 1
The data entry screen in SPSS

0011 0010 1010 1101 0001 0100 1011

4

1

2

SPSS Screen 2 : Hierarchical Cluster
Choose Statistics -> Data Reduction -> Hierarchical Cluster We are shown the Hierarchical Cluster Screen as follows:

0011 0010 1010 1101 0001 0100 1011

1. Select All six variables (V1-V6) and transfer them to the variable(s) box 2. Select Cluster “Cases”

3. Select Display “Statistics and “Plots” 4. Press on the Statistics Button

4

1

2

SPSS Screen 3 : Hierarchical Cluster
On Pressing the “Statistics” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011

1. “Agglomeration Schedule” and “Cluster Membership -> None” should be checked by default. If not select these options 2. Press “Continue”

3. Select “Plots” from the “Screen 2”

4

1

2

SPSS Screen 4 : Hierarchical Cluster
On Pressing the “Plots” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011

1. Select “Dendogram” 2. Select “All Icicles” 3. Select Orientation “Vertical”

4. Select “Methods” from the “Screen 2”

4

1

2

SPSS Screen 5 : Hierarchical Cluster
On Pressing the “Methods” Button we are shown the following screen

0011 0010 1010 1101 0001 0100 1011

1. Choose in Cluster Method: “Between Group Linkage” 2. Select in Measure “Interval” and select “Squared Euclidean Distances” 3. Select in “Transform Values” “none” in the standardize dropdown list 4. Select Continue

5. In Screen 2 select “OK”

4

1

2

Cluster Analysis
0011 0010 1010 1101 0001 0100 1011

The SPSS Output

4

1

2

SPSS Output 1 : Hierarchical Cluster
The following output “Proximities” is displayed by SPSS
Data Information 20 unweighted cases accepted. 0 cases rejected because of missing value. 0010 1010 1101 0001 0100 1011 Squared Euclidean measure used. * * * * * * * * * * * * * * P R O X I M I T I E S * * * * * * * * * * * * * * Agglomeration Schedule using Average Linkage (Between Groups) Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Clusters Cluster 1 14 6 10 2 5 3 6 4 5 1 4 5 1 2 1 1 4 2 1 Combined Cluster 2 16 7 14 13 11 8 12 10 9 6 19 20 17 5 3 15 18 4 2 Coefficient 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.333333 4.500000 5.000000 7.250000 7.333333 8.250000 10.750000 11.300000 14.000000 20.200001 38.611111 48.291668 Stage Cluster 1st Appears Cluster 1 Cluster 2 0 0 0 0 0 0 2 0 5 0 8 9 10 4 13 15 11 14 16 0 0 1 0 0 0 0 3 0 7 0 0 0 12 6 0 0 17 18

0011

4

1

Next Stage 3 7 8 14 9 15 10 11 12 13 17 14 15 18 16 19 18 19 0

2

SPSS Output 1 : Hierarchical Cluster
The Analysis : Proximities

•The "average linkage (between group)" clustering was used.
0011 0010 1010 1101 0001 0100 1011

•There were a total of 20 data points. In the first stage two data point (14 and 16) were combined. This information is provided under cluster combined cluster 1 and cluster 2 column.

•The squared Euclidean distance between the data point 14 and 16 is provided and is equal to 2.00. This is shown in column “Coefficients” •The column entitled "Stage Cluster First Appeared" indicates the stage of combining the data in which the cluster first appears. The entry of 0 and 0 implies that right now no new clusters have been demarcated. The first cluster demarcation appears at stage 3 when data point 10 and 14 are combined to form a cluster. •The “next stage” columns gives the step in which the next data point was combined. The entry is 3. If we look at stage 3 then we find that data point 10 and 14 were combined to form the next cluster.

4

1

2

SPSS Output 2 : Hierarchical Cluster
The following output “Icicle Plot” is displayed by SPSS
Vertical Icicle Plot using Average Linkage (Between Groups)

0011 0010 1010 1101 0001 0100 1011
C a s e 1 C a s e 1 8 C a s e 1 9 C a s e 1 6 C a s e 1 4 C a s e 4 0 C a s e 2

(Down) Number of Clusters C a s e 9 0 C a s e 1

(Across) Case Label and number C a s e 5 1 C a s e 1 C a s e 2 3 C a s e 1 C a s e 8 5 C a s e 3 C a s e 1 C a s e 1 7 C a s e 7 2 C a s e 6 C a s e 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 1 1 1 1 2 1 1 1 1 1 8 9 6 4 0 4 0 9 1 5 3 2 5 8 3 7 2 7 6 1 +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXXXXXXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXXXXXXXX X XXXX XXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX +X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX X XXXXXXXXXX +X XXXXXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXXXXX +X X XXXXXXXXXX X XXXXXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXXXXX X X XXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXX X X X XXXX XXXX X XXXX X XXXXXXX X +X X XXXXXXX X X X XXXX XXXX X XXXX X X XXXX X +X X XXXXXXX X X X XXXX XXXX X X X X X XXXX X +X X XXXXXXX X X X X X XXXX X X X X X XXXX X +X X XXXXXXX X X X X X X X X X X X X XXXX X +X X XXXX X X X X X X X X X X X X X XXXX X +X X XXXX X X X X X X X X X X X X X X X X

4

1

2

SPSS Output 2 : Hierarchical Cluster
The Analysis : Icicle Plot

•The icicle plot shows the cluster combination. It is read from bottom to top.
0011 0010 1010 1101 0001 0100 1011

•Initially it was assumed that there are 20 initial cluster. Then in row labeled 19 a combination was made and 19 clusters were formed. •The icicle plot in pictorial form represents the whole process of cluster formation. For example, if we take row labelled 7 we shall see that there are 7 clusters denoted by a series of X's: X XXXXXXXXXXXXX XXXXXXXXXX XXXX X XXXX XXXXXXXXXXXXX

•Each subsequent step leads to a formation of new cluster in one of the following three (3) ways:
–Two individual cases are grouped together –A case is joined to an already existing cluster –Two clusters are grouped together

4

1

2

SPSS Output 3 : Hierarchical Cluster
The following output “Dendogram” is displayed by SPSS
Dendrogram using Average Linkage (Between Groups) Rescaled Distance 0011 0010 1010 1101 0001 0100 1011 C A S E Label Num Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case Case 14 16 10 4 19 18 2 13 5 11 9 20 3 8 6 7 12 1 17 15 14 16 10 4 19 18 2 13 5 11 9 20 3 8 6 7 12 1 17 15 Cluster Combine

0 5 10 15 20 25 +---------+---------+---------+---------+---------+ -+ -+-+ -+ +-+ ---+ +-------------+ -----+ +-------------------+ -------------------+ | -+-------+ +---------+ -+ | | | -+-+ +-----------------------------+ | -+ +-+ | | ---+ +---+ | -----+ | -+---------+ | -+ | | -+-+ +-+ | -+ | | | | ---+---+ | +-----------------------------------+ ---+ +---+ | -------+ | -------------+

4

1

2

SPSS Output 3 : Hierarchical Cluster
The Analysis : Dendogram

•The Dendogram is a graphical output which is useful in identifying the 0011 0010 1010 1101 0001 0100right. clusters. It is read from left to 1011 •Vertical lines represent the clusters that are joined together. The position of the vertical line on the scale indicates the distance at which the clusters were joined. Because many of the distances in the early stages are of similar magnitude, it is difficult to tell the sequence in which some of the early clusters were formed. However, it is clear that in the last two stages, the distances at which the clusters are combined are large. This information is useful in deciding the number of clusters to retain.

4

1

2

Cluster Analysis
0011 0010 1010 1101 0001 0100 1011

Exercises and Final Notes

4

1

2

Practice Example
• The following data was collected for US baseball champions:
– Height : Height in Inches – Weight : Weight in Pounds – FGPct : Field Goal Percentage – Points: Average Points per game 0011 – Rebounds: Average 0001 0100 game 0010 1010 1101 rebounds per 1011

Champion Height Jabbar K.A. 86 Barry R 79 Baylor E 77 Bird L 81 Chamberlain W 85 Cousy B 73 Erving J 79 Johnson M 81 Jordan M 78 Robertson O 77 Russell B 82 West J 75 •

Weight 230 205 225 220 275 175 200 215 195 210 220 180

FGPct 55.9 44.9 43.1 50.3 54.0 37.5 50.6 53.0 51.3 48.5 44.0 47.4

Points 24.6 23.2 27.4 25.0 30.1 18.4 24.2 19.5 32.6 25.7 15.1 27.0

Rebound 11.2 06.7 13.5 10.2 22.9 05.2 08.5 07.4 06.2 07.5 22.6 05.8

Conduct a Hierarchieal Cluster Analysis using
a) Height, Weight, FGPct, Points and Rebound

b) Height, FGPct, Points and Rebound c) FGPct, Points and Rebound Analyse the Dendograms to identify how the clusters have changed between (a) and (b) and (c)

4

1

2

Warning
• We have only shown the output of a hierarchical Cluster Analysis 0011 0010 1010 1101 0001 0100 1011 • Similar Interpretations may or may not be applicable to nonhierarchical Cluster Analysis • The analysis software used was SPSS® 6.0. The output may vary with the type of analysis tool selected • Cluster Analysis should be run more than once using different distance measures and results compared before a final interpretation is attempted.

4

1

2

Thank You
0011 0010 1010 1101 0001 0100 1011