You are on page 1of 15

1

Part 2. Exercise (10 points)


A study was conducted to study consequences due to sleeping and waking problems. 95 individuals
were selected randomly. They rated twelve characteristics on a scale from 1 to 10 (1 being the
lowest level, 10 being the highest level):
1. Anxiety (excluded / PCA)
8. Concentration (excluded / PCA)
2. Depression (excluded / PCA)
9. Memory (excluded / PCA)
3. Lethargy (PC2)
10. Life satisfaction (PC1)
4. Tiredness (PC2)
11. Overall well-being (PC1)
5. Drowsiness (PC2)
12. Relationship with the others
6. Energy (PC2)
(family, friends, colleagues) (PC1)
7. Mood (PC1)
After conducting a principal component analysis (we found 2 principal components: PC1 = impact; PC2
= bad consequences), we aim to run a hierarchical classification.
The HCA results are shown below:

Table 1. Descriptive Statistics


Std.
N Minimum Maximum Mean Skewness Kurtosis
Deviation
Statistic Statistic Statistic Statistic Statistic Statistic Statistic
Mood 95 1 10 5,72 2,407 -,253 -,831
Life satisfaction 95 1 10 5,34 2,381 -,159 -,829
Overall well-being 95 1 10 5,66 2,338 -,201 -,666
Relationships 95 1 10 4,82 2,627 ,214 -1,100
Lethargy 95 1 10 4,72 2,354 ,121 -1,229
Tiredness 95 1 10 5,30 2,347 -,128 -1,171
Drowsiness 95 1 10 5,57 2,220 -,157 -1,017
Energy 95 1 10 5,53 2,353 -,159 -1,166
Valid N (listwise) 95

Table 2. Agglomeration Schedule


Stage Cluster Combined Coefficients Stage Cluster First Appears Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 72 97 ,000 0 0 21
… … … … … … …
89 3 7 45,866 79 88 93
90 1 10 56,598 80 86 91
91 1 11 73,934 90 85 93
92 2 6 93,099 87 83 94
93 1 3 135,647 91 89 94
94 1 2 193,897 93 92 0
Graph 1. Dendrogram (hierarchical tree)

1) Using table 2 and the dendrogram (graph 1): how many clusters do you feel are optimal?
Sketch a line on the dendrogram to illustrate the number of clusters you have chosen.
Maximum of 10 lines (2 points)

We have 95 participants. According to the agglomeration schedule, we can see that the 1st biggest
jump between the coefficients seems to be at stage 91 or 92. The number of clusters should be 95 –
92 = 3 clusters (or 95 – 91 = 4 clusters). According to the dendrogram, we can cut it in 2 (middle) to
obtain the solution. Here, when we sketch a line, we find 3 intersections. Therefore, we can study 3
clusters.

2) You still hesitate between 2 and 3 clusters. So you decide to conduct an ANOVA test
(analysis of variance), with post hoc tests to determine the best number of clusters. Using the
tables below, which solution do you prefer? Explain why. Maximum of 10 lines (2 points)

Solution with 3 clusters

Table 3. ANOVA (3 clusters)


Sum of df Mean Square F Sig.
Squares
Between Groups 57,594 2 28,797 61,522 ,000
Impact Within Groups 43,063 92 ,468
Total 100,657 94
Between Groups 43,204 2 21,602 39,719 ,000
Bad consequences Within Groups 50,036 92 ,544
Total 93,240 94

Table 5. Bad consequences (Post hoc with 3 clusters)


Tukey HSD

6
Ward Method N Subset for alpha = 0.05
Table 4. Impact (Post hoc with 3 clusters) 1 2 3
Tukey HSD 3 23 -1,0601
Ward Method N Subset for alpha = 0.05 2 26 -,2245
1 2 1 46 ,5923
3 23 -,5798 Sig. 1,000 1,000 1,000
1 46 -,3693
2 26 1,2966
Sig. ,474 1,000

Solution with 2 clusters


Table 6. ANOVA (with 2 clusters) (Nota Bene: same result with a Student T test)
Sum of Squares df Mean Square F Sig.
Between Groups 56,914 1 56,914 121,003 ,000
Impact Within Groups 43,743 93 ,470
Total 100,657 94
Between Groups 1,336 1 1,336 1,352 ,248
Bad consequences Within Groups 91,904 93 ,988
Total 93,240 94

Using the solution with 2 clusters: the 2nd component is not significant (p-value > 5%, so it means
that H0 cannot be rejected: the component is totally useless, and the clusters overlap), so only the
1st component is useful to discriminate the clusters. When using the solution with 3 clusters, the 2
components are useful (H0 is rejected in favor of H1 : the component is useful). Indeed, the F-test is
significant and shows that there is at least 2 different behaviors on each component. It is better to
interpret the solution with 3 clusters.

For the post hoc tests: we can clearly see the differences between the clusters.

 For the component “impact”: we can see that group 3 and group 1 are similar and their
means are below the average trend. Group 2 is significantly different from the 2 other groups,
and its mean is above the average trend
 For the component “bad consequences”: all the groups are significantly different from
each other. Group 3 and group 2 are under the average, but group 1 is above the trend. (not
mandatory)
3) Why don’t we need to run the post hoc tests for the solution with 2 clusters? Maximum of
4 lines (1 point)
When the test is significant, it means that there is at least 2 different behaviors between the groups.
When we have 2 groups, it is easy to find the dissimilarities (there is a significant difference
between the 2 groups). That is why post hoc tests are useless.

4) Using graph 2: interpret the graph considering the solution with 3 clusters. Maximum of
5 lines per cluster (3 points)

Graph 2. Graphical representation of the clusters

Cluster 1 (stars) (+ see question 2)


Interpretation:
High level on bad consequences and low level on impact

Name of the cluster: Subjective interpretation, so no formal answer

Cluster 2 (diamonds) (+ see question 2)


Interpretation:
This cluster is too dispersed. We have different behaviors on the component “bad consequences”,
but a same behavior on the component impact (high level)

Name of the cluster: Subjective interpretation, so no formal answer

Cluster 3 (rectangles) (+ see question 2)


Interpretation:
Low level on bad consequences and low level on impact

Name of the cluster: Subjective interpretation, so no formal answer

5) What is the use of a supplementary variable in a hierarchical classification analysis? Then,


using graph 3 (below): interpret the graph, with the supplementary variable “gender”.
Maximum of 8 lines (2 points)
A supplementary variable allows us to see in more detail the information about the observations in
a cluster. Do women tend to be more in a particular cluster compared to men? Etc.

We can see on the graph that all the observations are together, there is no particular trend. So we
cannot say that women and men are more represented in a cluster.
Graph 3. Using the supplementary variable “gender”
Part II. Interpreting the results (10 points)

A study was conducted with a random sample of 100 households to describe their budget allocation
(per year). Eight expenditure items were taken into account, namely:

Expenditure items Description


Food at home Bread, meat and fish, dairy products, fruits and vegetables, beverages

Clothing Clothes, shoes, other accessories


Food away from
Restaurants, canteens, coffee bars, fast food, other related expenses
home
Housing Rent, charges, taxes, utilities (water, electricity, gas)

Health care Products and medical devices, health care-related services

Leisure activities Games, books, travels, sport activities


Transportation Vehicles, transport services
(= Transport)
Savings Savings accounts, financial products, life insurance

N Minimum Maximum Mean Standard Deviation


Food at home 100 1866,33 6501,17 3915,87 919,02
Food away from home 100 196,62 2979,55 1398,48 670,58
Clothing 100 164,74 3689,23 2198,31 745,80
Housing 100 1810,60 21919,27 11176,58 4530,43
Health care 100 262,37 1835,70 973,99 340,50
Transportation 100 176,98 9545,76 4774,57 1966,34
Leisure activities 100 76,95 5237,68 2436,76 1187,31
Savings 100 17,16 3444,84 1374,13 733,63

By carrying out a principal component analysis, we found two principal components.


The first principal component describes the short-term and daily expenditures (health care, food
outside, clothing, food, and leisure), while the second principal component describes the long-term
and monthly expenditures (transportation, savings, and housing).
We aim to perform a hierarchical classification.
1. Using the dendrogram and the table below, what is the optimal number of clusters?
Explain why. (maximum of 10 lines, 3 points)

Agglomeration schedule
Cluster Combined
Cluster 1 Cluster 2
Stage Coefficients
90 50 95 24,749
91 2 7 27,926
92 46 81 31,422
93 46 57 35,644
94 1 17 43,200
95 46 90 52,436
96 46 50 71,738
97 1 48 92,244
98 1 2 137,878
99 1 46 198,000

According to the agglomeration schedule, it exists an important jump for the coefficients
column for the step number 97. 100 – 97 = 3 => 3 clusters

Besides, if we cut the dendrogram in two, we can also see that there is the possibility of having
3 clusters (ideal for maximizing intergroup variance and for minimizing intragroup variance).
This is in line with the agglomeration schedule.

2. Using figure 1 (output for ANOVA): what do you deduce from this output?
(maximum of 10 lines, 3 points)
Figure 1. Output for ANOVA

ANOVA
Sum of
Squares df Mean Square F Sig.
Between Groups 52,526 2 26,263 54,815 ,000
REGR factor score 1 Within Groups 46,474 97 ,479
(Short-term expenditures) Total 99,000 99
Between Groups 53,231 2 26,615 56,406 ,000
REGR factor score 2 Within Groups 45,769 97 ,472
(Long-term expenditures) Total 99,000 99

H0: there is no difference between groups (so, the component is useless for the classification

analysis, groups of individuals have the same behaviour on the component), H1: there is at

least one significant difference between groups (so, the component is useful for the

classification analysis).

We use the method of “critical probability”, indicating that to reject H0 in favour of H1, the

probabilities (p-values or α’) must be less than alpha (α).

With a risk level α=5%, we can see that both p-values (sig.) are below 5%, it means that we

can reject H0 in favour of H1. In this sense, thanks to the classification based on the previous

PCA, we can distinguish different groups of individuals with different behaviours on the two

principal components.

3. Finally, we have chosen three clusters.


Using figure 2 and the tables below: describe the three clusters (maximum of 4 lines
per cluster, 3 points)
Figure 2. Graph
REGR factor score 1 / Short-term expenditures
Tukey
Subset for alpha = 0.05
Ward Method N 1 2 3
2 21 -1,0745828
1 37 -,2744185
3 42 ,7790411
Sig. 1,000 1,000 1,000

REGR factor score 2 / Long-term expenditures


Tukey
Subset for alpha = 0.05
Ward Method N 1 2
1 37 -,9444119
3 42 ,4725967
2 21 ,7187705
Sig. 1,000 ,346
Cluster 1: Low spenders

This cluster has a significantly different behaviour from the other two on the two components.

This group make major expenditures neither in the long-term (below the total average) nor in

the short-term (slightly below the total average).

Cluster 2: Investors

This group has a significantly different behaviour from the two other groups on the principal

component related to the short-term expenditures. Nevertheless this group is similar to the

cluster number 3 on the long-term expenditures component. In this cluster, individuals make

a lot of expenditures on the long-term (highest total average) but not on the short-term (lowest

total average).

The expenses tend to focus on the “long-term”, and less on the “short-term”.

Cluster 3: Big spenders

This group has a significantly different behaviour from the two other clusters on the principal

component related to the short-term expenditures. It is quite similar to the cluster number 2

on the principal component related to the long-term expenditures. In this cluster, individuals

make a lot of expenditures both on short-term (highest total average) and long-term (average

above the general trend).

The expenses tend to focus both on the “long-term” and on the “short-term”.

4. How could we improve this study? Suggest one line of inquiry. (maximum of 5
lines, 1 point)

 You can use qualitative variables like the job of the household head to identify some

tendencies (Do managers invest more than others? etc.).


 Clusters 1 and 3 are very scattered. It could be interesting to perform a new classification

using 4 clusters or more in order to obtain more concentrated groups, that is to say with

less variation within group.

You might also like