Professional Documents
Culture Documents
Answer Part 222
Answer Part 222
1) Using table 2 and the dendrogram (graph 1): how many clusters do you feel are optimal?
Sketch a line on the dendrogram to illustrate the number of clusters you have chosen.
Maximum of 10 lines (2 points)
We have 95 participants. According to the agglomeration schedule, we can see that the 1st biggest
jump between the coefficients seems to be at stage 91 or 92. The number of clusters should be 95 –
92 = 3 clusters (or 95 – 91 = 4 clusters). According to the dendrogram, we can cut it in 2 (middle) to
obtain the solution. Here, when we sketch a line, we find 3 intersections. Therefore, we can study 3
clusters.
2) You still hesitate between 2 and 3 clusters. So you decide to conduct an ANOVA test
(analysis of variance), with post hoc tests to determine the best number of clusters. Using the
tables below, which solution do you prefer? Explain why. Maximum of 10 lines (2 points)
6
Ward Method N Subset for alpha = 0.05
Table 4. Impact (Post hoc with 3 clusters) 1 2 3
Tukey HSD 3 23 -1,0601
Ward Method N Subset for alpha = 0.05 2 26 -,2245
1 2 1 46 ,5923
3 23 -,5798 Sig. 1,000 1,000 1,000
1 46 -,3693
2 26 1,2966
Sig. ,474 1,000
Using the solution with 2 clusters: the 2nd component is not significant (p-value > 5%, so it means
that H0 cannot be rejected: the component is totally useless, and the clusters overlap), so only the
1st component is useful to discriminate the clusters. When using the solution with 3 clusters, the 2
components are useful (H0 is rejected in favor of H1 : the component is useful). Indeed, the F-test is
significant and shows that there is at least 2 different behaviors on each component. It is better to
interpret the solution with 3 clusters.
For the post hoc tests: we can clearly see the differences between the clusters.
For the component “impact”: we can see that group 3 and group 1 are similar and their
means are below the average trend. Group 2 is significantly different from the 2 other groups,
and its mean is above the average trend
For the component “bad consequences”: all the groups are significantly different from
each other. Group 3 and group 2 are under the average, but group 1 is above the trend. (not
mandatory)
3) Why don’t we need to run the post hoc tests for the solution with 2 clusters? Maximum of
4 lines (1 point)
When the test is significant, it means that there is at least 2 different behaviors between the groups.
When we have 2 groups, it is easy to find the dissimilarities (there is a significant difference
between the 2 groups). That is why post hoc tests are useless.
4) Using graph 2: interpret the graph considering the solution with 3 clusters. Maximum of
5 lines per cluster (3 points)
We can see on the graph that all the observations are together, there is no particular trend. So we
cannot say that women and men are more represented in a cluster.
Graph 3. Using the supplementary variable “gender”
Part II. Interpreting the results (10 points)
A study was conducted with a random sample of 100 households to describe their budget allocation
(per year). Eight expenditure items were taken into account, namely:
Agglomeration schedule
Cluster Combined
Cluster 1 Cluster 2
Stage Coefficients
90 50 95 24,749
91 2 7 27,926
92 46 81 31,422
93 46 57 35,644
94 1 17 43,200
95 46 90 52,436
96 46 50 71,738
97 1 48 92,244
98 1 2 137,878
99 1 46 198,000
According to the agglomeration schedule, it exists an important jump for the coefficients
column for the step number 97. 100 – 97 = 3 => 3 clusters
Besides, if we cut the dendrogram in two, we can also see that there is the possibility of having
3 clusters (ideal for maximizing intergroup variance and for minimizing intragroup variance).
This is in line with the agglomeration schedule.
2. Using figure 1 (output for ANOVA): what do you deduce from this output?
(maximum of 10 lines, 3 points)
Figure 1. Output for ANOVA
ANOVA
Sum of
Squares df Mean Square F Sig.
Between Groups 52,526 2 26,263 54,815 ,000
REGR factor score 1 Within Groups 46,474 97 ,479
(Short-term expenditures) Total 99,000 99
Between Groups 53,231 2 26,615 56,406 ,000
REGR factor score 2 Within Groups 45,769 97 ,472
(Long-term expenditures) Total 99,000 99
H0: there is no difference between groups (so, the component is useless for the classification
analysis, groups of individuals have the same behaviour on the component), H1: there is at
least one significant difference between groups (so, the component is useful for the
classification analysis).
We use the method of “critical probability”, indicating that to reject H0 in favour of H1, the
With a risk level α=5%, we can see that both p-values (sig.) are below 5%, it means that we
can reject H0 in favour of H1. In this sense, thanks to the classification based on the previous
PCA, we can distinguish different groups of individuals with different behaviours on the two
principal components.
This cluster has a significantly different behaviour from the other two on the two components.
This group make major expenditures neither in the long-term (below the total average) nor in
Cluster 2: Investors
This group has a significantly different behaviour from the two other groups on the principal
component related to the short-term expenditures. Nevertheless this group is similar to the
cluster number 3 on the long-term expenditures component. In this cluster, individuals make
a lot of expenditures on the long-term (highest total average) but not on the short-term (lowest
total average).
The expenses tend to focus on the “long-term”, and less on the “short-term”.
This group has a significantly different behaviour from the two other clusters on the principal
component related to the short-term expenditures. It is quite similar to the cluster number 2
on the principal component related to the long-term expenditures. In this cluster, individuals
make a lot of expenditures both on short-term (highest total average) and long-term (average
The expenses tend to focus both on the “long-term” and on the “short-term”.
4. How could we improve this study? Suggest one line of inquiry. (maximum of 5
lines, 1 point)
You can use qualitative variables like the job of the household head to identify some
using 4 clusters or more in order to obtain more concentrated groups, that is to say with