Answer Part 222

1
Part 2. Exercise (10 points)

A study was conducted to study consequences due to sleeping and waking problems. 95 individuals
were selected randomly. They rated twelve characteristics on a scale from 1 to 10 (1 being the
lowest level, 10 being the highest level):
1. Anxiety (excluded / PCA)
8. Concentration (excluded / PCA)
2. Depression (excluded / PCA)
9. Memory (excluded / PCA)
3. Lethargy (PC2)
10. Life satisfaction (PC1)
4. Tiredness (PC2)
11. Overall well-being (PC1)
5. Drowsiness (PC2)
12. Relationship with the others
6. Energy (PC2)
(family, friends, colleagues) (PC1)
7. Mood (PC1)
After conducting a principal component analysis (we found 2 principal components: PC1 = impact; PC2
= bad consequences), we aim to run a hierarchical classification.
The HCA results are shown below:
Table 1. Descriptive Statistics

Std.
N Minimum Maximum Mean Skewness Kurtosis
Deviation
Statistic Statistic Statistic Statistic Statistic Statistic Statistic
Mood 95 1 10 5,72 2,407 -,253 -,831
Life satisfaction 95 1 10 5,34 2,381 -,159 -,829
Overall well-being 95 1 10 5,66 2,338 -,201 -,666
Relationships 95 1 10 4,82 2,627 ,214 -1,100
Lethargy 95 1 10 4,72 2,354 ,121 -1,229
Tiredness 95 1 10 5,30 2,347 -,128 -1,171
Drowsiness 95 1 10 5,57 2,220 -,157 -1,017
Energy 95 1 10 5,53 2,353 -,159 -1,166
Valid N (listwise) 95
Table 2. Agglomeration Schedule

Stage Cluster Combined Coefficients Stage Cluster First Appears Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 72 97 ,000 0 0 21
… … … … … … …
89 3 7 45,866 79 88 93
90 1 10 56,598 80 86 91
91 1 11 73,934 90 85 93
92 2 6 93,099 87 83 94
93 1 3 135,647 91 89 94
94 1 2 193,897 93 92 0
Graph 1. Dendrogram (hierarchical tree)
1) Using table 2 and the dendrogram (graph 1): how many clusters do you feel are optimal?
Sketch a line on the dendrogram to illustrate the number of clusters you have chosen.
Maximum of 10 lines (2 points)
We have 95 participants. According to the agglomeration schedule, we can see that the 1st biggest
jump between the coefficients seems to be at stage 91 or 92. The number of clusters should be 95 –
92 = 3 clusters (or 95 – 91 = 4 clusters). According to the dendrogram, we can cut it in 2 (middle) to
obtain the solution. Here, when we sketch a line, we find 3 intersections. Therefore, we can study 3
clusters.
2) You still hesitate between 2 and 3 clusters. So you decide to conduct an ANOVA test
(analysis of variance), with post hoc tests to determine the best number of clusters. Using the
tables below, which solution do you prefer? Explain why. Maximum of 10 lines (2 points)
Solution with 3 clusters
Table 3. ANOVA (3 clusters)

Sum of df Mean Square F Sig.
Squares
Between Groups 57,594 2 28,797 61,522 ,000
Impact Within Groups 43,063 92 ,468
Total 100,657 94
Between Groups 43,204 2 21,602 39,719 ,000
Bad consequences Within Groups 50,036 92 ,544
Total 93,240 94
Table 5. Bad consequences (Post hoc with 3 clusters)

Tukey HSD
6
Ward Method N Subset for alpha = 0.05
Table 4. Impact (Post hoc with 3 clusters) 1 2 3
Tukey HSD 3 23 -1,0601
Ward Method N Subset for alpha = 0.05 2 26 -,2245
1 2 1 46 ,5923
3 23 -,5798 Sig. 1,000 1,000 1,000
1 46 -,3693
2 26 1,2966
Sig. ,474 1,000
Solution with 2 clusters

Table 6. ANOVA (with 2 clusters) (Nota Bene: same result with a Student T test)
Sum of Squares df Mean Square F Sig.
Between Groups 56,914 1 56,914 121,003 ,000
Impact Within Groups 43,743 93 ,470
Total 100,657 94
Between Groups 1,336 1 1,336 1,352 ,248
Bad consequences Within Groups 91,904 93 ,988
Total 93,240 94
Using the solution with 2 clusters: the 2nd component is not significant (p-value > 5%, so it means
that H0 cannot be rejected: the component is totally useless, and the clusters overlap), so only the
1st component is useful to discriminate the clusters. When using the solution with 3 clusters, the 2
components are useful (H0 is rejected in favor of H1 : the component is useful). Indeed, the F-test is
significant and shows that there is at least 2 different behaviors on each component. It is better to
interpret the solution with 3 clusters.
For the post hoc tests: we can clearly see the differences between the clusters.
 For the component “impact”: we can see that group 3 and group 1 are similar and their
means are below the average trend. Group 2 is significantly different from the 2 other groups,
and its mean is above the average trend
 For the component “bad consequences”: all the groups are significantly different from
each other. Group 3 and group 2 are under the average, but group 1 is above the trend. (not
mandatory)
3) Why don’t we need to run the post hoc tests for the solution with 2 clusters? Maximum of
4 lines (1 point)
When the test is significant, it means that there is at least 2 different behaviors between the groups.
When we have 2 groups, it is easy to find the dissimilarities (there is a significant difference
between the 2 groups). That is why post hoc tests are useless.
4) Using graph 2: interpret the graph considering the solution with 3 clusters. Maximum of
5 lines per cluster (3 points)
Graph 2. Graphical representation of the clusters
Cluster 1 (stars) (+ see question 2)

Interpretation:
High level on bad consequences and low level on impact
Name of the cluster: Subjective interpretation, so no formal answer
Cluster 2 (diamonds) (+ see question 2)

Interpretation:
This cluster is too dispersed. We have different behaviors on the component “bad consequences”,
but a same behavior on the component impact (high level)
Cluster 3 (rectangles) (+ see question 2)

Interpretation:
Low level on bad consequences and low level on impact
5) What is the use of a supplementary variable in a hierarchical classification analysis? Then,

using graph 3 (below): interpret the graph, with the supplementary variable “gender”.
Maximum of 8 lines (2 points)
A supplementary variable allows us to see in more detail the information about the observations in
a cluster. Do women tend to be more in a particular cluster compared to men? Etc.
We can see on the graph that all the observations are together, there is no particular trend. So we
cannot say that women and men are more represented in a cluster.
Graph 3. Using the supplementary variable “gender”
Part II. Interpreting the results (10 points)
A study was conducted with a random sample of 100 households to describe their budget allocation
(per year). Eight expenditure items were taken into account, namely:
Expenditure items Description

Food at home Bread, meat and fish, dairy products, fruits and vegetables, beverages
Clothing Clothes, shoes, other accessories

Food away from
Restaurants, canteens, coffee bars, fast food, other related expenses
home
Housing Rent, charges, taxes, utilities (water, electricity, gas)
Health care Products and medical devices, health care-related services
Leisure activities Games, books, travels, sport activities

Transportation Vehicles, transport services
(= Transport)
Savings Savings accounts, financial products, life insurance
N Minimum Maximum Mean Standard Deviation

Food at home 100 1866,33 6501,17 3915,87 919,02
Food away from home 100 196,62 2979,55 1398,48 670,58
Clothing 100 164,74 3689,23 2198,31 745,80
Housing 100 1810,60 21919,27 11176,58 4530,43
Health care 100 262,37 1835,70 973,99 340,50
Transportation 100 176,98 9545,76 4774,57 1966,34
Leisure activities 100 76,95 5237,68 2436,76 1187,31
Savings 100 17,16 3444,84 1374,13 733,63
By carrying out a principal component analysis, we found two principal components.

The first principal component describes the short-term and daily expenditures (health care, food
outside, clothing, food, and leisure), while the second principal component describes the long-term
and monthly expenditures (transportation, savings, and housing).
We aim to perform a hierarchical classification.
1. Using the dendrogram and the table below, what is the optimal number of clusters?
Explain why. (maximum of 10 lines, 3 points)
Agglomeration schedule
Cluster Combined
Cluster 1 Cluster 2
Stage Coefficients
90 50 95 24,749
91 2 7 27,926
92 46 81 31,422
93 46 57 35,644
94 1 17 43,200
95 46 90 52,436
96 46 50 71,738
97 1 48 92,244
98 1 2 137,878
99 1 46 198,000
According to the agglomeration schedule, it exists an important jump for the coefficients
column for the step number 97. 100 – 97 = 3 => 3 clusters
Besides, if we cut the dendrogram in two, we can also see that there is the possibility of having
3 clusters (ideal for maximizing intergroup variance and for minimizing intragroup variance).
This is in line with the agglomeration schedule.
2. Using figure 1 (output for ANOVA): what do you deduce from this output?
(maximum of 10 lines, 3 points)
Figure 1. Output for ANOVA
ANOVA
Sum of
Squares df Mean Square F Sig.
Between Groups 52,526 2 26,263 54,815 ,000
REGR factor score 1 Within Groups 46,474 97 ,479
(Short-term expenditures) Total 99,000 99
Between Groups 53,231 2 26,615 56,406 ,000
REGR factor score 2 Within Groups 45,769 97 ,472
(Long-term expenditures) Total 99,000 99
H0: there is no difference between groups (so, the component is useless for the classification
analysis, groups of individuals have the same behaviour on the component), H1: there is at
least one significant difference between groups (so, the component is useful for the
classification analysis).
We use the method of “critical probability”, indicating that to reject H0 in favour of H1, the
probabilities (p-values or α’) must be less than alpha (α).
With a risk level α=5%, we can see that both p-values (sig.) are below 5%, it means that we
can reject H0 in favour of H1. In this sense, thanks to the classification based on the previous
PCA, we can distinguish different groups of individuals with different behaviours on the two
principal components.
3. Finally, we have chosen three clusters.

Using figure 2 and the tables below: describe the three clusters (maximum of 4 lines
per cluster, 3 points)
Figure 2. Graph
REGR factor score 1 / Short-term expenditures
Tukey
Subset for alpha = 0.05
Ward Method N 1 2 3
2 21 -1,0745828
1 37 -,2744185
3 42 ,7790411
Sig. 1,000 1,000 1,000
REGR factor score 2 / Long-term expenditures

Tukey
Subset for alpha = 0.05
Ward Method N 1 2
1 37 -,9444119
3 42 ,4725967
2 21 ,7187705
Sig. 1,000 ,346
Cluster 1: Low spenders
This cluster has a significantly different behaviour from the other two on the two components.
This group make major expenditures neither in the long-term (below the total average) nor in
the short-term (slightly below the total average).
Cluster 2: Investors
This group has a significantly different behaviour from the two other groups on the principal
component related to the short-term expenditures. Nevertheless this group is similar to the
cluster number 3 on the long-term expenditures component. In this cluster, individuals make
a lot of expenditures on the long-term (highest total average) but not on the short-term (lowest
total average).
The expenses tend to focus on the “long-term”, and less on the “short-term”.
Cluster 3: Big spenders
This group has a significantly different behaviour from the two other clusters on the principal
component related to the short-term expenditures. It is quite similar to the cluster number 2
on the principal component related to the long-term expenditures. In this cluster, individuals
make a lot of expenditures both on short-term (highest total average) and long-term (average
above the general trend).
The expenses tend to focus both on the “long-term” and on the “short-term”.
4. How could we improve this study? Suggest one line of inquiry. (maximum of 5
lines, 1 point)
 You can use qualitative variables like the job of the household head to identify some
tendencies (Do managers invest more than others? etc.).

 Clusters 1 and 3 are very scattered. It could be interesting to perform a new classification
using 4 clusters or more in order to obtain more concentrated groups, that is to say with
less variation within group.

Answer Part 222

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Answer Part 222

Uploaded by

Copyright:

Available Formats

1

Part 2. Exercise (10 points)

Table 1. Descriptive Statistics

Table 2. Agglomeration Schedule

Solution with 3 clusters

Table 3. ANOVA (3 clusters)

Table 5. Bad consequences (Post hoc with 3 clusters)

Solution with 2 clusters

Graph 2. Graphical representation of the clusters

Cluster 1 (stars) (+ see question 2)

Name of the cluster: Subjective interpretation, so no formal answer

Cluster 2 (diamonds) (+ see question 2)

Name of the cluster: Subjective interpretation, so no formal answer

Cluster 3 (rectangles) (+ see question 2)

Name of the cluster: Subjective interpretation, so no formal answer

5) What is the use of a supplementary variable in a hierarchical classification analysis? Then,

Expenditure items Description

Clothing Clothes, shoes, other accessories

Health care Products and medical devices, health care-related services

Leisure activities Games, books, travels, sport activities

N Minimum Maximum Mean Standard Deviation

By carrying out a principal component analysis, we found two principal components.

probabilities (p-values or α’) must be less than alpha (α).

3. Finally, we have chosen three clusters.

REGR factor score 2 / Long-term expenditures

the short-term (slightly below the total average).

Cluster 3: Big spenders

above the general trend).

tendencies (Do managers invest more than others? etc.).

less variation within group.

You might also like