You are on page 1of 32

Statistics for management

E&I AY 2022-23

EXERCISES TIME SERIES SOLVED

EXERCISE 1

To forecast the sales of next period a company has used the simple exponential smoothing
method with smoothing constant a = 0.02 and a = 0.4. In the following table there are: the times
series yt. the smoothed series with a = 0.02 and the smoothed series with a = 0.4 till lag 9.
t y Smoothed series Smoothed series
(a = 0.02) (a = 0.4)
1 19.00 20.50 20.50
2 22.00 20.47 19.90
3 24.00 20.50 20.74
4 23.00 20.57 22.04
5 21.00 20.62 22.43
6 17.00 20.63 21.86
7 18.00 20.55 19.91
8 22.00 20.50 19.15
9 24.00 20.53 20.29
10 23.00 ? ?

a) Calculate the last value for each series.


b) Dividing the time series in train (70%) and test (30%) what is the best smoothing constant
between a = 0.02 and a=0.4?
c) Determine the forecast for t=11 with the chosen constant.

a) Forecasting for time t+1 could be written as: Yˆt+1 = αYt + (1− α )⋅ Yˆt

For t=10 and a = 0.02 is:


Yˆ9+1 = 0.02 ⋅Y9 + (1− 0.02) ⋅ Yˆ9 = 0.02 ⋅ 24 + 0.98⋅ 20.53 = 20.60

For t=10 and a = 0.4 is:


Yˆ9+1 = 0.4 ⋅Y9 + (1− 0.4) ⋅ Yˆ9 = 0.4 ⋅ 24 + 0.6 ⋅ 20.29 = 21.77
b) To determine the best constant we calculate the sum of squares of the errors:
t y α=0.02 error error^2 α=0.4 error error^2
1 19 20.50 -1.50 2.25 20.50 -1.50 2.25
2 22 20.47 1.53 2.34 19.90 2.10 4.41
3 24 20.50 3.50 12.25 20.74 3.26 10.63
4 23 20.57 2.43 5.90 22.04 0.96 0.91
5 21 20.62 0.38 0.15 22.43 -1.43 2.03
6 17 20.63 -3.63 13.15 21.86 -4.86 23.58
7 18 20.55 -2.55 6.52 19.91 -1.91 3.66
8 22 20.50 1.50 2.24 19.15 2.85 8.13
9 24 20.53 3.47 12.02 20.29 3.71 13.77
10 23 20.60 2.40 5.75 21.77 1.23 1.50
20.65 62.57 22.26
RMSE 2.50 2.66
The best constant is α=0.02.

c) The forecast is:

Yˆ10+1 = 0.02 ⋅Y10 + (1− 0.02) ⋅ Yˆ10 = 0.02 ⋅ 23+ 0.98⋅ 20.60 = 20.65

EXERCISE 2

Consider the time series of the production of a good:

y = (70, 71, 73, 70, 73, 75, 73, 74, 75, 76).

1) Calculate the autocorrelation function for lag 1 and 2. Graph the correlogram and the
confidence limit for lag 1 and 2. Are these two values inside the confidence limits?
2) Divide the time series in training (60%) and test (40%) and say which is the best forecasting
method between: the average method, the naïve method, the drift method, the moving
average of order 3 and the simple exponential smoothing with 𝛼 =0.2. Using the best
method, calculate the forecast for lag 12.

1) Autocorrelation function at lag 1 and 2:

n n

∑( yt − y)( yt−1 − y) ∑ ( y − y)( y


t t−2
− y)
r1 = t=2
n
r2 = t=3
n
2 2
∑( y − y)
t ∑ ( y − y)
t
t=1 t=1
1 2 3
( yt − y) ( yt−1 − y) ( yt−2 − y) 1*2 1*3 ( yt − y) 2
t yt yt-1 yt-2
1 70 -3 9
2 71 70 -2 -3 6 4
3 73 71 70 0 -2 -3 0 0 0
4 70 73 71 -3 0 -2 0 6 9
5 73 70 73 0 -3 0 0 0 0
6 75 73 70 2 0 -3 0 -6 4
7 73 75 73 0 2 0 0 0 0
8 74 73 75 1 0 2 0 2 1
9 75 74 73 2 1 0 2 0 4
10 76 75 74 3 2 1 6 3 9
sum 730 14 5 40

730 1 n 40
y= = 73 σ2 = ∑ ( yt − y) 2 = =4
10 n t=1 10

n n

∑ ( y − y)( y
t t−1
− y)
14
∑ ( y − y)( y
t t−2
− y)
5
r1 = t=2
n
= = 0.35 r2 = t=3
n
= = 0.125
2 40 2 40
∑ ( y − y)
t ∑ ( y − y)t
t=1 t=1

⎡ 1 1⎤ ⎡ 1 1⎤
confidence limits: −2 ,2
⎢ ⎥ ⎢
⇒ −2 ,2 ⎥ ⇒ ⎡⎣−0.6325,0.6325⎤⎦
⎢⎣ n n ⎥⎦ ⎢⎣ 10 10 ⎥⎦

The two values of the autocorrelation function are inside the confidence intervals, i.e. they are
significantly equal to zero.

2) In the following table are the estimate for the train part and the prediction for the test part and
the prediction errors:

1 2 3 4 5 1 2 3 4 5
t yt Average Naïve Drift MA(3) Exp.S. ( yt − ŷ) ( yt − ŷ) ( yt − ŷ) ( yt − ŷ) ( yt − ŷ)
T 1 70 70 70 70 - 70
R 2 71 71 71 71 - 70
A 3 73 73 73 73 - 70.2
I 4 70 70 70 70 71.3 70.8
N 5 73 73 73 73 71.3 70.6
6 75 75 75 75 72 71.1
T 7 73 72 75 76 72.7 71.9 1 -2 -3 0.3 1.1
E 8 74 72 75 77 72.7 71.9 2 -1 -3 1.3 2.1
S 9 75 72 75 78 72.7 71.9 3 0 -3 2.3 3.1
T 10 76 72 75 79 72.7 71.9 4 1 -3 3.3 4.1
P 11 76
P 12 76
For 1: Mean training data set = 72

For 2: last observation 75


( y − y1 )
For 3: T =1
(T −1)
!"#!$#!%
For 4: 𝑦! = $
=72.7

For 5: 𝑦$& = 𝑦& = 70


𝑦$' = 0.2 ∙ 𝑦& + 0.8 ∙ 𝑦$& = 0.2 ∙ 70 + 0.8 ∙ 70 = 70
𝑦$$ = 0.2 ∙ 𝑦' + 0.8 ∙ 𝑦$' = 0.2 ∙ 71 + 0.8 ∙ 70 = 70.2
𝑦$( = 0.2 ∙ 𝑦$ + 0.8 ∙ 𝑦$$ = 0.2 ∙ 73 + 0.8 ∙ 70.2 = 70.8
𝑦$% = 0.2 ∙ 𝑦( + 0.8 ∙ 𝑦$( = 0.2 ∙ 70 + 0.8 ∙ 70.8 = 70.6
𝑦$) = 0.2 ∙ 𝑦% + 0.8 ∙ 𝑦$% = 0.2 ∙ 73 + 0.8 ∙ 70.6 = 71.1
𝑦$! = 0.2 ∙ 𝑦) + 0.8 ∙ 𝑦$) = 0.2 ∙ 75 + 0.8 ∙ 71.1 = 71.9

Evaluation of the prediction methods: RMSE


Average Naïve Drift MA(3) Exp.S.
7 1.00 4.00 9.00 0.09 1.21
8 4.00 1.00 9.00 1.69 4.41
9 9.00 0.00 9.00 5.29 9.61
10 16.00 1.00 9.00 10.89 16.81
Sum/4 7.50 1.50 9.00 4.49 8.01
Square root 2.74 1.22 3.00 2.12 2.83
Evaluation of the prediction methods: MAE
Average Naïve Drift MA(3) Exp.S.
7 1 2 3 0.3 1.1
8 2 1 3 1.3 2.1
9 3 0 3 2.3 3.1
10 4 1 3 3.3 4.1
Sum/4 2.5 1 3 1.8 2.6

Evaluation of the prediction methods: MAPE


Average Naïve Drift MA(3) Exp.S.
7 0.0137 0.0274 0.0411 0.0041 0.0151
8 0.0270 0.0135 0.0405 0.0136 0.0284
9 0.0400 0.0000 0.0400 0.0307 0.0413
10 0.0526 0.0132 0.0395 0.0434 0.0539
Sum/4 % 3% 1% 4% 9.2% 13.9%

The best method in the naïve (lowest RMSE, MAE e MAPE). The prediction at lag 12 is 76.
EXERCISE 3

Consider the following table with: the series of sales of a good in the last 4 months (the test set)
and the sales prediction calculated on the train set using
a. Holt-Winters model;
b. SARIMA model.
Sales Prediction with H-W Prediction with SARIMA
664 677.5 791.9
628 651.3 795.8
308 453.6 508.9
324 350.2 354.4

Determine which model produces the best forecasts. Why?

To determine the best model let’s calculate the prediction error:



Prediction Prediction with Error H-W Error SARIMA
Sales with H-W SARIMA
664 677.5 791.9 -13.5 -127.9
628 651.3 795.8 -23.3 -167.8
308 453.6 508.9 -145.6 -200.9
324 350.2 354.4 -26.2 -30.4


MAEH-W= (13.5+23.3+145.6+26.2)/4=208.6/4=52.15

MAESARIMA = (127.9+167.8+200.0+30.4) /4=527/4=131.7

The model that produces the best forecast is Holt-Winters.


EXERCISE 4

Let’s consider the times series of the monthly births (in thousands) in the city of New York from
January 1946 to December 1959.

year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227 21.672 21.870
1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110 21.759 22.073
1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142 21.059 21.573
1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907 21.519 22.025
1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252 22.084 22.991
1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110 22.964 23.981
1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199 23.162 24.707
1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462 25.246 25.180
1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379 24.712 25.688
1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784 25.693 26.881
1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136 26.291 26.987
1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484 26.634 27.735
1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945 25.912 26.619
1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012 26.992 27.897

To predict the birth we consider two methods: Holt’s method and Holt Winter’s seasonal model.
The results for the component of the prediction for the last three observations of year 1959 are
reported in the two tables.

Holt’s methods
level Trend
Oct 1959 29.46330 0.449243525
Nov 1959 29.31788 0.327384560
Dec 1959 27.89322 -0.031649239

Holt - Winters’ model


level trend season
Oct 1959 27.51263 0.0297128961 1.0433886
Nov 1959 27.66990 0.0334333005 0.9633125
Dec 1959 27.85678 0.0379091677 0.9920814

a) Predict the births in June 1960 with the Holt’s model


b) Predict the births in December 1960 with the Holt - Winters’ models.

a) April 1960 => k=4

Yˆt+4 = Lt + 4 ⋅ Tt = 27.89322 + 4 ⋅ (−0.031649239) = 27.8

b) December 1960 => k=12 and s=12

Yˆt+12 = (Lt +12 ⋅ Tt )⋅ St−12+12 = (27.85678 +12 ⋅ 0.0379091677)⋅ 0.9920814 = 28.087


Statistics for management
E&I AY 2022/23

EXERCISES SIMPLE AND MULTIPLE LINEAR MODEL SOLVED

EXERCISE 1
A manager wants to study the relationship between years of experience X and productivity Y
(mean numbers of process completed in 1 hour) of the blue collars of his company. He collects
information on the year of experience and on productivity. The results are in the table below:

Sxi 1092
Mean(x) 27.3
Syi 1777
Syi2 98903
Sxi2 30974
Sxi yi 50305

Calculate:
a) the correlation between the two variables
b) the slope and the intercept of the simple linear regression
c) the coefficient of determination
d) predict the productivity of a blue collar with 10 years of experience.

a) Given that:

"#$% "&&&
𝑥̅ = 27.3 ∑ 𝑥! = 1092 𝑛= = 40 𝑦/ = = 44.4
%&.( )#

The correlation between X and Y is:

∑*!+"( 𝑦! − 𝑦/)(𝑥! − 𝑥̅ )/𝑛


corr(𝑥, 𝑦) = =
8∑*!+"(𝑦! − 𝑦/)% 8 * (𝑥! − 𝑥̅ )%
𝑛 ∙ ∑!+" 𝑛

∑*!+" 𝑦! 𝑥! − 𝑛𝑦/𝑥̅ 50305 − 40 ∙ 44.4 ∙ 27.3


= = =
√98903 − 40 ⋅ 44.4% ∙ √30974 − 40 ⋅ 27.3%
8∑*!+" 𝑦!% − 𝑛𝑦/ % ∙ 8∑*!+" 𝑥!% − 𝑛𝑥̅ %
1820.2 1820.2
= = = 0.377
√20048.6√1162.4 141.6 ⋅ 34.1
b)
∑*!+"(𝑦! − 𝑦/)(𝑥 − 𝑥̅ ) ∑ 𝑦! 𝑥! − 𝑛𝑦/𝑥̅ 1820.2
𝛽A" = = = = 1.56
∑*!+"(𝑥! − 𝑥̅ )% ∑ 𝑥!% − 𝑛𝑥̅ % 1162.4

𝛽A# = 𝑦/ − 𝛽A" ⋅ 𝑥̅ = 4.44 − 1.56 ⋅ 27.3 = 1.81

The model is : Y = 1.81 + 1.56 X

c) R2 = corr(y,x)2 = 0.3772 = 0.142

d) The productivity of a blue collar with 10 years of experience is:

Y = 1.81 + 1.56 • 10 = 17.41

EXERCISE 2
Identify the violation of the assumption of the linear model examining the residuals of the model.

1) Heteroscedasticity 2) Correlation of the errors

3) No assumption violated 4) Non linearity


5) Non linearity 6) Heteroscedasticity

7) No assumption violated 8) Heteroscedasticity

9) No assumption violated 10) Outlier


EXERCISE 3
Consider a dataset that collects characteristics of houses in some city. The following linear
regression model is fitted on the data to predict the price of a house, based on the number of
baths (Baths), the surface in square feet (size) and the taxes paid in each city (Taxes). Results are
reported below.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13645.325 11728.640 1.163 0.247543
Baths 7340.274 7601.035 0.966 0.336624
size 30.571 8.697 3.515 0.000672 ***
Taxes 31.482 4.414 7.132 1.86e-10 ***
---
Residual standard error: 29580 on 96 degrees of freedom
Multiple R-squared: 0.7328, Adjusted R-squared: ----
F-statistic: 87.78 on 3 and 96 DF, p-value: < 2.2e-16

i) What is the prediction of the price for a house with 2 baths, 50 square feet and 1000$ of taxes?

Y = 13645.32 + 7340.27 * 2 + 30.57 * 50 + 31.48 * 1000 = 61336,36 $

ii) From the R output, it is possible to state that:


q the model is useful in the prediction of the price of a house and all independent variables
involved are statistically significative in the prediction
q correlation is low among the variables
q multicollinearity is present among the variables
q the variable Baths is not significant, so that the model can be reduced to a simpler model

EXERCISE 4
Consider some multiple regression model that study the relationship between the price of a house
and some explicative variables collected on a sample of 100 houses.
Taxes = taxes
Bedrooms = number of bedrooms
Baths = Number of baths
NW = if the house is exposed to North-West
Size = surface (in feet2)

Here are some descriptive statistics on the variables involved in the models

Taxes Bedrooms Baths


Min. : 20 Min. :1.00 Min. :1.000
1st Qu.: 970 1st Qu.:3.00 1st Qu.:1.875
Median :1535 Median :3.00 Median :2.000
Mean :1668 Mean :2.99 Mean :1.890
3rd Qu.:2042 3rd Qu.:3.00 3rd Qu.:2.000
Max. :4900 Max. :5.00 Max. :3.000

price size
Min. : 21000 Min. : 370 NW =0 in 25% of the cases
1st Qu.: 86875 1st Qu.:1158 NW =1 in 75% of the cases
Median :123750 Median :1410
Mean :126698 Mean :1526
3rd Qu.:153075 3rd Qu.:1760
Max. :338000 Max. :4050
a) Which model would you choose to predict the price of a house? Justify your choice

Model 1: Model 2:
Adj.R2 = 0.7418 Adj.R2 = 0.7247
5 var., 2 to drop 2 var., 0 to drop

Model 3: Model 4:
Adj.R2 = 0.7272 Adj.R2 = 0.7413
3 var., 1 to drop 3 var., 0 to drop

Model 1 has the highest Adj.R2, but model 4 with a negligible difference in the Adj.R2 is more parsimonious.

b) Considering model 4, which variables would you drop from the model at
(i) a level of significance of 5%
(ii) a level of significance of 10%.

None

c) Considering your best model predict the price of a house of “top” characteristics.

Y = 5172.33 + 18495.37 *1 + 38.86 * 4050 + 28.98 *4900 =


= 5172.33 + 18495.37 + 157383 + 170451 =351501,7

Model 1
lm(formula = price ~ Bedrooms + Baths + NW + size + Taxes)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12967.271 16013.779 0.810 0.4201
Bedrooms -6952.988 5438.688 -1.278 0.2042
Baths 6952.504 7499.397 0.927 0.3563
NW 16909.815 6953.530 2.432 0.0169 *
size 40.172 9.237 4.349 3.47e-05 ***
Taxes 28.403 4.417 6.430 5.26e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28640 on 94 degrees of freedom


Multiple R-squared: 0.7548, Adjusted R-squared: 0.7418
F-statistic: 57.88 on 5 and 94 DF, p-value: < 2.2e-16

Model 2
lm(formula = price ~ size + Taxes)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21115.050 8813.367 2.396 0.0185 *
size 34.069 7.904 4.310 3.92e-05 ***
Taxes 32.121 4.363 7.362 5.93e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29570 on 97 degrees of freedom


Multiple R-squared: 0.7302, Adjusted R-squared: 0.7247
F-statistic: 131.3 on 2 and 97 DF, p-value: < 2.2e-16
Model 3
lm(formula = price ~ Bedrooms + size + Taxes)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35765.660 13799.284 2.592 0.011 *
Bedrooms -7543.736 5484.510 -1.375 0.172
size 39.535 8.815 4.485 2.02e-05 ***
Taxes 31.859 4.347 7.328 7.29e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29440 on 96 degrees of freedom


Multiple R-squared: 0.7355, Adjusted R-squared: 0.7272
F-statistic: 88.96 on 3 and 96 DF, p-value: < 2.2e-16

Model 4
lm(formula = price ~ NW + size + Taxes)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5172.334 10395.753 0.498 0.6199
NW 18495.375 6872.445 2.691 0.0084 **
size 38.858 7.865 4.940 3.3e-06 ***
Taxes 28.981 4.387 6.606 2.2e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28660 on 96 degrees of freedom


Multiple R-squared: 0.7492, Adjusted R-squared: 0.7413
F-statistic: 95.58 on 3 and 96 DF, p-value: < 2.2e-16

EXERCISE 5
We want to investigate the determinants of the wages of male workers in the Mid-Atlantic
region. The data frame has 3000 observations on the following 8 variables:
age Age of worker
maritl A factor indicating marital status with 5 levels:
1. Never Married 2. Married 3. Widowed 4. Divorced 5. Separated
race A factor indicating race with 4 levels:
1. White 2. Black 3. Asian 4. Other
education A factor indicating education level with 5 levels:
1. < HS Grad 2. HS Grad 3. Some College
4. College Grad 5. Advanced Degree
jobclass A factor indicating type of job with 2 levels
1. Industrial 2. Information
health A factor indicating health level of worker with 2 levels
1. <=Good 2. >=Very Good
health_ins A factor indicating if worker has health insurance with 2 levels
1. Yes 2. No
wage Workers raw wage

The model and its residuals are reported below:

Call:
lm(formula = wage ~ age + maritl + race + education + jobclass +
health + health_ins)

Residuals:
Min 1Q Median 3Q Max
-103.663 -18.706 -3.473 13.853 211.966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.8356 3.5407 18.594 < 2e-16 ***
age 0.2837 0.0623 4.554 5.47e-06 ***
maritl2. Married 16.9253 1.7234 9.821 < 2e-16 ***
maritl3. Widowed 0.9009 8.0206 0.112 0.910578
maritl4. Divorced 3.6329 2.8929 1.256 0.209287
maritl5. Separated 11.5439 4.8563 2.377 0.017512 *
race2. Black -4.8977 2.1505 -2.277 0.022830 *
race3. Asian -2.5041 2.6087 -0.960 0.337193
race4. Other -5.9525 5.6809 -1.048 0.294809
education2. HS Grad 7.8432 2.3754 3.302 0.000972 ***
education3. Some College 18.3040 2.5265 7.245 5.49e-13 ***
education4. College Grad 31.3257 2.5547 12.262 < 2e-16 ***
education5. Advanced Degree 54.1677 2.8180 19.222 < 2e-16 ***
jobclass2. Information 3.4806 1.3273 2.622 0.008775 **
health2. >=Very Good 6.5454 1.4244 4.595 4.51e-06 ***
health_ins2. No -17.4482 1.4069 -12.402 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 34.09 on 2984 degrees of freedom


Multiple R-squared: 0.336, Adjusted R-squared: 0.3327
F-statistic: 100.7 on 15 and 2984 DF, p-value: < 2.2e-16



a. Looking at the model and its residuals what improvement in the model would you suggest?

Non normality of the residuals. I suggest to transform the dependent variable and model the
log(wage) instead of Wage

b. Given the model, predict the wage of a white male of 45 years old, never married, with
some college, in good health, with no insurance that works in the industrial sector.

Y = 65.84+ 0.28 * 45 + 18.30 * 1 - 17.45 * 1 = 79.29


Statistics for management
2 3
E&I AY 2021/22

Cluster Analysis: Exercises


EXERCISE 1
Suppose to have four observations, for which it has been computed a dissimilarity matrix:

A B C D
A 0 0.3 0.4 0.7
B 0 0.5 0.8
C 0 0.45
D 0

Build the dendrogram as a result of:
1. Complete linkage method (MAX)
2. Single linkage method (MIN)


Solution EXERCISE 1

1. Complete linkage method (MAX)

Starting from 0.3 the first cluster is AB. Let’s update the distance matrix:
à !(!,!)! = max!!!,! , !!,! ! = 0.5 !(!,!)! = max!!!,! , !!,! ! = 0.8



C D AB
C 0 0.45 0.5

D 0 0.8
AB 0


Considering 0.45 the second cluster is CD. The dendrogram is the following:














2. Single linkage method (MIN)

Starting from 0.3 the first cluster is AB. Let’s update the distance matrix:

à !(!,!)! = min!!!,! , !!,! ! = 0.4 !(!,!)! = min( !(!,!), !(!,!) ) = 0.7

C D AB
C 0 0.45 0.4
D 0 0.7
AB 0

Considering 0.4 the second cluster is ABC. The dendrogram is the following:



EXERCISE 2
Suppose to have four observations, for which it has been computed a similarity matrix:

A B C D
A 1 0.2 0.7 0.6
B 1 0.5 0.3
C 1 0.4
D 1

Build the dendrogram as a result of:
1. Complete linkage method (MAX)
2. Single linkage method (MIN)

Solution EXERCISE 2

1. Complete linkage method (MAX)


Starting from 0.7 the first cluster is AC. Let’s update the distance matrix:

à !(!,!)! = max!!!,! , !!,! ! = 0.2 !(!,!)! = max!!!,! , !!,! ! = 0.4

B D AC
B 1 0.3 0.2
D 1 0.4
AC 1

Considering 0.4 the second cluster is AC-D. The dendrogram is the following:


2. Single linkage method (MIN)

Starting from 0.7 the first cluster is AC. Let’s update the distance matrix:

à !(!,!)! = min!!!,! , !!,! ! = 0.5 !(!,!)! = min( !(!,!) , !(!,!) ) = 0.6

B D AC
B 1 0.3 0.5
D 1 0.6
AC 1

Considering 0.6 the second cluster is AC-D. The dendrogram is the following:

EXERCISE 3
Suppose to have five observations, for which it has been computed a distance matrix, use the
complete and single linkage methods to find clusters.

A B C D E
A 0 1 5 6 8
B 0 3 8 7
C 0 4 6
D 0 2
E 0

Solution EXERCISE 3

Complete linkage method (MAX)

1st cluster AB


Updated distance matrix:



2nd cluster DE


Updated distance matrix:



Clusters: ABC, DE

Single linkage method (MIN)

1st cluster AB



Updated distance matrix:



2nd cluster: DE



Updated distance matrix:



Clusters: ABC, DE





EXERCISE 4
Suppose to have five observations, for which it has been computed a similarity matrix:

A B C D E
A 1 0.6 0.4 0.2 0.5
B 1 0.9 0.5 0.8
C 1 0.1 0.7
D 1 0.3
E 1

Computing the updated similarity matrices, build the dendrogram as a result of (i) the single
linkage method (MIN) (ii) the complete linkage method (MAX).

Solution EXERCISE 4

(i) the single linkage method (MIN)

Starting from 0.9 the first cluster is BC. Let’s update the similarity matrix:
! !,! ! = min !!,! , !!,! = min 0.6, 0.4 = 0.6
! !,! ! = min !!,! , !!,! = min(0.5,0.1) = 0.5
! !,! ! = min !!,! , !!,! = min 0.8,0.7 = 0.8

BC A D E
BC 1 0.6 0.5 0.8
A 1 0.2 0.5
D 1 0.3
E 1

Considering 0.8 the second cluster is BC-E. The updated similarity matrix is:

! !",! ! = min !!",! , !!,! = min 0.6, 0.5 = 0.6


! !",! ! = min !!",! , !!,! = min 0.5,0.3 = 0.5

BCE A D
BCE 1 0.6 0.5
A 1 0.2
D 1

Considering 0.6 the next cluster is BCEA. The dendrogram is:






(ii) the complete linkage method (MAX).

Starting from 0.9 the first cluster is BC. Let’s update the similarity matrix:
! !,! ! = max !!,! , !!,! = max 0.6, 0.4 = 0.4
! !,! ! = max !!,! , !!,! = max(0.5,0.1) = 0.1
! !,! ! = max !!,! , !!,! = max 0.8,0.7 = 0.7

BC A D E
BC 1 0.4 0.1 0.7
A 1 0.2 0.5
D 1 0.3
E 1

Considering 0.7 the second cluster is BC-E. The updated similarity matrix is:

! !",! ! = max !!",! , !!,! = max 0.4, 0.5 = 0.4


! !",! ! = max !!",! , !!,! = max 0.1,0.3 = 0.1

BCE A D
BCE 1 0.4 0.1
A 1 0.2
D 1

Considering 0.4 the next cluster is BCEA. The dendrogram is:




Statistics for management
E&I AY 2022/23

EXERCISES LOGISTIC REGRESSION MODEL SOLVED

EXERCISE 1
Considering a logistic model in which the Y variable is the outcome of an interview for a job
(HIRED: YES/NO). We want to predict the probability to be hired, basing on the predictors Gender
(Male/Female) and holding of a Degree (YES/NO). Results are showed in the following R output.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2685 0.3773 -3.362 0.000774***
DegreeYES 1.1775 0.4335 2.716 0.006599**
GenderMale 1.2685 0.4386 2.892 0.003825**

a) Given the estimated model, what is the probability to be hired for a woman without a degree?
b) What is the effect of having a degree on the probability of being hired?

𝛼 = −1.2685; 𝛽! = 1.1775; 𝛽" = 1.2685;


𝑋! = 0 ß for DegreeNO
𝑋" = 0 ß for GenderFemale

The probability to be hired for a woman without a degree is:

𝑒 ($% '! (! %'" (" ) 𝑒 (*!.",-.% !.!//.∗1%!.",-.∗1) 𝑒 (*!.",-.)
𝑃(𝑌 = 1) = = = = 0.2195
1 + 𝑒 ($% '! (! %'" (" ) 1 + 𝑒 (*!.",-.% !.!//.∗1%!.",-.∗1) 1 + 𝑒 (*!.",-.)

The odds ratio of being hired having a degree is: 𝑒 (!.!//.) = 3.2

This means that having a degree compared to not having a degree increases of 3.2 times the odds
of being hired.

EXERCISE 2
If a logistic model estimates, for a given event, is P(Y=1/X)=1/4 what is the odds estimated by the
model for the same event?

2 !
𝑂𝑑𝑑 = !*2 ; 𝑝 = 3 = 0.25 ; (1 − 𝑝) = 0.75
2 1.".
𝑂𝑑𝑑 = !*2 = 1./. = 0.33
EXERCISE 3
Let consider the following classification table results from a logistic model. What is the correct
classification rate? And the sensitivity and specificity?

Predictions: YES NO
SmokerYES 310 287
SmokerNO 121 199

4!1%!55
The correct classification rate is equal to 4!1%"-/%!"!%!55 = 0.5551

Sensitivity = 310/(310+121) = 0.719

Specificity = 199/(199+287) = 0.409

EXERCISE 4
We are interested in identifying consumers that positively reply to advertising (the “true
positives”). A logistic model classifies a consumer in “true positives” group if estimated probability
for this group is equal or greater than 0.2. Results are: percentage of uncorrected classification for
true positives is equal to 16%; percentage of uncorrected classification for true negatives is equal
to 35%. What is the sensitivity and specificity of the classification rule?

Let’s indicate in the table the % of uncorrected classification for true positive (16%). The
complement is 84% that is the sensitivity. The same for the specificity.

Observed Status
Positives Negatives
Status Positives 84% 35%
predicted
by the Negatives 16% 65%
model

total 100% 100%

84% is the sensitivity


65% is the specificity

EXERCISE 5
Consider a sample of 462 males in a heart-disease high-risk region of the Western Cape, South
Africa. The response variable is chd, i.e. if the individual suffer of heart disease or not. The goal is
to study the determinants of hearth diseases like personal characteristics and/or behaviour of the
individuals. The variables considered are:

sbp systolic blood pressure
tobacco cumulative tobacco (kg)
ldl low density lipoprotein cholesterol
adiposity
famhist family history of heart disease (Present, Absent)
typea type-A behavior
obesity
alcohol current alcohol consumption
age age at onset
chd response, coronary heart disease

The summary statistics and the results of two logistic regression models and the corresponding
confusion matrix are in the following tables.

sbp tobacco ldl adiposity famhist
Min. :101.0 Min. : 0.0000 Min. : 0.980 Min. : 6.74 Absent :270
1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77 Present:192
Median :134.0 Median : 2.0000 Median : 4.340 Median :26.11
Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41
3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23
Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49

typea obesity alcohol age chd


Min. :13.0 Min. :14.70 Min. : 0.00 Min. :15.00 Min. :0.0000
1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51 1st Qu.:31.00 1st Qu.:0.0000
Median :53.0 Median :25.80 Median : 7.51 Median :45.00 Median :0.0000
Mean :53.1 Mean :26.04 Mean : 17.04 Mean :42.82 Mean :0.3463
3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89 3rd Qu.:55.00 3rd Qu.:1.0000
Max. :78.0 Max. :46.58 Max. :147.19 Max. :64.00 Max. :1.0000

MODEL 1
Call:
glm(formula = chd ~ typea + age + alcohol + famhist + tobacco +
ldl, family = binomial, data = southAf)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.4635237 0.9260840 -6.979 2.96e-12 ***
typea 0.0370655 0.0121680 3.046 0.00232 **
age 0.0505720 0.0102340 4.942 7.75e-07 ***
alcohol 0.0008268 0.0044158 0.187 0.85147
famhistPresent 0.9051115 0.2263255 3.999 6.36e-05 ***
tobacco 0.0794551 0.0263199 3.019 0.00254 **
ldl 0.1628561 0.0551413 2.953 0.00314 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 596.11 on 461 degrees of freedom


Residual deviance: 475.65 on 455 degrees of freedom
AIC: 489.65
chd
glm.pred2 0 1
No 256 73
Yes 46 87

MODEL 2
glm(formula = chd ~ typea + obesity + age + famhist + sbp, family = binomial,
data = southAf)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.544967 1.189465 -5.502 3.75e-08 ***
typea 0.039038 0.011847 3.295 0.000984 ***
obesity -0.018358 0.027349 -0.671 0.502070
age 0.062872 0.009762 6.440 1.19e-10 ***
famhistPresent 0.924630 0.220338 4.196 2.71e-05 ***
sbp 0.007442 0.005616 1.325 0.185122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 596.11 on 461 degrees of freedom


Residual deviance: 493.65 on 456 degrees of freedom
AIC: 505.65

chd
glm.pred3 0 1
No 252 82
Yes 50 78

a. Given that the individuals with heart disease are 160 calculate the odds of heart disease.

Odds=p(x) / (1-p(x)) = (160/462) / (302/462) = 160/302 = 0.53

b. On the basis of: the estimates of the coefficient, the index to evaluate the goodness of fit
of the model and the confusion matrix, choose the models you think best fit the data
(report all the calculation needed to justify your response).

Model 1 Model 2
AIC = 489.65 AIC = 505.65
6 variables, 1 to drop 5 variables, 2 to drop
% CC = [(256+87)/462] x100 = 74 % % CC = [(252+78) / 462] x 100 = 71 %
Sensit.= [87/ (87+73)] x 100 = 54.4 % Sensi. = [78 / (78+82)] x 100 = 48 %
Spec. = [256 / (256+46)] x 100 = 84.4 % Spec. = [252 / (252+50)] x 100 = 83.4 %

According to all the elements presented above the best model is model 1.

c. Given the chosen model, what is the effect on the probability of heart disease of having a
family history of heart disease?

The effect on the probability is positive: having a family history of heart disease increase the
probability of heart disease. To quantify the effect we can calculate the odds ratio:

exp(0.905) = 2.47

Having a family history of heart disease increase the odds of heart disease of 2.47 times with
respect of not having it.

d. Give the chosen model and the descriptive statistics reported above, calculate the
probability of heart disease of the “median” individual.

p(median individual) = exp ( * ) / (1+exp( * ) ) = exp ( -1.337) / (1 + exp(1.337)) = 0.263 / 1.263 = 0.21

where:

* = -6.46 – 0.037 x 53 + 0.051 x 45 + 0.0008 x 7.51 + 0.079 x 2 + 0.162 x 4.34) = -1.337

You might also like