You are on page 1of 8

CH/LF

ST22 - Advanced Statistical Methods – 2020/2021

Exercises WITH SOLUTIONS

Exercise 1: Testing a model

A staff restaurant conducted a survey collecting data from a random sample of 32 clients.
They were asked, among other things: how many times did they eat the restaurant during the
last month (variable: FREQUENCY); how much did they spend on a meal (variable:
SPENDING); how old they were (variable: AGE).

The restaurant manager would like to construct a model that would explain the spending
amount in terms of the frequency and the age for all clients.

You will find below:


Appendix 1: scatterplot of SPENDING w.r.t. FREQUENCY
Appendix 2: scatterplot of SPENDING w.r.t. AGE
Appendix 3: linear regression results obtained using a statistical package modeling
SPENDING in terms of FREQUENCY and AGE.

You are asked to proceed with the different tests required to validate the linear
regression model for all clients among which the sample was taken.

Appendix 1 (1st scatterplot)


Response variable: SPENDING
Explanatory variable: FREQUENCY
Appendix 2 (2nd scatterplot)
Response variable: SPENDING
Explanatory variable: AGE

Appendix 3 (multiple linear regression model)


Response variable: SPENDING
Explanatory variables: FREQUENCY and AGE

Variable #1 (SPENDING)
Mean 4.17188
Corrected Standard Deviation 1.53249
Variable #2 (FREQUENCY)
Mean 10.59375
Corrected Standard Deviation 6.76738
Variable #3 (AGE)
Mean 35.75
Corrected Standard Deviation 11.6453
Count n 32
R-square 0.58999

Coefficient Standard Error


Intercept 6.13803
FREQUENCY -0.17208 0.02782
AGE -0.00401 0.01617

ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 29.8504
Total ?
72.8047

2
SOLUTIONS:
ANOVA TABLE
SS df MS F p-value
Regression 42,9543 2 21,47715 20,8652933 <0,01
Residual 29,8504 29=32-(2+1) 1,02932414
Total 72,8047 32-1=31

42,9543=72,8047-29,8504

Overall test (F test)

H0 : 1=2=0
H1 : at least one of the j is not 0 j=1,2

From the F table (df1= 2; df2=29) critical value is 3,33 (5%)


There exists enough statistical evidence in order to reject the null hypothesis (no model).

Coefficient Standard Error t p-value


Intercept 6.13803
FREQUENCY -0.17208 0.02782 -6,18547807 <,05
AGE -0.00401 0.01617 -0,24799011 >,05

Frequency:

[ ]
95 % CI for β 1= ^β1 ±t 29 ,α /2 × s ^β =[−0.17208± 2,045 ×0.02782 ]
1

We will run a Student test for the coefficient 1 for the explanatory variable Frequency as follows:
H0: 1=0 in the presence of AGE
H1: 1≠0 in the presence of AGE
^β −0 −0.17208
1
We calculate the test statistic t= = =−6,18547807
s ^β 0.02782
1

The critical values associated with a Student distribution for df=29 and a type I error risk
=0,05 are t29;0.025 = 2,045 (two tailed test).
|t|>>critical value so we can reject H 0

Age:

[ ]
95 % CI for β 2= ^β2 ± t 29 ,α /2 × s ^β =[ −0.00401± 2,045 ×0.01617 ]
2

We will run a Student test for the coefficient 2 for the explanatory variable Age as follows:
H0: 2=0 in the presence of FREQUENCY
H1: 2≠0 in the presence of FREQUENCY
^β 2−0 −0.00401
We calculate the test statistic t= = =−0,24799011
s ^β 0.01617
2

The critical values associated with a Student distribution for df=29 and a type I error risk
=0,05 are t29;0.025 = 2,045 (two tailed test).
|t|<critical value so we can not reject H0

3
Exercise 2: Choosing a model and using it

The HR director of an industrial group would like to construct a model explaining the
monthly salary of all employees.
Using data collected from a random sample of 36 employees, he tests two explanatory
variables that he deems relevant: the number of years of graduate studies (X1) and the
number of years of service (X2).
You can find below results from three regression models he tested using Excel.

1) For each of the three suggested models:


a. Calculate the coefficient of determination r²
b. Conduct the Student tests for the explanatory variables.

2) Can you help the HR director choose the most suitable model?
a. Using the information provided, which model would you suggest to use? Justify
your choice.
b. Estimate the parameters of the chosen model.

3) Pierre Durand, an employee of this group, is 38 years old, with 10 years of service
and 4 years of graduate studies. His monthly salary is 2050 Euros and he thinks he is
underpaid.
Calculate a 95% confidence interval for the mean salary of an employee with Pierre
Durand’s profile.
If you were the HR director of that firm, what would you tell Pierre Durand about his
salary?

Variable Mean Sample standard deviation*


Y 1850 795.88
X1 3.5 2.40
X2 4.17 2.15
* Reminder: this is the value of the unbiased point estimate (coefficient 1/ n-1)

Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000

Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08

4
Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 22 156 025
Total ??

Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62

Regression of Y w.r.t. X1, X2


ANALYSE OF VARIANCE
Sum of Squares
Regression ?
Residual 681 981.4
Total ?

Coefficients Standard-
error
Constant 742
Variable X 1 327.27 10.15
Variable X 2 -8.98 11.35

SOLUTIONS:

1) For each of the three suggested models:


a. Calculate the coefficient of determination r²
b. Conduct the Student tests for the explanatory variables.

a. r²=SSRegression/SSTotal=(SSTotal-SSResidual)/SSTotal

Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression 21475075
Residual 694 925
Total 22 170 000

r²=(22170000-694925)/22170000= 21475075/22170000=0,96865471

5
Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression 13975
Residual 22 156 025
Total 22 170 000
?

r²=(22170000-22 156 025)/22170000=13975/22170000=0,00063036

Regression of Y w.r.t. X1, X2


ANALYSIS OF VARIANCE
Sum of Squares
Regression 21488018,6
Residual 681 981.4
Total 22 170 000

r²=(22170000-681 981.4)/22170000=21488018,6/22170000=0,96923855

b.

Regression of Y w.r.t X1
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08

We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0
H1: 1≠0
^β1 −0 326,9
We calculate the test statistic t= = =¿32,4305556
s ^β 10,08
1

The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I
error risk =0,05 t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and
t40;0.025 = 2.021 (two tailed test).
|t|>>critical value so we can reject H 0

Regression of Y w.r.t X2
Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62

We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0
H1: 2≠0

6
^β 2−0 9.32
We calculate the test statistic t= = =¿ 0,14649481
s ^β
12
63.62

The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I
error risk =0,05 t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and
t40;0.025 = 2.021 (two tailed test).
|t|<<critical value so we do not reject H0

Regression of Y w.r.t. X1, X2


Coefficients Standard-
error
Constant 742
Variable X 1 327.27 10.15
Variable X 2 -8.98 11.35

X1:
We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0 in the presence of X2
H1: 1≠0 in the presence of X2
^β1 −0 327,27
We calculate the test statistic t= = =¿32,2433498
s ^β 10,15
1

The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a
type I error risk =0,05 t33;0.025 are not included in the tables. We know t30;0.025 =
2.042 and t40;0.025 = 2.021 (two tailed test).

|t|>>critical value so we can reject H 0

X2:
We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0 in the presence of X1
H1: 2≠0 in the presence of X1
^β 2−0 −8,98
We calculate the test statistic t= = =−¿ 0,79118943
s ^β 11,35
2

The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a
type I error risk =0,05 t33;0.025 are not included in the tables. We know t30;0.025 =
2.042 and t40;0.025 = 2.021 (two tailed test).
|t|<critical value so we can not reject H0

2.a Choose the most suitable model

From 1.b we choose Regression of Y w.r.t X1

2.b parameters

7
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08

^β =706 ; ^β =326,9
0 1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000

S2=SSResidual/(n-2)= 694 925/34= 20438,9706

3) Pierre Durand, an employee of this group, is 38 years old, with X2=10 years of
service and X1=4 years of graduate studies. His monthly salary is Y=2050 Euros and he
thinks he is underpaid.
Calculate a 95% confidence interval for the mean salary of an employee with Pierre
Durand’s profile.

We use the model chosen in 2.a

95 % CI for E ( y ) when x =4

[ √ ][ √ ]
2
1 ( x p−x )
2
1 ( 4−3,5 )
^y ± t × sε + = 2013,6 ± 2, 042 ×142,964928 +
n−2 ,
α
2
n ∑ ( x−x )2 36 35× 2.402

¿ [ 2013,6 ±2 , 042× 142,964928× 0,17034629 ] =[ 2013,6 ± 49,7299378 ] =[1963,87006 ; 2063,32994 ]


sε =√ S 2=¿ 142,964928
^y =706 +326,9∗4=¿2013,6

T34;0.05 2 . 042 using the Student table with df=30.


∑ ( x− x )2= ( n−1 ) × s2x =35 ×2.402 .
If you were the HR director of that firm, what would you tell Pierre Durand about his
salary?

His salary belongs to the CI.


He is not underpaid (in fact, his salary is higher than the prediction!).

You might also like