Coursework2 QUANT

Coursework and Project Cover Sheet
Department of Civil and Environmental Engineering
Cluster: Transport and business management
Module: Quantitative Methods
Assignment: Coursework
Assignment Setter: Daniel GRAHAM
Submission Deadline:19-NOV-2020
DECLARATION
I understand that, unless specifically designated as Group work, all coursework submissions are
individual, and that sharing my work, either electronically or in paper, is prohibited. If my
submission is copied by another student, I will also be considered to have actively taken part in
plagiarism.
I certify that I have NOT:

- Shared my coursework with any other person.
- Given my coursework to someone else to submit on my behalf.
Signature: Name: Samuel GUIOSE
I certify that I have read the definition of plagiarism given overleaf, and that the work submitted
for this coursework assignment is my own work, except where specifically indicated otherwise.
In signing this document. I agree that this work may be submitted to an electronic plagiarism
test at any time and I will provide a further version of this work in an appropriate format when
requested:
CID: 01977602 Date: 19-NOV-2020
Note: Until an assignment carries this completed front page it will not be accepted for marking.
TO BE COMPLETED BY THE MARKER
Grade awarded: ____________________ Late penalty applied: ______________________________
Late submission penalties are 50% for the 24 hours and 100% thereafter.
PLAGIARISM
You are reminded that all work submitted as part of the requirements for any examination (including
coursework) of Imperial College must be expressed in your own writing and incorporate your own
ideas and judgements.
Plagiarism, that is the presentation of another person's thoughts or words as though they are your own,
must be avoided with particular care in coursework, essays and reports written in your own time.
Note that you are encouraged to read and criticise the work of others as much as possible. You
are expected to incorporate this in your thinking and in your coursework and assessments. But
you must acknowledge and label your sources.
Direct quotations from the published or unpublished work of others, from the internet, or from any
other source must always be clearly identified as such. A full reference to their source must be
provided in the proper form and quotation marks used. Remember that a series of short quotations
from several different sources, if not clearly identified as such, constitutes plagiarism just as much as
a single unacknowledged long quotation from a single source. Equally if you summarise another
person's ideas, judgements, figures, diagrams or software, you must refer to that person in your text,
and include the work referred to in your bibliography and/or reference list. Departments are able to
give advice about the appropriate use and correct acknowledgement of other sources in your
own work.
The direct and unacknowledged repetition of your own work, which has already been submitted
for assessment, can constitute self-plagiarism. Where group work is submitted, this should be
presented in a way approved by your department. You should therefore consult your tutor or
course director if you are in any doubt about what is permissible. You should be aware that you
have a collective responsibility for the integrity of group work submitted for assessment.
The use of the work of another student, past or present, constitutes plagiarism. Where work is
used without the consent of that student, this will normally be regarded as a major offence of
plagiarism.
Failure to observe any of these rules may result in an allegation of cheating. Cases of suspected
plagiarism will be dealt with under the College's Cheating Offences Policy and Procedures and
may result in a penalty being taken against any student found guilty of plagiarism.
Q1.
We are going to set a linear in parameters model. As we have only two variables, our model will
have only one explanatory variable. We are going to set the income as the explanatory variable, that
means the independent variable (x). Thus, we are going to explain car ownership, the dependent
variable (y) respect to income, as the more you earn the more you can afford to buy cars.
The model takes the form:
y i=α + βx i +ϵ i
Where
– α ∧β are the parameters∧ϵ i is the randomerror of the observation i
– yi is observation i onthe dependent variable y ( ¿ explain )
−x i is observationi on theindependent variable x
2
We assumeϵ i i. i. d N (0 , σ ϵ ), that describes Gauss Markov conditions which are key hypotheses
for the linear regression:
- E¿¿
- Cov (ϵ ¿ ¿ i, ϵ j )=0 ¿ : error term is uncorrelated with covariates.
2
- Var ( ϵ i )=σ for all i
- ϵ i is statistically independent of ϵ j for i≠ j
We estimate the parameters by least squares method:
We express the error as: ϵ i= y i −( α + β x i ) . The objective is to minimise the sum of squares of the
errors ϵ i.
n
The sum of squared errors is: Sϵ =∑ ( y i−(α + β xi ) )
2 2
i =1
2
By differentiating Sϵ with respect to the parameters α and β, then equating to zero, we obtain:
^ = y − ^β x
α
2
^β= S xy = Cov ( x , y )
2
S xx Var (x )
n
where S 2xy =∑ (xi −x)( y i− y)
i=1
n
¿ S xx =∑ ( x i−x )
2 2
i=1
x and y are the means of the 2 samples.

x=¿ 24541 dollars
y=¿1.625 cars/inhabitant
2
S xy =10758.1
2
S xx =801312056.5
We find
^β=1.343E-05 cars
/dollars α^ =1.296 cars/inhabitant
inhabitant
Those are the estimated values of the parameters.
We consider these estimators BLUE = Best Linear Unbiased Estimators. Therefore, they respect the
Gauss Markov conditions. The estimators are:
- Unbiased
- With variance as small as possible
- Convergent
Coefficient of determination:
S2m
r 2=
S2t
Where,
n
Sm =∑ ¿^
2
¿ ¿ ¿0.1444 , ¿)
i=1
n
S =∑ ( y i− y ) =¿¿ 0.7062
2 2
t
i =1
2
r =0.2045
Standard errors:
We obtain:
s
 s β=
S xx
2
S
Where s = ϵ is an estimate of the variance of the population error (n=17)
2
n−2
2 2 2
Sϵ =S t −S m=0.562
s=
√ 0.562
15
=0.19
√
2
1 x
 sα =s +
n S2xx
s β=6.837E-06
sα =0.1742 Those are the estimated values of the standard errors of the parameters.
Q2.
Coefficient of determination:
We found r 2=0.2045. The coefficient of determination measures the proportion of the variation in
y (car ownership) explained by the systematic part of the model, x (income). That could mean that
20% of variation in car ownership is explained by the income.
Besides, by doing the linear regression on R, we can access the “adjusted coefficient of
determination”(Figure 2). It considers the number of variables used in the model, unlike the
“normal” coefficient of determination. According to R, adjusted r-squared is 0.1515 which is 15%.
So, that could mean that car ownership is associated with income at a 15% rate, which is low and
should warn us that the model is not performant. The closer to 0 is the coefficient of determination,
the less correlated are the variables. By observing the graph of the linear regression (Figure 1.) we
can say that the correlation between both variables is not satisfying as the points are scattered.
Linear regression
2.50
2.00
f(x) = 1.34255884402106E-05 x + 1.29581912095734

R² = 0.204515464928242
1.50
Car ownership
1.00
0.50
0.00
15000 20000 25000 30000 35000 40000
Income ($)
Figure 1. Linear regression on Excel

Figure 2. Linear model on R
We assess the power of the model. For that, we want to test the overall significance of the model by
utilizing 2 methods: the coefficient of determination and the F-test. Those are tests of calibration, to
see whether or not the model is appropriate for the data to which the model has been fitted.
 First, we test the significance of the coefficient of determination at level 0.05.
We test the null hypothesis:
H 0 :r=0
H1: r ≠ 0
Under the assumption that the error term is normally distributed with a mean of 0, the test statistic
t has Student’s distribution with (n-2) degrees of freedom:
t r=r
√ n−2 S m
= =1.96
1−r 2 s
We have t n−2 , p =t 15,0.025 =2.131 .Then , t r <t 15,0.25 ∧we cannot reject H 0 at level 0.05 .
2
 We can conclude that the coefficient of determination of the model is not significantly
different from 0 at level 0.05.
 F-test:
We can also see the overall significance of the model by doing a test based on F-distribution, where:
H 0 : β=0
H1: β ≠ 0
This test is a form of analysis of variance (ANOVA) test. We obtain an F-distributed test statistic with
(k,n-(k+1)) degrees of freedom where k is the number of independent variables. Here k = 1 and n =
17.
S2m /k 0.1444
F= = =3.856 And,
2
S /(n−( k +1 ) )
ϵ
0.562/15
1 1
f 0.05,1,15=4.54∧f 0.95,1,15 = = =¿0.22
f 0.05,1,15 4.54
We have f 0.95,1,15 < F< f 0.05,1,15 so we accept H 0 .
 β is not significantly different from 0 at level 0.05.
Those two tests prevent us from accepting this model. As a result, we can’t accept the model as a
performant one.
Income does not affect car ownership; they are not associated. The model should be fitted again.
Q3.
Discussion of the performance of model
 We can first do some qualitative checks:

o Expected signs: the sign of β coherent is as the more you earn the more you can
afford to buy cars. So, we could say that the predictions are plausible.
 Inspection of residuals:
o Omission of explanatory variable
Let’s discuss the trueness of the Gauss Markov conditions:
- E¿¿ (Normality)
- Var ( ϵ i )=σ 2 for all i (homoscedasticity )
- Cov (ϵ ¿ ¿ i, ϵ j )=0 ¿ : error term is uncorrelated with covariates. (homogeneity-linearity)
- ϵ i is statistically independent of ϵ j for i≠ j (independence)
Figure 3. R-spread-location plot
This graph permits verifying if the variance of the errors is constant (homoscedasticity ). The red
line should be horizontal to prove the variance is constant and the values should be equally spread.
Here, the red line close from being horizontal. We cannot deny the model. Though, we can observe
an outlier that could “disturb” the model, value number 13 circled in red (Figure3).
Figure 4. Residual vs Leverage
This plot stresses if some points influence linear regression. We don’t look for patterns but for
outlying values which are off the boundaries at the upper right corner and lower right corner. Here
we don’t see any values off the limits so we cannot deduct any influence on the regression results.
Figure 5. Residual vs Fitted

This plot stress whether or not there is a linear relationship between the independent and
dependent variable by showing residuals patterns. (assumption of linearity and
independence of errors)
The red line should appear horizontal to show that there is a linear relationship between x
and y and the residuals should be equally spread around this horizontal. Here we cannot see
a clear pattern. We can observe a little shift between the fitted values 1.55 and 1.65 with the
residual number 13 (circled in red on figure 5.). We can accept this plot as a “good” one.
Figure 6. Normal Q-Q
This plot shows if the residuals are normally distributed. The points should follow the
straight line and note deviate. Here, the points more or less follow the straight line, that
2
means the assumption of ϵ i N (0 , σ ϵ ) is respected. We can observe 2 severe outliers which
could be removed to obtain a better model. Number 13 which is the same than in Figure 5.
And number 16.
On Figure 7, we can also assume that {E(ϵ} rsub {i)} =0 for all i is a respected condition.
The fact that this graph shows that the points follow a curvy trend could indicate the fact
that there is a relationship between x and y that we did not take into account.
Figure 7. Excel table
As the model meets the assumptions on the errors, we cannot totally deny it. It should be
fitted again as the form was not correct. We may have omitted other explanatory
variables.
References :
[1] Data.library.virginia.edu. 2020. Understanding Diagnostic Plots For Linear Regression Analysis |

University Of Virginia Library Research Data Services + Sciences. [online] Available at:
<https://data.library.virginia.edu/diagnostic-plots/> [Accessed 19 November 2020].
[2] Decitre.fr. 2020. Le Modèle Linéaire Par L'exemple. Régression, Analyse De La Variance Et Plans
D'expérience Illustrés Avec R, SAS Et Splus - Jean-Marc Bardet,Jean-Marc Azaïs. [online] Available
at: <https://www.decitre.fr/livres/le-modele-lineaire-par-l-exemple-9782100495597.html>

Coursework2 QUANT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coursework2 QUANT

Uploaded by

Copyright:

Available Formats

Coursework and Project Cover Sheet

Department of Civil and Environmental Engineering

Cluster: Transport and business management

Module: Quantitative Methods

Assignment Setter: Daniel GRAHAM

I certify that I have NOT:

Signature: Name: Samuel GUIOSE

CID: 01977602 Date: 19-NOV-2020

TO BE COMPLETED BY THE MARKER

Grade awarded: ____________________ Late penalty applied: ______________________________

The model takes the form:

We estimate the parameters by least squares method:

x and y are the means of the 2 samples.

f(x) = 1.34255884402106E-05 x + 1.29581912095734

Figure 1. Linear regression on Excel

 First, we test the significance of the coefficient of determination at level 0.05.

We test the null hypothesis:

We have f 0.95,1,15 < F< f 0.05,1,15 so we accept H 0 .

 β is not significantly different from 0 at level 0.05.

 We can first do some qualitative checks:

Let’s discuss the trueness of the Gauss Markov conditions:

Figure 3. R-spread-location plot

Figure 5. Residual vs Fitted

Figure 6. Normal Q-Q

Figure 7. Excel table

[1] Data.library.virginia.edu. 2020. Understanding Diagnostic Plots For Linear Regression Analysis |

You might also like

Grade awarded: Late penalty applied: __________