Professional Documents
Culture Documents
Assignment: Coursework
Submission Deadline:19-NOV-2020
DECLARATION
I understand that, unless specifically designated as Group work, all coursework submissions are
individual, and that sharing my work, either electronically or in paper, is prohibited. If my
submission is copied by another student, I will also be considered to have actively taken part in
plagiarism.
I certify that I have read the definition of plagiarism given overleaf, and that the work submitted
for this coursework assignment is my own work, except where specifically indicated otherwise.
In signing this document. I agree that this work may be submitted to an electronic plagiarism
test at any time and I will provide a further version of this work in an appropriate format when
requested:
Note: Until an assignment carries this completed front page it will not be accepted for marking.
Late submission penalties are 50% for the 24 hours and 100% thereafter.
PLAGIARISM
You are reminded that all work submitted as part of the requirements for any examination (including
coursework) of Imperial College must be expressed in your own writing and incorporate your own
ideas and judgements.
Plagiarism, that is the presentation of another person's thoughts or words as though they are your own,
must be avoided with particular care in coursework, essays and reports written in your own time.
Note that you are encouraged to read and criticise the work of others as much as possible. You
are expected to incorporate this in your thinking and in your coursework and assessments. But
you must acknowledge and label your sources.
Direct quotations from the published or unpublished work of others, from the internet, or from any
other source must always be clearly identified as such. A full reference to their source must be
provided in the proper form and quotation marks used. Remember that a series of short quotations
from several different sources, if not clearly identified as such, constitutes plagiarism just as much as
a single unacknowledged long quotation from a single source. Equally if you summarise another
person's ideas, judgements, figures, diagrams or software, you must refer to that person in your text,
and include the work referred to in your bibliography and/or reference list. Departments are able to
give advice about the appropriate use and correct acknowledgement of other sources in your
own work.
The direct and unacknowledged repetition of your own work, which has already been submitted
for assessment, can constitute self-plagiarism. Where group work is submitted, this should be
presented in a way approved by your department. You should therefore consult your tutor or
course director if you are in any doubt about what is permissible. You should be aware that you
have a collective responsibility for the integrity of group work submitted for assessment.
The use of the work of another student, past or present, constitutes plagiarism. Where work is
used without the consent of that student, this will normally be regarded as a major offence of
plagiarism.
Failure to observe any of these rules may result in an allegation of cheating. Cases of suspected
plagiarism will be dealt with under the College's Cheating Offences Policy and Procedures and
may result in a penalty being taken against any student found guilty of plagiarism.
Q1.
We are going to set a linear in parameters model. As we have only two variables, our model will
have only one explanatory variable. We are going to set the income as the explanatory variable, that
means the independent variable (x). Thus, we are going to explain car ownership, the dependent
variable (y) respect to income, as the more you earn the more you can afford to buy cars.
y i=α + βx i +ϵ i
Where
– α ∧β are the parameters∧ϵ i is the randomerror of the observation i
– yi is observation i onthe dependent variable y ( ¿ explain )
−x i is observationi on theindependent variable x
2
We assumeϵ i i. i. d N (0 , σ ϵ ), that describes Gauss Markov conditions which are key hypotheses
for the linear regression:
- E¿¿
- Cov (ϵ ¿ ¿ i, ϵ j )=0 ¿ : error term is uncorrelated with covariates.
2
- Var ( ϵ i )=σ for all i
- ϵ i is statistically independent of ϵ j for i≠ j
We express the error as: ϵ i= y i −( α + β x i ) . The objective is to minimise the sum of squares of the
errors ϵ i.
n
The sum of squared errors is: Sϵ =∑ ( y i−(α + β xi ) )
2 2
i =1
2
By differentiating Sϵ with respect to the parameters α and β, then equating to zero, we obtain:
^ = y − ^β x
α
2
^β= S xy = Cov ( x , y )
2
S xx Var (x )
n
where S 2xy =∑ (xi −x)( y i− y)
i=1
n
¿ S xx =∑ ( x i−x )
2 2
i=1
2
S xy =10758.1
2
S xx =801312056.5
We find
^β=1.343E-05 cars
/dollars α^ =1.296 cars/inhabitant
inhabitant
Those are the estimated values of the parameters.
We consider these estimators BLUE = Best Linear Unbiased Estimators. Therefore, they respect the
Gauss Markov conditions. The estimators are:
- Unbiased
- With variance as small as possible
- Convergent
Coefficient of determination:
S2m
r 2=
S2t
Where,
n
Sm =∑ ¿^
2
¿ ¿ ¿0.1444 , ¿)
i=1
n
S =∑ ( y i− y ) =¿¿ 0.7062
2 2
t
i =1
2
r =0.2045
Standard errors:
We obtain:
s
s β=
S xx
2
S
Where s = ϵ is an estimate of the variance of the population error (n=17)
2
n−2
2 2 2
Sϵ =S t −S m=0.562
s=
√ 0.562
15
=0.19
√
2
1 x
sα =s +
n S2xx
s β=6.837E-06
sα =0.1742 Those are the estimated values of the standard errors of the parameters.
Q2.
Coefficient of determination:
We found r 2=0.2045. The coefficient of determination measures the proportion of the variation in
y (car ownership) explained by the systematic part of the model, x (income). That could mean that
20% of variation in car ownership is explained by the income.
Besides, by doing the linear regression on R, we can access the “adjusted coefficient of
determination”(Figure 2). It considers the number of variables used in the model, unlike the
“normal” coefficient of determination. According to R, adjusted r-squared is 0.1515 which is 15%.
So, that could mean that car ownership is associated with income at a 15% rate, which is low and
should warn us that the model is not performant. The closer to 0 is the coefficient of determination,
the less correlated are the variables. By observing the graph of the linear regression (Figure 1.) we
can say that the correlation between both variables is not satisfying as the points are scattered.
Linear regression
2.50
2.00
1.00
0.50
0.00
15000 20000 25000 30000 35000 40000
Income ($)
We assess the power of the model. For that, we want to test the overall significance of the model by
utilizing 2 methods: the coefficient of determination and the F-test. Those are tests of calibration, to
see whether or not the model is appropriate for the data to which the model has been fitted.
H 0 :r=0
H1: r ≠ 0
Under the assumption that the error term is normally distributed with a mean of 0, the test statistic
t has Student’s distribution with (n-2) degrees of freedom:
t r=r
√ n−2 S m
= =1.96
1−r 2 s
We have t n−2 , p =t 15,0.025 =2.131 .Then , t r <t 15,0.25 ∧we cannot reject H 0 at level 0.05 .
2
We can conclude that the coefficient of determination of the model is not significantly
different from 0 at level 0.05.
F-test:
We can also see the overall significance of the model by doing a test based on F-distribution, where:
H 0 : β=0
H1: β ≠ 0
This test is a form of analysis of variance (ANOVA) test. We obtain an F-distributed test statistic with
(k,n-(k+1)) degrees of freedom where k is the number of independent variables. Here k = 1 and n =
17.
S2m /k 0.1444
F= = =3.856 And,
2
S /(n−( k +1 ) )
ϵ
0.562/15
1 1
f 0.05,1,15=4.54∧f 0.95,1,15 = = =¿0.22
f 0.05,1,15 4.54
Those two tests prevent us from accepting this model. As a result, we can’t accept the model as a
performant one.
Income does not affect car ownership; they are not associated. The model should be fitted again.
Q3.
Discussion of the performance of model
- E¿¿ (Normality)
- Var ( ϵ i )=σ 2 for all i (homoscedasticity )
- Cov (ϵ ¿ ¿ i, ϵ j )=0 ¿ : error term is uncorrelated with covariates. (homogeneity-linearity)
- ϵ i is statistically independent of ϵ j for i≠ j (independence)
This graph permits verifying if the variance of the errors is constant (homoscedasticity ). The red
line should be horizontal to prove the variance is constant and the values should be equally spread.
Here, the red line close from being horizontal. We cannot deny the model. Though, we can observe
an outlier that could “disturb” the model, value number 13 circled in red (Figure3).
Figure 4. Residual vs Leverage
This plot stresses if some points influence linear regression. We don’t look for patterns but for
outlying values which are off the boundaries at the upper right corner and lower right corner. Here
we don’t see any values off the limits so we cannot deduct any influence on the regression results.
This plot shows if the residuals are normally distributed. The points should follow the
straight line and note deviate. Here, the points more or less follow the straight line, that
2
means the assumption of ϵ i N (0 , σ ϵ ) is respected. We can observe 2 severe outliers which
could be removed to obtain a better model. Number 13 which is the same than in Figure 5.
And number 16.
On Figure 7, we can also assume that {E(ϵ} rsub {i)} =0 for all i is a respected condition.
The fact that this graph shows that the points follow a curvy trend could indicate the fact
that there is a relationship between x and y that we did not take into account.
As the model meets the assumptions on the errors, we cannot totally deny it. It should be
fitted again as the form was not correct. We may have omitted other explanatory
variables.
References :
[2] Decitre.fr. 2020. Le Modèle Linéaire Par L'exemple. Régression, Analyse De La Variance Et Plans
D'expérience Illustrés Avec R, SAS Et Splus - Jean-Marc Bardet,Jean-Marc Azaïs. [online] Available
at: <https://www.decitre.fr/livres/le-modele-lineaire-par-l-exemple-9782100495597.html>