Unit 2: Review of Correlation and The General Linear Model (GLM)

UNIT 2
Review of Correlation and the General Linear

Model(GLM)
Objectives
At the end of this unit students are expected to be able to
1. define correlation and explain what it measures,
2. state the properties of correlation,
3. define partial correlation and relate it to regression coefficients,
4. distinquish between correlation and partial correlation,
5. calculate partial correlation using recursive formulae

and/or matrix methods,
6. state the 5 main assumptions of the General Linear Model,
7. Give a meaningful interpretation of the regression coefficients,
8. state the 3 main properties of Ordinary Least Squares(OLS)

estimators,
9. conduct diagnostic tests for goodness of fit of a model,
10. test the significance of each regression coefficient,
11. construct confidence intervals for the parameters of the model,
12. make forecasts using regression models.
This Unit presents one of the most commonly used regression models in econometric
modelling, namely the general linear model.
Unit 1
Review of Correlation and the
General Linear Model(GLM)
2
1.1. CORRELATION AND REGRESSION 3
1.1 Correlation and Regression
Correlation is a scale-free measure of the degree or strength of the relationship between

two or more variables. The relationship between two variables can be linear or non-
linear. Thus, in principle correlation can be measured in a variety of ways. It is
however customery, and we shall do the same, to postulate some functional form of the
relationship as linear. It is important to note that while association between variables
is often characterised as linear, this is only an assumption.
Definition 1.1 Let X and Y be two random variables. Then the correlation between
X and Y is defined as
cov(X, Y )
ρ= p p
var(X) var(Y )
From your courses on regression you may have already learnt that correlation has the
following properties.
(i) −1 ≤ ρ ≤ 1
(ii) ρ = 1 implies perfect positive correlation
(iii) ρ = −1 implies perfect negative correlation
(iv) ρ = 0 implies no (linear) relationship.
The sample correlation ρ̂ = r, in analogy with the above, is given by

P P P P
(x − x̄)(y − ȳ) xy − ( x) ( y) /n
r= Pp p = h
r
(x − x̄)2 (y − ȳ)2
P i hP i
x − ( x)2 /n y 2 − ( y)2 /n
P 2 P P
where x̄ and ȳ are sample means. Formal conclusions concerning correlation can be
based on a statistical test with test statistic
r
r n−2
T =p =r×
2
(1 − r )/(n − 2) 1 − r2
which, under the assumption of normality, has a student’s t-distribution with n-2 de-
grees of freedom.
Example 1.1 The following data show monthly income and expenditure in thousands
of dollars, for 5 families.
income(x) expenditure(y)
2 2.0
3 2.5
4 2.6
5 2.9
6 3.0
4UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
(i) Plot expenditure against income and comment on the relationship between the two
variables.
(ii) Estimate the correlation between income and expenditure. Hence test the hypoth-
esis that there is a positive correlation between income and expenditure. Use a
5% significance level.
Solution 1.1
(i) A plot of expenditure against income is shown in figure(1.1) below.
Figure 1.1 Plot of the data on expenditure
From the plot, it is clear that expenditure increases as income increases.

(ii) The hypotheses being tested are
H0 : ρ = 0 versus H1 : ρ > 0
The required sample calculations are

X X X X X
x = 20, y = 13, x2 = 90, y 2 = 34.42, xy = 54.4
Thus
P P P
xy − ( x)( y)/n) 54.4 − (20)(13)/5
r = rh =p = .9639
(90 − 80)(34.42 − 33.8)
P 2 P 2 i hP 2 P 2 i
x − ( x) /n y − ( y) /n
Hence the test statistic is

r
0.9639 5−2
t= p = 0.9639 × = 6.2668
[1 − (0.9639)2 ]/[5 − 2] 1 − 0.0.96392
The critical region is t > t3,0.05 = [2.353, ∞). Hence H0 is rejected and conclude that
there is positive correlation between income and expenditure.
Simple correlation provides a measure of dependence between two random variables.

In practice, especially in multiple regression, one is often dealing with more than two
variables. Pairwise correlation can be calculated for any two variables. The various
correlations for a system or vector (X1 , X2 , . . . , Xp ) consisting of p variables, the results
can be conveniently displayed in a matirx called the correlation matrix.
1 ρ12 ... ρ1p

 
.. 
ρ 1 . 

ρ = (ρij ) =  21 (1.1)
 ...


ρp1 ... 1
1.1. CORRELATION AND REGRESSION 5
where ρij = cor(Xi , Xj ). The correlation matrix ρ can be calculated from the covari-
ance matrix Σ using the relation
 1
0 . . . 0  σ11 σ12 . . . σ1p  σ11 0 . . . 0
  
σ1
 1 ..   ..   1 .. 
 0 .   σ21 σ22 .  0 . 
ρ= σ2
 σ 2  = D− 12 ΣD− 12
  ..
 
 .   ..
 ..
 
 . . 
1 σp1 . . . σpp 1
0 ... σp 0 ... σp
√
q
where σij = cov(Xi , Xj ), σi = σii = σi2 and D is a diagonal matrix of variances.
An estimate S of the covariance matrix Σ based on a sample of size n is given by
n
1X
S = (σ̂ij )p×p = (sij ) , where sij = (xik − x̄i )(xjk − x̄j )
n
k=1
Example 1.2 Using the expenditure data given in the above example, calculate
(i) the sample covariance matrix S
(ii) the sample correlation matrix R.
Solution 1.2 .

2.500 0.600
The covariance matrix is S =
0.600 0.155

0.6325 0
D̂−1/2 = diag(1/σ̂1 , 1/σ̂2 ) =
0 2.5400

1.0000 0.9638
Hence R = D̂−1/2 S D̂−1/2 =
0.9638 1.0000
Another important measure of dependence between two variables of a multivariate

system is the partial correlation coefficient.
1.2 Partial correlation

Suppose that we would like to measure the correlation between say the number of cold
drinks X1 consumed by visitors in a summer resort and the number of tourists (X2 )
coming to that resort. It is clear that both variables are influenced by weather condi-
tions such as temperature or rainfall. Let X3 be an index of temperature or rainfall. If
a large number of tourists visit the resort then one would expect a high consumption
of cold drinks and vice versa. That is, we expect the number of cold drinks X1 and
the number of tourists X2 to be positively correlated. If whether conditions change
this relationship may be distorted to the extent that it may even be negative. Thus
to get a more accurate and more meaningful measure of the dependence or correlation
between X1 and X2 it is necessary to fix X3 . This leads to the following definition.
Definition 1.2 Let (X1 , X2 , . . . , Xp ) be a p-dimensional random vector with finite sec-
ond order moments. Then the partial correlation between any pair of the random vari-
ables, say (X1 , X2 ) given X3 , . . . , Xp is defined as
ρ12.34..p = cor(X1 , X2 |X3 , . . . , Xp ). (1.2)
The partial correlation cofficient measures the correlation between two variables when
all other variables are held constant i.e. after removing the effect or influence of the
other variables i.e. intervening variables. It follows that in a multivariate system,
partial correlation gives a true measure of the correlation between two random variables.
By definition a partial correlation matrix is the conditional correlation matrix i.e the
correlation matrix of the conditional distribution of the first set of variables given the
1.2. PARTIAL CORRELATION 7
second set of variables. In pratcice it customery to assume that the data follow a
multivariate normal distribution. In that case an expression for the partial correlation
matrix is given in the following theorem.
Theorem 1.1 Let (X1 , X2 , . . . , Xp ) be a p-dimensional multivariate normal random

vector with covariance matrix

Σ11 Σ12
Σ= .
Σ21 Σ22
Then the conditional covariance matrix and conditional correlation matrix i.e. matrix
of partial correlations of say X1 , . . . , Xk given Xk+1 , . . . , Xp are given respectively by
Σ1|2 = Σ11 − Σ12 Σ−1

22 Σ21 (1.3)
P = D−1/2 Σ1|2 D−1/2 (1.4)
where D = diag(Σ1|2 ) is a diagonal matrix of partial variances.
If p = 3 then it can easily be verified that the first-order partial correlation of X1 and
X2 given X3 i.e. in which only one variable is held constant, is given by
ρ12 − ρ13 ρ23

ρ12.3 = p .
(1 − ρ213 )(1 − ρ223 )
Similarly the partial correlation of X1 and X3 given X2 is
ρ13 − ρ12 ρ32

ρ13.2 = p
(1 − ρ212 )(1 − ρ232 )
and that of X2 and X3 given X1 is
ρ23 − ρ21 ρ31

ρ23.1 = q .
(1 − ρ221 )(1 − ρ231) )
Example 1.3 The following data show quarterly percentage change in earnings i.e.
wages wt of workers in the manufacturing industry, the reciprocal vt = 1/ut of unem-
ployment ut , which is the percentage of the workforce, and lagged percentage change
pt−1 in consumer price index(CPI), i.e. lagged quarterly inflation.
year quarter wt vt = u−1

t pt−1
1954 2 3.53
0.2312 1.2290
3 1.74
0.1942 0.7020
4 1.72
0.1794 0.0000
1955 1 1.71
0.1827 -0.522
2 2.27
0.1942 -0.260
3 4.57
0.2128 -0.173
4 4.52
0.2260 0.0900
1956 1 5.06
0.2353 0.699
2 5.56
0.2367 0.263
3 4.37
0.2367 1.048
4 5.95
0.2381 1.997
1957 1 6.42
0.2395 2.507
2 5.26
0.2424 3.447
3 5.24
0.2410 3.589
4 4.08
0.2286 3.376
1958 1 3.52
0.2010 3.023
2 3.50
0.1747 3.415
3 3.48
0.1544 3.221
4 2.94
0.1460 2.297
1959 1 3.88
0.1487 1.966
2 4.35
0.1613 0.814
3 2.88
0.1747 0.404
4 3.33
0.1810 0.967
1960 1 3.24
0.1878 1.367
2 2.78
0.1869 1.528
3 3.74
0.1869 1.762
 
1.6384 0.0264 0.5924
Given that the sample covariance matrix is S = 0.0264 0.0010
 0.0030  ,
0.5924 0.0030 1.6835
(a) find the sample correlation matrix R,
(b) calculate first-order partial correlations,
(i) using the matrix approach i.e. from the partial covariance matrix.
(ii) using explicit formulae for first-order partial correlations.
Solution 1.3 .
 
0.7812 0 0
(a) D̂−1/2 = diag(1/σ̂1 , 1/σ̂2 , 1/σ̂3 ) =  0 31.5643 0 
0 0 0.7707
 
1.0000 0.6508 0.3567
Hence R = D̂−1/2 S D̂−1/2 =  0.6508 1.0000 0.0726 
0.3567 0.0726 1.0000
1.3. GENEAL LINEAR MODEL (GLM) AND ECONOMETRIC MODELLING 9
(b) (i) Using the matrix approach we have

1.6384 0.0264 0.5924
S11 = , S12 = and S22 = 1.6835
0.0264 0.0010 0.0030

−1 1.4299 0.0253
Hence S1|2 = S11 − S12 S22 S21 =
0.0253 0.0010
Thus, as for simple correlation matrix, the partial correlation matrix is given
by
−1/2 −1/2 1.0000 0.6706
R1|2 = D̂ S1|2 D̂ =
0.6706 1.0000
where D̂ = diag(S1|2 ). It follows that the partial correlation of (X1 , X2 )
given X3 is r12.3 = ρ̂12.3 = 0.6706.
(ii) Using the explicit formulae we have
r12 − r13 r23 0.6508 − 0.3567(0.0726)

r12.3 = p 2 2
=p = 0.6706
(1 − r23 )(1 − r23 ) [1 − (0.3567)2 ][1 − (0.0726)2 ]
As we will see later partial correlations are related to regression coefficients.
1.3 Geneal Linear Model (GLM) and econometric mod-

elling
In many researches the objective is often to establish or assess the nature of relation-
ships, if any, among variables of a physical i.e. natural system. Regression analysis is
concerned with the construction of statistical models in which one variable, the variable
of interest is described in terms of the other variables of the system. It is by means of
the model that inference and conclusions about the system can be made e.g. strength
and direction of the relationships, predictions etc.
Athough in practice any form of relationship can exist among variables, a useful and
widely used class of regression models is that of linear regression models. A linear
regression model for the dependence of a response variable Y on a suspected predictor
or predictors X1 , X2 , . . . , Xp−1 is of the form
Y = X 0 β + ut
The explanatory variables in a linear regression model can be qualitative or quanti-

tative, discrete or continuous etc. The model is called a general linear model since it
allows explanatory variables measured on different measurement scales e.g. nominal,
continuous, etc.
1.4 Assumptions of the General Linear Model

By definition, a model is an idealised representation of a natural system or phenomenon.
In order to be able to make some inference about the data generating mechanism being
investigated, we must make some simplifying and presumably reasonable i.e. realistic,
assumptions about the structure or behaviour of the process. The GLM has five basic
assumptions. These are:
1. LINEARITY: The general linear model Y = X 0 β + u assumes that

Y = β0 + X1 β1 + X2 β2 + . . . + Xp βp + u
with E(u) = 0. That is, the dependent variable Y is linearly related to the
independent variables X = (X1 , . . . , Xp )0 and has mean µY = E(Y ) = X0 β.
2. INDEPENDENCE: This assumption asserts that the error terms {ut }, and
hence the obvservations {Yt }, are independent or at least uncorrelated. In other
words an observation in one period is independent of the observation in any an-
other period. That is, cov(ut , us ) = 0, t 6= s. When this assumption does not
hold we say that there is auto-correlation. The assumption ensures that param-
eter estimates have certain optimal properties. Since in Econometric modeling
the regressors can be random, it is also assumed that u and X are uncorrelated.
That is cov(u, X) = 0.
3. HOMOGENIETY: The assumption states that variability at different times or
in different periods, is the same i.e. constant or homogeneous. Symbolically the
assumption states that var(ut ) = σ 2 , for all t. When this assumption is violated
we say that there is heteroscedasticity.
4. MULTICOLINEARITY: Most estimation procedures in particular, Least Squares
Estimation, do not allow redundancy in the form of perfect linear relationaships
among the regressors or explanatory variables as this results in inflated standard
errors and consequently inacurate parameter estimation. Thus, in the GLM it is
assumed that the moment matrix (X0 X) is non-singular, where X is the design
matrix.
5 NORMALITY: Statistical inference requires knowledge of the sampling dis-
tributions of model parameter estimates. A common distributional assumption
in parametric Statistical Inference is that the obveservations are normally dis-
tributed. The normality assumption, independence and homogeniety assumptions
can be combined and expressed more compactly as ut ∼ N ID(0, σ 2 ).
The next section presents a summary of common inferences that can be made with the
GLM.
1.5 Ordinary Least Squares Estimation(OLS)

In this section we summarise properties of the ordinary least squares estimators and
inferential procedures with the General Linear Model Y = Xβ + u.
1.5. ORDINARY LEAST SQUARES ESTIMATION(OLS) 11
Y1 1 X11 ... Xk1 β0 u1

       
 Y2  1 X12 ... Xk2   β1   u2 
Let Y = 
 ...  , X =  ...
  .. ..  , β=
 ...  ,
 and u = 
 ... 
 .
. . 
Yn 1 X1n ... Xkn βk un n×1
Then the Ordinary Least Squares (OLS) estimator β̂ of β that β which minimises the
error sum of squares
Q(β) = u0 u = (Y − Xβ)0 (Y − Xβ) = Y0 Y − 2Y0 Xβ + β 0 (X0 X)β
If the assumptions of the GLM are satisfied then one can proceed with the following
steps of (econometric) modeling using the GLM:
1. β̂ = (X1 X)−1 X0 Y is the vector of parameter estimates.
2. Ŷ = X0 β = X(X0 X)−1 X0 Y = HY where H = X(X0 X)−1 X0 .
3. A basic analysis of variance table showing a breakdown of the total variation into
systematic variation due to regression and random variation due to the distur-
bance term is
source df SS MS F
regression p-1 SSR=Y0 [H − J]Y MSR =SSR/(p-1) MSR/MSE
error n-p SSE=Y0 [I − H]Y MSE =SSE/(n-p)=s2
Total n-1 SST=Y0 [I − J]Y
where J = (1/n)p×p is a square matrix with entries each equal to 1/n.
4. R2 = SSR/SST is the proportion of variation explained by the model.
5. β̂ ∼ Np [β, σ 2 (X0 X)−1 ] is the sampling distribution of β̂ and p = k + 1.

√
6. β̂ ± tn−p,α/2 sβ̂ where sβ̂ = s cii and cii = ith diagnonal entry of (X0 X)−1 .
7. µ̂Y (X00 ) = X00 β̂ is an estimate of µY or forecast of Y at X00 = (1, X1 , .., Xp−1 ).
8. σµ̂Y (X0 ) = σ X00 (X0 X)−1 X0 is the standard error of µ̂Y (X0 ).
p
p 0
9. µ̂Y (X0 ) ± tn−p,α/2 s X0 (X0 X)−1 X0 , is a confidence interval µY (X0 ) .
10. Ŷ(X0 ) ± tn−p,α/2 s 1 + X00 (X0 X)−1 X0 is a prediction interval for Y (X0 ).
p
We now present an example to illustrate the above inferential procedures.
Example 1.4 The following data show a segment of the series on percentage wage
change Y , unemployement X1 and percentage price changes i.e. inflation X2 .
Y X1 X2
3 3 5
1 1 4
8 5 6
3 2 4
5 4 6
(a) Fit a general model Y = β0 + β1 X1 + β2 X2 + u to thse data using OLS. Write

down the estimated regression model.
(b) Calculate the fitted values Ŷ.
(c) Hence compute the residuals û = Y − Ŷ.
(d) Set up the basic analysis of variance table. Hence test the significance of the
GLM. Use 10% significance level.
(e) Determine the multiple correlation cofficient R2 and comment on this value.
(f ) Find a 95% confindence interval for β1 .
(g) Find a 95% confindence interval for µY (10, 10).
(h) Forecast i.e. predict the change in wages if the values of uneployment and inflation
rise to 10 and 10 respectively. Obtain a 95% prediction interval for your forecast.
Solution 1.4 .
   
3 1 3 5  
1 1 1 4 5 15 25
(X0 X) =  15
   
8, X = 1
(a) Let Y =    5 6, and 55 81  .
3 1 2 4 25 81 129
5 1 4 6
   
26.7 4.5 −8.0 20
Then (X0 X)−1 =  4.5 1.0 −1.5  and X0 Y =  76  .
−8.0 −1.5 2.5 109
Thus the parameter estimates are β̂ = (X0 X)−1 X0 Y = (4.0, 2.5, −1.5)0 .
The estimated regression equation is therefore, Ŷ = 4.0 + 2.5X1 − 1.5X2
 
4.0
 0.5 
 
(b) The fitted values are Ŷ = Xβ̂ = 
 7.5 .

 3.0 
5.0
1.5. ORDINARY LEAST SQUARES ESTIMATION(OLS) 13
 
−1.0
 0.5 
 
(c) The residual values are û = Y − Ŷ = 
 0.5  .

 0.0 
0.0
(d) The residual sum of squares is SSE = 5i=1 u2i = û0 û = Y 0 [I − H]Y = 1.5
P
P5
The total sum of squares is SST = i=1 (yi − ȳ)2 = Y 0 [I − J]Y = 28.0
P5
The regression sum of squares SSR = i=1 (ŷi − ȳ) = Y 0 [H − J]Y = 26.5.
Thus the basic analysis of variance table is therefore, given by
source df SS F
regression 2 26.5 17.67
error 2 1.5
Total 4 28.0
A test of the significance of the regression model is given by
H0 : Y = β0 + u
H1 : Y = β0 + β1 X1 + β2 X2 + u
SSR/(p−1) 26.5/2
The test statistic is F = SSE/(n−p) = 1.5/2) = 17.67.
The critical region is [F2,2,0.10 , ∞) = [9, ∞).
Thus at the 10% level we can reject H0 conclude that the regression model
is significant. That is unemployment and/or inflation can be used to explain
wage changes.
26.5
(e) The multiple correlation coefficient is R2 = 28.0 = 0.9464.
This means that 94.6% of the total variation in Y is explained by the
regression.
(f ) An estimate of the error variance is

5
2 2 1 X 2 SSE 1.5
σ̂ = s = ui = = = 0.75
n−p n−p 2
i=1
Thus an estimate of the standard error of β̂1 is

√ √
sβ̂1 = s c11 = 0.866 1.0 = 0.866.
The table value is t2,0.025 = 4.3. Hence a 95% confidence interval for β1 is
2.5 ± 4.3(0.866) = (−1.2, 6.2).
(g) Substituting the values of X0 = (1, X1 , X2 )0 = (1, 10, 10)0 into the equation
we get µ̂Y (1, 10, 10) = 4 + 2.5(10) − 1.5(10) = 14.0. Further, X00 (X0 X)−1 X0 = 6.7.
Hence a 95% confidence interval for µY (10, 10) is

q √
14.0 ± tn−p,α/2 s X00 (X0 X)−1 X0 = 14.0 ± 4.3(.866) 6.7 = (4.36, 23.64)
(h) A forecast for Y when X1 = 10 and X2 = 10 is Ŷ (10, 10) = 14.0. Hence
a 95% prediction interval for Y (10, 10) is

q √
14.0 ± tn−p,α/2 s 1 + X00 (X0 X)X0 = 14.0 ± 4.3(.866) 1 + 6.7 = (3.66, 24.34).
As indicated earlier we now relate regression and and partial correlation discussed
earlier in thi Unit.
It is easy to show that the partial correlation between Y and say X1 is given by
P P
(y − b02 x2 ) (x1 − b12 x2 )
r01.2 = pP P
(y − b02 x2 ) (x1 − b12 x2 )
where b02 is the slope of the regression of Y on X2 and b12 is the slope of the regression
of X1 on X2 . That is r01.2 a correlation of the residuals from the two regressions. Since
the residuals represent variation remaining after the linear effect of the regressor(s) has
been removed, an interpretation of the partial correlation between Y and X1 is that it
represents the correlation between Y and X1 after the linear effect of X2 on both Y
and X1 has been removed.
If the assumptions of the linear model are satisfied then the inferences based on the
model are valid. The assupmtions, however, do not always hold. In the next section
we examine some common ways of assessing the assumptions of a model.
1.6. RESIDUAL ANALYSIS 15
1.6 Residual analysis

Inference concerning relationships of an economic system or any system must be based
on a satisfactory model. That is, a model which seems to fit the data well. A model
is plausible i.e. satisfactory if none of its assumptions are (grossly) violated. Thus,
before a model is used to make inference it must be subjected to diagnostic checking
for adequacy. In the literature, this process of checking model-compliance or conformity
is duped Residual Analysis. A summary of the steps and procedures of residual analysis
is presented below.
1. Plot of residuals against fitted values: If the linearity, independence, equal vari-
ance, and normality assumptions i.e.P all basic assumptions of the GLM hold, then
result cov(û, Ŷ ) = 0 or equivalently ni=1 ûi ŷi = 0 implies that a plot of residuals
against fitted values should show a good fit characterised by small residuals with
no apparent structure or pattern.
2. Plot of residuals against each predictor: If u isPindependent of each predictor

say Xi so that cov(u, Xi ) = 0 or equivalently, ni=1 ûi xi = 0, a plot of residu-
als against each predictor should show a random pattern. If the variance, for
example, is changing say increasing or decreasing with a predictor variable, the
pattern will be reflected in the plot of residuals against the predictor variable.
3. Plot of residuals against time or index. The assumption that

2
σ , t=s
cov(ut , us ) =
0, t 6= s
implies that a plot of residuals against time t = 1, 2, . . . should show constant vari-
ability, at least approximately. The plot should not show any trend or systematic
pattern.
4. Correlogram: The assumption of no auto-correlation implies that a plot of the

auto-correlation function against lag i.e. the correlogram, should show correlation
coeffients which are insignificant i.e. lying within the confidence limits ±1.96/n.
Further the correlation coeffients should not show any pattern or structure.
5. The assumption of normality can be checked by constructing and examining the

histogram of the residuals. The assumption can be checked more carefully by
plotting the residuals against normal scores. Let u(1) , u(2) , . . . , u(n) be ordered
random variables. Then the ith normal score zi corresponding to these data is
defined as the [(i − 1/2)/n]100%
quantile of the standard normal distribution.
−1 i−1/2
That is zi = Φ n where Φ is the cummulative distribution function of
the standard normal distribution. If the data are normally distributed then a
plot of the residuals û(1) , û(2) , . . . , û(n) against the corresponding normal scores
z1 , z2 , . . . , zn should produce an approximate straight line.
1.7 Summary of this Unit

Econometric modelling is concerned with the measurement of economic relationships.
In this Unit we have learnt common ways of investigating relationships among economic
variables. These include simple correlation, partial correlation, etc. If a relationship
exists between two or more variables then a common and useful model is provided by
the General Linear model. The basic assumptions of this model, upon which inference
may be based are independence, homogeneous variances and normality.
In the next Unit we examine what happens when these assumptions are violated.
Activity 1.1 .
1. The following data show a bookshop’s figures pertaining to prices of Statistics text
books and monthly sales made over a period of 8 months.
month 1 2 3 4 5 6 7 8
sales 120 186 292 157 257 352 147 111
price 192 205 197 213 208 199 178 170
(a) Find the sample correlation sales and price.

(b) Test the hypothesis that the correlation between sales and price is 0. State
and justify your alternative hypothesis. Use 5% significance level.
2. In the study of 30 organisations a sample correlation of 0.7 was obtained between

productivity and age of the organisation. Test the hypotheses that productivuty is
related to age of the organistion.
3. In a study of thirty firms the sample correlation between average worker and
productivity 0.15. Formulate an appropriate research hypothesis relating to pro-
ductivity age of a worker and test it at the 5% significance level.
4. Prove that the multiple correlation coefficient is the simple correlation between Y
and fitted values Ŷ .
5. If u1 , u2 , . . . , un are observations from a N (0, σ 2 ) distribution, then a plot of the

ordered observations u(1) , u(2) , . . . , u(n) against the corresponding normal scores
should produce an approximate straight line
u = α + βz
State the values of the intercept and slope of the straight line. Justify your answer.
1.7. SUMMARY OF THIS UNIT 17
6. Consider the data on wages, unemployment and inflation presented earlier in the
text, which is represented below.
year quarter wt vt = u−1

t pt−1
1954 2 3.53 0.2312 1.2290

3 1.74 0.1942 0.7020
4 1.72 0.1794 0.0000
1955 1 1.71 0.1827 -0.522
2 2.27 0.1942 -0.260
3 4.57 0.2128 -0.173
4 4.52 0.2260 0.0900
1956 1 5.06 0.2353 0.699
2 5.56 0.2367 0.263
3 4.37 0.2367 1.048
4 5.95 0.2381 1.997
1957 1 6.42 0.2395 2.507
2 5.26 0.2424 3.447
3 5.24 0.2410 3.589
4 4.08 0.2286 3.376
1958 1 3.52 0.2010 3.023
2 3.50 0.1747 3.415
3 3.48 0.1544 3.221
4 2.94 0.1460 2.297
1959 1 3.88 0.1487 1.966
2 4.35 0.1613 0.814
3 2.88 0.1747 0.404
4 3.33 0.1810 0.967
1960 1 3.24 0.1878 1.367
2 2.78 0.1869 1.528
3 3.74 0.1869 1.762
Use the GLM to model wage changes as a function unemployment and/or infla-
tion. Perform residual analysis for adequacy of the model. State the R2 statistic
for the final model.
7. Given sample size n = 25 and 2
 Ŷ = 9.1266−.0724X1 +0.2029X2 , s = 0.4377, X0 =
2.77847 −0.011242 −0.106098
1 −1
[1, 32, 22] and (X X) = −0.011242 0.000146
 0.000175 
−0.106098 0.000175 0.00479
(a) find a 95% confidence interval for the intercpet term,
(b) find
(i) a 95% confidence interval for µY (X0 )
(ii) a 95% prediction interval for Y((X0 ).
8. Perform a correlation and partial correlation analysis of the following data. Cal-
culate the partial correlation of X1 and X2 given X3 using simple linear regres-
sion.
X1 X2 X3
12.3 263.3 93.1
16.0 275.4 93.9
15.7 278.3 92.5
21.2 296.7 89.2
17.9 309.3 91.7
18.8 315.8 96.5
15.4 318.8 100.0
19.0 333.0 103.9
20.0 340.2 102.5
18.4 350.7 102.5
21.8 361.3 102.1
24.1 381.3 101.5
25.6 406.5 101.2
30.0 430.8 99.0
References
1. Christ, C.F (1966), Econometric Models and Methods, John Wiley, New York.
2. Koutsoyiannis A. (1991) Theory of Econmetrics: An Introductory Exposition of

Econometric methods, Macmillan, Hong-kong
3. Matintike G. (1997), Commerce vol 2, College Press, Harare
4. Stanlake (1980), Introductory Economics, Longman, Harare
5. Statistical Year Book(1987), Central Statistical Office (CSO), Harare
19

Unit 2: Review of Correlation and The General Linear Model (GLM)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2: Review of Correlation and The General Linear Model (GLM)

Uploaded by

Copyright:

Available Formats

UNIT 2

Review of Correlation and the General Linear

At the end of this unit students are expected to be able to

1. define correlation and explain what it measures,

2. state the properties of correlation,

3. define partial correlation and relate it to regression coefficients,

4. distinquish between correlation and partial correlation,

5. calculate partial correlation using recursive formulae

6. state the 5 main assumptions of the General Linear Model,

7. Give a meaningful interpretation of the regression coefficients,

8. state the 3 main properties of Ordinary Least Squares(OLS)

9. conduct diagnostic tests for goodness of fit of a model,

10. test the significance of each regression coefficient,

11. construct confidence intervals for the parameters of the model,

12. make forecasts using regression models.

1.1 Correlation and Regression

Correlation is a scale-free measure of the degree or strength of the relationship between

is often characterised as linear, this is only an assumption.

The sample correlation ρ̂ = r, in analogy with the above, is given by

Figure 1.1 Plot of the data on expenditure

From the plot, it is clear that expenditure increases as income increases.

The required sample calculations are

Hence the test statistic is

Simple correlation provides a measure of dependence between two random variables.

1 ρ12 ... ρ1p

(i) the sample covariance matrix S

(ii) the sample correlation matrix R.

Another important measure of dependence between two variables of a multivariate

1.2 Partial correlation

ρ12.34..p = cor(X1 , X2 |X3 , . . . , Xp ). (1.2)

Theorem 1.1 Let (X1 , X2 , . . . , Xp ) be a p-dimensional multivariate normal random

Σ1|2 = Σ11 − Σ12 Σ−1

P = D−1/2 Σ1|2 D−1/2 (1.4)

where D = diag(Σ1|2 ) is a diagonal matrix of partial variances.

ρ12 − ρ13 ρ23

Similarly the partial correlation of X1 and X3 given X2 is

ρ13 − ρ12 ρ32

and that of X2 and X3 given X1 is

ρ23 − ρ21 ρ31

year quarter wt vt = u−1

(b) (i) Using the matrix approach we have

(ii) Using the explicit formulae we have

r12 − r13 r23 0.6508 − 0.3567(0.0726)

As we will see later partial correlations are related to regression coefficients.

1.3 Geneal Linear Model (GLM) and econometric mod-

The explanatory variables in a linear regression model can be qualitative or quanti-

1.4 Assumptions of the General Linear Model

1. LINEARITY: The general linear model Y = X 0 β + u assumes that

1.5 Ordinary Least Squares Estimation(OLS)

Y1 1 X11 ... Xk1 β0 u1

Q(β) = u0 u = (Y − Xβ)0 (Y − Xβ) = Y0 Y − 2Y0 Xβ + β 0 (X0 X)β

1. β̂ = (X1 X)−1 X0 Y is the vector of parameter estimates.

2. Ŷ = X0 β = X(X0 X)−1 X0 Y = HY where H = X(X0 X)−1 X0 .

where J = (1/n)p×p is a square matrix with entries each equal to 1/n.

4. R2 = SSR/SST is the proportion of variation explained by the model.

5. β̂ ∼ Np [β, σ 2 (X0 X)−1 ] is the sampling distribution of β̂ and p = k + 1.

7. µ̂Y (X00 ) = X00 β̂ is an estimate of µY or forecast of Y at X00 = (1, X1 , .., Xp−1 ).

We now present an example to illustrate the above inferential procedures.

(a) Fit a general model Y = β0 + β1 X1 + β2 X2 + u to thse data using OLS. Write

(b) Calculate the fitted values Ŷ.

(c) Hence compute the residuals û = Y − Ŷ.

(f ) Find a 95% confindence interval for β1 .

(g) Find a 95% confindence interval for µY (10, 10).