Professional Documents
Culture Documents
Unit 2: Review of Correlation and The General Linear Model (GLM)
Unit 2: Review of Correlation and The General Linear Model (GLM)
Objectives
This Unit presents one of the most commonly used regression models in econometric
modelling, namely the general linear model.
Unit 1
Review of Correlation and the
General Linear Model(GLM)
2
1.1. CORRELATION AND REGRESSION 3
Definition 1.1 Let X and Y be two random variables. Then the correlation between
X and Y is defined as
cov(X, Y )
ρ= p p
var(X) var(Y )
From your courses on regression you may have already learnt that correlation has the
following properties.
(i) −1 ≤ ρ ≤ 1
(ii) ρ = 1 implies perfect positive correlation
(iii) ρ = −1 implies perfect negative correlation
(iv) ρ = 0 implies no (linear) relationship.
where x̄ and ȳ are sample means. Formal conclusions concerning correlation can be
based on a statistical test with test statistic
r
r n−2
T =p =r×
2
(1 − r )/(n − 2) 1 − r2
which, under the assumption of normality, has a student’s t-distribution with n-2 de-
grees of freedom.
Example 1.1 The following data show monthly income and expenditure in thousands
of dollars, for 5 families.
income(x) expenditure(y)
2 2.0
3 2.5
4 2.6
5 2.9
6 3.0
4UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
(i) Plot expenditure against income and comment on the relationship between the two
variables.
(ii) Estimate the correlation between income and expenditure. Hence test the hypoth-
esis that there is a positive correlation between income and expenditure. Use a
5% significance level.
Solution 1.1
(i) A plot of expenditure against income is shown in figure(1.1) below.
H0 : ρ = 0 versus H1 : ρ > 0
Thus
P P P
xy − ( x)( y)/n) 54.4 − (20)(13)/5
r = rh =p = .9639
(90 − 80)(34.42 − 33.8)
P 2 P 2 i hP 2 P 2 i
x − ( x) /n y − ( y) /n
The critical region is t > t3,0.05 = [2.353, ∞). Hence H0 is rejected and conclude that
there is positive correlation between income and expenditure.
where ρij = cor(Xi , Xj ). The correlation matrix ρ can be calculated from the covari-
ance matrix Σ using the relation
1
0 . . . 0 σ11 σ12 . . . σ1p σ11 0 . . . 0
σ1
1 .. .. 1 ..
0 . σ21 σ22 . 0 .
ρ= σ2
σ 2 = D− 12 ΣD− 12
..
. ..
..
. .
1 σp1 . . . σpp 1
0 ... σp 0 ... σp
√
q
where σij = cov(Xi , Xj ), σi = σii = σi2 and D is a diagonal matrix of variances.
An estimate S of the covariance matrix Σ based on a sample of size n is given by
n
1X
S = (σ̂ij )p×p = (sij ) , where sij = (xik − x̄i )(xjk − x̄j )
n
k=1
6UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
Example 1.2 Using the expenditure data given in the above example, calculate
Solution 1.2 .
2.500 0.600
The covariance matrix is S =
0.600 0.155
0.6325 0
D̂−1/2 = diag(1/σ̂1 , 1/σ̂2 ) =
0 2.5400
1.0000 0.9638
Hence R = D̂−1/2 S D̂−1/2 =
0.9638 1.0000
Definition 1.2 Let (X1 , X2 , . . . , Xp ) be a p-dimensional random vector with finite sec-
ond order moments. Then the partial correlation between any pair of the random vari-
ables, say (X1 , X2 ) given X3 , . . . , Xp is defined as
The partial correlation cofficient measures the correlation between two variables when
all other variables are held constant i.e. after removing the effect or influence of the
other variables i.e. intervening variables. It follows that in a multivariate system,
partial correlation gives a true measure of the correlation between two random variables.
By definition a partial correlation matrix is the conditional correlation matrix i.e the
correlation matrix of the conditional distribution of the first set of variables given the
1.2. PARTIAL CORRELATION 7
second set of variables. In pratcice it customery to assume that the data follow a
multivariate normal distribution. In that case an expression for the partial correlation
matrix is given in the following theorem.
Then the conditional covariance matrix and conditional correlation matrix i.e. matrix
of partial correlations of say X1 , . . . , Xk given Xk+1 , . . . , Xp are given respectively by
If p = 3 then it can easily be verified that the first-order partial correlation of X1 and
X2 given X3 i.e. in which only one variable is held constant, is given by
Example 1.3 The following data show quarterly percentage change in earnings i.e.
wages wt of workers in the manufacturing industry, the reciprocal vt = 1/ut of unem-
ployment ut , which is the percentage of the workforce, and lagged percentage change
pt−1 in consumer price index(CPI), i.e. lagged quarterly inflation.
8UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
1954 2 3.53
0.2312 1.2290
3 1.74
0.1942 0.7020
4 1.72
0.1794 0.0000
1955 1 1.71
0.1827 -0.522
2 2.27
0.1942 -0.260
3 4.57
0.2128 -0.173
4 4.52
0.2260 0.0900
1956 1 5.06
0.2353 0.699
2 5.56
0.2367 0.263
3 4.37
0.2367 1.048
4 5.95
0.2381 1.997
1957 1 6.42
0.2395 2.507
2 5.26
0.2424 3.447
3 5.24
0.2410 3.589
4 4.08
0.2286 3.376
1958 1 3.52
0.2010 3.023
2 3.50
0.1747 3.415
3 3.48
0.1544 3.221
4 2.94
0.1460 2.297
1959 1 3.88
0.1487 1.966
2 4.35
0.1613 0.814
3 2.88
0.1747 0.404
4 3.33
0.1810 0.967
1960 1 3.24
0.1878 1.367
2 2.78
0.1869 1.528
3 3.74
0.1869 1.762
1.6384 0.0264 0.5924
Given that the sample covariance matrix is S = 0.0264 0.0010
0.0030 ,
0.5924 0.0030 1.6835
(a) find the sample correlation matrix R,
(b) calculate first-order partial correlations,
(i) using the matrix approach i.e. from the partial covariance matrix.
(ii) using explicit formulae for first-order partial correlations.
Solution 1.3 .
0.7812 0 0
(a) D̂−1/2 = diag(1/σ̂1 , 1/σ̂2 , 1/σ̂3 ) = 0 31.5643 0
0 0 0.7707
1.0000 0.6508 0.3567
Hence R = D̂−1/2 S D̂−1/2 = 0.6508 1.0000 0.0726
0.3567 0.0726 1.0000
1.3. GENEAL LINEAR MODEL (GLM) AND ECONOMETRIC MODELLING 9
Thus, as for simple correlation matrix, the partial correlation matrix is given
by
−1/2 −1/2 1.0000 0.6706
R1|2 = D̂ S1|2 D̂ =
0.6706 1.0000
where D̂ = diag(S1|2 ). It follows that the partial correlation of (X1 , X2 )
given X3 is r12.3 = ρ̂12.3 = 0.6706.
Athough in practice any form of relationship can exist among variables, a useful and
widely used class of regression models is that of linear regression models. A linear
regression model for the dependence of a response variable Y on a suspected predictor
or predictors X1 , X2 , . . . , Xp−1 is of the form
Y = X 0 β + ut
The next section presents a summary of common inferences that can be made with the
GLM.
Then the Ordinary Least Squares (OLS) estimator β̂ of β that β which minimises the
error sum of squares
If the assumptions of the GLM are satisfied then one can proceed with the following
steps of (econometric) modeling using the GLM:
3. A basic analysis of variance table showing a breakdown of the total variation into
systematic variation due to regression and random variation due to the distur-
bance term is
source df SS MS F
regression p-1 SSR=Y0 [H − J]Y MSR =SSR/(p-1) MSR/MSE
error n-p SSE=Y0 [I − H]Y MSE =SSE/(n-p)=s2
Total n-1 SST=Y0 [I − J]Y
8. σµ̂Y (X0 ) = σ X00 (X0 X)−1 X0 is the standard error of µ̂Y (X0 ).
p
p 0
9. µ̂Y (X0 ) ± tn−p,α/2 s X0 (X0 X)−1 X0 , is a confidence interval µY (X0 ) .
10. Ŷ(X0 ) ± tn−p,α/2 s 1 + X00 (X0 X)−1 X0 is a prediction interval for Y (X0 ).
p
Example 1.4 The following data show a segment of the series on percentage wage
change Y , unemployement X1 and percentage price changes i.e. inflation X2 .
12UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
Y X1 X2
3 3 5
1 1 4
8 5 6
3 2 4
5 4 6
(d) Set up the basic analysis of variance table. Hence test the significance of the
GLM. Use 10% significance level.
(e) Determine the multiple correlation cofficient R2 and comment on this value.
(h) Forecast i.e. predict the change in wages if the values of uneployment and inflation
rise to 10 and 10 respectively. Obtain a 95% prediction interval for your forecast.
Solution 1.4 .
3 1 3 5
1 1 1 4 5 15 25
(X0 X) = 15
8, X = 1
(a) Let Y = 5 6, and 55 81 .
3 1 2 4 25 81 129
5 1 4 6
26.7 4.5 −8.0 20
Then (X0 X)−1 = 4.5 1.0 −1.5 and X0 Y = 76 .
−8.0 −1.5 2.5 109
Thus the parameter estimates are β̂ = (X0 X)−1 X0 Y = (4.0, 2.5, −1.5)0 .
The estimated regression equation is therefore, Ŷ = 4.0 + 2.5X1 − 1.5X2
4.0
0.5
(b) The fitted values are Ŷ = Xβ̂ =
7.5 .
3.0
5.0
1.5. ORDINARY LEAST SQUARES ESTIMATION(OLS) 13
−1.0
0.5
(c) The residual values are û = Y − Ŷ =
0.5 .
0.0
0.0
(d) The residual sum of squares is SSE = 5i=1 u2i = û0 û = Y 0 [I − H]Y = 1.5
P
P5
The total sum of squares is SST = i=1 (yi − ȳ)2 = Y 0 [I − J]Y = 28.0
P5
The regression sum of squares SSR = i=1 (ŷi − ȳ) = Y 0 [H − J]Y = 26.5.
source df SS F
regression 2 26.5 17.67
error 2 1.5
Total 4 28.0
H0 : Y = β0 + u
H1 : Y = β0 + β1 X1 + β2 X2 + u
SSR/(p−1) 26.5/2
The test statistic is F = SSE/(n−p) = 1.5/2) = 17.67.
Thus at the 10% level we can reject H0 conclude that the regression model
wage changes.
26.5
(e) The multiple correlation coefficient is R2 = 28.0 = 0.9464.
regression.
The table value is t2,0.025 = 4.3. Hence a 95% confidence interval for β1 is
2.5 ± 4.3(0.866) = (−1.2, 6.2).
(g) Substituting the values of X0 = (1, X1 , X2 )0 = (1, 10, 10)0 into the equation
we get µ̂Y (1, 10, 10) = 4 + 2.5(10) − 1.5(10) = 14.0. Further, X00 (X0 X)−1 X0 = 6.7.
As indicated earlier we now relate regression and and partial correlation discussed
earlier in thi Unit.
It is easy to show that the partial correlation between Y and say X1 is given by
P P
(y − b02 x2 ) (x1 − b12 x2 )
r01.2 = pP P
(y − b02 x2 ) (x1 − b12 x2 )
where b02 is the slope of the regression of Y on X2 and b12 is the slope of the regression
of X1 on X2 . That is r01.2 a correlation of the residuals from the two regressions. Since
the residuals represent variation remaining after the linear effect of the regressor(s) has
been removed, an interpretation of the partial correlation between Y and X1 is that it
represents the correlation between Y and X1 after the linear effect of X2 on both Y
and X1 has been removed.
If the assumptions of the linear model are satisfied then the inferences based on the
model are valid. The assupmtions, however, do not always hold. In the next section
we examine some common ways of assessing the assumptions of a model.
1.6. RESIDUAL ANALYSIS 15
1. Plot of residuals against fitted values: If the linearity, independence, equal vari-
ance, and normality assumptions i.e.P all basic assumptions of the GLM hold, then
result cov(û, Ŷ ) = 0 or equivalently ni=1 ûi ŷi = 0 implies that a plot of residuals
against fitted values should show a good fit characterised by small residuals with
no apparent structure or pattern.
implies that a plot of residuals against time t = 1, 2, . . . should show constant vari-
ability, at least approximately. The plot should not show any trend or systematic
pattern.
In the next Unit we examine what happens when these assumptions are violated.
Activity 1.1 .
1. The following data show a bookshop’s figures pertaining to prices of Statistics text
books and monthly sales made over a period of 8 months.
month 1 2 3 4 5 6 7 8
3. In a study of thirty firms the sample correlation between average worker and
productivity 0.15. Formulate an appropriate research hypothesis relating to pro-
ductivity age of a worker and test it at the 5% significance level.
4. Prove that the multiple correlation coefficient is the simple correlation between Y
and fitted values Ŷ .
u = α + βz
State the values of the intercept and slope of the straight line. Justify your answer.
1.7. SUMMARY OF THIS UNIT 17
6. Consider the data on wages, unemployment and inflation presented earlier in the
text, which is represented below.
Use the GLM to model wage changes as a function unemployment and/or infla-
tion. Perform residual analysis for adequacy of the model. State the R2 statistic
for the final model.
7. Given sample size n = 25 and 2
Ŷ = 9.1266−.0724X1 +0.2029X2 , s = 0.4377, X0 =
2.77847 −0.011242 −0.106098
1 −1
[1, 32, 22] and (X X) = −0.011242 0.000146
0.000175
−0.106098 0.000175 0.00479
(a) find a 95% confidence interval for the intercpet term,
(b) find
(i) a 95% confidence interval for µY (X0 )
(ii) a 95% prediction interval for Y((X0 ).
8. Perform a correlation and partial correlation analysis of the following data. Cal-
culate the partial correlation of X1 and X2 given X3 using simple linear regres-
sion.
18UNIT 1. REVIEW OF CORRELATION AND THE GENERAL LINEAR MODEL(GLM)
X1 X2 X3
12.3 263.3 93.1
16.0 275.4 93.9
15.7 278.3 92.5
21.2 296.7 89.2
17.9 309.3 91.7
18.8 315.8 96.5
15.4 318.8 100.0
19.0 333.0 103.9
20.0 340.2 102.5
18.4 350.7 102.5
21.8 361.3 102.1
24.1 381.3 101.5
25.6 406.5 101.2
30.0 430.8 99.0
References
1. Christ, C.F (1966), Econometric Models and Methods, John Wiley, New York.
19