You are on page 1of 57

Correlation and regression

analysis

Reference: Anderson et al., ch.


14

08/09/2023
Overview
• Correlation analysis
• Linear regression analysis
• Ordinary least squares (OLS)
• Goodness-of-fit & the coefficient of determination
• CI and tests about 𝛽1
• Prediction
Objectives
• Estimate the correlation coefficient, and perform a hypothesis test to
determine if two variables are statistically significantly related to each other
• Use OLS to fit a sample regression model to data
• Test if the slope coefficient of the estimated OLS equation is statistically
significant
• Use the coefficient of determination to assess the goodness-of-fit of the
estimated equation
• Estimate confidence and prediction intervals for predicted values
Outcomes
• After studying this chapter, you should be able to:
• explain the meaning of regression analysis
• identify practical examples where regression analysis can be used
• construct a simple linear regression model
• use the regression line for prediction purposes
• calculate and interpret the correlation coefficient
• calculate and interpret the coefficient of determination
• conduct a hypothesis test on the regression model to test for significance
Correlation analysis
• Recall that the correlation coefficient provides a measure of the strength of
the LINEAR association between two variables.
• For example:
• Advertising expenditure is assumed to influence sales volumes
• Education levels is likely to influence wages.
• A company’s share price is likely to be influenced by its return on investment
• The number of hours of operator training is believed to impact positively on productivity
• Foreign Direct Investment (FDI) is likely to have a positive effect on economic growth (GDP
growth).
• Correlation analysis allows us to quantify the relationship between these
variables and measure the strength of this relationship.
Correlation analysis
• Recall that the correlation coefficient is computed as follows:
𝜎 𝑠 σ 𝑥𝑖 𝑦𝑖 −𝑛(𝑥)(
ҧ 𝑦)

• 𝜌𝑋𝑌 =𝜎 𝑋𝑌 , 𝑟𝑋𝑌 = 𝑋𝑌
=
𝜎
𝑋 𝑌 𝑠 𝑠
𝑋 𝑌 (σ 𝑥 2 −𝑛𝑥ҧ 2 )(σ 𝑦 2 −𝑛𝑦ത 2 )

• −1 ≤ 𝜌 ≤ 1, −1 ≤ 𝑟 ≤ 1, values close to 0 indicate no linear relationship between the two variables.


• If both X and Y are normally distributed random variables, then
𝑟 𝑛−2
𝑡= ~𝑡𝑛−2
1− 𝑟2
• Can now use the above to test whether there exists ANY (linear) relationship between
X and Y
• Instead of using the rule of thumb: IrI > 2/ 𝑛, we can conduct a test for zero population
correlation.
Example:
One-sided right-sided One-sided left-sided Two-sided

Suppose that H0 : 𝜌 ≥ 0 Suppose that H0 : 𝜌 = µ0


Suppose that H0 : 𝜌 ≤ 0
This means that H1 : 𝜌 < 0 This means that H1 : 𝜌 ≠ µ0
This means that H1 : 𝜌 > 0
Specify α Specify α
Specify α
𝑟 𝑛−2 𝑟 𝑛−2
𝑟 𝑛−2 Test statistic: t = Test statistic: t =
Test statistic: t = 1−𝑟 2 1−𝑟 2
1−𝑟 2
Critical value: Critical value: t n−2,α/2
Critical value: t n−2,α
t n−2,α =−t n−2,α Reject if 𝑡 > t n−2,α/2
Reject if t > t n−2,α
Reject if t < − t n−2,α
Example - Correlation analysis
The table contains data for the number of years of education received and the hourly wage earned
of a random sample of 10 urban Free State women that are between 40 and 50 years old. Assume
that the number of years of education received, as well as the hourly wage earned are normally
distributed.

Woman 1 2 3 4 5 6 7 8 9 10
Hourly wage 145 153 162 138 149 149 147 143 157 140
Years of 8 11 14 8 14 12 8 10 11 5
education

Is the number of years of education received positively related to the hourly wage? Test at
the 10% level.
Solution
• X=wages, Y=education, xത = 148.3, yഥ =10.1 • H0 : 𝜌 ≤ 0, H1 : 𝜌 > 0

• α = 0.1
• σ x2 = 220431, s 2 x = 55.789 , 𝑠x = 7.469
𝑟 𝑛−2 0.741 8
2 2 • Test statistic: t = = = 3.121
• σ y = 1095, s y = 8.322 , 𝑠y = 2.885 1−𝑟 2 1−0.7412

• Critical value: t n−2,α = t 8,0.1 = 1.3968


σ xy−nx
ഥyഥ
• σ xy= 15122 , 𝑠xy =
𝑛−1 • Reject H0 if t > t 8,0.1 and because t = 3.121
15122−10(148.3)(10.1) > t 8,0.1 = 1.3968, reject H0
= = 15.967
9
• Thus, we can conclude at the 10% level that
𝑠xy the number of years of education received is
15.967
• 𝑟= = = 0.741 positively related to the hourly wage.
𝑠x 𝑠y 7.469x2.885
Scatterplot
The relationship between any pair of variables – labelled x and y – can be examined
graphically by producing a scatter plot of their data values

165 • Each scatter point represents a pair


of data values from the two random
160 variables, x and y.
• The pattern of the scatter points
155 indicates the nature of the
Wages

relationship, which can be positive,


150
negative or none.
145

140

135
4 6 8 10 12 14
Education
Simple linear regression analysis - Ordinary least
squares (OLS) 14.1
• Regression analysis is a statistical technique that quantifies the relationship between a
single response variable and one or more predictor variables.
• Simple linear regression analysis finds a straight-line equation that represents the
relationship between a single response variable and one predictor variable.
• One variable is called the independent or predictor variable, x, and the other is
called the dependent or response variable, y.
• The x-variable influences the outcome of the y-variable.
• Its values are usually known or easily determined.
• The dependent variable, y, is influenced by (or responds to) the independent variable,
x.
• Values for the dependent variable are estimated from values of the independent variable.
Possible relationships
Independent variable (x) Dependent variable (y) (variable
(predictor of y) estimated from x)
Advertising Machine output
Training Fuel consumption
Speed Labour productivity
Hours worked Company turnover
Daily temperature Crop yield
Hours studied Electricity demand
Amount of fertiliser used Sales volume
Product price Statistics grade
Bond interest rate Poverty
Cost of living Number of bond defaulters
Simple linear regression analysis - Ordinary least
squares (OLS) 14.1
• Thus, suppose that we think that the variables Y and X are linearly related ...
• The equation that describes how Y is related to X and an error term is called the
regression model.
• Consider the population regression function (PRF) or simple linear regression
function:
• 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀, where 𝛽0 = 𝑦 intercept and 𝛽1 = slope coefficient - are the parameters of the
model and 𝜀 = error term - is a random variable that accounts for the variation in Y not
explained by the linear relationship between Y and X.
• Now consider the simple linear regression equation:
• 𝐸 𝑌 = 𝐸 𝑌|𝑋 = x = 𝛽0 + 𝛽1 𝑋
• Graph of the simple linear regression equation is a straight line, where b0 is the y
intercept of the regression line and b1 is the slope of the regression line and E(Y) is the
expected value of Y for a given x value.
Simple linear regression analysis - Ordinary least
squares (OLS) 14.1
• Fig. 14.1 illustrates possible regression lines associated with
simple linear regressions (also see next two slides)
• Next, consider the sample regression function (SRF):
• 𝑌 = 𝑏0 + 𝑏1 𝑋 + 𝑒, where 𝑏0 and 𝑏1 are the sample estimators of 𝛽0 and 𝛽1
and e is the residual (sample estimate of error)
• Finally, consider the estimated simple linear regression
equation:
• 𝑌෠ = 𝑏0 + 𝑏1 𝑋
Simple Linear Regression Equation
Positive Linear Relationship

E(Y)

Regression line

Intercept Slope b1
b0 is positive

x
Simple Linear Regression Equation
Negative Linear Relationship

E(Y)

Intercept Regression line


b0

Slope b1
is negative

x
Simple Linear Regression Equation
No Relationship

E(Y)

Regression line
Intercept
b0
Slope b1
is 0

x
Fitting a sample regression function to
sample data
• Specify the regression model and regression equation, to determine the
parameters to be estimated
• Obtain sample data
• Estimate the sample regression equation and obtain the sample statistics as
estimates of the population parameters:
• 𝑦ො = 𝑏0 + 𝑏1 𝑥
• As already said, the graph is called the estimated regression line. b0 is the y
intercept of the line, b1 is the slope of the line and 𝑦ො is the estimated value
of Y for a given x value.
• 𝑦ො𝑖 is the predicted or fitted value of y𝑖 obtained by plugging values of
x𝑖 into 𝑏0 + 𝑏1 x𝑖
Estimation Process Fig 14.2
Regression Model Sample Data:
Y = b0 + b1x +e x y
Regression Equation x1 y1
E(Y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn

Estimated
Regression Equation
b0 and b1
provide estimates of 𝑦ො = 𝑏0 + 𝑏1 𝑥
b0 and b1
Sample Statistics
b0, b1
Ordinary least squares (OLS) (14.2)
• The method of least squares (LS) is the most commonly used estimation methodology
to estimate the sample coefficients of the sample regression function (i.e. b0 and b1)
• If the sample regression function is: 𝑦𝑖 = 𝑏0 + 𝑏1 x𝑖 + 𝑒𝑖 , then:
• 𝑦𝑖 is the observed value of the dependent variable (from data set)
• 𝑒𝑖 is the residual, which is the difference between the observed and predicted values of y.
• For ordinary least squares (OLS), the values of the coefficients are chosen to minimise
the residual sum of squares, where the residual sum of squares (SSE) is found as
follows:
• 𝑒𝑖 = 𝑦𝑖 − (𝑏0 + 𝑏1 x𝑖 )
Note that 𝑏0 +𝑏1 x𝑖 = 𝑦ො𝑖 , where 𝑦ො𝑖 is the predicted or
• 𝑒𝑖 = (𝑦𝑖 − (𝑏0 + 𝑏1 x𝑖
2 )2
fitted value. Thus, 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 and σ 𝑒𝑖 2 = σ(𝑦𝑖 −𝑦ො𝑖 )2
• σ 𝑒𝑖 2 = σ(𝑦𝑖 − (𝑏0 + 𝑏1 x𝑖 ) 2
Ordinary least squares (OLS) (14.2)
• Note that σ 𝑒𝑖 2 = σ(𝑦𝑖 − 𝑦ො𝑖 ) 2
• The least squares criterion: 𝑚𝑖𝑛𝑏0 ,𝑏1 σ(𝑦𝑖 − 𝑦ො𝑖 )2 = 𝑚𝑖𝑛𝑏0 ,𝑏1 σ(𝑦𝑖 − (𝑏0 +𝑏1 x𝑖 ))2
• The OLS estimators of 𝑏0 and 𝑏1 :
𝑠xy σ(𝑥𝑖 −𝑥)(𝑦
lj lj
𝑖 −𝑦) σ xy−nxഥyഥ σ xy−nx
ഥyഥ
• 𝑏1 = 𝑠x 2
= σ(𝑥𝑖 −𝑥)lj 2
= σ x2 −nx
ഥ2
and Recall that 𝑠xy = 𝑛−1

• 𝑏0 = yത − 𝑏1 xത
𝑠y
• Note that this implies that 𝑏1 = 𝑟
𝑠x

• Furthermore, 𝑦ො𝑖 = 𝑏0 + 𝑏1 x𝑖 → 𝑦ො𝑖 = yത + 𝑏1 (x𝑖 −തx)


• The SRF always passes through the point (തx, yത ): when x= xത , then y= yത
• b1 is the slope coefficient – it shows by how much the predicted value of 𝑦ෝ will change if x
changes by one unit.
• Generally, Δ𝑦ෝ = 𝑏1 Δx – it is the marginal rate of change measure - for a unit change in x, y will change by
the value of b1.
• 𝑏0 is the intercept coefficient (value of 𝑦ෝ when x = 0)
• Often doesn’t have interesting/meaningful economic interpretation
Example 1:
• The table contains data about weekly household consumption expenditure and weekly
household disposable income for 8 independently and randomly sampled Free State
households. Assume that household consumption expenditure and household
disposable income are normally distributed.

Household 1 2 3 4 5 6 7 8
Consumption 562 762 569 804 888 990 543 874
Income 364 759 851 1166 1452 1050 670 1091

Use OLS to estimate the sample regression equation consumption = b0 + b1(income) + e, where
consumption = weekly household consumption expenditure and income = weekly disposable
household income. Interpret your estimate of b1
Also work through the example on p. 355-8
Solution
• NB: don’t estimate the equation the wrong way round - consumption = b0 + b1(income)
+ e, IS NOT THE SAME as income = b0 + b1(consumption)
• X=income and Y=consumption, xത = 925.375, yഥ = 749
Given 2 variables and if you want to use
• σ x2 = 7642319, s 2 x = 113109.7 , 𝑠x = 336.318
the calculator, enter the x (income)
• σ y2 = 4694034, s 2 y = 29432.286 , 𝑠y = 171.558 variable first and y (consumption) last.
σ xy−nx
ഥyഥ 5850829−8(925.375)(749)
• σ xy= 5850829 , 𝑠xy = = = 43711.71
𝑛−1 7
𝑠xy 43711.71
• 𝑏1 = = = 0.386
𝑠x 2 113109.7

• 𝑏0 = yത − 𝑏1 xത = 749 − 0.386 925.375 = 391.805


Solution

• The sample regression equation is therefore: 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 391.805 + 0.386(income)
• Interpretation:
• If income increases by R1, the predicted/expected value of consumption increases by R0.386
(almost 39 cents)
• b0: If income is R0, predicted/expected value of consumption is R392

1000 Given your Y and X data, to derive


Scatterplot the scatterplot in excel put income
900 (X) on the left side (that is in the
Consumption

first column) and then


800
consumption (Y) on the right side
700 y = 0,3865x + 391,38 (second column).
R² = 0,5739
600
From the scatterplot you can also
500 get the sample simple regression
400 900 1400 equation
Income
Simple Linear Regression
• Consider a simple regression equation:
𝑦 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
• The 𝑛 realisations of the relationship can be written in the following form:
𝑦1 1 𝑥1 𝜀1
𝑦2 1 𝑥2 𝜀2
. . . 𝛽0 .
. = . . 𝛽1 + . → 𝒀 = 𝑿𝜷 + 𝝐
. . . .
𝑦𝑛 1 𝑥𝑛 𝜀𝑛
• Objective in regression analysis
• To estimate the elements of the vector of parameters, 𝛽𝑖 .
• OLS - minimise the residual sum of squares
• Get the least squares estimate of 𝛽 as:
𝑏 = 𝑿′ 𝑿 −𝟏 (𝑿′ 𝒀)
Simple Linear Regression
• Matrices used in simple linear regression can be generalized as
follows:
𝑛 σ𝑛𝑖=1 𝑋𝑖 σ 𝑛
𝑖=1 𝑌𝑖
𝑿′ 𝑿 = ′
;𝑿 𝒀 = 𝑛
σ𝑛𝑖=1 𝑋𝑖 𝑛
σ𝑖=1 𝑋𝑖2 σ𝑖=1 𝑋𝑖 𝑌𝑖

• Find the least squares regression line for the following observations:

x 1 2 3 4 5
y 1 2 4 4 6
Example
𝑛 𝑛
1 1 𝑛 ෍ 𝑋𝑖 ෍ 𝑌𝑖
2 2 𝑖=1 5 15
𝑿′ 𝑿 = = 𝑖=1 17
𝒀 = 4 ;𝑿 = 3 𝑿′ = 𝟏 𝟐 𝟑 𝟒 𝟓 𝑛 𝑛
15 55 𝑿′ 𝒀 = 𝑛 =
63
4 4 ෍ 𝑋𝑖 ෍ 𝑋𝑖2 ෍ 𝑋𝑖 𝑌𝑖
6 5 𝑖=1 𝑖=1 𝑖=1

1 55 −15 11Τ
10 − 3Τ10
𝑿′ 𝑿 = 50 ; 𝑿′ 𝑿 −𝟏 = =
50 −15 5 − 3Τ10 1Τ10

11Τ
10 − 3Τ10 17 −0.2
𝒃= 𝑿′ 𝑿 −𝟏 𝑿′ 𝒀 = 3 1Τ ; 𝒃 = 𝑿′ 𝑿 −𝟏 (𝑿′ 𝒀)
=
− Τ10 10 63 1.2

𝑦 = −0.2 + 1.2𝑥
ANOVA table for regression
• Can divide variation in y into two sources: variability explained by the regression;
variability not explained by the regression: SST = SSR + SSE
• Where SST is the total sum of squares (total variation in y):
• SST =σ (y − 𝑦)
ത 2 = σ y2 −n𝑦ത 2 =(n − 1)𝑠𝑦2
• If you have the sample variance of y, use the last expression to calculate SST
• SSR is the sum of squares due to the regression (variability in y attributable to model):
ത 2 = 𝑏12 σ (x − 𝑥)ҧ 2 = 𝑏12 (σ x 2 −n𝑥ҧ 2 ) = 𝑏12 𝑠𝑥2 (n − 1)
• SSR = σ (𝑦ො − 𝑦)
• Last expression is the easiest to use for computations
• And SSE is the sum of squares due to the error (variability in y NOT explained by model):
• SSE = σ e2 = σ (𝑦 − 𝑦)
ො 2 = σ (𝑦 − (𝑏0 + 𝑏1 x))2 = SST − SSR
ANOVA table for regression
• We can also construct an ANOVA table when fitting a sample regression function to data
using OLS
• In this table, SSR = variation due to regression (also known as SSM = variation due to model),
and is the variation in y explained by the variation in x
• SSE = error sum of squares (sometimes referred to as residual sum of squares) and is all
variation in y not explained by variation in x
• The F-statistic resulting from MSR/MSE tests the significance of the regression: does
variation in x explain a significant part of the variation in y (we won’t use this F test here; will
rather use t-test for slope, to be discussed later) (see T14.5)
Source SS Df MS F
Regression SSR 1 SSR/1 MSR/MSE
Error SSE n-2 SSE/(n-2)
Total SST n-1
Coefficient of determination (𝑅 )
2

• Is a measure of the goodness-of-fit of an estimated regression equation (how well does the
• estimated equation fit the data?)
SSR SSE
• 𝑅2 = =1− , 0 ≤ 𝑅2 ≤ 1
SST SST
• Shows the proportion of the variation in y that is explained by the variation in x
• The closer to one 𝑅2 is, the better is the fit; the closer to zero, the worse is the fit
• Multiplied by 100: Shows percentage of sample variation in y explained by sample variation in x
• For given SST: 𝑅2 will be greater, the greater SSR is (i.e. the smaller SSE is)
• Also note the 𝑅2 = 𝑟𝑥𝑦
2 ; while 𝑟
xy = (sign of 𝑏1 ) 𝑅
2

• Just as the correlation coefficient, 𝑅2 is a measure of LINEAR association between x and y


• However, Coefficient of determination is more useful than the correlation coefficient when
interpreting the strength of association between two random variables, because it measures the
strength as a percentage.
Computationally most efficient way to obtain SST,
SSR and SSE
SUMMARY OUTPUT
Regression Statistics
• Quickest to estimate SST,
Multiple R 0.757593
SSR and SSE as follows: R Square 0.5739471
Adj R Square 0.5029383
• First, find SST, using SST
Standard Error 120.95314
= (n − 1)𝑠𝑦2 Observations 8
• Second, find SSR using ANOVA
df SS MS F Significance F
SSR = 𝑏12 𝑠𝑥2 (n − 1) Regression 1 118248.021 118248 8.08275759 0.029450042
• Third, find SSE using SSE Residual 6 87777.97898 14629.66
Total 7 206026
= SST − SSR Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 391.3850 132.8574 2.9459 0.0257 66.2946 716.4754
Income 0.3865 0.1359 2.8430 0.0295 0.0538 0.7191
Example
• Consider the following estimated sample regression equation:

𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 391.805 + 0.386(income)
• Also suppose that the summary statistics for consumption and income
are:

Consumption Income

Sample mean 749 925.375

Sample variance 29432.286 113109.7

Estimate and interpret the coefficient of determination (and complete


the ANOVA table for this linear regression)
Solution
• From the regression equation: x=income and y=consumption
SSR
• 𝑅 2 = SST
• SST = (n − 1)𝑠𝑦2 = 7 29432.286 = 206206.002
• SSR = 𝑏12 𝑠𝑥2 (n − 1) = 0.3862 7 113109.7 = 117970.25
117970.25
• 𝑅 2 = 206206.002 = 0.5721
• 57.21% of the sample variation in consumption(y) is explained by the sample variation in
income (x)

• ANOVA Table
Source SS Df MS F
Regression 117970.25 1 117970.25 8.038
Error 88055.756 6 14675.959
Total 206206.002 7
Model assumptions (14.4) – just take note
When fitting a sample regression function to sample data using OLS,we are assuming that the
following are true:
• The model is specified correctly and is linear in parameters i.e. y = β0 + β1x + ε is correct
specification
• The error term (ε) is a random variable with a mean or expected value of zero, E(ε)=0
• The error term is independent of X:E(ε|X) = E(ε) = 0. This implies that E(ε|X) = 0 and
Cov(e, x) = 0
• The error term has a constant variance that does not depend on the value of X:
• Var(ε) = σ 2𝜀 , Var(ε|X) = Var(ε) =σ 2𝜀 . Known as homoskedasticity
• The error terms are independent: E(εi|εj) = E(ε i ) ⇒ Cov(εi, ε j ) = 0. Implies no
autocorrelation
• The error terms are normally distributed ε ∼ N(0, σ 2𝜀 )
Estimating intervals and testing hypotheses
about 𝐵1
• To test hypotheses and estimate CIs about the population slope coefficient,β1, weneed:
• Estimate of 𝑏1
SSE
• Estimate of error variance, 𝜎Ƹ 2=𝑠𝑒2. Where: 𝑠𝑒2 = MSE =
n−2
SSE
• Standard error of regression: 𝑠𝑒 = RMSE = 𝑠 𝑒2 =
n−2
• Estimate of 𝑠 𝑏21
• To assume that ε ∼ N, then use t-distribution to test hypotheses/estimate intervals
• The sampling distribution of 𝑏1 is 𝑏1 ∼N(𝛽1 , σ 2𝑏1 ) (under the OLS assumptions)
𝜎2
• Where σ 2𝑏1 = , 𝜎 2= σ 2𝜀 = error variance
σx 2− n
𝑥ҧ 2
𝜎
• Then, 𝜎𝑏1 = σ 2𝑏1 =
σ x 2 −n𝑥ҧ 2
• Because we don’t know what 𝜎𝜀 is, we need sample statistics as estimates of the
population parameters.
Variance 𝑏1
• To estimate variance of b1, 𝑠 𝑏21 , we have to proceed as follows:
• First, estimate SSE,where SSE = SST-SSR
SSE
• Then, estimate error variance (𝑠𝑒2) using 𝑠𝑒2=
n−2
• Note that SSE/(n-2) = MSE (from the ANOVA table)
• Note that the square root of the error variance is known as the standard error of the regression:
• 𝑠𝑒= 𝑠 𝑒2 = RMSE(where RMSE means root mean squared error)
𝑠 𝑒2 𝑠 𝑒2
• Then use the error variance to help estimate 𝑠 𝑏21 : 𝑠 𝑏21 = = σ (x−𝑥)ҧ 2
𝑠 𝑥2 ( 𝑛 − 1 )

• The standard error of b1, 𝑠𝑏1 ,is then equal to 𝑠 𝑏21


Steps for computing 𝑠𝑏1 quickly
• First, find SSE, where SSE = SST − SSR
• Recall that SST =(n − 1)𝑠 𝑦2 , while SSR= 𝑠 𝑥2 (n − 1)𝑏 12
SSE
• Then, find MSE =(𝑠𝑒 ) =
2
n−2
• Note that SSE/(n-2) = MSE (from the ANOVA table)
𝑠 𝑒2
• Then find 𝑠 𝑏21 =
𝑠 𝑥2 (𝑛 −1)

• Finally, find 𝑠𝑏1 = 𝑠 𝑏21


Test statistic and confidence interval
𝑏1−𝛽1∗ 𝑏1−𝛽1∗
• To test hypotheses about 𝛽1, use t = = as test statistic
𝑠 𝑏1 𝑠𝑒/𝑠𝑥 𝑛 −1

• 𝛽1∗ is the value under 𝐻0


𝑏1−𝛽1∗
• Note that: ∼ 𝑡𝑛−2
𝑠𝑏1
𝑏1
• Note: if 𝛽1∗ = 0 → t =
𝑠𝑏1

• Economic theory , results from previous studies will guide decision about the
formulation of 𝐻0 and 𝐻1 to be tested
• A 100(1 − 𝛼)% CI for 𝛽1 is given by 𝑏1 ± 𝑡𝑛−2,𝛼/2𝑠𝑏1
Hypothesis tests about 𝛽1
One-sided right-sided One-sided left-sided Two-sided

𝐻0 : 𝛽1 ≤ 𝛽1∗ 𝐻0 :𝛽1 ≥ 𝛽1∗


𝐻1 :𝛽1 > 𝛽1∗ 𝐻1 :𝛽1 < 𝛽1∗ 𝐻0 :𝛽1 = 𝛽1∗
𝐻1 :𝛽1 ≠ 𝛽1∗
𝑏1 −𝛽1∗ 𝑏1 −𝛽1∗
Test statistic t = Test statistic t =
𝑠 𝑏1 𝑠 𝑏1 𝑏1 −𝛽1∗
Test statistict =
Critical value:t 𝑛−2,α Critical value: −t 𝑛−2,α = 𝑠 𝑏1

t 𝑛−2,1−α Critical value:t 𝑛−2,α/2


Reject 𝐻0 if t > t 𝑛−2,α
Suppose that you want to test Reject 𝐻0 if t < −t 𝑛−2,α Reject if 𝑡 > t 𝑛−2,α/2
if there is a positive Suppose that you want to test Suppose that you want to test

relationship between x and y if there is a negative if there is an relationship

(i.e 𝛽1 >0) relationship between x and y between x and y (i.e 𝛽1 =0)


𝐻0 :𝛽1 ≤ 0 (i.e 𝛽1 <0) 𝐻0 :𝛽1 = 0
𝐻1 :𝛽1 > 0 𝐻0 :𝛽1 ≥ 0 𝐻1 :𝛽1 ≠ 0
𝛽
Then, test statistic is t =
𝛽1
𝐻1 :𝛽1 < 0 Then, test statistic is t = 1
𝑠 𝑏1 𝑠 𝑏1
𝛽1
Then, test statistic is t =
𝑠 𝑏1
Summary for testing for Significance: t Test
1. Determine the hypotheses. 𝐻0 : 𝛽1 = 0

H1 : b 1 ≠ 0

2. Specify the level of significance. a = 0.05

𝑏1
3. Select the test statistic. 𝑡=
𝑠𝑏1

4. State the rejection rule. Reject H0 if p-value < 0.05


or |t| > 𝑡𝑛−2,𝛼/2
Confidence Interval for b1
• The form of a confidence interval for b1 is:
ta /2 sb1
b1 is the b1  ta /2 sb1
point is the
estimator margin
of error
where 𝑡𝛼/2 is the t value providing an area
of a/2 in the upper tail of a t distribution
with n - 2 degrees of freedom.
Example
• Consider the following estimated sample regression equation:

• 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 391.805 + 0.386(income)
• Also suppose that the summary statistics for consumption and income
are:
Consumption Income

Sample mean 749 925.375

Sample variance 29432.286 113109.7

Furthermore, SSE = 88055.756


a. Estimate a 99% interval for 𝛽1
b. Is b1 positive? Test at the 5% level
Solution
• A 99% CI for 𝛽1 is given by 𝑏1 ± 𝑡𝑛−2,𝛼/2𝑠𝑏1
SSE 88055.756
• 𝑠𝑒2= = =14675.959
n−2 6
𝑠 𝑒2 14675.959
• 𝑠 𝑏21 = 𝑠 𝑥2 ( 𝑛 − 1 )
= 7(113109.7)
= 0 . 0 1 8 5 → 𝑠𝑏1 = 𝑠 𝑏21 =0.136

• t-score:𝑡𝑛−2,𝛼/2= 𝑡6,0.005 = 3.707 . So 99% CI is 0.386±3.707(0.136).Therefore, CI (− 0 . 1 1 8 , 0 . 8 9 0 ) .


• Given the 99% confidence interval for 𝛽1, H0 is rejected if the hypothesized value of 𝛽1 (𝛽1 >
0) is not included in the confidence interval for 𝛽1 .
• Checking whether the 99% confidence interval for 𝛽1 include or does not include zero. If it
includes zero like in our case, then we do not reject H0

• 𝐻0 : 𝛽1 ≤ 𝛽1∗ and 𝐻1 :𝛽1 > 𝛽1∗. But because you are testing for b1 positive : 𝐻0 : 𝛽1 ≤ 0 and
𝐻1 :𝛽1 > 0.
0.386
• α =0.05; Test statistic t = 0.136
= 2.838 and Critical value:t 𝑛−2,α = t 6,0.01 = 3.1427

• Reject 𝐻0 if t > t 𝑛−2,α . Therefore, do not reject 𝐻0 t = 2.838 < t 6,0.01 = 3.143
Prediction (14.6)
• If slope coefficient of sample regression equation is statistically
significant, and if 𝑅2 is relatively high, then we can use the estimated
sample regression equation for prediction.
• Recall that the fitted value (𝑦)ො is also known as the predicted value; we
obtain predicted values by plugging in specific values for x into the
estimated sample regression equation.
• If estimated equation is 𝑦ො = 60 + 5x, then predicted value is 110 if x=10;
and it is 120 if x=12, etc.
Prediction: intervals
• Point estimates, however, do not provide information about the precision
associated with a prediction
• To that end, we use two confidence intervals for predictions:
• Confidence interval for mean value of y for a given value of x
• Prediction interval for specific/individual value of y for a given value of
x
• Due to larger error associated with making individual predictions,
prediction intervals tend to be larger than confidence intervals for
predicted values of y
Confidence intervals for prediction
• Notation:
• x𝑝 = value of x for which prediction must be found
• 𝑦ො𝑝 = value of prediction, given x𝑝
• E(x𝑝 ) = mean/expected value of y𝑝 given x𝑝

1 (x𝑝 −xഥ )2
• CI for E(y) = 𝑦ො𝑝 ± t 𝑛−2,α/2 𝑠 +σ
𝑛 (x𝑝 −xഥ )2

SSE
• Note that s = 𝑠𝑒 =
𝑛−1

1 (x𝑝−xഥ )2
• Can also find CI for E(x𝑝 ) as follows: 𝑦ො𝑝 ± t 𝑛−2,α/2 𝑠 + 𝑠2 (𝑛−1)
𝑛 𝑏x
Prediction intervals

1(x𝑝 −xഥ )2
• Prediction interval (PI) for 𝑦ො𝑝 is 𝑦ො𝑝 ± t 𝑛−2,α/2 𝑠 1 + + σ
𝑛 (x𝑝 −xഥ )2

1 (x𝑝 −xഥ )2
• Can also find PI using 𝑦ො𝑝 ± t 𝑛−2,α/2 𝑠 1 + +
𝑛 𝑠 𝑏2 ( 𝑛 − 1 )
x
• Note that the prediction interval is wider than the confidence interval
• More uncertainty associated with making predictions for a particular value than there
is when making prediction for mean/expected value
• When making predictions, choose x𝑝 so that x𝑚𝑖𝑛 ≤ x𝑝 ≤ x𝑚𝑎𝑥
Example
• Consider the following estimated sample regression equation:

𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 = 391.805 + 0.386(income)
• Also suppose that the summary statistics for consumption and income
are:

Consumption Income
Sample mean 749 925.375
Sample variance 29432.286 113109.7

Furthermore, SSE = 88055.756


a. Estimate a 95% confidence interval for the expected value of
consumption (E(Consumption)) if income = 900
b. Estimate a 95 % prediction interval for consumption if income = 900
Solution
• Income=x, consumption=y
• x𝑝 = 900 → y𝑝 = 391.805 + 0.386(900)=739.205
SSE 88055.756
• 𝑠𝑒2= n−2 = 6 =14675.959; 𝑠 2 = 113109.7
𝑥

1 (x𝑝−xഥ )2
• Interval:𝑦ො𝑝 ± t 𝑛−2,α/2 𝑠 𝑛
+ 𝑠 2 ( 𝑛 − 1 ) , t 6,0.025 = 2.447, 𝑠 = 𝑠𝑒= 𝑠 𝑒2 = 1 4 6 7 5 . 9 5 9 =121.144
𝑏x

1 (900−925.375)2
• Interval:739.205 ± 2.447*121.144* 8
+ 7(113109.7)

• 95% CI: (634.062,844.34). The 95% confidence interval for the expected value of consumption
(E(Consumption)) if income = 900 lies between 634.062 and 844.34.

1 (x𝑝−xഥ )2
• 95% PI = y𝑝 ± t 𝑛−2,α/2 𝑠 1+𝑛 + 𝑠 2 (𝑛−1)
𝑏x

1 (900−925.375)2
• Interval:739.205 ± 2.447*121.144* 1 + +
8 7(113109.7)

• Interval: (424.683, 1053.727). The 95 % prediction interval for consumption if income = 900 lies
between 424.683 and 1053.727.
Final Example
• Marginal productivity theory predicts that increases in worker productivity
(output per worker) lead to increases the in real wages that workers receive. The
table contains data about the output per hour (productivity, measured as an
index) and real wages per hour (wage, measure in Rand) of 8 economic sectors for
2020.
Questions:
Sector Productivity Wage
a. Use OLS to fit the sample regression function,
Gold mining 94 28 (wage) = b0 + b1(productivity) + e to the data.
Platinum mining 125 40 b. Write down the estimated sample regression
Timber 63 12 equation, and interpret your estimate of b1
Coal mining 96 24 c. Estimate and interpret the coefficient of
Manufacturing 122 40 determination.
d. Estimate a 90% confidence interval for 𝛽1 . Are
Fisheries 70 18
wages and productivity related.
Agriculture 82 25
e. Estimate a 95% prediction interval for wage if
Forestry 70 16 productivity = 110
Solution – Regression Equation
• X=productivity and Y=wage, xത = 90.25, yഥ = 25.375

• σ x2 = 69054, s 2 x = 556.214 , 𝑠x = 23.5842 Given the variables and if you want to


use the calculator, enter the x
• σ y2 = 5909, s 2 y = 108.268 , 𝑠y = 10.4052 (productivity) variable first and y
(wage) last.
σ xy−nx
ഥyഥ 20002−8(90.25)(25.375)
• σ xy= 20002 , 𝑠xy = 𝑛−1
= 7
= 240.179

𝑠xy 240.179
a. 𝑏1 = 𝑠x 2
= 556.214 = 0.432 𝑎𝑛𝑑 𝑏0 = yത − 𝑏1 xത = 25.375 − 0.432 90.25 = −13.613

b. ෟ = −13.613 + 0.432(productivity)
𝑤𝑎𝑔𝑒
• If productivity increases by 1 unit, predicted/expected wage will increase by R0.43 (43
cents).
Solution
The coefficient of determination
SSR
c. 𝑅2 = ;
SST
• SSR = 𝑏12 𝑠𝑥2 (n − 1) = 7 556.214 0.432 2
= 726.620
• SST = (n − 1)𝑠𝑦2 = 108.268 7 = 757.876
SSR 726.620
• 𝑅2 = = = 0.9588
SST 757.876
• 95.88% of the sample variation in wage (y) is explained by the sample variation in productivity (x)
d. 90% confidence interval of 𝛽1 and 𝑡6,0.05
90% CI for 𝛽1: 𝑏1 ± 𝑡𝑛−2,𝛼/2𝑠𝑏1
• SSE = SST − SSR =757.876 −726.620 =31.256
𝑆𝑆𝐸 31.256
• MSE = 𝑠𝑒2 = ( 𝑛 − 2 ) = = 5.209
6

𝑠 𝑒2 5.209
• 𝑠 𝑏21 = = = 0 . 0 0 1 3 a n d 𝑠𝑏1 = 𝑠 𝑏21 = 0.0013 = 0.0366
𝑠 𝑥2 ( 𝑛 − 1 ) 556.214(7)
Solution
• 𝑡6,0.05 = 1.9432
• Cl = 0.432±1.9432(0.0366) = (0.361 , 0.503)
• At the 10% level, one can conclude that wages and productivity are related, because 0 lies
outside this interval (i.e. we can reject 𝐻0 : 𝛽1 = 0)
Prediction Interval
• X=productivity and Y=wage,
• x𝑝 = 110 → y𝑝 = 0.432 110 − 13.613=33.907
• t 6,0.025 = 2.4467, 𝑠 = 𝑠𝑒= 𝑠 𝑒2 = 5 . 2 0 9 = 2.282

1 (x𝑝 −xഥ )2
• 95% PI = y𝑝 ± t 𝑛−2,α/2 𝑠 1 + + 𝑠 2 ( 𝑛 − 1 )
𝑏x 𝑛

1 (110−90.25)2
• Prediction Interval:33.907 ± 2.4467(2.282) 1 + + 8 7(556.214)

• Interval: (27.726, 40.088)


Homework 1
• The table below contains data about the unionisation rates (𝑢𝑛𝑖𝑜𝑛, in percentage points) and the
Gini coefficients (𝑔𝑖𝑛𝑖, in percentage points) of a random sample of 9 Bloemfontein retailers.
Assume that 𝑢𝑛𝑖𝑜𝑛 and 𝑔𝑖𝑛𝑖 are normally distributed. Note that, 𝑢𝑛𝑖𝑜𝑛 – Is the percentage of
workers who are trade union members, while 𝑔𝑖𝑛𝑖 Measure of (earnings) inequality, ranges
between 0 and 100% -- 0 = perfectly equal earnings distribution; 100 = perfectly unequal earnings
distribution.

Retailer A B C D E G G H I
𝑢𝑛𝑖𝑜𝑛 25 34 38 42 24 37 7 17 28
𝑔𝑖𝑛𝑖 85 81 41 34 78 72 90 88 83

1. Write down the sample means and sample variances for 𝑢𝑛𝑖𝑜𝑛 and 𝑔𝑖𝑛𝑖
2. Use OLS to estimate the sample regression function 𝑔𝑖𝑛𝑖 = 𝑏0 + 𝑏1 (𝑢𝑛𝑖𝑜𝑛) + 𝑒. Write down the estimated
sample regression equation and interpret your estimate of 𝑏1 .
3. What are the values of the sums of squares (SS) for this regression?
4. Estimate the coefficient of determination, 𝑅2 .
5. What is the value of the error variance (𝑀𝑆𝐸 = 𝑠𝑒2 ) of this regression?
Homework 2
The table below contains data (in thousands of rand, so that 1=R1000, etc.) about monthly household
income at age 15 (𝑖𝑛𝑐𝑜𝑚𝑒15) and current monthly household income (𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤) for a random
sample of ten 40-50-year-old South African men. Assume that 𝑖𝑛𝑐𝑜𝑚𝑒15 and 𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤 are
normally distributed. NB: use the values at they appear in the data table to obtain all of your
estimates.

Man A B C D E F G H I J
𝑖𝑛𝑐𝑜𝑚𝑒15 3.5 4.4 5.2 2.8 3.9 4.0 3.8 3.3 4.8 3.1
𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤 5.1 6.3 7.5 5.3 7.5 6.8 5.3 6.0 6.4 4.2

1) Find the sample means and sample standard deviations of 𝑖𝑛𝑐𝑜𝑚𝑒15 and 𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤 for these 10 South
African men.
2) Use ordinary least squares (OLS) to fit the sample regression function 𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤 𝑖 = 𝑏0 +
𝑏1 𝑖𝑛𝑐𝑜𝑚𝑒15 𝑖 + 𝑒𝑖 to the data. Write down the estimated sample regression function and interpret your
estimate of 𝑏1 .
3) Estimate and interpret the coefficient of determination, 𝑅 2 .
4) Is the relationship between household income at age 15 and current household income positive? Test at the 5%
level.
5) Estimate a 90% prediction interval for 𝑖𝑛𝑐𝑜𝑚𝑒𝑛𝑜𝑤, if 𝑖𝑛𝑐𝑜𝑚𝑒15 = 4.
Homework 3
• The table below contains data about the annual inflation rates (𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛, in percentage points)
and annual economic growth rates (𝑔𝑟𝑜𝑤𝑡ℎ, in percentage points) for a random sample of 11
emerging market economies, for the period 1990-1999. Assume that 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 and 𝑔𝑟𝑜𝑤𝑡ℎ are
normally distributed.

Emerging market A B C D E F G H I J K
economy
𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 18.0 26.1 17.6 16.4 20.1 7.3 19.6 18.8 26.8 28.2 20.7
𝑔𝑟𝑜𝑤𝑡ℎ 4.4 2.7 4.6 4.6 4.3 6.7 4.1 5.0 3.1 3.5 3.8
1) Use ordinary least squares (OLS) to fit the sample regression function 𝑔𝑟𝑜𝑤𝑡ℎ𝑖 = 𝑏0 + 𝑏1 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛𝑖 + 𝑒𝑖 to
the data. Write down the estimated sample regression equation and interpret your estimate of 𝑏1 .
2) Estimate and interpret the coefficient of determination, 𝑅2 .
3) Estimate a 90% confidence interval for 𝑏1 . Use this confidence interval to test (at the 10%) if there was no
relationship between economic growth and inflation in emerging market economies during the 1990s.
4) Estimate a 90% prediction interval for 𝑔𝑟𝑜𝑤𝑡ℎ if 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 = 12.
End of Unit 9 - Chapter 14

Next time: Unit 10 (Multiple linear regression), Ch. 15)

You might also like