You are on page 1of 22

Data Analysis in Finance

Master in Finance

Class #5
Jorge Caiado, PhD Luís Silveira Santos, PhD
CEMAPRE/ISEG, University of CEMAPRE/ISEG, University of
Lisbon Lisbon
Email: jcaiado@iseg.ulisboa.pt Email: lsantos@iseg.ulisboa.pt
Web:
http://jcaiado100.wixsite.com/
jorgecaiado
III. Regression Analysis
Correlation Analysis
Correlation coefficient. It measures the strength of (linear) association of two
variables:
𝑠𝑠𝑋𝑋𝑋𝑋
𝑟𝑟𝑋𝑋𝑋𝑋 =
𝑠𝑠𝑋𝑋 𝑠𝑠𝑌𝑌
1
where 𝑠𝑠𝑋𝑋𝑋𝑋 = ∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦 � is the sample covariance of X and Y, and 𝑠𝑠𝑋𝑋 and
̅ 𝑖𝑖 − 𝑦𝑦)
𝑛𝑛
𝑠𝑠𝑌𝑌 are the sample standard deviations of X and Y, respectively.

Illustrative example:
Return co-
movements and volati
lity co-movements
across international
stock markets

2
III. Regression Analysis
How to test the significance of correlation coefficient?
To test the null hypothesis (H0) that the correlation in the population, 𝜌𝜌𝑋𝑋𝑋𝑋 , is zero
(𝜌𝜌𝑋𝑋𝑋𝑋 = 0) against the alternative hypothesis that the correlation in the population is
different from zero (𝜌𝜌𝑋𝑋𝑋𝑋 ≠ 0), we use the following test statistic:
𝑟𝑟𝑋𝑋𝑋𝑋 𝑛𝑛 − 2
𝑡𝑡 = ~𝑡𝑡(𝑛𝑛 − 2)
2
1− 𝑟𝑟𝑋𝑋𝑋𝑋

Example: The sample correlations among GDP, money supply (M2) and interest rate
(TB1YR) in the US during 1997Q1-2001Q1 are shown in the following table. Test the
null hypothesis that each of these correlations, individually, is equal to 0 at the 0.05
level of significance.
M2 GDP TB1YR
M2 1.000000 0.993685 0.068864
GDP 0.993685 1.000000 0.167160
TB1YR 0.068864 0.167160 1.000000
3
III. Regression Analysis
Simple Linear Regression
The problem of estimation
Simple linear regression or two-variable linear regression is a linear regression in which
the dependent variable is related to a single independent or explanatory variable. The
population regression specification of a simple regression model is as follows:
𝑌𝑌𝑖𝑖 = 𝑏𝑏0 + 𝑏𝑏1 𝑋𝑋𝑖𝑖 + 𝑢𝑢𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑁𝑁 (1)
where 𝑌𝑌𝑖𝑖 is the dependent variable, 𝑋𝑋𝑖𝑖 is the independent variable, 𝑏𝑏0 and 𝑏𝑏1 are the
regression coefficients (intercept and slope coefficients, respectively) and 𝑢𝑢𝑖𝑖 is the
stochastic error term. In this model, 𝑏𝑏0 + 𝑏𝑏1 𝑋𝑋𝑖𝑖 is the deterministic component and 𝑢𝑢𝑖𝑖
is the stochastic component.
Since the model (1) is not directly observable, we estimate it from the sample
regression function:
𝑌𝑌𝑖𝑖 = 𝑏𝑏�0 + 𝑏𝑏�1 𝑋𝑋𝑖𝑖 + 𝑢𝑢� 𝑖𝑖 (2)
or
𝑌𝑌𝑖𝑖 = 𝑌𝑌�𝑖𝑖 + 𝑢𝑢� 𝑖𝑖 (3)
where 𝑢𝑢� 𝑖𝑖 = 𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 are the residuals and 𝑌𝑌�𝑖𝑖 = 𝑏𝑏�0 + 𝑏𝑏�1 𝑋𝑋𝑖𝑖 .
4
III. Regression Analysis

The method of ordinary least squares (OLS) chooses 𝑏𝑏�0 and 𝑏𝑏�1 values that minimize
the sum of squared residuals, ∑ 𝑢𝑢� 𝑖𝑖2 . In other words, it provides unique estimates of 𝑏𝑏0
and 𝑏𝑏1 that minimize the distances between the observations and the regression line.

2
The differentiation of ∑ 𝑢𝑢� 𝑖𝑖2 = ∑ 𝑌𝑌𝑖𝑖 − 𝑏𝑏�0 − 𝑏𝑏�1 𝑋𝑋𝑖𝑖 yields the following normal
equations:
∑ 𝑌𝑌𝑖𝑖 = 𝑛𝑛𝑏𝑏�0 + 𝑏𝑏�1 ∑ 𝑋𝑋𝑖𝑖
and

� 𝑌𝑌𝑖𝑖 𝑋𝑋𝑖𝑖 = 𝑏𝑏�0 � 𝑋𝑋𝑖𝑖 + 𝑏𝑏�1 � 𝑋𝑋𝑖𝑖2

5
III. Regression Analysis
Solving the normal equations, we obtain
𝑏𝑏�0 = 𝑌𝑌� − 𝑏𝑏�1 𝑋𝑋�
and
1 1 1
∑ 𝑌𝑌𝑖𝑖 𝑋𝑋𝑖𝑖 − ∑ 𝑋𝑋𝑖𝑖 ∑ 𝑌𝑌𝑖𝑖
𝑏𝑏�1 = 𝑛𝑛 𝑛𝑛 𝑛𝑛
2
1 2 1
∑ − ∑ 𝑋𝑋𝑖𝑖
𝑛𝑛 𝑋𝑋𝑖𝑖 𝑛𝑛
𝑆𝑆𝑋𝑋𝑋𝑋 𝑆𝑆𝑌𝑌
= 2 = 𝑅𝑅𝑋𝑋𝑋𝑋 ×
𝑆𝑆𝑋𝑋 𝑆𝑆𝑋𝑋

6
III. Regression Analysis
Example: Fitted regression line of GDP on M2

7
III. Regression Analysis
Exercise: Source: DeFusco et al., 2015

8
III. Regression Analysis
Assumptions of the Simple Linear Regression Model (SLRM):
A1: The regression model is linear in the parameters 𝑏𝑏0 and 𝑏𝑏1 , as follows
𝑌𝑌𝑖𝑖 = 𝑏𝑏0 + 𝑏𝑏1 𝑋𝑋𝑖𝑖 + 𝑢𝑢𝑖𝑖
A2: The independent variable 𝑋𝑋 is non-stochastic.
A3: The conditional mean value of the error term 𝑢𝑢 is zero: 𝐸𝐸(𝑢𝑢𝑖𝑖 𝑋𝑋𝑖𝑖 = 0, for each i
A4: The conditional variance of the error term 𝑢𝑢 is the same for all observations:
𝑉𝑉𝑉𝑉𝑉𝑉(𝑢𝑢𝑖𝑖 𝑋𝑋𝑖𝑖 = σ2 .
A5: The error term 𝑢𝑢 is uncorrelated across observations: 𝐶𝐶𝐶𝐶𝐶𝐶(𝑢𝑢𝑖𝑖 , 𝑢𝑢𝑗𝑗 𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑗𝑗 = 0, 𝑖𝑖 ≠
𝑗𝑗.
A6: The error term 𝑢𝑢 is normally distributed.

Comments on the assumptions of the SLRM:


• Assumption A3 implies that the error term 𝑢𝑢 and explanatory variable 𝑋𝑋 are
uncorrelated: 𝐶𝐶𝐶𝐶𝐶𝐶(𝑢𝑢𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) = 0
• The number of observations n must be greater than the number of parameters to
be estimated.
9
III. Regression Analysis
Standard Errors of Least Squares Estimates
The standard errors of the OLS estimates are obtained as follows:

∑ 𝑋𝑋𝑖𝑖2 𝜎𝜎�
𝑠𝑠𝑠𝑠(𝑏𝑏�0 ) = 𝜎𝜎� 𝑠𝑠𝑠𝑠(𝑏𝑏�1 ) =
� 2
𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋) � 2
∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋)

�𝑖𝑖2
∑ 𝑢𝑢
where 𝜎𝜎� = SEE = is the OLS estimator of 𝜎𝜎 and SEE stands for standard error
𝑛𝑛−2
of estimate.

Statistical Properties of OLS Estimators


• Under assumptions A1 to A3, the least squares (OLS) estimator is unbiased (and
consistent).
• Under assumptions A1 to A5, the OLS estimator said to be BLUE (best linear
unbiased estimator). This means that it has minimum variance in the class of all
linear unbiased estimators.
10
III. Regression Analysis
The Coefficient of Determination
The coefficient of determination 𝑟𝑟 2 is a measure for goodness of fit. In the simple
regression model, it tells how well the sample regression line fits the data.
The total variation, or total sum of squares (TSS or SST), can be partitioned into two
parts: the explained variation, or explained sum of squares (ESS or SSE), and the
unexplained variation, or residual sum of squares (RSS or SSR):
TSS = ESS + RSS
The coefficient of determination 𝑟𝑟 2 can be defined as

∑( �𝑖𝑖 − 𝑌𝑌)
𝑌𝑌 � 2 ESS
𝑟𝑟 2 = =
� 2 TSS
∑(𝑌𝑌𝑖𝑖 − 𝑌𝑌)
or, equivalently,
2
2
∑ 𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 RSS
𝑟𝑟 = 1 − = 1−
∑ 𝑌𝑌𝑖𝑖 − 𝑌𝑌� 2 TSS

11
III. Regression Analysis
Exercise: Source: DeFusco et al., 2015

Solutions: A. 𝑟𝑟 2 = 0.4279; B. 𝑠𝑠𝑥𝑥𝑥𝑥 = −0.6542; C. 𝜎𝜎� = 1.1775; D. 𝑠𝑠𝑦𝑦 = 1.5569

12
III. Regression Analysis
Confidence Intervals and Hypothesis Testing
If the sampling or probability distributions of the OLS estimators are known, one can
perform a hypothesis test using the confidence interval approach.

Assuming the error term 𝑢𝑢 is normally distributed, the OLS estimators 𝑏𝑏�0 and 𝑏𝑏�1 are
themselves normally distributed as follows:
𝑏𝑏�0 ~𝑁𝑁(𝑏𝑏0 , 𝜎𝜎�2 ) and 𝑏𝑏�1 ~𝑁𝑁(𝑏𝑏1 , 𝜎𝜎 �2 )
𝑏𝑏0 𝑏𝑏1
where
∑ 𝑋𝑋𝑖𝑖2 𝜎𝜎 2
𝜎𝜎𝑏𝑏�20 = 𝜎𝜎 2 𝜎𝜎𝑏𝑏�21 =

𝑛𝑛 ∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋)2 � 2
∑(𝑋𝑋𝑖𝑖 − 𝑋𝑋)

Then by the properties of the normal distribution the variables,


𝑏𝑏�0 −𝑏𝑏0 𝑏𝑏�1 −𝑏𝑏1
𝑍𝑍0 = ~𝑁𝑁(0,1) and 𝑍𝑍1 = ~𝑁𝑁(0,1)
𝜎𝜎𝑏𝑏� 𝜎𝜎𝑏𝑏�
0 1

13
III. Regression Analysis
Once 𝜎𝜎 is rarely known, we may replace 𝜎𝜎 by its estimate, 𝜎𝜎,
� and use the t-student
distribution as follows:
𝑏𝑏� −𝑏𝑏 𝑏𝑏� −𝑏𝑏
0
𝑡𝑡0 = 𝑠𝑠𝑠𝑠( 0
~𝑡𝑡(𝑛𝑛−2) and 𝑡𝑡1 = 1 1
~𝑡𝑡(𝑛𝑛−2)
0𝑏𝑏� ) 𝑠𝑠𝑠𝑠(𝑏𝑏� )
1

Therefore, a confidence interval for 𝑏𝑏0 can be written as


𝐶𝐶𝐼𝐼1−𝛼𝛼 (𝑏𝑏0 ) = 𝑏𝑏�0 ± 𝑡𝑡𝛼𝛼/2 × 𝑠𝑠𝑠𝑠(𝑏𝑏�0 )
Analogously, a confidence interval for 𝑏𝑏1 can be written as
𝐶𝐶𝐼𝐼1−𝛼𝛼 𝑏𝑏1 = 𝑏𝑏�1 ± 𝑡𝑡𝛼𝛼/2 × 𝑠𝑠𝑠𝑠(𝑏𝑏�1 )
where 𝑡𝑡𝛼𝛼/2 is obtained from a t-Student distribution with 𝑛𝑛 − 2 degrees of freedom.

14
III. Regression Analysis
To illustrate the confidence interval approach, consider a simple regression of
personal consumption expenditure (Y) on gross domestic product (X), both in billions
of dollars, in the United States, 1982-1996 (N=15):
𝑌𝑌�𝑖𝑖 = −184.078 + 0.7064𝑋𝑋𝑖𝑖
𝜎𝜎� = 411.4913; 𝑠𝑠𝑠𝑠(𝑏𝑏�0 ) = 46.2619; 𝑠𝑠𝑠𝑠(𝑏𝑏�1 ) = 0.007827

Obtain a 95% confidence interval for both 𝑏𝑏0 and 𝑏𝑏1

Solution:
𝐶𝐶𝐼𝐼0.95 𝑏𝑏0 = −184.078 ± 2.16 × 46.2619
−284.004 < 𝑏𝑏0 < −84.152

𝐶𝐶𝐼𝐼0.95 𝑏𝑏1 = 0.7064 ± 2.16 × 0.007827


0.689 < 𝑏𝑏1 < 0.723

Now suppose that we postulate 𝐻𝐻0 : 𝑏𝑏1 = 0.5 versus 𝐻𝐻1 : 𝑏𝑏1 ≠ 0.5. Is the observed 𝑏𝑏�1
compatible with 𝐻𝐻0 ? 15
III. Regression Analysis
An alternative but complementary approach to the confidence interval method of
estimation is the test of statistical hypothesis.

Suppose that 𝑏𝑏𝑘𝑘∗ is the conjectured value of 𝑏𝑏𝑘𝑘 under the null hypothesis (𝐻𝐻0 : 𝑏𝑏𝑘𝑘 = 𝑏𝑏𝑘𝑘∗ ),
then the test statistic can be defined as:
𝑏𝑏�0 −𝑏𝑏0∗ 𝑏𝑏�1 −𝑏𝑏1∗
𝑡𝑡0 = ~𝑡𝑡(𝑛𝑛−2) or 𝑡𝑡1 = ~𝑡𝑡(𝑛𝑛−2)
𝑠𝑠𝑠𝑠(𝑏𝑏�0 ) 𝑠𝑠𝑠𝑠(𝑏𝑏�1 )

Here, again, the critical value depends on the significance level, α, and on the type of
the test:
• For 𝐻𝐻0 : 𝜃𝜃 = 𝑏𝑏𝑘𝑘∗ vs. 𝐻𝐻1 : 𝜃𝜃 ≠ 𝑏𝑏𝑘𝑘∗ (two-tailed hypothesis test), two rejection points exist
(one negative, one positive), ±𝑡𝑡𝛼𝛼/2 .
• For 𝐻𝐻0 : 𝜃𝜃 = 𝑏𝑏𝑘𝑘∗ or 𝐻𝐻0 : 𝜃𝜃 ≤ 𝑏𝑏𝑘𝑘∗ vs. 𝐻𝐻1 : 𝜃𝜃 > 𝑏𝑏𝑘𝑘∗ (right-tailed hypothesis test), one
rejection point exists (positive only), 𝑡𝑡𝛼𝛼 .
• For 𝐻𝐻0 : 𝜃𝜃 = 𝑏𝑏𝑘𝑘∗ or 𝐻𝐻0 : 𝜃𝜃 ≥ 𝑏𝑏𝑘𝑘∗ vs. 𝐻𝐻1 : 𝜃𝜃 < 𝑏𝑏𝑘𝑘∗ (left-tailed hypothesis test), one
rejection point exists (negative only), −𝑡𝑡𝛼𝛼 .

16
III. Regression Analysis
In the consumption-GDP example, we know that 𝑏𝑏�0 = −184.078, 𝑠𝑠𝑠𝑠(𝑏𝑏�0 ) = 46.2619,
𝑏𝑏�1 = 0.7064, 𝑠𝑠𝑠𝑠(𝑏𝑏�1 ) = 0.007827. How to test 𝐻𝐻0 : 𝑏𝑏1 = 0.5 vs. 𝐻𝐻1 : 𝑏𝑏1 ≠ 0.5, at the 5%
level? And 𝐻𝐻0 : 𝑏𝑏0 = −200 vs. 𝐻𝐻1 : 𝑏𝑏0 ≠ −200, at the same level?

Solution:
𝑛𝑛 − 2 = 15 − 2 = 13 df → 𝑡𝑡 13 , thus the critical value is equal to 𝑡𝑡0.05/2 13 = 2.16.
The observed values of the test statistics are:
−184.078−(−200) 0.7064−0.5
𝑡𝑡0,𝑜𝑜𝑜𝑜𝑜𝑜 = = 0.344 and 𝑡𝑡1,𝑜𝑜𝑜𝑜𝑜𝑜 = = 26.370
46.2619 0.007827

|𝑡𝑡0,𝑜𝑜𝑜𝑜𝑜𝑜 | < 𝑡𝑡0.05/2 13 ⇒ Do not reject 𝐻𝐻0


|𝑡𝑡1,𝑜𝑜𝑜𝑜𝑜𝑜 | > 𝑡𝑡0.05/2 13 ⇒ Reject 𝐻𝐻0

17
III. Regression Analysis
Analysis of Variance (ANOVA) in a Simple Linear Regression
Recall that in the previous section, we decompose the total sum of squares (TSS) into
two components: explained sum of squares (ESS) and residual sum of squares (RSS).

In regression analysis, we use analysis of variance (ANOVA) to test whether the


independent variable (or variables) explain the variations in the dependent variable.
This statistical procedure conducted in ANOVA is called the F-test:
ESS
ESS/1 TSS /1 𝑅𝑅2 /1
𝐹𝐹 = = = ~𝐹𝐹(1, 𝑛𝑛 − 2)
RSS/(𝑛𝑛 − 2) RSS 1 − 𝑅𝑅 2 /(𝑛𝑛 − 2)
TSS /(𝑛𝑛 − 2)
Under the null hypothesis 𝐻𝐻0 : 𝑏𝑏1 = 0 (vs. 𝐻𝐻1 : 𝑏𝑏1 ≠ 0), the 𝐹𝐹 statistic follows an 𝐹𝐹
distribution with 1 df in the numerator and 𝑛𝑛 − 2 in the denominator

2
(note that 𝐹𝐹 1,𝑚𝑚 = 𝑡𝑡𝑚𝑚 )
18
III. Regression Analysis
To illustrate, consider the following expenditure regression of total expenditure (T) on
food expenditure (F) in India, for a sample of 55 rural households (Source: Mukherjee
et al., 1998):
𝐹𝐹�𝑖𝑖 = 94.2087 + 0.4368𝑇𝑇𝑖𝑖
𝑠𝑠𝑠𝑠(𝑏𝑏�0 ) = 50.8563, 𝑠𝑠𝑠𝑠(𝑏𝑏�1 ) = 0.0783
𝑡𝑡0,𝑜𝑜𝑜𝑜𝑜𝑜 = 1.8524, 𝑡𝑡1,𝑜𝑜𝑜𝑜𝑜𝑜 = 5.5770
𝑓𝑓𝑜𝑜𝑜𝑜𝑜𝑜 = 31.1034 (𝑝𝑝 value = 0.000)
𝑟𝑟 2 = 0.3698

Test the null hypothesis that there is no relationship between food expenditure and
total expenditure (𝐻𝐻0 : 𝑏𝑏1 = 0)?

Solution:
2
𝑓𝑓𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑡𝑡𝑜𝑜𝑜𝑜𝑜𝑜 = 5.57702 = 31.103
At the 5% level, the critical value for the 𝐹𝐹(1,55 − 2) distribution is 𝑓𝑓𝛼𝛼 = 4.03
(obtained from the statistical tables, based on a 𝐹𝐹(1,50) distribution). Since 𝑓𝑓𝑜𝑜𝑜𝑜𝑜𝑜 > 𝑓𝑓𝛼𝛼 ,
𝐻𝐻0 can be rejected.
19
III. Regression Analysis
Exercise: Source: DeFusco et al., 2015

20
III. Regression Analysis

21
III. Regression Analysis

22

You might also like