You are on page 1of 70

Introduction to Econometrics

Chapter 2 and 3

Sebastiano Manzan

ECON-UB251

1 / 70
Brief Overview of the Course

▶ Economics suggests important relationships, often with policy


implications, but virtually never suggests quantitative magnitudes of
causal effects.
▶ What is the quantitative effect of reducing class size on student
achievement?
▶ How does another year of education change earnings?
▶ What is the price elasticity of cigarettes? or Peloton bikes?
▶ What is the effect on output growth of a 1 percentage point increase in
interest rates by the Fed?
▶ What is the effect on housing prices of environmental improvements?

2 / 70
This course is about using data to measure causal effects

▶ Ideally, we would like an experiment


▶ What would be an experiment to estimate the effect of class size on
standardized test scores?

▶ But almost always we only have observational (nonexperimental) data.


▶ returns to education
▶ cigarette prices
▶ monetary policy

▶ Most of the course deals with difficulties arising from using


observational to estimate causal effects
▶ confounding effects (omitted factors)
▶ simultaneous causality
▶ “correlation does not imply causation’ ’

3 / 70
In this course you will

▶ Learn methods for estimating causal effects using observational data


▶ Learn methods for prediction, including forecasting using time series
data
▶ Learn to evaluate the regression analysis of others – this means you
will be able to read/understand empirical economics and finance
papers in other econ courses
▶ Get some hands-on experience with regression analysis in your
problem sets

4 / 70
Review of Probability and Statistics (SW Chapters 2, 3)

▶ Empirical problem: Class size and educational output

▶ Policy question: What is the effect on test scores (or some other
outcome measure) of reducing class size by one student per class? by
8 students/class?

▶ We must use data to find out (is there any way to answer this without
data?)

5 / 70
The California Test Score Data Set

▶ All K-6 and K-8 California school districts ( n = 420 ) Variables:


▶ 5 th grade test scores (Stanford-9 achievement test, combined math
and reading), district average
▶ Student-teacher ratio (STR) = no. of students in the district divided
by no. full-time equivalent teachers

Figure 1: Summary of the Distribution of Student–Teacher Ratios and


Fifth-Grade Test Scores for 420 K–8 Districts in California in 1999

▶ This table does not tell us anything about the relationship between
test scores and the STR

6 / 70
Do districts with smaller classes have higher test scores?

▶ Scatterplot of test score v. student-teacher ratio

▶ What does this figure show?

7 / 70
R

library(readxl)
caschool <- read_xlsx("./data/caschool.xlsx")
dplyr::glimpse(caschool)

Rows: 420
Columns: 18
$ `Observation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, ~
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F~
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th~
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-~
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22~
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4~
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31~
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95~
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,~
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6~
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0~
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,~
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,~
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333~
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767~
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, ~
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, ~

8 / 70
mean(caschool$testscr)

[1] 654.1565

sd(caschool$testscr)

[1] 19.05335

c(mean(caschool$str), sd(caschool$str))

[1] 19.640425 1.891812

9 / 70
library(ggplot2)
ggplot(caschool, aes(str, testscr)) + geom_point()

testscr 690

660

630

14 16 18 20 22 24 26
str

10 / 70
▶ We need to get some numerical evidence on whether districts with low
STRs have higher test scores:
1. Compare average test scores in districts with low STRs to those with
high STRs (estimation)

2. Test the null hypothesis that the mean test scores in the two types of
districts are the same, against the alternative hypothesis that they
differ (hypothesis testing)

3. Estimate an interval for the difference in the mean test scores, high v.
low STR districts (confidence interval)

11 / 70
Initial data analysis

▶ Compare districts with small'' (STR $<$ 20) andlarge’ ’ (STR ≥


20) class sizes

Class Size Average St. Dev n


Small 657.4 19.4 238
Large 650.0 17.9 182

▶ Estimation of Δ = difference between group means


▶ Test the hypothesis that Δ = 0
▶ Construct a confidence interval for Δ

12 / 70
1. Estimation

̄ ̄ 1 𝑛 1 𝑛
𝑌𝑠𝑚𝑎𝑙𝑙 − 𝑌𝑙𝑎𝑟𝑔𝑒 = 𝑛𝑠𝑚𝑎𝑙𝑙 ∑𝑖=1
𝑠𝑚𝑎𝑙𝑙
𝑌𝑖 − 𝑛𝑙𝑎𝑟𝑔𝑒 ∑𝑖=1
𝑙𝑎𝑟𝑔𝑒
𝑌𝑖 (1)
= 657.4 − 650.0 (2)
= 7.4 (3)

▶ Is 7.4 a large difference in a real-world sense?


▶ Standard deviation across districts = 19.1
▶ This is a big enough difference to be important for school reform
discussions, for parents, or for a school committee?

13 / 70
2. Hypothesis testing

▶ Let’s test the null hypothesis that there is no difference between the
two groups
▶ The test statistic is

̄
𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 ̄
𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒
𝑡= =
𝑠2𝑠𝑚𝑎𝑙𝑙 𝑠2𝑙𝑎𝑟𝑔𝑒 ̄ ̄
𝑆𝐸(𝑌𝑠𝑚𝑎𝑙𝑙 − 𝑌𝑙𝑎𝑟𝑔𝑒 )
√𝑛 + 𝑛𝑙𝑎𝑟𝑔𝑒
𝑠𝑚𝑎𝑙𝑙

̄
▶ where 𝑆𝐸(𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 ) represents the standard error of
̄
𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 and the variance estimates are calculated as

𝑛𝑠𝑚𝑎𝑙𝑙
1 2
𝑠2𝑠𝑚𝑎𝑙𝑙 = ∑ (𝑌𝑖 − 𝑌 ̄ )
𝑛𝑠𝑚𝑎𝑙𝑙 − 1 𝑖=1

14 / 70
2.

▶ Compute the difference-of-means t -statistic:

Size Average $Std. Dev. n


small 657.4 19.4 238
large 650.0 17.9 182

▶ |𝑡| > 1.96: reject the null hypothesis that the two means are the same
(at the 5% significance level)

15 / 70
3. Confidence interval

A 95% confidence interval for the difference between the means is,

▶ Two equivalent statements:


▶ The 95% confidence interval for Δ doesn’t include 0

▶ The hypothesis that Δ = 0 is rejected at the 5% level

16 / 70
R

▶ Package dplyr is great for data manipulation


▶ Symbol %>% is used to concatenate commands and reads then

library(dplyr)
caschool <- mutate(caschool, group = ifelse(str < 20, "small", "large"))
table <- group_by(caschool, group) %>%
summarize(Average = mean(testscr),
StdDev = sd(testscr),
N = n())
knitr::kable(table)

group Average StdDev N


large 649.9788 17.85336 182
small 657.3513 19.35801 238

17 / 70
library(ggplot2)
ggplot(caschool, aes(group,testscr)) + geom_boxplot() + theme_bw()

testscr 690

660

630

large small
group

18 / 70
t.test(testscr ~ group, data = caschool)

Welch Two Sample t-test

data: testscr by group


t = -4.0426, df = 403.61, p-value = 6.333e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.957525 -3.787296
sample estimates:
mean in group large mean in group small
649.9788 657.3513

19 / 70
What comes next

▶ The mechanics of estimation, hypothesis testing, and confidence


intervals should be familiar

▶ These concepts extend directly to regression

▶ We will review some of the underlying theory of estimation,


hypothesis testing, and confidence intervals:
▶ Why do these procedures work, and why use these rather than others?
▶ We will review the intellectual foundations of statistics and
econometrics

20 / 70
Review of Statistical Theory

1. The probability framework for statistical inference


2. Estimation
3. Testing
4. Confidence Intervals
5. The probability framework for statistical inference
▶ Population, random variable, and distribution
▶ Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
▶ Conditional distributions and conditional means
▶ Distribution of a sample of data drawn randomly from a population:
𝑌1 , ⋯ , 𝑌𝑛

21 / 70
Population, random variable, and distribution

▶ Population
▶ The group or collection of all possible entities of interest (school
districts)
▶ We will think of populations as infinitely large (∞ is an approximation
to “very big’ ’)

▶ Random variable Y
▶ Numerical summary of a random outcome (district average test score,
district STR)

▶ The probabilities of different values of Y that occur in the population,


for ex. Pr[Y = 650] (when Y is discrete)
▶ The probabilities of sets of these values, for ex. Pr[640 < Y < 660]
(when Y is continuous)

22 / 70
Moments of a population distribution: mean, variance, standard
deviation, covariance, correlation

▶ A random variable Y can be described by certain characteristics of its


distribution

1. mean = expected value (expectation) of Y


▶ E(Y) = 𝜇𝑌
▶ long-run average value of Y over repeated realizations of Y

2. variance
▶ 𝐸(𝑌 –𝜇𝑌 )2 = 𝜎𝑌2
▶ measure of the squared spread of the distribution

3. standard deviation = 𝜎𝑌 = √𝜎𝑌2

23 / 70
𝐸(𝑌 −𝜇𝑌 )3
4. skewness = 𝜎𝑌3

▶ measure of asymmetry of a distribution


▶ skewness = 0 : distribution is symmetric
▶ skewness > (<) 0 : distribution has long right (left) tail

𝐸(𝑌 −𝜇𝑌 )4
5. kurtosis: 𝜎𝑌4

▶ measure of probability in the tails


▶ measure of probability of large values
▶ kurtosis = 3: normal distribution
▶ kurtosis > 3: heavy tails (leptokurtotic)

24 / 70
25 / 70
2 random variables: joint distributions and covariance

▶ Random variables X and Z have a joint distribution


▶ The covariance between X and Z is

𝑐𝑜𝑣(𝑋, 𝑍) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑍 − 𝜇𝑍 )] = 𝜎𝑋𝑍

▶ The covariance is a measure of the linear association between X and


Z; its units are units of X times units of Z
▶ cov( X , Z ) > 0 means a positive relation between X and Z
▶ If X and Z are independently distributed, then cov( X , Z ) = 0
▶ The covariance of a r.v. with itself is its variance:

𝑐𝑜𝑣(𝑋, 𝑋) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑋 − 𝜇𝑋 )] = 𝐸[(𝑋 − 𝜇𝑋 )2 ] = 𝜎𝑋


2

26 / 70
▶ The covariance between Test Score and STR is negative:

▶ Data from 420 California school districts. There is a weak negative


relationship between the student–teacher ratio and test scores: The
sample correlation is − 0.23

c(cov(caschool$testscr, caschool$str),
cor(caschool$testscr, caschool$str))

[1] -8.1593241 -0.2263628

27 / 70
The correlation coefficient is defined in terms of the covariance

▶ The correlation coefficient is defined as:


𝑐𝑜𝑣(𝑋, 𝑍) 𝜎𝑋𝑍
𝑐𝑜𝑟𝑟(𝑋, 𝑍) = = 2 2
= 𝑟𝑋𝑍
√𝑣𝑎𝑟(𝑋) ∗ 𝑣𝑎𝑟(𝑍) √𝜎𝑋 𝜎𝑌

▶ The correlation coefficient represents a rescaled covariance


▶ Property: –1 ≤ 𝑐𝑜𝑟𝑟(𝑋, 𝑍) ≤ 1
▶ corr( X, Z) = 1 means perfect positive linear association
▶ corr( X, Z) = –1 means perfect negative linear association
▶ corr( X, Z) = 0 means no linear association

28 / 70
▶ The correlation coefficient measures linear association

29 / 70
(c) Conditional distributions and conditional means

▶ Conditional distributions
▶ The distribution of Y, given value(s) of some other random variable, X
▶ Ex: the distribution of test scores, given that STR < 20

▶ Conditional expectations and conditional moments


▶ conditional mean = mean of conditional distribution = 𝐸(𝑌 |𝑋 = 𝑥)
▶ conditional variance = variance of conditional distribution
▶ Example: 𝐸(test score|𝑆𝑇 𝑅 < 20) (the mean of test scores among
districts with small class sizes)

▶ The difference in means is the difference between the means of two


conditional distributions:

Δ = 𝐸(test score|𝑆𝑇 𝑅 < 20) − 𝐸(test score|𝑆𝑇 𝑅 ≥ 20)

30 / 70
▶ Other examples of conditional means:
▶ Wages of all female workers (Y = wages, X = gender)
▶ Mortality rate of those given an experimental treatment (Y = live/die;
X = treated/not treated)

▶ Notice that: if 𝐸(𝑋|𝑍) is constant, then 𝑐𝑜𝑟𝑟(𝑋, 𝑍) = 0


▶ Suppose you want to predict Y given the value of X
▶ Example: you want to predict someone’s income, given their years of
education
▶ A common measure of the quality of a prediction m of Y is the mean
squared prediction error (MSPE), given X, 𝐸[(𝑌 − 𝑚)2 |𝑋]
▶ Of all possible predictions m that depend on X, the conditional mean
𝐸(𝑌 |𝑋) has the smallest mean squared prediction error

31 / 70
Distribution of a sample of data drawn randomly from a
population: 𝑌1 , ⋯ , 𝑌𝑛

▶ We will assume simple random sampling


▶ Choose and individual (district, entity) at random from the population

▶ Randomness and data


▶ Prior to sample selection, the value of Y is random because the
individual selected is random
▶ Once the individual is selected and the value of Y is observed, then Y
is just a number (not random)
▶ The data set is (𝑌1 , 𝑌2 , ⋯ , 𝑌𝑛 ), where 𝑌𝑖 = value of Y for the i-th
individual (district, entity) sampled

32 / 70
Distribution of 𝑌1 , ⋯ , 𝑌𝑛 under simple random sampling

▶ Because individuals #1 and #2 are selected at random, the value of


𝑌1 has no information content for 𝑌2 . Thus:
▶ 𝑌1 and 𝑌2 are independently distributed
▶ 𝑌1 and 𝑌2 come from the same distribution, that is, 𝑌1 , ⋯ , 𝑌2 are
identically distributed
▶ Under simple random sampling, 𝑌1 and 𝑌2 are independently and
identically distributed (i.i.d.)
▶ More generally, under simple random sampling, 𝑌𝑖 , 𝑖 = 1, ⋯ , 𝑛 , are
i.i.d.

33 / 70
Estimation

▶ This set of assumptions (i.e., statistical model) allows rigorous


statistical inference about moments of population distributions
using a sample of data from that population
▶ 𝑌 ̄ is the natural estimator of the mean. But:
▶ What are the properties of 𝑌 ̄ ?
▶ Why should we use 𝑌 ̄ rather than some other estimator?
▶ 𝑌1 (the first observation)
▶ maybe unequal weights – not simple average
▶ median (𝑌1 , ⋯ , 𝑌𝑛 )
▶ The starting point is the sampling distribution of 𝑌 ̄

34 / 70
Sampling distribution of 𝑌 ̄

▶ 𝑌 ̄ is a random variable, and its properties are determined by the


sampling distribution of 𝑌 ̄
1. The individuals in the sample are drawn at random
2. Thus the values of (𝑌1 , ⋯ , 𝑌𝑛 ) are random
3. Thus functions of (𝑌1 , ⋯ , 𝑌𝑛 ) , such as 𝑌 ̄ , are random; a different
sample would have produce on a different average value
4. The distribution of 𝑌 ̄ over different possible samples of size n is called
the sampling distribution of 𝑌 ̄
▶ The mean and variance of 𝑌 ̄ are the mean and variance of its
sampling distribution, and 𝐸(𝑌 ̄ ) and 𝑉 𝑎𝑟(𝑌 ̄ )

35 / 70
▶ Example: Suppose Y takes on 0 or 1 (a Bernoulli random variable)
with probability distribution Pr(Y = 0) = 0.22, Pr(Y = 1) = 0.78
▶ Then
𝐸(𝑌 ) = 𝑝 ∗ 1 + (1 − 𝑝) ∗ 0 = 𝑝 = 0.78
𝜎 = 𝐸[(𝑌 − 𝜇𝑦 )2 ] = 𝑝 ∗ (1 − 𝑝)2 + (1 − 𝑝) ∗ (0 − 𝑝)2 = 𝑝(1 − 𝑝)
2

= 0.78 ∗ (1 − 0.78) = 0.1716


▶ The sampling distribution of 𝑌 ̄ depends on n
▶ Consider n = 2. The sampling distribution of 𝑌 ̄ is,

𝑃 𝑟(𝑌 ̄ = 0) = 0.222 = 0.0484

𝑃 𝑟(𝑌 ̄ = 1) = 0.782 = 0.6084


𝑃 𝑟(𝑌 ̄ = 0.5) = 2 ∗ 0.22 ∗ 0.78 = 0.3432

36 / 70
The sampling distribution of 𝑌 ̄ when Y is Bernoulli (p = 0.78):

37 / 70
R

myfunction <- function(p,N,B)


{
sample <- matrix(NA, B, 1)
for (i in 1:B) sample[i] = mean(rbinom(N,1,p))
library(ggplot2)
ggplot(data.frame(sample)) +
geom_histogram(aes(sample), bins = 20, fill="seagreen2") +
geom_vline(xintercept = mean(sample), color = "tomato3") +
theme_classic()
}

p = 0.78 # Pr(Y = 1) p = 0.78 # Pr(Y = 1)


N = 2 # SAMPLE SIZE N = 100 # SAMPLE SIZE
B = 10000 # NUMBER OF SAMPLES B = 10000 # NUMBER OF SAMPLES
myfunction(p,N,B) myfunction(p,N,B)

6000

1500

4000
count

count 1000

2000
500

0 0
0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9
sample sample

38 / 70
▶ What is the mean of 𝑌 ̄ ? If 𝑌 ̄ = 𝜇 = 0.78, then 𝑌 ̄ is an unbiased
estimator of 𝜇
▶ What is the variance of 𝑌 ̄ ?
▶ How does 𝑌 ̄ depend on n
▶ Does 𝑌 ̄ become close to 𝜇 when n is large?
▶ Law of large numbers: 𝑌 ̄ is a consistent estimator of 𝜇
▶ 𝑌 ̄ − 𝜇 appears bell shaped for n large: is this generally true?
▶ In fact, 𝑌 ̄ − 𝜇 is approximately normally distributed for n large
(Central Limit Theorem)

39 / 70
The mean and variance of the sampling distribution of 𝑌 ̄

▶ In the general case that 𝑌𝑖 is i.i.d. from any distribution (not just
Bernoulli)
▶ Mean:

1 𝑁 1 𝑁 1 𝑁
𝐸(𝑌 ̄ ) = 𝐸 ( ∑ 𝑌𝑖 ) = ∑ 𝐸(𝑌𝑖 ) = ∑ 𝜇 = 𝜇𝑌
𝑁 𝑖=1 𝑁 𝑖=1 𝑁 𝑖=1 𝑌

▶ Variance:
2 2
𝑉 𝑎𝑟(𝑌 ̄ ) = 𝐸 [𝑌 ̄ − 𝐸(𝑌 ̄ )] = 𝐸 [𝑌 ̄ − 𝜇𝑌 ] =
2 2
1 𝑁 1 𝑁
=𝐸[ ∑ 𝑌𝑖 − 𝜇𝑌 ] = 𝐸 [ ∑(𝑌𝑖 − 𝜇𝑌 )] =
𝑁 𝑖=1 𝑁 𝑖=1

1 𝑁 𝑁
= ∑ ∑ 𝐸 [(𝑌𝑖 − 𝜇𝑌 )(𝑌𝑗 − 𝜇𝑌 )]
𝑁 2 𝑖=1 𝑗−1

1 𝑁 𝑁 1 𝑁 𝜎2
= 2
∑ ∑ 𝑐𝑜𝑣(𝑌𝑖 , 𝑌𝑗 ) = 2 ∑ 𝜎𝑌2 = 𝑌
𝑁 𝑖=1 𝑗−1 𝑁 𝑖=1 𝑁

40 / 70
The mean and variance of the sampling distribution of 𝑌 ̄

▶ MEAN: 𝐸(𝑌 ̄ ) = 𝜇𝑌
▶ 𝑌 ̄ is an unbiased estimator of 𝜇𝑌
2
𝜎𝑌
▶ VARIANCE: 𝑉 𝑎𝑟(𝑌 ̄ ) = 𝑁
▶ the spread of the sampling distribution is inversely related to 𝑁
▶ the larger the sample the more accurate the estimate
▶ As n increases, the distribution of 𝑌 ̄ becomes more tightly centered
around 𝜇𝑌 (Law of Large Numbers)
▶ Moreover, the distribution of 𝑌 ̄ − 𝜇𝑌 becomes normal (Central Limit
Theorem)

41 / 70
Law of Large Numbers (LLN)

▶ An estimator is consistent if the probability that it falls within an


interval of the true population value tends to one as the sample size
increases
▶ [LLN] If 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞, then 𝑌 ̄ is a consistent
estimator of 𝜇𝑌 , that is,

𝑃 𝑟(|𝑌 ̄ − 𝜇𝑌 | < 𝜖) = 1 as n → ∞
𝑝
which can be written as 𝑌 ̄ −
→ 𝜇𝑌 (𝑌 ̄ converges in probability to 𝜇𝑌 )

42 / 70
Central Limit Theorem (CLT)

▶ [CLT] If 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞, then when n is large the
distribution of 𝑌 ̄ is well approximated by a normal distribution

𝜎𝑌2
𝑌 ̄ ∼ 𝑁 (𝜇𝑌 , )
𝑁

▶ and
√ 𝑌 ̄ − 𝜇𝑌
𝑛 ∼ 𝑁 (0, 1)
𝜎𝑌
▶ The larger n, the better the approximation

43 / 70
Sampling distribution of 𝑌 ̄ when Y is Bernoulli with p = 0.78:

44 / 70
▶ distribution of the standardized sample average (𝑌 ̄ − 𝐸(𝑌 ̄ ) /√𝑣𝑎𝑟(𝑌 ̄ )

45 / 70
R

▶ If 𝑌 is Bernoulli with 𝑝 = 0.78 then 𝐸(𝑌 ) = 0.78 and


𝜎𝑌2 = 0.78 ∗ 0.22 = 0.1716
myfunction <- function(p,N,B)
{
sample <- data.frame(matrix(NA, B, 2))
colnames(sample) <- c("average", "std_average")

for (i in 1:B){
Y = rbinom(N,1,p)
sample[i,1] = mean(Y, na.rm = T)
sample[i,2] = (sample[i,1] - p) / (p * (1-p) / N)^0.5 # here we are standardizing
}

library(ggplot2)
ggplot(sample, aes(std_average)) +
geom_histogram(aes(y = ..density..), binwidth = 0.2, fill="seagreen2") +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) +
theme_classic()
}

46 / 70
set.seed(3453)
p = 0.78
N = 1000
B = 10000
myfunction(p,N,B)

0.4

0.3
density

0.2

0.1

0.0
−2.5 0.0 2.5
std_average

47 / 70
Summary: The Sampling Distribution of 𝑌 ̄

▶ For 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞


▶ In finite sample the distribution of 𝑌 ̄ has mean 𝜇𝑌 and variance 𝜎2 /𝑁
▶ the exact distribution depends on the distribution of 𝑌𝑖
▶ if we assume that 𝑌𝑖 is normal then also 𝑌 ̄ is normally distributed
▶ When n is large, the LLN and the CLT imply that
𝑝
1. 𝑌 ̄ −
→𝜇
√ 𝑌 ̄ −𝜇𝑌𝑌
2. 𝑛 𝜎 ∼ 𝑁 (0, 1)
𝑌

48 / 70
(b) Why Use 𝑌 ̄ To Estimate 𝜇𝑌 ?

▶ 𝑌 ̄ is unbiased: 𝐸(𝑌 ̄ ) = 𝜇𝑌
𝑝
▶ 𝑌 ̄ is consistent: 𝑌 ̄ −
→ 𝜇𝑌
▶ In addition: 𝑌 ̄ is the least squares estimator of 𝜇𝑌 , that is, it
minimizes the sum of squared errors (or residuals)
𝑛
2
𝑚𝑖𝑛𝑚 ∑ (𝑌𝑖 − 𝑚)
𝑖=1

▶ the estimator 𝑚̂ that solves the least squares problem is 𝑚̂ = 𝑌 ̄

49 / 70
Hypothesis Testing

▶ The hypothesis testing problem (for the mean): decide based on the
sample evidence whether
▶ a null hypothesis is true
▶ an alternative hypothesis is true

▶ In statistical notation:
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) > 𝜇𝑌 ,0 (1-sided, >)
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) < 𝜇𝑌 ,0 (1-sided, <)
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) ≠ 𝜇𝑌 ,0 (2-sided)

50 / 70
▶ p-value = probability of drawing a statistic (e.g., 𝑌 ̄ ) at least as
adverse to the null as the value actually computed with your data,
assuming that the null hypothesis is true
▶ significance level: a pre-specified probability of incorrectly rejecting
the null, when the null is true
▶ Calculating the p-value based on 𝑌 ̄ :

p-value = 𝑃 𝑟 [|𝑌 ̄ − 𝜇𝑌 ,0 | > |𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0 |]

▶ where 𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 is the value of 𝑌 ̄ actually observed

51 / 70
▶ If n is large, the sampling distribution of 𝑌 ̄ is obtained using the
normal approximation (CLT):

p-value = 𝑃 𝑟𝐻0 [|𝑌 ̄ − 𝜇𝑌 ,0 | > |𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0 |]

𝑌 ̄ − 𝜇𝑌 ,0 𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0
= 𝑃 𝑟 [∣ ∣>∣ ∣]
𝜎𝑌 ̄ 𝜎𝑌 ̄


▶ where 𝜎𝑌 ̄ = 𝜎𝑌 / 𝑛
▶ Under normality, the p-value is calculated as the probability of values
𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 −𝜇𝑌 ,0
larger than ± 𝜎𝑌 ̄
on the left and right tails

52 / 70
Calculating the p-value with 𝜎𝑌 known

▶ For large n , p -value = the probability that a N(0,1) random variable


𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 −𝜇𝑌 ,0
falls outside ∣ 𝜎𝑌 ̄

▶ In practice, 𝜎𝑌 ̄ is unknown and must be estimated

53 / 70
Estimator of the variance of Y

▶ Since we do not know the population variance of 𝑌 , we need to


estimate 𝜎𝑌2 as well

𝑛
1 2
𝑠2𝑌 = ∑ (𝑌 − 𝑌 ̄ )
𝑛 − 1 𝑖=1 𝑖

▶ if 𝑌𝑖 are i.i.d. then we can apply the LLN to 𝑠2𝑌 since it represents an
𝑝
average and be confident that 𝑠2𝑌 −
→ 𝜎𝑌2
𝑌 ̄ −𝜇𝑌 ,0
▶ Define 𝑡 = √ then the p-value is obtained as
𝑠𝑌 / 𝑁

p-value = 𝑃 𝑟𝐻0 (|𝑡| > |𝑡𝑎𝑐𝑡𝑢𝑎𝑙 |)

54 / 70
What is the link between the p-value and the significance level?

▶ The significance level (typically denoted by 𝛼) is prespecified at the


typical values of 1, 5, or 10%
▶ if 𝛼 = 0.05 then you reject the null hypothesis 𝐻0 if
1. |𝑡| > 1.96 (the critical value)
2. p-value ≤ 0.05
▶ The p-value is referred to as the marginal significance level
▶ Often, it is better to communicate the p-value than simply whether a
test rejects or not
▶ the p-value contains more information than the “ yes/no ’ ’ statement
about whether the test rejects.

55 / 70
▶ The value of 1.96 that we used to reject the null is the 5% critical
value
▶ The critical values of the Student t-distribution are tabulated in all
statistics books
▶ To reject the null hypothesis using the critical values:
1. Compute the t -statistic
2. Compute the degrees of freedom, which is 𝑛–1
3. Look up the 𝛼 critical value
4. The null is rejected if the t-statistic exceeds (in absolute value) this
critical value,

56 / 70
▶ The difference between the t-distribution and N(0,1) critical values
depend on the sample size n and becomes negligible as the sample is
larger than 120.
▶ Some 5% critical values for a 2-sided tests:

d.o.f. ( n – 1) 5% critical value


10 2.23
20 2.09
30 2.04
60 2.00
∞ 1.96

▶ The Student t-distribution is only relevant when we are willing to


assume that the population distribution of Y is normal
▶ In economic data, the normality assumption is rarely credible (e..g,
earnings distribution)

57 / 70
Figure 2: The four distributions of earnings are for women and men,
for those with only a high school diploma (a and c) and those whose
highest degree is from a four-year college (b and d).

58 / 70
▶ For n > 30 , the t-distribution and N(0,1) are very close and as n
grows large, the 𝑡𝑛–1 distribution converges to N(0,1)
▶ When sample sizes were small and there were no computers, the
t-distribution was providing correct inference
▶ For historical reasons, statistical software typically still use the
t-distribution to compute p-values, but this is irrelevant when the
sample size is moderate or large
▶ For these reasons, in this class we will focus on the large-n
approximation given by the CLT

59 / 70
Confidence Intervals

▶ A 95% confidence interval for 𝜇𝑌 is an interval that contains the true


value of 𝜇𝑌 in 95% of repeated samples
▶ A 95% confidence interval can always be constructed as the set of
values of 𝜇𝑌 not rejected by a hypothesis test with a 5% significance
level.

𝑠 𝑠
𝜇𝑌 ∈ (𝑌 ̄ − 1.96 √𝑌 , 𝑌 ̄ + 1.96 √𝑌 )
𝑛 𝑛

▶ Here we are making two assumptions: 1) 𝑌 ̄ is approximately normal,


𝑝
2) 𝑠2𝑌 −
→ 𝜎𝑌2

60 / 70
R

library(tidyquant)
sp <- tq_get("SPY", get = "stock.prices", from = "1995-01-01")
head(sp)

# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 1995-01-03 45.7 45.8 45.7 45.8 324300 28.3
2 SPY 1995-01-04 46.0 46 45.8 46 351800 28.4
3 SPY 1995-01-05 46.0 46.1 46.0 46 89800 28.4
4 SPY 1995-01-06 46.1 46.2 45.9 46.0 448400 28.5
5 SPY 1995-01-09 46.0 46.1 46 46.1 36800 28.5
6 SPY 1995-01-10 46.2 46.4 46.1 46.1 229800 28.5

tail(sp)

# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452.
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452.
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453.
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453.
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451.
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451.

61 / 70
ggplot(sp, aes(date, adjusted)) + ggplot(sp, aes(date, log(adjusted))) +
geom_line() + theme_bw() geom_line() + theme_bw()

6
400

log(adjusted)
300
adjusted

200

4
100

2000 2010 2020 2000 2010 2020


date date

62 / 70
sp <- mutate(sp, return = 100*(log(adjusted) - log(lag(adjusted))))
tail(sp)

# A tibble: 6 x 9
symbol date open high low close volume adjusted return
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452. -0.148
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452. 0.0531
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453. 0.307
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453. -0.0243
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451. -0.358
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451. -0.122

ggplot(sp, aes(date, return)) + geom_line() + theme_bw()

10

5
return

−5

−10

2000 2010 2020


date

63 / 70
ggplot(sp, aes(return)) +
geom_histogram(aes(y = ..density..), binwidth = 0.1, fill = "steelblue1") +
stat_function(fun = dnorm,
args = list(mean = mean(sp$return,na.rm=T),
sd = sd(sp$return,na.rm=T))) +
theme_light()

0.6

0.4
density

0.2

0.0
−10 −5 0 5 10
return

64 / 70
library(psych)
c(skew(sp$return), kurtosi(sp$return))

[1] -0.298641 11.287862

describe(sp$return)

vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 6718 0.04 1.21 0.07 0.07 0.79 -11.59 13.56 25.15 -0.3 11.29
se
X1 0.01

65 / 70
fin.data <- tq_get(c("AAPL", "BAC", "SPY"), get = "stock.prices", from = "1995-
head(fin.data)

# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 1995-01-03 0.347 0.347 0.338 0.343 103868800 0.291
2 AAPL 1995-01-04 0.345 0.354 0.345 0.352 158681600 0.298
3 AAPL 1995-01-05 0.350 0.352 0.346 0.347 73640000 0.295
4 AAPL 1995-01-06 0.372 0.385 0.367 0.375 1076622400 0.318
5 AAPL 1995-01-09 0.372 0.374 0.366 0.368 274086400 0.312
6 AAPL 1995-01-10 0.368 0.393 0.368 0.390 614790400 0.331

tail(fin.data)

# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452.
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452.
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453.
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453.
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451.
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451.

66 / 70
ret.data = fin.data %>%
group_by(symbol) %>%
mutate(return = 100 * (log(adjusted) - log(lag(adjusted)))) %>%
select(symbol, date, return)
head(ret.data)

# A tibble: 6 x 3
# Groups: symbol [1]
symbol date return
<chr> <date> <dbl>
1 AAPL 1995-01-03 NA
2 AAPL 1995-01-04 2.57
3 AAPL 1995-01-05 -1.28
4 AAPL 1995-01-06 7.73
5 AAPL 1995-01-09 -1.92
6 AAPL 1995-01-10 5.85

ret.data %>%
group_by(symbol) %>%
summarize(Average = mean(return, na.rm = T),
StdDev = sd(return, na.rm = T),
N = n()) %>%
knitr::kable()

symbol Average StdDev N


AAPL 0.0934655 2.811128 6719
BAC 0.0293548 2.738327 6719
SPY 0.0412123 1.213518 6719

67 / 70
library(tidyr)
ret.data <- ret.data %>%
pivot_wider(names_from = symbol, values_from = return)
head(ret.data)

# A tibble: 6 x 4
date AAPL BAC SPY
<date> <dbl> <dbl> <dbl>
1 1995-01-03 NA NA NA
2 1995-01-04 2.57 0.816 0.477
3 1995-01-05 -1.28 0.810 0
4 1995-01-06 7.73 -0.269 0.102
5 1995-01-09 -1.92 0 0.102
6 1995-01-10 5.85 1.60 0.102

cor(select(ret.data, -date), use = "complete")

AAPL BAC SPY


AAPL 1.0000000 0.2667942 0.4722566
BAC 0.2667942 1.0000000 0.6521778
SPY 0.4722566 0.6521778 1.0000000

68 / 70
▶ 𝐻0 ∶ 𝜇𝑆𝑃 𝑌 = 0 vs 𝐻1 ∶ 𝜇𝑆 𝑃 𝑌 ≠ 0

symbol Average StdDev N


SPY 0.0412123 1.213518 6719

▶ t = …..
▶ Reject or fail to reject?

69 / 70
Summary

▶ Based on two assumptions:


1. random sampling from a population,: 𝑌𝑖 , 𝑖 = 1, ⋯ , 𝑛 are i.i.d.
2. 0 < 𝐸(𝑌 4 ) < ∞
▶ We derived that in large samples (large n):
1. Theory of estimation (sampling distribution of 𝑌 ̄ )
2. Theory of hypothesis testing (large-n distribution of t-statistic and
computation of the p-value)
3. Theory of confidence intervals (constructed by inverting the test
statistic)

70 / 70

You might also like