Introduction to Econometrics: Class Size and Educational Output

Introduction to Econometrics
Chapter 2 and 3
Sebastiano Manzan
ECON-UB251
1 / 70
Brief Overview of the Course
▶ Economics suggests important relationships, often with policy

implications, but virtually never suggests quantitative magnitudes of
causal effects.
▶ What is the quantitative effect of reducing class size on student
achievement?
▶ How does another year of education change earnings?
▶ What is the price elasticity of cigarettes? or Peloton bikes?
▶ What is the effect on output growth of a 1 percentage point increase in
interest rates by the Fed?
▶ What is the effect on housing prices of environmental improvements?
2 / 70
This course is about using data to measure causal effects
▶ Ideally, we would like an experiment

▶ What would be an experiment to estimate the effect of class size on
standardized test scores?
▶ But almost always we only have observational (nonexperimental) data.

▶ returns to education
▶ cigarette prices
▶ monetary policy
▶ Most of the course deals with difficulties arising from using

observational to estimate causal effects
▶ confounding effects (omitted factors)
▶ simultaneous causality
▶ “correlation does not imply causation’ ’
3 / 70
In this course you will
▶ Learn methods for estimating causal effects using observational data

▶ Learn methods for prediction, including forecasting using time series
data
▶ Learn to evaluate the regression analysis of others – this means you
will be able to read/understand empirical economics and finance
papers in other econ courses
▶ Get some hands-on experience with regression analysis in your
problem sets
4 / 70
Review of Probability and Statistics (SW Chapters 2, 3)
▶ Empirical problem: Class size and educational output
▶ Policy question: What is the effect on test scores (or some other
outcome measure) of reducing class size by one student per class? by
8 students/class?
▶ We must use data to find out (is there any way to answer this without
data?)
5 / 70
The California Test Score Data Set
▶ All K-6 and K-8 California school districts ( n = 420 ) Variables:

▶ 5 th grade test scores (Stanford-9 achievement test, combined math
and reading), district average
▶ Student-teacher ratio (STR) = no. of students in the district divided
by no. full-time equivalent teachers
Figure 1: Summary of the Distribution of Student–Teacher Ratios and

Fifth-Grade Test Scores for 420 K–8 Districts in California in 1999
▶ This table does not tell us anything about the relationship between
test scores and the STR
6 / 70
Do districts with smaller classes have higher test scores?
▶ Scatterplot of test score v. student-teacher ratio
▶ What does this figure show?
7 / 70
R
library(readxl)
caschool <- read_xlsx("./data/caschool.xlsx")
dplyr::glimpse(caschool)
Rows: 420
Columns: 18
$ Òbservation Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, ~
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "F~
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Th~
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-~
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 22~
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 4~
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.31~
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.95~
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56,~
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 6~
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0~
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988,~
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133,~
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333~
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.85767~
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, ~
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, ~
8 / 70
mean(caschool$testscr)
[1] 654.1565
sd(caschool$testscr)
[1] 19.05335
c(mean(caschool$str), sd(caschool$str))
[1] 19.640425 1.891812
9 / 70
library(ggplot2)
ggplot(caschool, aes(str, testscr)) + geom_point()
testscr 690
660
630
14 16 18 20 22 24 26
str
10 / 70
▶ We need to get some numerical evidence on whether districts with low
STRs have higher test scores:
1. Compare average test scores in districts with low STRs to those with
high STRs (estimation)
2. Test the null hypothesis that the mean test scores in the two types of
districts are the same, against the alternative hypothesis that they
differ (hypothesis testing)
3. Estimate an interval for the difference in the mean test scores, high v.
low STR districts (confidence interval)
11 / 70
Initial data analysis
▶ Compare districts with small'' (STR $<$ 20) andlarge’ ’ (STR ≥

20) class sizes
Class Size Average St. Dev n

Small 657.4 19.4 238
Large 650.0 17.9 182
▶ Estimation of Δ = difference between group means

▶ Test the hypothesis that Δ = 0
▶ Construct a confidence interval for Δ
12 / 70
1. Estimation
̄ ̄ 1 𝑛 1 𝑛
𝑌𝑠𝑚𝑎𝑙𝑙 − 𝑌𝑙𝑎𝑟𝑔𝑒 = 𝑛𝑠𝑚𝑎𝑙𝑙 ∑𝑖=1
𝑠𝑚𝑎𝑙𝑙
𝑌𝑖 − 𝑛𝑙𝑎𝑟𝑔𝑒 ∑𝑖=1
𝑙𝑎𝑟𝑔𝑒
𝑌𝑖 (1)
= 657.4 − 650.0 (2)
= 7.4 (3)
▶ Is 7.4 a large difference in a real-world sense?

▶ Standard deviation across districts = 19.1
▶ This is a big enough difference to be important for school reform
discussions, for parents, or for a school committee?
13 / 70
2. Hypothesis testing
▶ Let’s test the null hypothesis that there is no difference between the
two groups
▶ The test statistic is
̄
𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒
𝑡= =
𝑠2𝑠𝑚𝑎𝑙𝑙 𝑠2𝑙𝑎𝑟𝑔𝑒 ̄ ̄
𝑆𝐸(𝑌𝑠𝑚𝑎𝑙𝑙 − 𝑌𝑙𝑎𝑟𝑔𝑒 )
√𝑛 + 𝑛𝑙𝑎𝑟𝑔𝑒
𝑠𝑚𝑎𝑙𝑙
̄
▶ where 𝑆𝐸(𝑌𝑠𝑚𝑎𝑙𝑙 ̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 ) represents the standard error of
̄
− 𝑌𝑙𝑎𝑟𝑔𝑒 and the variance estimates are calculated as
𝑛𝑠𝑚𝑎𝑙𝑙
1 2
𝑠2𝑠𝑚𝑎𝑙𝑙 = ∑ (𝑌𝑖 − 𝑌 ̄ )
𝑛𝑠𝑚𝑎𝑙𝑙 − 1 𝑖=1
14 / 70
2.
▶ Compute the difference-of-means t -statistic:
Size Average $Std. Dev. n

small 657.4 19.4 238
large 650.0 17.9 182
▶ |𝑡| > 1.96: reject the null hypothesis that the two means are the same
(at the 5% significance level)
15 / 70
3. Confidence interval
A 95% confidence interval for the difference between the means is,
▶ Two equivalent statements:

▶ The 95% confidence interval for Δ doesn’t include 0
▶ The hypothesis that Δ = 0 is rejected at the 5% level
16 / 70
R
▶ Package dplyr is great for data manipulation

▶ Symbol %>% is used to concatenate commands and reads then
library(dplyr)
caschool <- mutate(caschool, group = ifelse(str < 20, "small", "large"))
table <- group_by(caschool, group) %>%
summarize(Average = mean(testscr),
StdDev = sd(testscr),
N = n())
knitr::kable(table)
group Average StdDev N

large 649.9788 17.85336 182
small 657.3513 19.35801 238
17 / 70
library(ggplot2)
ggplot(caschool, aes(group,testscr)) + geom_boxplot() + theme_bw()
testscr 690
660
630
large small
group
18 / 70
t.test(testscr ~ group, data = caschool)
Welch Two Sample t-test
data: testscr by group

t = -4.0426, df = 403.61, p-value = 6.333e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.957525 -3.787296
sample estimates:
mean in group large mean in group small
649.9788 657.3513
19 / 70
What comes next
▶ The mechanics of estimation, hypothesis testing, and confidence

intervals should be familiar
▶ These concepts extend directly to regression
▶ We will review some of the underlying theory of estimation,

hypothesis testing, and confidence intervals:
▶ Why do these procedures work, and why use these rather than others?
▶ We will review the intellectual foundations of statistics and
econometrics
20 / 70
Review of Statistical Theory
1. The probability framework for statistical inference

2. Estimation
3. Testing
4. Confidence Intervals
5. The probability framework for statistical inference
▶ Population, random variable, and distribution
▶ Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
▶ Conditional distributions and conditional means
▶ Distribution of a sample of data drawn randomly from a population:
𝑌1 , ⋯ , 𝑌𝑛
21 / 70
Population, random variable, and distribution
▶ Population
▶ The group or collection of all possible entities of interest (school
districts)
▶ We will think of populations as infinitely large (∞ is an approximation
to “very big’ ’)
▶ Random variable Y
▶ Numerical summary of a random outcome (district average test score,
district STR)
▶ The probabilities of different values of Y that occur in the population,

for ex. Pr[Y = 650] (when Y is discrete)
▶ The probabilities of sets of these values, for ex. Pr[640 < Y < 660]
(when Y is continuous)
22 / 70
Moments of a population distribution: mean, variance, standard
deviation, covariance, correlation
▶ A random variable Y can be described by certain characteristics of its

distribution
1. mean = expected value (expectation) of Y

▶ E(Y) = 𝜇𝑌
▶ long-run average value of Y over repeated realizations of Y
2. variance
▶ 𝐸(𝑌 –𝜇𝑌 )2 = 𝜎𝑌2
▶ measure of the squared spread of the distribution
3. standard deviation = 𝜎𝑌 = √𝜎𝑌2
23 / 70
𝐸(𝑌 −𝜇𝑌 )3
4. skewness = 𝜎𝑌3
▶ measure of asymmetry of a distribution

▶ skewness = 0 : distribution is symmetric
▶ skewness > (<) 0 : distribution has long right (left) tail
𝐸(𝑌 −𝜇𝑌 )4
5. kurtosis: 𝜎𝑌4
▶ measure of probability in the tails

▶ measure of probability of large values
▶ kurtosis = 3: normal distribution
▶ kurtosis > 3: heavy tails (leptokurtotic)
24 / 70
25 / 70
2 random variables: joint distributions and covariance
▶ Random variables X and Z have a joint distribution

▶ The covariance between X and Z is
𝑐𝑜𝑣(𝑋, 𝑍) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑍 − 𝜇𝑍 )] = 𝜎𝑋𝑍
▶ The covariance is a measure of the linear association between X and

Z; its units are units of X times units of Z
▶ cov( X , Z ) > 0 means a positive relation between X and Z
▶ If X and Z are independently distributed, then cov( X , Z ) = 0
▶ The covariance of a r.v. with itself is its variance:
𝑐𝑜𝑣(𝑋, 𝑋) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑋 − 𝜇𝑋 )] = 𝐸[(𝑋 − 𝜇𝑋 )2 ] = 𝜎𝑋

2
26 / 70
▶ The covariance between Test Score and STR is negative:
▶ Data from 420 California school districts. There is a weak negative

relationship between the student–teacher ratio and test scores: The
sample correlation is − 0.23
c(cov(caschool$testscr, caschool$str),
cor(caschool$testscr, caschool$str))
[1] -8.1593241 -0.2263628
27 / 70
The correlation coefficient is defined in terms of the covariance
▶ The correlation coefficient is defined as:

𝑐𝑜𝑣(𝑋, 𝑍) 𝜎𝑋𝑍
𝑐𝑜𝑟𝑟(𝑋, 𝑍) = = 2 2
= 𝑟𝑋𝑍
√𝑣𝑎𝑟(𝑋) ∗ 𝑣𝑎𝑟(𝑍) √𝜎𝑋 𝜎𝑌
▶ The correlation coefficient represents a rescaled covariance

▶ Property: –1 ≤ 𝑐𝑜𝑟𝑟(𝑋, 𝑍) ≤ 1
▶ corr( X, Z) = 1 means perfect positive linear association
▶ corr( X, Z) = –1 means perfect negative linear association
▶ corr( X, Z) = 0 means no linear association
28 / 70
▶ The correlation coefficient measures linear association
29 / 70
(c) Conditional distributions and conditional means
▶ Conditional distributions
▶ The distribution of Y, given value(s) of some other random variable, X
▶ Ex: the distribution of test scores, given that STR < 20
▶ Conditional expectations and conditional moments

▶ conditional mean = mean of conditional distribution = 𝐸(𝑌 |𝑋 = 𝑥)
▶ conditional variance = variance of conditional distribution
▶ Example: 𝐸(test score|𝑆𝑇 𝑅 < 20) (the mean of test scores among
districts with small class sizes)
▶ The difference in means is the difference between the means of two

conditional distributions:
Δ = 𝐸(test score|𝑆𝑇 𝑅 < 20) − 𝐸(test score|𝑆𝑇 𝑅 ≥ 20)
30 / 70
▶ Other examples of conditional means:
▶ Wages of all female workers (Y = wages, X = gender)
▶ Mortality rate of those given an experimental treatment (Y = live/die;
X = treated/not treated)
▶ Notice that: if 𝐸(𝑋|𝑍) is constant, then 𝑐𝑜𝑟𝑟(𝑋, 𝑍) = 0

▶ Suppose you want to predict Y given the value of X
▶ Example: you want to predict someone’s income, given their years of
education
▶ A common measure of the quality of a prediction m of Y is the mean
squared prediction error (MSPE), given X, 𝐸[(𝑌 − 𝑚)2 |𝑋]
▶ Of all possible predictions m that depend on X, the conditional mean
𝐸(𝑌 |𝑋) has the smallest mean squared prediction error
31 / 70
Distribution of a sample of data drawn randomly from a
population: 𝑌1 , ⋯ , 𝑌𝑛
▶ We will assume simple random sampling

▶ Choose and individual (district, entity) at random from the population
▶ Randomness and data

▶ Prior to sample selection, the value of Y is random because the
individual selected is random
▶ Once the individual is selected and the value of Y is observed, then Y
is just a number (not random)
▶ The data set is (𝑌1 , 𝑌2 , ⋯ , 𝑌𝑛 ), where 𝑌𝑖 = value of Y for the i-th
individual (district, entity) sampled
32 / 70
Distribution of 𝑌1 , ⋯ , 𝑌𝑛 under simple random sampling
▶ Because individuals #1 and #2 are selected at random, the value of

𝑌1 has no information content for 𝑌2 . Thus:
▶ 𝑌1 and 𝑌2 are independently distributed
▶ 𝑌1 and 𝑌2 come from the same distribution, that is, 𝑌1 , ⋯ , 𝑌2 are
identically distributed
▶ Under simple random sampling, 𝑌1 and 𝑌2 are independently and
identically distributed (i.i.d.)
▶ More generally, under simple random sampling, 𝑌𝑖 , 𝑖 = 1, ⋯ , 𝑛 , are
i.i.d.
33 / 70
Estimation
▶ This set of assumptions (i.e., statistical model) allows rigorous

statistical inference about moments of population distributions
using a sample of data from that population
▶ 𝑌 ̄ is the natural estimator of the mean. But:
▶ What are the properties of 𝑌 ̄ ?
▶ Why should we use 𝑌 ̄ rather than some other estimator?
▶ 𝑌1 (the first observation)
▶ maybe unequal weights – not simple average
▶ median (𝑌1 , ⋯ , 𝑌𝑛 )
▶ The starting point is the sampling distribution of 𝑌 ̄
34 / 70
Sampling distribution of 𝑌 ̄
▶ 𝑌 ̄ is a random variable, and its properties are determined by the

sampling distribution of 𝑌 ̄
1. The individuals in the sample are drawn at random
2. Thus the values of (𝑌1 , ⋯ , 𝑌𝑛 ) are random
3. Thus functions of (𝑌1 , ⋯ , 𝑌𝑛 ) , such as 𝑌 ̄ , are random; a different
sample would have produce on a different average value
4. The distribution of 𝑌 ̄ over different possible samples of size n is called
the sampling distribution of 𝑌 ̄
▶ The mean and variance of 𝑌 ̄ are the mean and variance of its
sampling distribution, and 𝐸(𝑌 ̄ ) and 𝑉 𝑎𝑟(𝑌 ̄ )
35 / 70
▶ Example: Suppose Y takes on 0 or 1 (a Bernoulli random variable)
with probability distribution Pr(Y = 0) = 0.22, Pr(Y = 1) = 0.78
▶ Then
𝐸(𝑌 ) = 𝑝 ∗ 1 + (1 − 𝑝) ∗ 0 = 𝑝 = 0.78
𝜎 = 𝐸[(𝑌 − 𝜇𝑦 )2 ] = 𝑝 ∗ (1 − 𝑝)2 + (1 − 𝑝) ∗ (0 − 𝑝)2 = 𝑝(1 − 𝑝)
2
= 0.78 ∗ (1 − 0.78) = 0.1716

▶ The sampling distribution of 𝑌 ̄ depends on n
▶ Consider n = 2. The sampling distribution of 𝑌 ̄ is,
𝑃 𝑟(𝑌 ̄ = 0) = 0.222 = 0.0484
𝑃 𝑟(𝑌 ̄ = 1) = 0.782 = 0.6084

𝑃 𝑟(𝑌 ̄ = 0.5) = 2 ∗ 0.22 ∗ 0.78 = 0.3432
36 / 70
The sampling distribution of 𝑌 ̄ when Y is Bernoulli (p = 0.78):
37 / 70
R
myfunction <- function(p,N,B)

{
sample <- matrix(NA, B, 1)
for (i in 1:B) sample[i] = mean(rbinom(N,1,p))
library(ggplot2)
ggplot(data.frame(sample)) +
geom_histogram(aes(sample), bins = 20, fill="seagreen2") +
geom_vline(xintercept = mean(sample), color = "tomato3") +
theme_classic()
}
p = 0.78 # Pr(Y = 1) p = 0.78 # Pr(Y = 1)

N = 2 # SAMPLE SIZE N = 100 # SAMPLE SIZE
B = 10000 # NUMBER OF SAMPLES B = 10000 # NUMBER OF SAMPLES
myfunction(p,N,B) myfunction(p,N,B)
6000
1500
4000
count
count 1000
2000
500
0 0
0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9
sample sample
38 / 70
▶ What is the mean of 𝑌 ̄ ? If 𝑌 ̄ = 𝜇 = 0.78, then 𝑌 ̄ is an unbiased
estimator of 𝜇
▶ What is the variance of 𝑌 ̄ ?
▶ How does 𝑌 ̄ depend on n
▶ Does 𝑌 ̄ become close to 𝜇 when n is large?
▶ Law of large numbers: 𝑌 ̄ is a consistent estimator of 𝜇
▶ 𝑌 ̄ − 𝜇 appears bell shaped for n large: is this generally true?
▶ In fact, 𝑌 ̄ − 𝜇 is approximately normally distributed for n large
(Central Limit Theorem)
39 / 70
The mean and variance of the sampling distribution of 𝑌 ̄
▶ In the general case that 𝑌𝑖 is i.i.d. from any distribution (not just
Bernoulli)
▶ Mean:
1 𝑁 1 𝑁 1 𝑁
𝐸(𝑌 ̄ ) = 𝐸 ( ∑ 𝑌𝑖 ) = ∑ 𝐸(𝑌𝑖 ) = ∑ 𝜇 = 𝜇𝑌
𝑁 𝑖=1 𝑁 𝑖=1 𝑁 𝑖=1 𝑌
▶ Variance:
2 2
𝑉 𝑎𝑟(𝑌 ̄ ) = 𝐸 [𝑌 ̄ − 𝐸(𝑌 ̄ )] = 𝐸 [𝑌 ̄ − 𝜇𝑌 ] =
2 2
1 𝑁 1 𝑁
=𝐸[ ∑ 𝑌𝑖 − 𝜇𝑌 ] = 𝐸 [ ∑(𝑌𝑖 − 𝜇𝑌 )] =
𝑁 𝑖=1 𝑁 𝑖=1
1 𝑁 𝑁
= ∑ ∑ 𝐸 [(𝑌𝑖 − 𝜇𝑌 )(𝑌𝑗 − 𝜇𝑌 )]
𝑁 2 𝑖=1 𝑗−1
1 𝑁 𝑁 1 𝑁 𝜎2
= 2
∑ ∑ 𝑐𝑜𝑣(𝑌𝑖 , 𝑌𝑗 ) = 2 ∑ 𝜎𝑌2 = 𝑌
𝑁 𝑖=1 𝑗−1 𝑁 𝑖=1 𝑁
40 / 70
The mean and variance of the sampling distribution of 𝑌 ̄
▶ MEAN: 𝐸(𝑌 ̄ ) = 𝜇𝑌
▶ 𝑌 ̄ is an unbiased estimator of 𝜇𝑌
2
𝜎𝑌
▶ VARIANCE: 𝑉 𝑎𝑟(𝑌 ̄ ) = 𝑁
▶ the spread of the sampling distribution is inversely related to 𝑁
▶ the larger the sample the more accurate the estimate
▶ As n increases, the distribution of 𝑌 ̄ becomes more tightly centered
around 𝜇𝑌 (Law of Large Numbers)
▶ Moreover, the distribution of 𝑌 ̄ − 𝜇𝑌 becomes normal (Central Limit
Theorem)
41 / 70
Law of Large Numbers (LLN)
▶ An estimator is consistent if the probability that it falls within an

interval of the true population value tends to one as the sample size
increases
▶ [LLN] If 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞, then 𝑌 ̄ is a consistent
estimator of 𝜇𝑌 , that is,
𝑃 𝑟(|𝑌 ̄ − 𝜇𝑌 | < 𝜖) = 1 as n → ∞
𝑝
which can be written as 𝑌 ̄ −
→ 𝜇𝑌 (𝑌 ̄ converges in probability to 𝜇𝑌 )
42 / 70
Central Limit Theorem (CLT)
▶ [CLT] If 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞, then when n is large the
distribution of 𝑌 ̄ is well approximated by a normal distribution
𝜎𝑌2
𝑌 ̄ ∼ 𝑁 (𝜇𝑌 , )
𝑁
▶ and
√ 𝑌 ̄ − 𝜇𝑌
𝑛 ∼ 𝑁 (0, 1)
𝜎𝑌
▶ The larger n, the better the approximation
43 / 70
Sampling distribution of 𝑌 ̄ when Y is Bernoulli with p = 0.78:
44 / 70
▶ distribution of the standardized sample average (𝑌 ̄ − 𝐸(𝑌 ̄ ) /√𝑣𝑎𝑟(𝑌 ̄ )
45 / 70
R
▶ If 𝑌 is Bernoulli with 𝑝 = 0.78 then 𝐸(𝑌 ) = 0.78 and

𝜎𝑌2 = 0.78 ∗ 0.22 = 0.1716
myfunction <- function(p,N,B)
{
sample <- data.frame(matrix(NA, B, 2))
colnames(sample) <- c("average", "std_average")
for (i in 1:B){
Y = rbinom(N,1,p)
sample[i,1] = mean(Y, na.rm = T)
sample[i,2] = (sample[i,1] - p) / (p * (1-p) / N)^0.5 # here we are standardizing
}
library(ggplot2)
ggplot(sample, aes(std_average)) +
geom_histogram(aes(y = ..density..), binwidth = 0.2, fill="seagreen2") +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) +
theme_classic()
}
46 / 70
set.seed(3453)
p = 0.78
N = 1000
B = 10000
myfunction(p,N,B)
0.4
0.3
density
0.2
0.1
0.0
−2.5 0.0 2.5
std_average
47 / 70
Summary: The Sampling Distribution of 𝑌 ̄
▶ For 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞

▶ In finite sample the distribution of 𝑌 ̄ has mean 𝜇𝑌 and variance 𝜎2 /𝑁
▶ the exact distribution depends on the distribution of 𝑌𝑖
▶ if we assume that 𝑌𝑖 is normal then also 𝑌 ̄ is normally distributed
▶ When n is large, the LLN and the CLT imply that
𝑝
1. 𝑌 ̄ −
→𝜇
√ 𝑌 ̄ −𝜇𝑌𝑌
2. 𝑛 𝜎 ∼ 𝑁 (0, 1)
𝑌
48 / 70
(b) Why Use 𝑌 ̄ To Estimate 𝜇𝑌 ?
▶ 𝑌 ̄ is unbiased: 𝐸(𝑌 ̄ ) = 𝜇𝑌
𝑝
▶ 𝑌 ̄ is consistent: 𝑌 ̄ −
→ 𝜇𝑌
▶ In addition: 𝑌 ̄ is the least squares estimator of 𝜇𝑌 , that is, it
minimizes the sum of squared errors (or residuals)
𝑛
2
𝑚𝑖𝑛𝑚 ∑ (𝑌𝑖 − 𝑚)
𝑖=1
▶ the estimator 𝑚̂ that solves the least squares problem is 𝑚̂ = 𝑌 ̄
49 / 70
Hypothesis Testing
▶ The hypothesis testing problem (for the mean): decide based on the
sample evidence whether
▶ a null hypothesis is true
▶ an alternative hypothesis is true
▶ In statistical notation:
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) > 𝜇𝑌 ,0 (1-sided, >)
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) < 𝜇𝑌 ,0 (1-sided, <)
▶ 𝐻0 ∶ 𝐸(𝑌 ) = 𝜇𝑌 ,0 vs. 𝐻1 ∶ 𝐸(𝑌 ) ≠ 𝜇𝑌 ,0 (2-sided)
50 / 70
▶ p-value = probability of drawing a statistic (e.g., 𝑌 ̄ ) at least as
adverse to the null as the value actually computed with your data,
assuming that the null hypothesis is true
▶ significance level: a pre-specified probability of incorrectly rejecting
the null, when the null is true
▶ Calculating the p-value based on 𝑌 ̄ :
p-value = 𝑃 𝑟 [|𝑌 ̄ − 𝜇𝑌 ,0 | > |𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0 |]
▶ where 𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 is the value of 𝑌 ̄ actually observed
51 / 70
▶ If n is large, the sampling distribution of 𝑌 ̄ is obtained using the
normal approximation (CLT):
p-value = 𝑃 𝑟𝐻0 [|𝑌 ̄ − 𝜇𝑌 ,0 | > |𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0 |]
𝑌 ̄ − 𝜇𝑌 ,0 𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0
= 𝑃 𝑟 [∣ ∣>∣ ∣]
𝜎𝑌 ̄ 𝜎𝑌 ̄
√
▶ where 𝜎𝑌 ̄ = 𝜎𝑌 / 𝑛
▶ Under normality, the p-value is calculated as the probability of values
𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 −𝜇𝑌 ,0
larger than ± 𝜎𝑌 ̄
on the left and right tails
52 / 70
Calculating the p-value with 𝜎𝑌 known
▶ For large n , p -value = the probability that a N(0,1) random variable

𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 −𝜇𝑌 ,0
falls outside ∣ 𝜎𝑌 ̄
∣
▶ In practice, 𝜎𝑌 ̄ is unknown and must be estimated
53 / 70
Estimator of the variance of Y
▶ Since we do not know the population variance of 𝑌 , we need to

estimate 𝜎𝑌2 as well
𝑛
1 2
𝑠2𝑌 = ∑ (𝑌 − 𝑌 ̄ )
𝑛 − 1 𝑖=1 𝑖
▶ if 𝑌𝑖 are i.i.d. then we can apply the LLN to 𝑠2𝑌 since it represents an
𝑝
average and be confident that 𝑠2𝑌 −
→ 𝜎𝑌2
𝑌 ̄ −𝜇𝑌 ,0
▶ Define 𝑡 = √ then the p-value is obtained as
𝑠𝑌 / 𝑁
p-value = 𝑃 𝑟𝐻0 (|𝑡| > |𝑡𝑎𝑐𝑡𝑢𝑎𝑙 |)
54 / 70
What is the link between the p-value and the significance level?
▶ The significance level (typically denoted by 𝛼) is prespecified at the

typical values of 1, 5, or 10%
▶ if 𝛼 = 0.05 then you reject the null hypothesis 𝐻0 if
1. |𝑡| > 1.96 (the critical value)
2. p-value ≤ 0.05
▶ The p-value is referred to as the marginal significance level
▶ Often, it is better to communicate the p-value than simply whether a
test rejects or not
▶ the p-value contains more information than the “ yes/no ’ ’ statement
about whether the test rejects.
55 / 70
▶ The value of 1.96 that we used to reject the null is the 5% critical
value
▶ The critical values of the Student t-distribution are tabulated in all
statistics books
▶ To reject the null hypothesis using the critical values:
1. Compute the t -statistic
2. Compute the degrees of freedom, which is 𝑛–1
3. Look up the 𝛼 critical value
4. The null is rejected if the t-statistic exceeds (in absolute value) this
critical value,
56 / 70
▶ The difference between the t-distribution and N(0,1) critical values
depend on the sample size n and becomes negligible as the sample is
larger than 120.
▶ Some 5% critical values for a 2-sided tests:
d.o.f. ( n – 1) 5% critical value

10 2.23
20 2.09
30 2.04
60 2.00
∞ 1.96
▶ The Student t-distribution is only relevant when we are willing to

assume that the population distribution of Y is normal
▶ In economic data, the normality assumption is rarely credible (e..g,
earnings distribution)
57 / 70
Figure 2: The four distributions of earnings are for women and men,
for those with only a high school diploma (a and c) and those whose
highest degree is from a four-year college (b and d).
58 / 70
▶ For n > 30 , the t-distribution and N(0,1) are very close and as n
grows large, the 𝑡𝑛–1 distribution converges to N(0,1)
▶ When sample sizes were small and there were no computers, the
t-distribution was providing correct inference
▶ For historical reasons, statistical software typically still use the
t-distribution to compute p-values, but this is irrelevant when the
sample size is moderate or large
▶ For these reasons, in this class we will focus on the large-n
approximation given by the CLT
59 / 70
Confidence Intervals
▶ A 95% confidence interval for 𝜇𝑌 is an interval that contains the true

value of 𝜇𝑌 in 95% of repeated samples
▶ A 95% confidence interval can always be constructed as the set of
values of 𝜇𝑌 not rejected by a hypothesis test with a 5% significance
level.
𝑠 𝑠
𝜇𝑌 ∈ (𝑌 ̄ − 1.96 √𝑌 , 𝑌 ̄ + 1.96 √𝑌 )
𝑛 𝑛
▶ Here we are making two assumptions: 1) 𝑌 ̄ is approximately normal,

𝑝
2) 𝑠2𝑌 −
→ 𝜎𝑌2
60 / 70
R
library(tidyquant)
sp <- tq_get("SPY", get = "stock.prices", from = "1995-01-01")
head(sp)
# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 1995-01-03 45.7 45.8 45.7 45.8 324300 28.3
2 SPY 1995-01-04 46.0 46 45.8 46 351800 28.4
3 SPY 1995-01-05 46.0 46.1 46.0 46 89800 28.4
4 SPY 1995-01-06 46.1 46.2 45.9 46.0 448400 28.5
5 SPY 1995-01-09 46.0 46.1 46 46.1 36800 28.5
6 SPY 1995-01-10 46.2 46.4 46.1 46.1 229800 28.5
tail(sp)
# A tibble: 6 x 8
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452.
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452.
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453.
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453.
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451.
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451.
61 / 70
ggplot(sp, aes(date, adjusted)) + ggplot(sp, aes(date, log(adjusted))) +
geom_line() + theme_bw() geom_line() + theme_bw()
6
400
log(adjusted)
300
adjusted
200
4
100
2000 2010 2020 2000 2010 2020

date date
62 / 70
sp <- mutate(sp, return = 100*(log(adjusted) - log(lag(adjusted))))
tail(sp)
# A tibble: 6 x 9
symbol date open high low close volume adjusted return
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452. -0.148
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452. 0.0531
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453. 0.307
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453. -0.0243
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451. -0.358
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451. -0.122
ggplot(sp, aes(date, return)) + geom_line() + theme_bw()
10
5
return
−5
−10
2000 2010 2020

date
63 / 70
ggplot(sp, aes(return)) +
geom_histogram(aes(y = ..density..), binwidth = 0.1, fill = "steelblue1") +
stat_function(fun = dnorm,
args = list(mean = mean(sp$return,na.rm=T),
sd = sd(sp$return,na.rm=T))) +
theme_light()
0.6
0.4
density
0.2
0.0
−10 −5 0 5 10
return
64 / 70
library(psych)
c(skew(sp$return), kurtosi(sp$return))
[1] -0.298641 11.287862
describe(sp$return)
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 6718 0.04 1.21 0.07 0.07 0.79 -11.59 13.56 25.15 -0.3 11.29
se
X1 0.01
65 / 70
fin.data <- tq_get(c("AAPL", "BAC", "SPY"), get = "stock.prices", from = "1995-
head(fin.data)
# A tibble: 6 x 8
1 AAPL 1995-01-03 0.347 0.347 0.338 0.343 103868800 0.291
2 AAPL 1995-01-04 0.345 0.354 0.345 0.352 158681600 0.298
3 AAPL 1995-01-05 0.350 0.352 0.346 0.347 73640000 0.295
4 AAPL 1995-01-06 0.372 0.385 0.367 0.375 1076622400 0.318
5 AAPL 1995-01-09 0.372 0.374 0.366 0.368 274086400 0.312
6 AAPL 1995-01-10 0.368 0.393 0.368 0.390 614790400 0.331
tail(fin.data)
# A tibble: 6 x 8
1 SPY 2021-08-31 452. 452. 451. 452. 59300200 452.
2 SPY 2021-09-01 453. 453. 452. 452. 48721400 452.
3 SPY 2021-09-02 453. 454. 452. 453. 42501000 453.
4 SPY 2021-09-03 452. 454. 452. 453. 47170500 453.
5 SPY 2021-09-07 453. 453. 451. 451. 51671500 451.
6 SPY 2021-09-08 451. 452. 449. 451. 56121700 451.
66 / 70
ret.data = fin.data %>%
group_by(symbol) %>%
mutate(return = 100 * (log(adjusted) - log(lag(adjusted)))) %>%
select(symbol, date, return)
head(ret.data)
# A tibble: 6 x 3
# Groups: symbol [1]
symbol date return
<chr> <date> <dbl>
1 AAPL 1995-01-03 NA
2 AAPL 1995-01-04 2.57
3 AAPL 1995-01-05 -1.28
4 AAPL 1995-01-06 7.73
5 AAPL 1995-01-09 -1.92
6 AAPL 1995-01-10 5.85
ret.data %>%
group_by(symbol) %>%
summarize(Average = mean(return, na.rm = T),
StdDev = sd(return, na.rm = T),
N = n()) %>%
knitr::kable()
symbol Average StdDev N

AAPL 0.0934655 2.811128 6719
BAC 0.0293548 2.738327 6719
SPY 0.0412123 1.213518 6719
67 / 70
library(tidyr)
ret.data <- ret.data %>%
pivot_wider(names_from = symbol, values_from = return)
head(ret.data)
# A tibble: 6 x 4
date AAPL BAC SPY
<date> <dbl> <dbl> <dbl>
1 1995-01-03 NA NA NA
2 1995-01-04 2.57 0.816 0.477
3 1995-01-05 -1.28 0.810 0
4 1995-01-06 7.73 -0.269 0.102
5 1995-01-09 -1.92 0 0.102
6 1995-01-10 5.85 1.60 0.102
cor(select(ret.data, -date), use = "complete")
AAPL BAC SPY

AAPL 1.0000000 0.2667942 0.4722566
BAC 0.2667942 1.0000000 0.6521778
SPY 0.4722566 0.6521778 1.0000000
68 / 70
▶ 𝐻0 ∶ 𝜇𝑆𝑃 𝑌 = 0 vs 𝐻1 ∶ 𝜇𝑆 𝑃 𝑌 ≠ 0
symbol Average StdDev N

SPY 0.0412123 1.213518 6719
▶ t = …..
▶ Reject or fail to reject?
69 / 70
Summary
▶ Based on two assumptions:

1. random sampling from a population,: 𝑌𝑖 , 𝑖 = 1, ⋯ , 𝑛 are i.i.d.
2. 0 < 𝐸(𝑌 4 ) < ∞
▶ We derived that in large samples (large n):
1. Theory of estimation (sampling distribution of 𝑌 ̄ )
2. Theory of hypothesis testing (large-n distribution of t-statistic and
computation of the p-value)
3. Theory of confidence intervals (constructed by inverting the test
statistic)
70 / 70

Introduction to Econometrics: Class Size and Educational Output

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Econometrics: Class Size and Educational Output

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

▶ Economics suggests important relationships, often with policy

▶ Ideally, we would like an experiment

▶ But almost always we only have observational (nonexperimental) data.

▶ Most of the course deals with diﬀiculties arising from using

▶ Learn methods for estimating causal effects using observational data

▶ Empirical problem: Class size and educational output

▶ All K-6 and K-8 California school districts ( n = 420 ) Variables:

Figure 1: Summary of the Distribution of Student–Teacher Ratios and

▶ Scatterplot of test score v. student-teacher ratio

▶ What does this figure show?

[1] 19.640425 1.891812

▶ Compare districts with small'' (STR $<$ 20) andlarge’ ’ (STR ≥

Class Size Average St. Dev n

▶ Estimation of Δ = difference between group means

▶ Is 7.4 a large difference in a real-world sense?

▶ Compute the difference-of-means t -statistic:

Size Average $Std. Dev. n

▶ Two equivalent statements:

▶ The hypothesis that Δ = 0 is rejected at the 5% level

▶ Package dplyr is great for data manipulation

group Average StdDev N

Welch Two Sample t-test

data: testscr by group

▶ The mechanics of estimation, hypothesis testing, and confidence

▶ These concepts extend directly to regression

▶ We will review some of the underlying theory of estimation,

1. The probability framework for statistical inference

▶ The probabilities of different values of Y that occur in the population,

▶ A random variable Y can be described by certain characteristics of its

1. mean = expected value (expectation) of Y

3. standard deviation = 𝜎𝑌 = √𝜎𝑌2

▶ measure of asymmetry of a distribution

▶ measure of probability in the tails

▶ Random variables X and Z have a joint distribution

𝑐𝑜𝑣(𝑋, 𝑍) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑍 − 𝜇𝑍 )] = 𝜎𝑋𝑍

▶ The covariance is a measure of the linear association between X and

𝑐𝑜𝑣(𝑋, 𝑋) = 𝐸[(𝑋 − 𝜇𝑋 )(𝑋 − 𝜇𝑋 )] = 𝐸[(𝑋 − 𝜇𝑋 )2 ] = 𝜎𝑋

▶ Data from 420 California school districts. There is a weak negative

[1] -8.1593241 -0.2263628

▶ The correlation coeﬀicient is defined as:

▶ The correlation coeﬀicient represents a rescaled covariance

▶ Conditional expectations and conditional moments

▶ The difference in means is the difference between the means of two

Δ = 𝐸(test score|𝑆𝑇 𝑅 < 20) − 𝐸(test score|𝑆𝑇 𝑅 ≥ 20)

▶ Notice that: if 𝐸(𝑋|𝑍) is constant, then 𝑐𝑜𝑟𝑟(𝑋, 𝑍) = 0

▶ We will assume simple random sampling

▶ Randomness and data

▶ Because individuals #1 and #2 are selected at random, the value of

▶ This set of assumptions (i.e., statistical model) allows rigorous

▶ 𝑌 ̄ is a random variable, and its properties are determined by the

= 0.78 ∗ (1 − 0.78) = 0.1716

𝑃 𝑟(𝑌 ̄ = 0) = 0.222 = 0.0484

𝑃 𝑟(𝑌 ̄ = 1) = 0.782 = 0.6084

myfunction <- function(p,N,B)

p = 0.78 # Pr(Y = 1) p = 0.78 # Pr(Y = 1)

▶ An estimator is consistent if the probability that it falls within an

▶ If 𝑌 is Bernoulli with 𝑝 = 0.78 then 𝐸(𝑌 ) = 0.78 and

▶ For 𝑌1 , ⋯ , 𝑌𝑁 are i.i.d. and 𝜎𝑌2 < ∞

▶ the estimator 𝑚̂ that solves the least squares problem is 𝑚̂ = 𝑌 ̄

p-value = 𝑃 𝑟 [|𝑌 ̄ − 𝜇𝑌 ,0 | > |𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝜇𝑌 ,0 |]

▶ where 𝑌 ̄ 𝑎𝑐𝑡𝑢𝑎𝑙 is the value of 𝑌 ̄ actually observed