You are on page 1of 54

Econometrics Workshop-Session 2

Basic Statistic Analysis & Univariate


OLS Model
Wednesday, 27 August 2021
Rakhmat Nurul

Department of Economics
Faculty of Economics and Business
Universitas Hasanuddin
Outline
1. Basic statistic analysis: tests for one or two means
2. Basic statistic analysis: bivariate correlation
3. Univariate OLS Regression
4. Empirical Analysis using STATA
Goals of today’s workshop
• Understand the concept of some basic statistic analysis: t-test &
correlation
• Have basic knowledge of how the univariate OLS model works and the
assumptions of the model
• Gain insight on how to perform the basic statistic analysis and
univariate OLS regression in STATA
1. Basic Statistic Analysis
Types of Statistical Methods
• In doing econometric research, lots of statistical tests are performed
• Key insight of statistics: one can learn about a character of population
by selecting a random sample from that population
• Using statistical methods, we can use this random sample to draw
statistical inferences about characteristics of the full population
• Three types of statistical methods are used throughout econometrics:
1. Estimation
2. Hypothesis testing
3. Confidence intervals
Types of Statistical Methods: Estimation
• Estimation is a statistical method that computes a “best guess” numerical
value for an unknown characteristic of a population by using a sample data
• An estimator is a function of sample data that are drawn randomly from a
population used to infer an estimate for an unknown parameter. Hence:
1. An estimator is a random variable because of the randomness in selecting the sample
2. An estimate is a numerical value of the estimator when it is actually computed using
data from a specific sample
• Ex:
1. Sample mean is the estimator of population mean
2. An estimator of the population variance is the sample variance
3. An estimator of the population covariance/correlation is the sample
covariance/correlation
Desirable Properties of the Estimator
•1.  Unbiasedness
• is an unbiased estimator of if . is the mean of the sampling distribution of
2. Consistency
• As the sample size increases, the sampling distribution of the estimator
becomes increasingly concentrated at the true parameter value
• is a consistent estimator of , if as n gets larger
3. Efficiency
• The estimator has the smallest variance among the unbiased estimators
• is more efficient than if
Hypothesis testing : terminology (1)
• Hypothesis
  testing is a process of formulating a specific hypothesis
about the population, then using the sample evidence to decide
whether it is true
• The null hypothesis is that the population men takes on a specific value:
• The two-sided alternative hypothesis specifies what is true if is not:
• Type I error: is rejected when in fact it is true
• Type II error: is not rejected when in fact it is not true
• Significance level of the test: The probability of a type I error: . Often
prespecified at 5%
Hypothesis testing : terminology (2)
•  The power of the test: The probability that the test correctly rejects
when the alternative is true:
• The p-value : The smallest significance level at which you can reject . If
the significance level is prespecified at 5%, we reject if
• A test statistic is a statistic used to perform a hypothesis test:
E.g. the t-statistic:
Hypothesis testing & Confidence interval
•  Critical value of the test statistic: The value of the test statistic for
which the test just rejects at the prespecified significance level
• Rejection region: Set of values of the t-statistic for which the test
rejects
• Confidence interval: A set of all values constructed from a random
sample that contains the true population mean with a prespecified
probability
t-test
•• t-test
  is one of statistical tests used to perform the hypothesis testing about the
population mean
• The procedure is the same as described in the previous section
• We need to formulate a specific hypothesis before:
• Null hypothesis:
• Alternative hypothesis:
• Set the prespecified significance level : e.g. 5%
• Perform the t-statistic test:

• Reject if p-value is less than 5%, or alternatively:


• Reject if
Correlation and covariance (1)
• Correlation
  and covariance are the statistics used to examine the strength of
the linear association between and
• We are often interested to know the extent to which two random variables
move together

• The correlation is always between -1 and 1:

• To note:
Correlation ≠ Causality !
Correlation and covariance (2)
•  The sample covariance and correlation are estimators of the
population covariance and correlation
• The sample covariance is denoted by :

• The sample correlation is denoted


2.Basic Knowledge of Univariate OLS
Model
Regression and causal inference
• Economists are often interested to know the causal relationship between two
variables
• E.g: If someone attend a college, what will be the effect on his/her earning ?
• The gold standard in identifying the causality is by using the experimental
research
• However, it is often hard to perform the experiments due to some reasons:
1. Financial
2. Ethics
3. Feasibility
• Alternatively, economists attempted to use the observational data in order to
identify the causality
The linear regression model
•  It is assumed that the causal relation between two variables can be
specified in a linear relation (model)

• The model is illustrated by drawing a straight line that fits the


observation of the data
• The model is drawn by minimizing the error
• Components of the model:
1. Deterministic part: conditional mean of given
2. Stochastic part: also refers to the error term.
Estimating the coefficients of the linear
regression model
•  The problem in econometric research is that we would like to
estimate by using the sample data
• The way we derive the estimator of is by applying the OLS
• Recall that:

• The OLS would minimize the square of by selecting and , so the first
order condition (FOC) :
Estimating the coefficients of the linear
regression model (2)
•  Solving the minimization problem would give:


• and are the estimators of and respectively
• and are random variables, and hence functions of data
• Therefore, and have sampling distrbutions
Estimating the coefficients of the linear
regression model (3)
•  The OLS predicted values, , and are:

• is the predicted value of the error term


• is the predicted value of
Desirable Properties
•1.  Unbiasedness: The OLS estimator is unbiased if:

2. Consistency: The OLS estimator is consistent if :

• Both conditions are satisfied if the zero conditional mean assumption


holds:

3. Efficiency: The OLS estimator is efficient


Least square assumptions
•  The OLS is the appropriate estimator if the following assumptions
hold:
1. Key identifying assumption: The error term has conditional mean
zero given :
2. , are independently and identically distributed (i.i.d)
3. Large outliers are unlikely
4. Homoskedasticity: the variance is constant across the observation
Least square assumptions: 1. zero
conditional mean assumption
•  The zero conditional mean assumption:
Key identifying assumption that makes the OLS estimator unbiased
Implies that the other factors contained in are unrelated to
If this assumption holds, is the causal effect of on
If this assumption does not hold, only describes the association or correlation
between and
It rules out that there are:
 Unobserved confounders: unobserved variables correlated with , that also affect
 Reversed causality: affects
Least square assumptions: 2. Data are i.i.d.
•  are independently and identically distributed
Independently distributed: the sample is randomly (independently) drawn
from the population
Identically distributed: all the observations in the sample have the same
(identical) distribution. This is the case when they are drawn from the same
population
Time-series data are not i.i.d. It might have the identical distribution, but not
independent
Least square assumptions: 3. Large
outliers are unlikely (1)
•  Observations with values of or both that are far outside the usual
range of the data are unlikely
• Large outliers can make OLS regression results misleading
• Typical source of large outliers: Data entry errors: typographical error
or incorrectly using different units for different observations
• Outliers can be easily detected by plotting the data
Least square assumptions: 3. Large outliers are unlikely
(2)
Least square assumptions: 4.
Homoskedasticity
•• The
  error term is homoscedastic if the variance of the conditional distribution of given , , is
constant for , and in particular does not depend on
• Homoskedastic error terms:

• The variance of the OLS estimators is often computed assuming homoscedastic error terms:

• Hence, the precision of the estimator improves if:


• The sample size () increases
• The variance of , , increases
• The variance of the residuals decreases
• Violation of the homoskedasticity assumption:
• Leads to incorrect standard error
• Does not affect the computation of the parameter estimates
Least square assumptions: 4. Illustration
of Heteroskedasticity
Measures of fit
•  There are, at least, two types of the measures of fit for the regression
line:

1. Standard error of the regression (SER)


• How well the regression line describes the data?
• Does the regressor account for much or for little of the variation in
the dependent variable ?
• Are the observations tightly clustered around the regression line, or
are they spread out ?
 
The
•  The measures the fraction of the sampling variance of that can be
explained or predicted by
• Explained sum of squares
• Sum of squared residuals
• Total sum of squares
The Standard Error of Regression
•  The standard error of the regression line (SER) is an estimator of the
standard deviation of the regression error
• The units of and are the same
• The SER is a measure of the spread of the observations around the
regression line, measured in the units of the dependent variable
Hypothesis test in the regression with a single
regressor (1)
•  In the OLS regression, we also perform the hypothesis test
• It is applied to test the hypothesis about the slope
• Because is a random variable and a function of data, it also has a
sampling distribution
• In large samples, the would have a normal sampling distribution, so
the hypotheses about the true value of can be tested using the same
general approach used in testing the hypothesis about the population
mean
Hypothesis test in the regression with a single
regressor (2)
•• The
  null and alternative hypotheses need to be stated precisely before they can be tested
• More generally, under the null hypothesis, the true population coefficient of takes on some specific value,
• Under the two-sided alternative, does not equal
• The null-hypothesis and the two-sided alternative hypothesis are:

• Often, in econometric research, we would like to prove that a variable has a causal effect on the
outcome/dependent variable, recall that:

• Using the sample data, we would like to estimate , by using its estimator
• After having the sample evidence, we would like to test whether or not the estimate of the causal effect
can be accepted or statistically significant. If not, we cannot prove that the identified causal effect can be
inferred to the population coefficient, so:
Hypothesis test in the regression with a single
regressor (3)
•  Three steps of testing the hypothesis about the slope :
1. Compute the standard error of ,
2. Compute the t-statistic
3. Compute the p-value. Reject the hypothesis at the 5% significance level if
the p-value is less than 0.05 or, equivalently, if
• The standard error and (typically) the t-statistic and p-value testing
are computed automatically by the software
3.STATA Practice : basic statistic & univariate OLS
model
Introduction to the dataset
• In this session we use the
Earnings_and_Height dataset
• This is the dataset used in the research by
Professors Anna Case (Princeton University)
and Christina Paxson (Brown University) in
their paper “Stature and Status: Height, Ability
and Labor Market Outcomes” published in
Journal of Political Economy, 2008
• The dataset contains the data on earnings,
height, and other characteristics of a random
sample of US workers
• The dataset is supplied by Pearson Education,
the publisher of Introduction to Econometrics,
Stock & Watson
The setting of the analysis
•  Now, we would like to run the regression of Earnings on Height, and
estimate this model:

• We are interested to know how much is the effect of any additional


increase of the worker’s height (in inches) on the earnings of US
workers (USD)
• So we use the random sample data of the U.S workers to perform the
regression, in order to find the estimate for the population coefficient
of the effect ()
Preparing the dataset
• However, in many cases, the data is not readily available to perform the analysis.
• The data is provided, but it needs some arrangement, so we need to prepare the data in
order to run our analysis
• That is where the data management in STATA is needed!
• So, we assume that the dataset is not yet ready to be used in the analysis, it is still
separated in two different datasets
• Some actions still needed to use the data!
• In this session, continuing from previous session, we will demonstrate how to prepare
our dataset
• Two kinds of dataset preparation that are mostly conducted in using STATA:
1. Merging datasets
2. Cleaning data
Preparing the dataset : starting command
• clear
• set more off
• cd "C:\Jobs\STATA Workshop-Department of Economics UNHAS“
• log using "workshop_2.txt", text append
Preparing the dataset : merging datasets
(1)
• The
  syntax used in merging the datasets:

• We need to pay attentions on these three elements when merging datasets


into a single dataset:
1. The current opened file: master file
2. Using file: the other datasets that we would like to combine with
3. Expression 1:1 means that the observations in the unique variable only has one
units in each file. Other expressions might be m:1 or 1:m. It depends on the master
file and the other files used in the merger
4. The unique variables: the variables we used as the basis in merging the files. Its
value must be unique across observations. Use command codebook to ensure
whether it is unique or not.
Preparing the dataset : merging datasets
(2)
Preparing the dataset : merging datasets
(3)
• After merging the datasets into a single file, there will be a new
variable generated, called _merge
• This is only a variable that describes whether the merging for each
observation is successful (matched), or not. If the observation is
matched, it will be labeled 3 (Matched)
Preparing the dataset : merging datasets
(4)
Preparing the dataset: Cleaning data (1)
• Now, we would like to do some data cleanings, before running the estimation
• It is very common, that before we perform the regression/analysis, we make
sure that our data do not contain any of these:
1. Missing values
2. Any outliers (due to some typographical errors in data entries)
• Some useful ways in finding the missing values:
1. Use codebook command : codebook [varlist] [if] [in] [, options]
2. Use tab, m command: tab [varlist], [, options]
3. Use the drop command for any missing values: drop if exp
4. For the outliers, try to locate the observation, and fix it manually. Use replace
command
Preparing the dataset: Cleaning data (2)
Before running the regression: summary
statistics and correlation
• It is very common that we perform the summary statistic and correlation before
running the regression
• To perform the summary statistics, we can use summarize command:
summarize [varlist] [if] [in] [weight] [, options]
• If we only type summarize, STATA would provide summary statistics for all variables
• To perform the correlation, we just use the command correlate :
correlate [varlist] [if] [in] [weight] [, correlate_options]
• If we only type correlate, STATA would perform the correlation for each pair of
variables
• The command summarize and correlate can be expressed in the shorter syntax: su
and corr
Running the regression (1)
• To perform the regression, we use this syntax:
regress depvar [indepvars] [if] [in] [weight] [, options]
• The dependent variable must be placed at the first time! Then
followed by the predictors (independent variables)
• The command “regress” can be also written short to be “reg”
Running the regression (2): subsample
analysis
• In this case, we also would like to estimate the effect of height on
earnings based on gender
• Is the effect identified larger for men or women ?
• This is often called a subsample analysis or heterogeneity analysis
• So we perform the regression for each gender.
• Use if in the regression syntax
regress depvar [indepvars] [if] [in] [weight] [, options]
Interpreting the output of analysis:
correlation and summary statistics
Interpreting the output of analysis:
regression output : main estimation
Interpreting the output of analysis:
regression output : subsample analysis
STATA Practice: Saving the output (1)
• STATA also provides a way to save our regression results into a
word/text file
• In this case, we use the command eststo dan estab
• At the first time, we use eststo
eststo: reg earnings height
• Type esttab afterwards to see the design of the table created from the
analysis
STATA Practice: Saving the output (2)
• To save the table into the Ms.Word, use:
esttab using reg_output.rtf, se r2
• The file name of reg_output would be saved in the file
directory in command cd
References
1. Acock, A. C. (2008). A gentle introduction to Stata. Stata press.
2. Stock, J. H., & Watson, M. W. (2020). Introduction to econometrics
Fourth Edition. New York: Pearson.
Thank You!
Questions can be sent here: rakhmat.nurul@unhas.ac.id

Office: Department of Economics, Faculty of Economics and Business Lt.2.


Jl.Perintis Kemerdekaan Km.10

Phone: 0821-8821-1856 (WA)

You might also like