You are on page 1of 26

# WHAT IS ECONOMETRICS?

## Literally interpreted, econometrics means “economic measurement.” Al-though

measurement is an important part of econometrics, the scope of Econometrics is
much broader, as can be seen from the following quotations:
Econometrics, the result of a certain outlook on the role of economics, consists of
the application of mathematical statistics to economic data to lend empirical sup-
port to the models constructed by mathematical economics and to obtain numerical
results.
Econometrics, may be defined as the quantitative analysis of actual economic
Phenomena based on the concurrent development of theory and observation,
related by appropriate methods of inference.
Econometrics, may be defined as the social science in which the tools of economic
theory, mathematics, and statistical inference are applied to the analysis of
economic phenomena.
Econometrics, is concerned with the empirical determination of economic laws.

METHODOLOGY OF ECONOMETRICS
How do econometricians proceed in their analysis of an economic problem?
That is, what is their methodology? Although there are several schools of thought
on econometric methodology, we present here the traditional or classical
methodology, which still dominates empirical research in economics and other
social and behavioral sciences.
Broadly speaking, traditional econometric methodology proceeds along the
following lines:
1. Statement of theory or hypothesis.
2. Specification of the mathematical model of the theory
3. Specification of the statistical, or econometric, model
4. Obtaining the data
5. Estimation of the parameters of the econometric model
6. Hypothesis testing
7. Forecasting or prediction
8. Using the model for control or policy purposes.

To illustrate the preceding steps, let us consider the well-known Keynesian theory
of consumption.

## 1. Statement of Theory or Hypothesis

Keynes stated:
The fundamental psychological law . . . is that men [women] are disposed, as a rule
and on average, to increase their consumption as their income increases, but not as
much as the increase in their income.
In short, Keynes postulated that the marginal propensity to consume (MPC), the rate of
change of consumption for a unit (say, a dollar) change in income, is greater than
zero but less than 1.

## Although Keynes postulated a positive relationship between consumption and

income, he did not specify the precise form of the functional relation-ship between
the two. For simplicity, a mathematical economist might suggest the following form
of the Keynesian consumption function:

## Y=β1+β2X 0<β2<1 (I.3.1)

Where Y= consumption expenditure and X=income, and where β1andβ2, Known as
the parameters of the model, are, respectively, the intercept and slope coefficients.
The slope coefficientβ2measures the MPC. Geometrically, Eq. (I.3.1) is as shown in
Figure I.1. This equation, which states that consumption is
Linearly related to income, is an example of a mathematical model of the
relationship between consumption and income that is called the consumption
function in economics. A model is simply a set of mathematical equations.
If the model has only one equation, as in the preceding example, it is called a single
equation model, whereas if it has more than one equation, it is known as a multiple-
equation model (the latter will be considered later in the book).
In Eq. (I.3.1) the variable appearing on the left side of the equality sign is called the
dependent variable and the variable(s) on the right side are called the independent,
or explanatory, variable(s). Thus, in the Keynesian consumption function, Eq. (I.3.1),
consumption (expenditure) is the dependent variable and income is the explanatory
variable.

## 3. Specification of the Econometric Model of Consumption

The purely mathematical model of the consumption function given in Eq. (I.3.1) is of
limited interest to the econometrician, for it assumes that there is an exact or
deterministic relationship between consumption and income. But relationships
between economic variables are generally inexact.
Thus, if we were to obtain data on consumption expenditure and disposable (i.e.,
after tax) income of a sample of, say, 500 American families and plot these data on
a graph paper with consumption expenditure on the vertical axis and disposable
income on the horizontal axis, we would not expect all 500 observations to lie exactly
on the straight line of Eq. (I.3.1) because, in addition to income, other variables
affect consumption expenditure. For ex-ample, size of family, ages of the members
in the family, family religion, etc. are likely to exert some influence on consumption.
To allow for the inexact relationships between economic variables, the
econometrician would modify the deterministic consumption function (I.3.1) as
follows:
Y=β1+β2X+u (I.3.2)
Where u, known as the disturbance, or error, term, is a random (stochastic) variable
that has well-defined probabilistic properties. The disturbance term u may well
represent all those factors that affect consumption but are not taken into account
explicitly.
Equation (I.3.2) is an example of an econometric model. More technically, it is an
example of a linear regression model, which is the major concern of this book. The
econometric consumption function hypothesizes that the dependent variable Y
(consumption) is linearly related to the explanatory variable X(income) but that the
relationship between the two is not exact; it is subject to individual variation.
The econometric model of the consumption function can be depicted as shown in
Figure I.2.

4. Obtaining Data
To estimate the econometric model given in (I.3.2), that is, to obtain the numerical
values of β1andβ2, we need data. Let us look at the data given in Table I.1, which
relate to TABLE I.1 DATA ON Y (PERSONAL CONSUMPTION EXPENDITURE)
ANDX (GROSS DOMESTIC PRODUCT, 1982–1996), BOTH IN 1992 BILLIONS OF
DOLLARS
5. Estimation of the Econometric Model
Now that we have the data, our next task is to estimate the parameters of the
consumption function. The numerical estimates of the parameters give empirical
content to the consumption function. The actual mechanics of estimating the
parameters will be discussed in Chapter 3. For now, note that the statistical
technique of regression analysis is the main tool used to obtain the estimates. Using
this technique and the data given in Table I.1, we obtain the following estimates of
β1andβ2, namely, −184.08 and 0.7064.
Thus, the estimated consumption function is:

ˆ Y=−184.08+0.7064Xi (I.3.3)
The hat on the Y indicates that it is an estimate.
The estimated consumption function (i.e., regression line) is shown in Figure I.3.
*As a matter of convention, a hat over a variable or parameter indicates that it is an estimated value.
As Figure I.3 shows, the regression line fits the data quite well in that the data points
are very close to the regression line. From this figure we see that for the period
1982–1996 the slope coefficient (i.e., the MPC) was about 0.70, suggesting that for
the sample period an increase in real income of 1 dollar led, on average, to an
increase of about 70 cents in real consumption expenditure.
We say on average because the relationship between consumption and income is
inexact; as is clear from Figure I.3; not all the data points lie exactly on the regression
line. In simple terms we can say that, ac-cording to our data, the average, or mean,
consumption expenditure went up by about 70 cents for a dollar’s increase in real
income.

6. Hypothesis Testing
Assuming that the fitted model is a reasonably good approximation of reality, we
have to develop suitable criteria to find out whether the estimates obtained in, say,
Eq. (I.3.3) are in accord with the expectations of the theory that is being tested.
According to “positive” economists like Milton Friedman, a theory or hypothesis that
is not verifiable by appeal to empirical evidence may not be admissible as a part of
scientific enquiry.
As noted earlier, Keynes expected the MPC to be positive but less than 1. In our
example we found the MPC to be about 0.70. But before we accept this finding as
confirmation of Keynesian consumption theory, we must en-quire whether this
estimate is sufficiently below unity to convince us that this is not a chance
occurrence or peculiarity of the particular data we have used. In other words, is 0.70
statistically less than 1? If it is, it may support Keynes’ theory.
Such confirmation or refutation of economic theories on the basis of sample
evidence is based on a branch of statistical theory known as statistical inference
(hypothesis testing). Throughout this book we shall see how this inference process
is actually conducted.

7. Forecasting or Prediction
If the chosen model does not refute the hypothesis or theory under consideration,
we may use it to predict the future value(s) of the dependent, or forecast, variable Y
on the basis of known or expected future value(s) of the explanatory, or predictor,
variable X.
To illustrate, suppose we want to predict the mean consumption expenditure for
1997. The GDP value for 1997 was 7269.8 billion dollars.
*Do not worry now about how these values were obtained. As we show in Chap. 3, the statistical
method of least squares has produced these estimates. Also, for now do not worry about the
negative value of the intercept.
*See Milton Friedman, “The Methodology of Positive Economics,” Essays in Positive Economics,
University of Chicago Press, Chicago, 1953.
*Data on PCE and GDP were available for 1997 but we purposely left them out to illustrate the topic
discussed in this section. As we will discuss in subsequent chapters, it is a good idea to save a portion
of the data to find out how well the fitted model predicts the out-of-sample observations.

## Putting this GDP figure on the right-hand side of (I.3.3), we obtain:

ˆ Y1997= −184.0779+0.7064 (7269.8)
= 4951.3167 (I.3.4)
Or about 4951 billion dollars. Thus, given the value of the GDP, the mean, or average,
forecast consumption expenditure is about 4951 billion dollars. The actual value of
the consumption expenditure reported in 1997 was 4913.5 billion dollars. The
estimated model (I.3.3) thus over predicted the actual consumption expenditure by
about 37.82 billion dollars. We could say the forecast error is about 37.82 billion
dollars, which is about 0.76 percent of the actual GDP value for 1997. When we fully
discuss the linear regression model in subsequent chapters, we will try to find out if
Such an error is “small” or “large.” But what is important for now is to note that such
forecast errors are inevitable given the statistical nature of our analysis.
There is another use of the estimated model (I.3.3). Suppose the President decides
to propose a reduction in the income tax. What will be the effect of such a policy on
income and thereby on consumption expenditure and ultimately on employment?
Suppose that, as a result of the proposed policy change, investment expenditure
increases. What will be the effect on the economy? As macroeconomic theory
shows, the change in income following, say, a dollar’s worth of change in investment
expenditure is given by the income multiplier M, which is defined as

If we use the MPC of 0.70 obtained in (I.3.3), this multiplier becomes about M=3.33.
That is, an increase (decrease) of a dollar in investment will eventually lead to more
than a threefold increase (decrease) in income; note that it takes time for the
multiplier to work.
The critical value in this computation is MPC, for the multiplier depends on it. And
this estimate of the MPC can be obtained from regression models such as (I.3.3).
Thus, a quantitative estimate of MPC provides valuable in-formation for policy
purposes. Knowing MPC, one can predict the future course of income, consumption
expenditure, and employment following a change in the government’s fiscal policies.

## 8. Use of the Model for Control or Policy Purposes

Suppose we have the estimated consumption function given in (I.3.3). Suppose
further the government believes that consumer expenditure of about 4900 (billions
of 1992 dollars) will keep the unemployment rate at its
LOGIT, PROBIT & TOBIT MODELS
Logit and probit models are appropriate when attempting to model a dichotomous dependent
variable, e.g. yes/no, agree/disagree, like/dislike, etc. The problems with utilizing the familiar
linear regression line are most easily understood visually. As an example, say we want to model
whether somebody does or does not have Bieber fever by how much beer they’ve consumed. We
collect data from a college frat house and attempt to model the relationship with linear (OLS)
regression.

There are several problems with this approach. First, the regression line may lead to predictions
outside the range of zero and one. Second, the functional form assumes the first beer has the same
marginal effect on Bieber fever as the tenth, which is probably not appropriate. Third, a residuals
plot would quickly reveal heteroskedasticity.
Logit and probit models solve each of these problems by fitting a nonlinear function to the data
that looks like the following:

The straight line has been replaced by an S-shaped curve that 1) respects the boundaries of the
dependent variable; 2) allows for different rates of change at the low and high ends of the beer
scale; and 3) (assuming proper specification of independent variables) does away with
heteroskedasticty.

What Logit and probit do, in essence, is take the linear model and feed it through a function to
yield a nonlinear relationship. Whereas the linear regression predictor looks like:

## The Logit and probit predictors can be written as:

Logit and probit differ in how they define f (*). The Logit model uses something called the
cumulative distribution function of the logistic distribution. The probit model uses something
called the cumulative distribution function of the standard normal distribution to define f (*). Both
functions will take any number and rescale it to fall between 0 and 1. Hence, whatever α + βx
equals, it can be transformed by the function to yield a predicted probability. Any function that
would return a value between zero and one would do the trick, but there is a deeper theoretical
model underpinning Logit and probit that requires the function to be based on a probability
distribution. The logistic and standard normal cdfs turn out to be convenient mathematically and
are programmed into just about any general purpose statistical package.
Is Logit better than probit, or vice versa? Both methods will yield similar (though not identical)
inferences. Logit – also known as logistic regression – is more popular in health sciences like
epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models
can be generalized to account for non-constant error variances in more advanced econometric
settings (known as heteroskedasticty probit models) and hence are used in some contexts by
economists and political scientists. If these more advanced applications are not of relevance, than
it does not matter which method you choose to go with.
Tobit Model
• This model is for metric dependent variable and when it is “limited” in the sense
we observe it only if when it is limited in the sense we observe it only if it is above
or below some cut off level. For example,
–the wages maybe limited from below by the minimum wage
–The donation amount give to charity
–“Top coding” income at, say, at \$300,000
–Time use and leisure activity of individuals
Extramarital affairs –Extramarital affairs
• It is also called censored regression model. Censoring can be from below or from
above, also called left and can be from below or from above, also called left and
right censoring. [Do not confuse the term “censoring” with the one used in
dynamic modeling.
Uses of Logit Model:
Logistic regression is used in various fields, including machine learning, most medical
fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS),
which is widely used to predict mortality in injured patients, was originally developed by
Boyd et al. using logistic regression. Another example might be to predict whether an
American voter will vote Democratic or Republican, based on age, income, sex, race, state
of residence, votes in previous elections, etc. The technique can also be used
in engineering, especially for predicting the probability of failure of a given process,
system or product. It is also used in marketing applications such as prediction of a
customer's propensity to purchase a product or halt a subscription, etc. In economics it
can be used to predict the likelihood of a person's choosing to be in the labor force, and
a business application would be to predict the likelihood of a homeowner defaulting on
a mortgage.
Examples of probit regression
Example 1: Suppose that we are interested in the factors that influence whether a
political candidate wins an election. The outcome (response) variable is binary (0/1); win
or lose. The predictor variables of interest are the amount of money spent on the
campaign, the amount of time spent campaigning negatively and whether the candidate
is an incumbent.
Example 2: A researcher is interested in how variables, such as GRE (Graduate Record
Exam scores), GPA (grade point average) and prestige of the undergraduate institution,
effect admission into graduate school. The response variable, admit/don’t admit, is a
binary variable.
Examples of Tobit Model
Example 1.
In the 1980s there was a federal law restricting speedometer readings to no more than
85 mph. So if you wanted to try and predict a vehicle’s top-speed from a combination of
horse-power and engine size, you would get a reading no higher than 85, regardless of
how fast the vehicle was really traveling. This is a classic case of right-censoring (censoring
from above) of the data. The only thing we are certain of is that those vehicles were
traveling at least 85 mph.
Example 2. A research project is studying the level of lead in home drinking water as a
function of the age of a house and family income. The water testing kit cannot detect lead
concentrations below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to
be dangerous. These data are an example of left-censoring (censoring from below).
Example 3. Consider the situation in which we have a measure of academic aptitude
(scaled 200-800) which we want to model using reading and math test scores, as well as,
the type of program the student is enrolled in (academic, general, or vocational). The
problem here is that students who answer all questions on the academic aptitude test
correctly receive a score of 800, even though it is likely that these students are not “truly”
equal in aptitude. The same is true of students who answer all of the questions
incorrectly. All such students would have a score of 200, although they may not all be of
equal aptitude.
Importance of Quantitative Research

## 1. More reliable and objective

2. Can use statistics to generalize a finding
3. Often reduces and restructures a complex problem to a limited number of
variables
4. Looks at relationships between variables and can establish cause and effect in
highly controlled circumstances
5. Tests theories or hypotheses
6. Assumes sample is representative of the population
7. Subjectivity of researcher in methodology is recognized less
8. Less detailed than qualitative data and may miss a desired response from the
participant

## 1. Basis for scientific analysis– With the increase in complexities of modern

b2usiness it is not possible to rely on the unscientific decisions based on the
intuitions. This provides the scientific methods for tackling various problems for
modern business.
2. Tools for scientific analysis– Quantitative techniques provide the managers with a
variety of tools from mathematics, statistics, economics and operational research.
These tools help the manager to provide a more precise description and solution
of the problem. The solutions obtained by using quantitative techniques are often
free from the bias of the manager or the owner of the business.
3. Solution for various business problems. Quantitative techniques provide solutions
to almost every area of a business. These can be used in production, marketing,
inventory, finance and other areas to find answers to various question like (a) how
the resources should be used in production so that profits are maximized. (b) How
should the production be matched to demand so as to minimize the cost of
inventory?
4. Optimum allocation of resources- An allocation of resources is said to be optional
if either a given level of output is being produced at minimum cost or maximum
output is being produced at a given cost. A quantitative technique enables a
manager to optimally allocate the resources of a business or industry.
5. Selection of an optimal strategy– Using quantitative techniques it is possible to
determine the optimal strategy of a business or firm that is facing competition
from its rivals. The techniques for determining the optimal strategy is dependent
upon game theory.
6. Optimal deployment of resources- Using quantitative technique It is possible to
find out the earliest and latest time for successful completion of project and this is
called program evaluation and review technique.
7. Facilitate the process of decision making- quantitative techniques provide a
method of decision making in the face of uncertainty. These techniques are based
upon decision theory.
What is Multicollinearity?

## Multicollinearity can adversely affect your regression results.

Multicollinearity generally occurs when there are high correlations between two or
more predictor variables. In other words, one predictor variable can be used to predict the
other. This creates redundant information, skewing the results in a regression model.
Examples of correlated predictor variables (also called multicollinearity predictors) are: a
person’s height and weight, age and sales price of a car, or years of education and annual
income.
An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs
of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called
perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be
removed from the model if at all possible.
It’s more common for multicollinearity to rear its ugly head in observational studies; it’s
less common with experimental data. When the condition is present, it can result in
unstable and unreliable regression estimates. Several other problems can interfere with
analysis of results, including:

 The t-statistic will generally be very small and coefficient confidence intervals will be
very wide. This means that it is harder to reject the null hypothesis.
 The partial regression coefficient may be an imprecise estimate; standard errors may
be very large.
 Partial regression coefficients may have sign and/or magnitude changes as they pass
from sample to sample.
 Multicollinearity makes it difficult to gauge the effect of independent
variables on dependent variables.

## What Causes Multicollinearity?

The two types are:

##  Data-based multicollinearity: caused by poorly designed experiments, data that is

100% observational, or data collection methods that cannot be manipulated. In some
cases, variables may be highly correlated (usually due to collecting data from purely
observational studies) and there is no error on the researcher’s part. For this reason,
you should conduct experiments whenever possible, setting the level of the predictor
variables in advance.
 Structural multicollinearity: caused by you, the researcher, creating new predictor
variables.
Causes for multicollinearity can also include:

 Insufficient data. In some cases, collecting more data can resolve the issue.
 Dummy variables may be incorrectly used. For example, the researcher may fail to
exclude one category, or add a dummy variable for every category (e.g. spring,
summer, autumn, winter).
 Including a variable in the regression that is actually a combination of two other
variables. For example, including “total investment income” when total investment
income = income from stocks and bonds + income from savings interest.
 Including two identical (or almost identical) variables. For example, weight in pounds
and weight in kilos, or investment income and savings/bond income.
Next: Variance Inflation Factors.

What is Multicollinearity?
As stated in the lesson overview, multicollinearity exists whenever two or more of the predictors in a
regression model are moderately or highly correlated. Now, you might be wondering why a researcher
can’t just collect his data in such a way to ensure that the predictors aren't highly correlated. Then,
multicollinearity wouldn't be a problem, and we wouldn't have to bother with this silly lesson.

Unfortunately, researchers often can't control the predictors. Obvious examples include a person's
gender, race, grade point average, math SAT score, IQ, and starting salary. For each of these predictor
examples, the researcher just observes the values as they occur for the people in her random sample.

Multicollinearity happens more often than not in such observational studies. And, unfortunately,
regression analyses most often take place on data obtained from observational studies. If you aren't
convinced, consider the example data sets for this course. Most of the data sets were obtained from
observational studies, not experiments. It is for this reason that we need to fully understand the impact
of multicollinearity on our regression analyses.

Types of multicollinearity
There are two types of multicollinearity:

 Structural multicollinearity is a mathematical artifact caused by creating new predictors from other
predictors — such as, creating the predictor x2 from the predictor x.
 Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance
on purely observational data, or the inability to manipulate the system on which the data are
collected.
In the case of structural multicollinearity, the multicollinearity is induced by what you have done. Data-
based multicollinearity is the more troublesome of the two types of multicollinearity. Unfortunately it
is the type we encounter most often!

Example

## Let's take a quick look at an example in which data-based

multicollinearity exists. Some researchers observed — notice the choice of word! — the following data
(bloodpress.txt) on 20 individuals with high blood pressure:
 blood pressure (y = BP, in mm Hg)
 age (x1 = Age, in years)
 weight (x2 = Weight, in kg)
 body surface area (x3 = BSA, in sq m)
 duration of hypertension (x4 = Dur, in years)
 basal pulse (x5 = Pulse, in beats per minute)
 stress index (x6 = Stress)
The researchers were interested in determining if a relationship exists between blood pressure and age,
weight, body surface area, duration, pulse rate and/or stress level.

## The matrix plot of BP, Age, Weight, and BSA:

Dummy Variables
A dummy variable is a numerical variable used in regression analysis to
represent subgroups of the sample in your study. In research design, a
dummy variable is often used to distinguish different treatment groups. In
the simplest case, we would use a 0,1 dummy variable where a person
is given a value of 0 if they are in the control group or a 1 if they are in
the treated group. Dummy variables are useful because they enable us
to use a single regression equation to represent multiple groups. This
means that we don't need to write out separate equation models for each
subgroup. The dummy variables act like 'switches' that turn various
parameters on and off in an equation. Another advantage of a 0,1
dummy-coded variable is that even though it is a nominal-level variable
you can treat it statistically like an interval-level variable (if this made no
sense to you, you probably should refresh your memory on levels of
measurement). For instance, if you take an average of a 0,1variable, the
result is the proportion of 1s in the distribution.

## To illustrate dummy variables, consider the simple regression model for

a posttest-only two-group randomized experiment. This model is
essentially the same as conducting a t-test on the posttest means for two
groups or conducting a one-way Analysis of Variance (ANOVA). The key
term in the model is 1, the estimate of the difference between the
groups. To see how dummy variables work, we'll use this simple model
to show you how to use them to pull out the separate sub-equations for
each subgroup. Then we'll show how you estimate the difference
between the subgroups by subtracting their respective equations. You'll
see that we can pack an enormous amount of information into a single
equation using dummy variables. All I want to show you here is that 1 is
the difference between the treatment and control groups.

To see this, the first step is to compute what the equation would be for
each of our two groups separately. For the control group, Z = 0. When
we substitute that into the equation, and recognize that by assumption
the error term averages to 0, we find that the predicted value for the
control group is 0, the intercept. Now, to figure out the treatment group
line, we substitute the value of 1 for Z, again recognizing that by
assumption the error term averages to 0. The equation for the treatment
group indicates that the treatment group value is the sum of the two beta
values.

## Now, we're ready to move on to the second step -- computing the

difference between the groups. How do we determine that? Well, the
difference must be the difference between the equations for the two
groups that we worked out above. In other word, to find the difference
between the groups we just find the difference between the equations for
the two groups! It should be obvious from the figure that the difference
is 1. Think about what this means. The difference between the groups
is 1. OK, one more time just for the sheer heck of it. The difference
between the groups in this model is 1!
Whenever you have a regression model with dummy variables, you can
always see how the variables are being used to represent multiple
subgroup equations by following the two steps described above:

 create separate equations for each subgroup by substituting the dummy values

 find the difference between groups by finding the difference between their equations

## Dummy variable (statistics)

A dummy variable is a dichotomous variable which has been coded to represent a variable with a
higher level of measurement. Dummy variables are often used in multiple linear regression (MLR).
Dummy coding refers to the process of coding a categorical variable into dichotomous variables. For
example, we may have data about participants' religion, with each participant coded as follows:
A categorical or nominal variable with three categories
Religion Code

Christian 1

Muslim 2

Atheist 3

This is a nominal variable (see level of measurement) which would be inappropriate as a predictor in
MLR. However, this variable could be represented using a series of three dichotomous variables
(coded as 0 or 1), as follows:
Full dummy coding for a categorical variable with three categories
Religion Christian Muslim Atheist

Christian 1 0 0

Muslim 0 1 0

Atheist 0 0 1

There is some redundancy in this dummy coding. For instance, in this simplified data set, if we know
that someone is not Christian and not Muslim, then they are Atheist.
So we only need to use two of these three dummy-coded variables as predictors. More generally, the
number of dummy-coded variables needed is one less than the number of categories.
Choosing which dummy variable not to use is arbitrary and depends on the researcher's logic. For
example, if I'm interested in the effect of being religious, my reference (or baseline) category would be
Atheist. I would then be interested to see whether the extent to which being Christian (0 (No) or 1
(Yes)) or Muslim (0 (No) or 1 (Yes)) predicts the variance in a dependent variable (such as Happiness)
in a regression analysis. In this case, the dummy coding to be used would be the following subset of
the previous full dummy coding table:
Dummy coding for a categorical variable with three categories, using Atheist as the reference
category
Religion Christian Muslim

Christian 1 0

Muslim 0 1

Atheist 0 0

Alternatively, I may simply be interested to recode into a single dichotomous variable to indicate, for
example, whether a participant is Atheist (0) or Religious (1), where Religious is Christian or Muslim.
The coding would be as follows:
A categorical or nominal variable with three categories
Religiosity Code

Atheism 0

Religious 1
Autocorrelation
In this part of the book (Chapters 20 and 21), we discuss issues especially related to the study of
economic time series. A time series is a sequence of observations on a variable over time.
Macroeconomists generally work with time series (e.g., quarterly observations on GDPand monthly
observations on the unemployment rate). Time series econometrics is a huge and complicated subject.
Our goal is to introduce you to some of the main issues.

We concentrate in this book on static models. A static model deals with the contemporaneous
relationship between a dependent variable and one or more independent variables.Asimple example
would be a model that relates average cigarette consumption in a given year for a given state to the
average real price of cigarettes in that year:

In this model we assume that the price of cigarettes in a given year affects quantity demanded in that
year.1In many cases, a static model does not adequately capture the relationship between the variables
of interest. For example, cigarettes are addictive, and so quantity demanded this year might depend on
prices last year. Capturing this idea in a model requires some additional notation and terminology. If we
denote year t’s real price by RealPricet, then the previous year’s price is RealPricet-1. The latter quantity
is called a one-period lag of RealPrice. We could then write down a distributed lag model:

Although highly relevant to time series applications, distributed lag models are an advanced topic which
we will not cover in this book.2

## Let us return to the static model:

As always, before we can proceed to draw inferences from regressions from sample data, we need a
model of the data generating process.We will attempt to stick as close as possible to the classical
econometric model. Thus, to keep things simple, in our discussion of static models we continue to
assume that the X’s, the independent variables, are fixed in repeated samples. Although this assumption
is pretty clearly false for most time series, for static models it does not do too much harm to pretend it
is true. Chapter 21 points out how things change when one considers more realistic models for the data
generating process.

Unfortunately, we cannot be so cavalier with another key assumption of the classical econometric model:
the assertion that the error terms for each observation are independent of one another. In the case we
are considering, the error term reflects omitted variables that influence the demand for cigarettes. For
example, social attitudes toward cigarette smoking and the amount of cigarette advertising both
probably affect the demand for cigarettes. Now social attitudes are fairly similar from one year to the
next, though they may vary considerably over longer time periods. Thus, social attitudes in 1961 were
probably similar to those in 1960, and those in 1989 were probably similar to those in 1988. If that is
true and if social attitudes are an important component of the error term in our model of cigarette
demand, the assumption of independent error terms across observations is violated.

These considerations apply quite generally. In most time series, it is plausible that the omitted variables
change slowly over time. Thus, the influence of the omitted variable is similar from one time period to
the next. Therefore, the error terms are correlated with one another. This violation of the classical
econometric model is generally known as autocorrelation of the errors. As is the case with
heteroskedasticity, OLS estimates remain unbiased, but the estimated SEs are biased.

For both heteroskedasticity and autocorrelation there are two approaches to dealing with the problem.
You can either attempt to correct the bias in the estimated SE, by constructing a heteroskedasticity- or
autocorrelation-robust estimated SE, or you can transform the original data and use generalized least
squares (GLS) or feasible generalized least squares (FGLS). The advantage of the former method is that
it is not necessary to know the exact nature of the heteroskedasticity or autocorrelation to come up with
consistent estimates of the SE. The advantage of the latter method is that, if you know enough about
the form of the heteroskedasticity or autocorrelation, the GLS or FGLS estimator has a smaller SE than
OLS. In our discussion of heteroskedasticity we have chosen to emphasize the first method of dealing
with the problem; this chapter emphasizes the latter method. These choices reflect the actual practice
of empirical economists who have spent much more time trying to model the exact nature of the
autocorrelation in their data sets than the heteroskedasticity.

In this chapter, we analyze autocorrelation in the errors and apply the results to the study of static time
series models. In many ways our discussion of autocorrelation parallels that of heteroskedasticity. The
chapter is organized in four main parts:

 Understanding Autocorrelation
 Consequences of Autocorrelation for the OLS estimator
 Diagnosing the Presence of Autocorrelation
 Correcting for Autocorrelation

Chapter 21 goes on to consider several topics that stem from the discussion of
autocorrelation in static models: trends and seasonal adjustment, issues surrounding the data
generation process (stationarity and weak dependence), forecasting, and lagged dependent variable
models.

Autocorrelation
Autocorrelation is a characteristic of data in which the correlation between the values of the
same variables is based on related objects. It violates the assumption of instance
independence, which underlies most of the conventional models. It generally exists in those
types of data-sets in which the data, instead of being randomly selected, is from the same
source.
Presence
The presence of autocorrelation is generally unexpected by the researcher. It occurs mostly
due to dependencies within the data. Its presence is a strong motivation for those
researchers who are interested in relational learning and inference.

Examples
In order to understand autocorrelation, we can discuss some instances that are based upon
cross sectional and time series data. In cross sectional data, if the change in the income of
a person A affects the savings of person B (a person other than person A), then
autocorrelation is present. In the case of time series data, if the observations show inter-
correlation, specifically in those cases where the time intervals are small, then these inter-
correlations are given the term of autocorrelation.
In time series data, autocorrelation is defined as the delayed correlation of a given
series. Autocorrelation is a delayed correlation by itself, and is delayed by some specific
number of time units. On the other hand, serial autocorrelation is that type which defines the
lag correlation between the two series in time series data.

Patterns
Autocorrelation depicts various types of curves which show certain kinds of patterns, for
example, a curve that shows a discernible pattern among the residual errors, a curve that
shows a cyclical pattern of upward or downward movement, and so on.

In time series, it generally occurs due to sluggishness or inertia within the data. If a non-
expert researcher is working on time series data, then he might use an incorrect functional
form, and this again can cause autocorrelation.
The handling of the data by the researcher, when it involves extrapolation and interpolation,
can also give rise to autocorrelation. Thus, one should make the data stationary in order to
remove autocorrelation in the handling of time series data.

## Autocorrelation is a matter of degree, so it can be positive as well as negative. If the series

(like an economic series) depicts an upward or downward pattern, then the series is
considered to exhibit positive autocorrelation. If, on the other hand, the series depicts a
constant upward and downward pattern, then the series is considered to exhibit negative
autocorrelation.

When a researcher has applied ordinary least square over an estimator in the presence of
autocorrelation, then the estimator is incompetent.

## Detecting the Presence

There is a very popular test called the Durbin Watson test that detects the presence of
autocorrelation. If the researcher detects autocorrelation in the data, then the first thing the
researcher should do is to try to find whether or not it is pure. If it is pure, then one can
transform it into the original model that is free from pure autocorrelation.