CHAPTER

1

Economic Questions and Data

A

sk a half dozen econometricians half dozen different

what econometrics

is and you could get a is the is

answers. One might tell you that econometrics

science of testing economic

theories. A second might tell you that econometrics

the set of tools used for forecasting future values of economic variables, such as a firm's sales, the overall growth of the economy, or stock prices. Another might say

that econometrics is the process of fitting mathematical

economic models to real-

world data. A fourth might tell you that it is the science and art of using historical data to make numerical, and business. In fact, all these answers are right. At a broad level, econometrics and art of using economic is the science or quantitative, policy recommendations in government

theory and statistical techniques to analyze economic

data. Econometric methods are used in many branches of economics, including

finance, labor economics, macroeconomics, microeconomics, marketing, and
economic policy. Econometric sciences, including methods are also commonly used in other social political science and sociology. you to the core set of methods used by econometricians. to answer a variety of specific, quantitative questions

This book introduces We will use these methods

taken from the world of business and government

policy. This chapter poses four of to

those questions and discusses, in general terms, the econometric approach

answering them. The chapter concludes with a survey 01 the main types of data available to econometricians questions. for answering these and other quantitative economic

1.1

Economic Questions We Examine
Many decisions in economics, business, and government hinge on understanding
relationships among variables in the world around us. These decisions require

quantitative answers to quantitative questions.
This book examines several quantitative questions taken from current issues

in economics. Four of these questions concern education policy, racial bias in mortgage lending, cigarette consumption, and macroeconomic forecasting.

2

CHAPTER

1

Economic Questions and Data

. ""-1' Does Reducing Class Size Improve Question -rr' • . ? Elementary School Education.
Proposals for reform

,

a

f tl

ie

US public cducation .,

system generate
.

heateddeb'l
t

.1 cern the youngest Many of the proposa s con " ducarion has vanous h 00 I e Elementary sc c , • skills, but f or man Y parents and educators .,;

,

studcnts.those III elementarysrbool " ' ' objectives. such as developlOgsOc~
. . • .

.' academic 'I eaJ11Jng. . J -cading writing ,-.' . ,OVIIing basic learning is to reduce posa I for IIllP! '-' With fewer students in the classroom.

thc most Important obJectIVe isblll1 ". and baSIC mathematics. One prommentplO' . class sizes at elemeJllaryschoo

" 11

the argument

goes, each student getsmare

of the teacher's attention,
and grades improve. But what, precisely, class size? Reducing contemplating quantitative hiring if the school is already

there are fewer class disruptions, learning isenhan~~,
is the effect on elementary building must

school education ofreducinl hiring more teachersan~

class size costs money: at capacity, marc teachers however. of the likely

It requires weigh

more classrooms.

A decision maker
must haveaprec~e e[fectonba>k

these costs against the benefitt maker Is the beneficial

To weigh costs and benefits, understanding

the decision benefits.

learning of smaller classes large or srna II? Is it possible ally has 110 effect on basic learning? Although learning quantitative of reducing common answer sense and everyday
are fewer students.

that smaller class sizeactu. may suggest thatmort sense cannot provide!

experience common

occurs when there

to the question

of what exactly

is the effect on basic leama

class size. To provide

such an answer. we must examine empirical

evidencethat is, evidence elementary schools. In this book, we examine using data gathered data, students dardizedtests in districts than students

based on data - relating the relationship between

class size to basic lemnillgin

class size and basiclearllin~ in 1999.111 the Califcm better on sIan·

from 420 California in districts

school

districts

with small class sizes produce

tend to perform

with larger classes. While this [actisron, better test scores, itmightsimpll

sistent with the idea that smaller classes

reflect many other advantages that students in districts with small classeshal'e over their counterparts' d' t " . I . . 'h 111 IS nets Wit 1 large classes. For example, districts wl small class sizes tend to have I I',' '" . e wea uue r residems than districts With largecla~f\ so students III small-class districts ld h ' . c cou ave more opportunities for learnmgoulSide the classroom It could b 1 ' , . e tnese extra learlllng opportunities that leadlO higher test Scores not S n II I ' , '. ' I a er c ass srzes. In Pan II, we use multiple regression analySIS to Isolate the effect I' I ' " o c langes 111 class size from changes in olherfactor~ such as the economic background of tl d re stu ents.

1.1

Economic Questions We Examine

3

Question #2: Is There Racial Discrimination in the Market for Home Loans?
Most people buy their homes with tbe belp of a mortgage, a large loan secured by institutions cannot take race into Applicants who the value of the home. By law, U.S. lending are identical gage lending. In contrast Bank of Boston cants are denied these data indicate how large is it? The fact that more black than white applicants Fed data does not by itself provide concluding evidence lenders because their race. Before the black and white applicants that there more closely are denied in the Boston by mortgage market, these in the probeconometric notably of discrimination to this theoretical mortgages, conclusion, researchers at the Federal Reserve are denied. Do lending? If so, found (using data from the early 1990s) that 28% of black appliwhile only 9% of white applicants that, in practice, there is racial bias in mortgage

account when deciding to grant or deny a request for a mortgage: gage applications

in all ways but their race should be equally likely to have their mortapproved. In theory, then, there should be no racial bias in mort-

differ in many ways other than

is bias in the mortgage

data must be examined difference methods obtaining

to see if there is a difference 11 we introduce

ability of being denied for otherwise identical applicants is large or small. To do so, in Chapter that make it possible a mortgage, holding to repay the loan. to quantify
COr/stan,

and, if so, whether this

the effect of race on the chance of characteristics,

other applicant

their ability

Question #3: How Much Do Cigarette Taxes Reduce Smoking?
Cigarette smoking is a major public health concern worldwide. costs to nonsmokers who prefer Many of the costs not to breathe secof smoking, such as the medical expenses ing and the less quantifiable ondhand cigarette costs are borne intervention smoke, are borne of caring for those made sick by smokof society. Because these

by other members

by people other than the smoker, there is a role for government One of the most flexible tools for will go down. will the in the quantity is to increase taxes on cigarettes. says that if cigarette sold decrease? from a 1. % increase prices go up, consumption The percentage change

in reducing cigarette consumption.

cutting consumption Basic economics quantity demanded of cigarettes resulting

But by how much? If the sales price goes up by l %, by what percentage

in price is the price elasticity of demand.

4

CHAPTER 1

Economic Questions and Data

I we wan

ki by a certain amount. ( 'Iasticity to calculate we need to know the price e ( , " sumption But what
f
t

to reduce srno IIlg con

achieve this reduction

III

sa 20%. hy raising taxcs then . . thc pnce mcrcase nccc sary to ", .. IS thc prtcc elastIcIty of demanct
I

for cigarettes? 'I ' theory provides us WIth th concepts uuu hclp us answer Although econorruc , , " d lot tell LISthc numerical value of thc pnce elasticity of this questlOIl, It oes I .' , , ' demand. -r- Iearn the elasticity , we must examine empirical evidence abOlltlhe ,0 behavior of smokers and potential smokers; in other \ ords, we need to analYze data on cigarette consumption and prices,
The data we examine are U.S, states in the 1980s and I990s, In these data,

,

,
pcr~onHIII1COme

cigarette sales, priccs.rnxcs.and

Ior

IUt·S with low tuxe •. an l tbus low cigarette prices, have high smoking rate, and states with high prices have low smoking rates, However, the analysis of these data IS .ornpli atcd because causality runs both ways: Low taxes lead to high dcman I. hut if there arc many Smokers in the state, then local politicians might try t kccp cigarcue In\es low 10 satisfy their smoking constituents, In Chapter 12, we study methods Ior handling this "simultaneous causality" and use those methods to est imme the pnce elasticity of cigarette demand,

Question #4: What Will the Rate of Inflation Be Next Year?
f the future. What will sales be next year at a firm considering investing in new equipment? Will the vtock market go up next month and, if so, by how much? Will city tux receipts next year cover planned expenditures on city services? Will YOUI' microeconomics cxnrn next week focus on externalities or monopolies? Will aturday be a nice day to go to the beach? One aspect of the future in which macr economists lind financial ccon mists are particularly interested is the rate of overall price inflation during the next year. A financial professional might advise a client whether to make a loan 01' to take one out at a given rate of interest, depending on her be 'I guc • of the rate of inflation over the coming year. Economists at ce 1111'<' I banks like the Federal Reserve Board in Washington, D,C.,and the uropean cntral Bank in Frankfurt, Germany, are responsible for keeping the rate of price inflation under comrol. so their decisions about how to set interest rates rely on the outl ok for inflation over the next year If they think til' I" [' " ' t ' e I ate 0 III lallon wtllmcrcase by a per enlage pOln , then they might increase' t ' t ( 111 etes rates by more than th(lt to low down an econ~ omy that, in their view risks a -h' f ' 'ng , ' ve, eatlllg, I they guess \ rong, they nsk caUSI either an unnecessary recess'.. . ' Ion 01 an undeSIrable jump in the rate of mfialtOn,
It seems that people always want a sneak preview

provides a mathematical way to quantify how a change in one variable affects another variable. questions. The conceptual framework used in this book is the multiple regression model. One of the inflation 14 is based on the Phillips curve. Economic theory provides clues about that answer-cigarette price goes up-but that is. Because we use data to answer quantitative our answers always have some uncertainty: A different set of data would produce a different numerical answer. the mainstay of econometrics. Quantitative Answers Each of these four questions requires a numerical answer. Quantitative Questions. A forecaster's using the past. using data and 1. the first three questions in Section 1. . an action is said to cause an outcome if the outcome is the direct result. what effect does a change in class size have on test scores.2 Causal Effects and Idealized Experiments 5 Professional economists who rely on precise numerical forecasts use econometric models to make those forecasts. or consequence.2 Causal Effects and Idealized Experiments Like many questions encountered in econometrics. and econometricians tical techniques to quantify relationships job is to predict the future do this by using economic theory and statisin historical data. holding constant other factors such as your ability to repay the loan? What effect does a 1% increase in the price of cigarettes have on cigarette consumption. For example. introduced in Part II.1. the conceptual framework for the analysis needs to provide both a numerical answer to the question and a measure of how precise the answer is. holding other things constant. holding constant or controlling for student characteristics family income) that a school district administrator (such as cannot control? What effect does your race have on your chances of having a mortgage application granted. An important empirical relationship in macroeconomic data is the "Phillips curve. Therefore." in which a currently low value of the unemployment forecasts we develop and evaluate in Chapter rate is associated with an increase in the rate of inflation over the next year. by analyzing consumption ought to go down when the must be learned empirical1y.1 concern causal relationships among variables. In common usage. holding constant the income of smokers and potential smokers? The multiple regression model and its extensions provide a framework for answering these questions for quantifying the uncertainty associated with those answers. the actual value of the number data. The data we use to forecast inflation are the rates of inflation and unemployment in the United States. This model.

6

CHAPTER1 Economic Questions and Data
HI r Call Ses of ~h: :~'~~S~~~':sty; ~:'Iting air in your lire, cnu,~' rnern 10 mO.lle ru IIin fer_ yo tilizer on your tomato plants cau e Ihom to produ e 01 re rom " I c aUSalily means that a specific action (applying ferliIiJ~r) Icnds 10 ,\ P ~IIJ mllllSurable consequence (more tomatoes).

.

hin a hal stove causes you I g~1burned unn 10

Estimation of Causal Effects
How best might we measure the causal effect on tom.no I Id (01 'I unaJ in kilo_ grams) of applying a certain amount of Icnilizcr, s IUU n 01' 01 I 1IIliLer per square meter? One way 10 measure this causal effect is 10 condu t uo ' penmen]. In that experiment, a horticultural researcher plants 11100 plot-, OIIIlI1l.ln· I .ich prot is tended identically, with one excepti n: orne plol~ g~t IlXIl\nil1l, nt I ·lllIiJ.:Cr per square meier, while the rest gel none. M rcovcr, \ hethcr II ph>! 'rllllllld or not is determined randomly by a computer, ensuring thllt n other dl!fllrcnces between the plots are unrelated to whether they receive r 'rllh/~r \t the en l of the growing season, the horriculturalist wcighs the harv "Itrom eu h pint. The difference between the average yield per square meter uf Ih ' It ',lieu nd urnrcated plots is the effect on rornato production f the fCrliliJ 'r II iuuncnt This is an example of a randomized controlled xperill1e.lt It" .orurouco in the sense that there are both a control grollp that rcccive-, no tr '1101 rm (n fertilizer) and a treatment group that receives the treatment (100 111 ltcrlilizer). It is randomized in the sense that Ihe treatment is assigned randoml . n\l~ rand m assignment eliminates the possibility of a sy temlllic reJtlliOll,hlp b'l\ Cllll. for example, how sunny the plot is and whether it receive fertiliJer '0 thtll the nil' systematic difference between the treatment and COntr I group, "Ih ' treatment. If this experiment is properly implemented on a large enough S IIle. thcn it will yield an estimate of the causal effect on the Ollie me f inlcre>! (11111101 rroduction) of the treatment (applying 100 g/m2 of fertilizer).

"r

In this book, the clIllsal ell'ect is defined to he Ihe cffe'l on un ulcOl11e of a given action or treatment as measured in an ideal randomiled conlrollcd cxperiment. In such an experiment, the only

sy tematic rca on for clift 'rene.:>~ in out-

COmes between the treatment and control groups is the IreOll11Cnllh '11. lt is possible to imagine an ideal randomiJed controllcd e\pcnl1l 'nt I an wer each of the fmt three questions in Section 1.1, Por cxample. 10 ~tud class size ~ne can Imagme randomly assigning "treatments·' of differenl cia, ,i/<: - 10 different groups of students. If the experiment is designed and e\C UI .t1 sO thaI the only systematic difference between the groups of studenl' is their c1a~~ .i./;e. then

1.3

Data: Sources and Types

7

in theory this experiment would estimate the effect on test scores of reducing class size, holding all else constant. The concept of an ideal randomized controlled experiment is useful because it gives a definition of a causal effect. In practice, however, it is not possible to perform ideal experiments. In fact, experiments are rare in econometrics because

often they are unethical, impossible to execute satisfactorily, or prohibitively expensive. The concept of the ideal randomized controlled experiment does, however, provide a theoretical benchmark for an econometric analysis of causal effects using actual data.

Forecasting and Causality
Although the first three questions in Section 1.1 concern causal effects, the fourth-forecasting inflation-does not. You do not need to know a causal relationship to make a good forecast. A good w~y to "forecast" if it is raining is to observe whether pedestrians are using umbrellas, but the act of using an umbrella
does not cause it to rain.

Even though forecasting need not involve causal relationships, economic theory suggests patterns and relationships that might be useful for forecasting. As we see in Chapter 14, multiple regression analysis allows us to quantify historical relationships suggested by economic theory, to check whether those relationships have been stable over time, to make quantitative forecasts about the future, and to assess the accuracy of those forecasts.

1.3 Data: Sources and Types
In econometrics, data come from one of two sources: experiments or nonexperi-

mental observations of the world. This book examines both experimental and nonexperimental data sets.

Experimental Versus Observational

Data

Experimental data come from experiments designed to evaluate a treatment or

policy or to investigate a causal effect. For example, the state of Tennessee financed a large randomized controlled experiment examining class size in the 1980s.In that experiment, which we examine in Chapter 13, thousands of students were randomly assigned to classes of different sizes for several years and were given annual standardized tests.

8

CHAPTER 1

EconomicQuestions and Data The Tennessee class size experiment cost milhon dill' t nu tequired th . . or 'nl ind t uch r e ongoing cooperation of many adrnini unrors, p C er cveral B ause real world experiments with human sub) I r doth uh I adrnin years. ec id I d zed ru II . ister and to control, they have flaw relaii e to I eo r n nu C I'll ed expel'. iments. Moreover, in some circum lance experiments arc n toni 'pen ive and , ' '11(would it be thl ill In 011 I' I' nd rnly difficult to administer but a Iso une thirear, selected teenagers inexpensive cigarettes to ee hall' man th 'Iou ) Because of these financial, practical, and ethical pr blcms, experiment to ec nOmic. are rare. Instead, most economic data are obtained by observin rc l-workl b h vi r. Data obtained by observing actual behavior outside tn pcnrn 'ntal Ctting are called observational data. Observational data ore collc led USll1 'U,vcy, such as a telephone survey of consumer, and administrative re (lrd vu 10", hi t rical records on mortgage applications maintained by lending insutuuun Observational data pose major challenges to cconornern II to estimate causal effects, and the tools of econometrics t tackle th .se hnllcnec In the real world, levels of "treatment" (the amount of fertilizer in th I m to ample, the student-teacher ratio in the class size example) arc n to .." ned It r mdorn, s it is difficult to sort out the effect of the "treatment" fr mother r tcvunr I rs. Much of econometrics, and much of this b k, is devoted to m ihods f r me una the challenges encountered when real-world data are used t cstimare IU'lIl 'ffects. Whether the data are experimental or observati nul. doto 't~ m in three main types: cross-sectional data, time series data, and panel dmu, In uns b k, you will encounter all three types.

,nrl,

Cross- Sectional Data
Data on different entities-workers,consumers, firms,g vcrnmentat unit • and so forth-for a single time period are called cross-secnonnt duta .. r c ample, the data on test scores in California school districts are cross c tiOna I.Th data are for 420 entities (scbool districts) for a single time peri d (1999). In g >neral, the number of entities on which we have observations is denOl d 1/; •. for example, in the California data set, n ; 420. The California test score data set contains measurement ofeveral different variables for eacb district. Some of these data are tabulated in '!bbl, 1.1. ach row lists data for a different district. For example, the overage le.t or' f I' the first dlstnct ("district #1") is 690.8; this is the average o[ the moth and s ien test scores for all fifth graders in that district in 1999 on a standardized te t (the tanford AchIevement Test). The average student-teacher ratio in that distri t i 17.89: that IS, the number of students ill district #1 divided by the number f cia room

1.3 Data: Sources and Types

9

~
Observation (District) Number

Selected Observations on Test Scores and Other Variables for California School Districtsin 1999
District Average Test Score (fifth grade) Student-Teacher Ratio Expenditure per Pupil ($) Percentage of Students Learning English

1 2 3 4 5

690.8 661.2 643.6 647.7 640.8

17.89 21.52 18.70 17.36 18.67

$6385 5099 5502 7102 5236

0.0% 4.6 30.0 0.0 13.9

418 419 420

645.0 672.2 655.8

21.89 20.20 19.04

4403 4776 5993

24.3 3.0 5.0

Note: The California test score data set is described in Appendix 4.1.

teachers in district #1 is 17.89. Average expenditure per pupil in district #1 is $6385. The percentage of students in that district still learning English-that is, the percentage of students for whom English is a second language and who are not yet proficient in English-is 0%. The remaining rows present data for other districts. The order of the rows is arbitrary, and the number of the district, which is called the observation number, is an arbitrarily assigned number that organizes the data. As you can see in the table, all the variables listed vary considerably. With cross-sectional data, we can learn about relationships among variables by studying differences across people, firms, or other economic entities during a single time period.

Time Series Data
Time series data are data for a single entity (person, firm, country) collected at multiple time periods. Our data set on the rates of inflation and unemployment in the United States is an example of a time series data set. The data set contains observations on two variables (the rates of inflation and unemployment) for a

the overall price level (as mea ureo b the n urner Pric' Index.1%.The daW in each correspond to a different time period (year and quarter).6 3.1959.2. single entity (the United States) for 183 time peri d . the rate of price inflation In other words. evolution of variables over time and to forecast fUlure valu of those .10 PTER EconomicOuestionsand Data 1 CHA tes of Consumer Poe Ind (PII 11111 Selected ObselVatio~s on l~e ~ad States: Quarterly Data. data set begin in the second quarter periods) of .1 2 181 182 183 2004:11 2004:lJl 2004:IV 4. which is den the fourth quarter of 2004 (2004:fV).I.Ii i . the rat' of unemployment was 5.quarterl 1959:11 an annual r 0.7% per year III an annual rate 01 By tracking a single entity Over time.S.1 % or the labor f r e reported that they die not have a job but were lOoking [or work.inflation and unemployment data set is described in Appendix 14. that is. lind end ir b 'r . In the sec nd quarter C 1 59. for example. this data set contains T= 183 bservlllion. CPI) would have increased by 0. and n h~ r au n Ihal in thi: March in thi! led 19 9:11.5 Note: The U. In the se 1959.and so f rth). The number of in a time series data set is den ted . 1959 7004 and Unemployment In the ru e Observation Number Date ~ (PI Inflation Rac (010 per y ar lit u f ) Iyeer. %.4 1. May. if inflation had continued f r 12 months al it rate during the sec' and quarter of 1959.1% 2 3 4 5 1959:llJ 1959:[V [960:1 1960:11 . the rate nd quarler 01 was 0. and the rate of unemployment was 5.1 5.7%. and June.3 ~6 ~.7% 2. uarten rov T. tim. time series data can be u ed t iudy the variables.1 %. In the third quaJ'ter of 19 Cl'I inflation was 2. Becau e there Dr' I from 1959:II to 2004:JV. Some observations in this data set are listed in nble 1. 'Deh 11111 P 'nod data set is a quarter of a year (the fir t quarter is Januar the second quarter is April.4 ~. 5. Febr u I ic . .

organized .585 0. and selected variables and observations in that data set are listed in Table 1.4 117.333 0. The first block of 48 observations lists the data for each state in 1985. Our data on cigarette consumption and prices are an example of a panel data set.007 1.360 Note: The cigarette consumption data set is described in Appendix 12.2 1.382 0. also called longitndinal data. we have observations on n ~ 48 continental U.8 115.states (entities) for T = 11 years (time periods) from 1985 to 1995.089 0.3.240 0.335 528 Wyoming 1995 112.5 128.3. Some data from the cigarette consumption data set are listed in Table 1.935 1.8 1. The number of entities in a panel data set is denoted n. are data for multiple entities in which each entity is observed at two or more time periods. by State and Year for U.370 0.5 104. and the number of time periods is denoted T.5 $1. 1985-1995 Average Price Observation Number State Year Cigarette Sales (packs per capita) per Pack (including taxes) Total Taxes (cigarette excise tax + sales tax) 1 2 3 Alabama Arkansas Arizona 1985 1985 1985 116.240 0.080 0. Thus there is a total of n X T = 48 X 11 = 528 observations.015 1. 47 48 49 West Virginia Wyoming Alabama 1985 1985 1986 112.2 1. and Taxes.362 .S.8 129.086 $0.022 1.135 0. Prices. Panel Data Panel data.3 Data: Sources and Types 11 ~ Selected Observations on Cigarette Sales.334 96 97 Wyoming Alabama 1986 1987 127.1.1.S. States. In the cigarette data set.

was $1. equals 128. The average price of a pack of cigarettes including tax. time series data. cigarette sold in Arkansas in 1985 divided by the total population of Arkan in A rkan lists sales in Arkansas were 128. and panel data are gathered by ob erving multiple entities. and so forth. or too expensive. Time Series. impractical. nd local taxes. provides tools for estimating (nonexperimental) causal effects data or data from real-world. lime senes data are gathered by observing a single entity at multiple points in lime.5). The definitions of cross-sectional summarized in Key Concept 1. and panel data are Summary 1.1. entities. Conceptually.12 CHAPTER 1 Economic Questions and Data ~ Cross-Sectional. estimates of the way to estimate a causal effect is in an ideal randomized rnic appli- using either imperfect at a single 4. data are gathered by observing multiple entities pomt in lime. For example. 3. of which 37¢ went to federal. but performing such experiments in econ cations is usually unethical. state.5 packs per capita (the total number of packs of cigarettes as in 1985 as in 19 5. time periods.1 • Cross-sectional period. each of which is observed at multiple points in time.015. controlled experiment. . and Panel Data 1. 2. Cross~sectional. fr m the expeover Panel data can be used to learn about economic relationships riences of the many different entities in the data set and from the evolution time of the variables for each entity. data. through 1995. where alphabetically from Alabama to Wyoming. The next block of 4 observations the data for 1986. in 1985. data consist of multiple entities observed at a single time • Time series data consist of a single entity observed at multiple • Panel data (also known as longitudinal data) consist of multiple each entity is observed at two or more time periods. Many decisions in business and economics require quantitative how a change in one variable affects another variable. Econometrics observatIOnal expenments.

an ideal randomized controlled experiment to measure this causal effect. an observational eros -sectional data et with which you could study this effect. an observati d. c.Review the Concepts 13 Key Terms randomized controlled experiment (6) control group (6) treatment group (6) causal effect (6) experimental data (7) observational data (8) cross-sectional data (8) observation number (9) time series data (9) panel data (11) longitudinal data (11) Review the Concepts 1. J. uggc t some impediments to implementing this experiment in practice. Design a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts.1 Design a hyp thetical ideal randomized controlled experiment to study the effect of hour pent studying on performance on microeconomics exams.2 J. ugge tome impediments to implementing this experiment in practice. and nal panel data et (or tudying this effect. Describe: a. b. an observati nat time series data set for studying this effect.3 . You are a ked to study the casual effect of hours spent on employee training (mea ured in hours per worker per week) in a manufacturing plant on the productivity of its workers ( ut] ut per worker per hour).

1 reviews probability distributions for a single random variable. the sample average has a probability as its sampling distribution distribution. The this theory of probability provides mathematical tools for quantifying and describing randomness. Because the average earnings vary from one randomly chosen sample to the next. and F distributions.3 introduces the basic elements of probability theory for two random variables. which is referred to describes the different because this distribution possible values of the sample average that might have occurred had a different sample been drawn. had you done so. and compute the average earnings using these ten data have chosen ten different graduates by pure random chance. If you feel confident with the material. you would have observed ten different earnings and you would have computed different sample average. Most aspects of the world around us have an element of randomness. and Section 2. Because you chose the sample at random. squared. chi.CHAPTER 2 Review of Probability taken an introductory course in probability and statistics. complicated. The final two sections of this chapter focus on a specific source of randomness of central importance in econometrics: the randomness that arises record (or you could a by randomly drawing a sample of data from a larger population. If your knowledge T his chapter reviews the core ideas of the theory of probability that are needed to understand regression analysisand econometrics. suppose you survey ten recent college graduates selected at random. We assume that you have of probability is stale. This sampling distribution is.4 discusses three special probability distributions that playa central role in statistics and econometrics: the normal. For example. mean. Section 2. in general. the sample average is itself a random variable. "observe") their earnings. and variance of a single random variable. and Section 2. you should refresh it by reading this chapter.5 discusses random sampling and the sampling distribution of the sample average. When the I 14 . points (or "observations"). Most of the interesting problems in economics involve more than one variable. Therefore. Section 2.2 covers the mathematical expectation. you still should skim the chapter and the terms and concepts at the end to make sure you are familiar with the ideas and notation. Section 2.

and Random Variables The gender of the next new per n you meet. your grade n an exam. a result known as the central limittheorem. ." A random variable is a numerical summary of a rand rn ut orne.then over the ourse f writing many term papers y u will complete 80% without a crash. the Sample Space. that is. The mutually exclusive p tential result of a random pr ce s are called the outcomes. whereas a continuous random variable takes on a continuum of po sible value Random variables. The event "my computer will era h n m re than once" is the et nsisting of two utc mes: "no era he "and "one crash.1.2. The number of time your computer crashe while you are writing a term paper is random and take on a numerical value. it might crash twi e.. the sampling distribution of the sample average is approximately normal. like 0. and the number of times your computer will crash while you are writing a term paper all have an element of chance or randomness. and the outc mes need not be equally likely.1 Random Variables and Probability Distributions Probabilities. The sample space and events. however. Only one of these outcomes will actually OCcur (the outc mes arc mutually exclusive). a di crete random variable takes on only a discrete set of values. TI. . Probabilities and outcomes. which is discussed in Section 2.1 RandomVariablesand Probability Distributions 15 sample size is sufficientlylarge. An event i a subset f the sample space.6. 2. and s n. me rand rn variables are di crete and some are continuous.an event i a set of ne I'm re outc mes.. so it is a random ariable. A their names sugge I.e set of all possible outcomes is called the sample pace. For example. . 2. If the probability of your computer n t era hing while you are v riting a term paper is O%. it might era h once. TIle Ilrobubllit of an utc me is the proporti n f the time that the outc me ccur in the I ng run. In each of these example there i mething n t yet known that is eventually revealed. your c mputer might never crash.

96 0. and the probability bution is plotted in Figure 2. Pr(M variable that M M is = 0.90 0. An example of a probability distribution for M is given in the second your computer crashes row of Table 2. 6%. Cumulative probability distribution. is the probability The cumulative probability distribution is less than or equal to a particular probability distribution of of at mo t one crash. For example.06 0. According to this distribution. value. and 1 %. These probabilities Probabilities of events.06 = 0. These probabilities sum to l. the probability is. The probability distribution the list of probabilities denoted Pr(M of each possible outcome: = 0). of no crashes is 80%. The last row of Table 2. The probability of an event can be c mputed of the constituent = from the For example.10 + 0. and so forth. r 16%. 1).16. probability distribution. in this distribution. the probability of one crash is 10%. respecdistrisum to 100%. or four crashes by hand.1 gives the cumulative the random variable the probability Pr(M :s. the probability of the event of one or two outc meso That is. of no crashes (80%) and that the random variable M. is the probability that each while you of the random The probability crashes. is 90%. let M be tbe number of times your computer crashes are writing a term paper. crashes is the sum of the probabilities Pr(M = 1.80 0. This probability four times.03 0.16 CHAPTER 2 Review 01 Probability Probability Distribution of a Discrete Random Variable Probability distribution. three. if of no computer = 1. For example.10 0. OD!II Probability of Your Computer Crashing M Times Outcome (number of crashes) 0 Probability distribution Cumulative probability distribution 1 2 3 4 0.00 .1.99 om 1.80 0. or M = 2) = Pr(M = 1) + Pr(M = 2) 0.1. 3%. which is the sum of the probabilities of one crash (10%).) is the probability of a single computer crash. you will quit and write the paper of two. The probability distribution of a discrete random variable is the list of all possible values of the variable and the probability value will occur. tively.

8. so the probabilityof 0 computer crashes is 80%.1 0.6 0.L.1 Random Variables and Probability Distributions 17 cmmIJD Probability Distribution of the Number of Computer Crashes Probability The height of each bar is the probabilitythat the computer crashesthe indicated number of times. 0. The Bernoulli distribution. A cumulative to as a cumulative dis- An important special case of a discrete random variable is when the rand m variable is binary. For example.1) is the Bernoulli distribution. G = (2.1. The height of the second bar is 0. or a cumulative distribution.The probability distribution in Equation (2.The ut ornes of G and their probabilitie thu are J with probability { 0 with probability p 1 . a c.2. where = 0 indicates that the person is male and G = I indicates that she i female.d.0 o 2 3 4 Number of crashes pr bability distribution is also referred Iribution function.so the probabilityof 1 computer crash is 10%.8 0.5 height of the firstbar is 0. . and its probability di tributi n is called the Bernoulli distribution. let be the gender of the next new per on you meet.p. binary random variable is called a Bernoulli random variable (in honor f the evententhentury wis mathematician and cienti t J cob Bernoulli).1 ) where pis the probability of the next new person you meet being a \ ornan. that is. The 0. and so fonh for the other bars. the outcomes are 0 or I.7 0.

A probability density function that the random variable fall between is also called a p. 2. on a continuum Because a continuous random variable can take of possible values. The area under the probability those two points. random variable is computed weights are t~e probabilities of that outcome.f. 1111 tudent s to treat it as a than 15 minutes is 20% and the probability that it takes less than 20 minutes is 78%. n randistributake. The cumulative probability for a continuous variable is defined just as it is for a discrete random is.. Instead. less m variable probability that the random variable is less than or equal to a particular commuting time can take on a continuum of values and. and Variance The Expected Value of a Random Variable Expected value. Mean. a density times correthat the comcan mute takes between 15 and 20 minutes is given by the area under the p.18 CHAPTER 2 Review of Probability Probability Distribution of a Continuous Random Variable Cumulative probability distribution.2a as the difference between the probability that the commute is less than 20 minutes (78%) and the pr bability that it is less tban 15 minutes (20%). by functi n i summarized density variables. consider a student who drives from home to school. Figure 2. That is the value. because it depends dom factors such as the weather and traffic conditions. between any two points is the probability function. this probability be seen on the cumulative distribution in Figure 2. between 15 minutes and 20 minutes.f. The probability the probability the probability density function. wbich lists the probability of each possible value of the rand is not suitable for continuous variables. . the probability continuous random variable. Equivalently. Figure 2. it is natural tion of commuting times.2 Expected Values. or simply a density. The expected weighted average of the possible outcomes variable over many repeated of that random variable. the cumulative probability distribution of a continuous rand distribution variable. Probability density function. tbe probability distribution u ed f r discrete m variable.2a. or 58%. 7 .58.2b plots the probability density function of commuting sponding to the cumulative distribution in Figure 2.d. The expected value of a random variable value of a discrete Y. which is 0. The expected value of Y i also called the expectation of Yor the mean of Y and is denoted J.d.2a plots a hypothetical cumulative that the commute For example. For example. Thus the probability density function and the cumulative probability distribution show the same information in different formats.Ly. denoted E(Y). is trials or as a where the the long-run average value of the random occurrences.

.D of commuling times.__ -' 20 25 3 omrntlting 35 40 time (minutes) (n) .dJ.2b shows the probability density function (or p..2a shows the cumulative probability orstribution (or c.20 (or 20'lbl..2.58 (58%> and is given by Ihe area under Ihe curve between 15 and 20 mhunes..20 0.0.22 0.dJJ of commuling times.0 Pr (Commuting time 5 20) = 0.0 '---'---'10 15 __ -..f-----/ 0.umulauvc durribuuon fun lion of cormuuriug time Probability denslry IUS Pr (Commuting time 5 15). and Variance Time 19 Distribution and Probability Density Functions of Commuting 1.4 Pr (Commuting time 5 15) = 0. Probabilities are given by areas under the p.2 Cumulative Prcbnblfiry Expected Values.6 0.78 0.I __ 10 . and Ihe probability that it is less than 20 minutes is 0. Figure 2. .Q3 Pr (Commuting time > 20) = 0.78 <78%). 25 ...The probability thai a commuting time is between 15 and 20 minutes is 0...ooLG3l1!i:!~'-_-.-/ 0. The probability that a commuting time is lessthan 15 minutes is 0.06 .12 Pr (15 < Commuting time 5 20) = 0..09 0..:= 30 ornrnucing --L __ --J 15 20 35 40 time (minutes) (b) Prcbabihry densuy fun non of conunuring time Figure 2. Mean.20 0.8 F--.2 I---"'C-f 0.58 o..58 0.i -'---__ --'- '-.d.

Accordingly.+Y:2P2+'" where the notation +YkPk= LYiPi. but 1% of the time you would gel nothing. As a second example. Thus the amount equals $0 with probability 0. f Yanc 0 an is IS .1 uses "summation notation..2) means that the average ber of crashes over many such term papers is 0. by the frequency M with the with which given in Table 2.35. and s forth.10 + 2 X 0. The formula for the expected value of a discrete random variable take on k different values is given as Key Concept 2. 11. weighted a crash of a given size occurs. is k E(Y)=YIP. denotes the second value.35.) Expected Value and the Mean 2. the actual number of crashes must always be an integer.' .Yk> where YI that Y takes on Y2is Pz. YI. but there au are repaid is a random variable that equals $110 with probability be paid back $110.25. Thus the expected value of your repayment (or the "mean repayment") is $108.wfhe expected value of Y is also called the mean of Y Or the expectaf denot d Ion e !.90.06 number of computer + 3 X 0.99 and you repaid. would be repaid $110 X 0. is p" the probability The expected value of Y. 'If the loan is is a ri k 0.0 I = 0.35. consider the number of computer eras he probability distribution number of crashes over many term papers. i=l (2.35 times while writing a particular num- 0.Q1 of I% that your friend will default and you will get nothing at all. denoted E(Y). and that the probability that Y takes on Y. it makes no sense to say that the computer crashed term paper! Rather. Over many such loans.1. the expected crashes while writing a term paper is 0. Of course. suppose you loan a friend $lOO at 10% interest.e expected value of M is the average E(M) = 0 X 0. (Key oncept Y that can 2.ty.3) Lk1=1 y·p·means "th e sum 0 f Yi Pi for l"running from . you get $110 (the principal of $100 plus interest of $10).01. 99% of the time you would on average = $108.2) That is." which is reviewed in Exercise 2.y. the calculation in Equation (2.80 + 1 X 0.03 + 4 X 0. . denotes thefirst value.99 + $0 X 0.1.l to I I k. (2.1 Suppose the random variahle Y takes on k possible values.20 CHAPTER 2 Review of Probability For example. and so f rth.90.

1):n.2 Expected Values. Because Ihe variance involves the quare of Y. which make the variance awkward to interpret.1 is the mean of a Bernoulli random variable.e variance of the di crete rand rn variable Y. is the cxpe ted value f the square of the deviation of Y fr rn its mean: var(Y) = E[(Y-p.e expected value of Gis E( G) = I x P + 0 X (1 - p) = p.2.e tandard deviation of Y is lTy. the formal mathematical definition of its expectation involves calculus and its definiti n is given in Appendix 17. Varian . The expected value of a continu us random variable is also the probability-weighted average of the possible OUIC mes f the random variable. and Variance 21 Expected value of a Bernoulli random variable.2.These definition are urnmarized in Key oncept 2. ~ i 2. .y)2]. The unit of the standard deviation are the same as the unit of Y. denoted k IT~.1. the probability that it lakes n the value" I. The Standard Deviation and Variance The varian e and tandard deviation measure the dispersion or the "spread" of a probability di tribution.and Standard Deviation 11. It is thcref rc c mm n t mea ure the spread by the standard deviation. An important special case of the general formula in Key oncept 2. the units of the variance are the unit f the square of Y.5) (T~=var(Y)= E[(Y-p. Let G be the Bernoulli random variable with the probability distribution in quati n (2. the square root of the variance. (2.2 (2." Expected value of a continuous random variable. Mean.y)2J = L(Yi-/Lyfpi' i-\ 11. which i the quare rtf the variance and is denoted oy111e standard deviation ha the ame units as Y.4) 11lUSthe expected value of a Bern ulli random variable is p. denoted var(Y). Because a continuous random variable can lake n a continuum of po ible values. The variance of a random variable Y.

Thus the standard deviation of a Bernoulli random variable is UG = Mean and Variance of a Linear Function of a Random Variable This section discusses random variables (say.6475 '" O. Thus . ( quation =(0-p)2X (l-p) + (1 _p)2Xp=p(1 (2. after-lax earning X by the equation Y = 2000 + O.01 = O.l.6) of M is the square r 01 of the variance. (2. Variance of a Bernoulli random variable.Sx.0. Suppose an individual's pre-lax earnings next year are a random mean J.y)2.35)2 X 0. 0.l.0. Y .l.J.35: var(M) = (0 .l.p). variable with earnings earnings.l.SX . consider an income taxed at a rate of 20% on his or her earnings X and Y) that are related by a linear tax scheme under which a worker and then given a (tax-fr Yare related I pre-lax earnings is e) grant of $2000. (2. Because Y = 2000 + O. M is the probability-weighted average of the squared difference between M and Its mean. 5j2 X 0.22 CHAPTER 2 Review Probability of For example.1-x and variance (T~. Because pre-tax earnings are random. after-tax earnings Y is 80% of pre-tax earnings X. so are after-tax deviations of her after-tax earnings. the .J. vananc e of' the number of c mputer crashes . (2. function.35? X 0. For example. so its variance is var(G)=u8 distribution 11.0.80 + (1 .06 + (3 .e mean of the Bern ulli random in Equation (2.l.10 + (2 .x. her earnings are SO% of the original pre-tax plus $2000.y= 2000 + O. Under this tax scheme.x)= O.8J. Thus the expected value of her after-tax earnings is E( Y) = J.7) p( I . variable G with probability (2.0. so uM The standard deviation = VO.S(X .y = 2000 + 0. 475.03 + (4 .SJ.1) is Ji-G =P -pl.35)2 X 0.SX.(2000 + O.x).O. What are the mean and standard under this tax? After taxes.4)].9) The variance of after-tax earnings is the expected value of (Y _ J.35)2 X 0. plus $2000.SO.S) That is.

11) Other Measures of the Shape of a Distribution The mean and standard deviation measure two important features of a distribution: its center (the mean) and its spread (the standard deviation). taking the square root of the variance.12) (2.8(X . Or "heavy.3a and 2. 111isanalysis can be generalized so that Y depends on X with an intercept a (instead of $2000) and a slope b (instead of 0.3c and 2.10) are applications of the more general formulas in Equations (2.2 ExpectedValues.so. Then the mean and variance of Yare !Ly = a + b/l-x and (2.3c. (2.8) so that Y=a+bX. which measures how thick. This section discus es measures of two other features of a distribution: the skewne s. It follows that var( Y) = 0. two which are symmetric (Figures 2.11. which measures the lack of symmetry of a distribution.2. The skewness of the distribution of a random variable Y is Skewness (2.64var(X). (2.!Lx )fl = 0.!Lx )2].e expressions in Equations (2.3d appears to deviate more from symmetry than does the distribution in Figure 2. the standard deviation of the distribution of her after-tax earnings is 80% of the standard deviation of the distribution of pre-tax earnings. Figure 2.3d).80" X.!LyJ2] = EI[0. the standard deviation of Y is O"y = 0.12) and (2.Mean. Visually. and the kurtosis. and kurtosis are all based on what are called the moments of a distribution.14) .13) and the standard deviation of Y is O"y = btr x. skewness.and Variance 23 E[( Y .10) That is. The mean." are its tails.9) and (2.3b) and two which are not (Figures 2. the distribution in Figure 2. Skewness.64E[(X . variance.3 plots four distributions.1 3) with a = 2000 and b = O•. The skewness of a distribution provides a mathematical way to describe how much a distribution deviates from symmetry.

5 0.1 0.5 1. Thus. kurtosis == 3 (b) Skewness = 0. If so.3 0.1.6 0.1 0.4 0.3 0. kurtosis = 20 0.7 0. The distributions with kurtosis exceeding where Uy is the standard deviation of Y.2 0.0 -4 -3 -2 -I a = 5 0. For a symmetric distribution. E[(Y .8 0.2 0. The distributions (c with skewness of 0 (a and b) are sym metric. kurtosis = 5 All of these distributions have a mean of a and a variance of 1.1 0. a value of Ya given amount above its mean is just as likely as a value of Y the same amount below its mean.3 0. kurtosis (d) Skewness = 0.!"y)3 will be offset on average (in expectation) by equally likely negative values. the distributions with nonzero skewness 3 (b-d) have heavy tails.3 0.4 0.!"y)3] = 0. for a ymmetric distribution.6 0.2 0.0 0.1 0. If a .0 2 3 4 -4 -3 -2 -I 0 2 3 4 (c) Skewness == -0. the skewness of a symmetric distributi n is zero. and d) are not symmetric.4 0.2 0.9 0.5 0.6.0 0.4 0. then positive values of (Y .24 CHAPTER 2 Review Probability of Kurtosis mmtm Four Distributions with Different Skewness and 05 0.0 -4 -3 -2 == -I 0 2 3 4 -4 -3 -2 -I 0 2 3 4 (a) Skewness 0.

so the skewness is nonzero for a di rributi quati free. the r'h m ment of Y i E( V').so the skewness is unit ther words.l-'y)3 generally is not ffset on average by an equally likely negative value. ccnd. TIle kurtosis of a distribution is a measure of how much mass is in it tails and. In general.The distribu- ti ns in -rgures 2. then same extreme depar- Y fr rn its mean arc likely. n average (in expectati n). f the distribution of Y is extreme values. theref i fa distributi The kurtosis re. the kurtosis is unit free. the kurt TIle kurtosis able \ ith kurt variable. of (Y .he vy-railed . O'y (2. If a di tribution skewne s i negative. and Variance 2S then a p sitive value of by (Y . is called the sec nd m merit of Y. has m re mass in it tails than a normal random r. i a measure of how much of the variance of Y arises fr m n. so changing the unit doc not change its kurtosis.l-'y)4. If a distrihas Bel w each f the four distributions n has a I ng right tail.14) cancels the units f y3 in the numerator. The skewness is a function of the first. E(Y). ike skewness. Thus f I' a distribution with a large amount of rna s in its tails. and the e very large values will lead 10 large values. and the skewness is positive.2 distribution is not symmetric. its Kurtosis. and the expe ted value of the f Y. lind third m ments of Y. Below eo h of the four distributi ns in Figure 2.l-'y)3 are not fully offset by neg- ative value. and the kurtosis is a function of the first through fourth moments of Y.so a rand m variis ex eeding ution with kurtosis exceeding 3 is called leptokurtic simpl .. changing the units of Y does not change its kewness.3 is its skewness. is also called the first moment quare f f Y. b-d are heavy-tailed. The mean of Y. in Figure 2. Expected Values.I long left tail.3 is its kurto is.l-'y)4 cann t be negative. Dividing 0) in the denominator of n (2. the kuno i will be large. .15) If a distribution tures of has a large amount of mas in its tails. the expected value Y' i called the r1h moment of the rand m variable Y.2. Becau e (Y . E(y1). the more likely are outliers. That is. in buti n that i not symmetric. TIle greater the kurto- Kurto is= E[(Y-/y)"l. Mean. . of distri si COnn t e negative. p sitive values of (Y . Moments. m re f a normally distributed rand m variable is . An extreme value of Y is called an outlier.

Y=1)=0. The joint probability distribution Pr(X=x.orPr(X=0. in the first example. say X and Y.63 0. The probabilities of all possible (x.22 0.70 0.ccond).78 l. is the probability that the rand take on certain values. there are four possible outcomes: it rains and the commute is long (X = 0. weather conditions-whether not it is raining-affect in Section 2.1.15.15 0. ~ Joint Distribution of Weather Conditions and Commuting Rain <X =: Times Total 0) No Rain LY =: l) Long commute (Y Short commute (Y 0) 1) 0. According to this distribution. the probability of a long. Distributions of two discrete rn variables can be written 01' The joint probability distribution random vari- ables. An example of a joint distribution of these two variables is given in Table 2. Are H college graduates more likely to have a job than nongrad~atcs? the distribution employment of two random variables.3 Two Random Variables Most of the interesting questions in economics involve two or m re variables.2. and conditional probability distributions.30 0. considered income together w ~ es the disns concern and . and n distribution rain and shan commute (X = 1. (education in th~ tribution of income for women compare to that for mcn? n.15. Y = 1).ese qucsu status and gender Answering such questions requires an understanding ginal. The joint probability is the frequency which each of these four outcomes occurs over many repeated commute. Between these two random variables. y) combias the function the and nations sum to 1. no rain and long commute (X = J. ¥= y). Also. mar- Joint and Marginal Joint distribution. (X have rain Pr(X= I. rainy COm- Y=O) = 0. Y = 1). that is.00 - Total . over many commutes. Let Y be a binary ranis short (less than 20 minutes) For example. = 0. 15% of the day and a long commute muteisI5%.15 0. ¥ = 0). Y = 0). of the concepts of joint.07 0. say simultaneously x and y. Y = 0). rain and commute with (X = 0. commuting time of the student commuter dom variable that equals 1 if the commute equals ° otherwise and let Xbe a binary random variable that equals 0 if it is rainh rt ing and 1 if not.26 CHAPTER 2 Review0. Probability 2. Pr(X = 0.

. in Table 2. quivalently. the marginal probability of rain is 30%.These four possible outcomes are mutually exclu ive and con titute the sample space so the four probabilities sum to 1. then the marginal probability that Y takes on the value y is I Pr(Y= y) = LPr(X=x. what is the probability of a long commute (Y = 0) if you know it is raining (X = OJ? rom Table 2.2.17) . If X can take on I different values XIo"" XI.2. (2. 1-' Y= y). conditional on it being rainy (X = 0). the joint probability f a rainy short c rnmute is 1 % and the j int probability of a rainy I ng commute is 15%. Marginal probability distribution. Similarly. 50% f the time the commute i long (0. In general. Of thi 30% of commutes.2.30). the marginal probability that it will rain i 30%. (2.07.2. the probability of a long rainy commute is 15% and the pr babitiiy f a I ng commute with no rain is 7%.50.16) example. is 50%. or Pre Y = 01 = 0) = 0.63. the c nditional distribution f Y given X = x is ___ Pr(Y-yIX-x)- Pr(X=x. The marginal probability distribution of a rand rn variable Y is ju t another name for its probability distribution. Y = 1) = 0. so the probability of a long e mmute rainy or no I) i 22%.Thu the probability of a long commute (Y = 0). The distribution of a random variable Y conditi naJ on an ther random ariable X taking on a specific value i called the conditional distribution of Y given X. The marginal distribution of Y can be computed from the joint distribution of X and Y by adding up the pr babilities of all pos ible outcomes for which Y takes n a specified value.ll. I' example. so if it is raining a I ng commute and a hort commute are equally likely.and Pr(X = 1.Y=y) Pr(X=x) . I' Conditional Distributions Conditional distribution.15/0. This term is used t distinguish the di tribution of Y alone (the marginal distribution) from the joint distribution of Y and another random variable. as shown in the final I' w of Table 2. The marginal distribution of commuting times is given in the final c lumn of Table 2. over many commutes it rains 30% of the time.2.3 Two RandomVariables 27 y = 0) = 0. that i .e conditional pr bability that Y takes on the value y when X takes on the value x is written Pre Y = ylX = x).

half new and half of which are old. Joint Distribution M Old computer (A New computer (A Total B. of the rand 111 = n is a random variable. Because you are randomly a signed t a c mputer. Suppose the joint distribution variables M and A is given in Part A of Table 2. also called distribution of is the expected value of the conditional mean of Y given X. given the age of the computer.00 1.005 0.30 0IX= 0) = Pr(X= O)/Pr(X= = 0. A = O)/Pr(A = 0) = 0. is 70%.07 010 0.01 0.35/0. distributi f c mputer are or crashes.28 CHAPTER Review Probability 2 of Joint and Conditional Distributions of Computer Crashes (Ml and Computer Age (Al -~ ~ A.00 Pr(MIA -0) Pr(MIA ~1) 0.90 013 0. (= 1 if the computer is new.03 0.02 0. but 1% with Conditional expectation.50 = 0.3.01 0. For exam- M = 0 and A = 0 is 0. given that you are u ing an the conditional probability Id com- Pr(M = OIA = 0) = Pr(M = 0.50 0) 1) 0. examand the f which are 0 if it is old).06 0.70.35 0.70 0. That is. In contrast. the conditional expectation Y. because half the c mputers probability of no crashes.80 0. the conditional is Pr(Y= probability 0. As a second example. Conditional 0 At-I M = 2 M = 3 M=4 Tota' 0. is the mean of the conditional Y given X. is given in Part B of the table.01 1.50 0.50. A of the crashing computer ple. of three crashes is 5% with an old computer B of Table 2. the newer computers are less likely to crash than the old one. consider a modification librarian randomly assigns you a computer the age of the computer you use.00 For example.45 0. the probability a new computer.3.Q35 010 0.00 - Distributions of M given A M= 0 M = 1 M=2 M = 3 At = 4 Total 1.025 0. The conditional expectation of Y given X.35.02 0. Suppose you use a computer in the library to type your term paper from those available.05 0. the conditional puter. Then the conditional ple.15/0. the joint probability old. According to the conditional distributions example.065 0. computed f . of no crashes given that you are in Part for assigned a new computer is 90%.01 0.05 0.00 0. Y= of a long commute given that it is rainy 0) = 0.

is E(MIA = 1) = 0.. x.20) is computed using the conditional distribution of Y given X and the outer expectation is computed using the marginal distribution of X. the mean number of crashes is 0.56.2. that is. if X takes on the I values XI>' .10 + 3 X 0. given that the computer is new.20) where the inner expectation on the right-hand side of Equation (2.05 + 4 X 0. The law of iterated expectations.3.56. the mean number of crashes is 0.17) (see Exercise 2. is E(MIA = 0) = Ox 0.20) is known as the law of iterated expectations.18) For example.19). The expected number of computer crashes.3 Two Random Variables 29 using the conditional distribution of Y given X. the expected number of computer crashes. Stated differently. weighted by the probability distribution of X. Similarly. so the conditional expectation of Y given that the computer is old is 0. among new computers. based on the conditional distributions in Table 2.14.)Pr(X=x. Equation (2. the conditional expectation of Y given that the computer is new is 0. (2.). E(Y) = E[E(Y IX)]. The mean of Y is the weighted average of the conditional expectation of Y given X. i=l (2.13 + 2 X 0. In the example of Table 2. .14.02 = 0. given that the computer is old. Stated mathematically. the mean height of adults is the weighted average of the mean height of men and the mean height of women.70 + 1 X 0. The conditional expectation of Y given X = x is just the mean value of Y when X = x.18) and (2.19) Equation (2.3.19) follows from Equations (2. weighted by the proportions of men and women. less than for the old computers. then I E(Y) = LE(YIX=x.If then the conditional mean of Y given X = x is k Y takes on k values Y""" Yk> E(YIX=x) = LYiPr(Y= 1=1 YiIX=x). (2. the expectation of Y is the expectation of the conditional expectation of Y given X.14. For example. the mean number of crashes M is the weighted average of the conditional expectation of M given that it is old and the conditional expectation . For example.56 for old com- puters.

56)2 X 0.3. average of these conditional means is zero.0. if the mean of Y given The law of iterated ditional on multiple = 0. E(M). then E(M I A.O.02 . the expected number of crashes for new computers (0. Z) is the conditional crash illustrati expectation Y given both X and Z.0. as measured by the conditional for new computers (0. Stated mathematically.-E(YIX=x)]2pr(Y=YdX=x). I .30 CHAPTER 2 Review Probability of ofMgiven Pr(A that it is new.14) va a IS less than that for old computers (0. Conditional variance. Fa r th e con diruona I di1tributions in Table 2. P) with age A that ha P programs average of f pr grams with age A and number f conditi installed. .99 = 0.3. (2. The tandard that =a i thus \"0. .22 -.56)' X 0. in the computer is the expected number of crashes for a computer the expected number of crashes for a computer P. Z)J. n I expectations The variance of Y conditional on X i the variance of the of Y given X.99.05 + (4 . where E(Y let I X. This is an immediate Equation (2. The conditional variance of M given that A = 1 is the variance of the distribution in tbe second row of Panel B of Table 23 . the conditional variance of the number of crashe computer deviation is old of the is var(MIA = 0) = (0 . then the mean of Y is zero.13 A + (2 .47) than for old (0.14 X 0. then E(Y) = E[ E( Y IX)] = E[O] = O. .99). then it must be that the probability-weighted also applies to expectations that are contions says of f'Table 2. dard deviation of M for new computers is .20): if E(Y IX) ently. For example.so E(M) = E(MIA = 0. and the spread of the distributi standard deviation ' n of the is smaller number of crasbes.2). For example. 0. 47 .WhiIeh IS022 I so lle stani . The expected number of crashes overall. expectations random variables. and Z be random expect n variables that are jointly distributed..0.10 + (3 . conditional distribution ance of Y given X is P denote the number of programs installed on the computer. The law of iterated expectations implies that if the conditi given X is zero. the conditional vari- k var(YIX=x)= ~[Y. Y.35. This is the mean of the marginal disnal mean of Y consequence of tribution of M.99.50 = 0.20 provides some additional with multiple variables. .21) For example.56)2 X 0. Then the law of iterated that E(Y) = E[E(Y IX. as calculated in Equation (2.56).0.56? X 0. .56 X 0. aid difFer- Xis zero.50 =0) xPr(A =0) + (MIA =l)X = 1) + 0. let X. is the weighted with that value properties f b th A and P.56)2 X 0.. weighted by the proportion of computers Exercise 2. that is.70 conditional distribution of M given given that the + (1 . the mean of Y must be zero.

Covariance and Correlation Covariance. Prey = ylX = x) = Prey ~ y) (independence of X and V). then the covariance is negative.!LX)( Y . and vice versa).24) To interpret this formula. Finally. and wben X is less than its mean (so that X -!LX < 0). so the covariance is positive. Y~ y) ~ Pr(X = x)Pr(Y~ y).23) That is. (2. Specifically.!Ly is positive).22) Substituting Equation (2.!Ly)] = L: L:(Xj-!LX)(Yi-!Ly)Pr(X=xj. If X and Yare independent.!LY)). tben the covariance is given by the formula cov(X. .!Lx) X (Y . One measure of the extent to which two random variables move together is their covariance. The covariance is denoted cov(X. That is. i=1 j=l k I Y=y.!Ly) tends to be positive.22) into Equation (2. then the covariance is zero (see Exercise 2.17) gives an alternative expression for independent random variables in terms of their joint distribution. then Y tends be greater than its mean (so that Y . the joint distribution of two independent random variables is tbe product of their marginal distributions. X and Yare independently distributed if.19). suppose that when X is greater than its mean (so that X -!Lx is positive). X and Yare independent if the conditional distribution of Y given X equals the marginal distribution of Y. Y) or tr Xy. Y) = <Txy = E[(X . where !Lx is the mean of X and !Ly is the mean of y.2. If X can take on I values and Y can take on k values. the product (X .!Ly < 0). for aUvalues of x and y.!Lx)(Y . In contrast. if X and Yare independent. then Pr(X =x. then Y tends to be less than its mean (so that Y . In both cases.). (2. if knowing the value of one of the variables provides no information about the otber. (2. or independent. if X and Y tend to move in opposite directions (so that X is large when Y is small. The covariance between X and Y is the expected value E[(X .3 TwoRandomVariables 31 Independence Two random variables X and Yare independently distributed.

The Mean and Variance of Sums of Random Variables The mean of the sum of two random variables.X) =0. the units cancel and the corrclation is unities . as pr vcn in Appcndix 2.!Joy)(X .. Y) . is the Sum of their means: E(X+ Y) = E(X) + E(Y) = !Jox+ !Joy. that is. t .0 . "units" problem can rna ke numen 'cal values of the covarian c diffi ult t 1I11crpre!.!Jox)] = E( YX). however.. h . .2 ). Y) Vvar(X) var(Y) (fXY = (fXUy' (2. first subtract off their means. so cov(Y. fY . then Y and X are uncorrelated. alternative measure of dependcnce between. . . X) = E[(Y . (2. then the conditional mean of Y given X does not depend on X. Correlation. .then cov( Y. Equation (2. X) = 0 and corr( Y. if E( Y I X) = !Joy. · .25) Because the units of the numerator in Equati n (2. between X an d Y· IS tl covariance between X and Y divided b their slandard . An example is given in Exercise 2. pccifically.20)].O. th correlation s . deviated from ecause . It is not necessarily true. e units .26) If the conditi nal mean of Y docs not depend on X. The rand m variables X and Yare said to be uncorrelated if corr(X.' k 'dl the unit of X multiplied by.23. . . First suppose that Y and X have mean zero that cov(Y. aid differently. that if X and Yare unc rrelat d.This their means Its units are. E( YX) = E[ E( Y XI X)] = ElE( YI ) J = because E(YIX)=O. Y) cov(X. The correlation always is between -1 and I. X) = 0 into the definition of correlation in Equati n (2. -1 :5 corr(X. (2. Y) :5 '[ (correlation inequality}. it i possible for the conditional mean of Y to be a function of X but frY and n netheless to be uncorrelated. (2.32 CHAPTER Review Probability 2 of the covariance is the product of nnd Y.28) . X and Y.le c deviations: corr(X. That is. . X) Correlation and conditional mean. If Y and X do not have mean zero. By thc law f iterated expectations [Equation (2. B that solves t h e " uruits" problem of the covariance.27) follows by substituting cov( Y. then the preceding pr f applies.25) arc thc snme as those of the denominator.1. X and y The corre Ia non IS an . = O.27) We now show this result. aw war y.

' . o 10 20 30 40 50 60 70 80 Dollars (d) Men with a college degree ClO.S.02 0-01 'a 0. best-paid college-educated men? a college degree Are these parents than if they skip higher education.Q3 0.07 0.~C:-C::-::-=:~. Given Education Level and Gender The four distributions of earnings are for women and men. and the mean. ~~ o 10 20 30 40 SO 60 70 80 Dollars (c) Men with a high school diploma continued .07 0. a 10 20 30 40 50 60 70 80 Dollars (b) Women with a college degree 0.06 e. do the best- gender.OO~""'C""-:O:-~~-"""'~-::O::-::'.00iL-""'C""-:.08 0. and.04 00. These four conditional distributions are shown in Figure 2.06 0.06 e. for those with only a high school diploma (a and c) and those whose highest degree is from a four-year college (b and d). standard deviation.08 0. 0.07 0..0.03 0. o 10 20 30 40 50 60 70 80 Dollars (a) Women with a high school diploma • 'a 0.OO.---::.05 'a 0.o.05 0-02 0. con- earnings differ between workers who are college graduates and workers who have only a high school ditional on the highest educational degree achieved (high school diploma or bachelors' degree) and on diploma.'C-::o:~c:?~""":'::"""':'.04 • 00. and some percentiles of the conditional distributions are Conditional Distribution of Average Hourly Earnings of U.02 0.l"-C:---:':.0.00'iL-~~~~:::.04 e.01 O.0..4.os '2 0.Q3 0.Q3 0.3 Two Random Variables 33 Th!!_Distribution of Earnings in the United States in 2008 S ome parents tell their children that they will be higher-paying job if they get paid college-educated women earn as much as the able to get a better..07 0.08 0.01 O. right? Does the distribution of One way to answer these questions is to examine the distribution of earnings of full-time workers. does the distribution of earnings for men and women differ? For example..05 • c 0. Full-Time Workers in 2008.06 0.01 0. how? Among workers with a similar education.04 • e.2. if so.08 0.02 0.

38 12.4b) is shifted to the right of the distribution for women with only a I sed to individu- als with only Another distribution high sch I diplom feature of rhese distributi of earnings for men is shifted 1 the right of earnings for w men. Interestingly.23 (median) Deviation 7S% 90'\'0 (a) Women with high school diploma (b) Women with four-year college degree (c) Men with high school diploma (d) Men with four-year college degree $14.85 24. The distribution of average hourly earnings for female college graduates (Figure 2.77 12.1 For example.97 $7.08 $13.93 19. This "genmuny. the conditional mean of earnings for women whose highest degree is a high school diploma-that degree 90th percentile of earnings is much hi her for workers with a college degree than for workers w uh only is.48 28.salaries.50 19. IThe distributions were estimated using dot from the March 2009 Current Population Survey.73 23.69 Average hourly earnings are the sum of annual pretax wages.amIll9S Full-T'me 0 ers In Per< nrne Standard Mean of U S 50% 25% $9.73 per hour.().to return to this topic in later chapters.64 30. the of the distribution troubling-ospect der gap" in earnings is an importam-c-und. as measured by the standard deviation.19 21.34 CHAPTER 2 Reviewof Probability GlI!JD summariews ofrkthe con2doioti80~i~~r~s~:~~~~i'~~fL:::r:~~ ~~~~~.vldcdby the num~r of h u worked annually. the same shift can be seen for the two groups of men (Figure 2.63 17.This final c mparis sistent with the parental admonition degree pens doors 'hot remain It n is con. the spread of the distribution of earnings.62 15. E(EarnillgslHigheSf a high school diplcm u.4a).l 39. = high school diploma.45 S17. is greater for those with a college degree than for those with a high school diploma.21 16. We of the distribuu n of corning high school degree (Figure 2. .50 28.1. Gender = lema/e)-is ihut n ollege . for both men and women. which is discussed in more detail in Appendix 11.4c). which I de rl~ll Appendix 3. first numeric column).4d and Figure 2. n rs that the $14. 4 S23 ll(\ 3~42 32.4. tips. and bonusc di.59 10.4. In addition. For both men and women. The distributions were computed from the March 2009 urrent POpUlll1l011 urvey. mean earnings are higher for those with a college degree (Table 2. In presented in Table 2.

34) (2. and c be constants.xt-v. (2.2. TI.let a xv be the covariance between X and Y (and so forth for the other variables).let /.37) Useful expressions for means.36) If X and Yare independent. variances. (2. variance. The variance of the sum of X and Y is the sum of their variances plus two times their covariance: var(X+ Y) =var(X) +var(Y) +2cov(X.y. Equations (2.35) follow from the definitions of the mean.x + C!J. and covariances involving weighted sums of random variables are collected in Key Concept 2.30) COV(ll + bX + cV. and Covariances of Sums of Random Variables Let X. (2.Lx and (T) be the mean and variance of X.35) E(XY) = (TXY + !J.29) through (2. var(a + bY) = b2(T9. (2.e results in Key Concept 2. Icorr(X. and V be random variables. Y) =(T}+(T9+2(Txy.33) (2. and covariance: ~ 2. b.3 Two Random Variables 35 Means.3.3 E(a + bX + cy) = a + b!J. Y.1. .29) (2. then the covariance is zero and the variance of their sum is the sum of their variances: var(X + Y) = var(X) + var( Y) = (T) + (T9 (if X and Yare independent). and Jet a.3 are derived in Appendix 2. Y)I s 1and l(Txyl :5 v'(T}(T9 (correlation inequality). Y) = b(Txy + WVY. Variances.

96" is 0.96a II II + \ W><r y 2. Student t.36 The normal probability density function with mean 1l and variance (J2 is a bell-shaped curve.e probability distributions most often enc untered in econ metri mal. and the standard normal cumulative distribution function is denoted by the Greek letter (1).1. ( ). Pr(Z :5 e) ~ <pte).. between Jl . and variance u2 i expressed 2 concisely as "N(I-'. Student I. I). The area under the normal p. TI. As Figure 2.2).f. the n rmal density with mean I-' and variance (J2is symmetric around its mean and has 95% of it probability between I-' -1.5 show. The normal distribution with mean JJ. Chi-Squared.1. chi-squared.4 The Normal. and F dist ributi ns. ."The standard normal distribution is the normal distribution with mean I-' ~ 0 and variance u2 ~ J and is denoted N(O. where e is a constant. The normal distribution is denoted N(p. have been devel p d for the normal distribution. Values of the standard normal cumulative distribution function are tabulated in Appendix Table l. and F Distributions 11.e function defining th normal probability density is given in Appendix 17. Rand m variables that have a N(O. we must standardize the variable by first subtracting the mean.. then by dividing Some special notation and terminology . arc the nor- The Normal Distribution A continuous random variable with a normal distribution has th familiar hellshaped probability density shown in Figure 2. To look up probabilities for a normal variable with a general mean and variance.d.96(J and I-' + J . 1/-1. centered at JJ.96u and I' + 1. accordingly.96u.5.95. 1) distribution are of len denoted Z.

c2 is distributed ing by its standard Let c\ and d2 = (C2 -/")/u.4. 2. so its skewness is zero. Now Y s. d2) = <I>(~). that is. dJl = 1- <I>(dl).6b. cumulative distribution function <I> is tabulated in Appendix the result by the standard deviation. Y by subtracting its mean and dividZ = (Y -/")/u. that is. the probability that a normally Wall of can be applied to compute in Key Concept application exceeds some value or that it faUs in a certain range. ~.691 is taken from Appendix The same approach distributed random variable Table l.6a? the random variable ~(Y . The kurtosis . 4)-that is. what is the shaded area in Figure 2. Y is normally distributed with a mean of 1 and a variance What is the probability The standardized normally to ~(Y -1) tion.4 The Normal. (2).691. is symmetric. (2. Then Y is standardized deviation. divided by its standard V4 = ~(Y - 1). C2 CI Y is normally distributed with mean /" and variance N(/". For example. C2) = Pr(Z s. (2.~(Y -1) s.8). (2.Thus. by computing with denote Then two numbers < and let dl = (CI -/")/" and Pr(Y Pr(Y. devia- N(I. 2 is equivalent Pr(Y s. Accordingly..' s.41) where the value 0. ~) = <1>(0.2. Chi-Squared. 2-that is.4 u2. suppose Y is distributed of 4. Student t. it has version of Y is Y minus its mean.39) The normal Table 1.. 2) ~ Pr[~ (Y -1) ~l Pr(Z = s. with mean zero and variance is.1) is one (see Exercise 2.5)= 0.. in other words. ~(2 -I)-that shown in Figure 2.1)/ distributed that Y s. and F Distributions 37 Computing Probabilities Involving Normal Random Variables Suppose ~ 2. (Y .38) Cl) = Pr(Z . s. the standard normal distribution s. of the cumulative These steps are summarized Street" presents an unusual The normal distribution the normal distribution is 3. The box "A Bad Dayan normal distribution.

The probability that Y:::.38 CHAPTER 2 Review Probability of Ci13!ll1JD . is given in Appendix 18.42) (X.4) y Pr(Z 505) N(O.0 2. 2 = r Y-I "T" a Pr(Z -s 0.1. then aX + bY has the normal distribution: aX The multivariate normal distribution. or.0 (a) N(I.691. From Appendix Table 1.f.-1)= random 1. Pr(Y51) N(I. The multivariate normal distribution has four important pr perties. y s 2 When Y Is Distributed NO.. If X and Y have a bivariate normal distribution with covariance (J KY and if a and b are two constants. if nly two variabies are being considered. the distribution is called the multivariate normal distribution. Pr( Y:::. is a standard normal (Z) variable. the bivariate normal distributionThe formula for the bivariate normal p..d. Y IS standardized by subtracting its mean (J. use the standard normal distribution table.5) ~ <1>(05) ~ 0.5). standardize Y. and the formula for the general multivariate normal p.d. I) 0. 4) dl51nbut~n (Y . Pr(Z" 0.1 )/2.1. 4) Calculatingthe Probability That To calculate Pr( Y :::. a2(T} + b2(T~ + Tabo X1') (2. 2 is shown in Figure 2. Y bivariate normal) .6a.6b. ) p( <L=.5 The normal distribution can be generalized to describe the joint distribution of a set of random variables. and the corresponding probability after standardizing Y is shown in Figure 2. + bY is distributed N( a/Lx + b/Ly. 1) dl'tnbutlon 0. Because the standardized random variable.0 (b)N(O. then . In this case.f.2). is given in Appendix 17.L ::: 1) and dividing by its standard deviation (c-> 2).

4 The Normal.1987 1 -20 -25 L_-'-_--'-:---:':-:-------:'::-:-_-!=_-:-::-:--:-:::::--:::::----:-::::--:: 1980 1981 1982 1983 1984 t985 1986 1987 1988 1989 1990 Year continued . but you can calculate it using a computer (try it!). standard deviations.stock market can rise or fall on Monday.6% was a negative 2 x <1>( -20). If daily percentage price changes are normally distributed. 2009. then the probability of a change of at least 20 standard deviations is Pr(IZI c: 20) = by 1% or even more.6/J.l3%.I December 31. a plot of the daily returns on the Dow during the 1980s.000 . and F Distributions 39 A Bad Day on Wall Street O n a typical day the overall value of stocks traded on the U. where there are a total of 88 zeros! Daily Percentage Changes in the Dow Jones Industrial Average in the 1980s Duringthe 19805.2." the Dow Jones Industrial Average (an average of 30 large industrial stocks) fell hv 22.6%. 1980. You will not find this value in Appendix Table I. n O October 19..7.5 X 10-89. Student t. Chi-Squared.On "Black Monday. or more than 22 'landerd devianons: 10 5 a -5 -10 -15 October 9. the standard deviation of daily percentage return of 20(= 22. 19a7-'Black Monday" -the indexrell25. This is a 101-but nothing compared to what happened 1987.16%.so the drop of 22. that is. Percent change Ihe average percentage daily change of "the Dow' index was 0.. The enormity of this drop can be seen in Figure 2. 00055.J3) pric changes on the Dow was 1. This probability is 5. .05% and its standard deviation was 1. 0. October 19.6%! Fr m January I.

4 X 10• The universe is believed 10 10 Industrial January Average in the 7571 tr dlnp tit ~ bell"een 1.. the fact that it happened at all suggests its probability days-good and bad-with stock price changes was 100 s hav a disother vllri- tributi n with heavierwilli th n rhe normal lil -tnbuti n. so the probability of choosing a particular second at random from all the seconds since the beginning of time is 2 x 10-18 • There are approximately 1043 molecules of gas in the first kilometer above the earth's surface. F r this reason.2 7.4 X 10-11 2. The probability of choosing one at random is 10-43• Although Wall Street did have a bad day. so period.200'1.1 .111.5 x 1O-89? Consider the following: • The world population is about 7 billion.andDe ember 1. Til.st kpriceper cntage r re vent hnn Ir \lock have existed for 14bil- lion years.0 X 10 to - f . Date 1980-2009.on ideo papulnnlcd in Nassim Tnleb's 2007 book.2001 -20.0 7.JJJ/fT p~lz '" z) 5. I Probob Illy of • Chang a' L.5X 10 12 2. varying variances ore discussed In models abandon more than 5.- -604 -604 lAX 10-1• 2. All len standard deviati ns.1987 October 13.5 lists the ten largest daily percentage price changes in the Dow Jones the norrnal di<tribulion fnvor r distributions with heavier toils.2008 December 01. 1987 October 15. October 21. Other In in the f II of 21Xl/\ hove (mlldc" hnpter higher volatiliry than Iher. 2008 October 27.0 -6.5 I.IXI012 6.9 7.3 X 10-11 5.1 9.0 -7.9 -6.6 9.9 10...4 chonge u 109 the me n ond vori- the srandardized ance over this period. Daily Percentage Changes in the Dow Jones Industnal and the Normal Probability of a Change at Least as Larg Standardized x Percentage Change (x) 22.01'9 = (x .1 10.3 7.6 11.2 -7..1997 September 17. tlongwilh hnng eX eed 6. 1987 October 26.7 Change z Nom.. Thl 1.1 -8. or about 5 X 1017 seconds.( tober wilh 11Ine16). 2008 October 09. 2008 .2008 OClober 28. Table 2.19 O.5 X 10-89. 1987 and the Iinanci I eris: u h m I\J l trcats with like.5 2<1>1 10 !<'I z) October 19. 8/" k ·.1 X 10-11 6.! and very good-days we actually sen 011 Ir ret. 0ll!JD The Ten Largest ~ Ind . These / models are more consistent with the very but.40 CHAPTER 2 Review of Probability How small is 5.so the probability of winning a random lottery among allliving people is about one in 7 billion. there have been many large to be consistent with a normal distribution with a constant variance.finnn e pr()(c~'lonnls u models of stock price hnnge n stock price change as normally dhtnbuled ancc that evolves over lime. or 1. X 10 I~ - 7.an exiremel prices are normally distributed lcarly. In fact..9 X 10-11 -- 7.8 9.

then the marginal distribution of each of the variables is normal [this follows from Equation (2. The name for this distribution derives from the Greek letter used to denote it: A chi-squared distribution with m degrees of freedom is denoted X. This result-that zero covariance implies independence-is a special property of the multivariate normal distribution that is not true in general. let Z be a standard normal random variable.42) by setting a = 1 and b = OJ..3 it was stated that if X and Yare independent. and FDistributions 41 More generally. TIle chi-squared distribution is the distribution of the sum of m squared independent standard normal random variables. which is called the degrees of freedom of the chi-squared distribution. if variables with a multivariate normal distribution have covariances that equal zero. Chi-Squared. If X and Yare jointly normally distributed. distribution are given in Appendix Table 3. Thus. Appendix Table 3 shows that the 95th percentile of the X~ distribution is 7. Second.E( YI X = x) = a + bx.95.. For example.. but linearity of conditional expectations does not imply joint normality.Joint normality innplies linearity of conditional expectations.2. The Student t Distribution TIle Student I distribution with m degrees of freedom is defined to be the distribution of the ratio of a standard normal random variable. 7. Fourth. then the conditional expectation of Y given X is linear in X. where a and b are constants (Exercise 17. then any linear combination of these variables (such as their sum) is normally distributed. if a set of variables has a multivariate normal distribution.4 The Normal. and 2:l be independent standard normal random variables. if X and Y have a bivariate normal distribution and a xy = 0. let ZI> Zz. then the converse is also true. Third.let W be a random variable with a chi-squared distribution with m degrees of freedom. then.81) = 0. For example. if X and Y have a bivariate normal distribution. ern = O. Student I. In Section 2.11).81. This distribution depends on m. then the variables are independent. Selected percentiles of the X. divided by the square root of an independently distributed chi-squared random variable with m degrees of freedom divided by m. that is.. so Pr(Zt + Z~ + Z~ :s. That is. Then Zt + Z~ + Z~ has a chi-squared distribution with 3 degrees of freedom. The Chi-Squared Distribution The chi-squared distribution is used when testing certain types of hypotheses in statistic and econometrics.if n random variableshave a multivariate normal distribution. . regardless of their joint distribution. then X and Yare independent..

Then has an distribution-thai is. 7. elected percentiles f the Student I distribution are given in Appendix Table 2... the 95110 percentile of the 1'3.. distribution..m "~.24). it has more mass in the tails-that i il is a "fauer" bell shape than the normal. For example. and that mean is 1 because the mean f a squared standard nOImal random variable is 1 (see Exercise 2. distribution depends on the degrees of [Ieedo. which is the same as the 95110 pel' entile of the distribution. the 95'10 percentile of the 1'. divided by 111: Wlm is distributed F. an F distrihuti n with numerator degrees of freedom 1/1 and dcnominat rdcgrce of freedom II.00 distribution is 2. r. distribuii n is the distribution of a chi-squared random variable with 11/ degree of fr cdom.00 limit of 2. Then the random variable and let Z an d W b e 111 • •• • Z/VW/m has a Student I distribution (also called the I dISlrobllhO~) with //I degrees of freedom.". The Student I distribution depends on the degrees f freed m IIl.. which is 3 (7. To state this rnathcmati ally...30 distribution is 2.... divided by II.81 (from Appendix Table 2). the tudent I distribution is well approximated by the standard normal distribution and the I distributi n equals the standard normal distribution.60.oo. (rom AppendixTable 4.81/3 = 2.Thus the 95'h percentile of the I". to an independently distributed chi-squared rand III variable with degrees of freedom 1/. o/f. x3 . but when m is small (20 or less). In thi limiting case. where Wand V are independently distributed.60). As the denominator degrees of freedom n increases... 90 distributi n is 2. The 90110. JI 95 .the denominator random variable V is the mean of infinitely many chi-squared randam variables.. Thus the F. 95'h..". This distribution is denoted I".. and the 951h percentile of the F." distribution can be approximated by the 0".60.l1lC tudcnr I distribution has a bell shape similar to that of the normal distriburi n. and 991h percentiles of the F. In statistics and econometrics. an important pccial asc f the F di trihuti n arises when the denominator degrees of freedom is large enough that the f. distribution tends to the 1'. For example.92. distribution arc given in Appendix Table 5 for selected values of I'll and n. the 1h percentile (the F. divided by the degrees of freedom.7 J. The F Distribution The F distributiou with 1'1/ and 1/ degrees f freedom. let W be a chi-squared random variable with 111 degrees r freed m and let be a chisquared random variable with 1/ degrees of freed m. divided by m.42 CHAPTER 2 Review of Probability 'dependently distributed. denoted 0"JI' is defined 10 be the distribution of tbe ratio of a chi-s [uare I random variable with d grccs of freedom m.. When 111 is 30 or more.

. .5 RandomSamplingand the Distributionof the SampleAverage 43 2. where >'I is the first b servatlon.2a. The situ ti n de ribed in the previ u paragraph is an example of the simple t ampling heme u ed in stati nics. The a t f random sampling-that i . in which /I bje tare el ted at rand m fr m a population (the p pulati n f cornmutan do and e h memb r f the p pulatlon (each day) is equally likely t be in luded in the ample.1 aspire l be a slati ti ian nd decides t rec rd her commuting time n various do he el ct the e da at rand m from the ch I year. \ hich is called it sampling distribution.called simple random sllmpling. that is. .. We begin by discussing random sampling.5 Random Sampling and the Distribution of the Sample Average 1m l all the statisti al and econometric procedures used in this book involve aver ges or weighted average of a sample of data. y" are themselve random.rand mly drawing a sample fr m larger popul ti n-ha the effect of making the sample average itself a rand m variable. .the value of the b er ati n >'I.2. and her daily mmutin lime ha the umulati e di tribution function in igure 2. If Simpl random sampling. Be use the ample average is a rand m variable. Random Sampling uppo e our c mmuting student fr m ecti n 2. Because these d \ ere sete ted at rand m.. Y". kn \ ing the value of the commuting time n ne of the e rnnd 1111 ele ted day provides n inf rrnati n about the commuting lime nan th r f the do . it has a probabllit distributi n.. l'2 is the sec nd b ervati n. Iri uted r nd m variables. haracterizing the distributi n f sample aver ges theref re is an es ential step toward understanding the perf rman e of e n mel ric pr edures. and forth. TIli se tion introduces some basic concepts bout random sampling and the di: tributi n of average that arc used through ut the bo k. In the commuting example. The /I b erv ti n in the ample are den ted >'I •.. >'I i the commuting time n the first of her /I rand mly elected da sand Y. This ecti n concludes with m pr pertie f the sampling di tributi n of the sample average. the v rlue f the commuting time n each f the different day are independently di . is the mmuting time 0 the .because the days were elected at rand m.1h of her rand ml elected day Be u e the membe f the population included in the sample are elected at rand m.

Yi is r..i.. Y. + Y. . knowing mation about )1. Before they are sampled. n objects are drawn at random from a popl~lation each object is equally likely to be drawn. .. draws are summarized The Sampling Distribution of the Sample Average The sample average or sample mean.. their values of Y will differ. Because r. . Random Variables In a simple random sample.. the distriindependcntly bution of Y.. . can take on many possible for each observation.... Y.Y" are randomly drawn from the same population.5. In other words.Y. they are said to be indepeudently and identically distributed Simple random sampling and i. .5 Y for likcly Y. values... The value of the random variable the ph randomly drawn object is denoted Y. . . the rand distributed m variables of Yo11 . Because the sample was drawn . Y" are said to be identically Under simple random sampling... . . oncept 2.. in Key distributed.'. + 2>. draws... When I the marginal distribution tribution is the distribution distributed. 1 II Y. Y..).d... is the same for all i = 1.43) An essential concept is th~ the act of drawing a random sample has the effect of making tbe sample average Ya random variable.. .d.i.. . can be treated as rand m vari- l'1.44 CHAPTER 2 Review of Probability 4mi. . Because each object is equally to be drawn and the distribution of and 2.. . of Yz.11.. are independently and so fortb.d. is distributed 12. and identically (i. Y" different members of the population are chosen.... so the conditional distribution marginal distribution When of distributed independently the value of of l'1 provides no infor- Yz given l'1 is the same as the 12... . under simple random sampling. this marginal of Yin Ihe population /1.. after they are sampled. is the same for all i. r.i. . .)..Y" are drawn from the same distribution and are independently (or i.. is .. then ~.Y.d. a specific value is recorded i. . + 1'.i... that is. di being sampled. Thus the act of random sampling means that ables. nand r.' i=l n (2. is the same for each i = 1.d..i. of Y. . .m3m Simple Random Sampling and i.) = n 1 nO.. of the n observations Yl" . has the same marginal distribution for i = 1._1 Y - Y..

. .2. so cov( y" 1j) = O..Thus. Had she chosen five different days. The distribution of Y is called the sampling distribution of Y because it is the probability distribution associated with possible values ofY that could be computed for different possible samples r.d. The variance of are independently distributed for i '" j. Random Sampling and the Distribution of the Sample Average 45 thei. and let.. The sampling distribution of averages and weighted averages plays a central role in statistics and econometrics. average is random.i.44) n = 2.. it has a probability distribution.and Y. (2..11" are i.) 1 +2' n 1""'[ 1 =2' n 2:var(Y. var(r.uy and <T~denote tbe mean and variance of Y. so [by applying Equation (2. the mean of tbe sum r. n)..Y. then the observations and their sample average would have been different: The value of Y differs from one randomly drawn sample to the next. For example. for + 11) = 2<T~. . then computed the average of those five times.uy +f. . Thus the mean of the sample average is E[1( l'J + 11)] = x 2f.i.1j) i=l1'=l..Ly. 11) = OJ.Ly.Y" are i.) i=l n 2: 2: cov(Y. Y. j#-i n n (2. .var(Y) = 1<T$· For general n.. are random. because r. When n = 2. . <Ty/Vii.Ly.i. = 1.5 at random.d.. In general. she would have recorded five different times-and thus would have computed a different value of the sample average.Ly = f.28):£(r.. Had a different sample been drawn.45) =n' The standard deviation of <T$ Y is the square root of the variance.Ly = 2f. .. the mean and variance is the same for ail ..(because the observations are i.".37). 1 _ £(Y) 1 II = Ii~ E(Y. We start our discussion of the sampling distribution of Y by computing its mean and variance under general conditions on the population distribution of Y. . Suppose our student commuter selected five days at random to record her commute times. Because Vis random... .. . Y is found by applying Equation (2. For example. . = var (1" Ii var(Y) 2: Y.. Mean and variance ofY.r the value of each }j is random. Because l'l. Suppose that tbe observations r..31) with a = b ~ 1 and cov(r.11".d. + 11 is given by applying Equation (2.) i=l ~ f.. + 11) = .

dev( Y) =(Ty =-. Vii These results hold whatever the distribution of 1'1 is. for largen. the same variance. and. fLy.lhe variance of the portfolio payout is varCY) = per' (Exercise 2. As stated following quation (2. ' The not~ion (T~ denotes the variance of the sampling distribution of the sample average Y. such a the normal di tribution for Equations (2. Because you invested lin dollars in each asset. Putting all your money into one asset or spreading it equally acrossall 11. has led to financial owns a Equation (2. multiple assets.46) and (Ty - 2 up = . and tbe same positive correlation p across assets [so that cov( y" lj) ~ pu']. In contrast. the mean. In the case of stocks. That is.42).n' (2.d.'Y. (T9) distribution.. the variance.48) to hold.. var( Y) = (TY - (2. the sum of n normally distributed random variables i itself . " are i.48) std. + Y. that is.i. But diversification has its limits: For many assets. S uppose t h at Y.46 CHAPTER Reviewof Probability 2 Financial Diversification and Portfolios T he principle of diversification says that you can reduce your risk by holding small investments in EY = J. but that portfolio remains subject to the unpredictable fluctuations of the overall stock market. .' imilarly.45). The math of diversification follows from assets has the same expected pay ut. (Tf denotes the standard deviation of the sampling distribution of Y. Suppose you divide $1 equally among n assets. ~""'. and the standard deviation of Yare E(Y) = IJ-y. 2 To keep things simple.26). payouts are positively correlated. (T9 is the variance of each individual y. the actual payoff of your products such as stock mutual funds. draws from the N(IJ-y. y. the distribution of Y. that is the variance of the population distribution from which the observation is '~rawn. suppose that each asset has the same expected payout..47) (2. does not need to take on a specific form. in which the fund holds many Slacks and an individual share of the fund. but In (T diversiFying reduces the variance fr The math of diversification 2 to pu2.you shouldn't put all your eggs in one basket. thereby owning a small amount of many stocks. + . Let }j represents the payout in 1 year of $1 invested in the i1h asset. so portfolio after 1 year is (1'[ + Y. risk is reduced by holding a portfolio. a . Yaref') remains positive even if II is large. compared to putting all your money into one asset.46) through (2. Sampling distribution of Y when Y is normally distributed . Then the expected payout is In summary.)/n = 1'.

6 Large-Sample Approximations to Sampling Distributions ampling dislributi n playa central r le in the development of tatistical and c on metrt pr edure so it is imp rtaru to know. nfortunately. 2../Ly)/u).Y" are i. When til ~nmple il i large. if Y is n rmally distributed and l'I.The large-s mple approximation to the amplmg di lri uti n i fl n allcd the OS)'Il. ( . t ampling distributions are complicatcd and depend n the di tobuli n f . in a rnathernatica! en e.lhe a mpl ti distributions are imple. o}).Y" are i. draws from the N(/Ly. when the sample size is large. The sampling distribution that exactly des ribe: the distributi n f f r ny 1/ is called the e net distribution or finitevnmple dhlribution f . . r example.i. i pproximalely normal. then in gcueral rhe exa I mpling distribution f l' is very complicated and dcp nd n the dlstributi n f . (ry/n). Thc "e a I" appr ach entails deriving a f rmula for the sampling distributi n rhat h Ids exactly f r any value f n. then (a dis ussed in c lion 2. .There are two approaches to characterizing samphng di tribuli ns: an "exa t" approach and an "approximate" appr ach. what the sampling di rribut! n f is.6 Large-Sample Approximationso SamplingDistributions t 47 normall distributed. .As we see in thi secti n.ptotie distribution-"a ymptotic" because the appr im Ii n bc m exa t in Ihe limit Ihalll .. Thi n rmal appr ximate distribution provide enormous impliIi lJ n and underlie the theory of regre i n u ed Ibr ughout this book. this mean th~t.y)/u does not depend on Ihe di lribuli n f Y. The" ppr lmme" ippr a h u es approximati n to the ampling distributi n thot rcl n the mple ize cing Iarge.d. Moreover-remarkablyIhc mpl Ii normal di lributi n f ( -p.. the ampling di tributi n of lhe londardized sample avcrage.i. if the distribution of Yin t n rmal. Be ause the mean of l' is /Ly and the variance of l' is uNn. . l} wil) bc I to /Ly with e high probability. lthougb cx. TIle la\ f large numbers ay that. if l'I. then l' is distributcd (/Ly. Ih "e ppr imali n n b very a curale even if the ample ize is onl 1/ = 0 01> erv Ii n Bccnu c snmpl ilC u ed in pro tice in ec n melri typically number in th hundr d..r Ih u and these nsympt Ii di tributi n can be counled n to pr vide vc good appr ximati n t the exact sampling di tribution.d.2. The cenlral limit theorem ays Ihat.5) the exa t di tribution of l' i normal v Ith mean /Ly and vori n e U~/II. Thi e Ii n pres '0\ tbe t\ 0 ke I I u ed I appr imalc sampling di tribUll n ben Ihe nmple ize i I rge: the law of large number and the central Itmit Ih rem...

48

CHAPTER 2

Review Probability of

Convergence
~

in Probability, Consistency,

and the

law

of Large Numbers

2.6

11,e sample average Y converges in p~obability to /ky (or, equivalently, Y i COn_ sistent for /kY) if the probability that Y is in the range /ky - c to /ky + C beeoll1,= arbitrarily close to 1 as n. increases for any constant c > 0.111e convergence of y to in probability is written,}I ...L... /ky, . The law of large numbers says that if Y;, i = 1, ... ,11 arc indc] endcntly and identically distributed with E( Y,) = /ky and if large outliers arc unlikely (tcehniI"Y =

cally ifvar(Y;)

ol

< (0), then Y -'-> t-v-

-

p

The Law of Large Numbers and Consistency
1110law of large numbers states that, under general conditions, Y will be ncar My with very high probability when /I is large. Thi is sometimes called the "law f averages."When a large number of random variables with the same mean arc averaged together, the large values balance the small values and their sample average is close to their COlll1110n mean. For example, consider a simplified version of our student ommuter's experiment in which she simply records whether her commute was short (Ie s than 20 minutes) or long. Let Y; equal I if her commute was short on the i1h rand mly selected day and equal 0 if it was long. Because she used simple random saml lin , l'\, ... , Y" are i.i.d. Thus Y'j, i = 1, ... II are i.i.d. draws of a Bernoulli rand 11'\ variable, where (from Table 2.2) the probability that Y; = .l is 0.78. Because the expectation of a Bernoulli random variable is its success probability, E( Y,) = /ky = 0.78.111e sample average }lis the fraction of days in her sample in which her commute was short. Figure 2.8 shows the sampling distribution of Y for vari u sample sizes ,.1. When n = 2 (Figure 2.8a), Y can take on only three values:O, ~,and 1 (neither commute was short, one was short, and both were short), none of which is particularly close to the true proportion in the population, 0.78. As 11 increases, however (Figures 2.8b-d), Y takes on more values and the sampling distribution bec mes tightly
I

centered on

My.

11,e property that Y is near I"Y with increasing probability as 11 increases is called convergence in probability or, more coocisely, consistency (see Key oncept 2.6). The law of large numbers states that, under certain conditions, Y converges in probability to /ky or, equivalently, that Y is consistent for MY,

r

2.6

Large-Sample Approximations to Sampling Distributions

49

Sampling Distribution of the Sample Average of n Bernoulli Random Variables

Probability 0.7 0.6

Probability 0.5 0.4

0.5
0.4

u = 0.78
0.3

u = 0.78

0.3 . 0.2 0.1 0.0 0.0 (a)

0.2

0.1

0.25

0.50 0.75 1.00 Value of sample average
(b)

0.25

,,= 2

0.50 0.75 1.00 Value of sample average

,,= 5

Probability 0.25 0.20

Probability 0.125 0.100

p = 0.78

0.15 0.075 0.10 0.05 0.00 e.,,...,.,-,.,,-.-,,."I'" 0.0 0.25 0.50 0.75 1.00 Value of sample average
(c) ,,=25

0.050 0.025 0.00 'r-r-r-r-r-r-r-r-r-r-f--r-r-rf 0.0 0.25 0.50 0.75 1.00 Value of sample average (d) ,,=100

The distributions are the sampling distributions of with p
:=

Y, the

sample average of n independent

Bernoulli random variables

Pr( Y,' ee 1)

=

0.78 <the probability of a short commute is 78%), The variance of the sampling distribution of
J.1.

Y

decreases as n gets larger, so the sampling distribution becomes more tightly concentrated around its mean the sample size n increases.

=

0.78 as

50

CHAPTER 2

Review of

Probability

· . f tl e law of large numbers thai we will usc in this b k are The con d mons 'or 1 _ 2'" that Y;,i = 1, ... ,1'1 are i.i.d. and that the variance of 11, Uy,lS finite. The mathemat_ ical role of these conditions is made clear in ection 17.2, where the.law of large num, bers is proven. If the data are collected by simple random sampling, then the i.i.d. assumption bolds. The assumption that the variance is finite says that extremely large values of Y;-that is, outliers-are unlikely and observed Infrequently; therwise, these large values could dominate Yand the sample average w uld be unreliable. This assumption is plausible for the applications in this b k. F r example, because there is an upper limit to our student's commuting time (she could park and walk if the traffic is dreadful), the variance of the distribution of c mmuting times is finite.

The Central Limit Theorem
The central limit theorem says that.under general conditions, the di uri ution ofY is well approximated by a normal distribution when 11 is large. Recall that the mean ofYis I"Y and its variance is u~ = up/no According to Ihe central limit the rem, when 11 is large, the distribution ofY is approximately N(I"Y, u~). As discus ed at the end of Section 2.5, the distribution ofYis exactly N(I"Y, u~) when the sample is drawn from a population with the normal distributi n N(I-'Y, up). The central limit theorem says that this same result is approximately true when 11 i large even if y" ... ,Y;, are not themselves normally distributed. The convergence of the distribution of Y to the bell-shaped, n rrnal appr ximation can be seen (a bit) in Figure 2.8. However, because the distribution gets quite tight for large 1'1, this requires some squinting. It would be easier t sec the shape of the distribution ofY if you used a magnifying glass r had me other way to zoom in or to expand the horizontal axis of the figure. One way to do this is to standardize Y by subtracting its mean and dividing by its standard deviation so that it has a mean of 0 and a variance f l.This process leads to examining the distribution of the standardized vcrsi n f Y, (Y - I-'I)/uy. According to the central limit theorem, this di tribution should be w II approximated by a N(O,1) distribution when n is large. The distribution of the standardized average (Y - I-'Y)/U1' is plotted in Figure 2.9 for the distributions in Figure 2.8; the distributions in Figure 2.9 arc exactly the same as in Figure 2.8, except that the scale of the horizontal axis i changed so that the standardized variable has a mean of 0 and a variance of l.After this change of ~cale, it is easy to see that, if n is large enough, the distribution of Y is well approximated by a normal distribution. One might as".:,how large is "large enough"? That is, how large must 1/ be for the distribution of Y to be approximately nonnal?The answer is, "I t depends."",e

t

2.6

Large-Sample Approximationsto SamplingDistributions

51

Distribution of the Standardized Sample Average of n Bernoulli Random Variables with p = 0.78
Probability 0.7 0.6 0.5
0.4 0.3

Probability 0.5
0.4

0.3 0.2

0?:

0.1 0.0 ~h,..-,---lL,-r-.,J"'--r---r-~ -3.0 -2.0 -1.0 0.0 1.0 2.0

0.1 0.0 ", ...... -, -3.0 -2.0 ...... -,---!''-r--.''---,-,---;:"'; -1.0 0.0 1.0 2.0 3.0 Standardized value of sample average (b)

3.0

Standardized value of sample average (a)

,,= 2

,,= 5

Probability 0.25

Probability
0.12

-2.0

-1.0

0.0

1.0

2.0

3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Standardized value of sample average (c),,=25 The sampling distribution of (d) ,,=100

Standardized value of sample average

Y in

Figure 2.8 is plotted here after standardizing

Figure 2.8 and magnifies the scale on the horizontal axis by a tactor of

Y. This plot centers the distributions in Vii. When the sample size is large, the sampling

distributions are increasingly well approximated by the normal distribution (the solid line), as predicted by the central limit theorem. The normal distribution is scaled so that the height of the distributions is approximately the same in all figures.

the pI' bability I function (for continuous random variables). normal as 1/ grows large. the normal approximation approximation still has noticeable = 100. Y.9d and 2. the "large distributions re ult. with E( Y. amazingly. have a similar shape. the normal approximation however.2 < 00. In Iact.10 are complicated 1/" and quite differd are simple aches the and.. ability hecause of the central limit theorem.. then Y is exactly normally distributed underlying I'.. Becau e the distri ution The convenience of the normal approximation. As n . n f the underlv. Y is said to have an asymptotic normal distribution. The central limit theorem is summarized in Key Summary 1.10a.10 for a population Figure 2. 0' y where 2 o < . is shown in Figures 2.e respectively. after centering and scaling.\ f Yappr distribu- tions of Yin parts band c ofFigures 2. the d.52 CHAPTER 2 Review of Probability ~ The Central Limit Theorem 2. and 100.d. . that is quite different from the Bern lion has a long right tail (it is "skewed" to the right). makes it a key underpinning applied econometrics..I I' II IDO. distributed. i approaching imperfect] ~ the bell ns. At one extreme.the normal to the distribution ofY typically is very g d f I' a wide variety II" population distributions...O'y/n) becom:s arbitrarily well approximated by the standard n rmal distribution..". I 1': are i. are thcrnselve In contrast. of is quite go d. then this di tributi ampling n.IOb-d f I'll = 5. 25. and the probability values are distridensity summarized by the cumulative distribution bution function (for discrete random functi n.) 11 .stflbutlon of (Y - iJ-Y)/" 2 _ (where . While the "small in Figures 2. themselves have a distribution that is far from n rmal.7.9 and 2. quality of the normal approximation depends on the distributi if the for all II. Y. This point is illustrated in Figure 2. The probabilities with which a random variable takes on different variables).) = a}. sh wn in This di tribudistribution of approximation can require n = 30 or even more. combined wi th its wide applicof modern nccpt 2. that make up the average.7 S uppose th a t v l] I •.. By nape for II n ~ 25. n rmally when the ing Y.' = iJ-l' - and var( 1'. Although the sampling distribution ulli distribution. 00. The central limit theorem is a remarkable ent from each other.i.

lOa.0 0. But when n is large tn = 10m.0 2.-.0 -1.00 ~.0 3. .0 0.0 0.09 0.20 0. like the population distribution.0 Standardized value of sample average (e) Standardized value of sample average (d) .--3. the sampling distribution.0 2. the sampling distribution is well approximated by a standard normal distribution (solid line).0 0.00"'='.12 Probability 0.0 1.03 o.-.06 0.0 -1.-"'" -3.0 0.0 20 3.12 0.0 -2.0 2.0 1.0 0..0 -2.0 3.~---fI -3.12 0. The normal distribution is scaled so that the height of the distributions is approximately the same in all figures.0 1.50 0.30 0.40 Probability 0.0 -1.0 Standardized value of sample average (a) 1/ Standardized value of sample average (b) 1/ ~ =1 5 Probability 0.09 0.0 3.Summary Distribution of the Standardized Sample Average of n Draws from a Skewed Distribution 53 Probability 0. as predicted by the central limit theorem. is skewed.0 1.0 -2.0 -2. When n is small tn = 5).~100 1/=25 The figures show the sampling distribution of the standardized sample average of rid population distribution n draws from the skewed (asyrnrnet- shown in Figure 2.10 0.-.03 -1.oo~.06 0.

y)']. . Yc nverges in probability to /-'y.. 3. the sampling distribution ofY has rnean uv and variance b.).) (23) . S.5. random observations lative distribution \'I. If 11. The conditional pr bability distribu_ tion of Y given X = x is the probability distribution of Y. 6.i. Key Terms outcomes (15) probability (15) sample space (15) event (15) discrete random variable (15) continuous random variable (15) probability distribution (16) cumulative probability distribution (16) cumulative distribution function (c.f. To calculate a probability associated with a normal random variable. The joint probabilities [or two random variablcs X and Yare summarized by their joint probflbility distribution. the central limit theorem says that the standardized veri n f Y. then use the standard normal cumutabulated in Appendix Table I. ILy).d. then: a. first standardize the variable.d. and c. that are independently and identically distributed (i. conditional nX taking on the value x.Ld.1'. are i.. is its probability-weighted averflge value. I) di .54 CHAPTER 2 Reviewof probability 2. (Y -/-'l')/Uy. varies from one randomly chosen sample t the next and thus is a random variable with a sampling distributi n. and the standard deviation of Y is the square root of its variance. The variance of y is (Jp = E[(Y _/-.··· . 111esample average.tribution] when fI is large. 4...L) (17) Bernoulli random variable (17) Bernoulli distribution (17) probability density Iuncti ( 18) density function (I ) density (18) expected value (18) expectation (I ) mean (\8) variance (21) standard deviation (2\) moments of a distribution skewness (23) kurtosis (25) outlier (25) n (p. Simple random sampling produces n.d. . The expected value of a random variable Y (also called its mean. Y. the law of large numbers says that u 2 fT~/n.l'. has a standard normal distribution [N(O.. denoted E(Y). A normally distributed random variable has the bcll-shaped probability density in Figure 2.

and the mean student weight is 145 lb.) (44) (44) (45) distribution (47) (48) (47) sampling (43) distribution expectation mean (28) Jaw of iterated expectations variance (30) distributed (31) independently distributed sample average sampling asymptotic sample mean (44) distribution distribution exact (finite-sample) (36) (36) (38) (32) distribution law of large numbers (48) convergence consistency (38) asymptotic in probability (48) normal distribution (52) normal distribution a variable (36) normal distribution central limit theorem (50) normal distribution Review the Concepts 2. during a given month and Y denotes born in Los Angeles during the same month. (d) whether Explain why each can be thought of as random.d. Explain about the value of y. Are X and Y independent? 2. and their in the Will the average variable. weight of the students A random sample average weight is calculated. and (e) whether it is raining or not. 2. and you know why knowing the value of X tells you nothing 2. are assigned in the library is new or old. (b) the number of times a computer (c) the time it takes to commute to school. of four students is selected from the class.4 An econometrics class has 80 students.3 Suppose that X denotes the amount of rainfall in your hometown the number of children Explain. sample equal J 45 lb? Why or why not? Use this example to explain why the Y. sample average. is a random .2 Suppose that the random variables X and Yare independent their distributions.1 Examples of random variables used in this chapter included (a) the gender crashes.Reviewthe Concepts leptokurtic rth moment (25) (25) distribution (27) (28) (29) (31) (26) (27) distribution chi-squared Student distribution (41) 55 t distribution (41) joint probability marginal conditional conditional conditional conditional independently independent covariance correlation uncorrelated normal standard standardize multivariate bivariate (31) (32) probability t distribution (42) F distribution (42) simple random population identically (43) distributed (44) and identically (i. you the computer of the next person you meet.i.

. consider (b). Show E(X3) b. Explain dom variables drawn from this distribution might have some large outliers. You want to calculate Pr( Y it be reasonable to use the normal approximation n ~ 25 or n ~ 100? Explain. random variables with the probahility • _ disabout tribution given in Figure 2.. Would if n = 5? What Y is a random variable with !-Ly = 0. Show E(Xk) ~ ~ p.~.S. Compute tbe mean. ance in DC? 2.6 The [allowing table gives the joint probability distribution between employment status and college graduation among those either employed Ing [or work (unemployed) in the working age U. Seattle's daily high temperature has a mean of 70°F and a and vari- standard deviation of7°F. Compute (a) E(W) and E(V). . Sketch a hypothetical probability distribution of Y.Repeat this for n _ and n .3 Using the random variables X and Y from Table 2. and .2. In' Y. Y.. two new ran- 2.. 2... Sketch the probability density of Y when n ~ 2.100 ..7Y. describe how the densities differ. Derive the cumulative probability distribution of Y. variance.2 Use the probability distribution given in Table 2. (Hint: You might find it helpful to use the formulas Exercise 2.. "'y ~ 1.1 Let Y denote the number of "heads" that occur when two coins are tossed. and (c) a xy and corr(X. and kurtosis 100. dom variables W ~ 3 + 6X and V ~ 20 .6 S uppose th a t Y. b. and kurtosis of X. c.10a. 4) distribu2.21. 0. a. 2. What are the mean. Suppose that p ~ OJ. (b) . re i i d random variables with a N(I.. In war dS. skewness = 0.5 Suppose that 1>""" a .2 to compute (a) E(Y) and E(X).) 2.4 Suppose X is a Bernoulli random variable with P(X = 1) = p. 2.a. p for k > O. given in c. skewness. are i i d. V).. .5 In September. Derive the probability distribution of Y.56 CHAPTER 2 Review of Probability Y.1). population Or lookfor 2008. a. Derive the mean and variance of Y. and (c) "'wv and corr(W.~. ..11 .10 lion.7 s.1 and.. V). = why n ran- Exercises 2. What IS the rela- tionship between your answer and the law of large numbers? 2. standard deviation.

how that Ikz 2. What i the mean of b.046 0.000 = I) a.S. dollars ($) to euros (€).037 0.341 1.ollege graduate? r. elected couple. h w that the unemployment c. male earnings [$40.000 per year and a tandard deviation of $12. TIle unemployment unemployed. lculat E(YIX=I)andE(YIX=O). What is the tandard deviation of C? d. Population Aged 25 and Greater. 2008 Unemployed Employed 57 CY = 0) W=1) Total Non-cotlege grads(X ollegegrads(X To'ot = 0) 0. is the covariance between male and female earning? c. \ hat is the pr bability that this worker i a college graduate? Anon. The correlati ct n between male and female earnings for a earnings for a randomly c uple is O.009 0. 2. X and Yare di crete rand m variables with the foLJowingjoint distribution: . ompule' (Y).954 0.9 = 0 and u~= l. n of two-earner male/female status independent? In a given p pulati have a mean tion couples. Wha.E(Y). denote the combined u.I). O. alculaie the unemployment rate [or (i) college graduates and (ii) n n-collcge graduates. 2. rare is the fracuon of the labor force that is rate is given by 1. Let Z = ~(Y .000.332 0. e. Female earnings have a mean of $45. onvert the an \ er to (a) through (c) from U. b. d. A rand mly selected member ot this p pulation reports being unernpi yed.8 The random variable Y has a mean of 1 and a variance of 4.0 0 per year and a standard devia[$1 .7 Are educational achievement and crnploymeru xplain.Exercises Joint Distribution of Employment Status and College Graduation in the U.000.659 0.s.622 0.

X=8. if Y is distributed N(50. Calculate the covariance and correlation 2. 1).99:5 c.oo.78). f.11 Compute the folJowing probabilities: a. Why are the answers to (b) and (c) approximately e.58 CHAPTER 2 Review of Probability Value of y 14 22 0. find Pr( Y > 4. of Y given a. c. find Pr(Y 3). If Y is distributed N(O. If Y is distributed F.02 0.79). between X and Y.0).09 J Value of X 5 0. d. > 0).02. and variance of Y. If Y is distributed b. b.02 8 That is. > 1..15 40 0. the same? find Pr(Y > 2.05 0. find Pr( -1.12). Why are the answers to (b) and (c) the same? e. u Y is distributed xl.4). Use the definition 2.) xl xl. mean. If Y is distributed Ro.4. find Pr( Y:5 7. find Pr( Y d.12 Compute the following probabilities: a.03 0. 2. (Hint.05 0.10 Compute the foUowing probabilities: a.02 0. and so forth. Y:5 1. Y = 14) = 0. b.01 0. If Y is distributed 'l5. find Pr(40 :5 Y:5 52). If Y is distributed the distribution. find Pr(6 :5 Y :5 8).99). .10 65 0.15 0.03 30 0. If Y is distributed N(I.t7 0. If Y is distributed N(3. If Y is distributed N(5. 25).find Pr(Y > 18. '90. Calculate the probability distribution. d. Calculate the probability distribution. c. and variance mean. If Y is distributed FWD.10 0.31). 9). find Pr(Y :5 1. of c. find Pr( Y > 1.99). Pr(X = 1.01 0. 2).83).75). [f Y is distributed xlo. find Pr(Y:5 b. find Pr( -1.99 :5 Y:5 1.

Let S = Y + (I ..) B.16 Y i distributed (5.i. a.14 In a populau nlJ.100) and you want 10 calculate Prey < 3.. (That is.X) W.) e. (I-lill/: What is the skewness for a symmetric di tribution?) the kurt = 3 and E(W4) = 3 X 1002. 100). I). c. In a random sample of size /I = 64. argue that Y converge in probability to 10. find Prey > 98). (Hil/I: Use the fact that i is 3 for a normal distribution. Shov that E( y d.000. /I. Let Y ns for 0. = 400.6).y= lOOanduf=43.100) distributi n. each distributed mpute Pr(9. Pre Y i!: = 0. /I = 20. 2. and Ware independent. random variable.6). nfortunately. iHint: Use the law of iterated expectations conditioning on X = 0 and X = I. (ii) /I = 100. how that £(y2) = I and E(W2) = 100. In a random sample of size /I = 100. ii. W is di tribuicd N(O. you do have your ompurer and a mpurcr program tha: can generate i.4).. Y. 2. and b. 2.i. an wer the f II wing que tions: II. you do n t ha e your rcxibo k and do not have acce to a n rmal probabilit table like ppendix Table I. Howe er. f size /I = 165. draws from the I ( . .. Show that Pre I 0 .37) when . i = I. Derive the skewness and kurtosis for S. and S = W when X = 0.d.1 uppose y" i-I.4..99.Exercises S9 2. find Pr(Y:S In a random ample h. xplain how you can u e your computer to compute an a urate appro irnation for Prey < 3.6 :s Y:s lOA) when (i) (iii) /I 1.c :s Y :s 10+ c) be omcs ct C I 1.. Pre Y :s 0.13 X is a Bernoulli random variable with Pr(X = J) = 0. b. 2. (10. II. Bernoulli random variables with denote the ample mean. . are i. and X.) ) 4 c. Use the central limit theorem to 101). how that E( y3) = 0 and E( W3) = O. find Pre J 0 I :s Y :s 103). E( 2). uppo c c is a positive number.0 as /I grows large. Derive E( ).i. £(S3) and £(S4). are i. se the central limit to compute approximati i.d.d.S = Y when X = 1.43) whcn /I /I = 100. p 2.. /I.17 y. Y is distributed (0. sc our answer in (b) 10 c.

a. a. . I]'Xy b. Consider an "insurance pool" of 100 people whose homes are sufficiently dispersed so that.) Pr(X = x. Suppose that in 95% of the years Y = $0. and that Z takes on of X..). [Hin/:This is a generalization Equations (2.16). Let Y denote the dollar value of damage in any given year.95? (Use the central limit theorem to compute an approximate answer. Y = Y.19 Consider two random variables X and Y. of Equation (2. Zm' The joint probability distribution Pr(X = x. XI' a.. Suppose that X and Yare independent..Z z) .20 Consider three random variables X. Explain how the marginal probability from the joint probability distribution. Yk. in any year. Show that E(Y) = E[E(YIX. Z)]. .] 2.41) 0. - 3[E(X2)][E(X)] 4[E(X)J[E(X3)] + 2[E(X)]3 + 6[E(X)f[E(X')]- 3[E(X)]' . Suppose that Y takes on k values Y" . the expected value of the average damage bility that homes Let Y denote the average damage to these 100 homes in a year.~lPr( Y ~ yil X = x. the damage is random.] (2. Show that = 0 and 2.21 X is a random variable with moments E(X). a. From year to year. Suppose that Y takes on k values y" 11/ .000 . the weather can inflict storm damage to a home. Z = z). but in 5% of the years Y = $20. E(X').] that Y = Y can be calculated [Hin/:This is a generalization of b. What are the mean and standard deviation of the damage in any year? b.) 2. X. how that Pre Y = Yi) = L:. Z is of Y values z" . [Hint: Use the definition of Pr( Y = YilX = x. Y./Y~y. . E(X3).. the damage to different can be viewed as independently distributed random variables. (i) What is Y? (ii) What is the proba- Y exceeds $2000? 2. that X takes on I values x" .19) and (2.20). Show E(X _1-')3 = E(X3) b.. and the conditional probability distribution givenXandZisPr(Y=yIX=x Z=z) =P. Show E(X _1-')4 = E(X ) 4 and so forth.39 -s: 2: Y '" 0. corr( X. Y. How large would n need to be to ensure that Pr(0.18 In any year.x~x Z-z) . Yk and that X takes on I values x]. and Z. Use your answer to (a) to verify Equation c. . Y) = O.19)..60 CHAPTER2 Review 01 Probability b. Pr(X Xj.).

Exercises 61 2.) 2.05 (5%) and standard devia- tion 0. Let X and Z be two independently variables. b.2) b.. (Hint: Use the fact that the odd moments c. = O. . 1 . Suppose that $1 invested in a stock fund in a bond fund yields R . n. Show that V = Yj 2. (Harder) What is the value of w that minimizes the standard deviation of R? (Show using a graph. 1 . then the return on your investment is R = w R.5. the mean and standard deviation of R. is random fund.04. The correlation between R.-J1 istn notation. into a bond mutual pose that R.) Y) d.08 (8%) and standard deviation and Suppose that lib is random with mean 0. algebra. Suppose that = 0. c.75.) Let Xl.2. X~. standard Y) = O..w.22 Suppose you have some money to invest-for simplicity. Yl>' . normal random a. Show that !-'y = = X2 1.Yn denote another sequence of numbers. [Hint: Use your answer to (a). (1/(72) L::~. Y) = 0 and thus corr(X. . Show that E(W) = n. . $l-and you are planning to put a fraction w into a stock market mutual fund and the rest. with mean 0.. If you place a fraction w of your money ill the stock fund and the rest.. (72) for i = 1. What value of w makes the mean of R as large as possible? What is the standard deviation of R for this value of w? d. and a.] d. b a. and c denote three constants.25.07. + (1 . Compute w = 0. Compute the mean and standard deviation of R. = 1. N(O. and let Y = X2 + Z. or calculus. Show that E(Yl/o.23 This exercise provides an example of a pair of random variables X and Y for which the conditional mean ofY given X depends distributed on Xbut corr(X.w)R .24 Suppose 1) is distributed a.i... 2. Show that E(XY) of a standard normal random variable are all zero. in the bond fund.25 (Review / L::~2Y?IS d· ib ute d l.d. and Rb is 0. b yields R. Show that W = i. . . Show that cov(X. after 1 year and that $1 invested sup0. Show that E(yIX) b. = O. Suppose that w b. Show that .w. is distributed Yl n- c.Xn denote a sequence of num- of summation bers.

a j=l = na n II II II d. 2. b.E(Xj Z).X a.3.26 Suppose that l'l. j=! II = a LX. 11. - a. show that var( Y) '" pUy. are random variables with a common mean t-v.E(a + bY l]'J = El[b(Y .30).l'y1]'} ~ b'E[(Y .=1 tl II n b. . To derive Equation (2. Show that cov( Y.) = i=l n ?Xi 1=1 + ?Yi 1=1 c.XiYi . so that V = [X ..1 )/n ]pUy. and let W ~ X .. + ey. - = /"y and var(Y) = ~af + ~paf. [Hint: Let h(Z) g(Z) . Show that E( V') "" E(W'). Lax. When n is very large.h(Z). + 2ab ~Xi + 2ae ~Yi + 1=1 1-1 I-I 2be 2..E(XI Z)] .27 X and Z are two jointly distributed random variables.I'v)'] b'u9. Show that E( WZ) ~ O.) b. lj) = perf for i '" j. i=1 i(a + n bx. . a common variance O"~l and the same correlation p (so that the correlation between Y. (Xi + y.62 CHAPTER2 Review of Probability II a. ..=1 + e' 2.)' = no' + b' 2.=1 2.y. Show tbat E(W) = O. Derive E(V'). 2. Suppose that n = 2. Show tbat E(Y) Co For n "" 2. Suppose the value of Z. 2 2 var(Y) = av/n + [(n . .3 This appendix derives the equations in Key Concept 2. but not the value of X.1'. and lj is equal to p for all pairs i and j. . 2. Equation (2. .1 Derivation of Results in Key Concept 2.] APPENDIX 2.x. show that E(Y) = /"y and d.29) follows from the definition of the expectation.(Hint: use the law of iterated expectations. c. and V =X - X = denote its error. you know a guess of denote tbe X ~ E(XI Z) denote the value of X using the information on Z. use the definition of the variance to write yare a = + bY) = El[a + bY . where i '" j). Let error associated with this guess. Let X = g(Z) denote anotber guess of X using Z.

use the definition of the variance to write 63 var(aX + bY) = E{[ (aX + bY) . and the fourth equality follows by the definition of the variance and To derive Equation (2. Because u} + 0'9 + 2( -uxy/u})aXY (2.E(a + bX + cV)][Y -I"Y]} =E ([b(X-l"x) + c(V -l"v)][Y -I"y]j -I"y]j -I"y]j = E {[b(X -l"x)][Y = buXY + E ([c(V -l"v)][Y + caVY. inequality implies that cr}y/(crlcr'f.a}yja}.33). covariance. the third equality follows by expand.51) it must be that uf . To derive Equation (2.51) var(aX + Y) is a variance. .3 To derive Equation (2. it cannot be negative.Applying Equation (2.34). we have that var(aX+ Y) =a2ul+uf+2auXY ~ (-uxy/u})' = u~.49) +E[b'(Y-I'Y)'] = a'var(X) + 2abcov(X. We now prove the correlation inequality in Equation (2.(al"x + bI"Y)]') = E{[a(X -I"x) + b(Y -I"Y)]'} = E[ a'(X -I"x)'] + 2E[ ab(X . equivalently.u'j-yja} ~ O. -I"Y) + l"yJ'} = E[(Y -I"Y)'] + To derive Equation (2.31).):5 !crxy!(uxu). where the second equality follows by collecting terms.I"Y) + I"y]j = oS Y -I"Y)] + I"xE(Y -I"Y) + l"yE(X -I"x) + I"x I"Y = aXY + I"x I"Y' 1. Y-I"Y) +I"~= u9 + 1"9 because E(Y-I'Y) = E{[(Y = O. write E(Y') 21"YEI.50) which is Equation (2.52) lor. ~ a'a} + 2aba xy + b'a}. ing the quadratic. so from the final line of Equa- tion (2.35).32). use the definition of the covariance to write cov(a + bX + cV. (2. that is.Rearranging this inequality yields s uh The covariance :5 a}u~ (covariance inequality). I carr (X.33).)1 inequality. which (using the definition of the correlation) proves the correlation I carr (X Y)I -s 1.I"x) + I"x][( Y .Derivationof Results in KeyConcept 2. Y)I Let a = -UXY/ a} and b = 1.31). (2. 1.I"x)(Y -I"Y) ] Y) + b' var( Y) (2. write E(XY) E[(X -I"x)( = E {[(X . Y) ~ E{[a + bX + cV .

we can use this sample to reach tentative conclusions-to draw statistical inferences-about characteristics of the full population. and compiling and analyzing the data takes ten years.1. if so. to estimate an interval or range for an unknown population characteristic.3 review estimation. Estimation entails computing a "best guess" numerical value for an unknown characteristic testing entails formulating a specific hypothesis about the population. Despite this extraordinary commitment. Hypothesis then using sample evidence to decide whether it is true. hypothesis testing. population. we might survey. however. of a population distribution. measuring the earnings of each worker and thus finding the population distribution of earnings.CHAPTER 3 Review of Statistics populations of interest. and confidence Intervals In the context of statisticalinference about an unknown mean. Confidence intervals use a set of data. Sections 3. what is the mean of the distribution S tatistics is the science of using data to learn about the world around us. The 2000 l. and 3. The key insight of statisticsis that one can learn about a population distribution selected by selecting a random sample from that population. Thus a difJerent.say. 1000 members of the population.' . managing and conducting the surveys. 3. 64 population .S. more practicalapproach is needed. One way to answer these questions would be to perform an exhaustive survey of the population of workers. In practice. hypothesis testing. such as its mean. Statistical of earnings tools help us answer questionsabout unknown characteristics of distributions in of recent college graduates' Do mean earnings differ for rnen and women. The only comprehensive survey of the U. and confidence intervals.2. at random by simple random sampling. such a comprehensive survey would be extremely expensive. and. by how much' These questions relate to the distribution of earnings in the population of workers. from a sample of data. Census cost $10 billion. many members of the population slip through the cracks and are not surveyed. . Using statistical methods.S.s. Rather than survey the entire U. For example.l. The process of designing the census forms. and the 2010 Censuscould cost $15 billion or more. population is the decennial census. Three types of statistical methods are used throughout econometrics: estimation.

in fact. many estimators l[ both have sampling distributions.7.6.5 focus on the use of the normal distribution for performing hypothesis tests and for constructing confidence intervals when the sample size is hypothesis tests and confidence intervals can large. The chapter concludes with a discussion of the sample correlation and scatterplots in Section 3. which Y and lJ are two examples. For example..1.·· .i. Yand mate !Ly.2 through 3. at least in some average sense.i. Y and lJ take on different values (they proof !Ly.4. Sections 3.. Y" discusses are i. both are estimators of !Ly.> 3. A natural way to Y from a sample of n inde- (i. so what makes one estimator "better" are random variables. ..!LY) in a population. 3.d.1 through 3. is there a gap between the mean earnings for male and female recent college graduates? population In Section 3. lJ. Section 3. lj. of from one sample to the next. than another? phrased bution Because estimators There are many possible estimators. in other . 1'. these special circumstances are discussed in Section 3. this mean is to compute and identically the sample average such as the mean earnings of women recently graduated distributed from college. The sample average Y is a natural way to estimate !Ly. we would like an estimator as possible true value. In some special circumstances. if they are collected by simple random estimation of !Ly and tbe properties of Y as an estimator Estimators and Their Properties Estimators..) observations. this question can be of the sampling distrithat gets as close more precisely: What are desirable characteristics of an estimator? to the unknown In general. using the terminology When evaluated duce different in repeated estimates) in Key Concept samples.3 are extended to compare means in two different populations. but it is not !Ly is simply to use the first the only way.1 Estimationof the Population Mean 6S Most of the interesting questions in economics involve relationships between two or more variables or comparisons between different populations.1 Estimation of the Population Mean Suppose estimate pendently you want to know the mean valne of Y (that is. be based on the Student I distribution instead of the normal distribution. Both Yand 11 are functions of the data that are designed to esti3. (recall that sampling).d. For example.5 discusses how the methods for comparing the means of two populations can be used to estimate causal effects in experiments. Thus the estimators There are. . way to estimate lJ. the methods for learning about the mean of a single in Sections 3. This section of !Ly. another observation.

on average. Thus a desirable properly of an estimator sampling distribution as equals /'oy. then fj. 1 as the sample size increases. An estimate is the numerical value of the estimator computed using data from a specific sample. Another desirable property fj. How might you choose between them? One way to do so is to choose the estimator with the tightest sampling distribution. fj.y has a smaller variance than ILy. while an estimate is a nonrandom words.y is said to be more efficient ILy. let fj. Stated more precisely. If fj.y and ILy by picking the estimator with the smallest ance. the uncertainty about the value of /'oy arising from random variations in the sample is very small.6). An estimator because of randomness number. Suppose you have two candidate estimators. when the property sample size is large.y. of an estimator of the sampling distribution Consistency. Unbiosedness. consistency.y) = /'oy.y has a smaller variance than ILY.y is that the probability approaches Concept 2. (Key Variance and efficiency.if so. It is reasonable to hope that. The estimator fj. The terminology "efficiency" stems from the notion that if fj. the estimator is said to be unbiased. otherwise. then it uses the information in the data more efficiently than does fiy. we would like the sampling distribution tered on the unknown desirable characteristics and efficiency. .. you would get the right answer. when it is actually is a random in selecting the sample.y is unbiased if E(fj. variable ulation. [ky and ILy. Suppose you evaluate an estimator many times over repeated is that the mean of its randomly drawn samples. of an estimator to be a lightly cenleads to three specific value as possible.y denote some est ima t r of /ky. To slate this concept mathematically.y is consistent that it is within a small interval of the true value /ky for /k. This observation of an estimator: unbiased ness (a lack of bias).y is that. This sugvarithan gests choosing between fj.c 66 CHAPTER 3 Reviewof Statistics ~ Estimators and Estimates 3. fj.l' is biased. that is. both of which are unbiased. where E([ky) i the mean of fj.1 An estimator is a function of a sample of data to be drawn randomly from a pop. a desirable of fj. such Y or l'I.

ty if My ~ 3.. the variance of Y is o}/n.1 Estimation the Population of Mean 67 Bias.en: • The bias of fly is E(fly) • fly is an unbiased • (Ly is a consistent .d. that is. Properties of Y How does Y fare as an estimator of I-'Y when judged by the three criteria of bias. . Efficiency.. Bias. From Section 2. the law of large numbers (Key Concept 2. Thus.6) states that Y ---L. E(V) = I-'Y. might strike you as an obviously poor estimator-why would you go to the trouble of collecting a sample of n observations only to throwaway all but the first?-and the concept of efficiency provides a formal way to show that V is a more desirable estimator than Y.. Becanse Y" . Y should be used instead of Y. that is. and efficiency? The sampling distnbution of Y has already been examined in Sections 2. . • --=--==~_lI. .. Bios and consistency. We start by comparing the efficiency of V to the estimator l'i.6..5. we need to specify the estimator or estimators to which Y is to be compared. As shown in Section 2.5. of I-'y if E(fly) of j.2. and Efficiency CilDmim Let /LY be an estimator of l-'y.TI. and efficiency are summarized in Key Concept 3.I . consistency. the variance of Y is less than the variance of l'i. is an unbiased estimator of I-'Y' Its variance is var( l'i) = o-f. /-Ly.5 and 2. I-'y. for n 2: 2. according to the criterion of efficiency.3. thus y.I-'Y.Then /Ly is said to be more efficient than /iy if var(/Ly) < var(/iy).1IlIIIIIIIIiI... Consistency. What can be said about tbe efficiency of Y? Because efficiency entails a comparison of estimators.2 = I-'Y. Y" are i. consistency.Y is consistent.!I. estimator estimator • Let /iy be another estimator of I-'y and suppose that both /Ly and /iy are unbiased. The estimator Y. the mean of the sampling distribution of l'i is E( l'i) = I-'y. Y is a more efficient estimator than Y" so.i. so V is an unbiased estimator of I-'Y' Similarly.

However. m that minimizes (3.. it is the most efficient (best) estimator among all estimators that are unbiased 3. Y is consistent.. jiy =::: ".25a~/11 (Exercise 3. Y". structure: They are Y.." . .. functions 5. Because m is an estimator of E(Y). that is. then .This result is stated in Key Concept that is..\. 1'... II (l/n) ":::'.3 and is proven in Chapter Y is the least squares estimator of /"y.I.. Thus j? is unbiased and.3 Efficiency of Y: Y Is BLUE Let My be an estimator of J. you can think of it as a . .1 aY where a. and Y have a common mathematical r.. show thai the weighted of all unbiased estimators averages 11 and Y have these conclusions reflect a more general result: Y is the most efficient estimator that are weighted averages of 11.y is unbiased.lllUs The estimators weighted averages of Y is more efficient than Y. In fact. .-mf. 1'. The sample average Y provides the best the obser- fit to the data in the sense that the average squared Consider the problem of finding the estimator n differences between vations and Yare the smallest of all possible estimators. that is.=J I'i> varey) < var(p. If iJ.1'. . .. .11).ty that is a weighted average of>. The comparisons in the previous two paragraphs larger variances than Y. a are nonrandom constants.... ..68 CHAPTER 3 Reviewof Statistics m:mmm 3. Y is the most efficient estimator of My among all unbiased estimators weighted averages of }].2) rn ~(Y. .1) where the number of observations 11is assumed to be even for convenience.. .y = Y Thus Y is the Best Linear Unbiased Estimator (BLUE). because var( Y) ---> 0 as n ---> 00. 1'.. The mean of Y is /"y and its variance is var( Y) = 1. i=l which is a measure of the total squared gap or distance between the estimator and the sample points. Y is the Best Linear Unbiased Estimator (BLUE). . ot r].1' that are What about a less obviously poor estimator? in which the observations are alternately weighted Consider the weighted by ~ and~: average (3. j? has a larger variance than Y. Said and are linear differently. 11..y) unless p. .

published a poll111dlcatmg to 43%. presidential election. 11. but the "Landon by sampling that is not entirely random. such as those that would be sampling. by a landslide-57% The Gazette was right that the election was a landslide. Franklin D.2) is called the least squares estimator. a. This bias arises because this sampling scheme of the population.m can be thought of as a prediction mistake. . . that }]. This example of Wins!" box gives a real-world is fictitious.1 Estimationof the Population Mean 69 _f--------------: S won hortly before the 1936 U. But in 1936 many households did not have cars Or telephones. so the gap Y.3. Suppose agency that.d. � iiiiiiiiiiiiii ===""""""'_' _ .. Because ple are at work at that hour (not sitting in the park!).e estimator 111 that minimizes the sum of squared gaps Y.2) so that (3. . Y is the least The Importance of Random Sampling We have assumed obtained nonrandom monthly scheme 10:00 A.nd those that did tended to be richer-and likely to that AlI M. and an estimate of the unemployment or oversarnples.m in Expression One can imagine using trial and error to (3. you have the value that makes Expression solve the least squares problem: Try many values of m until you are satisfied that as is done in Appendix 3. (3. Because the telephone survey did not sample randomly from the population but instead undersampled Democrats. Roosevelt. This assumption is important because the from simple random national unemployment sampling can result in Y being biased.. you can use algebra or calculus to show that choosing m = Y minimizes squares estimator the Sum of squared gaps in Expression of J.2.2) can be thought of as the sum of squared prediction mistakes. example biases introduced in the sample. a statistical adopts a sampling most employed peoare overly rate based on in which interviewers survey working-age adults sitting in city parks at on the second Wednesday of the month. Alternatively. the unemployed members plan would be biased. the estimator was biased and the Gazette made an embarrassing mistake. The sum of squared gaps in Expression (3. .M.2) as small as possible.S. Do you think surveys conducted over the Internet might have a similar problem with bias? prediction of the value of Y. Y" are i. the Literary Gazette records and automobile registration files. to estimate rate.Ly. draws.. the unemployed represented this sampling overrepresents. but it was wrong about the winner: Roosevelt by 59% t041%1 Bow could the Gazette have made such a big ntistake?111e Gazette's sample was chosen from telephone were also more be Republican. Landon would defeat the incumbent.i.

Hypothesis pare the null hypothesis that holds if the null does not. a !-'Y. The null hypothesis is that the population to a second hypothesis. college graduates a null hypothesis distribution = 20. the conjecture earn $20 per hour constitutes = !-'Y.O' The null hypothesis is denoted Ho: E(Y) For example. Stated mathematically.4) .o. college graduates embody specific hypotheses equal $20 per hour? Are mean earnings the same for male and female college graduates? Both these questions about the population distribution of earnings.S. unemployment 3. !-'Y. Hypothesis tests involving two populations mean earnings the same for men and women?) are taken up in Section 3. takes on a specific Ho and thus is value. Do the mean hourly earnings of recent U. (3. the Sur- bias.2 Hypothesis Tests Concerning the Population Mean Many hypotheses about the world around us can be phrased as yes/no questions. Current the monthly U.O (two-sided alternative).S. called the null hypothesis. on average in the population. Null and Alternative Hypotheses to testing entails using data to comhypothesis.o. The statismean (Does the population (Are tical challenge is to answer these questions based on a sample of evidence.4.3) that. then the null hypothesis is. Appendix 3.o = 20 in Equation (3.1 includes vey it uses to estimate a discussion of what the Bureau of Labor rate. 111is section describes hypothesis tests concerning the population mean of hourly earnings equal $20?). E( V).3).S. The starting point of statistical hypotheses testing is specifying the hypothesis be tested. called the alternative mean. if Y is the hourly earning of a randomly selected recent college graduate. actually does when it conducts the U. The two-sided alternative H.O' about the population is that E(Y) (3.: E(Y) * !-'Y. specifies what is true if the null hypothesis hypothesis is that E(Y) is written The alternative two-sided alternative hypothesis hypothesis The most general alternative * is not. that of hourly earnings.Q 70 CHAPTER3 Reviewof Statistics 11 is important to design sample selection schemes Population in a way that minimizes Statistics Survey (CPS). which is called because it allows E(Y) to be either as less than or greater than !-'y. denoted !-'Y.

and it is reasonable not to reject the null hypothesis.The p-value is the probability of observing a value of Y at least as different from $20 (the population mean under the nuU) as the observed value of $22.o (the null hypothesis is false) or because the true mean equals !'-y.-y.64 by pure random sampling variation. For example. In the case at hand. it is accepted tentatively with the recognition that it might be rejected later based on additional evidence. The p-value. thus it is reasonable to conclude that the null hypothesis is not true. say 0. in your sample of recent college graduates. Although a sample of data cannot provide conclusive evidence about the null hypothesis. if this p-value is large. the average wage is $22.3. also called the signilicance probability. For this reason. If this p-value is small.2 Hypothesis TestsConcerning the Population Mean 71 One-sided section. and these are discussed later in this The problem [acing the statistician is to use the evidence in a randomly selected sample ot data to decide whether to accept the null hypothesis H or to o reject it in favor o[ the alternative hypothesis Hl." this does not mean that the statistician declares it to be true. suppose that. This calculation involves using the data to compute the p-value of the null hypothesis. assuming the null hypothesis is correct. assuming that the null hypothesis is true.o (the null hypothesis is true) but Y differs from !'-y.5%. alternatives are also possible.o because of random sampling. statistical hypothesis testing can be posed as either rejecting the null hypothesis or failing to do so. By contrast. To state the definition of the p-vaJue mathematically. say 40%. the p-value is the probability of drawing Y at least as far in the tails of its distribution under the null hypothesis as the sample average you actually computed. The p-Value In any given sample. then it is quite likely that the observed sample average of $22.64 could have arisen just by random sampling variation if the null hypothesis is true. it is possible to do a probabilistic calculation that permits testing the null hypothesis in a way that accounts for sampling uncertainty. let ya" denote the value of the sample average actually computed in the data set at hand and let PrHo . then it is very unlikely that this sample would have been drawn if the null hypothesis is true.64. is the probability of drawing a statistic at least as adverse to the null hypothesis as the one you actually computed in your sample. the sample average Y will rarely be exactly equal to the hypothesized value t-v» Differences between Y and t-v» can arise because the true mean in [act does not equal!. accordingly. rather. It is impossible to distinguish between these two possibilities with certainty. the evidence against the null hypothesis is weak in this probabilistic sense. If the null hypothesis is "accepted.

. (Y . according to the central limit theorem.o)/Ul" has a standard normal distribution. As discussed in Section 2. Under the null hypothesis the mean of this normal distribution is !LY.Y. Thus.e calculation of the p-value when Uy is known is summarized in Figure 3.1 (that is.o. Tf the sample size is large.6) where <l> is the standardnormal cumulative distribution function. the p-value is the area in the tails of a standard normal distribution outside ± (yacl -" . when the sample size is large the sampling distribution ofY is well approximated by a normal distribution./ly.o )/u.. in which case the .!LY.1. Written mathematically. To compute the p-value.72 CHAPTER Reviewof Statistics 3 denote the probability assuming that E( Y.y.) computed under the null hypothesis (that is. it i not. u~). [An exception is when Y.Y. In practice. ut ever. is consistent with the null hypothesis. computed p-value is = !Ly. ut. is the probability of Obtaining (Y . () (I Y .o. where = ut/n .ol. 11. then the observed value yac. (3.O than y " under the nuU hypothesis or.This probability is the shaded area shown in Figure 3.. when the sample size is small this distribution is complicated.O > (Ty I ly acl u:Y.ol].lf the p-value is large. I) (3. the p-value) is p-value = PI'". the p-value is the area in the tails of the distribution of Y under the null hypothesis beyond I Y'''' . this variance is typically unknown. how- utl./ly.a.. is binary so that its distribution is Bernoulli.e formula for the p-value in Equation (3. the shaded tail probability in Figure 3. so under the null hypothesis Y is distributed N(/ly. but if the p-value is small.. Calculating the p-Value When O'y Is Known 11. it is necessary to know the sampling distribution of Y under the null hypothesis.This large-sample normal approximati n makes it possible to compute the p-value without needing to know the population distribution of Y. then under the null hypothesis the sampling distribution of Y is N(/ly.o.6.!Ly.The pa value is the probability of obtaining a value ofY farther from /l1'.!LY.o)/Ul' in absolute value.1.5) That is. under the null hypothesis. The = p-value Prl/o[IY .o - I) = 2(1' (ly - acl .ol > !ya" - /ly. However. as long as the sample size is large.!Ly. The details of the calculation. where u~ ~ ut/n. equivalently.o)/ul' greater than (Y'''' . depend on whether CT~ is known. the standardized version of Y.o).6) depends on the variance of the population distribution.That is.

o by at least as much as yoct. n..(1) i=1 1 "" _-2 Y). s~.7) and Exercise 3. The population variance. is replaced by Y. The sample variance. The Sample Variance. is the average value of (Y . Y is distributed N(IJ-y. .O I I variance is determined by the null bypothesis. The sample variance and standard deviation. Sample Standard Deviation. witb two modifications: First.!Ly f. Thus the p-value is the shaded standard normal tail probability outside _ N(O. E(Y .IJ-y. . i = 1. Sy. see Equation (2. and the standard error of the sample average Y is an estimator of tbe standard deviation of the sampling distribution of Y. is 2 Sy- n -1 ". . is the square root of the sample variance. and second.O I o y"" . u~) under the null hypothesis.o)!uy is distributed N(O. In large samples.] Because in general O"} must be estimated before the p-value can be computed. Similarly.2 ~ Calculating Hypothesis Tests Concerning the Population Mean a p-value 73 The p-value is the probability of drawing a value of Y that differs from My... (3. 1). iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii===~~.!Ly)2 in the population distribution. the average uses the divisor n . we now turn to the problem of estimating u~. n z ±I( y"" -IJ-y.2. and Standard Error The sample variance s? is an estimator of tbe population variance O"~. .3.the sample variance is tbe sample average of (Y. so (Y . .7) The sample standard deviation.o)!uyl· I Y""-Il O"y Y.u.1 instead of n. the sample standard deviation Sy is an estimator of the population standard deviation ay. The formula for the sample variance is mucb like the formula for tbe population variance.11 Y.!LYf...o.

.-1)(T~.... The reaas is instead of by I/-is that esti- Y introduces 3. Intuitively.d.y that /-. E( yl) < that s~ is consistent is that it is a sample average.74 CHAPTER 3 Review of Statistics KEY CONCEPT The Standard Error of The standard error of Y estimator of the standard deviation of are i. a small downward bias in (Y. deviation of the sampling dis- tribution ofY is (T1' = (Ty/Vii.is large.8) 3.18.4 Y is an Y. that so s~ in Key obeys the law of large numbers. Y and is 0-1' (the caret "'" over the symbol means that it i an estimator error of V is summarized of (T1')'The standard . (3. The stan_ dard error ofY is denoted SE(Y) or 0-1" When r. But (or must be finite. Y" SE(Y) = 0-1' = s. and Y.. The result in Equation that 1"[" .:'\ (Y. mator of the population variance: SY The sample variance is a consistent '-+(T' y.9) is proven in Appendix 3.6.4.7) instead " reets for this small downward bias. which in turn means E( yt) Y.3 under the assumptions 00. has a finite fourth moment.1 in Equation (3.1)//1. lase correction: Estimating the mean uses up some of the information-that the data. ..-1 /-. the sample variance is close to the population probability when /1. S9 to obey the law of large numbers Concept 2.7) instead of n is called a degrees of freedom is.y is estimator of is Y.Thus EL.](T~.1 in Equation (3. The reason (or the first modification-replacing unknown and thus must be estimated.i. - vf Specifically. Y" are i.d. . the reason (3.Y? = Cor- IlE[(Y. The standard error 0('1. . (3.9) variance with high In other words. - Vf] = (/1.. of 1/ shown in Exercise E[( Y.. in other words. Dividing by n . that is..i. and as a result Sy IS un bi d .1 degrees Dividing by /1.V)'] = [(/1. is called the standard error of as Key Concept 3. uses up remain. esti1 "degree of freedom" -in f freedom Consistency of the sample variance./-'y)' must have finite variance. (1./Vii. must have a finite fourth moment.9) justifies using sy/Vii as an estirna- sy/Vii. the natural son for the second modification-dividing mating /"y by by /1.. Equation (3.. so that only /1.y by Y -is /-. . Because the standard tor of (T1" The estimator denoted SE(Y) or of (T1'.

13) __________ iiiiiiiiiiiiii..10) can be rewritten in terms of la" denote the value of' the t-statistic actually computed: yaci - act _ - j..2)."Y./ are i. (3. Let distributed N(O.===~ .7)... a test statistic is a statistic used to perform r-statistic is an important example of a test statistic.. The formula for the standard error also takes on a simple form that depends only on l'and n..o 1= SE(Y)' a hypothesis (3. draws from a Bernoulli p....d.. (3.. when cry is unknown and }j. Thus the distribution same as the distribution the standard normal distribution when 11 of' the r-statistic is approximately (l' . 1. the (-statistic or z-ratio: l' ..LY."Y. the formula for the variance ofl'simplifies distribution with Success to p(l .d. 1) for large fl.ol) . the p-value is calculated using the formula p-value = 2(t>(_ll'aa - SEC Y) .11) In general.Y)/n. test./LY.6) by the standard error. Calculating the p-Value When Uv Is Unknown Because Sf is a consistent estimator of a}.. Accordingly.O I - SEC Y) ."Y..10) The t-Statistic The standardized statistical sample average (l' . . which in turn is well approximated under the null hypothesis. The Large-sample distribution of the t-stotistk: of When n is large. the p-value can be computed by replac- ing cry in Equation (3.o)/cry..o)/SE(l') plays a central role in testing hypotheses and has a special name...i. .12) for the p-value in Equation (3.. . Y. . That is. SE(l') = \!Y(l.2 When probability Hypothesis TestsConcerning the PopulationMean 75 1[. are i.p)/n (see Exer- cise 3.3. SE(l') = <Ty.. (3... sf is close to cr9 the by with high probability... is large because of the centrallimit the- orem (Key Concept 2.i. I is approximately The formula the I-statistic.

64. 1) distribution.20)/1. Because the area under the tails of the normal distribution outside ±1. If n is large enough. Suppose it has been decided that the hypothesis will be rejected if the p-value is less than 5%.96.when n is large. and the sample standard deviation is Sy = $18. Hypothesis Testing with a Prespecified Significance Level When you undertake a statistical hypothesis test.reject if the absolute value of the I-statistic computed from the sample is greater than 1.14.06) = 0. This approach gives preferential treatment to the null hypothesis.96. the p-value is 2<1>(2. or you can fail to reject the null hypothesis when it is false.039. the p-value can be calculated using p-value = 2(1)( -11""1)· (3. .06. but in many practical situations this preferential treatment is appropriate. of incorrectly rejecting the null hypothesis when it is true.15) That is. the probability of obtaining a sample average at least as different from the null as the one actually computed is 3.96 is 5%. or 3. suppose that a sample of n = 200 recent college graduates is used to test the null hypothesis that the mean wage. then under the null hypothesis the z-statistic has a N(O. Then the standard error of Y is sy/Vii = 18.14/v'200 = 1. Thus the probability of erroneously rejecting the null hypothesis (rejecting the null hypothesis when it is in fact true) is 5%.64 .28. assuming the null hypothesis to be true. is $20 per hour. you can make two types of mistakes: You can incorrectly reject the null hypothesis when it is true.9%. Hypothesis tests can be performed without computing the p-value if you are willing to specify in advance the probability you are willing to tolerate of making the first kind of mistake-that is.28 = 2. E(Y). The sample average wage is ¥"" = $22. TI. then you willreject the null hypothesis if and only if the p-value is less than 0.9%. (3.14) As a hypothetical example. Hypothesis tests using a fixed significance level. That is. If you choose a prespecified probability of rejecting the null hypothesis when it is true (for example. 5%).e value of the z-statistic is I"" = (22. From Appendix Table 1. this gives a simple rule: Reject Ho if 1/""1 > 1.05.76 CHAPTER 3 Review of Statistics Accordingly.

The significance level of the test in Equation (3.96. This value exceeds 1. TI. the p-value is the smallest ignifi ance level at which you can reject the null hypothesis.ll. summarized in Key oncept 3. What significance level should you use in practice? and econometricians In many cases.15) is 5%. If the test rejects at the 5% significance level. and the rejeclion regi n i the values of the r-statistic out ide ±J .e pr bability that the test actually incorrectly rejects the null hyp thesis when it is true is the size of the test. in which the null hypothesis is rejected when in fact it is true.96.5 111i framework f r te ting statistical hypotheses has some specialized terminology. statisticians use a 5% significance leveJ.e prespeeified rejection probability of a statistical hypothesis test when the null hypothesis is truethat is.06. and a type U error. In the previou example of re ting the hypothesis that the mean earnings f recent college graduates is $20 per hour.3. 3. in which the null hypothesis is not rejected when in fact it is false. The set of values of the test statistic for which the test rejects the null hypothesis is the rejection region.lf you were to test many statistical . the r-statistic was 2.96. TI.0 at the 5% significance level.2 Hypothesis Tests(oncerning the Population Mean 77 The Terminology of Hypothesis Testing ~ A statistical hypothesi te t can make two types of mi takes: a type I error. and the values of the lest stati tic f r which it docs not reject the null hypothesi is the acceptance region. Equivalently. so the hypothesis is rejected at the 5% level. and the probability that the test corrc tly "ejects the null hypothe is when the alternative is true is the power of the test. the prespecified pr bability of a type I error-i the significance level of the test. at least as adverse to the null hyp thesis value as is the statistic actually observed. by random sampling variation. the critical value ofthi two. Testing hypotheses u ing a prespecified significance level does not require computing p-values.ided test is 1. The p-value is the probability of obtaining a te t statistic. the populati n mean !iy is said to be statistically significantly different from 1-'1'. assuming that the null hypothe is is correct. Although performing the test' ith a 5% significance level is easy. reporting only whether the null hypothesis is rejected at a prespecified significance level conveys less information than reporting the p-value.5.e critical vulue of the test statistic is the value of the statistic for which the test just rejects the null hypothesis at the given significance level.

then you never need to look at any statistical change yonr mind! The lower the significance test. [Equation (3. and the null hypothesis could be that the defendant is not guilty.O * 1. For example.8)]. Compute the t-statistic 3.13)]. you would incorrectly reject the null on average once in 20 cases. has a cost The smaller the significance difficult it becomes level.1 %. Similarly.o· For example. Being conservative. so a 5% significance level. Key Concept 3.96). the lower the power can call for less conservatism level is often considered to be a reasonable mean against the two-sided alternative. Many economic compromise. In some legal settings. Compute the p-value error ofY. in the sense of using a very low significance level. the most conevidence for you will never servative thing to do is never to reject the null hypothesis-but view. the hypothesis at the 5% significance level if the p-value is less than 0. evant alternative to the null hypothesis that earnings are the same for college graduates and non-college graduates is not just that their earnings differ. Sometimes a more conservative significance level might be in order. to avoid this sort of mistake. Reject (3.o Against the Alternative E(Y) fLY. SECY) [Equation [Equation (3. tbe alternative hypothesis might he that the mean exceedS so the relfJ-y. the significance level used is 1%.78 CHAPTER 3 Review of Statistics KEY CONCEPT 3.6 Testing the Hypothesis E( Y) = fLy. Compute the standard 2. but rather .14)]. if It""1 > 1. legal cases sometimes involve statistical evidence. In fact.or even 0. one hopes that education helps in the labor market. One-Sided Alternatives In some circumstances.05 (equivalently. a very conservative standard might: be in order so that consumers can be sure that the drugs available in the market actually work.6 summarizes hypothesis tests for the population and policy applications a legal case. hypotheses at the 5% level. then one would want to be quite sure that a rejection of the null (conclusion of guilt) is not just a result of random sample variation. if a government agency is considering permitting the sale of a new drug. the larger the critical value and the more if that is your of the than to reject the null when the null is false.

Specifically. Now pick another arbitrary ___________ iiiiiiiiiiiiiiiiii====~~.o is not rejected at the that u .. to test the one-sided hypothesis in Equation (3.96. is p-value The N(O. called a confidence interval.3 Confidence Intervalsfor the Population Mean 79 that graduates earn mare than non graduates.17) J) critical value for a one-sided test with a 5% significance level is 1.16). Test the null value !Lv. construct the I-statistic in Equation (3.64.a· The one-sided of the previous the 5% rejection IF instead the alternative hypothesis is that E(Y) < !LY. Such a set is in sible to use data from a random the true population called a confidence the possible sample to construct a set of values that contains probability that !Ly is contained mean t-v with a certain prespecified set.3 Confidence Intervals for the Population Mean Because oC random sampling error. That is.e general approach > !LyO (one-sided alternative).64. The confidence set for !Ly turns out to be all values of the mean between a lower and an upper Limit. r-siatistic. based on the N(O.64.l. region consists of values of the I-statistic less than -1.16) testing is the to computing same for one-sided alternatives as it is for two-sided alternatives. This is called a one-sided alternative hypothesis and can be written /-I. However. region For this test is all values of the z-statistic exceeding 1.13). and write down tbis non rejected value !Ly. then the discussion paragraph applies except that the signs are switched.16) concerns values of!LY exceeding The rejection !Ly. = !Lyo against the alternative that!LY '" !Ly.: E(Y) 11. call it !LY. so that the a 95% confidence set for the population mean.q)(I'''').a. it is impossible to learn the exact value of the population mean of Y using only the information in a sample. J) approximation = PrHo(Z > I'''') = 1. for example.. Here is one way to construct Begin by picking some arbitrary hypothesis r-statistic: value for the mean. (3.3. 3. the p-value.L. and the prespecified this set is called the confidence level.O by computing if it is less than 1. the confidence set is an interval.. this hypothesized 5% level. it is posprobability. "~_ .o. p-values and to hypothesis (3. The p-value is the to the distribution of the area under the standard normal distribution to the right of the calculated I-statistic. with the modifi- cation that only large positive values of the I-statistic reject the null hypothesis rather than values that are large in absolute value.a. hypothesis in Equation (3.

y.u .y.y 3. Y .96SE(Y)).5.5. This method of constructing a confidence set is impractical. Do this again and again.7 summarizes a Y. you can tell him whether shows that this set of values has a remarkable The clever reasoning his hypothesis is rejected or not simply by looking up his number on your handy list. your list will contain the true value of t-v.96 standard errors away from ues within ±1. 90%. Thus the set of values of My that are not rejected at the 5% level consists of those valfor My is Ythis approach. a trial value of iJ. for it requires you to test all possible values of iJ. According to the formula for the r-statistic in Equation (3. Suppose u-v is 21. the probability of rejecting the null hypothesis iJ.y= IY interval for iJ.. A bit of clever reasoning property: The probability the true value of that it contains the true value of the population mean is 95%.5 on 21. (although we do not know this). in particular you tested the true value. That is.y as null hypotheses.965£(1'). Then Y has a normal distribution centered and the z-statistic testing the null hypothesis u v = 21.80 CHAPTER 3 Reviewof Statistics ems. goes like this.y.58S£(Y)!.965£(1') 0. Fortunately. Continuing this process yields the set of all values of the population at the 5% level by a two-sided hypothesis test. do so for all possible values of the population mean. 1) distribution.96SE(Y) of Y.y= interval for intervals for iJ. if you cannot reject it. if n is large. 95%.y = IY ± 2. IY ± 1. a 95% confidence interval + 1.Thus the values on your list constitute 95% confidence set for iJ.5. = 21.13). mean that cannot be rejected This list is useful because it summarizes the set of hypotheses you can and cannot reject (at the 5% level) based on your data: If someone walks up to you with a specific number in mind.5 has a N(O.o is rejected at the 5 % level if it is more than 1.y is an interval constructed in 95% of all possible random samples. and 99% confidence 95% confidence 90% confidence 99% confidence interval for iJ. this means that in 95% of all samples. value of iJ.y = 21.o and test it.y are ± 1. so that it When the sample size n is large.7 interval for iJ. write this value down on your list.64S£(Y)). there is a much easier approach. mean Thus.mUD Confidence Intervals for the Population Mean A 95% two-sided confidence contains the true value of iJ.. indeed. you will correctly accept 21. 1. In 95% of all samples.5 at the 5% level is 5%. But because you tested all possible values of the population in constructing your set. Key Concept 3. iJ. My 0.

The 95% confidence interval for mean hourly earnings is 22. Although one-sided confidence intervals have applications in some branches of statistics. let IJ-" be the mean hourly earning in the population of women recently graduated from coLlegeand let J. $25. that it contains the true population mean.64 ± 1. This discussion so far has focused on two-sided confidence intervals.Ly that cannot be rejected by a one-sided hypothesis test.96 x 1. ~'ypothesis Tests for the Difference Between Two Means To iLlustrate a test for the difference between two means.15].4 ComparingMeansfrom DifferentPopulations 81 As an example. 3.13.51 ~ [$20. computed over all possible random samples.28.3. consider the problem of constructing a 95% confidence interval for the mean hourly earnings of recent college graduates using a hypothetical random sample of 200 recent coLlegegraduates where Y ~ $22. .L". TI. One could instead construct a one-sided confidence interval as the set of values of J.4 Comparing Means from Different Populations Do recent male and female college graduates earn the same amount on average? This question involves comparing the means of two different population distributions.64 and SECY) ~ 1. be the population mean for recently graduated men. Then the nuLl hypotbesis and the two-sided alternative hypothesis are (3. Coverage probabilities.28 ~ 22.18) The null bypothesis that men and women in these populations have tbe same mean earnings corresponds to Ho i11Equation (3.64 ± 2. Consider the null hypothesis that mean earnings for these two populations differ by a certain amount.18) with do ~ O. say do.e coverage probability of a confidence interval for the population mean is the probability. they are uncommon in applied econometric analysis. 111 section summarizes how to test hypotbeses and how to conis struct confidence intervals for tbe difference in the means from two different populations.

. Because the r-statistic in Equation (3. ." is. these population variances are typically unknown be estimated. is the population ance of earnings for men.Y..4 that a weigbted average of two normal random variables is itself normally distributed. III ·IV and dividing t=("1. from the estimator the result by the standard error of YIII .15. If (J"I. then this approximate normal distribution fJ-w used to compute p-values for the test of the null hypothesis that fJ-".Jio"" (eI. approximately where men and nw women earnings ~I" ~Il - drawn at random from their populations..Y." .7/I1". Because Y. they can be estimated so they must using the sample variances.. ."1.. .:/I1". by subtracting Y../11 from samples of men and women.82 CHAPTER 3 Reviewof Statistics Because these population means are unknown.-y .y".Jio". recall from Sec- tion 2. As before. 11/ 5£( Y _ V) lw (r-statistic for comparing . and s." are from different randomly Thus elected samples. are known.../ and (T~.. we need to know "1. Also.) + (eI.20) has a standard normal distribution under the null hypothesis whe n fIlii an d· Ilw ale large. (J~ t:v is approximately constructed N( Mill! a./1 - ~v 5£(1.).Jio"... is distributed can be = do· In N[Jio". . . = do using Y...19) when Y is a Bernoulli random see Exercise 3. practice..:. s.) = (3.... two means).. Let the sample average annual yoIV for women. however." .7). for men and III the distribution of rem. Thus the 1./I1". where eI'." . is the population variance of earnings for women... except that the statistic is is computed only for the men in the sample./ nm). (320) If both fIlii and nw are large. "1.~ is defined standard error of as in Equation (3. they are independent random variables.- Y.)]. then this r-statistic has a standard normal distribution.19) For a simplified version of Equation (3. where s. 1. Similarly. -11" mean. they must be estimated 11.-Yw)-dO . the p·value of the tWO-SIde d . and s. The r-statistic for testing the null hypothesis is constructed analog variable. according to the central distributed limit theovari- distributed eI. usly to the I-statistic for testing a hypothesis about a single population the null hypothesized value of 1'-".. is defined similarly for the women. and 1. Suppose we have samples of be"'Y. Then an estimator o[!-LI/I -""'1\1 is Recall that N(Jio"" To test the null hypothesis that Jio"..

5 Dlfferences-of-Means Estimation of Causal Effects Using Experimental Data 83 test is computed exactly as it was in the case of a single population. controlled experiment randomly selects of interest...) ± 1.14). is less than 1.96 means of those values the estimated Y.17).Y". simply calculate the level if the absolute 1- (3.96SE(Y.3.96.21) With these formulas in hand. if the alternative is that 1-'". > do). . That is. do wi11be in the confidence difference.1-'".2 that a randomized subjects (individuals 01'. and control groups is imental treatment. d = 1-'".96.: f. entities) from a population then randomly 'TI.. the pvalue is computed using Equation (3... (3... Because the bypothesized value do is rejected at the 5% level III > 1. ..2.S.L11I - 95% confidence interval for d = fJ. which does not receive the treatment.).. more general1y. The pvalue is computed using Equation rejects when I > 1. which receives the experbetween the sample means of the treatment of the causal effect of the treatment.Y. But III '" 1. the nul1 hypothesis is rejected at the 5% significance value of the r-statisric exceeds 1. . or to a control group.96.. (3...20) and compare it to the appropriate critical value..96 standard errors of Y. and a test with a 5% significance level Confidence Intervals for the Difference Between Two Population Means The method extends if for constructing confidence intervals summarized in Section between 3.64.5 Differences-of-Means Estimation of Causal Effects Using Experimental Data Recal1 from Section 1.. -1-'". Thus the 95% two-sided confidence d within ± 1. . For example.e difference an estimator assigns them either to a treatment group.. If the alternative is one-sided rather than two-sided (tbat is.96 standard interval for d consists errors away from do.Y. To conduct a test with a prespecified statistic in Equation significance level. college graduates. of gender dif- 3.Y. the box "TIle Gender Gap of Earnings of College Graduates in the United States" contains an empirical investigation ferences in earnings of U.w is (Y... then the test is modified as outlined in Section 3.3 the that of to constructing a confidence interval for the difference set if means. . . III '" 1.

are com- monly conducted in some fields. The hypothesis is equivalent to the hypothesis that the two means are the same." unrelated to the treatment different treatments provides an example conclusions. In the context effect. For this reason. A for the of by the difference in the sample average outcomes that the treatment the treatment and control groups. for the causal effect. "A Novel Way to Boost Retirement of such a quasi-experiment that yielded some surprising . randomized iments tend to be expensive. in some cases. such as medicine. which can be tested using the r-statistic for comparing two means. experand. the treatment controlled experiment. so they remain rare. or subject characteristics has the effect of assigning Savings. the causal effect is also called the treatment denote the control group and X = 1 denote the treatment (that . expectations. difficult to administer. well-run experiment interval 95% confidence interval for the difference in the means of the two groups is a 95% causal effect can be constructed can provide a compelling controlled experiments estimate a causal effect.E(ylx=O) in an ideal randomized is binary). If the treatment effect) is is. econometricians also called quasi-experiments.84 CHAPTER 3 Review of Statistics The Causal Effect as a Difference of Conditional Expectations The causal effect of a treatment of the treatment as measured effect can be expressed is the expected effect on the outcome of interest This in an ideal randomized controlled experiment.21). In economics) however. Estimation of the Causal Effect Using Differences of Means If the treatment in a randomized effect can be estimated controlled experiment is binary. If there are of experiments. study "natural experiments. then we can let X = 0 group.E(YIX = 0). For this reason." to different subjects as if they had been part of a randomThe box. if the treatment then the causal effect level X = 0). E(yIX value of Y for the treatment ideal randomized controlled level x is the difference where E(YIX treatment in the cond]. so a 95% confidence using Equation (3. is the expected value = x) . the causal effect on Y of treatment tional expectations. E(Ylx=l) levels (that is. ethically sometimes in which some event questionable. then the causal between is ineffective (3. ized controlled experiment.20). as the difference of two conditional Specij. ically. given in Equation confidence interval A well-designed. experiment and = x) group (which receives level X = x) in an E( Y Ix = 0) is the expected of Y for the control group (which receives treatment only two treatment is binary treatment.

.782/1838 average. conducted in March of the next year (for example.78. 1992 to 2008.2000.22** 3.19 3. $4.95 9.66'/1871). from $3. with a standard error of $0.2004.02 20.87 $.88 25.87 per hour in real terms.36 9. What arc the recent' trends laws governing gender discrimination in the workin the United over lime? + 9.98 y. "The Distribution of Earn- was $24.1996.88 2.Yw) for d 1992 1996 2000 2004 2008 to.40-4.36 9.1"1per hour.97 3. An hourly gap of $4. and the standard deviation of earnings ings in the United States in 2008. The difference ISsignificantly dIfferent from zero at the U I % significance level.3.33 0.22 per hour to $4.The95% con- States.]1 ** 0.27 22. Ym- Yw SE(Ym . The average hourly earnings in 2008 of the 1871 women surveyed was $20.60 12.80-4.01 11.32-4.. the gender gap is large.66.2000. Til us the estimate of the gender gap in earnings for 2008 is $4.11 ± 1.98 .80 These estimates are computed using data on all full-time workers aged 25-34 surve~ed in the Current POI~ulfltionS~rv.96 X 0.78 1594 1379 1303 1894 1838 7.35 = ($3. on earn more than Socialnorms and for men was $11.87 7.41-4.87. First. 2005 18. this increase is not statistically significant college-educated full-time workers aged 25-34 in the United States in 1992."14** 4. the average hourly earnings of the '1838 men surveyed ~rC"ds in Hourly Earnings in the United States of Working College Graduates.ey. Second. continued . 1996. and 2008. using data collected by the Current vey.42 0.48 24.$20.74 21. Women 95% Confidence Interval $m IJm Year ¥m 23.1 gives estimates of hourly earnings for fidence interval for the gender gap in earnings in 2008 is 4.35 2.87). and 2004 were Population Sur- adjusted for inflation by putting them in 2008 dollars using the Consumer Price Index (CPI). the data for 2008 were collected til March 2009).5 Differences-oF-Means Estimation of Causal Effects Using Experimental Data 85 The Gender Gap of Earnings of College Graduates in the United States T he box in Chapter 2.10 11.80).1 suggest four conclusions.80 3. male college graduates in this "gender gap" in earnings? place have changed substantially female college graduates.'10** 4.58-3. however.41.50** 4.11 (= $24.' In 2008. and the standard deviation of earnings was $9. n. assuming a 40-hour work week and 50 paid weeks per year. but over a year it adds up to $8220. Is the gender gap in earnings of college graduates stable.11 might' not sound like much.35 (=Y1I.12 24.66 1368 1230 ll81 1735 . in 2008 Dollars Men Women Differencer Men V5. from 1992 to 2008.36 0." shows that.1871 3. Earnings [or 1992.98 20.17 10.98. Ages 25-34. the estimated gender gap increased by $0.35 0. or has it diminished Table 3. 111e results in Table 3.

98). in other words. slightly more than the gap of 14% seen in 1992 ($3.534 to put them into "2008 dollars. One way 10 make this adjustment is to lise the CPI.1.23. the finitesample distribution. the topic of Part It working full-time in 2008 was $23.97 . When the sample size is small. Over the 16 years from '1992 to 2008. the price of the CPI market basket rose by 53.4. however. The formula for this statistic .5. by multiplying 1992earnings by 1.1992 cost $153.17).93.1) than it is for all college graduates (analyzed in Tahle 2.97. while for men this mean was $30. . which corresponds to a gender gap of23% [= (30. Does it arise from gender 'Because of inflation. I and criti- cal values can be taken from the Student distribution..1. the population a poor dis- tribution is itself normally distributed. Fourth. I Consider the z-statistic used to test the hypoth- esis that the mean of Y is /Lv." 3. the gap is Jarge if it is measured instead in percent- age terms: According to the estimates in Table 3.11/$24.1 degrees of freedom. however. that is. To make earnings in 1992 and 2008 comparable in Table 3. Y". in 2008 women earned 16% less per hour than men did men and women? Does it reflect differences in choice of jobs? Or is there some other cause? We return to ($4. that the "genanalysis documents der gap" in hourly earnings is large and has been fairly stable (or perhaps increased slightly) over the recent past.97] among all fulltime college-educated This empirical workers.86 CHAPTER3 Reviewof Statistics crimination in the labor market? Does it reflect dif. tell us diswhy this gap exists. ferences in skills. a measure of the.93)/30. a dollar in 1992 was worth more than a dollar in 2008. see Section 2. using data yt. or education between at the 5% significance level (Exercise 3. The use of the standard with critical valis testing and for the connormal distribution justified by the central limit theorem.If. Third. pnce of a "market basket" of consumer goods and services Constructed by the Bureau of Labor Statistics.4%..6) of the z-statistic testing the mean of a single I population is the Student distribution with n . in the sense that a dollar in 1992 could buy more goods and services than a dollar in 2008 could. the r-statistic is used in conjunction normal distribution for hypothesis intervals.o.2 through ues from the standard struction of confidence 3. the CPI basket of goods and services that cost $100 in . the mean earnings for all college-educated women these questions once we have in hand the tools of multiple regression analysis. which applies when the sample size is large.6 Using the t-Statistic When the Sample Size Is Small In Sections 3.1992 earnings arc inflated by the amount of overall CPI price inflation.40 in 2008.4):As reported in Table 2. the gender gap is smaller for young college graduates (the group aaalyzed in Tahle 3. . The t-Statistic and the Student t Distribution I The t-stotistk: testing the mean.27). the standard normal distribution can provide approximation to the distribution of the z-statistic. Thus earnings in 1992 cannot be directly compared to earnings in 2008 without adjusting for inflation. The analysis does not. experience. then the exact distribution (that is.22/$23.

- and dividing by ~ and collecting terms: VsNn VU9!11 \j~ Is! (1'-/"0) (n-l)s9l<T9~Z. When 1']. and lTh then l' and sr are independently I distributed. then Z = (1' .7). It follows that if the population distribution of Y is normal. if the null hypothesis uv = }J-y. Y" are i.22) where standard s? is given in Equation in Section normal distribution (3. the sampling distribution exactly N(/-ty. however..o)/V IT?/n and let W= (n -l)s?.1 degrees of freedom.1 degrees of freedom..10). (3.12)]. of the z-statistic depends on the There is. Substitution the r-statistic: (3.22) has a Student I distribution case in which the exact distribution of the r-statistic distributed.lT?. and the population of1'is distribution IT?/n) for all n.o is correct. recall [rom Section 2.3. X~-I distribution fat all n. In addition. and the population distribution of Y is N(/-ty.-VW/(n_l). Recall from Section 2.i. of freedom is defined distribution To verify this result. lTh the z-statistic can be written as such a ratio.d.o)/V IT?/n has a standard normal distribution for all fl. . Specifically...22) can be written as I = Z/VW/(n -1).4 that if 11.22) has an exact Student is normally distributed. W = (n -l)s?/lT? has a of Y is N(}J-y./-ty. under general conditions the standard normal has a to if the sample size is large and the null hypothesis of Y if n is large. .n 11-1 .. Although is reliable for a wide range of distributions if n is small.0).8). Yu9.. . . where Z is a random variable with a standard normal distribution. where the standard error of l' is given by Equation for of the latter expression into the former yields the formula 1'-/-ty.r. and it can be very complicated. with a chi-squared W is a random with n . it can As discussed is true [see Equation the r-statistic be unreliable distribution normally tribution (3. then the z-statistic in Equation with n .o v... thus. Y" are i. and Z and Ware independently distributed.i. then some algebra! shows that the z-statistic in Equation (3. then under the null hypothesis distribution with n . If the population distribution the Student I distribution then critical values from tests and to construct can be used to perform hypothesis "The desired expression is obtained by multiplying {~1'-I-"'O (1'-1-'". .d. the r-statistic approximation (3. 3.6 Using the I-Statistic When the Sample Size Is Small 87 is given by Equation (3. The exact distribution of Y. let Z = (1' .1 the r-statistic given in Equation degrees of freedom.4 that the Student n -1 degrees variable to be the distribution of Z/VW/(n -1)./-ty.2. one special is relatively simple: If Y is I diswith (3.

the pooled of observations error formula applies only in the special case that the two groups have the same variance or that each group has the same number w. where the standard SEpootec I( -y. . using Equation (3. The pooled standard ference in means is SEpooled(Y.". tion. is 2.15 null hypothesis would be rejected at the 5% significance sided alternative.~v) V1/1I11/ + If"'wl and tile pooled .96.1=1 v"i+ ~ i=l (Y.09 SE(Y). III error formula-has however.19) so that the two groups are denoted and pooled variance estimator is S2 pooe Id= 1 I'lm+nw-2 [ ~ (Y.20).. (J".) . This confidence wider than the confidence value of 1.88 CHAPTER 3 Review of Statistics a hypothetical is n . The (Exercise as Adopt the notation of Equation (3.23) group III group w where the first summation is for the observations = S. given in Equation (3. 2 _ 2) same ( t rat IS. . the 5% two-sided critical value for the distribution the r-sratistic is larger in absolute value than the critical value (2.Jooleri X in group 11/ and the second sumerror of the diferror is the mation is for the observations I-statistic is computed pooled standard error .- V.~). consider t"" = 2. interval constructed The t-stotistit: testing differences of means. .20). would be for /ky.19) does not produce a denominator in the r-statistic r-statistic. distribution is N(/k"" (T.)2] I (3. + II" - 2 . if the population distribution of Y in group w is N(/k". using the 119 is somewhat normal critical Y± 2.1l . distribution of Y is normal. based on a different an exact Stustandard 3.09.then under the null hypothesis the r-statistic computed using the pooled standard degrees of freedom.. and if the two group variances are the I . error has a Student I distribution with /I..(T".y ) I III 'III· of Y in group /'11. The 95% confidence interval distribution. constructed interval using the standard > 2.09). From Appe-.21). The Student used to compute I distribution the standard with a (3. Because dix Table 2. As an example. the level against the twn. even if the population error ill Equation The I-statistic testing tudent the differI distribu- ence of two means. A modified version of the differences-of-means standard error formula-the dent I distribution "pooled" standard when Y is normally distributed. U". does not have a does not apply here because the variance estimator chi-squared distribution. If the population in group w.I problem in which confidence intervals..15 and n = 20 so that the degrees of freedom t9 ' = J 9..

in such plans. employees arc enrolled to convertof Maybe workers found these financial choices too confusing. cial incentives. ees.9% .They groups of workers: those hired prominently in testimony the year before on this part of the legislation. and those hired in the year after the change and automatically enrolled (but could opt out).9% (/7 the treatment Because = 5801). How could the default choice matter so much? firms.8% to 50. This research had an important then takes it But.37.S.15) tight. hired and after for encouraging retirement change. lax code. be wrong? Docs the of the growing field of "behavioural both could lead to accepting option. at other at some firms employees in the plan. In an important ways to encourage saving in 2001.6 Using the r-Sratist« When the Sample Size Is Small 89 A Novel Way to Boost Retirement Savings M any economists think that people do not save Conventions! methods savings focus on finan- Madrian between and Shea found the workers 110 systematic before differences the enough for retirement. 46. savings ment) group was 85. Neither explanation is economically only if they choose to opt in. Madrian could conventional economics the default enrollment method of enrotlment in a savings plan directly affect its enrollment rate? To measure the effect of the method of enroll- practical impact. the 95% confidence for the treatment effect is after the applicable section of the U. although they (computed in Exercise3.4% (n whereas the enrollment Madrinn and Dennis Shea considered unconventional method for stimulating = 4249). Laibson. Madrian and Shea studied changed the default option firms to offer 40"1 plans in which enrollment (k) default. is always optional. Brigitte one such retirement and the causal effect of the change could be estimated by the difference in means between the two groups. the method out or opt in-should computes not matter: but both are consistent with the predictions economics. Thus. Congress tection Act that (among a large firm that passed otber things) encouraged is the and ment. the Pension Pro- In August 2006. According tional economic models enrollment-opt of behavior.3. are automatically enrolled can opt out. and and Thaler (2007) and Beshears. but there also has been an upsurge in assigned treatment interest in unconventional for retirement. or maybe they just didn't want to think about growing rationalold. The financial aspects of the plan remained the same. (treatof savings. However.5% (= 85.2%." and The rational worker the optimal action. and savings plans. To learn more about the design of retirement' behavioral economics and change and not automatically enrolled (but could opt in). and Shea wondered. their sample is large. Madrian (2008). The estimate taken out of the paycheck of participating employ. . Many firms offer retirement which the firm matches. The econometric Shea and others featured findings of Madrian for its 40I(k) plan from compared two the nonparticipation 10participation. study published Madrian and Shea found that the default enrollment rule made a huge difference-The enrollment rate for the "opt-in" (control) group was 37. from an econometrician's the change was like a randomly perspective.4%). see Benartzi Choi. Enrollment savings plans in rate for the "opt-out" in full or in part. called 401(k) plans effect is 48.

Even though the Stndent {distribution is rarely applicable in econom ics. For economic variables.19). inferences-hypothesis if the 2. is used. distributions correct standard error formula. If the population variances the null distribution * II". "TIle underlying population distribution of Y is normal. the normal approx- distribution should be based on the large-sample normal approximation. the pooled standard error and the pOoled Use of the Student t Distribution in Practice For the problem of testing the mean of Y. and in all applicalarge norand the standard the sample sizes are in the hundreds or thousands. any economic reason for two groups having different means typically implies that the two groups also could have different ances. For n dent { and standard tions in this textbook. it does not even have a standard t-statistic should not be used unless you have a good reason to believe population variances are the same.Theretests and confidence intervals-about the mean of a Street"). see the boxes in Chapter Distribution of Earnings imation to the distribution fore.In practice.19) does not inferences about differences with the large- bave a Student t distribution. variis as the and the using the standard error formula in Equation (3.. distribution in large samples. are normal. used in conjunction sample standard normal approximation. in tbe United States in 2008" and "A Bad Day on Wall of the {-statistic is valid if the sample size is large. Even if the underlying data are not normally distributed. Accordingly.). . which allows for different group variances.. the pooled variance estimator is biased and inconsistent.0 I.90 CHAPTER 3 Reviewof Statistics The drawback of using the pooled variance estimator S~ool'd is if the two population variances are the same (assuming 11. for n > 80.002. however. in means should be based on Equation (3. Therefore. the pooled standard error formula is inappropriate. the difference in the p-values computed using the Studistributions never exceeds 0. difference never exceeds 0. In most modern enough for the difference between the Student t distribution mal distribution to be negligible. When comparing two means. are different but the pooled variance formula distribution. this does not pose a problem because the difference Student t distribution ple size is large. given in Equation t-statistic computed (3. some software uses the Student t distribution to compute p-values and confidence vals. Even if tbe population In practice. and the standard normal distribution normal is negligible between interthe if the sam- > 15. If the population I that it applies only variances are different.19). in fact. the Student {distribution is applicable normal distributions are the exception (for example. the applications. even if normal that the of the pooled t-statistic is not a Student the data are normally distributed.. therefore.

are Y) pair for one of the observations. i = 1. . three ways to summarize the relationship between variables: the scatterplot.here. denoted 1 n !1 _ _ SXy=---=-y2:(X.2.3 as two properties distribution is unknown. to another. and earnings could not be predicted perfectly using only a person's age. this difference stems from using X and Y to estimate . be estimated by taking a random sample of n members collecting the data (X" Y.3. in which each observation in the information is by the point (X" 1). relates between age and earnings? This question.This relationship is not exact. how- of the joint probability distribution or correlation.2 is a scatterplot (Y) for a sample of 200 managers of age (rom the March 2009 CPS. one of the workers in this samage and earnings shows a positive rela- ple is 40 years old and earns $35.). Sample Covariance and Correlation TIle covariance population covariance and correlation were introduced in Section 2. (3. however. and the sample correlation coefficient. of the random variables X and Y Because the in practice we do not know the population can. The sample covariance and correlation of the population and are estimators of the population covariance and correlation. The sample covariance.. Figure 3. observations represented industry (X) and hourly earnings on X.24) is computed by dividing by n.-X)(Y. the Sample Covariance. TIle scatterplot between age and earnings in this sample: Older workers tend to earn more than younger workers.2 corresponds to an (X..78 per hour. previously in this chapwith a ter. sample covariance. Scatterplots A scat1erplot is a plot of n. indicated tionship For example.=1 V). Each dot in Figure 3. . . Like the estimators discussed a population SXY. n.7 Scatterplots. The population covariance and correlation ever.7 Scatterplots. For example. the average in Equation (3. and the Sample Correlation What is the relationship ers. this worker's by the highlighted dot in Figure 3. like many othThis section reviews the one variable. the Sample Covariance. Y (earnings).and the Sample Correlation 91 3. 100.1 instead of 11. X (age). and 1). they are computed by replacing mean (the expectation) is _ sample mean.24) Like the sample variance..

= Y.1. 1. The hiqhlighted dot corresponds to a 40-year-old worker who earns $35. for all i and equals -1 if is a straight X. If the all i. it makes little difference division is by n or n .Y. correlation is unitless and lies between -1 and 1: I :s.78 per hour. The sample correlation coefficient. • •I • • • • • • • • ••• • • •• • • • : : • •• • • •••• • •• • • •••••• •• • •••• •• • • • • • • ••• • • • •• • • • • ••• • • ••• •• • • • • • ··. More generally. When n is large.• . = . The data are for computer and intorrna- tion systems managers from the March 2009 CPS.:• e. Like the population correlation.92 CHAPTER 3 Reviewof Statistics t:ml:!DD 100 90 Scallerplot of Average Hourly Earnings vs. · •• • • 40 45 35 • 25 30 50 55 60 65 Age Eachpoint in the plot represents the age and average earnings of one of the 200 workers in the sample.• • 0 20 • • •• •• • • ••• ••• • • • • • • • •• • • • ••• •• • • • ••••• • • • •• •• • •I • • • :• • • • • • . or sample correlation.25) The sample correlation measures the strength of the linear association between X and Y in a sample of n observations. is denoted is the ratio of the sample covariance to the sample standard deviations: whether 'Xyand (3. the correlation is ±1 if the scallerplot . the respective population means. Age Average hourly earnings • 80 70 60 50 40 30 20 \0 • • • • • r- r. . the sample 'xyl The sample correlation equals 1 if X. for line.

Because the sample variance and sample covariance are consistent.25 or 25%. lj) are i. and Y.3a relationship between tbese variables. consider the data on age and earnings in Figure 3. the sample covariance is consistent.26) under the assumption that (X. and the sam3.i. Figure 3. Consistency of the sample covariance and correlation. it means that the points in the scatterplot fall very close to a straight line.26) In other words.. Thus the correlation coefficient is "IE = 33.8.25 or 25%.9.20).7 Scatterplots.3 gives additional shows a strong positive linear ple correlation is 0. That is. tbe closer is the correlation to ±L A high correlation coefficient does not necessarily mean that tbe line has a steep slope. then the correlation is 3316/(9. then there is a positive relationship between X and Y and the correlation is L If the line slopes down. rather. The covariance between age and earnings is SAE = 33. the SampleCorrelation the and 93 line slopes upward.3 that the sample covariance is consistent and is left as an exercise (Exercise 3. the sam~ ple correlation coefficient is consistent. corr(X. examples of scatterplots and correlation. The proof of the result in Equation (3. ance. but as is evident in the scatterplot. then there is a negative relationship and the correlation is-1.2. that is. and that X.07 years and the sample standard deviation of earnings is SE = $14. the sample standard deviation of age is SA = 9.3c shows a scatterplot with no evident . As an example.37 per hour. Figure 3.. this relationship is far from perfect.07 X 14. For these 200 workers. To verify that the correlation does not depend on the units of measurement.3b shows a strong negative relationship with Figure 3. Figure a sample correlation of -0. not readily interpretable).d. in which case the sample standard deviations of earnings is 1437¢ per hour and the covariance between age and earnings is 3316 (units are years X cents per hour).l'Y ---'--+ o-Xy· Like the sample vari- (3. in large samples the sample covariance is close to the population covariance with high probability. Suppose that earnings had been reported in cents.16/(9. The closer the scatterplot is to a straight line. have finite fourth moments is similar to the proof in Appendix 3.37) = 0. rXY Example. lj). The correlation of 0. SampleCovariance.07 X 1437) = 0.16 (the units are years X dollars per hour.25 means that there is a positive relationship between age and earnings.3. P s.

In Figure 3.' : '. . . the reason with both large and small coefficient is that... ' .. :..=~~ . • . but it is I=~=:. • °0. small values of Yare associated values of X. \:.. . for these data.9 y 70 variables also are uncorrelated even though they are related nonlinearly.. :oi fe. ...>fo. cernable relationship is zero.. to:' .. '. " " . "0.. the two y 70 60 50 40 30 20 10 0 70 80 90 100 +0.3d. '.: • ..." • -:'.' .3(. .0° ... 0° : 60 50 40 30 20 10 0 70 • :' to· "'~'. '. •\ . :.rI'!. .. . ... '. the sample correlation is zero. 0".. • '0.:I. 80 90 (c) Correlation = 0.. .8 Y 70 60 50 40 30 20 10 0 70 80 90 100 110 120 130 x 70 . °0' . . ~.0 (quadratic) -~ relationship..3b show strong linear relationships between X and Y In Figure 3.3a and 3..."..0 (d) Correlation = 0. .:.. . 100 110 120 130 x . ..~ .. X is independent of Yand the two variables are uncorrelated..:'. . " .. in Figure 3.3d.. and the sample correlation tionship: As X increases. . : ...3d shows a clear relathis dis- Y initially increases bnt then decreases.. . . Despite between X and Y . There is a relationship not linear.' '. ..' I' ..' .. . 0° ".94 ~ CHAPTER 3 Review of Statistics Scatterplots for Four Hypothetical Data Sets in Figures The scatterplots 3. ." . Figure 3. . 0: .. • ::\00. 80 90 100 110 120 130 x t10 120 130 x 0 70 (a) Correlation y = (b) Correlation = -0... 60 50 40 30 20 10 i. This final example emphasizes all important point:The correlation is a measnre of linear association. • 0° '. .

by the law of large numbers. Ylj"" Y" are i. A small p-vaJue is evidence that the null hypothesis is false.::=~=~ ..Ly in 95% of all possible samples. 5.. t-v. 2. The r-statistic can be used to calculate the p-value associated with the null so that it con- hypothesis. how well their scatterplot is approximated Key Terms (66) (66) and efficiency (68) estimator tests (70) (70) hypothesis probability) (70) (71) hypothesis (70) alternative (significance variance (73) (69) (67) (Best Linear Unbiased sample standard standard error of deviation (73) (74) degrees of freedom estimator estimate BLUE bias.. Y is consistent...i. If n is large. and measures the linear relationship between two vari- is. the sampling distribution of Y has mean J. d. Estimator) least squares hypothesis alternative two-sided p-value sample null hypothesis Y (74) I-statistic (r-ratio) (75) test statistic (75) type 1 error (77) type II error (77) significance level (77) critical value (77) rejection region (77) acceptance region (77) size of a test (77) ______ . The r-statisric is used to test the null hypothesis that the population takes on a particular value. 3..When a..Key Terms 9S Summary 1. is an estimator of the population mean...... consistency. The sample average. by the central limit theorem. Y is unbiased. A 95% confidence interval for !"y is an interval constructed tains the true value of J.. iiiiiiiiiiiiiiiiiiiiiiiiiiii.. Hypothesis tests and confidence intervals for the difference two populations are conceptually of a single population. 4.. 6. The sample correlation lation coefficient ables-that coefficient in the means of similar to tests and intervals for the mean is an estimator of the population by a straight correline. b. pling distribution when the sample size is large... and Y has an approximately normal sammean normal c.d..Ly and variance u~ = a'P/n. Y.. the /-statistic has a standard sampling distribution when the null hypothesis is true.

sample from this populati n for (a) n = 10. (b) rt = 100.4 What role does the central limit theorem ing? In the construction 3.0. applied to data from a effect.8 controlled experiment.0. significance level. Determine the mean and variance ofY from an i. (c) 0.d. (e) 0.9. (d) -0. more information than the result 3. 3. Exercises 3. Use the central limit theorem to .0. is an estimator of the treatment Sketch a hypothetical scatterplot for a sample of size 10 for two random variables with a population correlation of (a) 1. Provide an Y and the population 3. and an estimate. and (c) n = 1000.5. Relate your answer to the law of large numbers.1 In a population. (b) -1.1 Explain the difference between the sample average mean. 3.7 estimator.5 of confidence play in statistical hypothesis test- intervals? is? Among hypoth- What is the difference between a null and alternative hypothe size.96 CHAPTER 3 Reviewof Statistics power of a test (77) one-sided alternative confidence set (79) confidence level (79) confidence interval (79) coverage probability test for tbe difference means (81) causal effect (84) hypothesis (79) treatment scatterplot effect (84) (91) sample covariance (91) sample correia lion coefficient (sample correlation) (92) (81) between two Review the Concepts 3.6 Why does a confidence interval contain of a single hypothesis test? Explain why the differences-of-means randomized 3.2 Explain the difference between an estimator example of each.3 A population distribution has a mean of 10 and a variance of l6. and power? Between a one-sided alternative esis and a two-sided alternative hypothesis? 3.i.!"y = 100 and answer the following questions: O'f = 43.

of all likely voters who preferred the time of the survey. and let A. test the hypothesis level. find Pr(l01 c.3: a. Y" be i. = 0.5? e. a. Use the survey results to estimate p.5 calculations. interval for p.p lin.p(l .. Why do the results from (c) and (d) differ? f. Let Y be a Bernoulli random variable with success probability Pre and let l'[.. HI: p > 0. 215 responded incumbent and 185 responded p denote the fraction that they would vote for the challenger. of successes (Is) in this sample. Why is the interval in (b) wider than the interval in (a)? d. Did the survey contain statistically significant incumbent was ahead of the challenger Explain. to calculate the standard error of your estimator. b.3 . What is the p-value for the test Ho: p ~ 0. Hl: p '" 0.5? for the test Ho: P = 0. In a random sample of size n = 165. Y = 1) =p . evidence that the at the time of the survey? c. Show that var( 3. B. that they would vote for the Let at who the incumbent In a survey of 400 likely voters. In a random sample of size n = 64. find PrCY < Y> 101). find Pre 3. draws from this distribution. Without doing any additional Ho: p 3. Letp be the fraction Show that b.Plin. . Use the estimator of the variance of p.50 vs. II. b.5 vs. A and candidate who prefer candidate A survey of JOSS registered to choose between candidate of voters in the population and the voters are asked A. What is the p-value d. and let p be the fraction of survey respondents preferred the incumbent. Construct a 95% confidence b. . 98).2 < Y < 103). c. 3.Exercises 97 a. Construct a 99% confidence interval for p.50 at the 5% significance voters is conducted. c.d. p is an unbiased p) = p(l estimator of p. HI: p '" 0.4 Using the data in Exercise 3.i. Let p denote the fraction p denote the fraction of voters in the sample who prefer Candidate . Show that p = Y. In a random sample of size n = 100.5 vs.

Susing a S% significance level. What is the size of this test? II. . Suppose that you decide to reject flo if * = O.Susing a S% significance level. ow large should n be if the surH vey uses simple random sampling? I 3. The sample mean test score is 1110. 3.98 CHAPTER3 Reviewof Statistics a. it is half the length of 9S% confidence interval.8 A new version of the SAT test is given to 1000 randomly selected high school seniors.pi> 0. .s] > 0. . that is. val? Explain.Svs. = 6 is contained a. A test of jJ. in the 9S% confidence inter- 3. Y" be i.p I. 11. Ill.6 Let flo: l'I. Suppose that the survey is carried out 20 times.d. v. fll: P * O. Does the 9S% confidence interval contain u: = 5? Explain. Compute the power of this test if p = 0.96 X SEt p). jJ.i. I.01) :5 O. c. Ip .OS. A survey using a simple random sample of 600 landline telephone numbers finds 8% African Americans. What is the probability that the true value of p is contained in aU 20 of these confidence intervals? n.53.5 vs. IV. In survey jargon. i. Is there evidence that the survey is biased? Explain. Test Ho: p = 0. using independently selected voters in each survey. the "margin of error" is 1. Test Ho: P = 0.5 vs. Construct a 99% confidence interval for p.54. b. Construct a 95% confidence interval for p.03. Construct a SO%confidence interval for p. How many of these confidence intervals do you expect to contain the true value of p? d.Q2.and the sample standard deviation is 123. a 9S% confidence interval for p is constructed. * 5 usingthe usual r-statisuc yields a p-vaJue of 0..7 In a given population. Can you determine if jJ. 11% of the likely voters are African American.o. Construct a 95% confidence interval for the population mean test score for high school seniors. = 0. Suppose you wanted to design a survey that had a margin of error of at most 1%.HI: P > O. = S versus Hv. you wanted Pr( p .. In the survey.For each of these 20 surveys. b.5. You are interested in the competing hypotheses flo: P fll: p 0. That is. draws from a distribution with mean u.

Let 11 denote the mean of the new process. .~. a. she will conclude that the new process is no better than the old process. Suppose the new process is in fact better and has a mean bulb life of 2150 hours. Consider the null and alternative hypothesis HO:!L = 2000 vs. Sy. otherwise. A summary of the resulting monthly salaries follows: n iiiiiiiiiiiiiiiiiiiiiiii======--'.11 Consider the estimator Y.12 To investigate possible gender discrimination in a firm. What is the size of the plant manager's testing procedure? b. The plant manager randomly selects 100 bulbs produced by the process. (3. c. H(!L > 2000. producing a sample average of 62 points and sample standard deviation of 11 points. defined in Equation (a) E(Y) =!Ly and (b) var(Y) = 1. The sample average score Yon the test is 58 points. She says that she wiil believe the inventor's claim if the sample mean life of the bulbs is greater than 2100 hours. is 8 points. The authors plan to administer the test to all third-grade students in New Jersey.10 Suppose a new standardized test is given to 100 randomly selected thirdgrade students in New Jersey. Construct a 90% confidence interval for the difference in mean scores between Iowa and New Jersey. What is the power of the plant manager's testing procedure? c. and the sample standard deviation. An inventor claims to have developed an improved process that produces bulbs with a longer mean life and the same standard deviation. What testing procedure should the plant manager use if she wants the size of her test to be 5%? 3. a. Can you conclude with a high degree of confidence that the population means for Iowa and New Jersey students are different? (What is the standard error of the difference in the two sample means? What is the p-value of the test of no difference in means versus some difference?) 3.1). a sample of 100 men and 64 women with similar job descriptions are selected at random.25o}/n. b.Exercises 99 3. Suppose the same test is given to 200 randomly selected third graders from Iowa. Construct a 95% confidence interval for the mean score of all New Jersey third graders. Show that 3.9 Suppose that a lightbulb manufacturing plant produces bulbs with a mean life of 2000 hours and a standard deviation of 200 hours.

3. hypothesis.13 Data on fifth-grade districts in California •. compute the relevant with the I-statistic. s« = 1. Do these data suggest that the firm is guilty in its compensation 3.2 lb. test scores (reading yield and mathematics) deviation for 420 school Sy Y = 646. Construct population. and and kilo- 0. TI. denoted denoted p".5 rXY = in.2 and standard = 19.14 Values 01 height in inches (X) and weight in pounds (Y) are recorded a sample of 300 male college students. Convert these statistics to the metric system (meters grams).e resulting summary statistics from are X = 70.15 Let Y" and y" denote Bernoulli random variables [rom two different popa and b. . When the districts were divided into districts with small cia scs « students per teacher) and large classes 20 (2: 20 students per teacher).73 in. null and alternative third.. and a random sam pic of size lib is chosen from population b. X lb. policies? Explain.8 in.4 650. Y= 158 Ib. Suppose that E( y") = P« and E( Yt.4 t7. b.. 3. compute use the p-value the p-value associated to answer the question.5 . and finally. What do these data suggest about wage dif[erences in the firm? Do they represent statistically significant evidence that average wage men and women are different? (To answer this question.) = Pb' A random sample of size II" is chosen from population a.85. SXl' = 21.9 238 182 Is there statistically that the districts with smaller classes have higher average test scores? Explain.0 significant evidence 19. ~ n the following results were found: . with sample average ulations.) [gender discrimination fir t state the r-statistic.-----------------Class Size Small Large Average Score (Y) Standard Deviation (Sy) 6S7.. a 95% confidence interval for the mean test score in the b. Sy = 14. second.100 CHAPTER 3 Reviewof Statistics Average Men Women Salary (Y) Standard Deviation (Sy) n $3100 $2900 $200 $320 tOO 64 of •.

Exercises with sample average denoted jJb' Suppose the sample from population independent of the sample from population b. a. Show that E(p,)
= Pa and var(Pa) = Pa(1-

101 a is

Pa)/n". Show that

E(p,,)

= p"

and

var(Pb)
1I

= Pb(l - p")/n,,.

b. Show that

var(p

- P'b ) = Pa(1 nn Pal + p,,(1 tlb. Pb) (H"tnt. R emem ber -, I" er

that the samples are independent.) c. Suppose that n, and nb are large. Show that a 95% confidence interval for Pa - p" is given by

(Pa - p,,) ± 1.96

cP,,""( 1 a",,) - "-c,---.'-'P + /)" (1n» P b ) . n{/
interval for Pa - p,,? Savings" in Section group and

How would you construct d.

a 90% confidence

Read the box "A Novel Way to Boost Retirement 3.5. Let population population confidence b denote

a denote the "opt-out"
the "opt-in" (control)

(treatment)

group. Construct a 95%

interval for the treatment

effect,p, - p".

3.16 Grades dents

on a standardized in the United

test are known to have a mean of 1000 for stuto 453 randomly

States. The test is administered

selected students in Florida; in this sample, the mean is 1013 and the standard deviation (s) is 108.
3.

Construct a 95% confidence Florida students.

interval for the average test score for

b. Is there statistically differently

significant evidence that Florida students perform

than other students in the United States? are selected at random from Florida. They are deviation of 95.

c. Another 503 students

given a 3-hour preparation

course before the test is administered.

Their average test score is 1019 with a standard \. Construct a 95% confidence

interval for the change in average

test score associated
I\.

with the prep course. significant evidence that the prep course

Is there statistically helped?

d. The original 453 students

are given the prep course and then are change in their test of the change is 60

asked to take the test a second time. The average scores is 9 points, and the standard deviation points. \. Construct a 95% confidence rest scores.

interval for the change in average

102

CHAPTER 3

Review Statistics of
II.

Is there statistically significant evidence that students will perform better on their second attempt after taking the prep curse?

iii. Students may have performed better in their sec nd attempt because of the prep course or because they gained test-taking experience in their first attempt. Describe an experiment that would quantify these two effects. 3.17 Read the box "The Gender Gap of Earnings United States" in Section 3.5. a. Construct a 95% confidence interval for the change in men' hourly earnings between 1992 and 2008. b. Construct a 95% confidence interval f r the change in w men's average hourly earnings between '1992 and 2008. c. Construct a 95% confidence interval for the change in the gender gap in average hourly earnings between 1992 and 200 . (Hirn:
"Y,1l.1992 Yw.l992

of

allege Graduates

in the

average

is independent of

~1I,2008 - ~v,2008')

3.18 This exercise shows that the sample variance i an unbiased
the population ance
(J~.

estimator

of

variance when

Y1"",

Y,', arc i.i.d. with mean J.Ly and vari-

a. Use Equation (2.31) to show that E[( Y, - y)2] = var( Y,) - 2cov( Y" Y) var(Y). b. Use Equation (2.33) to show that cov(Y, Y,)

+

= uNII.

c. Use the results in (a) and (b) to show that £(s~) 3.19 a. Y is an unbiased estimator of !ky. Is b. Vis a consistent estimator of t-v- Is

=

IT~.

y2 an un iased e .timator of Jk~?

y2

a c nsi tent cstimat

r of !k~?

3,20 Suppose that (Xi, Y,) are i.i.d. with finite fourth m mcnts, Prove that the sample covariance is a consistent estimator of the populati n covariance, that is, SXY ---L.. a xy, where SXY is defined in Equati n (3.24). (Ill/II: Use the strategy of Appendix 3.3 and the auchy chwartz inequality.) 3.21 Show that the pooled standard errol' [S£,wol"I(Y,,, - Y,,)] given following Equation (3.23) equals the usual standard error for the difference in means in Equation (3.19) when the two group sizes are the same (11m = 11".).

Empirical Exercise

103

Empirical Exercise
E3.1 On the text Web site http://www.pearsonhighered.com/stock_watson/You will find a data file CPS92_08 that contains an extended version of tbe dataset used in Table 3.1 of the text for the years 1992 and 2008. It contains data on full-time, full-year workers, age 25-34, with a high scbool diploma or B.A.lB.S. as their highest degree. A detailed description is given in CPS92_08_Description, available on the Web site. Use these data to answer the followingquestions. a. Compute the sample mean for average hourly earnings (AHE) in 1992 and in 2008.Construct a 95% confidence interval for the population means of ARE in 1992 and 2008and the change between 1992and 2008. b. In 2008,the value of the Consumer Price Index (CPI) was 215.2. In 1992, the value of the CPI was 140.3.Repeat (a) but use ARE measured in real 2008 dollars ($2008);that is, adjust the 1992 data for the price inflation that occurred between 1992 and 2008. c. If you were interested in the change in workers' purchasing power from 1992 to 2008, would you use the results from (a) or from (b)? Explain. d. Use the 2008 data to construct a 95% confidence interval for the mean of ARE for high school graduates. Construct a 95% confidence interval for the mean of ARE for workers witb a college degree. Construct a 95% confidence interval for the difference between the two means. e. Repeat (d) using the 1992 data expressed in $2008. 1'. Did real (inflation-adjusted) wages of high school graduates increase from 1992 to 2008? Explain. Did real wages of college graduates increase? Did the gap between earnings of college and high school graduates increase? Explain, using appropriate estimates, confidence intervals, and test statistics. g. Table 3.1 presents information on the gender gap for college graduates. Prepare a similar table for higb school graduates using tbe 1992 and 2008 data. Are there any notable differences between the results for high school and college graduates?

i~
104 CHAPTER 3 Review

of Statistics

APPENDIX

3.1

The U.s. Current Population Survey
Each month, the Bureau of Labor Statistics in the U.S, Department
population, including the level of employment, unemployment,

of Labor conducts

the

Current Population Survey (CPS), which provides data on labor force characteristics 50,000 U.S. households arc surveyed each mont h. The sample is chosen by randomly ing addresses from a database of addresses from the most recent decennial mented with data on new housing units constructed

of the select_

and earnings. More than

census aug-

after the last census. The exact random

sampling scheme is rather complicated (first, small geographical arcus arc randomly

selected, then housing units within these areas arc randomly selected): details can be found in the Handbook of Labor Sratisrics and on the Bureau of Labor Statistics Web site (www .bls.gov), The survey conducted each March is more detailed than in other m nths and asks

questions about earnings during the previous year.The statistics in Tables 2.4 and 3.1 were computed using the March surveys.The CPS earnings data are for full-time workers, defined to be somebody employed ous year. more than 35 hours per week for at least 48 weeks in the previ-

APPENDIX

3.2

Two Proofs That Yis the Least Squares Estimator of fLy
This appendix provides two proofs. one using calculus and aile not, that Y minimizes sum of squared prediction
estimator or

the

mistakes in Equation

(3.2)-thnl

is, that

Y

is the least squares

E( Y),

Calculus Proof
To minimize the sum of squared prediction mistakes. take its derivative and set it to zero: d
dm

I-I

:? (Y; -III)' ~ -2 2:(l'i
1=1

II

1/

II

-III)

= -22: Y;+ 2/1111 0, =
;=1 'C'IJ ~i=1

(3.27)

Solving for the final equation for

//'1

shows that

(V'i-II'/

)2' IS

••. nurunuze d

I W1 ell

m=Y.

by setting m = Y -so that Y is the least APPENDIX 3. must be zero.1')' + lid'.!LY)'.~l( tj .9). . .- in Equat ion (3. from which it follows that Y is the =Y _ so that m=V-d. d = O-that is. and E( Yf) < !LY) . This is done by setting squares estimator of E( Y). Because both terms in the final line of Equation (3. as small as possible. add and subtract uv to write (Y.Y) = O.(1' . . Then (Y.28) where the second equality uses the fact that L.!LY)] and by collecting terms.-V)'+ 2d( Y.d. .!Ly)J' = (Y.~l(}j .~I (11- 11I)2 is minimized by choosing d to make the second term.1")' + 2d2.-VJ+d)'=(Y.[Equation (3.-[V-dj)'=([Y.2)J is II H n ~ (Y.!LY)+ (1' . when Y. . .1') + nd' ~ L(Y. Let d and Y In. 1'.fLy) = fI( 1" .(Y.V) II + d'.A ProofThat the Sample Variance Is Consistent 105 Noncalculus Proof The strategy is to show that the difference between the least squares estimator least squares estimator. .28) are non negative and because the first term does not depend on d. 1=1 1"'1 i=l (3.. Substituting this expression [or (Y. - where the final equality follows from the definition ofY [which implies that L.1')2 into the definition of sl. -m)' 1=1 = L(Y. we have that First. as stated 1')' = [( Y.l1lL1s the sum of squared prediction mistakes [Equation (3. .7)].!Ly)2 2( Y. . ~.!Ll')( 1" . nd2./ are i. ..3 A Proof That the Sample Variance Is Consistent This appendix uses the law of large numbers to prove that the sample variance sistent esti maror of the population variance Sf is a con- a$.i.-m)'=(Y.

.. . Y" are i. .d.i.IJ-y)' . so the second term converges in probability to zero. E(W. -I'y)'] < < 00.d.P. . ance). and var("'[) W satisfies the conditions for the law of large E(W. Because 17 -"-> I'Y.. the random variables DO = E[( Y. (Y .50 p because.) = u~ (by the definition of the vari. by assumption. Wn are i..i.) = (f9.. . Combining these results yields s~~ uf..29). - -"-> (f9. Because the random variables are i. W 00. so the first term in in probability to uf..106 CHAPTER 3 Review of Statistics The law of large numbers can now be applied to the two terms in the final line of Equation (3. z and ~:~l (Y.i... . But W= - E( yl) < -I'Y) " "'I. 0.) Thus f1. Also.. so (l(n) Equation (3.-I'y)'. In addition..6 and W ~ I'y)' ('<''' l(n)"'i~I(Y. Now E( W.. ltYj.d. .29) converges 2.. = (Y. Define W. numbers in Key Concept E( W.). 11/(n -1) I. .

cmmD Linear Regression 4 with One Regressor highway fatalities? A school district cuts the size of its elementary school classes: What is the effect on its students' standardized test scores? You successfully complete one more year of college classes: What is the effect on your future earnings? All three of these questions are about the unknown effect of changing one variable. Y A state implements tough new penalties on drunk drivers: What is the effect on <Y being highway deaths. The slope and the intercept of different school districts. to another. class size. This model postulates a linear relationship between X and Y. which is not to the liking of those paying the bill! So she asks you: If she cuts class sizes. she will Parents want smaller classes so that their children can receive attention. using data on class sizes and test scores from class sizes by. student test scores. This chapter introduces the linear regression model relating one variable. X. or years of schooling). But hiring more teachers means spending more reduce the number of students per teacher (the student-teacher faces a trade-off. one student per class. what will the effect be on student performance? 107 . the slope of the X and Y is an unknown characteristic of the population joint distribution of is. X and Y. For instance. to estimate the a sample of data on these two variables.1 The Linear Regression Model The superintendent additional of an elementary school district must decide whether to hire ratio) by two. X (X being penalties for drunk driving. the slope of the line relating of line relating X and Y is the effect of a one-unit change in X on Y. The econometric problem is to estimate this slope-that effect on Y of a unit change in X-using of data on of reducing This chapter describes methods for estimating this slope using a random sample X and Y. Y. 4. Just as the mean Y is an unknown characteristic of the population distribution of Y. say. Tf she hires the teachers. we show how to estimate the expected effect on test scores the line relating X and Y can be estimated by a method called ordinary least squares (OLS). She teachers and she wants your advice. on another variable. or earnings). more individualized money.

as before. This straight line can be written line relating TestScore = where {3o is the intercept According to Equation able to determine f30 + {3cltlssSize X ClassSize. that is. which concerned rearrange Equation per class.. We therefore is measured by standardized can depend in part on how sharpen the superintendent's question: [f she reduces the average class size by two students. and the job status or pay of some administrators well their students do on these tests.6) X (-2) = 1. of the test that test scores would rise by 1. Suppose that f3C1""Si" ~ -0. f3CI""Slu is the by the change in the test score that results from changing the class size divided If you were lucky enough to know f3C1""Si. To do so. where the subscript she expect the change in standardized test scores to be? We can write this as a mathusing the Greek letter ClosiSir» distinguishes the effect of changing the class size from other effects. A (delta) stands for "change is.3). If the superintendent ematical relationship a quantitative statement about changes the class size by a certain amount. you would be able to tell the superintendent that decreasing class size by one student would change districtwide {3CfassSize' test scores by You could also answer the superintendent's changing class size by two students actual ques- tion..108 CHAPTER 4 Linear Regression with One Regressor In many school districts. (4.Siu X A ClassSize. if you knew f30 and {3CI".2. what will the effect be on standardized test scores in her district? A precise answer to this question requires changes. {3C1a. Then a reduction you would predict in class size of two students (4. change in TestScore change in ClassSire A TestScore A Classsize' in. .. (4..2) per class would yield a predicted change in test scores of (-0.1) so that A TesiScore ~ (3CI". 111us.6. (4. but you also would be able to predict the average test score itself for a given class size. f3C1""SI".'Si".Siu is the slope. Equation (4."That f3C/assSize (4.3) of this straight line and. not only would you be the change in test scores at a district associated with a change in class size. what would beta.1) is the definition of the slope of a straight scores and class size.2 points as a result reduction in class sizes by two students per class.1) where the Greek letter change in the class size. student performance tests.

3) to the superintendent. She is right. Finally. n)..4) can be written more generally as Y.4) is much more general. for all these reasons.5) for each district (that is. quality of their teachers. she points out that even if two districts are the same in all these ways they might have different test scores for essentially random reasons having to do with the performance of the individual students on the day of the test. fnstead. perhaps one district has more immigrants (and thus fewer native English speakers) or wealthier families. Two districts with comparable class sizes.1 The LinearRegression Model 109 When you propose Equation (4. of course.. teachers.e because this equation is written in terms of a general variable Xi. and let u. let Xi be the average class size in the i th district. that represents the average effect of class size on scores in the population of school districts and a second component that represents all other factors. /30 + /3ClassSize X ClasxSize. .5) instead of /3 C/tI.3) (an idea we return to in Chapter 6). how lucky the students were on test day). . however. [111egeneral notation /31 is used for the slope in Equation (4. Let Y. so it is useful to introduce more general notation. where f30 is the intercept of this line and /31 is the slope.4. One approach would be to list the most important factors and to introduce them explicitly into Equation (4.uSIz. Suppose you have a sample of n districts.4) Thus the test Score for the district is written in terms of one component. denote the other factors influencing the test score in the il" district. Equation (4. it should be viewed as a statement about a relationship that holds 011 average across the population of districts. A version of this linear relationship that holds for each district must incorporate these other factors influencing test scores. One district might have better teachers or it might use better textbooks. For now. (4. she tells you that something is wrong with this formulation. She points out that class size isjust one of many facets of elementary education and that two districts with the same class sizes will have different test Scores for many reasons. including each district's unique characteristics (Ior example. Then Equation (4. = /30 + f3.3) will not hold exactly for all districts..Xi + /I..] . we simply lump all these "other factors" together and write the relationship for a given district as TesrScore = /30 + /3C1""Si" X ClassSize + other factors. Although this discussion has focused on test scores and class size. and textbooks still might have very different student populations. background of their students. . be the average test score in the i Ih district. (4. i = 1. the idea expressed in Equation (4.

Y..110 CHAPTER 4 Linear Regression with One Regressor Equation (4. is the population regression y and X on average over the population. Tn the class size example. Figure 4. strictly of the intercept is nonsensical. The errol' term incorporates for the difference between the i'h district's average regression line. when X is the class size. are the coefficients of the population sian line. 111e term /I.5) is the error term. TI. also known as the parameters of the population regression line.5) is the linear regression model with a single regressor. In some econometric line when X = 0. but.1'1. the value of Y for district #1. This means that test scores in district # I were better than predicted .X The population < 0). acconj. that determines it is the predicted value of test scores when there are no stuas the coefficient the level of dents in the class! When the real-world meaning is best to think of it mathematically the regression line. for example. and even any mistakes in grading the test.1. Y. including teacher quality. The linear regression cept 4. for a specific observation. student economic luck. in Equation (4. the hypothetiregression line. these other factors include all the unique features of the i Ih district that affect the performance of its students on the test. economic interpretation. ematicalmeaning the linear regression model with model and its terminology are summarized in Key Confor line background. it is the point at which the populaapplications. is f30 + f3. meaning.1 summarizes seven hypothetical slopes down (f3. is above the population line. f30 variable or the regressor that holds between The first part of Equation the popnlation regression ing to this population + f3.TIle popu- lation regression line is the straight line f30 + f3. In other applications. in whict. a single regressor regression observations on test scores (Y) and class size (X).X. by the regression cal observations in Figure 4.1 do not fall exactly on the population For example. if you knew the value of X. function.X The intercept f30 and the slope f31 is the change in Yassociated of the population regression tion regression line intersects intercept has a meaningful intercept has no real-world speaking the intercept f3. This is the relationship line you would regression predict that the value of the regresdependent variable. which means that districts with lower student-teacher by the population regres- ratios (smaller classes) tend to have higher test scores. as mentioned earlier. Because of the other factors that determine test performance.5). i. The intercept f30 has a mathas the value of the Y axis intersected sion line.e slope is the value the the with a unit change in X The intercept the Y axis.. line Or Y is the dependent variable and X is the independent (4. it has no real-world meaning in this example. This error term the value of the dependent all of the factors responsible test score and the value predicted by the population contains all the other factors besides X that determine variable. Thus.

1 where the subscript i runs over observations.(X. . The population regressian line is f30 + {31X. Yj) 680 vertical distance from the point to the population regression line is ".. ui for the which is the population error term ith observation.({3o + f31Xi). 620 600 10 15 20 25 30 ratio (X) Student-teacher . . regression line... Ai f30 is the independent variable. or simply the left-hand variable. Y. the regressand. the regressor. + f31X is the population regression line or the population regression function. The jth Ratio Test score (Y) 700 . regression line. and f30 is the intercept of the population f3L is the slope of the population Ui is the error term.4. Student-Teacher (Hypothetical Datal The scatterplot shows hypothetical observations for seven school districts. n. is the dependent variable. i = 1. Scattenplot of Test Score vs.1 The Linear Regression Model 111 Terminology for the Linear Regression Model with a Single Regressor The linear regression model is cmmtm) 4. or simply the right-hand variable. 660 640 • Yi.

We do not know the population value of f3CI". is below the population regression line.112 CHAPTER 4 Linear Regression with One Regressor population regression line. "I. The 10'h percentile of the distribution of . The data we analyze here consist of test scores and class sizes in 1999 in 420 California school districts that serve kindergarten through eighth grade.'Si".suppose you want to compare the mean earnings of men and women who recently graduated from college. which is the number of students in the district divided by the number of teachersthat is.I'Size? 4. trast. we can estimate the population means using a random sample of male and female college graduates. so the error term for that district.Therefore.The measure used here isone of the broadest. This estimation problem is similar to others you have faced in statistics. and the standard deviation is 1.6 students per teacher. The same idea extends to the linear regression model. is the average earnings of the female college graduates in the sample. Now return to your problem as advisor to the superintendent: What is the expected effect on test scores of reducing the student-teacher ratio by two students per teacher? 11.9 students per teacher. the intercept Po and slope f3. But what is the value of {3Clas.'Si'" the slope of the unknown population regression line relating X (class size) and Y (test scores). and '" < O. for example. so test scores for that district were worse than predicted. The average student-teacher ratio is 19.e answer is easy: The expected change is (-2) X f3Cltu. These data are described in more detail in Appendix 4.. For example.In con. Y.1. Then the natural estimator of the unknown population mean earnings for women. so is it possible to learn about the population slope f3ClassSi" using a sample of data. But just as it was possible to learn about the population mean using a sample of data drawn from that population. Class size can be measured in various ways. The test score is the districtwide average of reading and math scores for fifth graders. of the population regression line are unknown..the districtwide student-teacher ratio.1summarizes the distributions of test scores and class size for this sample.2 Estimating the Coefficients of the Linear Regression Model In a practicalsituation such as the application to class size and test scores. Although the population mean earnings are unknown. we must use data to estimate the unknown slope and intercept of the population regression line. Table4. is positive .

9.. • • ·ro.... . ." .....-f .:r .. .... ..: ••• ..'. .. .--"-~~~311o --' Student-teacher ratio .. .2 Estimating the Coefficientsof the Linear Regression Model 113 the student-teacher ratio is 17.23.3)..\ I.. . there are other determinants keep the observations Despite through from falling perfectly along a straight if one could somehow this low correlation.. The sample correlation is -0. A scatterplot ative relationship of these 420 observations on test scores and the student-teacher larger classes in this samof test scores that line. ..:. nOr 700 680 660 640 620 .. ·.:..t .' " '. '. .!t.c~. Although ple tend to have lower test scores.. while the district at the 90th percentile ratio of 21.: • .: ... ... .~~""0.-5 ~~~~~-. -s:... •...20. .... • '· ....~. .. • ...4.. .. -: .. ..... .:. indicating a weak negbetween the two variables.:y " . •..-:.. Student-Teacher Test score Ratio (California School District Data) California school distrios. . .' . -. ?. then the slope of this line would be an estimate of Cim!DD Data from 420 Scatterplot of Test Score vs. .. There is a weak negative relationship between the studentteacher ratio and test scores:The sample correlation is -0. •• • . 1. ..: .. only 10% of districts have student-teacber has a student-teacher ratios below 17.!.. draw a straight line {3Cla"Si" ratio is shown in Figure 4... .•...3 (that is..:111·· •.2..c\.-~~~~-2~5.:. 6001'~0~-~~~-:1.. ..23... /. . these data..

E( Y).Xj• Thus the the i'"observation is Y.114 CHAPTER4 Linear Regressionwith One Regressor based on these data.(bo + b. should you choose among the many possible lines? By far the 1U0st the "least squares" (OLS) estimator. Y. is. In fact.. i=l " (4.6) are called the ordinary of the intercept and slope that minimize the sum of squared mistakes in Expression f30 and f3.X. 2:(Y. least squares estimaestimation ion (3. where cI seness is mea- sured by the sum of the squared mistakes made in predicting Y given X.6) and the two problems minimizes the Expression mistakes I'or the problem estimating the mean in Expression (3. The predicted value of Y.6)]. and different common way is to choose data-that How. to use the ordinary least squares The Ordinary Least Squares Estimator The OLS estimator chooses the regression coefficients so that the estimated regression line is as close as possible to the observed As discussed in Section 3.x.6). . While this meth d is easy. - mY among all possible estimators [sec Exprc The OLS estimator and b. bo in Expression Y.2). so the value or Y. mistakes data..X.~. be some estimators extends this idea to the linear reg res ion m del. fit 10 these unscientific.l1. The regression linc based on the e estiusing this line is bo mators is bo + b.Xi) The sum of these squared prediction mistakes over all » observati = Y.-bo-b. that (3.2)]. .predicted mistake made in predicting + b.6) The sum of tbe squared sion (4. Y minimizes the total squared 2:. and the OLS estimator or f3. that is. that minimize Expression (4. . if there is no regressor.The is denoted also called the sample regression liue or sample regression function.e 0 LS regressiou line. it is very people will create different the line that produce estimated lines.bo ns i b. using the OLS estimators: ffio + OLS h~s its own special notation and terminology. One way to draw the line would be 10 take out a pencil and a ruler and to "eyeball" the best line you could.(y.. are identical except for the (4.lation mean.so is there a unique pair (4. Just as there is a [1'17 in Expression (3.1.2). f3o..2). least squares (OLS) OLS estimator estima· of f30 f30 and f3. then. the sample average. Let bo of f30 and f3. is the straight line constructed ffi. is the 111 tor of the popu.j2.X. of the sum of the squared (4. of estimators of The estimators tors of mistakes for the linear regression model in Expresof then b. is denoted ffi.6) is the extension does not enter Expression different notation unique estimator.

counterpart are sample of 1)0 between Y.11.8. and £tj = residuals iti are (49) (4. j=j " Y) L: (X.2. and Y. .. n }J . and error term given. of slope (pd. - y.. (4.6) using calculus.xj2 i=l The OLS predicted values n (48) Y.2. Po and 13" are sample counterparts .10) (Ui) are computed from a of tbe (Ui)' Y.f3. and residual X. These are estimates intercept (.. programs.). is . .: 1..7) L(X. .4.2 (4. (4.80). Fortunately. they are the least squares estimates. slope (.X.-X)(Y. that streamline are collected in Key Concept 4. and the OLS residuals Ui counterparts of the population errors Ui' Po PjX You could compute tbe OLS estimators and b. Predicted Values. of the population The OLS estimators. and its predicted value: Ui = Y. These all statistical and spreadsheet 4. i = 1"".. i . however. and Residuals The OLS estimators of the slope (3. i = 1. .2 Estimating the Coefficientsof the Linear Regression Model 115 The OLS Estimator. and the intercept ~ (30 are 4. the OLS regression line + is the sample of the population regression line (30 + .6). The OLS formulas and terminology formulas are implemented in virtually These formulas are derived in Appendix ing Expression 130 and PI by trying different values the total squared misderived by minirnizof the OLS until you find those that minimize there are formulas. . based on the OLS regression line. = Po + 13. repeatedly takes in Expression be quite estimators.80 and (3. Xi. The estimated intercept sample of n observations unknown true population (Po).'~. n. Similarly.The residual for tbe i th observation is the difference coefficients. This method would the calculation tedious. " .8 .

. .. ratio by one test scores ratio by two st udents per teacher per class is...28)]..28 points all the test. associated with a decline in districtwide by 2. . on average... .9 .. ..' .9.~I":s::t:. .....28 points. associated with an increase in test scores of 4. •. .-J' 10 t5 20 ....9 . .-~~~~-=----'-'~~~-L~~-~.2.. the OLS regression line For these 420 Tesiscore = 698.28 means that an increase in the student-teacher student per class is.2. and (4.2.:'~<~. .2. over Test Score in Equation over the scatterplot (4. the estimated regression predictsthat test scores will increase by 2..1-: .116 CHAPTER 4 Linear Regression with One Regressor OLS Estimates of the Relationship Between Test Scores and the Student-Teacher Ratio When OLS is used to estimate scores using the 420 observations the estimated intercept observations is a line relating the student-teaeher ratio to test and in Figure 4.••":.. ·.'....3 plots this OLS of the data previously shown in Figure 4. . ... ..·:-··. • .. . . . Aecordingly.. The slope of -2...\ ~~ . • ·'1 •• . on average. A decrease in the student-teacher [= -2 X (-2..C~'~ It {~. .::. ... .28 X STR 6 ...' z 25 Student-teacher . . (larger classes) is associated with poorer performance ~ The Estimated Regression Line for the California Data Test score 720 700 680 660 640 620 The estimated regression line shows a negative relationship between test scores and the student-teacher ratio.28 is 698.. the estimated slope is -2.11) STI? is the that it Tes/Score is the average test score in the district student-teacher ratio. '."-{ --. . : ..28 where X STII. 30 ratio 600o~~~~7.. The "~.1 .. . Figure 4...56 points The negative slope indicates that more st udents on the test. ... If class sizes fall by one student.~"tscore =•• 98\.~:. ~ .'.11) indicates is the predicted value based on the OLS regression regression line superimposed line.•••• ...

it has become tbe comfinance (sec "The results mon language 'Beta' of a analysis t ck ' box).2. if her district's test scores to 659. Recall that she is contcmplating district. this. and she would need to hire many new teacher . cutting the student-teacher to increase ratio by 2 is pre- dieted to increase test sc res by approximately arc at the median. Suppose her district is at the median per class.2.6 points. from 19.28 x 20 = 653.7. we return to the hiring enough teachratio is 19.11) would not be very useful to her. per teacher to 5? quaiion (4. ratio in these data is 14. 654.9 . and the social sciences more generally.2 Estimating the Coefficientsof the Linear Regression Model 11 7 It is now possible to predict the districtwide test Score given a value of the student-teacher right because regression ratio. A reduction of two students to 17. Presenting .1.] I). this improvement district from the median to just short of the 60'h percentile. the predicted test score is 698. For example. as the figure shows. this prediction will not be exactly performance. but it would not be a panacea. the effect of a radical so these data alone move to such an the srna lie t student-teacher marion on how district arc not a reliable extremely with extremely ratio.3. would move her student-teacher ratio from the 50'h percentile to very near the J O'h percentile. for a district with 20 students per teacher.H w would it affect test scores? According to Equati n (4. the median student-teacher and the mcdian test core is 654. mates.7 of the slope large or small? To answer be for that district. Is this improvewould move her to Table 4. cutting the student-teacher pCI' According to these esti(two students ratio by a large amount teacher) IV uld help and might be worth doing depending situation. they are predicted mcnt large or small? According 4. What if the superintendent the estimate was estimated in were contemplating such as reducing Unfortunately. absent those other factors. 130 and i the dominant (or regression method used in practice. Of course.5. These data contain no infersmall classes perform.1.1. the student-teacher ratio by 2. based on their student-teacher Is this estimate uperimcnderu's ers to reduce of thc alifornia problem. This is a big change. This regression the student-teacher ratio from 20 students on her budgetary a far more radical change. at least.7 From Table 4. But the of the other factors that determine a district's Iinc does give a prediction (the OLS prediction) of what test scores would ratio. u ing the data in Figure 4. and. ba is for predicting low student-teacher Why Use the OLS Estimator? There arc both pra tical and theoretical ~I' Because • reasons to usc the OLS estimators throughout economics.5.4. Thus a decrease in class size that w uld place her district close to the 10% with the smallest classes would move her test scores from the 50'h to the 60'h percentile.

0 1.ut(dl. In contrast.05. however. The OLS formulas software packages ' making and statistical . or Rf. According excess return to the CAPM. R . Company staples like Kellogg have stocks with low betas.S.11ms the expected excess risk-free. riskier stocks The capital asset pricing model (CAPM) formalizes this idea.3 0. stocks. The table below gives estimated betas for seven should be measured by its variance. a stock bought on January 1 for $100. would have a return of R ~ [($105 . should be positive. by diversifying your financial holdings. the risk-free return is often taken to be the rate of interest short-term U. For example. 0. Said differently.50 dividend during the year and sold on December 31 for $1. ~sing OLS (or its variants discussed later in this book) means that you are "speakmg the same language" are built Into virtually OLS easy to use.6 1. This means that the right way to measure the risk of a stock is not by its variance but rather by its covariance with the market. Tn practice. return. which then paid a $2.Rf. Low-risk producers of consumer have high betas. as other economists all spreadsheet and statisticians.5 0.6 0. That is.$100) + $2. Much of that risk. The "beta" of a stock has become a workhorse of the investment industry.4 CAPM. the CAPM says that (412) where Rm is the expected return on the market portfolio and f3 is the coefficient in the population regreson sion of R . government debt. like owning firm Web sites. "Thereturn o~ ~ll investment is the change in its price plus an~ p~y.ccm.can "porttolio't-ciu be reduced by holding other stocks in a other words. on a risky investment. stock in a company.Rf. a stock with a {3 < 1 has less risk than the market portfolio and therefore has a lower expected excess return than the market portfolio.vldend) from the investment as a percentage of Its initial ~nce.3 2.S.o.Rf on Rill . U. investment. and you can obtain estiby mated betas for hundreds of stocks on investment return.118 CHAPTER 4 LinearRegressionwith One Regressor _f---------A functamental investor idea of modern finance is that an a financial incentive to take a a stock with a f3 > 1 is riskier than the market portexcess needs folio and thus commands a higher expected risk.50]/$ tOO~ 7. must exceed the return on a safe.5%. Those betas typically are estimated At first it might seem like the risk of a stock OLS regression of the actual excess return on the stock against the actual excess return on a broad market index. R. the expected return! on a risky investment. the expected to the on an asset is proportional Estimated {J expected excess return on a portfolio of all available assets (the "market portfolio"). According to the Wal-Mart (discount retailer) Kellogg(breakfast cereal) Waste Management (waste disposal) Verizon (telecommunications) Microsoft (software] Best Buy (electronic equipment retailer} Bankof America (bank) Source: Smartbdoncy.

-(Y.. the OLS estimator is unbia ed and consistent. TIle explained su~u of squares (ESS) is the sum of quared deviations f the predicted values of Y. B to the sample vari- Mathematically. that i explained by X.3 Measuresof Fit 119 The OLS estimators also have desirable theoretical properties..4. however.-Y). plus the residual iiI' B. (4. 1'1. . typically is from its predicted value. The definitions f the predicted value and the residual (see Key Concept 4. 4..2) allow us to write the dependent variable 1) as the sum of the predicted value. 2 The R2 The regression R' is the fracti n of the sample variance of 1) explained by (or predicted by) X. of Y as an estimator of the population mean. studied in Section 3. and the total Sum of squares (TSS) is the Slim of squared deviations of 1) from its average: ESS = ~ (Y. this efficiency result holds under some additional special conditions.15) " TSS= ~ -.3 Measures of Fit Having estimated a linear regression.13) In this n tation. from their average.1')2 i-I " .14) (4. TIle OLS estimator is also efficient among a certain class of unbiased estimators.1. Does the regressor account for much or for lillie of the variation in the dependent variable? Are the observations tightly clustered around the regrc sion line.5.. r are they pread out? The R and the standard error of the regression measure bow well the OLS regres ion line fits the data. the R2 is the ratio of the sample variance of ance of 1'1. you might wonder how well that regression line describes the data. Under the assumptions introduced in Section 4. jE'1 -. The standard error of the regression mea ures how far Y. (4.4. the R2 can be written as the ratio of the explained sum of square to the total sum of square. and further discussion of this result is deferred until Section 5. The R2 ranges between 0 and 1 and measures the fraction of the variance of Y. They are analogous to the desirable properties.

Ui = 0). while an 1/' near 0 indicates the regressor is not very good at predicting Y. 1110 R' is the ratio of the explained sum of squares to the total sum of squares: (4.e R' ranges between 0 and 1. if Xi explains all of the variation of 1). i( the units of the dependent . or SSR.18) Finally. then Y. the R2 of the regression of Yon the single regressor X is the square of the correlation coefficient between Yand X. thus the R2 is zero.3). 11.16) Alternatively. is the sum of the SSR = "" -z LJ"i' i=[ " (4. so that ESS = TSS and Y. U ~ I tion of 1) and the predicted average of Y. An R2 near 1 indi- cates that the regressor is good at predicting Y. the does not take on that the extreme values of 0 or I but (ails somewhere in between.. not explained by Xi' The sum of squared squared OLS residuals: residuals.14) uses the fact that the sample average OLS predicted value equals Y (proven in Appendix 4. then Xi explains none of the varia- value of 1) based on the regression is just the sample sum of squares is zero and the sum of squared residuals equals the total Sum of squares. devia- The units of u..17) It is shown in Appendix 4. the explained = 0. so the S ER is a around the regression line. and }j are the same. For example.. In general. The Standard Error of the Regression The standard error of the regression (SER) is an estimator of the standard tion of the regression measure ot the spread error Ui. In contrast. Tn this case. Thus the R2 also can be expressed as 1 minus the ratio of the sum of squared residuals to the total Sum of squares: R2 = 1_ SSR TSS' (4.3 that TSS = ESS + SSR. the R' can be written in terms of the fraction of the variance of Y. for all i and every R2 residual is zero R2 = 1. measured of the observations in the units of the dependent variable.120 CHAPTER 4 Linear Regression with One Regressor Equation (4. = (that is.

11) reports the regression line.6. the SER of 18. The SER of ]8.6 means that standard deviation of the regression residuals is 18.except that y.1 in Equation (3. the magnitude of a typical regression error-in dollars. whereas here it is n . (The mathematics behind this is discussed in Section 5... U/1 are unobserved.. Because the regression errors ttl. Figure 4.. the OLS residuals Ul>. and tbe SER is 18.6.4.).then the SER measures the magnitude of a typicaldeviation from the regression Line-that is.3 superimposes this regression line On the scatterplot of the Test Score and STR data..7) is 17.2.:. . two "degrees of freedom" of the data were lost. The reason for using the divisor 17.7): It corrects for a light downward bias introduced because two regression coefficients were estimated.Y in Equation (3. . ____ .2' (4. the student-teacher ratio explains some of the variation in test scores.19) is similar to the formula for the sample standard deviation of Y given in Equation (3..1.051.3 Measuresof Fit 121 variable are dollars. __ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii.1.6means that tbere is a large spread of the scatterplot in Figure 4.051 means that the regressor STR explains 5. .6.=:======::::::=:=~lil "-- .or 5. . estimated using the California test score data. The R2 of this regression is 0. This is called a "degrees of freedom" correction because two coefficients were estimated ({3o and {3.2. where the units are points on the standardized test.7) in Section 3. The formula for the SERis l un- SER -- SUI wheresi/ 2 = 1".1 %.2 here (instead of 17.) When /I is large.7) is replaced by Uj and the divisor in Equation (3.2.) is the same as the reason for using the divisor 17.19) where the formula for s~ uses the fact (proven in Appendix 4. or by n . As the scatterplot shows. The formula for the SER in Equation (4..1 % of the variance of the dependent variable TesrScore. the difference between dividing by 17. The R2 of 0..3 around tbe regression line as measured in points on the test. the SER is computed using their sample counterparts. relating the standardized test score (TestScore) to the student-teacher ratio (STR). --LU? n . .2 i='] I SSR =-_ n . Application to the TestScoreData Equation (4.3) that the sample average of the OLS residuals is zero..2 is negligible. Because tbe standard deviation is a measure of spread.. but much variation remains unaccounted for. so the divisor in this factor is I'l . by n .. . This large spread means that predictions of test scores made using only the student-teacher ratio for that district will often be wrong by a large amount.

. these assumptions when OL will-and model of might will and the sampling scheme under which OLS provides an appropriate the unknown regression ing these assumptions not-give estimator f30 and f3. appear abstract. = x. more generally. is zero. regression line at X. being at other centered on the population = 20 and. in the sense that. They do. the mean of the distri- 1. but on average over the popula- = 20.4. given X. imply that this R. at a given value of class size.4. > 0) iu.does tell us IS that other in in school quality unrelated that the student-teacher to the important factors influence test scores. stated mathematically. 4.[X. of It"conditional on X. The population regression is the relationship that holds on average between class size and test scores in the population. < 0). is essential for understanding of the regression coefficients. or. the distribution has a mean of zero. Given X Has a Mean of Zero The first of the three least squares assumptions is that the conditional distribution of given Xi has a mean of zero.. but they do indicate ratio. have natural interpretations. on thc linear regression Initially. as well. however. valuesx of X. As shown in Figure 4.IX. The low R2 and high SER do not tell us ratio alone explains only a small part of the variation in test scores in these data. In other words.This assumption is a formal mathematical statement about the "other factors" contained in Iii and asserts that these other factors are unrelated to X. this is shown as the distribution 01 u. E(u. Said differently. in somewhat .4. or luck on the test.) = o.122 CHAPTER 4 Linear Regression with One Regressor What should we make of this low this regression is low (and the regression is either R2 and large SER? The fact that the R2 of "good" or "bad. useful estimates and understand- Assumption #1 : The Conditional Distribution of u. represents the other factors that lead test scores at a given district to differ from the prediction based on the population sometimes these other factors lead to better performance and sometimes to worse performance tion the prediction bution of is right.4 The Least Squares Assumptions This section presents a set of three assumptions coefficients. by itself. sim- E( u." What the low SER is large) does no. = x) = 0. say 20 students than predicted per class. pler notation. In Figure 4. regression line. This assumption is illustrated in Figure 4. given a value of X" the mean of the distribuHi tion of these other factors is zero./. (II. These factors could include dIfferences the student body across districts. differences student-teacher what these factors are. and the error term u.

data.IX.) = 0 is equivalent to that the population line is the conditional mean of Y. As shown assuming (a mathematical The conditional domized group is done ensuring subject.IX.~~~~-----i 10 IS Student-teacher ratio The figure shows the conditional probability of test scores for districts with class sizes of 15. that the conditional In observational mean of u given X is zero.-~~~~--=-~~~~-----l.X). The random that uses no information of all personal makes X and u independent.4. the assumption regression that E(u.6). in Figure 4.4. U = Y . and we the best that can be hoped for is that X is as if randomly sense that E(u.4 The Least Squares Assumptions 123 GBDt Test score The Conditional Probability Distributions and the Population Regression Line 720 700 680 660 Distribution of Y when X == 15 / E(Ylx = 15) Distribution of Y when X == 20 / E(Ylx = 20) E(YIX = 25) 20 Distribution of Y when X == 25 / 25 30 640 620 600 . has a conditional mean of zero for all values of X.(/30 + f3. in the precise and judgment. data requires careful thought this assumption holds in a given empirical appli- . given the studentteacher ratio. At a given value of X.-.-~~~~.) = O. proof of this is left as Exercise 4. mean of u in a randomized controlled experiment. assigned. E(YI X). subjects are randomly experiment. given X. (X = 1) or to the control group (X using a computer Random program that X is distributed assignment independently = 0). In a ran- controlled assigned to the treatment assignment typically about the subject. 20.Whether cation return with observational to this issue repeatedly. and 25 students.. characteristics of the which in turn implies Instead. X is not randomly assigned in an experiment. Y is distributed around the regression line and the error. is the population regression line f30 + f31X. The mean of the conditional distribution of test scores.

are i.. observations on (X" Y. if X. ii.i. It is therefore often COnvenient to discuss the conditional mean assumption in terms of possible correlation between X. or carr (X.). The i.).. this implication does not go the other way. and imagine drawing a person at random from the population of workers. For example. let X be the age of a worker and Y be his or her earnings. survey data from a randomly chosen subset of the population typically can be treated as i. . Recall from Section 2. That randomly drawn person will have a certain age and earnings (that is. i = 1. Because correlation is a measure of linear association.124 CHAPTER 4 Linear Regression with One Regressor Correlation and conditional mean.i.) across observations. Thus the conditional mean assumption E(u.i. necessarily bave tbe same distribution. so the sampling scheme is not i. is random). Are Independently and Identically Distributed The second least squares assumption is that (X" Y. suppose a horticulturalist wants to study the effects of different organic weeding methods (X) on tomato production (Y) and accordingly grows different plots of tomatoes using different organic weeding techniques.) is nonzero. then it must be the case that E(u.. For example. . that is. the conditional mean of u.d. However.i.d.i.) = 0 implies that X. this assumption is a statement about how the sample is drawn. assumption is a reasonable one for many data collection schemes. . however.~ Y. then the conditional mean assump.).d. n. is nonrandom (although the outcome Y.3 that if the condi_ tional mean of one random variable given another is zero. then the two random variables have zero covariance and thus are uncorrelated [Equation (2. then (X..5 (Key Concept 2. and are correlated. X and Y willtake on some values). are uncOrre_ lated.) = O. Y. One example is when the values of X are not drawn from a random sample of the population but ratber are set by a researcher as part of an experiment.i. u. and u.d. .i.). ii. If X. i = 1" . If they are drawn at random they are also distributed independently from one observation to the next.27)]. they are i.5). are Correlated. The results presented in this chapter developed . and are un correlated. If a sample of n workers is drawn from this population. n. Assumption #2: (X. given X. then (X" Y..d.d. If the observations are drawn by simple random sampling from a single large population.IX. For example.. does not change from one sample to the next. i = 1.. n. might be nonzero. i = 1"".. Thus X. are independently and identically distributed (i. As discussed in Section 2. tion is violated. n. and a.). and u. .IX. If she picks the techniques (the level of X) to be used on the ilh plot and applies the same technique to the i th plot in all repetitions of tbe experiment. even if X.d. then the value of X. Not all sampling schemes produce i.

Y. sampling is when observations refer to the same unit of observation Over time.i.3 showing that s ~ is consistent.or both that are far outside the usual range of the data-are unlikely. We encountered this assumption in Chapter 3 when discussing the consistency of the sample variance. . they are likely to be low next quarter.. the assumption that large outliers are unlikely is made mathematically precise by assuming that X and Y have nonzero finite fourth moments: o < E(Xt) < 00 and 0 < E(YI) < 00.lf}J. Another example of non-i. Another way to state this assumption is that X and Y have finite kurtosis. is finite. where these data are collected over time from a specificfirm. This is an example of time series data.4. and a key feature of time series data is that observations falling close to each other in time are not independent bnt rather tend to be correlated with each other. the level of X is random and (Xi.9) states tliat the sample variance s~ is a consistent estimator ofthe population variance a} (s~ .. The case of a nonrandom regressor is.. assumption. . for example. if interest rates are low now.i. Specifically.i. This potential sensitivity of OLS to extreme outliers is illustrated in Figure 4. This pattern of correlation violates the "independence" part of the i. quite special.!!.. In this book. they might be recorded four times a year (quarterly) for 30 years.2:i~l("Y. and the fourth moment of Y.i. then the law of large numbers in Key 1 n . . regressors are also true if the regressors are nonrandom.however.For example. For example.6 applies to the average.a key step in t I proo f In ie Appendix 3.. When this modern experimental protocol is used.4 TheLeastSquaresAssumptions 125 for i. Concept 2. are i. 1'. Large outliers can make OLS regression results misleading.i. Imagine collecting .y )2 . The assumption of finite kurtosis is used in the mathematics that justify the large-sample approximations to the distributions of the OLS test statistics. observations with values of Xi. we might have data on inventory levels (Y) at a firm and the interest rate at which the firm can borrow (X).. such as a typographical error or incorrectly using different units for different observations. Equation (3.d.d. -jJ. a}). Assumption #3: Large Outliers Are Unlikely The third least squares assumption is that large outliers-that is.. thereby circumventing any possible bias by the horticulturalist (she might use her favorite weeding method for the tomatoes in the sunniest plot).d. One source of large outliers is data entry errors.d. Y.d.. Time series data introduce a set of complications that are best handled after developing the basic tools of regression analysis.5 using hypothetical data..) are i.. modern experimental protocols would have the horticulturalist assign the level of X to the different plots using a computerized random number generator.

2000 • 1700 1400 1100 800 500 200 OLS regression line including outlier L---. One way to find outliers is to plot your data. have four of matter. some distributions IHlVe infinite moments. Because have finite kurtosis. used distributions such as the normal distribution rules out those distributions... The least squares throughout assumptions play twin roles.126 CHAPTER 4 Linear Regression with One Regressor tmImIID The Sensitivity of OLS to Large Outliers y This hypothetical data set has one outlier.=:::::::::::. the best you can do on a standardized size and test scores have a finite range. Still.3. and we this textbook. Use of the Least Squares Assumptions The three least squares rized ill Key Concept return to them repeatedly assumptions for the linear regression model are summa4....:::O~L-S-r-eg-r-c'. drop the observation Data entry errors aside. then you can either error or. __ ----' 40 excluding outlier L-~::--_L__ __::' oL 30 50 60 X 70 data on the height of students in meters. commonly moments. the assumption of finite kurtosis is a plausible one in many applications with economic data... decide that an outlier is due to a data entry error..The OLS regression line estimated with the outlier shows a strong positive relationship between X and V. but the OLS regression line estimated without the outlier shows no relationship.::. and this assumption finite fourth moments will be dominated If the assumption holds. ....-. if that is impossible... they necessarily generally.. If you (rom your data set... as a mathematical test is to get all the quesclass More fourth tions right and the worst you can do is to get all the questions wrong. Class size is capped by the physical capacity of a classroom."si~o-n-li-n-e-:/T..---::_ .. but inadvertently height in centimeters recording one student's correct the instead.:. then it is unlikely that statistical inferences using OLS by a few observations.

+ uil i :::::1. The error term it. One reason why the first least squares assumpis discussed in Chapter 6.5 Sampling Distribution of the OLS Estimators Because the OLS estimators ~o and ~ I are computed from a randomly drawn sample. Y.) = 0. the indefor some assumption is inappropriate sion methods developed under assumption applications with time series data. just like the sample are cor- mean.lx. The third assumption should examine rectly recorded modification serves as a reminder that OLS. the first least squares to consider in practice. In turn.i. i = 1. 3..2. iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiO====~:::::::::::==- . 4. distributed (i. in large samples the OLS estimators methods for hypothesis OLS estimators.. n. If your data set contains large outliers. samples.. this large-sample normal distribution confidence role is to organize the circumstances that pose difficulties for assumption is the most reasons As we will see.d.. Their second OLS regression. and additional whether the second assumption for time series data. Large outliers are unlikely: and Xi and Y. have nonzero finite fourth moments. the estimators the sampling possible random themselves are random variables with a probability distributionthat describes the values they could take over different presents tbese sampling distributions . Their first role is mathematical: 1f these assumptions hold. where 4. (Xi. . are independent from their joint distribution.3 1. I n. pendence to consider holds in an the regres- Although it plausibly holds in many cross-sectional 2 require data sets. then. . you those outliers carefully to make sure those observations and belong in the data set. can be sensitive to large outliers. It is also important application.4. .5 Sampling Distribution of the OLS Estimators 1 27 The Least Squares Assumptions }j = /30 + [3 Lx. has conditional mean zero given and identically X. Tbis section distribution- .) draws 2. important testing and constructing tions that are normal. Therefore. as is shown have sampling distribulets us develop intervals using the in the next section..: E(u. tion might not hold in practice are discussed in Section 9..).

is a type of average.3. an be rnplicated when the sample size is small.128 CHAPTER 4 Linear RegreSSion WI .7) for f3"you will see that it.not a simple average.Y an be Complicated when the sample size is small. If the samp!e is sU~iciently large. the probability of the e different values is ummnrized in their sampling distributions.TI. inn with One Regressor amples th ' ey these distributions are complicated. the mean of the sampling di tribution is t-v. . In particular. then m re can be said about tbe sampling distribution. .)..5 and 2. _ X). and the proof that ~o is unbiased is left as xercise 4. Although the sampling distributi n f. under the least qua res assurnpti ns in Key onccpt 4. of the unknown interccpt. Y.3. are f30 and f3. are approxl1nate y norma The Sampling Distribution of the OLS Estimators Review of the sampling distribution of Y. . If you examine the numerator 10 Equation 0. ~o and ~.8.. but an average of the pr duct. (11 _ Y)( X. ~o and are random variables that take on different value fr rn ne sample to the next.8.C c idcas carr (4. The pro f that ~.the centrr Ilimit the rem (Section2. are normal in large samples. like Y. 1'-1" Bccause Yis calculated using a randomly drawn sample.ln other words.6) states that this distribution is approximately normal. 'Fis a random variable thattakcs on differcnt value from One samin its sam. '0 . of thc population regression line. the central limit theorem concerns the dist. pe to the next: . In particular. I . E(Y) = I'-Y. As discussed . it is possible to make certain tatcmcnt bout it that hold for all 11.7.4. Rccallthc discuss~n in Sections 2. are unbiased estimators of . ~j Thesampling distribution of ~o and ~ r. This implies that the marginal distributi n of ~o and ~.Technically. Although the sampling distribution of ~o and~. but in large In small samp Ies. the mean of the sampling distributi n of ~o and ~.20) that is. This argument invokes the central limit theorem. an CStimator of the unknown population mean of Y.ibution of averages (like Y). is well approximated by the bivariate normal distribution (Section 2.6 about the sampling distribution of thc a. Because the OL e timat r are calculated using a random sample.::plc average.80 and . so Y is an unbiased estimator of 1'-1" If 11 is large. I because of the centralllll1ltthcorcJ11. it is possible to make certam statements about it that hold for all n_ In particular. that is. by the central limit the rem the sampling distribution of f30 and f3.8o and sl pe. ver to the L estimators ~o and~. the probability of these diffcrent values is summarized _ I piing distribution. is unbiased is given in Appendix 4.

The results in Key Concept 4.. (T~o)' where 2 _ 1var(Hiui) (T~o ..6.) A relevant question in practice is how large n must be for these approximations sufficiently normal distribution.4. then in large samples 4. like the simpler average Y.4 is that. the central limit theorem applies to this average so that.4.3 summarizes the deriva- tion of these formulas. Mathemati(4.1 (421) The large-sample normal distribution of &0 is N(f3o.4 and f31 have ~ jointly normal sampling distribution.ply tha.21) so the smaller is (Tl. we suggested that n = 100 is by a and sometimes smaller n suffices.n [E(H1JF' where H.3 hold. The normal approximation samples is summarized to the distribution of the OLS estimators in large in Key Concept 4. with high probability... (422) further in Appendix 4.5 Sampling Distribution of the OLS Estimators 129 Large-Sample Distributions of Ifth. This is because the variances (T~o and to zero as n increases (Tl of the estimators decrease will be tightly concentrated Anotber (n appears in the denominaof the OLS estimators tor of the formulas for the variances). ). (T7. u x ]. large for the sampling distribution of Y to be well approximated to the more complicated n > 100.21) is the to the square of the variance of Xi: the larger is var(Xi).4 iU. The large-sample normal distribution of f31 is N(f31.: least squares assumptions Po and PI &0 . the smaller is the variance cally. {31' Cllm:mm in Key Concept 4. when the sample size is large. in general.3.. the larger is the variance of Xi. f30 and f3h when n is large. In Section 2. the OLS estimators are consistentthat is. f30 and f31 will be close to the true population coefficients f30 and f3. where the variance of this distribution (T? is . so we will treat the normal approxi- mations to the distributions of the OLS estimators as reliable unless there are good reasons to think otherwise. so the distribution around their means. it is normally distributed in large samples. larger is the denominator in Equation (4. implication of the distributions in Key Concept 4.. .. this implication inversely proportional arises because the variance .To get a better sense . This criterion carries over averages appearing in regression analysis. (Appendix 4. In virtually all applications. = 1 - [ E(X1) X. modern econometric to be reliable. of f31 in Equation "' of &.7.

then the data will have a tighter scatter around the population regression line so its slope will be estimated more precisely..-. the smaller is the variance of ffit. ~ ._ 101 102 103 X 100 of why this is so.13). . we are able to develop methods for making inferences about the true population values of the regression coefficients using only a sample of data. 202 - 200 f- .--.6. . .130 CHAPTER 4 Linear Regression ... The data points indicated by the colored dots are the 75 observations closest to X. •• • .__ __ 97 98 99 ___l' __ ~:_:_-----l'_::_-____. would be smaller by a factor of one-half and would be smaller by a factor of one-fourth (Exercise 4.. Similarly. Stated less mathematically.. and the Varianceof X The colored dots represent a set of Xis with a small variance. 0. if thc errors are smaller (holding the X's fixed). The black dots represent a set of Xis with a large variance.4 also imply that the smaller is the variance of the error U. 0 0 • 196 f- • o o 194 L__ --'---J. The normal approximation to the sampling distributi n f ~o and ~1 is a powerful tooL With this approximation in hand.21) because 1/.. were smaller by a factor of one-half but the X's did not change. • 1... : :a: : ... WI 'th One RegressOr CiBID The Variance of P. 198 f- o .__ . . • •• • • o 01 ...- 204 f- regression line can be estimated more accurately with the black dots than with the colored dots. but not dcnominator. then <rj..-. look at Figure 4. Thi can be seen mathemat- ically in Equation (4. . .the larger the variance of X. allf cr' ... . Suppose you were asked to draw a line as accurately as possible through either the colored or the black dots-which would you choose? ft would be easier to draw a precise line through the black dots. which presents a catterplot of 150 artificial data points on X and Y. The y 206... . :~ ..:e.. enters the numerator. the more precise is f31' The distributions in Key Concept 4. 0 . . . which have a larger varia~ce than the colored dots..of all 1/.. ..

.) are i. The first assumption is that the error term in the linear regression model has a conditional mean of zero. These important properties of the sampling distribution of the OLS estimator hold under the three least squares assumptions. determines the level (or height) of the regression line.This assumption yields the formula. Stated more formally. Taken together. then the sampling distribution of the OLS estimator is normal. the three least squares assumptions imply that the OLS estimator is normally distributed in large samples as described in Key Concept 4.4. however.that is.1 summarizes the terminology of the population linear regression model.d. is the mean of Y as a function of the value of x''Ille slope. Summary 1.for the variance of the sampling distribution of the OLS estimator. Moreover. and confidence intervalsis taken in the next chapter. Y.. these results are not sufficient to test a hypothesis about the value of f3. or to construct a confidence interval for f3. The second assumption is that (X" Y. Doing so requires an estimator of the standard deviation of the sampling distribution . f3" is the expected change in Yassociated with a one-unit change in X. The results in this chapter describe the sampling distribution of the OLS estimator. hypothesis tests.. By themselves. and a single regressor. The intercept. There are many ways to draw a straight line through a scatterplot. but doing so using OLS has several virtues. the standard error of the OLS estimator. as is the case if the data are collected by simple random sampling.i. X. are consistent. to its standard error. The third assumption is that large outliers are unlikely. The reason for this assumption is that OLS can be unreliable if there are large outliers. presented in Key Concept 4. X and Y have finite fourth moments (finite kurtosis).Summary 131 4. given the regressor X.4. Key Concept 4. if n is large. and have a sampling distribution with a variance that is inversely proportional to the sample size n. f3o. f30 + f3.6 Conclusion This chapter has focused on the use of ordinary least squares to estimate the intercept and slope of a population regression line using a sample of n observations on a dependent variable. This assumption implies that the OLS estimator is unbiased. The population regression line.X. If the least squares assumptions hold. This step-moving from the sampling distribution of ~. then the OLS estimators of the slope and intercept are unbiased.

between the residual iii and ihe regression error IIi. ().0) population regression function (110) population intercept (110) population slope (110) population coefficients (110) parameters (110) error term (110) ordinary least squares (OLS) estimators (114) OLS regression line (114) sample regression line (114) sample reg res ion function (114) predicted value (114) residual (l15) regression R2 (119) explained sum of squares (ESS) (119) total sum of squares (TSS) (119) sum of squared residuals (SSR) (120) standard error of the regression (SER) (120) least squares assumptions (122) Review the Concepts 4. (2) consistent. and between the OLS predicted value Y.. . 4. with a larger value indicating that the l)'s are closer to the line.132 CHAPTER 4 Linear Regression with One Regressor · egression line can be estimated using sample observations 2. in which tbe '. There are three key assumptions for the linear regression regression errors. are to the estimated regression line. Key Terms linear regression model with a single regressor (110) dependent variable (110) independent variable (110) regressor (110) population regression line (11. model: (1) The have a mean of zero conditional on the regressors Xi. random draws from the population.and (3) large outliers are unlikely. least square s assumptIOn. The R' and standard error of the regression (SER)are meas~res of how close the values of Y.LS). and (3) normally distributed when hl the sample is large. 4.). n by ordinary least squares (O.i.1 Explain the difference between hi and 13. assumption is valid .. (2) the sample observations are i. Xl i = 1. and E(Y. If these assumptions hold. The popu Ia t IOnr . then p roviid e an example in which the assumption f aJ'1 S .IX. Ui.2 For each. The R is between o and 1.Th: OLS estimators of th~ r:g~ession intercept and slope are denoted 130 and 13" 3. . provide an example '.d.the OLS estimators ho and are (1) unbiased..The standard error ol the regression is an estimator of the standard deviation of the regression error.

scores from 100 third-grade using data on class size (CS) and average classes. over the course of a year. Last year a classroom What is the regression's average test score? had 19 students. in pounds and Height is measured in inches. What is the sample average of the test scores across the 100 classrooms? (Hint: Review the formulas for the OLS estimators.4 xes.5. A classroom has 22 students. tall? b.) men is selected from Suppose that a random sample 01'200 twenty-year-old a population and that these men's height and weight are recorded. tall? 74 in. estimated R2. = -99.. Suppose that instead of measuring weight and height in pounds and inches these variables What are the regression regression? are measured in centimeters coefficients.) d.82 the OLS regression test Testscore = 520.5. SER = 10.2.) .Exercises 1 33 4. estimates from this new centimeter-kilogram (Give all results.41 where Weight is measured a. = 0.5. prediction for a. What is the regression's that classroom's average test score? b.. SER = 11. R2 = 0. tall? 65 in. Sketch a hypothetical of data for a regression Exercises 4. What is the sample standard deviation of test scores across the 100 classrooms? 4.5 in. What is the regression's prediction for the increase in this man's weight? c. and this year it has 23 students.2 (Hint: Review the formulas for the R2 and SER. prediction for the change in the classroom c. weight prediction for someone who is 70 in. R2 ~ 0.08. and kilograms.1 Suppose that a researcher.94 X Height. What is the regression's + 3.4..g.9. estimates .81. The sample average class size across the 100 classrooms is 21. A regression of weight on height yields w. A man has a late growth spurt and grows 1.3 Sketch a hypothetical scatterplot of data for an estimated scatterplot regression with with R 2 R2 = 0. and SER.

Explain what the coefficient values 696.) 4.1. and is it consistent with a normal distribution?) g. For each company listed in the table in the box. Will the regression give reliable predictions for a 99-year-old worker? Why or why not? f.3%. . . . In a given year. 4.R. The average age in this sample is 41. Show that the variance of (R . What are the units of measurement R2? (Dollars? Years? Or is 1/2 unit-free?) d. What are the units of measurement for the SER? (Dollars? Years? Or is SER unit-free?) c. The regression R2 is 0. . for the worker? do you think it is plausible that the distribution of errors in the regression is normal? (Hint: Do you think that the distribution is symmetric or skewed? What is the smallest value of earnings.6 mean.023.I/. but some students have 90 minutes to complete the exam while other Slave. SER = 624.2. ars) using a random sample of college-educated fUll. f erage weekly earnings (AWE. What is the average value of AWE in the sample? (Hin: Review Key oncept 4.). d .) c. Is it possible that variance of (1/ .)? (Hint: Don't forget the regression error. Suppose that the value of f3 is less than 1 for a particular stock.R f) for this stock is greater than the variance of (1/".1.5 A professor decides to run an experiment to mea ure the effect of time pressure on final exam scores.134 CHAPTER 4 Linear Regression with One RegresSor 43 . age (measure in ye . He gives each of the 400 students in his course the same final exam.4 Read the box "The 'Beta' of a Stock" in Section 4.5% and the rate of return on a large diversified portfolio of stocks (the S&P 500) is 7. . Each student is randomly assIgne I .6 years. kers aeed 25-65 yields the followmg: nme war co AWE = 696. The standard error of the regression (SER) is 624. b. b. Suppose that the value of f3 is greater than I for a particular stock. a. R2 = 0.2.Rf) for this stock is greater than the variance of (R".7 + 9. the rate of return on3-monlh Treasury bills is 3.023. ] 20 rrunutes.7 and 9. Given what you know about the distribution of earnings. a. .use the estimated value of f3 to estimate the slack' expected rate of return.6 X Age. What is the regression's predicted earnings for a 25-year-old A 45-year-old worker? e. measured in dollars) On A regressIOn 0 av d.

= f30 + f3. = f30 + f3. a.i. implies that Show that &0 is an unbiased estimator of f30. = O. The estimated I.Xi + u. A linear regression 4. = 49 + 0.7 = f30 + e.. 1.3 are satisfied. Repeat Compute Score of students given 90 mioutes to complete for 120 minutes and 150 minutes. Whit. Suppose you know that f30 = O. Are the other assumptions d. 4.h 4. [Hint: Eval- for the large-sample (4.20.3.IX. 1). is unbiased. U.Xi + u. when X = 0.) = O.11 Consider the regression model 1..d. 4). II. Compute the estimated gain in score for a student who is given an additional 10 minutes on the exam.9 a.4? What about &o?) 4.x.. Derive an expression assumptions in Key Concept 4. Show that tbe regression b. = O. E(UilX. represents.3 satisfied? Explain.. denote ith 135 the number of points scored on the exam by the student (0 so 1. Bernoulli random variable with Pr(X = 1) = 0.X.10 Suppose that yields &.Derive a formula for the least squares estimator of f3.4 continue to hold? Which change? Why? (Is f3. let Xi denote the amount of time that the student has to complete the exam (Xi = 90 Or 120)..) = 2..24 Xi' regression's prediction for the average the exam. = f30 + f3. so 100). E(1.) 4. where (Xi. a. Ui) are i. and X.) Suppose that all of the regression assumptions isfied except that tbe first assumption is replaced parts of Key Concept in Key Concept 4. variance of &. A linear regression b. Explain what the term u.] uate the terms in Equation 4.Does this imply that &. Let 1.6 Show that the first least squares assumption. c. Show that R2 yields R2 = O. When X = 1. is N(O.Exercises one of the examination times based on the flip of a coin. regression the estimated Y. = O? U. and consider the regression model 1.3 are satwith E(UilX. (Hint: Use the fact that &. have a. 4. which is shown in Appendix 4.8 4. Why will different students different values of Ui? b. normally distributed in large samples with mean and variance given in Key Concept 4. + "i. .. is a is N(O. Explain why E(UiIXi) = 0 for this regression in Key Concept is model.21).

(3 0 ~ 4.136 CHAPTER 4 Linear Regression with One Regressor k that b.. Empirical Exercises E4.13 Suppose that Y. full-year workers. A detailed description is given in CPS08_Description. ./B. variance of (3 I IS given by (J. Show that ~I ~ rXY(syjsx). you will investigate the relationship between a worker's age and earnings. IS equation IS the (31 variance given in equation (4. Bob is a 26-year-old worker. (Generally.1 [1'1' 'Th' '.)".. c.) In this exercise. 2 2Iv"[IX. and Sy and Sx are the sample standard deviations of X and Y. also available on the Web site.1 for 2008. Does age account for a large traction of the variance in earnings across individuals? Explain. where I'XY is the sample correlation between X and Y. ~ K .you will find a data file CPS08 that contains an extended ver ion of the data set used in Table 3. R2~rh· b.. Alexis is a 30-year-old worker. 4. Suppose you now estimator of (31' 4.rt rat IS. Derive a formula for the least square s w that the regression R2 in the regression of Yon X is.] 4.com/slock_watsonl.1 On the text Web site http://www. What is the estimated intercept? What is the estimated slope? Use the estimated regression to answer this que tion: H w much do earnings increase as workers age by 1 year? b. Show that the large sample -. Predict Alexis's earnings using the estimated regression. c.12a. It contains data for full-time. Predict Bob's earnings using the estimated regression. (These are the same data as in CPS92_08 but are limited to the year 2008.pearsonhighered. . where K is a non-zero constant and ( y" Xi) satisfy the three least squares assumptions.21) multiplied by <2.-. leading to higher productivity and earnings.14 Show that the sample regression line passes through the point (X. age 25-34.V).the squared Sho X a v. with a high school diploma or B. Run a regression of average hourly earnings (A H £) on age (Age). lder workers have more job experience.A.) a. ~ (30 + (31 Xi + KUi. [vnr('\i). Show that the R2 from the regression of Yon X is the same as the R2 from the regression of X on Y. show thai value of the sample correlation bel ween an . as their highest degree. ]' • tnt.

course characteristics. Professor Watson has an average value of Beauty.pearsonhighered. you will investigate how course evaluations are related to the professor's beauty. and professor characteristics for 463 courses at the University of Texas at Austin.com/stock_watsonl. Does there appear to be a relationship between the variables? b.you will find a data file TeachingRatings that contains data on course evaluations." Economics of Education Review.pearsonhighered. a. Construct a scatterplot of average course evaluations (Course_Eval) on the professor's beauty (Beamy). while Professor Stock's value of Beauty is one standard deviation above the average. so that students who live closer to a four-year college should.you will find a data file CollegeDistance that contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. Predict Professor Stock's and Professor Watson's course evaluatious.' A detailed description is given in TeachingRatings_Description. August 2005. Is the estimated effect of Beauty on Course_Evallarge or small? Explain what you mean by "large" and "small.Empirical Exercises 137 E4. Run a regression of average course evaluations (Course_Eval) on the professor's beauty (Beauty). 24(4): 369-376. Does Beauty explain a large fraction of the variance in evaluations across courses? Explain. In this exercise. on average. you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student's high school to the nearest four-year college. (Hinl: What is the sample mean of Beaufy?) c. In this exercise.com/stock_watson/. d. complete I These data were provided by Professor Daniel Hamerrnesh of the University of Texas at Austin and were used in his paper with Amy Parker. What is the estimated intercept? What is the estimated slope? Explain why the estimated intercept is equal to the sample mean of CourseEval. also available on the Web site. E4." e.2 On the text Web site hltp:/Iwww. "Beauty in the Classroom: Instructors' Pulchritude and Putalive Pedagogical Productivity. One of the characteristics is an index of the professor's "beauty" as rated by a panel of six judges. Comment on the size of the regression's slope. . (Proximity to college lowers the cost of education.3 On the text Web site hltp:llwww.

mance and the Sources or Growth"• Ioumol 01 //JOII EeonOl1l1es. . mated regression to answer this question: How does the average value of years of completed scho ling change when colleges are built close to where students go to high school? I . 2 ge . D es Malta look like an outlier? e.4 On the text Web site http://www. also available on the Web site. April 1995.58: 261-300.) What is the estimated intercept? What IS the estimated lope? Use the esu. One country. Construct a scanerplot of average annual gr wth rate (Grow/h) on the average trade share (TradeSltare). How would the prediction change if Bob lived 10 miles [rom the nearest college? c. Or something else)? E4. grams..s a. In this exerci e. cents. b. yza. years. Does distance to college explain a large I'racti n of the variance in educational attainment across individuals? xplain. "L rversrcn'. A detailed deseripti n is given in Growth_Description. 12(2): 217-224. Using all observations. d. Find Malta on the scatterplot. ? 10 .com/stoek_watsonl.pearsonhighercd.you will find a data file Growth that contains data on average growth rates from 1960 through 1995 for 65 countries along with variable that are potentially related to growth. Distance_ D esenp I l ' a. essor ecrna Rouse of Princeton University And were use In paper Democratlzatlon or D' . has a trade share much larger than the other countries. e Effect of Community alleges on Educational Att31n~nThent. run a regression of Growth on TradeShare.138 CHAPTER 4 Linear Regression with One Regressor f hi hereducation. Malta. . er w'Ll 11 ( B k r essor ass Levine of Brown University nnd were used 111 ISpap h J 1 lOTS en ec and Norman Loa "Fi F' cial '2000 . Bob's high school was 20 miles [rom the ncare t c liege. Docs there appear to be a relationship between the variables? b. Run a regression of years of completed ~ducati n (ED) on distance to the neares t college (Dis/) where Dist IS measured 111 tens of miles (For example. What is the estimated slope? What is the estimated intercept? Use the "These data were provided by Prof C ili d' her " '.oumat of BlIsil.) A detailed description is given in Colle more years 0 tg . ese data were provided by P of R' .less find Economic Sllllislics. dollars.you will investigate the relationship between growth and trade. Predict Bob's years of completed education using the estimated regression. t'on also available on the Web sue. What is the value of the standard error of the regression? What are the units for the standard error (meters. Dis! = 2 means that thc distance i 20 miles.

6)]. d.~1 i . The student-teacher ratio used here is the number of students in the district divided by the number of full-time equivalent teachers. and the percentage of students who are English learners (that is.)2 P [Equation (4. a standardized test administered to fifth-grade students.gov).Test scores are the average of the reading and math scores on the Stanford 9 Achievement Test. e.5 and with a trade share equal to 1.bo .2. Answer the same questions in c. the percentage of students who qualify for a reduced price lunch. All of these data were obtained from the California Department of Education (www.ca.. Demographic variables for the students also are averaged across the district. To minimize the sum of squared prediction mistakes 2:. number of computers per classroom.cde. first take the partial derivatives with respect to bo and b( (4. number of teachers (measured as "full-time equivalents"). APPENDIX 4. and student demographic backgrounds.24) . school characteristics.1 The California Test Score Data Set The California Standardized Testing and Reporting data set contains data on test performance.bIX. students for whom English is a second language).2 Derivation of the OLS Estimators This appendix uses calculus to derive the formulas for the OLS estimators given in Key Concept 4.0. TIle demographic variables include the percentage of students who are in the public assiuance program Cal Works (formerly AFDe) . The data used here are [rom all 420 K-6 and K-8 districts in California with data available for 1999.23) (4. Where is Malta? Why is the Malta trade share so large? Should Malta be included or excluded from the analysis? APPENDIX ------- _ 4. and expenditures per student. School characteristics (averaged across the district) include enrollment.Derivation of the OLS Estimators 139 regression to predict the growth rate for a country with a trade share of 0. Estimate the same regression excluding the data from Malta.

::0 sxyJsl 11 - is obtained by dividing the numerator tion (4.-iloX 11i=l "I II " - _ il'n£'" i=l " 1 Xf ~ =0 .has the normal sampling distribution given in Key Concept 4.27) and (4. mustsatisfv collecting terms.140 . the and denominator in Equa- Equations (4.3 Sampling Distribution of the OLS Estimator In this appendix. .-. that minimize (30 an \> f h' I . 2: (Xi 1"'\ II X)' + 2: (Xi i=l II .X " 2: (Xi. - (429) = il.so the numerator of the formula for ~Iin Y == f31(Xi - X) + u. . . 0 b 11 shows that the OLS estimators. (423) and (4.1') i.X){Lli . formula ffil . .settmg these derivatives equal tozer . Lj:I(Y. . APPENDIX ------------------- ---- 4.24) equa zero.. Bec~use Ii. _ bu J I" Accordingly. and dlVldlOo Y . I the values of bo and b] or W ic 1 the denvatives' or. in large samples. I b xi d ~ are the values of bo and b. (426) Solving this pair of equations for ~oand ~ I yields l±xy-xy n I I " 2:(X. l ill i-I *±XHW i=1 ~o=y-~. f30 and (31. eqUIvalent y."." 0.-X)(Y. we show that the OLS estimator ~I is unbiased and.4.iI)..27) by I. Xl j"'l (427) (428) given in Key Concept 42.. Representation of ~ t in Terms of the Regressors and Errors We start by providing an expression for }'j = 130 + (3\Xi + Uh 1i Equation (4.27) is ~I in terms of the regressors and errors.Y.. "th One Regressor CHAPTER 4 Linear Regression WI LS The O II _ estimators. " In A the two equations (425) -2:X. Equauons '.28) are the formulas for ~o and~.

E(UiIX. if the sample size is of X is nearly equal to f. is ohtained by taking the expectation of both sides of Equation (4."" Ar..).31) is zero.'".) = E(u.. ••• .X)Uj into the final Y) =f3Il'='l (X-X)'+ 2:" I X)Ui' SUbstituting (4..~..lx" .nX]u = O. the final equality follows from the definition of X which implies that c 2:.27) yields 2: =1 (. X is consistent. (4. is unbiased. E(~l - f3rl ~ E[E(~..~l(X.~J(Xi I expression ". It follows that the conditional is'~1 is conditionally expectation in large brackets in the second line or Equation (4.(X. - - -u) = 2:. ..(X. X." (X _ X)u = '::"/=. . I X.X)(Ui . .. that Large-Sample Normal Distribution of the OLS Estimator The large-sample 4.- 2::~.Sampling Distribution of the OLS Estimator N ow 11 [ 141 2:1/ . so that E(~. Because (4. ..-X)(u.f31IxJ.:::::::=====--". so E(u.u) = yields '1-1 I L. so that E(~.. the term in the numerator ~ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~~.<. By the first least however.. Xn) = f31.) = f31.~I (~.31) where the second equality in Equation (4. .) follows by using the law of iterated Ui expecta- tions (Section 2. . X.4) is obtained normal approximation by considering to the limiting distribution of ffi j (Key Concept the behavior of the final term in Equation of this term. .-:¥)u. Ecfillx isj.. of ~. Equivalently.t. .3).f3J!X" .) = O. of X for all observations squares assumption.. First consider the numerator large.. i this expression in turn into the formula for ffil in Equation (4.-X)u= .)] ~ 0. other is distributed independently than i.30).Lx. X. By the second least squares assumption.) = O.(X. to a close approximation. that unbiased. By the law of iterated expectations.30).30) Proof That f31 Is Unbiased The expectation Thus. Thus.. j.-X)Uh where .II Equation (4./.29) 2:'~ (X-X)(Y- .31. SUbstituting in - L..1 I Li=IXi . given Xl.

which is inconsequential if" is large). is. in large samples.jx = 0. . As estimator of the variance of X discussed in Section 3. TSS = SSR + E:SS. distribution of ~1 - 13\ Oil y/var(xi): /31 is. f30 f3IXi~(Y. by the third least squares of the central distributed N(O.LX)lIi· By the first least squares VI has a mean of zero.-Y)-f31(Xi-X).and (4.r=llli - of Y and X . Next consider the expression in the denominator in Equation (4.i. the sample average of the OLS pre di d values equals '9. we I~ave that. . where assumption.30) is the sample average V. ° II defined In Equations (4.in large samples.7).thus /I 11 " of 130lets us write the OLS residu- ~ili= ~(Y. so that the sampling var(v)/[ var( Xi)]' (4. and the total sum of squares is the 1 sum.Y) . X). Therefore . By the second least squares assumption. . ±(X. .mpyI that L:~. . so in large samples it is arbitrarily close to the population Combining these two results. u~/lI) distribution. N(13I. this is the sample Van. of squared residuals an d tlte exp Iained sum of squares [the ESS TSS and SSR are .32) (4. = if:/II.142 CHAPTER 4 Linear Regression with One Regressor Equation (4. 15).so O. which. But the definitions 'C" ". v satisfies all the requirements (Key Concept 2. and I'.IS . the sample variance is a consistent population variance. the sample covariance s::x rete between the OLS resod ua Is an d t Ite regressors IS zero.(X.~.ul).- .30). variance of Vi Vi vi = (X j- J.34) (4.21). To = "_" _ ~ t als as U.14). verify Equation (432) n~e t hat the deflni " ~t t ie d~nllion .d. Thus v/ if. Thus the distribution of v is well approximated by the N(O. . at.17)]. Equations (432) through (435) say t Itat the sample average of the .35) L residuals is zero. where assulllption . (4. is (T~ is i. Y._ L. which is the cxpressi Some Additional Algebraic Facts About OLS The OLS residuals and predicted values satisfy: (4.33) " LUiXi 1=1 = 0 and $. I). The = var[(Xi- !-LX)Ui]. ance of X (except dividing by II rather than 1'/ - I.(lJ-Y)=O .8)]. limit the arem nonzero and finite .2 [Equation (3. in large samples._X)=O. and (4. where a? == n in Equa~:on = var] (Xi -I'x )11/]/ {II[ var( Xi) ]'}.-.' .

1/ = II"" Li=lltj(f30 " + f31Xi ) _ " .= i"'l " .37) i=J = SSR + ESS+Z2.::::::====::' . "A Lio=ll.I.~l B. + Uj.~l2:(X. (4. note that tj= Y.y)' II" A _ ~ 2:(y. _ j 2:.32).-:y) ±[(I.33). note that j L. Equation (4.1 II A 1.~.(1.I "-+ ~ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~~. r/ II = 2: (I..I.J' + 2.Y) 1=1 (4. UiXi = L..34). where ~l the L.36) is obtained using the formula for (4.. . -:y) .=1 . combined with the preceding results.36) wbere the final equality in Equation (4. so 2:7"'1}j = where the second equality is a consequence of Equation (4.27).Y)(X.{30£"i=lUi ".Y). i==l Y)' + 2 2:(y.+ 1. ~I ~l - X)' ~ 0.~lU'J = L.0=. This result. -:~')J(X.~ltJ'i= 0 implies 2:7".Sampling Distribution of the OLS Estimator 143 To verify Equation (4.=1 /I A Y)'= 2.l.(X..X. To verify Equation (4.. . SSR + ESS. implies that Some ffil in Equation sux = O.35) follows from the previous results and II '" algebra: /I A _ TSS~ 2:(1..(1. = .u.~Ii/'iX:' final equality follows from = 0 by the previous results. so i=l ±u.:"lUi(X X). .~li1+ L.)(Y.liY.

the superintendent. In differs from one sample to the next-that make statements about /31 has a sampling distribution. and its standard error to test hypotheses. The for this Section 5. If.6 discusses the distribution of the OLS estimator when the population distribution of the regression errors is normal.1 through 5.3 assume that the three least squares assumptions results can be derived regarding the distribution of the OLS estimator. 5. a concept Section 5. . Because the effect on test scores a f a unit change 111 class size is /3CI 5".rased in the language f regres ion analysis. /3" Section 5.1 Testing Hypotheses About One of the Regression Coefficients Your client.5 presents the Gauss-Markov introduced theorem." of the population regression line is zero. OLS is efficient (has the smallest variance) among a certain classof estimators. how r:gression with a single regresSor. Section 5. Chapter 4 explained how the OLS eStimator ~ I of the slope coefficient /3. the superintendent 144 asks. h' oss I . so reducing them further is a waste of money. then some stronger One of these in stronger conditions is that the errors are homoskedastic. Sections 5. the raxa uni . Chapter 4 hold. no effect on test scores. calls you with a problem. claims.3 takes up the of special case of a binary regressor.2 /31 that accurately summarize the sampling starting point is the standard error of the OLS estimator. under certain conditions. Is there. .the slope /3C1""S. in addition. She ha an angry tax. payer IS as . some stronger conditions hold.CHAPTER 5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals T his chapter continues the treatment oflinear is. the taxpayer The taxpayer's claim can be reph.1 provides an expression then shows how to use ~.4. we show how knowledge of this sampling distribution can be used to uncertainty. Section 5. C sertmg t at the population regression line is flat-that i . has payer in her office who assertsthat cutting cla s size will not help boost test scores. lass ize. this chapter. which measures the spread of the sampling distribution of ~" Section 5. which states that. explains how to construct confidence intervals for standard error (and for the standard error of the OLS estimator of the intercept)..

We start by discussing two-sided tests of the slope tests and to tests of hypotheses regarding fJ. the statistic Concept samples equivalently.Alternatively. dom sampling 3.1 (51) estimator . lact the p-value for a two-sided hypothesis 2 <I> ( -ltac/I).5ize = 0.: E(Y) Ho against the two-sided * !LY. at least tentatively pending further new evidence? This section discusses tests of hypotheses about the slope f3. which is an estimator of the standard deviation of the distribution of Y. the t-statistic has the form 5.hypotheSized value standard error of the estimator' evidence in your sample of 420 observations on California school districts that this slope is nonzero? Can you reject the taxpayer's hypothesis that f3Ch". by ranvariation. so we begin with a brief review. where cumulative is the value of the I-statistic actually computed and <I> is the standard normal distribution tabulated in Appendix Table l. the z-statistic is the general form given in Key Concept (Y . Two-Sided Hypotheses Concerning /3. based on the test statistic actually observed.D' as The test of the null bypothesis as in the three steps summarized standard sampling 1= alternative proceeds the in Key Concept 3. The third step is to compute the p-value. the null hypothesis flo: E(Y) Recall from Section 3.5. at least as different from the null hypothesis value as is assuming that the null hypothesis normal is correct (Key in large test is distribution actually observed. which has here. Testing hypotheses about the population mean. which is the smallest significance level at which the null bypothesis could be rejected. The general approach to testing hypotheses about the coefficient f3.1 Testing Hypotheses About One of the Regression Coefficients 14S General Form of the t-Statistic ~ In general. or should you accept it.n. applied error ofY.6.o)/SE(Y). ./-LY.o can be written = !LY. The the r-statistic. then turn to one-sided intercept f3o. and the two-sided alternative is fl. The first is to compute second step is to compute 5. Because the z-statistic has a standard under the null hypothesis. the p-value is the probability of obtaining a statistic. in detail.2 that that the mean of Y is a specific value !Ly. or intercept of the population f30 the regression line.5). SE(Y). is the same as to testing hypotheses about the population mean.1.

h S' I Regressor' ypothesisTestsand Confidence lntervals H . in large samples. AItough the formula for (J~IIS camp I'icated. the null hypothesis and the two-sided alternative hypothesis are Testinghypotheses about the slope f3t- Ho: 131 = (31.2) To test the null hypothesis Ho. t46 CHAPTER 5 .. the critical feature justifying the foregoing testing procedure for the popula tion mean is that. RegressionWit a In9 e the third step can be replaced by simply comparing the l-stati tic to the critical value appropriate for the test with the desired significance level. HI: 13. does not equal (3 t. (5. (53) where ih (SA) The estimator of the vari 'E ' anance 111 quauon (SA) is discussed in Appendix 5. is an estimator of (J~" the standard deviation of the sampling distribution of (31' Specifically. hypothe es about the true value of the slope 13. (3.0 (two-si Ie I alternative). a two-sided test with a 5% significance level would reject the null hypothesis if 11"'1 > 1. also has a normal sampling distribution in large samples. the sampling distribution of'Yis approximately normal. we follow the same three steps a for the populaThe prst step is to compute the standard error of E(ffil)' The standard err?r of 13. '" 131. . The angry taxpayer's hypothesis is that 13 I"" I" = 0. the population mean i~said to be stati tically significanily different from the hypothesized value at the 5 Yostgntflcance level.under the null hypothesis the true population slope (31 takes on some specific value. can be tested using the same general approach.0 SE(ffit) (55) . Under the two-sided alternative.1. tian mean.n. 131. For example. " ' computed by regression software so that it is easy to use in practice. At a thcoreticallevel.o-That is. In this case. The second step IS to compute the r-statistic . The null and alternative hypotheses need to be stated precisely before they can be tested. More generally. Because~.13'.0 vs.96. t= 13. tn applicati ns the standard error IS h '1 ' .

.f31.. under the null hypothesis.96.ol] = Pr" 0 [1 ~1-f3I. p-value = Pr"o[I~1 .. assuming that the nul] hypothesis is correct.5..Oas the estimate actually computed (~1Cf). = f31 0 f3.... the second equality follows by dividing by SE(~ I)' and la" is the value of the r-statistic actuBecause ~J ally computed. the critical value for a two-sided of obtaining a value of f31 at least as far from the null as that actually observed is less than 5%..7)]. in Key Concept 5.. as a standard under the null hypothesis the I-statistic is approximately distributed normal random variable.05 or.f3'"o [Equation KEY CONCEPT 5.. [Equation (5.. p-value = Pr(IZI > 1t''''I) = 2<I>(-lla"I). so in large samples..96... _ 'TIle third step is to compute the p-va]ue. (5. Compute the standard 2. the null Alternatively..2. If so. if It ""I > 1. error of ~t. 'f. Reject the hypothesis at the 5% sig- nificance level if the p-value is less than 0."....oll~1"-f3t'OI] _ > _ SE(f3I) SE(f3I) (1t1>11""I)..3)]. is approximately normally distributed in large samples. by comparing test..::::". The standard error and (typically) the z-statistic and p-value testing f3'1= 0 are computed automatically by regression software..7) A p-value of less than 5% provides evidence against the null hypothesis i~ the sense that.96.-=-== . equivalently.6) where Pr"o denotes the probability computed under the null hypothesis....ol > = Pr" o 1~1" - f31. Compute the p-vaJue f3. the hypothesis can be tested at the 5% significance level simply at the 5% level if Ila"l the value of the r-statistic to ±1.1 Testing Hypotheses About One of the Regression Coefficients 147 Testing the Hypothesis Against the Alternative 1... the probability of observing a value of f31 at least as different from f3t..2 (5. the probability hypothesis is rejected at the 5% significance level.. _____ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~. Compute the z-statistic 3.. SE(~I) [Equation (5. and rejecting the null hypothesis These steps are summarized > 1. (5.5)]... Stated mathematically.

the estimated slope. One compact way to I t d d errors is to place them in parentheses below the respective report t re S an ar coefficients of the OLS regression line: TestScore _ 698. the 5% (two-sided) critical value taken from the standard normal distribution.52. Thi r-statistic exceeds (in absolute value) the 5% two-sided critical value f 1. To do so. as far [rom the null as the value we actually brained is extremely small.y.9 .8) at the 5% significance level. . construct tbe z-statistic and compare it to 1. estimates of the sampling uncertainty of the slope and tbe intercept (tbe standard errors).e standard error f these estinon (4.38. the result is t'" = (-2. = -2. Alternatively.6. itis .. 11. (5.11 . TIllis quation (5.28. The discussion so far has focused on testing the hypothe i that /3. it is reasonable to conclude that the null hypothesis is false.e eo' . 111is probability is the area in the tails of standard normal distribution. and two measures of the fit of this regressian line (tbe R2 and the SER).28 x STR.e r-statistic is constructed by substituting the hypothesized value of f3. The OlS Reportmg regression eq '. ).s the null hypothesis is rejected in favor of the two-sided alternative at the 5% significance level.) = 0.52) = 0.5). . could be either larger or smaller than f3. R2 (lOA) (0.clent . TI.00001. less than 0. and its standard error from Equation (5.0)/0.001'Yo. by onvenuon they are included when reporting the estimated OLS coelf. Id d f3' ~ 698 9 and f3. under the null hyp thesis (zero). .9 . and it will be used throughout the rest f this book.051 . e against the student-teacher ratto. . approximately 0. reported In Equa regresslOO of the test scar . " . or 0. ify. = f31. S E R = 18.ssSiu = 0 is true.96. uations and application to test scores. because under the alternative /3. ometimes.11.1. HypothesisTests and Confidence Intervals 148 CHAPTER 5 Regression with a Single Regressor.52 = -4. is zero in the population counterpart of Equation (5. Because this event is so unlikely. This is a common format f I' reporting a single regression equation. mates are SE(~o) = lOA and SE(f3.0 against the hyp~thesls that f3. . Because of the importance of the standard errors.3 .isis a two-sided hypothesis te t.8) Equation (5. the probability of obtaining a value of /3. One-Sided Hypotheses Concerning {3..e null hypothesis f3Ch. ) provides the estimated regression line.8) also reports the regression R2 and the standard error f the regression (SER) following the estimated regression line.8) into the general f rrnula in Equation (5. Suppose yon wish to test the null hypothesis that the sl pe /3.28 . '" l3"o.001 %.o. as shown in Figure 5.2. however. That is. . we can compute the p-value associated with I'''' = -4. This probability is extremely small.

(5.o (one-sided alternative).9) is reversed. to higher Scores. HI: f31 < f31.1 Testing Hypotheses About One of the Regression Coefficients 149 cmmn:DI Calculating the p-Value of a Twa-Sided Test When /"ct = -4.0 vs. For a one-sided test. the p-value is only0. the hypothesis is rejected at the 5% significance . For the oneis rejected against the one- test. the null hypothesis one. therefore.96.38 + z the area to the right of +4. f31 = 0 (no effect) against the one-sided alternative that f3j < O. the construction sided alternative sided alternative of the I-statistic is the same. many people think that smaller classes provide a better Under that hypothesis.and two-sided hypothesis in Equation for large negative. in the student-teacher Score problem.o. The only difference between (5. f3 j is negative: Smaller classes lead to test the null hypothesis that environment.5. the null hypothesis and the one-sided alternative hypoth- esis are Ho: f31 = f31. f3l. It might make sense. If the alternative is that f3.0 is the value of f31 under the null (0 in the student-teacher pie) and the alternative greater Because is that f3.00001 a The p-value is the area to the left of -4. When r"> -4. values of the (-statistic: Instead of rejecting if level if tact < -1. For example.9). is is the same for a one.645. appropriate ratio/test learning to use a one-sided hypothesis test.and a two-sided hypothesis a test is how you interpret the (-statistic.9) ratio exam- where f31.38.38. It'IC'1 > 1.38 The p-value of a two-sided test is the probability that !Zl > ItOC'1 where Z is a standard normal random variable and tOcr is the value of the r-sratistk calculated from the sample. is less than than f31.0' the inequality the null hypothesis in Equation (5. but not large positive.

/lt: fJo 'F fJo. The r·Slalhlic resrln the h pothe I Ih II there is no effect of class size on lest seore [ fJt. the p. d d Iternalive hypotheses should be use I nly when ther . i grcnler th n (JI o. This value is less than -2.vnluc.. pn r empin tl evid -nc or h th. n for d in' .. If the alternative i one.9) and (5. upOn reflection this might not nccessaril be newt ormul H 'd dru Undergoing clinical trials actually could prove harmful he u e of pr \ II U I unr 'cognized side effects.lO) If the alternative hYPolhesi is that fJ.lI 0 in 'qu lion ( . Testing Hypotheses About the Int rc pt 130 This discussion has focused on lesting h pot he". The p-value for a one-sided mal distribution p-value = Inl I te t i: bwined from thc cumulauv e nd.(l()(1Il .th .2 applied 10 fJo (Ihe f nnula for Ihe lanu. so the null hypOthesis is r~j' I'd . (S. one-vid 'dlcfl-lillll I).Iud .uion j ke thai a university's secret of ucce s is 10 admit mlemed . In the class size example... we or' remind 'd 01 th r ulu. pracn e. . In (act.10) are rever ed. as Pr ( Z < I "rn <P(r"") 1= (p. (S. you can reject the angry taxpayer's os erti n IhOtlhc n' lItI\' "IIIlUll tlf Ihe slope arose purely because of rand III samplin lIritll'On tth I • I 11Ihcan e level.ngleRegressor.' d 10. Imul th • . 111 • null hI pothesis concerning the intercept and the Iw ·sided allematlve arc Ho: /30 = fJo.111is reason could come from economic the ry. id~d.1 Application to test scores. the IIlcquUlities in Equations (5. so the IJ-V lu Pr(Z > (""'). When should a one-sided test be used? In I rh n hI I III ('Ir bability.m \0 then make sure that the faculty stays out of their way and eoe llute dum I' . B "u on lh~.tru 'ntlr f ~o is given 111 Appendix 5. I I Ihl . such ambiguity oflen lends ec n mcirl ion 10 1I1' 1"11 'Ided I ·.o (t wll'Sld 'd lit rn 1\ 1\ c).value i Ie. approa h" mlldificd as was discussed 10 the previous subsecti n for h p the es OOOul the I r' III .150 CHAPTER 5 .i le~r T~ .I n practice.lI) 1lte general approach t testing thi null h pothell " n\l ts of Ihe Ihree stepS Key Concept S.1). How_ ever.33 (Ihe critical valuc fro onc 'Id 'd I "1 \\ Itil I 1% igni(icance level).llIp" {JI ca ionally. rd ncr. Hypolhesls Tes15 and Confi Regressionwith a S.lgOlll't Ih· n '"J u Iltcrtlotive at the 1% level.o vs. Ihon O. '1)11'1 4. even if it initially seems that the rdev nl nit rn 11\' I un vrdcd. 8.. the hypothesis c ncern the inter ept (3u. on ·. however.

An easier way to construct the confidence interval is to note that the t-statistic will reject the hypothesized value {31. it is an interval that has a 95% probability of containing the true value of f31. will not be rejected. The reason these two definitions are equivalent is as follows. Recall that a 95% confidence interval for {31 has two equivalent definitions. Confidence interval for {3.O whenever {31. in 95% of possible samples that might be drawn.o) at the 5% significance level using the t-statistic.5. it follows that the true value of f31 will be contained in the confidence interval in 95 % of all possible samples.0 for all values of {3J.2 Confidence Intervals for a Regression Coefficient Because any statistical estimate of the slope {31 necessarily has sampling uncertainty. A hypothesis test with a 5% significance level will. however.. that is..2 Confidence Intervals for a Regression Coefficient 151 Hypothesis tests are useful if you have a specific null hypothesis in mind (as did Our angry taxpayer). the true value of {3."" _ .. there are many times that no single hypothesis about a regression coefficient is dominant. But constructing the z-statistic for all values of f31 would take forever. Being able to accept or to reject this null hypothesis based on the statistical evidence provides a powerful tool for coping with the uncertainty inherent in using a sample to learn about the population. by definition.testing the null hypothesis f3. it is the set of values that cannot be rejected using a two-sided hypothesis test with a 5% significance level. and instead one would like to know a range of values of the coefficient that are consistent with the data. Second. As in the case of a confidence interval for the population mean (Section 3. to use the OLS estimator and its standard error to construct a confidence interval for the slope f31 or for the intercept f3o. the confidence interval will contain the true value of f31' Because this interval contains the true value in 95% of all samples. Because the 95% confidence interval (as defined in the first definition) is the set of all values of f31 that are not rejected at the 5% significance level.3). in principle a 95% confidence interval can be computed by testing all possible values of {3J (that is. = {31. in 95% of all possible samples.The 95% confidence interval is then the collection of all the values of {31 that are not rejected. it is said to have a confidence level of 95%. It is possible. Yet. This calls for constructing a confidence interval. that is. reject the true value of {31 in only 5% of all possible samples.0 is outside the range ~ iiiiiiiiiiiiiiilliii ""'-". First. we cannot determine the true value of {31 exactly from a sample of data. 5.

1965E(~I)j. e . The value [3.96SE(~I)' That is. . ~ 0 can be rejected at the 5% significance level.I 96S£([3') I . with ~o and SE(ffio) replacing ~I and E(~I)' Application to test scores.. The 95% two-sided confidence int rval for f3. ~. Hypothesis Tests and ConfidenceIntervals ~ Confidence Interval for f31 5.12) + 1. the 95% confidence interval for /31 is the interval [~j-1. reported in quation f the te tore against the -2. of a confidence interval [3.52}. yielded ffil SE(~. h a 95°'(0 probability'. The OL regression stud~nt-teacher ratio.) = 0. . is surnmurizcd as Key Confidence interval for [3n A 95% confidence interval for f3n is constructed as in Key Concept 5.96SE([3. = is not contained in this confidence interval. ± 1. 111i argument parallel. 10 ibl 01 all POSSI e ran do Inly drawn samples. t ie predicted effect of the change Ax uSing estunateof IS[/31 . :5 -126.96SE(~. a Confidence intervals for predicted effects of changing X 11. t he argument used for to develop a confidence interval for the population mean. ted by a 5% two-sided hypoth SIStest. . . it is constructed as 95% confidence interval for [~.e 95 % confidence interval for /3. .1) the hypothesis [3. I. The construction Concept 5.).30 :5 [3. .3 id d fide ice interval for [31 is an interval that contains the tru A 95% tWO-SI e con I I .52. Becauseone en I of a 95% con(idence interval for /3 is [3 . II IS the set of values of"~I .1..)J.8). EqUivalently.) ] X 6x.28 and (5. can be usedto construct a 95% confiden e interval for the predicted effect of a general change in X. When the sample size that cannot b e rejec is large. ~. but because we can construct a confidence interval for (3" we can construct a confidenceinterval for. is {-2.the predictedeffect /316x.152 CHAPTER 5 Regression with a Single Regressor. Consider changing X by a given amount.' . tllis . The p pulati Yasso- n slope f31 is unknown.11le predicted change in ciated with this change in X is /3. " . + L96S£(~ . that is'. in 95"' valueof [3 lwlt ( .11le ther end of the confidence intervnl e.28 ± 1. 6x. ". so (as we knew already from ection 5.it contains the true value of /3.3. /31 = (5.96x 0. or -3.6x.3.)].96 E(~.

and it turns out that regresto performing sion with a binary variable is equivalent analysis.52 and 6.x = ing x by the amount D. whether a school district is urban or rural (= 1. For example. I __ lIiIiiiiiiiiiiliiilll _ . suppose you have a variable D.X For example. = 0 if male).D.x can be expressed as 95% confidence [~. as described ill Section 3.8.D.. ratio + 1. Thus decreasing the student-teacher ratio by 2 is predicted to increase test scores by between 2. = The population regression (5. n.60 or as little as -1.14) model with D. the effect of reducing the student-teacher ratio by 2 could be as great as -3. 0 or 1. + 1Ii.. gender so far has focused on the case that the regressor is a continuous Regression analysis can also be used when the regressor is binary-that is. that equals either 0 or 1. when it takes on only two values. X might be a worker's (= 1. = 0 if rural). if urban.15) 1 if the student-teacher { 0 if the student-teacher D.8" however. is different. our hypothetical student-teacher [-3. D. .96SE(~1) X D.8. is contemplating (5.. The interpretation .52. or whether the district's class size is small or large (= 1 if variable or some- small. Thus a 95% confidence interval for the effect of changinterval for . 5.x].1.96SE(. = 0 if large). Because the 95% confidence interval . if female.8.x. + 1.8 1D.26].96SE(~ 1) X D. A binary variable is also called an indicator times a dummy variable.96SE(~Jl. and the predicted effect of the change using that estimate is [.5.26 X (-2) = 2. ~.13) reducing for the superintendent by 2.8. = . + 1. as the regressor y. i = 1.30 X (-2) = 6. is -1. with a 95% confidence level.3 Regression When X Is a Binary Variable The discussion variable. Interpretation The mechanics of the Regression Coefficients of regression of with a binary regressor are the same as if it is cona difference of means tinuous.30.X .60 points.3 RegressionWhen X Is a Binary Variable t 53 is}. ing on whether the student-teacher To see this.80 + .)] X D.x.4. dependratio is less than 20: ratio in ilh district < 20 ratio in ith district > 20· is (5. .

= f30 + f3. the difference (f30 + f3. ible cases.. the two an d f3 1108 • . Because f30 + f3. in Equation (5. = 0) = f3o. the conditional expectation f l'/ when D.ID. . as the refer to f3i as the s ope In . when D. more compactly. io the two groups. is the difference between these two means. f31 is the difference between mean test score in districts with low student-teacher ratios and the mean test score in districts with high student-teacher ratios. (D. " ' .ID. that is.. then D. 154 CHAPTER 5 ' H pothesis Testsand Confidence Intervals y Regression with a Single Regressor.15) is not a slope. f30 is the population mean value of test scores when the student-teacher ratio is high. what is it? The best way to interpret ~o ' regression with a binary regressor 15 to consider.) . this is the case. = 0. = 0 is E(Y. the null bypothesis can be rejected at the 5% level against the two-sided . r f3. + III (D. = 0 POSSI I • I and Equation (5. can take on only two val· not useful to t 10 0 1 . • .ID. Specif. f3. ges of Y.=O). (5. = 1)E(Y. Similarly. f30 + f31 is the population mean value of test scores when the student-teacher ratio is low. Thus the null hypothesis that the two population means are the same can be tested against the alternative hypothesis that they differ by testing the null hypothesis f3. = 1 and when D.''''1 makes no sense to talk about a slope. ically.. Because D. so 1 ~ . . I' Equation (5.15) is zero.1. is the difference between the conditional expectation of Y. It' I'D· in this regresSIOn or. = 0. Hypothesis tests ond confidence intervals.ID. Thus we will not lies there IS no me.1 5). In the test score example. = 1) = f30 + f3. 1I1stead we will simply refer to f3. because D. sion model with the continuous regressor X.17) Thus. = 1 If the student-teacher rauo IS 11Igh. one at a lime. E( Y. If the two population means are the same. = E( Y. . is the population mean of Y. and f30 i the pop' ulation mean of Y. then f3. . when D. it makes sense that the OLS estimator f3j is the difference between the sample aver.' the binary variable D. wben D. I on ~~ f31 in Equation (5. in fact. that is.ID. ThIShypothesis can be tested using the procedure outlined in Seeti n 5. D i . "1. when D. = 0 against the alternative f31 # 0. and.0 and D.=f30+U.f30 = f3. Because {31 is the difference in the population means. Y. In other words. = 0). = I). = I. (516) Because E(Il. except This is the same as t he regres. IS not eOnlmuous it is that now the regresSOl IS _ ' hi k f f3 as a slope' indeed. when D. = I.15) becomes Y. the coefl1cient coefflcient mu Ip ymg . = 1.) = 0.

. conditional on X. This value exceeds 1.9). so that (as we know from the previous paragraph) the hypothesis 131 ~ 0 can be rejected at the 5 % significance level. the coefficient on the student-teacher ratio binary variable D.4. SER ~ 18.9.0.8) ~ 0. Is the difference in the population mean test scores in the two groups statistically significantly different from zero at the 5% level? To find out.96 in absolute value.3) + 7.4D. This confidence interval excludes f31 ~ 0. provides a 95% confideuce interval for the difference between the two population means.18) where the standard errors of the OLS estimates of the coefficients f30 and f31 are given in parentheses below the OLS estimates. for which D ~ 0) is 650. the variance of this conditional distribution does not depend on X. Thus the average test score for the subsample with student-teacher ratios greater than or equal to 20 (that is.4/1. Similarly. the simplified formulas for the standard errors of the OLS estimators that arise if the errors are homoskedastic.4 ± 1.2. This section discusses homoskedasticity.96 X 1. constructed as ± 1. If. This is the OLS estimate of f31. furthermore. As an example.8 ~ 4. its theoretical implications.14) estimated by 0 LS using the 420 observations in Figure 4. The difference between the sample average test scores for the two groups is 7. a 95% confidence interval for f31.7.4 Heteroskedasticity and Homoskedasticity 155 alternative when the OLS r-statisric I ~ P 1/ SE(Pl) exceeds 1. 10.96 in absolute value. is that it has a mean of zero (the first least squares assumption).This is 7.0 (1. and the risks you run if you use these simplified formulas in practice. a regression of the test score against the student-teacher ratio binary variable D defined in Equation (5. . so the hypothesis that the population mean test scores in districts with high and low student-teacher ratios is the same can be rejected at the 5% significance level.96SE(iJd as described in Section 5. construct the z-statistic on f31: I ~ 7.037. (5. 5.2 yields TeSIScore ~ 650.0 + 7. then the errors are said to be homoskedastic.8 ~ (3.4 Heteroskedasticity and Homoskedasticity Our only assumption about the distribution of u.5. and the average test score for the subsample with studentteacher ratios lessthan 20 (so D ~ 1) is 650. R2 (1.04.4 ~ 657.4. PI Application to test scores. The OLS estimator and its standard error can be used to construct a 95% confidence interval for the true difference in means.

= x increases with x. given Xi is con. u is 10 hetercskedastic. Otherwise. As drawn in that figure..JL~~ 15 20 25 30 Student-teacher ratio .. the variance of these distributions is the same (or the various values of r. tion of u. more pre./-. Because the variance of the distribution of . 156 CHAPTER 5 . TIle definitions of heteroskedasticity in Key Concept SA.-_0. in Figure 4.4.4. this is the conditional distribution of "I given X... so that the err r in Figure 5. Thus in Figure 5. it has a greater spread. but for larger values o( x. this shows the conditional scores for three differ- 720 700 680 660 640 620 600 Distribution of Y when X = 15 ent class sizes. these distributions become more spread out (have a larger variance) for larger class sizes. all these conditional distributions have the same spread.2 illustrates a case in which the conditional distribu. The distribution of the errors "i is shown for various values of x. h S· I Regressor'HypothesisTests and Confidence Intervals . Figure 5. . n and in particular does not depend on X. That is. . Regression WIt a Ing e What Are Heteroskedasticity and Homoskedasticity? Definitions of heteroskedasticity and homoskedasticity. so the errors illustrated in Figure 4. return to Figure 4. cisely. Unlike Figure 4. depends on X. and homosked sri ity arc summarized Cim!Il'Im distribution of test An Example of Heleroskedasticity Test score Like Figure 4. Because this distribution appJie specifically for the indicated value of x.2 are heteroskedastic..4 are homoskedastic.4. the conditional variance of III given X. = x. spreads out as x increases. As an illustration. The error term "i is homoskedastic if the variance of the conditional distribution of u. the error term is heteroskedastic.-~~_~-. this distribution is tight. For small values of x . ~~~~~---t5~~~~~.2 the variance of Ui given X." butlo n ot '7"_ 7'~" uno n e! u glven)( var(uIX). = x does not depend all x.4. stant for i = 1.. III contrast.

"The Gender Gap in Earnings of College Graduates in the United States.4 Example. if not.19) for i = 1.. 5. It follows that the statement. Earnings. (5. + U. These terms are a moutbful. Otherwise. . The binary variable regression model relating a college graduate's earnings to his or her gender is Earnings...20) (5. In this regard. the error is hornoskedastic. Deciding whether the variance of u. be a binary variable that equals 1 for male college graduates and equals 0 for female graduates. is homoskedastic if the variance of the conditional distribution of til given Xi. Here the regressor is MALE." Let MALE. . It. it is heteroskedastic. for women. n and in particular does not depend on x. The definition of homoskedasticity states that the variance of does not depend on the regressor. so at issue is whether the variance of the error term depends on MALE." is eqnivalent to the .19) as two separate equations. depends on MA LEI requires thinking hard about what the error term actually is. f31 is the difference in the population means of the two groups-in tbis case. is constant for i = 1. . tbe difference in mean earnings between men and women who graduated from college.). is the deviation of the it" woman's earnings from the population mean earnings for women (f3o). and the definitions might seem abstract. u. "the variance of UI does not depend on MALE. (5. var(u. . Because the regressor is binary.IXi = x).SA Heteroskedasticily and Homoskedasticily 157 Heteroskedasticity and Homoskeda5ticity ~ The error term U. and for men.21) = f30 + f31 + UI Thus. To help clarify them with an example.. (women) and (men). = f30 + It.. n. Earnings. = f30 + f31MALE. one for men and one for women.. is the variance of the error term the same for men and for women? If so. the error term is beteroskedastic. In other words. we digress from the student-teacher ratio/test score problem and instead return to the example of earnings of male versus female college graduates considered in the box in Chapter 3. it is useful to write Equation (5. UI is the deviation of the it" man's earnings from the population mean earnings for men (f3o + f3.

homoskedasticity-only estimator of the variance of ~I: (hornoskedasticity(522) ' quauon (4... then there is a specialized formula that can be used fOl}he standard errors 01' ~o and ~ I' 11. is the variance formula. n.5. the error term is heteroskedastlc.The homoskedasticity-only f rmula for the standard error of /30 is given in Appendix 5.1. I X. ances differ. . the estimator of the v~riance of ~I under h III skedasticity (that is. Because the least squares assumptions in Key Concept 4. the OLS estima· tors have sampling distributions that are normal in large samples even if the errors are homoskedastic.e homoskedastieity-only standard error of /3" derived in Appendix 5. [f the least squares assumptions in Key Concept 4. . Therefore. 1119e I R ressor HypothesisTests and eg .. Homoskedasticity-only If the err r term is homoskedastic. is discussed in Secti n 5. under homoskedasticity) i the so-called pooled vanance formula for the difference in means.!fiJ. the OLS estimators remain unbiased and consistent even if the errors are homoskedastic. In lIS e c . . 158 CHAPTER 5 ' with a S' Reqressron statement.3 place no restrictions on thecon.3 hold and thc errors are hornoskedas. e 'd' t ibution of earnings is the same for men and women. given in Equation (3. is SE(~I) = where 0'6. In addition. the OLS estimator is unbiased. conditional 11 X\l . . Confidence Intervals n ' thi xample the error term is homoskedastlc If the vallance ofth other wor d 5. ' tIlOy ' e an where s' is giiven ~n 'E f. they apply to both the general case of hetcroskcdasticity and the special case of homoskedasticity.23). Whether the errors are homo kedastic or heteroskedastic. Consequently. tic. ditional variance. if the errors are homoskedastic. '. Efficiency of the OLS estimator when the errors are homoskedastic.1. then the formulas for the variances of ~o and ~ I in Key oncept4. " h t e V31lanc " e of earnings is the same for men as it is for Wonlen" I .the square of the standard error 01' /3.19 ) .I' 111is result.."/ and are unbiased.4 simplify. then the OLS estimators Po and ~I are efficient among all estimators that are linear in 1'\. In the special case that X is a binary variable. Mathematical Implications of Homoskedasticity The OLS estimators remain unbiased and asymptotically norrnal. if these va ' popu Ianon IS n I'. Because these alternative formulas are derived for the special case that the errors are homoskedastic add 0 not ' apply If rhe errors are heteroskedaslic.. and asympt tically normal. consistent. .l'. which is called the Gauss-Markov theorem. . nly)..

but there have rarely been highly paid women. ai. . in general the probabun« that this interval contains tbe true value of the coefficient is not 95%.5. What Does This Mean in Practice? Which is more realistic. Because such formulas were proposed by Eicker (1967). This suggests that the distribution of earnings among women is tighter than among men (See the box in Chapter 3.21) for men. the issues can be clarified by returning to the example of the gender gap in earnings among college graduates. Specifically. Similarly. Thus the presence of a "glass ceiling" for women's jobs and pay suggests that the error term in the binary variable regression model in Equation (5. As the name suggests. then the z-statistic computed using the homoskedasticity-only standard error does not have a standard normal distribution. to a lesser extent. the estimators and of the variances of ~I and ~o given in Equations (5. Unless there are compelling reasons to the contrary-and we can think of noneit makes sense to treat the error term in this example as heteroskedastic.4) and (5. the variance of the error term in Equation (5. even in large samples. Because the standard errors we have used so far [that is. the correct critical values to use for this homoskedasticity-only I-statistic depend on the precise nature of the heteroskedasticity.those based on Equations (5. if the errors are heteros ked astic. "The Gender Gap in Earnings of College Graduates in the United States"). For many years-and. and White (1980). ai. heteroskedasticity or homoskedasticity? The answer to this question depends on the application. In fact. However. they are called heteroskedasticity-robust standard errors.4 Heteroskedasticity and Homoskedasticity 159 will be referred to as the "homoskedasticity_only" formnlas for the variance and standard error of the OLS estimators. In contrast.jjnn hypothesis tests and confidence intervals based on those standard errors are valid whether or not the errors are heteroskedastic. they are also referred to as Eicker-Huber-Wllite standard errors.26) produce valid statistical inferences whether the errors are heteroskedastic or homoskedastic.20) for women is plausibly less than the variance of the error term in Equation (5.4) and (5.19) is heteroskedastic. then the homoskedasticity_only standard errors are inappropriate. if the errors are heteroskedastic. Familiarity with how people are paid in the world around us gives Someclues as to which assumption is more sensible. even in large samples.96 homoskedasticity-only standard errors.26)] lead to statistical inferences that are valid whether or not the errors are heteroskedastic. so those critical values cannot be tabulated. because homoskedasticity is a special case of heteroskedasticity. if the errors are heteroskedastic but a confidence interval is constructed as ±1. In other words . Huber (1967). todaywomen were not found in the top-paying jobs:There have always been poorly paid men.

and workers with only len years of education have no shot at those job This line is plotted in Figure 5.to 3D-year-old workers.h a S· gle Regressor' Hypothesis Tests and Confidence Intervals In . 160 CHAPTER 5 . f- I .76 ± 1.5. n errors He hetcroskedastic. The coefficient of 1. Figure 5. and devia- r r workers with a college degree. 29. ages 29 and 30.Or distribution 1. so answering it requires analyzing data.3 is a scarterplot of the hourly earnings and the number of years of education for a sample of 2989 full-time workers in the United States in 2008. the varinnce (the years 11 t regression of Equation (5.05) (0. For workers with ten years of education. SER Education. for workers with a high seh 01 diploma. While more education. on the value in other will be = .I .989 full-time.This increase is summarized by the OLS regression line. The second striking feature of Figure 5.08) R' = 0.50.08 . workers with more education have higher earnings than workers with less educa- additional year f education. This can bc quantified at the spread of the residuals around sion line.3 has two striking features.76 Years (1. the standard deviation of the residuals is $4. indicating that the regression errors are heteroskedastic.3.76 for each O n average. the rcgressi real-world terms. The 95% confidence X interval for this coefficicnt is 1. i !. (5. The spread around the regression line increases with the years of education. The first is that the mean of the distribution of earnings increases with the number of years of education.23) earning $50 per hour by the time they are 29.159.38 + 1.23) depends of the regressor words. i I I 0 . it might also be that the spread 01 the of earnings is greater for workers with of earnings 0 . very few workers with low levels f education have by looking the OLS regres- high-paying jobs. In all college graduates = 9.to 30-Year Olds in the United States in 2008 Average hourly earnings 80 Hourly earnings are plotted against years of education for 2. Earnings with many years of cducarion have low~payingjobs. But if the best-paying jobs mainly go to the college educated. The data come from the March 2009 Current Population Survey.25. • 60 40 20 Fiucd values . to $12.1.3 is that the spread of the distribution with the years of cducnti of earnings increases some workers n. this standard tion is $7. RegreSSion Wit The Economic Value of a Year of Education: Homoskedasticity or Heteroskedasticity? average.76 in the OLS regression line means that on Gm!IiIm scatterplot of Hourty Earnings and Years of Education for 29. this standard deviation increases these standard of education.96 tion. Does the distribution spread out as education increases? This is an empirical question.60 to 1.91.30. Because levels in the deviati l1S differ for different f the residuals f cducauon). Figure 5. but some will. with between 6 and 18 years of education..34. u II I--r] 10 : -20 15 Years of education . hourly earnings increase by $1. which is described in Appendix 3.

is consistent.. is always to use the heteroskedasticity-robust standard errors. many software programs report homoskedasticity-only standard errors as their default setting. it might be helpful to note that some textbooks add hornoskedasticity 1'0 the list of least squares assumptions. As just discussed...5. The section concludes with 'Jn case this book is used in conjunction with other texts. and has a normal sampling distribution when the sample size is large. heteroskedasticity arises in many econometric applications..5. If the homoskedasticity-only and heteroskedasticity-robust standard errors are the same.! *5. if the least squares assumptions hold and if the errors are homoskedastic. "This section is optional and is not used in later chapters.5 The Theoretical Foundations of Ordinary Least Squares As discussed in Section 4. then.. For historical reasons. economic theory rarely gives any reason to believe that the errors are homoskedastic.:::==----1 . Specifically.. It therefore is prudent to assume that the errors might be heteroskedastic unless you have compelling reasons to believe otherwise. which is a consequence of the Gauss-Markov theorem. The simplest thing. has a variance that is inversely proportional to n. 1'. At a general level. iiiiiiiiiiiiiio . under certain conditions the OLS estimator is more efficient than some other candidate estimators.. however.. however. the OLS estimator is unbiased. this additional assumption is not needed for the validity of OLS regression analysis as long as heteroskedasticity-robust standard errors are used.The details of how to implement heteroskedasticity-robust standard errors depend on the software package you use. Practical implications. . In this regard. ____ iiiiiiiiiiiiiiiiiii . . The main issue of practical relevance in this discussion is whether one should use heteroskedasticity_robust or homoskedasticity-only standard errors. it is useful to imagine computing both. then the OLS estimator has the smallest variance of all conditionally unbiased estimators that are linear functions of r.. if they differ. then choosing between them. All of the empirical examples in this book employ heteroskedasticity-robust standard errors unless explicitly stated otherwise.. then you should use the more reliable ones that allow for heteroskedasticity.5 The Theoretical Foundations of Ordinary LeastSquares 161 As this example of modeling earnings illustrates.. This section explains and discusses this result. nothing is lost by using the heteroskedasticity-robust standard errors.. so it is up to the user to specify the option of heteroskedasticity-robust standard errors. In addition..

conditional On among all estimators in the class of linear conditionally unbiased esu. summarized in Key Concept 3.3) hold and if the error is homoskedastic.. III Linear conditionally unbiased estimators. of ~II linear conditionally unbiased estimators of {3j. ' f It ' rive estimators that are more efficient than OLS when th a dISCUSSion0 a et na 1 e conditions of the Gauss-Markov theorem do not hold. conditional on Xlt···. und~r conditions the OLS estimator~.. X. .. .. is linear). The clas of linear conditionally unbiased estimators consists of all estimators of {3. (ii... .. In other words.2 that the OL estimator is linear and conditionally unbiased. i=1 a/I 1 I (5.) = {3.24) (it is linear) and if Equation (5. I x. that is. can depend on Xii' . are implied by the three least squares . 162 CHAPTER 5 ' 'I R r: HypothesisTests and Confidence Intervals RegreSSionwith a SIng e egresso. (13. x. x" but not on Yj.the OLS estimator is BLUE. given X".. The Gauss-Markov theorem. given X]. Jt is shown in Appendix 5. the OLS estimat r is the De t Linear conditionally Unbiased Estimator-that is.lx" .= ~aiY. .. The Gauss-Markov a set of conditions known as the Gauss-Markov has ~he smal. . .1 and that are unbiased. then the OLS estimator has the smallest variance. ... )(". The Gauss-MarkOv conditions..2.. X b . is conditionally unbiased)..25) holds (it is conditionally unbiased). That is.1'. .. }j" The estimator 13. the estimator 13t is conditionally unbiased if E(13... is (3" That is.. This result is an extension of the result. (5." . is conditionally unbiased if the mean of its conditional sampling distribution. theorem states that.lest conditional variance. X mators. ..24) where the weights a" . thai arc linear functions of YJ.1'. Linear Conditionally Unbiased Estimators and the Gauss-Markov Theorem If the three least sqnares assDlllptions (Key Concept 4. that the sample average Y is the most efficient estimator of the population mean among the class of all estimators that are unbiased and are linear functions (weighted averages) of 1'].3..25) The estimator ~j is a linear conditionally unbiased estimator if it can be written in the form of Equation (5.. it is BLUE. then it can be written as " ~. which are stated in Appendix 5. if {31 is a linear estimator.

The Gauss-Markov theorem provides a theoretical justification for using OLS. if tbe error term is heteroskedastic-as it often is ineconomic applications-then tbe OLS estimator is no longer BLUE.:. If the errors are heteroskedastic. but it does mean that OLS is no longer the efficient linear conditionally unbiased estimator. there are other candidate estimators that are not linear and conditionally unbiased. called weighted least squares (WLS). the theorem has two important limitations. In particular. Limitations of the Gauss-Markov theorem. The Gauss-Markov theorem is stated in Key Concept 5. if the three least squares assumptions bold and the errors are homoskedastic.4. its conditions might not hold in practice.3 hold and if errors are homoskedastic. when applied to the weigbted data. is known up to a constant factor of proportionality-then it is possible to construct an estimator tbat has a smaller variance than the OLS estimator.5 The Theoretical oundations OrdinaryLeastSquares F of 163 TheGauss-Markov Theorem for ffi. 5. The second limitation of tbe Gauss-Markov tbeorem is that even if the conditions of the theorem hold. Consequently. is discussed below.then the OLS estimator ~l is the Best (most efficient) Linear conditionallyUnbiased Estimator (is BLUE). An alternative to OLS when there is heteroskedasticity of a known form. This method.However.. The weighted least squares estimator. tben OLS is BLUE. given X. called the weighted least squares estimator. Because of tbis weighting. weights the i''' observation by the inverse oftbe square root of tbe conditional variance of u. ________ iiiiliiiiiiiiiiiiiiitliiiiiiiiiiiiiiiil --. is BLUE. ~ If the three least squares assumptions in Key Concept 4.5 assumptions plus the assumption that the errors are homoskedastic. given X. if the conditional variance of u. Regression Estimators Other Than OLS Under certain conditions.5 and proven in Appendix 5. then OLS is no longer BLUE.2.5. some regression estimators are more efficient than OLS. under some conditions. the errors in this weighted regression are homoskedastic. First. If the nature of tbe heteroskedasticity is known-specifically. tbe presence of beteroskedasticity does not pose a threat to inferencebased on heteroskedasticity-robust standard errors.==:-r . so OLS. As discussed in Section 5. tbese other estimators are more efficient than OLS.

-.~. The LAD 1'1 estimator is less sensitive to large outliers in u than is OL .res .-.6 Using the t-Statistic in Regression When the Sample Size Is Small When the sample size is small. W is a random variable with a chi-squared dislributioa l 'This section is optional <lad is not used in later chapters.These five assurnptions-. the exact distribution f the r-statistic is complicated and depends on the unknown population distributi n of the data.That is. IS therefore used far less frequently than OLS. The t-Statistic and the Student t Distribution Recall from Section 2. .where Z is a random variable Wit h a . in which the regression coefficients f30 and f3. The least absolute deviations estimator. Weighted least squares' thing that IS rare y no . then other estimators can be more efficient than OLS and can produ e inferences that are more reliable.:.is uncommon in applications. depends on X sOme IS that you must now .If. usc f the LAD estimator. are the values of bo and b. that the errors are hornoskedasric. the regression errors are hornoskedastic. the LAD estimators of f30 and f3. t-k how the conditional vanance of II. In many economic data sets.. however. the OLS estimator can be sensitive to outliers.6) except that the absolute value of the prediction "mistake" is used instead of its square..II I gant the practical problem with weighted least squa Although theoretlca ye eza __ . If extreme utliers are not rare.3.164 CHAPTER 5 Regression with a Single _ R egresso.4 that the Student t distribution with III degrees of freedom is defined to be th e diistnibuti of. One such estimator is the least absolute dcviati n (LAD) estima_ tor. and further discussion of WLS is I!.. and that the errors are n rmally distriblltedare collectively called the homoskedastic normal regression assumptions.Z I'V r. severe outliers in II arc rare.Xi!. deferred to Chapter ]7. that minimize L. r: Hypothesis Tests and Confidence Intervals . then the OLS estimator is normally distributed and the hornoskedasticity-only r-statistic has a Student t distribution. uuon W/m standard normal distribution. I k wn in econometnc applications. Thus the treatment of linear regression throughout Ihe remainder of this text focuses exclusively on least squares method.: .I ..b.bo . . As discussed in ection 4. and the regression errors are normally distributed. arc obtained by solving a minimization like that in Equation (4. the three least q uares assumption. *5. the three least squares assumptions hold. or other estimators with reduced sensitivity to outl ier .

2-that is. by first computing heteroskedasticity-robust the standard normal distribution dence intervals. Use of the Student t Distribution in Practice If the regression the Student tribution. the z-statisrn. . r-statistic has This result is closely related to a result discussed in Section 3. are independently distributed. divided by n . .6 Using the r-Starisric in Regression When the Sample Size Is Small 165 with m degrees of [reedom.5. if the two population distributions are normal with the same variance and if the r-statistic is constructed using the pooled standard error formula [Equation (3. the homoskedasticity-only standard error for ffi I simplifies to the pooled standard error formula for the differ- ence of means. Th~~. When X is binary.. and confi- large. In addi- variance estimator has a chi-squared with n . 111 Under the null hypothstandard error can using the homoskedasticity_only z-statisue testing 131 = 131.5.22). Thus (ffil . and ffi.'" Equation Y has a normal distribution. are homoskedastic and normally there is rarely a reason to believe that the errors distributed. the homoskedasticity-only a Student t distribution with n . then the hornoskedasticity-only regression r-statistic has a Student t distribution (see Exercise 5.2. x" [see Equation (5.. esis. inference can proceed as described in Sections 5. computed be written in this form. the OLS estimator is a weighted average of l].131o)/(. Because normal random variables is normally distribon X" . average of independent 131 has a normal distribution. hypothesis ____ iiiiiiiiIII _ .. ."" a weighted uted. .23)]. then critical values should be taken from Table 2) instead of the standard normal dist t distribution Because the difference between the Student distribution and the nor- mal distribution is negligible if n is moderate or large.~" (5.32) in Appendix 5.. conditional tion.omoskedasticitY_Only where u~. Because sample sizes typically are standard errors and then by using tests." where the weights depend on XL. conditional cussed in Section 5.2 degrees of freedom. conditional On XI. 1. It follows that the result of Section 3. Under the homoskedastic normal regres.O) has a normal distribution under the null hypothesis. however. In econometric applications. In that problem. this distinction is relevant only if the sample size is small.1 and 5. . to compute p-values.IS defined sion assumptions.2 degrees of freedom. Consequently. the (normalized) distribution homoskedasticily_only x".5 is a special case of the result that if the homoskedastic normal regression assumptions hold.. and Z and Ware independent..x".10). and iii.5 in the context of testing for the equality of the means in two samples. then the (pooled) z-statistic has a Student ( distribution.2]. Xn' As dison XL.13 L.0 is 'i = (ffi] . errors are homoskedastic (Appendix and normally distributed and if the homoskedasticity-only z-statistic is used.

and we might simply have estimated our negative coefficient by rand 10 ampling variation. Hiring more teachers. ' R r: HypothesisTestsand Confidence Intervals RegressionWitha 51ngle egresso. in fact. However. in fact. their children are not native English speakers.There i a negative relationship between the student-teacher ratio and test scar . the probability of doing so (and of obtaining a r-stati ti n fJ\ as largeas we did) purely by random variation over potential amples is exceedingly small. It thus might be that our negative estimated relationship between test scores and the student-teacher ratio is a con eq uence of large classes being found in conjunction with many other factors that are. Our regression analysis. c - dent-teacher ratio and test scores: Districts with smaller classes have higher test scores. increase scores? There is. For example. including better facilities. But does this mean that reducing the student-teacher rati will. ideri hi ing additional teachers to cut the student-teacher ratio Wil t who IS consi enng in ' a have we learned that she might find useful? .6 points higher. so wealthier school districts can better afford smaller classes. This corresponds to moving a district at the 50'h percentile of the distribution of test scores to approximately the 60'h percentile. the e immigrants tend to be poorer than the overall population.26.001%. . and better-paid teachers.30 :S fJ\ :5 -1. and. in many cases. reasnn to worry that it might not. newer books. costs money. the real cause of the lower test scores.based on the 420 observations for 1998 m the California d t et showed that there was a negative relationship between the stu testscore a as." could mean that the OLS analysis one so far has little value to the superintendent. approximately 0. students at wealthier schools tend themselves to come from more affluent families and thus have other advantages not directly associated with their school. But students at wealthier schools also have other advantages over their poorer neighbors. but is this relationship necessarily the causal one that the superintendent needs to make her decision? Districts with lower student-teacher ratios have. or "omitted variables. California has a large immigrant community. The coefficient is moderately large. on average. higher test scores. d These other factors. . This result represents considerable progress toward answering the superintendent's question yet a nagging concern remain .7 Conclusion he problem that started hapter 4: the superintende t Return for a momen t to t n .166 CHAPTER 5 . in fact.The population coefficient might be 0. Indeed. in a practical sense: Districts with two fewer students per teacher have. on average. The coefficient on the student-teacher ratio is statistically significantly different from 0 at the 5% significance level. test scores that arc 4. Moreover. after all.A 95% confidence interval for f3\ is -3. 5. it could be misleadtng.

3. Homoskedasticityonly standard errors do not produce valid statistical inferences when the errors are heteroskedastic. if the regression errors are homoskedastic. the error u.Ix. 4. When X is binary. If the three least squares assumptions hold. the OLS estimator is BLUE. then the OLS t-statistic computed using homoskedasticity-only standard errors has a Student t distribution when the null hypothesis is true. Hypothesis testing for regression coefficients is analogous to hypothesis testing for the popnlation mean: Use the t-statistic to calculate the p-values and either accept or reject the null hypothesis. If the three least squares assumption hold and if the regression errors are homoskedastic. = x) is constant.Ix. In general. = x) depends on x.96 standard errors. Key Terms null hypothesis (146) two-sided alternative hypothesis (146) standard error of ~ I (146) z-statistic (146) p-value (147) confidence interval for /31 (151) confidence level (151) indicator variable (153) dummy variable (153) . and if the regression errors are normally distributed. A special case is when the error is homoskedastic. Summary 1. is heteroskedastic-that is.var(". To address this problem. but heteroskedasticity-robust standard errors do. The difference between the Student t distribution and the normal distribution is negligible if the sample size is moderate or large. the variance of u at a given value of x" var(". the regression model can be used to estimate and test hypotheses about the difference between the population means of the "X = 0" group and the "X = I" group. holding these other factors constant. then. a 95% confidence interval for a regression coefficient is computed as the estimator ±1.that is. Like a confidence interval for the population mean. That method is multiple regression analysis. 2. we need a method that willallow us to isolate the effect on test Scores of changing the student-teacher ratio. 5. as a result of the Gauss-Markov theorem. the topic of Chapter 6 and 7.Key Terms 167 Changing the student-teacher ratio alone would not change these other factors that determine a child's performance at school.

set of ob crvati ns~. Do you reject the null hypothe is at the 5% level? 1% level? . .2 Explain how you could use a regression model to c 110101 the wage gender gap using the data on earnings f rn 'n and w m ·n. . (154) coefficient on D..4 .ressi n test TestSco.i.3 Define homoskedasticity and heteroskednsticit . coefficient multiplying D. I R ressor Hypothesis Tests and Confidence Intervals Regressionwith a Sing e eg .lhe ficient. estimates lite ( ) and average rc .. using data n etas 'i.21) . .5. R2 0.. i I. = O.i. I I•. ER 11. Calculate the p-value for the two-sided rest of the null hypothe Ho: fJ. hut are the dependent and independent variables? 5.d.. (154) heteroskedasticityand homoskedasticity (156) homoskedasticity-only standard errors (158) heteroskedasticity-robuSI error (159) standard auss-M rko theorem (162) be t linear unbia cd esurnato(BL )(16) weighted least square.. a.:e = 520.168 CHAPTER 5 .82 x (2Q.5. II.). s t of obscrvarlons ( /•• \. Construct a 95% confidence interval f r PI.d. PrOVIde it hyp thetical empirical example in which y u think Ihe errors would be hctcr kcdastic and explain your reasoning.1 Suppose that a researcher.4) (2.: PI = a in a regression model using an i. Exercises 5. (163) h moskcda lie normal regression assumptions 164) nuss-Marko onditions (17 ) Review the Concepts 5. scores from lOa third-grade classes.. a 5.08. regression sl pe coefis I the b." Outline the procedures for computing the /I-value of n two-sid d I 'Sl <I( II.1 Outline the procedures f r computing the /I-vl\lue fa IWO sided test of flo: t-v = using an i.

what is the mean wage of women? Of men? e. a.) c. estimates the OLS regression Wage ~ 12. Calculate the p-value for the two-sided test of the null hypothesis Ho: 13.2.81.23) (0. R 2 ~ . A man has a late growth spurt and grows 1. Another researcher uses these same data but regresses Wages on Female. (. Without doing any additional calculations.12 x Male.36) where Wage is measured in dollars per hour and Male is a binary variable that is equal to 1 if the person is a male and 0 if the person is a female. d.94 X Height. Construct a 99% confidence interval for the person's weight gain.Exercises 169 c. Define the wage gender gap as the difference in mean earnings between men and women. In the sample.5 inches over the Courseof a year.6. SER ~ 5. (2. Construct a 99% confidence interval for 130' 5. determine whether -5. SER ~ 10. R' = 0. What are the regression estimates calculated from this regression? - Wage ~ __ + __ X Female. R' ~ 0. What is the estimated gender gap? b. Construct a 95% confidence interval for the gender gap. ~ -5. using wage data On 250 randomly selected male workers and 280 female workers.2. .15) (0. Is the estimated gender gap significantly different from zero? (Compute the p-value for testing the null hypothesis that there is no gender gap. A regression of weight on height yields Weight ~ -99. a variable that is equal to 1 if the person is female and 0 if the person a male.31) where Weight is measured in pounds and Height is measured in inches.41 + 3. SER ~ 4.52 + 2.2 SUppose that a researcher.3 Suppose that a random sample of200 twenty-year-old men is selected from a population and their heights and weights are recorded.6 is contained in the 95% confidence interval for 13" d.06.

.0 I.6.170 CHAPTER 5 .4 Read the box "The Economic Value of a Year of Education: Homoskedas_ ticity or Heteroskedasticity?" in Section 5.23) to answer the following. in the population.5) R2 = 0. . b. A regression of TestScore on Small Ie. Let mollCloss denote a binary variable equal 10 1 if the student is assigned t a small class and equal to 0 otherwise. Confidence Intervals 5.5 In the 1980s. Construct a 99% confidence interval for the effect of mall Class on 5.) was computed USing Equation (5.6) (2. What is the worker's expected average hourly earnings? b.6 Refer to the regression described in Exercise 5. Is the estimated effect of class size on test score cant? Carry out a test at the 5% level. A high school graduate (12 years of educati n) is contemplating going to a community college for a 2-year degree.0 + 13.5.5(c)? Explain. Do small classes improve test scores? By how much? Is the effect large? Explain. (1. A high school counselor tells a student that. a. a. Do you think that the regressi n errors plausibly are homoskedastic? Explain. test score. b. A randomly selected 30-year-old worker reports an educati n level of 16 years. and given standardized tests at the end of the year.) Suppose that. iatistically signifi- c. the standardized tests have a mean Score of 925 points and a standard deviation f 75 points. on average. college graduates earn $10 per hour more than high school graduates. (Regular cia ses contained approximately 24 students.Tennessee conducted an experiment in which kindergarten students were randomly assigned to "regular" and "small" classes.SS yields Yes/Score = 918. ER = 74. Use the regression repOrted in Equation (5.4. a. How much is this worker's average hourly earnings expected 10 increase? c. and small classe contained approximately 15st udents.9 X SmaltClass. S' I R ressor:HypothesisTests and Regressionwith a Ing e eg . ). SE(~. Is this statement consistent with the regression evidence? What range of values is consistent with the regression evidence? 5.Suppose that the regressron errors were homoskedastic: Would this affect the validity of the confidence interval constructed in Exercise 5.

HJ' f31 '" 55 at the 5% level.2) + 61. HI: f31 '" 0 at the 5 % level. Y" . 5.26.. b. O"~) and is independent of Xi. Test Ho: f31 = 55 vs.Exercises 1 71 5. a. 73 is conditionally unbiased. respectively. and Xi satisfy the assumptions in Key Concept 4. interval for b. a.5X. SER = 1.2 (10. A random sample of size n = 250 is drawn and yields y = 5. . Construct a 95% confidence interval for f3o. regressions estimated. A sample of size n = 30 Y = 43. SER = 6. where Y and X 73 is a linear function of Yj. Show that 73 = YIX. c. d. Suppose you learned that surprised? Explain. Y".3 and. Xi) satisfy the assumptions in Key Concept 4. + u.9 Consider the regression model > 55 at the 5% level.7 Suppose that (Y. In what fraction of the samples would Ho from (a) be rejected? what fraction of samples would the value f3 j = 0 be included in the confidence interval from (b)? 5..2. Construct a 95% confidence c. and (a) and (b) answered..1) (1. and /h Y. a. Suppose that Y. Test Ho: f31 = 55 vs. Test Ho: f31 = 0 VS. = f3X. where lI. II.2X. R' (3. . Xi) satisfy the assumptions in Key Concept 4. H..: f3. R' = 0.3. Y. Show that b.54.52.8 Suppose addition. and Xi were independent. and Xi. Would you be and many samples of size In Xi are independent n = 250 are drawn.3. yields that (Y. in is N(O. Let 73 denote are the an estimator of f3 that is constructed as sample means of Y.4 + 3. (7..4) standard where the numbers in parentheses are the homoskedastic-only errors for the regression coefficients.5) = 0.

To be specific.i' Let ~m."..o = (3/11. Regression WIth a Sing e 5. Show that the estimator is conditionally c. O'~) and is independent I' X.::I L. = f30 + f3. denotes years of schooling. 5.22).y"J2] is $68.. Write the regression for men as ~1J.) Y.1 and the regression for women as + f3\11. + I/i' Find the OL estimates and their corresponding standard errors.. Prove that the estimator is BLUE.:. .I~\'J + uw.10.13 Suppose that (Y.l' au 's-Markov con- of {3 an sh W that it is a linear unbiased. ... and the Independent samples are for men and women. derive the variance of f30 under homoskedasticity given in Equation (5. X. = $485.i . Is ~I oncept 4. The corresponding values for women are Y. '" f30 + · + . .1. of f30 and ~I J:. How would your answers to (a) and (b) change if you assumed only that (Y" X) satisfied the assumptions in Key oncept 4.10 Let X.i = {3w. and the sample standard deviation [s". how that ~o = 'YoJo + ~...] is $523.10 and Sw = $51.12 Starting from Equation (4.0 + f3m. How would your answers to (a) and (b) change if you assumed only that (Y" X. = 120 1l1~ and II". Women. in the best linear conditionally unbia e I estimator of f31? c.) satisfy the ditions given in Equation (5.3 and var(u.. Is ~..i + lim. suppose that 1. = y" and {31 = y. 5.) satisfied the assumptions in Key oncept 4..14 Suppose that Y. I R egre ssor: HypothesisTests and Confidence Intervals .an ~r o~servatlons With X = 1. b. = f3X. u.1 (1'. Y. denote the sample m. . Let Women denote an indicator variable that is equal to 1 for women and 0 for men and suppose that all 251 observations are used in the regression 1.2 ) in Appendix 5.11 A random sample of workers contains II". conditionally unbiased? h. a. 5. The sample average of men's weekly earnings [1'.." = (1/11..10.1.lx = x) is constant? d.31).15 A researcher has two independent samples of observations on (1.. Derive the conditional variance of the estimator. X).Yo. = '\)". where (1/" X.denotes earnings.tX... = 131 WOmen.. denote a binary variable and consider the r~gre sian Y.3'1 5. .ll. Derive the least squares estimator function of 1\. X) satisfy the assumptions in Key addition.' n 1. a. + 1/.l denote the OLS estimator constructed ~1I. LtV denote the sample mean for observation With X'" 0 a d f3 1tUrelo X . 5. d. is N(O. 172 CHAPTER 5 .3 and.

Is the estimated regression slope significant? That is. regression of average hourly earnings (A HE) on Age and carry out the fol- a. different for e. Is the effect of distance on completed years of education men than for women? tHint: See Exercise 5. run a lOWing exercises .5%.15. Is the effect of age on earnings different for high school graduates than for college graduates? E5. Repeat (a) using only the data for high school graduates.2 Using the data set CollegeDistance described in Empirical Exercise E4.l) denote the corresponding s.[3w. or 1% significance level? What is the p-value associated with coefficient's I-statistic? b. using data only on females and repeat (b).3. a. e. Repeat (a) using only the data for college graduates.5%.. and SE(ffi"u) and SE(ffiw. using data only on males and repeat (b). can you reject the null hypothesis H : [31 ~ 0 versus a two- o sided alternative at the 10%.) described in Empirical Exercise E4. Construct a 95% confidence interval for the slope coeftlcient.1 _ ffiw" is given by SE([3"".) . Using the data set TeachingRatings coefficient statistically run a regression of Course_Eval on Beauty. Is the estimated regression slope coefficient statistically significant? That is. or 1% significance level? What is the p-value associated with coefficient's I-statistic? E5. Run the regression d. can you reject the null hypothesis Ho: [3.2. Is the estimated regression slope coefficient statistically significant? That is.1 Using the data set CPS08 described in Empirical Exercise E4. can you reject the null hypothesis Ho: [31 ~ 0 versus a twosided alternative at the 10%.1. ~ 0 versus a two-sided alternative at the 10%. Empirical Exercises E5.l)f using the sample of men. Construct a 95% confidence c. d.) ~ V[sE(ffi""Jl)2+ [sE(ffiw.2 Explain. or 1 % significance level? What is the p-value associated with coefficient's I-statistic? b. run a regression of years of completed education (ED) on distance to the nearest college (Disl) and carry out the following exercises..andar~ errors.15. Rnn the regression interval for the slope coefficient. (Hint: See Exercise 5. c.Empirical Exercises 173 ffi". 5%.l denote the OLS estimator constructed from the sample of women. Show that the standard errol' of ffi"..

'~l X~)Xi. 1 /I .27) ~nder hornoskedasticity.4 ormu as i 2 (T~ I Ii.21) by the corresponding variance in the numerator of Equation (4. with a modificHlion. where to correct by a degrees-of-freedom adjustment used in the definiis estimated in the denominator X)'.. . 1 1'/-2~ Ii I I (5.JI in Equation (5.4) is obtained by replacing (4. TIle variance Li' I(X/ adjustment tIT. for downward bias.1 Formulas for OLS Standard Errors are first presented under the least squares assumptions in Key Concept 4. This appendix discusses the formulas for OLS standard errors.-I Nl". R rmulas for the variance of and the associated standard errors are then given r r the special case Heteroskedasticity-Robust ances in Equation Standard Errors the population X)2 varisample variances.2l) by these two estimators yields . . Replacing var[(X.!. = 1 .N.26) iJo n i-I where if. is a constant: It the errors are hornoskedasric . uti and stems from Homoskedasticity-Only Variances var] to (5. a7 =-x (1'" ')' ' L..111e standard error of ffiois E(ffio) = al. . 111CSC 5.Y II' E(X') (5...] and va.28) .3.4). os t3IJ 2 = --'~<T' 110'2 .2 (instead of II) incorporates tion of the SER (l/n)2::~I(X.3. given X. .174 CHAPTER 5 Regression Wit . The estimator of the variance of ffiois --'" . the conditional variance of u. th e formulas In Key Conccpl4./Lx)n. The n ~ 2 The estimator UPI defined in Equation (5.h a S· I R gressor· Hypothesis Tests and Confidence Intervals Ing e e . The consistency of hClcroskcdasticity-robust standard errors is discussed in Section 17.(Xi) in quati n (4. APPENDIX --------------------. which allow for hcteroskedastic_ standard errors.(X/t L. simplify =--and (T~ 1'/U'1. analogously to the degrees-of-freedom in Section 4..21) is estimated by the divisor n .3. ity.'\!) ::= O"~.Pine reasoning replacing population behind the estimator uJuis the same as behind expectations with sample averages. these are the "heteroskedasticity-robust" the OLS estimators of homoskedasticity.

29) &9 .!Lx)". -!Lx)".. .) = iT.27) follows by (4._.2 The Gauss-Markov Conditions and a Proof of the Gauss-Markov Theorem As discussed conditions in Section 5.30) where s~ is given in Equation (4.E[(X. -:\')' (homoskedastici ty-oul y). the second equality because =0 (by the first least squares assumption) expectations ar E[(A'. If u.-1 __ PI) n i=1 1" ) ( -~ Xf s~ n~ /I I 2: (X.)] substituting iterated (Section =iT. -!Lx)u. The homoskedasticity_only square roots of (fpo and tTl standard errors are the APPENDIX - _ 5.Ix. -!Lx)'] ~ <T.2 s.!LxJu. and where the final equality follows from the law of then var(u. and by TIle hornoskedaSlicity_only standard errors are obtained by substjtuting sample means and estimating the variance of tors of these variances are by the square of the SER. (4. then the OLS estimator is the best (most efficient) unbiased estimator (is BLUE). (homoskedasticity_only) and (5.The Gauss-Markov To derive Equation E({(X.!Lx)'v (lI.]}') ~ E{[(X.J].5. The hornoskedaSlicity_only estima- Util .X) '" Ii - . This appendix begins by stating the Gauss-Markov condi- tions and showing that they are implied by the three least squares condition plus . Standard Errors means and variances in Equations (5.27) and (5. so 2. Homoskedasticity-Only variances for the population It..19).27).<TJc.] = . _ !LxJ'ulJ E[(X. .21) and simplifying. write the numerator . "" (X.21) as var[(X. where E[(X.. the Gauss-Markov theorem states that if the GaUSS-Markov conditionally linear hold.IX. -!Lx)'var(u.3).lx.. (5.28).~ i=l . is homoskedastic. .] ~ E[(X.!Lx)u.E[(X.--:'_".. Conditions and a Proof of the Gauss-Markov in Equation .28). The result in Equation (5.]') follows Theorem 175 (5. A sim- this expression into the numerator of Equation ilar calculation yields Equation (5.

it follows tha: £(11.IX. in Key . .X..X)(I.-" --I (\' Iii the definition of _ "" Y. (Assumption (Key Concept 4.) (i) holds. A.. l 1I..X)I. ensures lhol c nditlon (iii) is Implied by the least = U~.1"" .lIjJXi.. plus homoskedasiicity quniion (5.. I•. (3.(x-x)' ''''1 I J La. ~J 1/ - 2.. . Assumption condition (ii) holds. . finite fourth by ASSumption moments) and because the errors nrc assumed to be h moskedastic./\/ '" Because the weights a·" i-I - . L:~I(X...50 J) because (Ai. turing this result into the formula for n y}. squares assumptions. thus condition = var(uiIXi). /I. (nonzero W £(II/IX.11. so condition (iii) holds.Y) = ~:C.) that 0 < var(uilXi •..----~~176 - CHAPTER5 Regression with a Single Regressor: . . 1.II i. bomoskedasucity . of the errors. state that /I. irnilarlv. I t n ) (5. because E(II. ..).) EhIAj) for i" j.31)./) conditions arc implied by the three least are i.'S respectively.X) (4. . .. £(1111.] = 1. note that £(II. has mean zero. Thus the least squares assumptions onccpt 4.(X. SubstiV) /' in quation 2. The Gauss-Markov The three Gauss-Markov (i) (ii) (iii) Conditions conditions are (531) where the conditions hold for l. and rhnt the err h lei coudln nrc uncorrclatcd fordif~ where all these statements nnlly on nil SqUMCS bserved X's assumptions and by 2.3. (X" .) = 0 for ..11/1. Hypothesis Tests and Confidence lntervels onditionally u n hi lased I We next SlOW that the OLS estimator is a linear estimator. Xn) = £(/I.). The Gauss-Markov Because the observations Assumption var(uJX.32) depend on Yr.1"1. Y. v. because ~l X).d. E(IIJIj) = 0.. .To sh < 00.\. plus [he additional assumptions ihnt rhe errors nrc h nlOskedastic.II)X" ."..(Xi. 2)....32) (V_V)' .3).i. by Assumption 2. "/) a 1::(11. .d.l where til = i_..C three conditions. X.. we turn t a the proof of the theorem.) _ 0 for oil i l' j.1. imply the Gauss-Mnrkov conditions in The OLS Estimator To show that h t Is ~1~l(XI. Finally.I> the OLS estimator 131 is a linear estimator. . that ferent observations.which O"~ is constant. I /I ~ but nolan . ..-X)1. first note that. iii has a constant variance. I(Xi _ X) o (by .. In quauon (5. . .i.7) yields a Linear Conditionally Unbiased Estimator ~l is linear.. Assumption 2 also implies that £(IIII1/IX. E .. 11) are i.

+ u! /31 = Li~J Qi}j and collecting terms..BI(~?=I aiAi) O = /31. A'. x." has ~i~Jalui.. . for 711to be conditionally /I /I i=1 2: a.ujIX simplifies to X.. applying the second and third Gauss-Markov vanish and the expression conditions.. fit 131' Let 0i:::: Oi + d... given X" X' -'. and the variance IS a' i""J 2: (X. ./) = a simple form.l) thus taking conditional expectations (30(Lj.35) and the expression for the variance of 131exceeds (5. is conditionally of the conditional distribution of f3' I.J = 0.34) yields = }.. but for this equality to hold for aU values of /30 and unbiased. = a..34) condition.~l ar:::: ~.. the cross terms in the double Sum- for the conditional variance var(. = var( ~i~l I L?~I~. 1 Under the Gauss-Markov conditions. .. .X.36) apply to We now show that the two restrictions conditional variance in Equation variance of hi with weights a. . so L.~l aiaj cov(ui. we have that (5.35) and (5....."j or') of both sides of Equation (5.) = + f31(~.25).32)./) :::: (r~ 2: ali=1 n (5.X.33) The result' that f31 is conditionally unbiased was previously Proof of the Gauss-Markov Theorem We start by deriving Some facts that hold for all linear conditionally that is. -:'V)' shown in Appendix 4..The GausS-Markov Conditions and a Proof of the Gauss-Markov Theorem 1 77 Under the GausS-Markov conditions. + 2..~1J) + . Substituting 73....... f31 it must be the case that. n u (5.x.35) on X" . . Because /31 is conditionally unbiased by assumption. . given in Equation in Equation (5. (5../). 1""1 (5. X.x.~1 (/.34) yields 131- . for all estimators into unbiased estimators_ (5. satisfying Equations (5. E( By the first Gauss-Markov ~i~lGilli!Xb .~l(ai + d )2 = i .. E(iJdx" .~!aiE(uilx" .x. . . it must be that f30C~:..36) Note that Equations (5.3. m hi unbiased. = 0 and 2: aiX. .~I aT + 22:i~1 aid.).· = 1.. .x.36) imply that the conditional unless the conditional L. (31 = . . Substituting Equation Thus mation var(/31 IX].35) into Equation (5.B1 lXI. the variance of /31>conditional ailli X].i~)dT./) (5.24) and }j = /30 + f3 J.

i.· . . The Sample Average Is the Efficient Linear Estimator of E<Y) An implication of the Gauss-Markov theorem is that the sample average. = 1. X. and the final equality follows (5. H pothesis Tests and Confidence lntervals CHAPTER 5 Regression with a Single Regressor. and ill = ~" which proves that OLS The Gauss-Markov Theorem When X Is Nonrandom theorem also applies to nontheir values over I. substituting yields .var(l3.) . ~. .d. ~e.. i-I (5. so it follows that Y is BLUEif tI. . take on the same values from one sample to the next.:) With a minor change in interpretation.r" statements are unnecessary Xj. . This result was stated previously in Key . under the Gauss-Markov assumptions. is auto- matically satisfied in this case because there is no regressor. = a. n.3.36) (5. . . d..d. rn E qua ti10n (5 . then the foregoing statement and proof of the Gauss-Markov theorem apply directly.. this result 2".··.. But Ii.c. efficient linear estimator of E(Yi) when the OLS estimator Y. ~ 0 for all i. ~ are i.n regression without a~ "X" so that the only regressor is the constant regressor 130 = 1'.. ... the Gauss-Markov random regressors.178 . Y ..1IX1. . that is.) = U" . is the most XOi ee Yi. where the penultimate from Equation equality follows from d.35) (which holds for both OJ and ai)' Thus u~I7.1ar = <T~Li~la7 + into Equation 1:T~~i~Jd7= var(f... consider the case of 1. then G. 1'. . .. 32) we have that l " Lalli i=l 0..d.i. 2 " IX" . Note that the Gauss-Markov requirement thal the error be homoskedastic oncept 3. it applies to regressors assumption that Xl>"" that do not change repeated samples. It fallows that. Using the definition of a.. ~I) + (T~L::'ldt. To sec this. Y IS BLUE. Specifically. .. ~ ~l if tlj is nonzero for any i i BLUE..37) Thus 731has a greater conditional variance than if d. if the second least squares assumption X.i.I replaced Ub"" by the till are nonrandom (fixed over repeated samples) XI>"" and are i. because except that all of the "conditional on x. X...lz.1 are i..

The key idea of multiple regression is that if we have data on these omitted variables. in fact. Could this have produced misleading results. in the school district of students who are 179 . The coefficients of the multiple multiple regression are random variables because they depend on data from a random sample. C hapter 5 ended on a worried note. and in large samples the sampling distributions of the OLS estimators are approximately normal. biased. make the ordinary least squares (OLS) estimator of the effect of class size on test scores misleading or. and. such as student Characteristics. perhaps students from districts with small classes have other advantages that help them perform well on standardized tests. what can be done? Omitted factors. the empirical analysis in Chapters 4 of test scores by collect. a method that can eliminate omitted variable bias. then we can include them as additional regressors and thereby estimate the effect of one regressor (the student-teacher ratio) while holding constant the other variables (such as student characteristics). studied in Chapters 4 and 5. can. more precisely. We begin by considering immigrant population: still learning English. characteristics. usage. This chapter explains this "omitted variable bias" and introduces multiple regression. if so. the OLS estimators in with a single regressor. This chapter explains how to estimate the coefficients of the multiple linear regression regression model. These omitted factors include such as teacher quality and computer that is particularly the prevalence relevant in California such as family background.(IIID'''iI'' 6 Linear Regression with MUltiple Regressors student-teacher ratios tend to have higher test scores in the California data set. and student an omitted because of its large important determinants ing their influences in the regression error term.1 Omitted Variable Bias By focusing only on the student-teacher and 5 ignored some potentially school characteristics. Many aspects of multiple regression parallel those of regression model can be estimated from data using OLS. student characteristic ratio. Although school districts with lower 6.

But because the student-teacher ratio percentage of English learners are correlated.19. Students who are still learning ertorm worse on standardized tests than native English speakers.180 CHAPTER 6 Linear Regression with Multiple Regressors . IS lish learners IS correlated with tb e stu d ent-teacher ratio the first condition for . then the OLS estimator have omitted variable bias. ratio could erroneously tion and produce a large estimated coefficient. The correlation (stuEnglish) dents who are not native English speakers and who have n t yet mastered in the district is 0.Percentage of Eng ti h I earners. even zero. sti. in the student_ English might tor might not equal the true effect on test scores of a unit change teacher ratio. the dependent variable (test scores). Because the percentage a I' Eng. student-teacher ratio could be biased. To illustrate these conditions. in part. the. of test consider tbree examples of variables that are omitted from the regression scores on the student-teacher ratio. when in fact the true causa! effect test scores is small. the superintendent to reduce the student-teacher improvement scores will fail to materialize if the true coefficient is small A look at the California data lends credence between the student-teacher ratio and the percentage of to this concern. Example # 1:. Accordingly. th ercentage of English learners in the district. then it would be safe to ignore English proficiency in the regression ratio. of English learners. mator of the slope in the regression of test scores on. then the OLS regres. tbat is. By ignonng eP . the OLs e . nglish learners ratio by 2. Omitted variable bias Occurswhen two conditions are true: (J) when the omitted variable is correlated with the included regressor and (2) when the omitted variable IS a determinant of the dependent variable. it is possible that the OLS coeffi- of test scores on the student-teacher Definition of Omitted Variable Bias If the regressor (the student-teacher ratio) is correlated with a variable that has will been omitted from the analysis (the percentage of English learners) and that deter- mines.' I . . but her hoped-for analysis of Chapters 4 and 5. might hire enough r zero. sian of test scores on the student-teacher of cutting class sizes 00 If districts with find a correia. based On the new teachers in test large classes also have many students still learning English.na. Here is the reasoning. the mean of the sampling distribution of the OLS esti. of test scores ratio and the reflects that If the student-teacher ratio were unrelated to the percentage against the student-teacher cient in the regression influence. . This small but positive correlation suggests more English learners tend to have a higher student-teacher that districts with ratio (larger classes).

that learning takes place in the classroom. Omitted variable bias is summarized in Key Concept 6. Thus the OLS estimator in the regression test Scores on the student-teacher the omitted variable. Specifically. If one of these other factors is correlated Xi. if an omitted variable is a determinant of lj. other than with with is correlated Xi. in the all factors.1 Omitted Variable Bias 181 omitted variable bias holds. recall that the error term u. if the time of day of the test varies from one district to the next in a way that is unrelated to class size. It is plausible that students who are still learning English will do worse on standardized percentage tests than native English speakers. Because parking scores. schools with more teachers per pupil probably have parking space. Omitted variable bias and the first least squares assumption. In other words. of Y. Thus omitting the time of day of the test does not result in omitted variable bias. so the second condition holds. the percentage of English learners. That is. this Xi. omitting it from the analysis does not lead to omitted variable bias. then the time of day and class size would be uncorrelated so the first condition does not hold. parking under the assumption does not hold. because in this example the time of day the test is administered the student-teacher is uncorrelated with the student-teacher ratio. For example. However.. Example #2: Time of day of the test. Conversely. students). Omitted vari- able bias means that the first least squares assumption-that in Key Concept 4. oat the parking lot space per pupil is not a determinant of test lot space has no direct effect on learning. then it is in the error . it is for omitted variable bias does not hold but that the secoud condition does. However. plausible that the first condition Another variable omitted from the analy- sis is the time of day that the test was administered. variable Another omitted variable is parking lot space per pupil (the area of the teacher parking lot divided by the number of This variable satisfies the first but not the second condition for omitted bias. so the first condition would be satisfied. in which case the of of of English learners is a determinant of test Scores and the second conratio could incorrectly reflect the influence dition for omitted variable bias holds. the time of day of the test could affect scores (alertness varies through the school day). ratio could not be incorrectly picking up the "time of day" effect.6. To see why. thus the second condition more teacher lot.1. Example #3: Parking lot space per pupil.3-is that are determinants means linear regression model with a single regressor represents that the error term (which contains this factor) E(Uil Xi) ~ 0. For this omitted variable. as listed incorrect. omitting the percentage of English learners may introduce omitted variable bias.

given therefore violates the first least squares assumption.1 Omitted Variable Bias in Regression with a Single Regressor Omitted variable bias is the bias in the OLS estimator that arises when the regres. For omitted variable bias to occur. {3\ when there. and if it is correlated with X. In E ". This bias does not vanish even in very large samples. Let the correlation between X. with X.e bias U1 . and the OLS estimator is inconsistent A Formula for Omitted Variable Bias The discussion of the previous section about omitted variable bias can be summarized mathematically and u. be corr(X. is nonzero. Then the OLS estimator has the limit (derived in Appendix 6. X. 2. u. .1) summarizes 1 + P XII (if [tr X ) with II increas- several of the ideas discussed above about omitted variable bias: I.) by a formula for this bias. is correlated with an omitted variable. Omitted ~ariable bias is a problem whether the sample size is large or small Because 13 J does not converge in probability to the true value 13 an~ InC0I1S1~tent. that is. sor. and the con- sequence is serious: 111eOLS estimator is biased. two conditions must be true: 1. but the firsl does not because Px" is nonzero. X is correlated with the omitted variable. 13" I is close to 13 ingly high probability.1) f3' I~ P f3 I+PXIIUX' if" (61) That is. the conditional mean of u. cqua tiIon f31 that persists even in large samples.f31 is not a consistent estimator of J> ~ I is biased omitted variable bias. X. then the error term is correlated Because lti and)( This correlation are correlated. as the sample size increases . The formula in Equation (6. ti.. term. of the dependent variable. The omitted variable is a determinant Y. = Px".. Suppose that the second and third least squares assumptions hold.182 CHAPTER 6 Linear Regression with Multiple Regressors m:mmm 6.ls (61)' IS. The term P X/I (if II / a X ).

the state of Georgia even distributed classical music CDs to all infants in the state. in fact. 11-76) and the one by Lois Hetland (1'1'. The direction of the bias in ~ I depends on whether X and negatively correlated. So is there a Mozart effect? One way to find out is to do a randomized discussed controlled experiment.105-148). the authors of tbe review suggested that the correlation between testing well and taking art or of things. So the next time you cram for an origami exam. studying musk appears to have an effect on test scores when in fact it has none. have higher English and math test Scores than those who don't) A closer look at these studies. What is the evidence for the "Mozart effect"? A review of dozens of studies found that students who in Chapter 4. In the terminology music courses so. suggests that the real reason for the better test performance has little to do with those courses. of regression.) Taken take optional music or arts courses in high school do. how- together. By omitting factors such as the student's innate ability or the overall quality of the school. For reasons not fully understood. Whether between bias. Instead. the fraction with the student-teacher iF _ . In our data.and Ky. ever. For students might or performance. especially the article by Ellen Winner and Monica Cooper (pp.1 Omitted VariableBias 183 The Mozart Effect: Omitted Variable Bias? A stUdYpublished in Nature in 1993 (Rauscher. music could arise from any number example. it seems that listening help temporarily to classical music does in one narrow area: folding paper and visualizing shapes. this bias is large or small in practice depends on the correlation p x" tbe regressor and the error term. randomized controlled experiments eliminate omitted variable bias by ranto "treatment" and domly assigning participants "control" groups. we speculated that the percentage of students learning English has a negative effect on district test Scores (students still learning English have lower scores). The larger jPx"I It is. Shaw. the estimated relationship between test scores and taking optional 2.6. For example. For a while. too. so tbat the percentage learners of English learners is positively correlated of Englisb ratio enters the error term with a negative sign. the many Con- trolled experiments on the Mozart effect fail to show that listening to Mozart improves IQ or general test how- ever. the academically better have more time to take optional more interest in doing deeper music curriculum across the board. (As Mozart for 10 to 15 minutes your IQ by 8 or 9 points. or those schools with a might just be better schools ISee the falJlwinter2000 issue of Journal of Aesthetic Education 34. 1993) suggested that listening to could temporariJy raise music Courses appears to have omitted variable bias. try to fit in a little Mozart. the larger the are positively or 3. That study made big news- and politicians and parents saw an easy way to make their children smarter.

4 661.2 1. by the Percentage of English Learners in the District Student-Teacher Ratio < 20 Average Test Score Student-Teacher Ratio ~ 20 Average Difference In Test Scores. includ- of English learners.9% 1.9 636.-/ 5. holding constant other factors. High STR n Test Score n Difference i_statistic All districts Percentage of English learners 657. Among between of English this subset of disclass size and test learners.72_ 0.68 .8% 8. perhaps we should focus on districts with percentages of English learners comparable tricts.9 3. Thus the student_ teacher ratio (X) would be negatively correlated with th<. In other words.04 < 1. First.2 654. the districts are broken tI!m!liJ- Differences in Test Scores for California School Districts with Low and High Student-Teacher Ratios. Districts into four categories gests that.184 CHAPTER 6 LinearRegression with MultipleRegressors (districts with more English learners have larger c1~sses).4 4. suggests lish learners is associated both with high test scores and low student-teacher that small classes improve test scores may be that the districts with small classes have fewer Addressing Omitted Variable Bias by Dividing the Data into Groups What can you do about omitted variable bias? Our superintendent fraction of immigrants in her community.7 634.30 1.4 238 650.9 8. having a small percentage ratios. do those with smaller classes do better on standardized Table 6.13 1.3 -0. would be biased of Eng_ toward a negative number.8 23.0 182 7.9 > 23. but she has no control over the ratio on test scores.0% . As a result. she is interested of the student-teacher ing the percentage is considering in the effect increasing the number of teachers in her district.8 649. Low vs.5 665.error term (u). So ratio Px" < 0 and the coefficient on the stndent-teacher f3.8 27 44 50 61 0. instead of using data for all districts.1 reports evidence on the relationship scores within districts with comparable percentages tests? are divided into eight groups..0% 664. so one reason that the OLS estimator English learners. This new way of posing her question sugto hers.7 76 64 54 44 665.

the difference in test scores between these two groups without breaking the quartiles reported of English Jearners.18) as the OLS estimate of the coeffi- cient on D. that is. Of 1. where D. so the null hypothesis test score is the same in the two groups is rejected at the 1% significance level. broken down by the quartile of the percentage of English learners. districts are further broken down into two groups.2 points for the third quartile and only 1. in the regression of TestScore on Dj. is a binary regressor that equals 1 if STR. the difference in performance ratios is perhaps between districts half (or less) of the overall the quartile age of English with high and low student-teacher estimate of 7. Thus.3 points bigher than those with high studentratios.9 points lower in the districts with low ratiosl In the second quartile.6.1 Omitted Variable Bias 185 that correspond learners to the quartiles of the distribution of the percentage of English across districts. How can the overall effect of test scores be twice the effect of test scores within any quartile? The answer is that the districts teacher between learners with the most English learners tend to have both the highest studentin the average test Score of English ratios and the lowest test scores. The first row in Table 6. The difference is large. for the districts with the fewest English average 0.1 report the difference in test scores between districts with low and high student-teacher ratios.9%). in regression form in Equation them down further into was previously (Recall that this difference (5. that the mean test score is 7.5 and the average for the 27 with the districts with the fewest English learners « those 76 with low student-teacher high student-teacher learners. Second. The final four rows in Table 6. This evidence presents a different picture. The districts with few English learners ratios: 74% (76 of 103) of the districts in the have small classes (STR tend to have lower student-teacher first quartile < 20). the average test score for ratios is 664.1 reports the overall difference in average test scores between districts with low and high student-teacher ratios. the districts with the most English learners have both lower test Scores and higher student-teacher ratios than the other districts . districts with low student-teacher ratios had test scores that averaged 3.4. At first this finding might seem puzzling.9 points for of districts with the most English learners.4 points. within each of these four categories.04. the t-statistic is 4.) Over the full sample of 420 districts.4 points higher in districts with a low student-teacher ratio than a high one. depending On whether the student-teacher ratio is small (STR < 20) or large (STR 2: 20). this gap was 5. Once we hold the percentlearners constant. the average < 20 and equals 0 otherwise. So. student-teacher teacher test scores were On ratios is 665. . approximately of English learners districts in the lowest and highest quartile of the percentage 30 points. while only 42% (44 of 105) of the districts in the quartile with the most English learners have small classes.

2). ssion of test scores against the student-teacher ratio. . holding constant the fraction of English learners.anging class size. Still.. . if the student-teacher ratio in the i'h district (Xli) equals some value XI and the percentage of English learners in the ilh district (X2. That is. given that X" = XI and X" = X. The coefficient f30 i the intercept. the test score dilferel lees ing Wit h.' 1es · h d par·t of Table 61 improve on the Simple dlfferenee-of-means anal y_ In t e seeon . of changing one variable (Xli) while holding the other regressors (X'i> X'i> and so forth) constant.2) IS the population regression line or population regression funcnon III the multIple regression model. .=xz) is the conditional expectation of Y. I . is the slope coefficient of Xli or. . sis in the first line of Table 6. the coefflclentf3llstheslopecoeffi' t fX' X ·and . the average relationship between these two independent variables and the dependent varia te. . Such an esumate can be provided. the multiple regression model provides a way to isolate the effect n re t score (Y. 0 • · . is given by the linear function (6. · Equation (6.2 The Multiple Regression Model The multiple regression model extends the single variable regression model of Chapters 4 and 5 to include additional variables as regressors.x. if This ana I' YSlsrel1or ces the superintendent's 6. In learners.This model permits estimating the effect on Y. and X". By 10 k IS present tn t te regie .) equals X2.. '.) of the student-teacher ratio (Xli) while holding constant the percentage of students in the district who are English learners (X'i)' The Population Regression Line Suppose for the moment that there are only two independent variables. using the method of multiple regression.. men 0 l.2) where E(y. the coefficient on lh the coefficienr f3. .186 CHAPTER 6 Linear Regression with Multiple Regressors worry that orniued variable bi I~ · . Of.1. Y. this analysis does not yet provide the sUper_ intendent with a useful estimate of the effect on lest scores of c~. given the student-teacher ratio and the percentage of English learners is given by Equation (6. In the class size problem. more simply. In the linear multiple regression model. the coefficient on X". One or more of the independent variables in the multiple regression model are sometimes referred to as control variables.' quaT til of the percentage of English . then the expected value of Y. XI. however. more simply.lx"=x.

= f3.when mines how far lip the Y axis the population regression The Population Multiple RegressionModel The population regression line in Equation (6. of the intercept in the single-regressor line starts.. Just as in the case of regression mine with a single regressor.Xj. constant. holding X.After this change. of students still learning Scores are influenced by school characteristics. is the difference between the expected value of Y when the independent variables take on the values Xl + Ll. holding X. Thus the population regression function in Equation mented to incorporate these additional factors. is simf30 deter- model: It is the r. is the effect on Yof a unit change in Aj. constant or controlling for X .Xj while not changing X" that is. Y. Ll. teacher does not hold exactly because English. 2 This interpretation of f31 follows from the definition that the expected effect on Y of a change in Xl.3) An equation for Ll.6. and x. Ll. is (6. yielding Ll. Y in terms of Ll. while holding X2 constant. in addition to Xli and X. write the population regression function in (6. luck.X.X. Just as in the case of regression with a single regressor.2) as Y = f30 + f3IX. is obtained by subtracting the equation Y = f30 + f3IXJ + f3.i are incorporated into Equation (6. Y + Ll. the new value of Y.4) f3.2) is the relationship between however. (6.2) is different than it was when Xli was the only regressor: In Equation (6. Xli and X2i are zero.2 The interpretation of the coefficient The MUltipleRegressionModel 187 f31 in Equation (6. test and many other factors influence ratio and the fraction variable.2). Y will change by Some amount. the intercept f3o. f3.2) needs to be augthe factors that deter- r. for example. Simply put. and X. Y. Y f3J = :~ holding X2 constant. In addition to the studentother student characteristics. Because Xl has changed. and the expected values Equation value of Y when the independent variables take on the X. The interpretation ilar to the interpretation expected value of of the intercept in the multiple regression model.r is the effect on Y (the expected change in Y) of a unit change in Xj. say Ll. That is. by the amount Ll.X2 from Equation (6. + f3.3).2) as an . fixed. and imagine changing X.X" holding X. this relationship the dependent Yand Xl and X2 that holds on average in the population. (6. Another phrase used to describe f31 is the partial effect on The coefficient Y of X" holding X2 fixed.X. Accordingly.

it can be useful to treat f30 as the coefficient on a regressor where that always equals 1.X" + Ui.\]. regression model in Equation y.... (6. just what ratio. there might be multiple regressor model. . .6) The variable XOi is sometimes same value-the called the constant regressor because it takes on the f3o. These factors holding constant other factors that are beyond not just the percentage affect test performance.. if the vari. This reasoning leads us to consider a model with three regressors or.lu practice. think of f30 as the coefficient the population multiple (6.6). For example.. model. n. i ~1.just as ignoring the fraction of English ally. in the multiple regression ance of the conditional distribution of Hi in the single-regressor " Xki. The multiple regression the superintendent . jXli. ~ 1. is summarized as Key Concept 6. . .... n. more generk regressors.Xki). Equations times called the constant term in the regression. the intercept. n and thus does not depend Otherwise. ~ f30. Similarly.. + f3. however.i + f3. a model that includes regressors. Xli and X2iIn regression with binary regressors. of homoskedasticity and heteroskedasticity in the multiple model.5) can alternatively be written as where Xo. ~ 1 for i ~ 1. (6.. but other measurable factors that might background of the students.. The multiple regression with k . we have relationship.Xki. Accordingly.2. Xki' var( u. is on the values of Xii. The discussion so far has focused on the case of a single additional X. value 1.' . might did. .A{.188 CHAPTER 6 LinearRegressionwith Multiple Regressors "error" term Ui' This error term is the deviation of a particular observation (test scores in the ith district in our example) from the average population Accordingly. . X2h The definitions .5) where the subscript iindicates the ith of the n observations Equation (6. inclnding the economic of English learners. include model holds out the promise of providing wants to know: the effect of changing the student-teacher her control." given constant for i ~ 1.5) and variable. ignoring the students' economic background factors omitted from the singlelearners model result in omitted variable bias. model is homoskedastic Xli. model when there are two regressors. on Xo" All. regression model are extensions of their definitions The error term u. . Xli. the error term is heteroskedastic.for all observations..5) is the population multiple regressiou (districts) in the sample. The two ways of writing the population regression (6. are equivalent.. is some(6.. To be .

of practical help to the superintendent.The intercept can he thought of as the coefficient on a regressor. these coefficients can be estimated using ordinary least squares. • The intercept 130 is the expected value of Y when all the X's equal O.. A:1h X2h'" I X observations on each of the k regressors.7) where • }j is i observation lh on the dependent variable. is the coefficient on X" and so on. 13k of the population regression model calculated using a sample of data. The coefficients on the other X's are interpreted similarly. that equals 1 for all i. resulting from changing Xu by one unit. 6.6.\2.2 (6. = x" + f3 JXI + f3. Xk.. h _ . = xk) + f3kXk' • 131 is the slope coefficient on Xl. The coefficient f3j is the expected change in 1". is the error term..' . . and u. however. + .x.. . • Tbe population regression line is the relationship that holds between Yand tbe X's on average in the population: E(YI Xj. 13. Fortunately.3 The OLS Estimator in MUltiple Regression 189 The Multiple Regression Model Themultiple regression model is Cll:m:mm) 6. holding constant X. AD.3 The OLS Estimator in Multiple Regression This section describes how the coefficients of the multiple regression model can be estimated using OLS. . = f30 ki are the ith = x".. we need to provide her with estimates of the unknown population coefficients 130.. Xk..

so their presentation is deferred to Section 18..(bo + bjXii + . . The key idea is that these coefficients can be estimated by minimizing the sum of squared prediction mistakes. 13k that minimize the sum of squared mistakes in Expression (6. the formulas are best expressed and discussed using matrix notation.bjX" -. -I. .bkXki)2 (6. Let bo. + bkXk. .. These formulas are incorporated into modern statistical software.. . In the multiple regression model.blX1i . the OLS residual is Ui = Y. y.blX. . and its OLS predicted value. The OLS estimators could be computed by trial and error. + bkXki) = Y. ~" . The estimators of the coefficients 130.Yj. .by choosing the estimators bo and b.···.. to use explicit formulas for the OLS estimators that are derived using calculus.. .8).1. The formulas for the OLS estimators in the multiple regression model are similar to those in Key Concept 4. bl.. and tbe mistake in predicting 1.calculated using these estimators.2 shows how to estimate the intercept and slope coefficients in the single_ regressor model by applying OLS to a sample of observations of Y and X..{3" . -I. .. . 13k in tbe multiple regression model. The OLS regression line is tbe straight line constructed using the OLS estimators: ~o + ~lXI + .6) for the linear regression model witb a single regressor. . repeatedly trying different values of bo. however. 13k' The predicted value of 1. bk be estimators of {3o.190 CHAPTER6 Linear Regressionwith Multiple Regressors The OLS Estimator Section 4. is 1.~kXk' The predicted value of 1.bo . so as to minimize L7~j(Y. that is.. ~k' The terminology of OLS in the linear multiple regression model is the same as in the linear regression model with a single regressor.)2The estimators that do so are the OLS estimators.{3J.bo .bkXki· The sum of tbese squared prediction mistakes Over all n observations thus is 2: (1. 13" ..~lXli + .2 for the single-regressor model... ~o and ~I' The method of OLS also can be used to estinnate the coefficients {3o.8) are called the ordinary least squares (OLS) estimators of 130.8) is tbe extension of tbe sum of tbe squared mistakes given in Equation (4.. is = ~o -I.8) The sum of tbe squared mistakes for the linear regression model in Expression (6. 13" . bk untiJ you are satisfied tbat you have minimized the total sum of squares in Expression (6.. . . It is far easier.. is bo + bjX1i + ..13k' The OLS estimators are denoted ~o. given Xli....~kXki' The OLS residual for tbe ilb observation is the difference be}ween 1.' .. . that is.. .. Xkh based on the OLS regression line. i=l n bo .

. . . f3" . i = 1. f31. (6.. is TestScore = 698. f3" .3 The OLS Estimatorin Multiple Regression 191 The OLS Estimators. u. = ~o " + ~lXli+ ...28 X STR. of OLS in multiple regression are summa- Application to Test Scores and the Student-Teacher Ratio In Section 4. i = 1.3. + ~ kXki. nsing OLS regression Our 420 observations for California school districts.9 .. n... .9) (6.11). (6. .btX -'" _ b X . . We are now in a position to address this concern by using OLS to estimate a multiple regression are two regressors: in which the dependent the student-teacher variable is the test score (lj) and there of English ratio (Xli) and the percentage ."1( lj .3 The OLS estimators f3o.. . That is.. b that minimize k the sum of squared prediction mistakes 2:. and residuals Ui are 1..2. n. Xk" lj). we used OLS to estimate the intercept and slope coefficient of the regression relating test scores (TestScore) to the student-teacher ratio (STR). The definitions and terminology rized in Key Concept 6. b1. .bo . Predicted Values.. These are estimators unknown true population coefficients (30.10) The OLS estimators f30. . (3k are the values of bo. i = 1.. (3k and residual Ui are computed from a sample of the of n observations of (Xj" . and Residuals in the Multiple Regression Model " . . n..11) Our concern has been that this relationship is misleading because the studentteacher ratio might be picking up the effect of having many English learners in districts with large classes. and Ui = }j . cmmIm) 6.~.6. it is possible that the OLS estimator is subject to omitted variable bias. ."" (3k and error term. the estimated line. reported in Equation (4.2.)2 The li k k OLS predicted values Y.

11 I 1\1- In r- tn<C.r..n 110 TN 1'.~ - -= ...........In -- - ----..s .

13) and where SSR is the sum of squared residuals. the SER is SER = s' where sl. The mathematical single regressor: definition of the R' is the same as for regression with a R' = ESS TSS = ] _ SSR TSS' (6. Here. The Standard Error of the Regression (SER> The standard error of the regression (SER) estimates the standard deviation of of Y the error term u.13). the divisor fI - by estimating two coefficients (the slope k -] + 1 coefficients (the k slope coefficients than is called a then k = 1. the divisor line).k -1 rather If there is a single regressor. TI.14) and the total sum of where the explained slim of squares . II 2 squares is TSS= L.rn Section 4. using n . Equivalently. model is that here the divi2 (rather than n) adjusts for the fI k .3 for the single-regressor adjusts for the downward bias introduced and intercept downward of the regression bias introduced by estimating k adjustment.3. the R' is 1 minus the fraction of the variance of }f not explained by the regressors.Y) .3 is the same as in Equation (6.13) and the definfI - ition of the SER in Section 4.~]ui.Y)' . Thus the SER is a measure of the spread of the distribution around the regression line. .1 rather than n . All three statistics measure how well the OLS estimate of the multiple regression line describes. explained by (or pre- dicted by) the regressors." the data. so the plus the intercept).:l(Y. is ESS = L.. The R2 The regression R' is the fraction of the sample variance of Y. In multiple regression.6.'" " SSR n .~l(l.2.4 Measures of Fit in MUltiple Regression Three commonly used summary statistics in multiple regression are the standard error of the regression.3. degrees-of-freedom As in Section 4. formula in Section 4. When n is large.k-l (6. or "fits.e only difference between sor is fI - the definition in Equation (6.4 Measures of Fit in Multiple Regression 193 6. = II II n-k-l 1 LJui i=l ". and the adjusted R' (also known as J?'). the effect of the degrees-ol'-freedom adjustment is negligible. SSR = L. the regression R'.

In this sense. so :R' is always less than R2 Second.1)/(n . or R'. Whether the :R2 increases or decreases depends on which of these two effects is stronger.15) l' The difference between this formula and the second definition of the R' in Equation (6. The Il2 is z SSR ~ 1_ ~ TSS :R' = I _ n .1 n-k-l sr (6.k . or iF. On the one hand. (n . then it must be that this value reduced the SSR relative to the regression that excludes this regressor. tben the SSR will be tbe same whether or not tbe second variable is included in the regression.1)/(n . As tbe second expression in Equation (6. is a modified version of the R' that does not necessarily increase when a new regressor is added. cient on the new regressor to be exactly zero. Wben you use OLS to estimate the model witb both regressors.k . But if OLS chooses any value other than zero. OLS finds the values of the coefficients that minimize the sum of squared residuals. this means that tbe adjusted R' is 1 minus the ratio of tbe sample variance of the OLS residuals [with the degrees-of-freedom correction in Equation (6. The "Adjusted R2" Because the R2 increases when a new variable is added. the factor (n . adding a regressor has two opposite effects on the R'. tbink about starting with one regressor and then adding a second. it is extremely unusual for an estimated coefficient to be exactly zero. the R' increases whenever a regressor is added. In practice. . the SSR falls. But this means that the R2 generally increases (and never decreases) when a new regressor is added. an increase in the R' does not mean tbat adding a variable actually improves the fit of the model. The adjusted R'.13)] to the sample variance of Y. which increases the :R2 On the other hand. There are three useful things to know about the:R2 First.k -1). One way to correct for this is to deflate or reduce the R' by some factor. If OLS bappens to cboose the coeffj. the R2 gives an inflated estimate of how well the regression fits the data.15) shows. and tbis is what the adjusted R'. does.14) is that the ratio of the sum of squared residuals to the total sum of squares is multiplied by the factor (n . so in general the SSR will decrease when a new regressor is added.1)/(n . To see this.194 CHAPTER 6 Linear Regression with Multiple Regressors In multiple regression.1) increases. unless the estimated coefficient on the added regressor is exactly zero.1) is always greater than 1.

only a small fraction of the variation in Test'icore is explained. however. the decision about whether to include a variable in a multiple regression should be based all whether including that variable allows you better to estimate the causal effect of interest.The reduction in the SER tells us that predictions about standardized test scores are substantiaJly more precise if they are made using the regression with both STR and PCIEL than if they are made using the regression with only STR as a regressor. or explain.6. Because n is large and only two regressors appear in Equation (6. we need to develop methods for quantifying the sampling uncertainty of the OLS estimator.in Chapter 7. and the standard error of the regression is SER ~ 14.426.The units of the SER are points on the standardized test.6%) of the variation in test scores is explained. Comparing these measures of fit with those for the regression in which PCIEL is excluded [Equation (6. 211 "rnaximize the "R is rarely the answer to any economically or statistically meaningful question. this value falls to 14. Instead. In applications. reduce the Sum of squared residuals by such a small amount that this reduction fails to offset the factor (n -l)/(n _ k _ 1). In this sense. .426. 2 the adjusted R is 7?2 ~ 0. the difference between R' and adjusted R' is very small 2 (R = 0.5 when PCIEL is included as a second regressor.424. The R2 for this regression line is R' ~ 0. This happens when the regressors. taken together.5.11)] shows that inclUding PClEL in the regression increased the R' from 0. including the percentage of English learners substantially improves the fit of the regression.12) reports the estimated regression line for the multiple regression relating test scores (TestScore) to the student-teacher ratio (STR) and the percentage of English learners (PcIEL). The SER for the regression excluding PCIEL is 18.424).12). The starting point for doing so is extending the least squares assumptions of Chapter 4 to the case of multiple regressors. heavy reliance on the 7/' (or R') can be a trap.the variation in the dependent variable. Using the R2 and adjusted R2. the 7/2 can be negative.426 versus 7?' ~ 0.051 to 0.First. We return to the issue of how to decide which variables to include-and which to exclude.6.4 Measures of Fit in Multiple Regression 195 Third. however. Nevertheless. The R' is useful because it quantifies the extent to which the regressors account for. more than two-fifths (42. when PCIEL is added to the regression.When the only regressor is STR. Application to Test Scores Equation (6.

'Tl>e I (Ke ( pI d ani ""en. ralh n the popul Ii n IIIlin lhel Inl 1." 7 ~ mpI bl A umpnon 2.. . A..n I.. Y h me n ~ " llll urnpllOl1e Icnd rhe I.. over the populllllon I. ( I.d. IOJI leve' . umpuon mean Ihal I.Ihi .nite .m' "hi n.flrelndepcn'" II d IIIbuI d It .". I - I.II'" unh.llhc <OI1d. . Y.l\" umpli n hold nUl liI I . d \ e lelum 10omuted VII". Ihe ke IKIIlIh 1m.. In n.ullll. Irlhullon or ".d ) "lOd..e II umpur .al appl)' In Illdllple Larg Outh rs Ar Unhkely IIh II n : I .1Il<l II.n line bUI n ".n I I .n lc rcgre "".l... 1. Has a M an of Zero It.on ~ \ r. .n '. Ih "' .... 1 IK. Ih 01 IIm.pl ICJIC • Th. PI'<' lin . Xl.' helo the pul Ii n .. A umpllOn \' Th Conditional Distnbunon I..x. n... .~ItIcd h \1mI'I I ndom '<Implln l11ecommenl on Ih..-1. X2 of u GIVen I .nn. grven X" . ' .I d.).\.Ih a I I K In muh.65Th Last Squares Assumpuons In Muillple Regression ion m I I..1 I I ..." lc I "IulI. II lIlh I Ih 1(.\ lit 'nrr I Inn "'Ih . Ihe expe led va u flit . Are i. ••lue or Ih regrc \0.X hi _ }').I. . I urnpilOn Ih.n IInc Ind >mellme 1. n) .

that is.6. The solution to perfect multicollinearity ply to correct multicollinearity tion to perfect problem. h _ . it often reflects a logical mistake in choosing the regresfeature of the data set. which also defines and discusses imperfect multicollinearity.. and OLS cannot estimate this nonsensical partial effect. if you try to estimate this regreswill do one of two things: Either it will drop one of the of STR or it will refuse to calculate the OLS estimates and give an the software error message. In general. assumption is new to the muHiple regression model. linear function of the other regressors. and STR.5 The Least Squares Assumptions in Multiple Regression 197 kurtosis. In the hypothetical regression is the effect of a change in that regressor.7. holdof TestScore on STR and STR.. In multiple regression. This assumption tics in large samples. It rules out an situation. in which it is impossible-to The regressors are said to exhibit perfect multi. you regress TestScore. This is a case of perfect multicollinearity because one of the regressors (the first occurrence of STR) is a perfect linear function of another regressor (the second occurrence of STR). and Pc/EL" except that you make a typographical type in STR. At an intuitive level. multicollinear) if one of the regressors is a perfect is that (or to be perfectly the OLS estimator. holding constant STR. Why does perfect multicollinearity estimator? accidentally make it impossible Suppose you want to estimate the coefficient to compute the OLS error and on STR in a regression of TesrScore. called perfect mUlticollinearity. This example of STR with the vari- is typical: When perfect occurs. the coefficient on the first occurrence of STR is the effect on test scores of a change in STR. able you originally wanted in this hypothetical regression is simthe typo and to replace one of the occurrences to include. on STR. on STR. The fOUl:thleast squares assumption the regressors are not perfectly multicollinear. This makes no sense. is used to derive the properties of OLS regression statis- Assumption #4: No Perfect MUlticollinearity The fourth inconvenient compute collinearity. The mathematical reason for tbis failure is that perfect multicollinearitv produces division by zero in the OLS formulas. perfect multicollinearity coefficient on one of the regressors is a problem because you are the asking the regression to answer an illogical question. the soluto eliminate the multicollinearity is to modify the regressors sors or some previously unrecognized Additional examples of perfect multicollinearity are given in Section 6. Depending on how your software package sion occurrences handles perfect multicol1inearity. a second time instead of PctEL. ing the other regressors constant.

0. nd nl mllh' nonzer .t 1 mull • uuon of th OL In Muillpi R r .0.""" I fJ u I " .

2. Although the algebra is more complicated when there are multiple regressors.2. then in large samples the 6.4) hold. ffi" .5. which is the extension of the bivariate normal distribution to the general case of two or more jointly normal random variables (Section 2.wbuted N((3J' if~). ffik are averages of the normal sampled data.e OLS estimators randomly distribution of those averages ffio. perfect multicoilinearity arises when one of the regresand discusses how perfect mUlticollinearity can sors is a perfect linear combination of the other regressors.. the 01$ estimators are correlated. Unlike perfect multicollinearity.Ths eral.6. nor does it correlated-with multicollinearity imply a logical problem with the choice of regressors. in large samples.5 summarizes the result that. 6.4).. fJl>"" fJk are jointly normally distributed and each . is disJ well approximated by a multivariate normal distribution. In genof the 01$ estimators in multiple regression is approximately between the regressors. /3k r 4Umm1D (Key Concept 6. and the general case is discussed in Section 18. . 1 . Key Concept 6. This section provides some examples of perfect multicollinearity multicollinearity perfectly imperfect arise. it does mean that one or more regression coefficients could be estimated imprecisely .7 Multicollinearity 199 Large-Sample Distribution If the least squar~s assumptions of /30' /31..but not the other regressors. Imperfect arises when one of the regressors is very high ly correlated ... and if the sample size is sufficiently becomes normal. the distribution jointly normal. and can be avoided. does not prevent estimation of the regression. tbe central limit theorem applies to the OLS estimators in the multiple regression model for the same reason that it applies to Y and to the OLS estimators when there is a single regressor: 11. .. in regressions with multiple binary regressors. Because large. k.0. .7 Multicollinearity As discussed in Section 6.J . the expressions for the joint distribution of tlie OLS estimators are deferred to Chapter 18. the sampling the multivariate distribution is best handled mathematically using matrix algebra.. However.5 OLS estimators fJo. ffi. . this correlation arises from the correlation joint sampling distribution of tbe OLS estimators is discussed in more detail for the case that there are two regressors and homoskedastic errors in Appendix 6.

. While it is possible to ima~in~ a ~chool district with fewer than J 2 students per teacher. when the regression includes an intercept. On 5TR. for all the observations in our data set:. .12). there arc no such districts III Our data set so we cannot" analyze them in our regression. were included as a third regressor in addition to 5TR. the smallest value of STR is 14. First. NVS . If the variable FracEL. which varies between 0 and I. Let NV5. There are in fact no districts in our data set with STR. The reason is thai PcfEL is the percentage of English learners. < 12. Thus NVS. = 1 X Xu. perfect multicoUinearity is a statement about' the data set you have 011 hand. = 1 for all observations. it equals Xo'. the regressors would be perfectly multicollinear." district.2. equals 1 if STR. I' to t I regl ie I . but for a more subtle reason than the regression in the previous example. Thus one of the regressors (Pc/EL.. be the percentage of "English speakers" in the . be a binary variable that equals 1 if the student-teacher ratio in the i1h district is "not very small. it is impossible to compute the OLS estimates of the regression of Tesiscore. Example #3: Percentage of English speakers. PCIEL. Thus we can write NVS. I Example # 1: Fraction of Englishlearners. holding constant the fraction of Engl ish learners? Because the percentage of English learners and the fraction of English learners move together in a perfect linear relationship.Sc'ore. Teo.200 CHAPTER 6 Linear Regression with Multiple Regressors Examples of Perfect Multicollinearity We continue the discussion of perfect multicollinearity from Section 6. defined to be the percentage of studenls who . '" 12 and equals 0 otherwise.m Equation (6. Because of this perfect multicollinearity.) can be written as a perfect linear function of another regressor (FracEL. Now recall that the linear regression 1110delwith an intercept can equivalently be thought of as including a regressor. be the fraction of English learners in the .. OLS fails because you are asking. specifically.'h district. Let Pel ES.5 byexam_ " h laillypothetical regressions. this question makes no sense and OLS cannot answer it.. = LOO X Praclil... Example #2: "Not very small" classes. and FracCt. as you can see in the scatrerplot in Figure 4. for every district. XOi.). can be written as a perfect linear combination or the regressors. This illustrates two important points abOUI perfect multicollinearity." specifi- cally. on STR·and PCIEL. In each. Let Froctil. then one of the regressors that can be irnplicared in perfect multicollinearity is the constant regressor XOi' Second. NVS. that is. a third regressor is added mmg tree a dditi 101 I .6). as shown in Equation (6. that equals 1 (or all i. At an intuitive level. "eSSlO1l 0. and PCIELI. so that Pc/EL. What is the effect of a unit change in the percentage of English learners. This regression also exhibits perfect multicollinearity.

denotes the constant regressor introdnced in Equation (6. Again the regressors will be perfectly multicollinear. then the coefficient on would be the average difference between test Scores in suburban and rural districts. to estimate the regression. Solutions to perfect multicollinearity.6. relative to the base case of the omitted category. the regressor Xo. . Alternatively. If you include all three binary variables in the regression along with a constant. the coefficients on the included binary variables represent the incremental effect of being in that category. holding constant the other variables in the regression. + Urban. and Urban. whichequals 1 for a rural district and equals o otherwise. For example. In one way or another. Suburban. Suburban. This situation is called the dummy variable trap. holding constant the other regressors. or dummy. if there is an intercept in the regression. In this case. you must exclude one of these four variables. Thus. 3 . if each observation falls into one and only one category. = 100 X Xo. were excluded. and if all G binary variables are included as regressors. if there are G binary variables. For example. so only G -1 of the G binary variables are included as regressors. your software will let you know if you make such a mistake because it cannot compute the OLS estimator if you have.. Each district falls into one (and only one) category. This example illustrates another point: Perfect mUlticollinearity is a feature of the entire set of regressors. then the regression will fail because of perfect multicollinearity. the regressors will be perfect multicollinearity: Because each district belongs to one and only one category. + Suburban. and urban.PctEL. variables are used as regressors. if Rural. = 1 = Xo" where AD. if either the intercept (that is.PctES. either one of the binary indicators or the constant term. were excluded from this regression. Perfect multicollinearity typically arises when a mistake has been made in specifying the regression. The dummy variable trap. the regressors would not be perfectly multicollinear. all G binary regressors can be included if the intercept is omitted from the regression. Let these binary variables be Rurol. The usual way to avoid the dummy variable trap is to exclude one of the binary variables from the multiple regression. Like the previous example. By convention. Sometimes the mistake is easy to spot (as in the first example) but sometimes it is not (as in the second example). Another possible Source of perfect multicollinearity arises when multiple binary.7 MUlticollinearity 201 are not English learners.6). suppose you have partitioned the school districts into three categories: rural. suburban. In general.. in which case one of the binary indica- tors is exclnded. Rural. the constant term is retained.) or PctEL. the perfect linear relationship among the regressors involves the constant regressor Xoi For every district.

The larger the correlation between the two regressors. It . For example. .. when multiple regres- sors are imperfectly multicollinear.. Imperfect multicollinearity that is highly correlated with another in the sense that there is a linear funcfor the theory of the OLS estimacorrelated. the percentage First-generation of the district's residents who are first-generation regression of TestScore on STR and Pel EL. Important t h at you m odify your regression . it will have a larger variance regressors PctEL and percentage immigrants The effect of imperfect multicollinearity tors can be seen mathe. If the least on PclEL in than if the this regression will be unbiased. will tend to have many students ables PctEL and percentage immigrants will be highly correlated: many recent immigrants English. to your computer 11your regressors Imperfect Multicollinearity Despite its similar name. on the variance of the OLS estimaEquation (6.2. or vice versa. influences quite differ_ that two Or Imperfect of the varent from perfect multicollinearity.t ley will have a large sampling val'" . In other word. and. the closer is this term to zero andtbe larger is the variance of ~. sor.17) in AppendiX 6. the coefficienjg on one or more of these regressors will be imprecisely estimat d tl . of Eng- use these data to estimate the partial effect on test scores or an increase little information about what happens to test scores when the percentage bold. the data set provides lisb learners is low but the fraction of immigrants squares assumptions is high. to IS . so the variwith Districts does not pose any problems tors.P~. then the OLS estimator or the coefficient holding constant the percentage immigrants. in a multiple regression with two regressors for the special case of a hOl11oskedastic error.·once. e .atically by inspecting were uncorrelated. eliminate u.x" where Px" x. then the coefficien ts on at least one individual regressor will be imprecisely estimated. a purpose ofOLS is to sort out the independent ious regressors when these regressors are potentially If the regressors are imperfectly multicollinear.202 CHAPTER 6 Linear Regression with Multiple Regressors When your so ft war e lets you know that you have perfect multicollinearity' ". imperfect multicollinearity more of the regressors are highly correlated tion of the regressors multicollinearity is conceptually means regressor. the variance inversely proportional to 1 . Suppose we were to add a third regresimmigrants often speak English as a second language.13t IS. however. (X. and at a mll1. indeed. who are still learning it would be difficult to in PeIEL.mum you will be ceding control over your choice of regressors are perfectly multicollinear.. is which is the variance of {3..X. Some Software' IS .. C unreliable when there is perfect multicollineanty..n this case. . is the correlation between Al and X. More generally. I .. consider the im migrants. . Because these two variables are highly correlated.. I.) of {3.

then imperfect mulprecisely one or more of 6. . 2.. including the percentage of a change in the student-teacher English learners as a regressor made it possible to estimate the effect on test scores ratio. Doing so reduced by half the estimated change in the student-teacher ratio.Summary 203 Perfect multicollinearity is a problem that often signals the presence of a logmulticollinearity is not necessarily an error. Xk. The coefficient on a regressor. . holding constant the other included regressors. in of multiple regression is the partial effect of a change in Xl. Because coefficients are estimated using a single sample. If the variables in your regression are the ones you meant to include-the ticolJinearity implies that it wilJ be difficult to estimate the partial effects using the data at hand. f3k' The coefficient f31 is the expected change in Y . f3j. Xb coefficient. with a single regressor is vulnerable to omitted variable bias: If an will be biased and variable is a determinant of the dependent variable and is correlated with then the OLS estimator of the slope coellicient wilJ reflect both the effect of the regressor and the effect of the omitted variable. Omitted variable bias occurs when an omitted variable (1) is correlated with an included regressor and (2) is a determinant of Y. and the question you are trying to answer. In contrast. holding constant the percentage of Engeffect on test SCores of a lish learners. The least squares assumptions for multiple regres- sion are extensions of the three least squares assumptions for regression with a single regressor. The statistical theory of multiple regression builds on the statistical theory of regression with a single regressor. plus a fourth assumption the regression ruling out perfect multicollinearity.8 Conclusion Regression omitted the regressor. This sampling uncertainty must be quantified as part of an empirical study. your data. Xj. and tbe ways to do so in the multiple regression model are the topic of the next chapter.Aj. The multiple regression model is a linear regression model that includes mul- tiple regressors. Associated with each regressor is a regression f3" ... the OLS estimators have ajoint sampling distribution and therefore have sampling uncertainty. . In the test score example. imperfect rather just a feature of OLS. but ones ical error. Multiple regression makes it possible to mitigate omitted variable bias by including the omitted variable in the regression.. you chose to address the potential for omitted variable bias. Summary 1.

87) controlling for X. in choosin Perfect multicollinearity. error of the regression.87) partial effect (187) population multiple regressiolllllodel (188) OLS regression line (190) predicted value (190) OLS residual (190) n2 (193) adjusted (i?2) (194) perfect multicollinearity (197) dummy variable trap (201) imperfect multicollinearity (202) n' Review the Concepts 6.1 A researcher is interesled in the effect all test scores of computer per student. 13k!190) slope coefficient of X" (186) coefficient on X2. IS It biased up Or down? Why? .4 are satisfied. The standard requires changing the set of regressors.. (1. When th e four least squares assulllptions in Key Concept 6. (1. be an unbiased estimator of the effect all lest scores of increasing the num- ber of computers per student? Why Or why not? If you think ~. collinearity 5. usage. and normally distributed 4. Will Usmg school district data like that used in this chapter. g multi_ of fit and the f?l for the multiple regression model. the OLS esti mators are unbiased. . S olving perfect are measures . in large samples.86) holding X. . trict average test scores all the number of computers she regresses di..86) 130.204 CHAPTER 6 Linear Regression with Multiple Regressors associated with a l-unit change in X" holding the other regressors constant The other regression coefficients have an analogous interpretation ' 3. usually arises from a mistake which regressors to include in a multiple regression. The coefficients in multiple regression can be estimated by OLS. constant (1. which occurs when one regressor is an exact linear function of the other regressors.86) slope coefficient of Xu (186) coefficient on constant regressor constant term (188) (188) (188) homoskcdastic heteroskedastic (188) (OLS) ordinary least squares estimators of AJ.86) intercept (1. consistent.13" . (1.- fJ. the R 2 . Key Terms omitted variable bias (180) multiple regression model (186) population regression line (:186) population regression function (1. is biased.

X1i + Uf.3 Using the regression results in column (2): a. 0 otherwise) South 6. Give two examples of a pair of perfectly multicollinear regressors.4 Explain why it is difficult to estimate precisely the partial effect of ing X. The data set consists of inforwas either a high school diploma or a bachelor's mation on 4000 full-time fUll-year workers. The worker's ages ranged from 25 to 34 years. The data set also contained information on the region of the country where the person Iived. on average.2 A multiple regression includes two regressors: Y.X + fl. 0 otherwise) West = binary variable (1 if Region = West.3 is unchanged? What is the expected change in Y if Xi increases by 3 units and X. constant. decreases by 5 units and Xi 6. Using the regression results in column (1): a. are highly correlated. let AHE = average hourly earnings (in 1998 dollars) College = binary variable (1 if college. = flo What is the expected change in Y if Xj increases + fl. computed using data for 1998 from the CPS. Do workers with college degrees earn more.marital status. .Exercises 205 6. Xi. Do men earn more than women on average? How much more? 6. decreases by 5 units? Explain why two perfectly multicollinear regressors cannot be included in a linear multiple regression. if Xj and X. For the purposes of these exercises. by 3 units and X. Is age an important determinant of earnings? Explain. The highest educational achievement for each worker degree.1 6. 6. 0 otherwise) Compute 71' for each of the regressions. 0 otherwise) Midwest = binary variable (1 if Region = Midwest. 0 if male) Age = age (in years) Ntheast = binary variable (1 if Region = Northeast. than workers with only high school degrees? How much more? b. 0 if high school) Female = binary variable (1 if female. and number of children. hold- Exercises The first four exercises refer to the table of estimated regressions on page 206.2 = binary variable (1 if Region = South. is u unchanged? What is the expected change in Y if X.

29 5.27 4.206 CHAPTER 6 Linear Regression with Multiple Regressors b. 6. (1) (2) (]) College (X. rom a I an 0111 sample of 220 110me sales from a Com" muniry tIl 2003. a11. Calculate the expected di[fere .) Female 5. e ouse In square feet). Sally is a 29-year-old female college graduate.e .40 3. Let Price denote the selling price (in $1000) BDR denote the number of bedroon s Bid . .176 6.48 -2. Do there appear to be important regional differences? What would b.5 Data were collected f .46 -2. .64 5.21 0. Why is the regressor West omitted from the regression? happen if it was included? Results of Regressions of Average Hourly Earnings on Gende and Education Binary Variables and Other Characteristics Using "'18 Data from the Current Population Survey Dependent Regressor variable: average hourly earnings (AHE).62 0.27 0.22 6.29 0.69 0. sit denote the size of th h (i . 'H .60 (X2) Age (X3) Northeast (X4) Midwest (X.194 -- 0.62 0. Predict Sally's and Betsy's earnings.4 Using the regression results in column (3): a. Betsy is a 34-year-old female college graduate..190 4000 4000 4000 c. nee 111 earmngs between Juanita and Jennifer.69 - -0. Lsize denote the lot size . enote the number of bathrooms. . d . Jennifer tS a 28-year-old female college graduate [rom the Midwest.44 -2.) South (X6) Intercept Summary Statistics 12. 1. Juanita is a 28-year-old female college graduate from the South. 6.75 SER R' R' 11 6.

S. counties. Explain why this regression is likely to suffer from omitted variable bias. Include a discussion of any additional data that need to be collected and the appropriate statistical techniques for analyzing the data. What is the loss in val ue if a homeowner lets his house run down so that its condition becomes "poor"? d." An estimated regression yields Pric-':~ 119. The researcher then plans to conduct a "difference in means" test to determine whether the average salary for women is significantly less than the average salary for men. Compute the R' for the regression.8Poor.156Hsize + 0. which increases the size of the house by 100 square feet.?) 6. Which variables would you add to the regression to control for important omitted variables? b.090Age . SER ~ 41. a. Your critique should explain any problems with tbe proposed research and describe how the research plan might be improved.Exercises 207 (in square feet). To determine potential bias. Use your answer to (a) and the expression for omitted variable bias given in Equation (6.or underestimate the effect of police on the crime rate. iIP _ . or ~l < (3. What is the expected increase in the value of the house? c. What is the expected increase in the value of the house? b. the researcher collects salary and gender information for all of the firm's engineers.5. Suppose that a homeowner converts part of an existing family room in her house into a new bathroom. Age denote the age of the house (in years). (That is. He plans to regress the county's crime rate on the (per capita) size of the county's police force.48.72. and Poor denote a binary variable that is equal to 1 if the condition of the house is reported as "poor.6 A researcher plans to study the causal effect of police on crime using data from a random sample of U. 6. 7?' ~ 0.485BDR + 23. a. Suppose that a homeowner adds a new bathroom to her house. a.4Bath + 0.1) to determine whether the regression will likely over.002Lsize + 0.2 + 0.do you think that ~l > {3.7 Critique each of the following proposed research plans. A researcher is interested in determining whether a large aerospace firm is guilty of gender bias in setting wages.

A random sample of size" = 400 is drawn from the population. A recent study found that the death rate for people who sleep 6 to 7 hours per night is lower than the death rate for people who sleep 8 or more hours.IXli. Does this estimator suffer from omitted variable bias? Explain.) satisfy the assumptions in Key Concept 6.) satisfy the assumptions in Key oncept 6. and X2 were . would you recommend that Americans who sleep 9 hours per night consider reducing thei I' sleep to 6 or 7 hours if they want to prolong their lives? Why or why not? Explain. Suppose that XI and X2 are uncorrelated.. The data set spent in He cOllects of prison for ample of peoincludes infor_ mation on each person's current wage. cthnicity. occupation. education. on Y. age.4. 6.Ving statements: "When XI and X2 are correlated. Compute the variance of ~I' [Hint: Look at Equation (6. gender. 6.] b.2. X2. The researcher plans to estimate the effect of incarceration on wages by regressing wages on an indicator variable for incarceration. X2. X2. Assume that XI and X2 are uncorrelated. Assume that cor(X" X2) = 0. Comment on the follo.9 (If. and union status. A researcher is interested in determining whether time prison has a permanent effect on a person's wage rate.4. Compute the variance of ~ I.17) in the Appendix 6. (so that X2 is not included in the regression). Xv. including in the regression the other potential determinants tenure. var(u.208 CHAPTER 6 LinearRegressionwithMultiple Regressors b. tenure (time in current job).1 million observations used for this study came from a random Sur- vey of Americans aged 30 to 102. data on a random sample of people who have been out at least 15 years.5. 6. c. This calculation was then repeated for people sleeping 6 hours. and so on.10 (Y. union status. He collects similar data on a random ple who have never served time in prison. Each survey respondent was tracked for 4 years. in addition. You estimate f3j by regressing Y onto X. The death rate for people sleeping 7 hours was calculated as the ratio of the number of deaths over the span of the study among people sleeping 7 hours to the total number of survey respondents who slept 7 hours. Xli. a.8 of wage (education. the variance of f31 is larger than it would be if X. and so on). You are interested in f3" the causal effect of X.) = 4 and var(Xli) = 6. as well as whether the person was ever incarcerated. Based on this summary. The 1.

.~.) Following analysis like that used in Appendix 4.~lX1iX'i = O. .. Suppose 2:. Suppose that the model includes an intercept: Y. Run a regression of CourseEval on Beauty. (Notice that there is no constant term in the regression.X. if you are interested in .Xli ( 1 - to the OLS estimator X.~..Xli' 2:. in (a) suffer One Credit. What is the estimated slope? b.X'j + u.X1)2 (Xli of /31 from the regression How does this compare that omits X. n. the data (Yj.1 Using the data set TeachingRatings described in Empirical carry out the following exercises.) (1. Female.. Also suppose that (Xli -:~)(X2i . b. Show tbat 2:. Exercises 4. a.X.$./2:. $. Minority. As in (e). c. Xlh X2i) .Xii + /3. Show that the least squares estimators satisfy $0 = Y .$. .~.~.X. include as additional regressors lntro.--I Empirical Exercises 209 uncorrelated. .2. and b.2: a. including some additional variables to control for the type of Course and professor characteristics. it is best to leave out of the regression ifit is correlated with Xi. Run a regression of CourseEval on Beauty. Derive an expression for $] as a function of 1." I 6. Compute the partial derivatives of the objective function with respect to b. = 2:.? Empirical Exercises E6.~..= /30 + /3. . Show that =:.~lX]iX'j '" O. and NN English. .. i e. What is the estimated effect of Beauty on CourseEvatt Does the regression from important omitted variable bias? .) = O. . n.. In particular. $ = 2:. Suppose that the model contains an intercept.Y )!2:.X]iY. Suppose d.11 (Requires calculus) Consider the regression model for i = 1. f. . Thus. Specify the least squares function that is minimized by OLS.

The value of the coefficient on DadColI is positive. Professor Smith is a black male with average beauty English speaker. Compare in (a) seem to suffer from important omitted in (b) substanvariable in (a)? Based on this. but include SOme additional sors to control for characteristics of the student. His high school was 20 miles [rom the nearest college. and his family owned a home. Explain why C/le80 and Swmfg80 appear in the regression. and Slwmfg80. DadColI. Are the (+ or -) what you would have g. His mother college. Run a regression of years ofcompletcd education (ED) on distance slope? regres. Run a regression of ED on Dist. and is a native COurse. He teaches a three-credit dict Professor Smith's course evaluation. Tn particular. signs of their estimated coefficients believed? lnterprellhe magnitudes of these coefficients. His base-year composite attended test Score (Bytes!) was 58.2 Using the data set ColiegeOist"nce carry out the following exercises. C/le80. Ownhome. hourly wage using the .75. described in Empirical Exercise 4. regression (b)? R' and 7?2 Why are the R2 and R' so similar in e. does the the fit of the regression in (a) and (b) using the regression standard errors. Hispanic. The unemployment county was 7. and the state average manufacturing was $9. His family rate in his income in 1980 was $26. a. Black. but his father did not. What is the e timated effect of Dist on ED? c.5%. to the nearest college (Dist).lncomehi. Bob is a black male. Predict Bob's years of completed schooling regression in (b). Female. and the local labor market. Verify that the three-step the same estimated coefficient for Beauty as that obtained in (b). d.3 (the Frisch-Waugh theorem). the student's family. include as additional regressors Bytest. What is the estimated b.000. Is the estimated effect of Dist on ED in the regression tively different from the regression regression bias? d.3. Pre- upper-division E6. What does this coefficient measure? f.210 CHAPTER 6 Linear Regression with Multiple Regressors c. Estimate the coefficient on Beauty for the multiple process yield regression mOdel in (b) using the three-step process in Appendix 6.

In . Rev_Coups. Key Concept 4. a.16) yietds Equation (6.1). Equation (4. What is the value of the coefficient on the value of this coefficient. Why is Oil omitted from the regression? What would happen if it were included? APPENDIX -- _ 6. 0-' and limits into (l/n)2:~l(X.0 211 h. YearsSchool. Construct a table that shows the sample mean. Substitution of these iIaz _ .30) in Appendix 4. b. Jim has the same characteristics as Bob except that his high school was 40 miles from the nearest college.1) This appendix presents a derivation of the formula for omitted variable bias in Equa- tion (6.4. Repeat (c) but now assume that the country's value for TradeShare is one standard deviation above the mean. Is it large or small Rev_Coups? Interpret in a real-world sense? c.. Rev _Coups.3 states that (6. ~ COV(Ui. Oil. Use the regression to predict the average annual growth rate for a country that has average values for all regressors. Assassinations and RGDP60.X)U. . and minimum and maximum values for the series Growth.. but excluding the data for Malta. RGDP60. E6.Aj) = Equation (6.3 Using the data set Growth described in Empirical Exercise 4. carry out the following exercises.Derivationof Equation (6. TradeShare.1). standard deviation. YearsSchool. Inclnde the appropriate units for all entries.16) Under the last two assumptions .1 Derivation of Equation (6.3 (1/ n )"" (X-X)' ""1=1 . Predict Jim's years of completed schooling using the regression in (b). e. PXl/lJlIlJx. x --L. Assassinations. Run a regression of Growth on TradeShare. d.

'TIle OLS estimator of ~I can be computed in three steps: .. If XI and X2 are highly correlated. 6.. lation bel ween the OLS estimators two regressors: ffil and ~2 is the = - negative of the correlation corr(/3" f3'.17) OLS estimators the correbetween the tion between the regressors.. pl"x2 is close to 1 and thus the term 1 - in the denominator z is small and the variance of ~l is larger than it would be if PXI. either positively or negatively.) Px 10. (J~I' (6. if there are two regressors in multiple regres_ arc homoskedastic..7). can be written as When there are two regressors. (k = 2) and the errors then the formula simplifies enough to provide some insights into the distribution OLS estimators.17) where PXhX2 is the population correlation between the two regressors XI and X2. and uJ is I the population variance of XI' The variance then ol of the sampling distribution P~I. is XII and X2i• and the error term is homoskedastic.Ihe var(lliIXli' X2i) = (T~.. of the conditional variance of II. distribution or the ffil and ffi2 are in general correlated..X1 of ~I depends on the squared correlaof Equation (6. Consider the multiple regression model in Equalion (6.2 Distribution of the OLS Estimators When There Are Two Regressors and Homoskedastic Errors Allbaugh the general formula for the variance of the OLS estimators sion is complicated. U~l) t where the variance of this distribution. When the errors are homoskedastic.X Another feature of the joint normal large-sample is that were close to O.3 The Frisch-Waugh Theorem ~e OLS e~timalor in multiple regression can be computed by a sequence of shorter regressions. in large samples the sampling distribution of ~I is N( (31. Because the errors are homoskedastic.-------------------------______.18) APPENDIX - _ 6.1· " (6.212 CHAPTER 6 Linear Regression with Multiple Regressors APPENDIX .

the third regression estimates ing (controlling Exercise 18. . X3. (5. Regress XI 2.17) pertains to the model with k = 2 regressors).Xl denote the residuals from this regression. of XI onto X (recall that 2 Equation (6. . Xk. Regress 3. 2 (6. x 2 is the adjusted R2 from the regression of ~(J'XI' onto X .17.Waugh theorem states that the OLS coefficient in step 3 equals the OLS coefficient sion model (6.. The Frisch~Waugh suggests how Equation (6.27). Regress On X2.X ~ 2 P 2 PX .17) can be derived .27) I ~ 1 is the OLS regression coefficient from the regression of that the homoskedasticity-only of "" Xi· 1-'1 Y onto Xl . controlling For the other X's: Because the first two associated with the other regressions (steps 1 and 2) remove from Y and Xl their variation X'5. where the regressions include a constant term (intercept).17) follows from 2 P 2 -R2 Xl. Yon Yon X't. Xk. h _ . and Jet Y denote the residuals from this regression. This result provides a mathematical cient on XI in the multiple regres- statement of how the multiple regression coeffi- iJ I estimates the effect On Y of Xi. X3.where u~ is the Because 2 "" Xi is the residual from the regression (6.· ".15) implies that Xl 2 1 s~ = (1 x Equation I • Rx' h x 2)sx I' where Rx' SXI I.X I 2 2 an d sX I ~ P (J'x .The Frisch-Waugh Theorem 213 1. and Jet .Equation nuXI variance of f31 is a} = u{~.7). Xi. This theorem Because suggests variance Equation the effect on Y of XL using what is left over after removtheorem is proven in for) the effect of the other X's. from Equation 2 (5. The Frisch..

Sign up to vote on this title
UsefulNot useful