CHAPTER

1

Economic Questions and Data

A

sk a half dozen econometricians half dozen different

what econometrics

is and you could get a is the is

answers. One might tell you that econometrics

science of testing economic

theories. A second might tell you that econometrics

the set of tools used for forecasting future values of economic variables, such as a firm's sales, the overall growth of the economy, or stock prices. Another might say

that econometrics is the process of fitting mathematical

economic models to real-

world data. A fourth might tell you that it is the science and art of using historical data to make numerical, and business. In fact, all these answers are right. At a broad level, econometrics and art of using economic is the science or quantitative, policy recommendations in government

theory and statistical techniques to analyze economic

data. Econometric methods are used in many branches of economics, including

finance, labor economics, macroeconomics, microeconomics, marketing, and
economic policy. Econometric sciences, including methods are also commonly used in other social political science and sociology. you to the core set of methods used by econometricians. to answer a variety of specific, quantitative questions

This book introduces We will use these methods

taken from the world of business and government

policy. This chapter poses four of to

those questions and discusses, in general terms, the econometric approach

answering them. The chapter concludes with a survey 01 the main types of data available to econometricians questions. for answering these and other quantitative economic

1.1

Economic Questions We Examine
Many decisions in economics, business, and government hinge on understanding
relationships among variables in the world around us. These decisions require

quantitative answers to quantitative questions.
This book examines several quantitative questions taken from current issues

in economics. Four of these questions concern education policy, racial bias in mortgage lending, cigarette consumption, and macroeconomic forecasting.

2

CHAPTER

1

Economic Questions and Data

. ""-1' Does Reducing Class Size Improve Question -rr' • . ? Elementary School Education.
Proposals for reform

,

a

f tl

ie

US public cducation .,

system generate
.

heateddeb'l
t

.1 cern the youngest Many of the proposa s con " ducarion has vanous h 00 I e Elementary sc c , • skills, but f or man Y parents and educators .,;

,

studcnts.those III elementarysrbool " ' ' objectives. such as developlOgsOc~
. . • .

.' academic 'I eaJ11Jng. . J -cading writing ,-.' . ,OVIIing basic learning is to reduce posa I for IIllP! '-' With fewer students in the classroom.

thc most Important obJectIVe isblll1 ". and baSIC mathematics. One prommentplO' . class sizes at elemeJllaryschoo

" 11

the argument

goes, each student getsmare

of the teacher's attention,
and grades improve. But what, precisely, class size? Reducing contemplating quantitative hiring if the school is already

there are fewer class disruptions, learning isenhan~~,
is the effect on elementary building must

school education ofreducinl hiring more teachersan~

class size costs money: at capacity, marc teachers however. of the likely

It requires weigh

more classrooms.

A decision maker
must haveaprec~e e[fectonba>k

these costs against the benefitt maker Is the beneficial

To weigh costs and benefits, understanding

the decision benefits.

learning of smaller classes large or srna II? Is it possible ally has 110 effect on basic learning? Although learning quantitative of reducing common answer sense and everyday
are fewer students.

that smaller class sizeactu. may suggest thatmort sense cannot provide!

experience common

occurs when there

to the question

of what exactly

is the effect on basic leama

class size. To provide

such an answer. we must examine empirical

evidencethat is, evidence elementary schools. In this book, we examine using data gathered data, students dardizedtests in districts than students

based on data - relating the relationship between

class size to basic lemnillgin

class size and basiclearllin~ in 1999.111 the Califcm better on sIan·

from 420 California in districts

school

districts

with small class sizes produce

tend to perform

with larger classes. While this [actisron, better test scores, itmightsimpll

sistent with the idea that smaller classes

reflect many other advantages that students in districts with small classeshal'e over their counterparts' d' t " . I . . 'h 111 IS nets Wit 1 large classes. For example, districts wl small class sizes tend to have I I',' '" . e wea uue r residems than districts With largecla~f\ so students III small-class districts ld h ' . c cou ave more opportunities for learnmgoulSide the classroom It could b 1 ' , . e tnese extra learlllng opportunities that leadlO higher test Scores not S n II I ' , '. ' I a er c ass srzes. In Pan II, we use multiple regression analySIS to Isolate the effect I' I ' " o c langes 111 class size from changes in olherfactor~ such as the economic background of tl d re stu ents.

1.1

Economic Questions We Examine

3

Question #2: Is There Racial Discrimination in the Market for Home Loans?
Most people buy their homes with tbe belp of a mortgage, a large loan secured by institutions cannot take race into Applicants who the value of the home. By law, U.S. lending are identical gage lending. In contrast Bank of Boston cants are denied these data indicate how large is it? The fact that more black than white applicants Fed data does not by itself provide concluding evidence lenders because their race. Before the black and white applicants that there more closely are denied in the Boston by mortgage market, these in the probeconometric notably of discrimination to this theoretical mortgages, conclusion, researchers at the Federal Reserve are denied. Do lending? If so, found (using data from the early 1990s) that 28% of black appliwhile only 9% of white applicants that, in practice, there is racial bias in mortgage

account when deciding to grant or deny a request for a mortgage: gage applications

in all ways but their race should be equally likely to have their mortapproved. In theory, then, there should be no racial bias in mort-

differ in many ways other than

is bias in the mortgage

data must be examined difference methods obtaining

to see if there is a difference 11 we introduce

ability of being denied for otherwise identical applicants is large or small. To do so, in Chapter that make it possible a mortgage, holding to repay the loan. to quantify
COr/stan,

and, if so, whether this

the effect of race on the chance of characteristics,

other applicant

their ability

Question #3: How Much Do Cigarette Taxes Reduce Smoking?
Cigarette smoking is a major public health concern worldwide. costs to nonsmokers who prefer Many of the costs not to breathe secof smoking, such as the medical expenses ing and the less quantifiable ondhand cigarette costs are borne intervention smoke, are borne of caring for those made sick by smokof society. Because these

by other members

by people other than the smoker, there is a role for government One of the most flexible tools for will go down. will the in the quantity is to increase taxes on cigarettes. says that if cigarette sold decrease? from a 1. % increase prices go up, consumption The percentage change

in reducing cigarette consumption.

cutting consumption Basic economics quantity demanded of cigarettes resulting

But by how much? If the sales price goes up by l %, by what percentage

in price is the price elasticity of demand.

4

CHAPTER 1

Economic Questions and Data

I we wan

ki by a certain amount. ( 'Iasticity to calculate we need to know the price e ( , " sumption But what
f
t

to reduce srno IIlg con

achieve this reduction

III

sa 20%. hy raising taxcs then . . thc pnce mcrcase nccc sary to ", .. IS thc prtcc elastIcIty of demanct
I

for cigarettes? 'I ' theory provides us WIth th concepts uuu hclp us answer Although econorruc , , " d lot tell LISthc numerical value of thc pnce elasticity of this questlOIl, It oes I .' , , ' demand. -r- Iearn the elasticity , we must examine empirical evidence abOlltlhe ,0 behavior of smokers and potential smokers; in other \ ords, we need to analYze data on cigarette consumption and prices,
The data we examine are U.S, states in the 1980s and I990s, In these data,

,

,
pcr~onHIII1COme

cigarette sales, priccs.rnxcs.and

Ior

IUt·S with low tuxe •. an l tbus low cigarette prices, have high smoking rate, and states with high prices have low smoking rates, However, the analysis of these data IS .ornpli atcd because causality runs both ways: Low taxes lead to high dcman I. hut if there arc many Smokers in the state, then local politicians might try t kccp cigarcue In\es low 10 satisfy their smoking constituents, In Chapter 12, we study methods Ior handling this "simultaneous causality" and use those methods to est imme the pnce elasticity of cigarette demand,

Question #4: What Will the Rate of Inflation Be Next Year?
f the future. What will sales be next year at a firm considering investing in new equipment? Will the vtock market go up next month and, if so, by how much? Will city tux receipts next year cover planned expenditures on city services? Will YOUI' microeconomics cxnrn next week focus on externalities or monopolies? Will aturday be a nice day to go to the beach? One aspect of the future in which macr economists lind financial ccon mists are particularly interested is the rate of overall price inflation during the next year. A financial professional might advise a client whether to make a loan 01' to take one out at a given rate of interest, depending on her be 'I guc • of the rate of inflation over the coming year. Economists at ce 1111'<' I banks like the Federal Reserve Board in Washington, D,C.,and the uropean cntral Bank in Frankfurt, Germany, are responsible for keeping the rate of price inflation under comrol. so their decisions about how to set interest rates rely on the outl ok for inflation over the next year If they think til' I" [' " ' t ' e I ate 0 III lallon wtllmcrcase by a per enlage pOln , then they might increase' t ' t ( 111 etes rates by more than th(lt to low down an econ~ omy that, in their view risks a -h' f ' 'ng , ' ve, eatlllg, I they guess \ rong, they nsk caUSI either an unnecessary recess'.. . ' Ion 01 an undeSIrable jump in the rate of mfialtOn,
It seems that people always want a sneak preview

an action is said to cause an outcome if the outcome is the direct result. the first three questions in Section 1. holding other things constant. the mainstay of econometrics. provides a mathematical way to quantify how a change in one variable affects another variable. Quantitative Questions. the conceptual framework for the analysis needs to provide both a numerical answer to the question and a measure of how precise the answer is. Quantitative Answers Each of these four questions requires a numerical answer. An important empirical relationship in macroeconomic data is the "Phillips curve. and econometricians tical techniques to quantify relationships job is to predict the future do this by using economic theory and statisin historical data. questions.2 Causal Effects and Idealized Experiments 5 Professional economists who rely on precise numerical forecasts use econometric models to make those forecasts. Because we use data to answer quantitative our answers always have some uncertainty: A different set of data would produce a different numerical answer. what effect does a change in class size have on test scores. For example.2 Causal Effects and Idealized Experiments Like many questions encountered in econometrics. . the actual value of the number data. or consequence." in which a currently low value of the unemployment forecasts we develop and evaluate in Chapter rate is associated with an increase in the rate of inflation over the next year. The conceptual framework used in this book is the multiple regression model. Therefore.1 concern causal relationships among variables.1. by analyzing consumption ought to go down when the must be learned empirical1y. holding constant or controlling for student characteristics family income) that a school district administrator (such as cannot control? What effect does your race have on your chances of having a mortgage application granted. One of the inflation 14 is based on the Phillips curve. Economic theory provides clues about that answer-cigarette price goes up-but that is. In common usage. This model. A forecaster's using the past. holding constant the income of smokers and potential smokers? The multiple regression model and its extensions provide a framework for answering these questions for quantifying the uncertainty associated with those answers. using data and 1. The data we use to forecast inflation are the rates of inflation and unemployment in the United States. introduced in Part II. holding constant other factors such as your ability to repay the loan? What effect does a 1% increase in the price of cigarettes have on cigarette consumption.

6

CHAPTER1 Economic Questions and Data
HI r Call Ses of ~h: :~'~~S~~~':sty; ~:'Iting air in your lire, cnu,~' rnern 10 mO.lle ru IIin fer_ yo tilizer on your tomato plants cau e Ihom to produ e 01 re rom " I c aUSalily means that a specific action (applying ferliIiJ~r) Icnds 10 ,\ P ~IIJ mllllSurable consequence (more tomatoes).

.

hin a hal stove causes you I g~1burned unn 10

Estimation of Causal Effects
How best might we measure the causal effect on tom.no I Id (01 'I unaJ in kilo_ grams) of applying a certain amount of Icnilizcr, s IUU n 01' 01 I 1IIliLer per square meter? One way 10 measure this causal effect is 10 condu t uo ' penmen]. In that experiment, a horticultural researcher plants 11100 plot-, OIIIlI1l.ln· I .ich prot is tended identically, with one excepti n: orne plol~ g~t IlXIl\nil1l, nt I ·lllIiJ.:Cr per square meier, while the rest gel none. M rcovcr, \ hethcr II ph>! 'rllllllld or not is determined randomly by a computer, ensuring thllt n other dl!fllrcnces between the plots are unrelated to whether they receive r 'rllh/~r \t the en l of the growing season, the horriculturalist wcighs the harv "Itrom eu h pint. The difference between the average yield per square meter uf Ih ' It ',lieu nd urnrcated plots is the effect on rornato production f the fCrliliJ 'r II iuuncnt This is an example of a randomized controlled xperill1e.lt It" .orurouco in the sense that there are both a control grollp that rcccive-, no tr '1101 rm (n fertilizer) and a treatment group that receives the treatment (100 111 ltcrlilizer). It is randomized in the sense that Ihe treatment is assigned randoml . n\l~ rand m assignment eliminates the possibility of a sy temlllic reJtlliOll,hlp b'l\ Cllll. for example, how sunny the plot is and whether it receive fertiliJer '0 thtll the nil' systematic difference between the treatment and COntr I group, "Ih ' treatment. If this experiment is properly implemented on a large enough S IIle. thcn it will yield an estimate of the causal effect on the Ollie me f inlcre>! (11111101 rroduction) of the treatment (applying 100 g/m2 of fertilizer).

"r

In this book, the clIllsal ell'ect is defined to he Ihe cffe'l on un ulcOl11e of a given action or treatment as measured in an ideal randomiled conlrollcd cxperiment. In such an experiment, the only

sy tematic rca on for clift 'rene.:>~ in out-

COmes between the treatment and control groups is the IreOll11Cnllh '11. lt is possible to imagine an ideal randomiJed controllcd e\pcnl1l 'nt I an wer each of the fmt three questions in Section 1.1, Por cxample. 10 ~tud class size ~ne can Imagme randomly assigning "treatments·' of differenl cia, ,i/<: - 10 different groups of students. If the experiment is designed and e\C UI .t1 sO thaI the only systematic difference between the groups of studenl' is their c1a~~ .i./;e. then

1.3

Data: Sources and Types

7

in theory this experiment would estimate the effect on test scores of reducing class size, holding all else constant. The concept of an ideal randomized controlled experiment is useful because it gives a definition of a causal effect. In practice, however, it is not possible to perform ideal experiments. In fact, experiments are rare in econometrics because

often they are unethical, impossible to execute satisfactorily, or prohibitively expensive. The concept of the ideal randomized controlled experiment does, however, provide a theoretical benchmark for an econometric analysis of causal effects using actual data.

Forecasting and Causality
Although the first three questions in Section 1.1 concern causal effects, the fourth-forecasting inflation-does not. You do not need to know a causal relationship to make a good forecast. A good w~y to "forecast" if it is raining is to observe whether pedestrians are using umbrellas, but the act of using an umbrella
does not cause it to rain.

Even though forecasting need not involve causal relationships, economic theory suggests patterns and relationships that might be useful for forecasting. As we see in Chapter 14, multiple regression analysis allows us to quantify historical relationships suggested by economic theory, to check whether those relationships have been stable over time, to make quantitative forecasts about the future, and to assess the accuracy of those forecasts.

1.3 Data: Sources and Types
In econometrics, data come from one of two sources: experiments or nonexperi-

mental observations of the world. This book examines both experimental and nonexperimental data sets.

Experimental Versus Observational

Data

Experimental data come from experiments designed to evaluate a treatment or

policy or to investigate a causal effect. For example, the state of Tennessee financed a large randomized controlled experiment examining class size in the 1980s.In that experiment, which we examine in Chapter 13, thousands of students were randomly assigned to classes of different sizes for several years and were given annual standardized tests.

8

CHAPTER 1

EconomicQuestions and Data The Tennessee class size experiment cost milhon dill' t nu tequired th . . or 'nl ind t uch r e ongoing cooperation of many adrnini unrors, p C er cveral B ause real world experiments with human sub) I r doth uh I adrnin years. ec id I d zed ru II . ister and to control, they have flaw relaii e to I eo r n nu C I'll ed expel'. iments. Moreover, in some circum lance experiments arc n toni 'pen ive and , ' '11(would it be thl ill In 011 I' I' nd rnly difficult to administer but a Iso une thirear, selected teenagers inexpensive cigarettes to ee hall' man th 'Iou ) Because of these financial, practical, and ethical pr blcms, experiment to ec nOmic. are rare. Instead, most economic data are obtained by observin rc l-workl b h vi r. Data obtained by observing actual behavior outside tn pcnrn 'ntal Ctting are called observational data. Observational data ore collc led USll1 'U,vcy, such as a telephone survey of consumer, and administrative re (lrd vu 10", hi t rical records on mortgage applications maintained by lending insutuuun Observational data pose major challenges to cconornern II to estimate causal effects, and the tools of econometrics t tackle th .se hnllcnec In the real world, levels of "treatment" (the amount of fertilizer in th I m to ample, the student-teacher ratio in the class size example) arc n to .." ned It r mdorn, s it is difficult to sort out the effect of the "treatment" fr mother r tcvunr I rs. Much of econometrics, and much of this b k, is devoted to m ihods f r me una the challenges encountered when real-world data are used t cstimare IU'lIl 'ffects. Whether the data are experimental or observati nul. doto 't~ m in three main types: cross-sectional data, time series data, and panel dmu, In uns b k, you will encounter all three types.

,nrl,

Cross- Sectional Data
Data on different entities-workers,consumers, firms,g vcrnmentat unit • and so forth-for a single time period are called cross-secnonnt duta .. r c ample, the data on test scores in California school districts are cross c tiOna I.Th data are for 420 entities (scbool districts) for a single time peri d (1999). In g >neral, the number of entities on which we have observations is denOl d 1/; •. for example, in the California data set, n ; 420. The California test score data set contains measurement ofeveral different variables for eacb district. Some of these data are tabulated in '!bbl, 1.1. ach row lists data for a different district. For example, the overage le.t or' f I' the first dlstnct ("district #1") is 690.8; this is the average o[ the moth and s ien test scores for all fifth graders in that district in 1999 on a standardized te t (the tanford AchIevement Test). The average student-teacher ratio in that distri t i 17.89: that IS, the number of students ill district #1 divided by the number f cia room

1.3 Data: Sources and Types

9

~
Observation (District) Number

Selected Observations on Test Scores and Other Variables for California School Districtsin 1999
District Average Test Score (fifth grade) Student-Teacher Ratio Expenditure per Pupil ($) Percentage of Students Learning English

1 2 3 4 5

690.8 661.2 643.6 647.7 640.8

17.89 21.52 18.70 17.36 18.67

$6385 5099 5502 7102 5236

0.0% 4.6 30.0 0.0 13.9

418 419 420

645.0 672.2 655.8

21.89 20.20 19.04

4403 4776 5993

24.3 3.0 5.0

Note: The California test score data set is described in Appendix 4.1.

teachers in district #1 is 17.89. Average expenditure per pupil in district #1 is $6385. The percentage of students in that district still learning English-that is, the percentage of students for whom English is a second language and who are not yet proficient in English-is 0%. The remaining rows present data for other districts. The order of the rows is arbitrary, and the number of the district, which is called the observation number, is an arbitrarily assigned number that organizes the data. As you can see in the table, all the variables listed vary considerably. With cross-sectional data, we can learn about relationships among variables by studying differences across people, firms, or other economic entities during a single time period.

Time Series Data
Time series data are data for a single entity (person, firm, country) collected at multiple time periods. Our data set on the rates of inflation and unemployment in the United States is an example of a time series data set. The data set contains observations on two variables (the rates of inflation and unemployment) for a

In the se 1959. The number of in a time series data set is den ted . May. uarten rov T.4 1. and June. the rat' of unemployment was 5.S. for example.The daW in each correspond to a different time period (year and quarter). which is den the fourth quarter of 2004 (2004:fV). evolution of variables over time and to forecast fUlure valu of those . In the sec nd quarter C 1 59. lind end ir b 'r . the rate nd quarler 01 was 0.2. the overall price level (as mea ureo b the n urner Pric' Index. and n h~ r au n Ihal in thi: March in thi! led 19 9:11.7% 2.1 2 181 182 183 2004:11 2004:lJl 2004:IV 4. %.6 3. 'Deh 11111 P 'nod data set is a quarter of a year (the fir t quarter is Januar the second quarter is April. Febr u I ic .Ii i . In the third quaJ'ter of 19 Cl'I inflation was 2. if inflation had continued f r 12 months al it rate during the sec' and quarter of 1959. the rate of price inflation In other words.7% per year III an annual rate 01 By tracking a single entity Over time. data set begin in the second quarter periods) of . CPI) would have increased by 0.1 5.quarterl 1959:11 an annual r 0.7%.1% 2 3 4 5 1959:llJ 1959:[V [960:1 1960:11 .and so f rth).1%.1959. single entity (the United States) for 183 time peri d .5 Note: The U. Becau e there Dr' I from 1959:II to 2004:JV. Some observations in this data set are listed in nble 1.I. and the rate of unemployment was 5. . 5.4 ~.1 %.1 % or the labor f r e reported that they die not have a job but were lOoking [or work.3 ~6 ~.inflation and unemployment data set is described in Appendix 14. that is. this data set contains T= 183 bservlllion. time series data can be u ed t iudy the variables. tim. 1959 7004 and Unemployment In the ru e Observation Number Date ~ (PI Inflation Rac (010 per y ar lit u f ) Iyeer.10 PTER EconomicOuestionsand Data 1 CHA tes of Consumer Poe Ind (PII 11111 Selected ObselVatio~s on l~e ~ad States: Quarterly Data.

080 0. we have observations on n ~ 48 continental U. Some data from the cigarette consumption data set are listed in Table 1.states (entities) for T = 11 years (time periods) from 1985 to 1995.362 .240 0.4 117.5 104. 47 48 49 West Virginia Wyoming Alabama 1985 1985 1986 112.089 0.3. In the cigarette data set.360 Note: The cigarette consumption data set is described in Appendix 12.135 0.333 0. States.5 128.8 115. organized .007 1.3 Data: Sources and Types 11 ~ Selected Observations on Cigarette Sales. are data for multiple entities in which each entity is observed at two or more time periods. The number of entities in a panel data set is denoted n. Thus there is a total of n X T = 48 X 11 = 528 observations.8 129.S. Panel Data Panel data.015 1.8 1.1.S.3.2 1. and Taxes.335 528 Wyoming 1995 112.022 1.585 0. by State and Year for U. and the number of time periods is denoted T.5 $1. 1985-1995 Average Price Observation Number State Year Cigarette Sales (packs per capita) per Pack (including taxes) Total Taxes (cigarette excise tax + sales tax) 1 2 3 Alabama Arkansas Arizona 1985 1985 1985 116. and selected variables and observations in that data set are listed in Table 1. The first block of 48 observations lists the data for each state in 1985. Prices.382 0.935 1.2 1.370 0.240 0.334 96 97 Wyoming Alabama 1986 1987 127. Our data on cigarette consumption and prices are an example of a panel data set. also called longitndinal data.086 $0.1.

For example. provides tools for estimating (nonexperimental) causal effects data or data from real-world.12 CHAPTER 1 Economic Questions and Data ~ Cross-Sectional. entities. time series data. equals 128. lime senes data are gathered by observing a single entity at multiple points in lime. each of which is observed at multiple points in time. Time Series.1 • Cross-sectional period. . through 1995. in 1985. time periods. of which 37¢ went to federal. state. or too expensive.1. 3. Many decisions in business and economics require quantitative how a change in one variable affects another variable. and Panel Data 1. impractical. where alphabetically from Alabama to Wyoming. fr m the expeover Panel data can be used to learn about economic relationships riences of the many different entities in the data set and from the evolution time of the variables for each entity. data are gathered by observing multiple entities pomt in lime. data consist of multiple entities observed at a single time • Time series data consist of a single entity observed at multiple • Panel data (also known as longitudinal data) consist of multiple each entity is observed at two or more time periods. Econometrics observatIOnal expenments.5). cigarette sold in Arkansas in 1985 divided by the total population of Arkan in A rkan lists sales in Arkansas were 128. data. and panel data are Summary 1. but performing such experiments in econ cations is usually unethical. was $1. Cross~sectional.015. The next block of 4 observations the data for 1986.5 packs per capita (the total number of packs of cigarettes as in 1985 as in 19 5. nd local taxes. The average price of a pack of cigarettes including tax. The definitions of cross-sectional summarized in Key Concept 1. 2. Conceptually. and panel data are gathered by ob erving multiple entities. controlled experiment. estimates of the way to estimate a causal effect is in an ideal randomized rnic appli- using either imperfect at a single 4. and so forth.

ugge tome impediments to implementing this experiment in practice. an observati nat time series data set for studying this effect. an observati d. an observational eros -sectional data et with which you could study this effect.Review the Concepts 13 Key Terms randomized controlled experiment (6) control group (6) treatment group (6) causal effect (6) experimental data (7) observational data (8) cross-sectional data (8) observation number (9) time series data (9) panel data (11) longitudinal data (11) Review the Concepts 1.1 Design a hyp thetical ideal randomized controlled experiment to study the effect of hour pent studying on performance on microeconomics exams. Design a hypothetical ideal randomized controlled experiment to study the effect on highway traffic deaths of wearing seat belts. You are a ked to study the casual effect of hours spent on employee training (mea ured in hours per worker per week) in a manufacturing plant on the productivity of its workers ( ut] ut per worker per hour). J. b.2 J. Describe: a.3 . and nal panel data et (or tudying this effect. uggc t some impediments to implementing this experiment in practice. c. an ideal randomized controlled experiment to measure this causal effect.

complicated. the sample average has a probability as its sampling distribution distribution. When the I 14 . had you done so. Because you chose the sample at random. If you feel confident with the material. you would have observed ten different earnings and you would have computed different sample average. We assume that you have of probability is stale. This sampling distribution is. mean. Because the average earnings vary from one randomly chosen sample to the next. squared. the sample average is itself a random variable. points (or "observations"). Section 2.3 introduces the basic elements of probability theory for two random variables. and F distributions. For example. suppose you survey ten recent college graduates selected at random.1 reviews probability distributions for a single random variable.CHAPTER 2 Review of Probability taken an introductory course in probability and statistics. you still should skim the chapter and the terms and concepts at the end to make sure you are familiar with the ideas and notation. Most of the interesting problems in economics involve more than one variable. you should refresh it by reading this chapter. and compute the average earnings using these ten data have chosen ten different graduates by pure random chance. and variance of a single random variable. in general. The this theory of probability provides mathematical tools for quantifying and describing randomness. "observe") their earnings.2 covers the mathematical expectation. Section 2. Section 2. which is referred to describes the different because this distribution possible values of the sample average that might have occurred had a different sample been drawn. chi. If your knowledge T his chapter reviews the core ideas of the theory of probability that are needed to understand regression analysisand econometrics.5 discusses random sampling and the sampling distribution of the sample average.4 discusses three special probability distributions that playa central role in statistics and econometrics: the normal. The final two sections of this chapter focus on a specific source of randomness of central importance in econometrics: the randomness that arises record (or you could a by randomly drawing a sample of data from a larger population. and Section 2. and Section 2. Most aspects of the world around us have an element of randomness. Therefore.

In each of these example there i mething n t yet known that is eventually revealed.6. and Random Variables The gender of the next new per n you meet. so it is a random ariable. The number of time your computer crashe while you are writing a term paper is random and take on a numerical value. If the probability of your computer n t era hing while you are v riting a term paper is O%.1.1 RandomVariablesand Probability Distributions 15 sample size is sufficientlylarge. and the outc mes need not be equally likely. a di crete random variable takes on only a discrete set of values. . and the number of times your computer will crash while you are writing a term paper all have an element of chance or randomness. The event "my computer will era h n m re than once" is the et nsisting of two utc mes: "no era he "and "one crash. your grade n an exam.e set of all possible outcomes is called the sample pace. me rand rn variables are di crete and some are continuous. Probabilities and outcomes. TIle Ilrobubllit of an utc me is the proporti n f the time that the outc me ccur in the I ng run.. For example." A random variable is a numerical summary of a rand rn ut orne. the Sample Space. . An event i a subset f the sample space. it might crash twi e. The sample space and events. 2. A their names sugge I. which is discussed in Section 2.an event i a set of ne I'm re outc mes.1 Random Variables and Probability Distributions Probabilities. Only one of these outcomes will actually OCcur (the outc mes arc mutually exclusive). TI. and s n. a result known as the central limittheorem. the sampling distribution of the sample average is approximately normal. your c mputer might never crash. . The mutually exclusive p tential result of a random pr ce s are called the outcomes.. however. whereas a continuous random variable takes on a continuum of po sible value Random variables. like 0. 2.then over the ourse f writing many term papers y u will complete 80% without a crash. that is. it might era h once.2.

OD!II Probability of Your Computer Crashing M Times Outcome (number of crashes) 0 Probability distribution Cumulative probability distribution 1 2 3 4 0. 3%. value. is 90%. is the probability that each while you of the random The probability crashes.03 0. the probability of the event of one or two outc meso That is. tively. of no crashes is 80%. crashes is the sum of the probabilities Pr(M = 1. Cumulative probability distribution.10 0.) is the probability of a single computer crash. The probability of an event can be c mputed of the constituent = from the For example.06 0. For example. The probability distribution of a discrete random variable is the list of all possible values of the variable and the probability value will occur. 6%.99 om 1. The last row of Table 2. let M be tbe number of times your computer crashes are writing a term paper. The probability distribution the list of probabilities denoted Pr(M of each possible outcome: = 0). Pr(M variable that M M is = 0.80 0. An example of a probability distribution for M is given in the second your computer crashes row of Table 2.06 = 0.1 gives the cumulative the random variable the probability Pr(M :s.1.16 CHAPTER 2 Review 01 Probability Probability Distribution of a Discrete Random Variable Probability distribution.80 0. For example. you will quit and write the paper of two. in this distribution.10 + 0. of no crashes (80%) and that the random variable M.00 . These probabilities Probabilities of events. the probability is. probability distribution. which is the sum of the probabilities of one crash (10%). if of no computer = 1.96 0. respecdistrisum to 100%. and the probability bution is plotted in Figure 2. or four crashes by hand. and so forth. three. and 1 %. is the probability The cumulative probability distribution is less than or equal to a particular probability distribution of of at mo t one crash. or M = 2) = Pr(M = 1) + Pr(M = 2) 0.90 0. These probabilities sum to l. This probability four times. According to this distribution.1. the probability of one crash is 10%. 1).16. r 16%.

The 0. and so fonh for the other bars. where = 0 indicates that the person is male and G = I indicates that she i female.1 Random Variables and Probability Distributions 17 cmmIJD Probability Distribution of the Number of Computer Crashes Probability The height of each bar is the probabilitythat the computer crashesthe indicated number of times. A cumulative to as a cumulative dis- An important special case of a discrete random variable is when the rand m variable is binary.The probability distribution in Equation (2.8. the outcomes are 0 or I.The ut ornes of G and their probabilitie thu are J with probability { 0 with probability p 1 .1. The Bernoulli distribution. binary random variable is called a Bernoulli random variable (in honor f the evententhentury wis mathematician and cienti t J cob Bernoulli).so the probabilityof 1 computer crash is 10%. and its probability di tributi n is called the Bernoulli distribution.2. 0.1) is the Bernoulli distribution. let be the gender of the next new per on you meet.0 o 2 3 4 Number of crashes pr bability distribution is also referred Iribution function. For example. or a cumulative distribution. G = (2. that is.5 height of the firstbar is 0.d.7 0.8 0. The height of the second bar is 0.1 ) where pis the probability of the next new person you meet being a \ ornan. . so the probabilityof 0 computer crashes is 80%. a c.1 0.p.6 0.L.

The probability the probability the probability density function. between 15 minutes and 20 minutes. and Variance The Expected Value of a Random Variable Expected value. on a continuum Because a continuous random variable can take of possible values. tbe probability distribution u ed f r discrete m variable. random variable is computed weights are t~e probabilities of that outcome. because it depends dom factors such as the weather and traffic conditions. which is 0.2b plots the probability density function of commuting sponding to the cumulative distribution in Figure 2. Figure 2. or simply a density. That is the value. this probability be seen on the cumulative distribution in Figure 2. the cumulative probability distribution of a continuous rand distribution variable. The expected value of Y i also called the expectation of Yor the mean of Y and is denoted J. The expected value of a random variable value of a discrete Y.2a as the difference between the probability that the commute is less than 20 minutes (78%) and the pr bability that it is less tban 15 minutes (20%).. Probability density function. Equivalently. wbich lists the probability of each possible value of the rand is not suitable for continuous variables.58. the probability continuous random variable.d. Mean. denoted E(Y).f.2a.2a plots a hypothetical cumulative that the commute For example. 2.2 Expected Values.d. 7 . it is natural tion of commuting times. less m variable probability that the random variable is less than or equal to a particular commuting time can take on a continuum of values and. . The expected weighted average of the possible outcomes variable over many repeated of that random variable. For example. Thus the probability density function and the cumulative probability distribution show the same information in different formats. Instead. The cumulative probability for a continuous variable is defined just as it is for a discrete random is. Figure 2. is trials or as a where the the long-run average value of the random occurrences.18 CHAPTER 2 Review of Probability Probability Distribution of a Continuous Random Variable Cumulative probability distribution. a density times correthat the comcan mute takes between 15 and 20 minutes is given by the area under the p. n randistributake. 1111 tudent s to treat it as a than 15 minutes is 20% and the probability that it takes less than 20 minutes is 78%. A probability density function that the random variable fall between is also called a p. The area under the probability those two points. consider a student who drives from home to school. by functi n i summarized density variables. between any two points is the probability function.Ly. or 58%.f.

and Ihe probability that it is less than 20 minutes is 0.2 Cumulative Prcbnblfiry Expected Values..2.58 (58%> and is given by Ihe area under Ihe curve between 15 and 20 mhunes.20 0..8 F--. Probabilities are given by areas under the p.dJ.umulauvc durribuuon fun lion of cormuuriug time Probability denslry IUS Pr (Commuting time 5 15)..d..2 I---"'C-f 0.The probability thai a commuting time is between 15 and 20 minutes is 0.0.06 .i -'---__ --'- '-.58 o.78 <78%).dJJ of commuling times...20 0.-/ 0.. .78 0.D of commuling times. The probability that a commuting time is lessthan 15 minutes is 0.__ -' 20 25 3 omrntlting 35 40 time (minutes) (n) .0 Pr (Commuting time 5 20) = 0.12 Pr (15 < Commuting time 5 20) = 0. Mean.ooLG3l1!i:!~'-_-.4 Pr (Commuting time 5 15) = 0.. 25 .I __ 10 . and Variance Time 19 Distribution and Probability Density Functions of Commuting 1.20 (or 20'lbl.58 0.Q3 Pr (Commuting time > 20) = 0.2a shows the cumulative probability orstribution (or c.6 0.09 0.22 0.0 '---'---'10 15 __ -.2b shows the probability density function (or p..f-----/ 0..:= 30 ornrnucing --L __ --J 15 20 35 40 time (minutes) (b) Prcbabihry densuy fun non of conunuring time Figure 2... Figure 2.

and s forth.l to I I k.1 Suppose the random variahle Y takes on k possible values. (2.03 + 4 X 0.2) That is. 11.35 times while writing a particular num- 0.1. weighted a crash of a given size occurs.1 uses "summation notation. As a second example. suppose you loan a friend $lOO at 10% interest. consider the number of computer eras he probability distribution number of crashes over many term papers. denotes thefirst value. by the frequency M with the with which given in Table 2. i=l (2.20 CHAPTER 2 Review of Probability For example. Thus the amount equals $0 with probability 0.e expected value of M is the average E(M) = 0 X 0. YI. but 1% of the time you would gel nothing." which is reviewed in Exercise 2.1. denotes the second value.0 I = 0.99 and you repaid. would be repaid $110 X 0.90. Of course.y. is p" the probability The expected value of Y.ty. the calculation in Equation (2.3) Lk1=1 y·p·means "th e sum 0 f Yi Pi for l"running from .10 + 2 X 0.35. f Yanc 0 an is IS . 99% of the time you would on average = $108. but there au are repaid is a random variable that equals $110 with probability be paid back $110. the expected crashes while writing a term paper is 0.+Y:2P2+'" where the notation +YkPk= LYiPi. Over many such loans.Yk> where YI that Y takes on Y2is Pz. .80 + 1 X 0..25. denoted E(Y).' . 'If the loan is is a ri k 0. The formula for the expected value of a discrete random variable take on k different values is given as Key Concept 2.Q1 of I% that your friend will default and you will get nothing at all.) Expected Value and the Mean 2.06 number of computer + 3 X 0. and so f rth. is k E(Y)=YIP. you get $110 (the principal of $100 plus interest of $10). it makes no sense to say that the computer crashed term paper! Rather. Thus the expected value of your repayment (or the "mean repayment") is $108.wfhe expected value of Y is also called the mean of Y Or the expectaf denot d Ion e !. the actual number of crashes must always be an integer.01.99 + $0 X 0.90.35. (Key oncept Y that can 2.35. Accordingly.2) means that the average ber of crashes over many such term papers is 0. and that the probability that Y takes on Y.

The variance of a random variable Y. and Variance 21 Expected value of a Bernoulli random variable. The unit of the standard deviation are the same as the unit of Y.and Standard Deviation 11.2. the units of the variance are the unit f the square of Y. which i the quare rtf the variance and is denoted oy111e standard deviation ha the ame units as Y. ~ i 2. the probability that it lakes n the value" I.e expected value of Gis E( G) = I x P + 0 X (1 - p) = p.y)2].2 (2. is the cxpe ted value f the square of the deviation of Y fr rn its mean: var(Y) = E[(Y-p. the square root of the variance. denoted k IT~. which make the variance awkward to interpret. denoted var(Y). the formal mathematical definition of its expectation involves calculus and its definiti n is given in Appendix 17.These definition are urnmarized in Key oncept 2.5) (T~=var(Y)= E[(Y-p." Expected value of a continuous random variable. (2. It is thcref rc c mm n t mea ure the spread by the standard deviation.1. The Standard Deviation and Variance The varian e and tandard deviation measure the dispersion or the "spread" of a probability di tribution. Mean.e tandard deviation of Y is lTy.1):n.2.1 is the mean of a Bernoulli random variable. .y)2J = L(Yi-/Lyfpi' i-\ 11. An important special case of the general formula in Key oncept 2. The expected value of a continu us random variable is also the probability-weighted average of the possible OUIC mes f the random variable. Because a continuous random variable can lake n a continuum of po ible values.4) 11lUSthe expected value of a Bern ulli random variable is p.2 Expected Values. Varian . Because Ihe variance involves the quare of Y.e variance of the di crete rand rn variable Y. Let G be the Bernoulli random variable with the probability distribution in quati n (2.

Y .35: var(M) = (0 . so its variance is var(G)=u8 distribution 11. Thus the standard deviation of a Bernoulli random variable is UG = Mean and Variance of a Linear Function of a Random Variable This section discusses random variables (say.06 + (3 .22 CHAPTER 2 Review Probability of For example.9) The variance of after-tax earnings is the expected value of (Y _ J. her earnings are SO% of the original pre-tax plus $2000.35? X 0.6) of M is the square r 01 of the variance.10 + (2 . variable with earnings earnings. For example. plus $2000.l.SX.J.35)2 X 0. 475.l. variable G with probability (2. 5j2 X 0.(2000 + O.x).0.x)= O. function.e mean of the Bern ulli random in Equation (2.0.35)2 X 0.0.7) p( I .y)2.SX .y= 2000 + O. (2. so uM The standard deviation = VO.6475 '" O.x.1-x and variance (T~.J.80 + (1 . the .Sx.SJ. so are after-tax deviations of her after-tax earnings.p).35)2 X 0.l.0. consider an income taxed at a rate of 20% on his or her earnings X and Y) that are related by a linear tax scheme under which a worker and then given a (tax-fr Yare related I pre-lax earnings is e) grant of $2000. Because Y = 2000 + O. after-lax earning X by the equation Y = 2000 + O. (2.4)].l. 0.S(X .O. Under this tax scheme.S) That is.l. Because pre-tax earnings are random.y = 2000 + 0. (2. What are the mean and standard under this tax? After taxes.01 = O. M is the probability-weighted average of the squared difference between M and Its mean.8J. Thus the expected value of her after-tax earnings is E( Y) = J.03 + (4 . Suppose an individual's pre-lax earnings next year are a random mean J. Thus .1) is Ji-G =P -pl. vananc e of' the number of c mputer crashes .SO.l. Variance of a Bernoulli random variable. ( quation =(0-p)2X (l-p) + (1 _p)2Xp=p(1 (2. after-tax earnings Y is 80% of pre-tax earnings X.

14) .12) and (2.2. (2. (2.9) and (2.64var(X).8(X . skewness. taking the square root of the variance.3c.e expressions in Equations (2.10) are applications of the more general formulas in Equations (2.Mean. two which are symmetric (Figures 2. which measures the lack of symmetry of a distribution.3a and 2. 111isanalysis can be generalized so that Y depends on X with an intercept a (instead of $2000) and a slope b (instead of 0.64E[(X .13) and the standard deviation of Y is O"y = btr x. Then the mean and variance of Yare !Ly = a + b/l-x and (2. which measures how thick.3d appears to deviate more from symmetry than does the distribution in Figure 2. Skewness.so.10) That is.8) so that Y=a+bX. and the kurtosis.3 plots four distributions.2 ExpectedValues. the distribution in Figure 2.1 3) with a = 2000 and b = O•. and kurtosis are all based on what are called the moments of a distribution. The mean. The skewness of a distribution provides a mathematical way to describe how much a distribution deviates from symmetry.3c and 2.3b) and two which are not (Figures 2. It follows that var( Y) = 0. the standard deviation of Y is O"y = 0. variance.and Variance 23 E[( Y . Or "heavy. Visually." are its tails.80" X. This section discus es measures of two other features of a distribution: the skewne s. the standard deviation of the distribution of her after-tax earnings is 80% of the standard deviation of the distribution of pre-tax earnings. Figure 2.11) Other Measures of the Shape of a Distribution The mean and standard deviation measure two important features of a distribution: its center (the mean) and its spread (the standard deviation).!LyJ2] = EI[0.12) (2.11.!Lx )fl = 0.3d).!Lx )2]. The skewness of the distribution of a random variable Y is Skewness (2.

kurtosis = 20 0. The distributions (c with skewness of 0 (a and b) are sym metric.7 0.6 0.8 0.3 0. If a .0 -4 -3 -2 -I a = 5 0.!"y)3] = 0.0 0.2 0.1.9 0.3 0.6. the skewness of a symmetric distributi n is zero.2 0.5 1.0 -4 -3 -2 == -I 0 2 3 4 -4 -3 -2 -I 0 2 3 4 (a) Skewness 0. then positive values of (Y . For a symmetric distribution. for a ymmetric distribution. the distributions with nonzero skewness 3 (b-d) have heavy tails.4 0.1 0.2 0. and d) are not symmetric. The distributions with kurtosis exceeding where Uy is the standard deviation of Y.4 0.3 0.6 0.4 0.24 CHAPTER 2 Review Probability of Kurtosis mmtm Four Distributions with Different Skewness and 05 0.1 0.1 0. Thus.0 0.1 0.0 2 3 4 -4 -3 -2 -I 0 2 3 4 (c) Skewness == -0. E[(Y . a value of Ya given amount above its mean is just as likely as a value of Y the same amount below its mean. kurtosis = 5 All of these distributions have a mean of a and a variance of 1.5 0. If so.5 0. kurtosis == 3 (b) Skewness = 0.3 0.!"y)3 will be offset on average (in expectation) by equally likely negative values.4 0.2 0. kurtosis (d) Skewness = 0.

. the kuno i will be large. O'y (2. Becau e (Y . so the skewness is nonzero for a di rributi quati free. E(Y). The skewness is a function of the first. and Variance 2S then a p sitive value of by (Y . the kurt TIle kurtosis able \ ith kurt variable. has m re mass in it tails than a normal random r. Moments. . so changing the unit doc not change its kurtosis.so a rand m variis ex eeding ution with kurtosis exceeding 3 is called leptokurtic simpl .15) If a distribution tures of has a large amount of mas in its tails.2. ike skewness. If a di tribution skewne s i negative. If a distrihas Bel w each f the four distributions n has a I ng right tail.I long left tail.l-'y)3 generally is not ffset on average by an equally likely negative value. and the kurtosis is a function of the first through fourth moments of Y. Dividing 0) in the denominator of n (2. and the expe ted value of the f Y. in Figure 2. is called the sec nd m merit of Y. of (Y . is also called the first moment quare f f Y. and the skewness is positive. its Kurtosis. the r'h m ment of Y i E( V').l-'y)4 cann t be negative. lind third m ments of Y. m re f a normally distributed rand m variable is . of distri si COnn t e negative. the kurtosis is unit free. E(y1).l-'y)4. b-d are heavy-tailed. theref i fa distributi The kurtosis re. Expected Values. TIle kurtosis of a distribution is a measure of how much mass is in it tails and. the more likely are outliers. n average (in expectati n).14) cancels the units f y3 in the numerator. An extreme value of Y is called an outlier. Below eo h of the four distributi ns in Figure 2. in buti n that i not symmetric. Mean. p sitive values of (Y . That is. In general.2 distribution is not symmetric.3 is its kurto is. Thus f I' a distribution with a large amount of rna s in its tails. . TIle greater the kurto- Kurto is= E[(Y-/y)"l.3 is its skewness.The distribu- ti ns in -rgures 2. changing the units of Y does not change its kewness. f the distribution of Y is extreme values. and the e very large values will lead 10 large values.so the skewness is unit ther words. i a measure of how much of the variance of Y arises fr m n.l-'y)3 are not fully offset by neg- ative value. the expected value Y' i called the r1h moment of the rand m variable Y. then same extreme depar- Y fr rn its mean arc likely. The mean of Y. ccnd.he vy-railed .

commuting time of the student commuter dom variable that equals 1 if the commute equals ° otherwise and let Xbe a binary random variable that equals 0 if it is rainh rt ing and 1 if not. over many commutes.orPr(X=0.ccond). Y=1)=0.30 0. = 0.22 0. mar- Joint and Marginal Joint distribution. Y = 1). Pr(X = 0. Y = 1). that is. Y = 0). Y = 0). weather conditions-whether not it is raining-affect in Section 2.15. (education in th~ tribution of income for women compare to that for mcn? n. According to this distribution.2. An example of a joint distribution of these two variables is given in Table 2. of the concepts of joint. y) combias the function the and nations sum to 1. ¥ = 0). The joint probability is the frequency which each of these four outcomes occurs over many repeated commute.26 CHAPTER 2 Review0.15 0. The probabilities of all possible (x. no rain and long commute (X = J. there are four possible outcomes: it rains and the commute is long (X = 0. say simultaneously x and y.78 l.1. ¥= y).3 Two Random Variables Most of the interesting questions in economics involve two or m re variables. in the first example.00 - Total . Also. considered income together w ~ es the disns concern and . is the probability that the rand take on certain values. Let Y be a binary ranis short (less than 20 minutes) For example.07 0. The joint probability distribution Pr(X=x.70 0. rainy COm- Y=O) = 0.15 0.63 0. Are H college graduates more likely to have a job than nongrad~atcs? the distribution employment of two random variables. Between these two random variables.ese qucsu status and gender Answering such questions requires an understanding ginal. ~ Joint Distribution of Weather Conditions and Commuting Rain <X =: Times Total 0) No Rain LY =: l) Long commute (Y Short commute (Y 0) 1) 0. and conditional probability distributions. Probability 2. rain and commute with (X = 0. (X have rain Pr(X= I. say X and Y. 15% of the day and a long commute muteisI5%.15. and n distribution rain and shan commute (X = 1. Distributions of two discrete rn variables can be written 01' The joint probability distribution random vari- ables. the probability of a long.

Y=y) Pr(X=x) . the marginal probability of rain is 30%.2. conditional on it being rainy (X = 0). as shown in the final I' w of Table 2. that i .50.17) . the c nditional distribution f Y given X = x is ___ Pr(Y-yIX-x)- Pr(X=x. In general. over many commutes it rains 30% of the time.07. This term is used t distinguish the di tribution of Y alone (the marginal distribution) from the joint distribution of Y and another random variable. so the probability of a long e mmute rainy or no I) i 22%. the marginal probability that it will rain i 30%.These four possible outcomes are mutually exclu ive and con titute the sample space so the four probabilities sum to 1. Of thi 30% of commutes.e conditional pr bability that Y takes on the value y when X takes on the value x is written Pre Y = ylX = x). or Pre Y = 01 = 0) = 0. (2.30).16) example. The marginal probability distribution of a rand rn variable Y is ju t another name for its probability distribution. The marginal distribution of commuting times is given in the final c lumn of Table 2. the joint probability f a rainy short c rnmute is 1 % and the j int probability of a rainy I ng commute is 15%. I' example. If X can take on I different values XIo"" XI.ll.2. what is the probability of a long commute (Y = 0) if you know it is raining (X = OJ? rom Table 2. I' Conditional Distributions Conditional distribution. 1-' Y= y). then the marginal probability that Y takes on the value y is I Pr(Y= y) = LPr(X=x. in Table 2.3 Two RandomVariables 27 y = 0) = 0.Thu the probability of a long commute (Y = 0). The marginal distribution of Y can be computed from the joint distribution of X and Y by adding up the pr babilities of all pos ible outcomes for which Y takes n a specified value.2.2. Marginal probability distribution. is 50%. (2.. Similarly. the probability of a long rainy commute is 15% and the pr babitiiy f a I ng commute with no rain is 7%. Y = 1) = 0. quivalently.2. so if it is raining a I ng commute and a hort commute are equally likely.and Pr(X = 1.63. The distribution of a random variable Y conditi naJ on an ther random ariable X taking on a specific value i called the conditional distribution of Y given X.15/0. 50% f the time the commute i long (0.

That is. Conditional 0 At-I M = 2 M = 3 M=4 Tota' 0. In contrast. given the age of the computer.28 CHAPTER Review Probability 2 of Joint and Conditional Distributions of Computer Crashes (Ml and Computer Age (Al -~ ~ A.00 - Distributions of M given A M= 0 M = 1 M=2 M = 3 At = 4 Total 1. because half the c mputers probability of no crashes.07 010 0.3. consider a modification librarian randomly assigns you a computer the age of the computer you use. is the mean of the conditional Y given X.35 0.06 0.03 0.00 0. the conditional is Pr(Y= probability 0.02 0. Joint Distribution M Old computer (A New computer (A Total B. of the rand 111 = n is a random variable.01 1. the newer computers are less likely to crash than the old one.05 0. Suppose you use a computer in the library to type your term paper from those available. distributi f c mputer are or crashes. also called distribution of is the expected value of the conditional mean of Y given X.005 0. but 1% with Conditional expectation. For exam- M = 0 and A = 0 is 0.50 0) 1) 0.50 0.30 0IX= 0) = Pr(X= O)/Pr(X= = 0.05 0.01 0. of no crashes given that you are in Part for assigned a new computer is 90%.00 Pr(MIA -0) Pr(MIA ~1) 0.00 1.50 = 0.00 For example. As a second example.45 0.025 0. Y= of a long commute given that it is rainy 0) = 0.80 0. the conditional puter.70. examand the f which are 0 if it is old). A of the crashing computer ple.02 0. the joint probability old. A = O)/Pr(A = 0) = 0.Q35 010 0. Suppose the joint distribution variables M and A is given in Part A of Table 2. The conditional expectation of Y given X.065 0. Because you are randomly a signed t a c mputer.50. of three crashes is 5% with an old computer B of Table 2. Then the conditional ple.70 0.90 013 0.01 0. is given in Part B of the table. is 70%. (= 1 if the computer is new.01 0. the conditional expectation Y. According to the conditional distributions example. computed f .35/0. the probability a new computer. given that you are u ing an the conditional probability Id com- Pr(M = OIA = 0) = Pr(M = 0.15/0.35.3. half new and half of which are old.

The expected number of computer crashes.70 + 1 X 0. if X takes on the I values XI>' .14.2.3. Stated differently. the expectation of Y is the expectation of the conditional expectation of Y given X. the mean number of crashes is 0. the expected number of computer crashes.56.10 + 3 X 0. The mean of Y is the weighted average of the conditional expectation of Y given X. the mean number of crashes is 0.13 + 2 X 0. The conditional expectation of Y given X = x is just the mean value of Y when X = x. is E(MIA = 1) = 0.)Pr(X=x.20) is computed using the conditional distribution of Y given X and the outer expectation is computed using the marginal distribution of X. .05 + 4 X 0. weighted by the probability distribution of X. so the conditional expectation of Y given that the computer is old is 0. based on the conditional distributions in Table 2. In the example of Table 2. the conditional expectation of Y given that the computer is new is 0. given that the computer is old.56 for old com- puters.). Equation (2. x. among new computers.19) follows from Equations (2.19).20) where the inner expectation on the right-hand side of Equation (2.56. that is. the mean number of crashes M is the weighted average of the conditional expectation of M given that it is old and the conditional expectation . i=l (2.18) and (2.17) (see Exercise 2. (2.If then the conditional mean of Y given X = x is k Y takes on k values Y""" Yk> E(YIX=x) = LYiPr(Y= 1=1 YiIX=x). then I E(Y) = LE(YIX=x.02 = 0.20) is known as the law of iterated expectations.14.18) For example. The law of iterated expectations..19) Equation (2. is E(MIA = 0) = Ox 0. Similarly. the mean height of adults is the weighted average of the mean height of men and the mean height of women. weighted by the proportions of men and women. Stated mathematically. given that the computer is new.3. For example. For example.14.3 Two Random Variables 29 using the conditional distribution of Y given X. E(Y) = E[E(Y IX)]. (2. less than for the old computers.

(2.14 X 0. This is the mean of the marginal disnal mean of Y consequence of tribution of M. expectations random variables.2). This is an immediate Equation (2. let X..56? X 0.99..0.99).-E(YIX=x)]2pr(Y=YdX=x). The expected number of crashes overall.99. E(M). The tandard that =a i thus ###BOT_TEXT###quot;0.02 .O. Then the law of iterated that E(Y) = E[E(Y IX. the conditional vari- k var(YIX=x)= ~[Y.13 A + (2 . conditional distribution ance of Y given X is P denote the number of programs installed on the computer. n I expectations The variance of Y conditional on X i the variance of the of Y given X. then E(M I A. Y.56)2 X 0. Z)J.35.47) than for old (0.3.3.21) For example. .50 =0) xPr(A =0) + (MIA =l)X = 1) + 0. .56)2 X 0.10 + (3 .99 = 0. . as calculated in Equation (2. weighted by the proportion of computers Exercise 2. .56)' X 0.20 provides some additional with multiple variables.WhiIeh IS022 I so lle stani . For example. the mean of Y must be zero.70 conditional distribution of M given given that the + (1 .0. Conditional variance. average of these conditional means is zero.56). For example.20): if E(Y IX) ently.0.56 X 0. aid difFer- Xis zero. I . 47 . where E(Y let I X.56)2 X 0. 0. P) with age A that ha P programs average of f pr grams with age A and number f conditi installed. in the computer is the expected number of crashes for a computer the expected number of crashes for a computer P.0. and the spread of the distributi standard deviation ' n of the is smaller number of crasbes. The conditional variance of M given that A = 1 is the variance of the distribution in tbe second row of Panel B of Table 23 . and Z be random expect n variables that are jointly distributed.30 CHAPTER 2 Review Probability of ofMgiven Pr(A that it is new. Z) is the conditional crash illustrati expectation Y given both X and Z. that is.so E(M) = E(MIA = 0. . Stated mathematically. The law of iterated expectations implies that if the conditi given X is zero.50 = 0.05 + (4 . the expected number of crashes for new computers (0. dard deviation of M for new computers is . then it must be that the probability-weighted also applies to expectations that are contions says of f'Table 2.22 -. Fa r th e con diruona I di1tributions in Table 2. if the mean of Y given The law of iterated ditional on multiple = 0. the conditional variance of the number of crashe computer deviation is old of the is var(MIA = 0) = (0 . is the weighted with that value properties f b th A and P.14) va a IS less than that for old computers (0. as measured by the conditional for new computers (0. then E(Y) = E[ E( Y IX)] = E[O] = O. then the mean of Y is zero.

!LX)( Y . (2.17) gives an alternative expression for independent random variables in terms of their joint distribution. (2. and vice versa).22) Substituting Equation (2. the joint distribution of two independent random variables is tbe product of their marginal distributions. Finally.!Ly) tends to be positive. That is. One measure of the extent to which two random variables move together is their covariance. so the covariance is positive.2.24) To interpret this formula.19).!Ly < 0). and wben X is less than its mean (so that X -!LX < 0). tben the covariance is given by the formula cov(X. (2. X and Yare independently distributed if. In both cases. then Y tends to be less than its mean (so that Y . The covariance is denoted cov(X.3 TwoRandomVariables 31 Independence Two random variables X and Yare independently distributed. If X can take on I values and Y can take on k values. then Pr(X =x. Covariance and Correlation Covariance. or independent.!Lx) X (Y . then the covariance is negative. In contrast. where !Lx is the mean of X and !Ly is the mean of y. If X and Yare independent.!Lx)(Y . if X and Y tend to move in opposite directions (so that X is large when Y is small. Prey = ylX = x) = Prey ~ y) (independence of X and V). i=1 j=l k I Y=y. for aUvalues of x and y. Y) = <Txy = E[(X . Specifically. the product (X .22) into Equation (2.!Ly is positive).23) That is. The covariance between X and Y is the expected value E[(X .!Ly)] = L: L:(Xj-!LX)(Yi-!Ly)Pr(X=xj. Y) or tr Xy.). then Y tends be greater than its mean (so that Y .!LY)). then the covariance is zero (see Exercise 2. Y~ y) ~ Pr(X = x)Pr(Y~ y). if X and Yare independent. . X and Yare independent if the conditional distribution of Y given X equals the marginal distribution of Y. if knowing the value of one of the variables provides no information about the otber. suppose that when X is greater than its mean (so that X -!Lx is positive).

O. Y) cov(X. (2. .X) =0. alternative measure of dependcnce between.27) follows by substituting cov( Y. That is. Y) Vvar(X) var(Y) (fXY = (fXUy' (2. X and Y.!Jox)] = E( YX). If Y and X do not have mean zero.32 CHAPTER Review Probability 2 of the covariance is the product of nnd Y. t . E( YX) = E[ E( Y XI X)] = ElE( YI ) J = because E(YIX)=O.20)]. The Mean and Variance of Sums of Random Variables The mean of the sum of two random variables. the units cancel and the corrclation is unities .2 )..25) arc thc snme as those of the denominator. aw war y. The rand m variables X and Yare said to be uncorrelated if corr(X. "units" problem can rna ke numen 'cal values of the covarian c diffi ult t 1I11crpre!. h . as pr vcn in Appcndix 2. . . deviated from ecause . fY . B that solves t h e " uruits" problem of the covariance. .28) . then Y and X are uncorrelated. first subtract off their means. aid differently.le c deviations: corr(X.1.' k 'dl the unit of X multiplied by. however.. = O. The correlation always is between -1 and I. (2. First suppose that Y and X have mean zero that cov(Y. X) Correlation and conditional mean. . between X an d Y· IS tl covariance between X and Y divided b their slandard . that if X and Yare unc rrelat d. · . is the Sum of their means: E(X+ Y) = E(X) + E(Y) = !Jox+ !Joy. if E( Y I X) = !Joy. X) = 0 into the definition of correlation in Equati n (2. By thc law f iterated expectations [Equation (2. Y) . Equation (2. so cov(Y. then the conditional mean of Y given X does not depend on X. It is not necessarily true. then the preceding pr f applies.26) If the conditi nal mean of Y docs not depend on X. X) = 0 and corr( Y. that is.then cov( Y. X) = E[(Y . Correlation. Y) :5 '[ (correlation inequality}.!Joy)(X . it i possible for the conditional mean of Y to be a function of X but frY and n netheless to be uncorrelated. e units . -1 :5 corr(X.This their means Its units are.25) Because the units of the numerator in Equati n (2. An example is given in Exercise 2.0 . X and y The corre Ia non IS an . th correlation s .27) We now show this result. pccifically.23. . (2. .

0.01 O. right? Does the distribution of One way to answer these questions is to examine the distribution of earnings of full-time workers.02 0-01 'a 0. does the distribution of earnings for men and women differ? For example. ~~ o 10 20 30 40 SO 60 70 80 Dollars (c) Men with a high school diploma continued .05 'a 0.06 0.08 0.08 0. for those with only a high school diploma (a and c) and those whose highest degree is from a four-year college (b and d). and some percentiles of the conditional distributions are Conditional Distribution of Average Hourly Earnings of U.04 • 00.2.3 Two Random Variables 33 Th!!_Distribution of Earnings in the United States in 2008 S ome parents tell their children that they will be higher-paying job if they get paid college-educated women earn as much as the able to get a better.06 0.04 • e.l"-C:---:':.05 0-02 0. o 10 20 30 40 50 60 70 80 Dollars (a) Women with a high school diploma • 'a 0.07 0.06 e. and the mean.07 0.o..~C:-C::-::-=:~. do the best- gender. o 10 20 30 40 50 60 70 80 Dollars (d) Men with a college degree ClO.04 00.S.OO.Q3 0..06 e.0.OO~""'C""-:O:-~~-"""'~-::O::-::'.. Full-Time Workers in 2008.07 0.04 e.00'iL-~~~~:::.03 0.08 0. standard deviation. con- earnings differ between workers who are college graduates and workers who have only a high school ditional on the highest educational degree achieved (high school diploma or bachelors' degree) and on diploma. best-paid college-educated men? a college degree Are these parents than if they skip higher education..Q3 0.0.02 0.01 O. These four conditional distributions are shown in Figure 2.02 0.4.07 0. and. Given Education Level and Gender The four distributions of earnings are for women and men.Q3 0.00iL-""'C""-:. a 10 20 30 40 50 60 70 80 Dollars (b) Women with a college degree 0.01 0. if so.08 0. 0.' .os '2 0.05 • c 0.'C-::o:~c:?~""":'::"""':'.---::. how? Among workers with a similar education.

for both men and women.4c).62 15.77 12. the spread of the distribution of earnings. n rs that the $14.1.45 S17. We of the distribuu n of corning high school degree (Figure 2.4.64 30.4b) is shifted to the right of the distribution for women with only a I sed to individu- als with only Another distribution high sch I diplom feature of rhese distributi of earnings for men is shifted 1 the right of earnings for w men. = high school diploma.amIll9S Full-T'me 0 ers In Per< nrne Standard Mean of U S 50% 25% $9. The distributions were computed from the March 2009 urrent POpUlll1l011 urvey.85 24. In presented in Table 2.50 28.vldcdby the num~r of h u worked annually.48 28.38 12.salaries. which is discussed in more detail in Appendix 11. is greater for those with a college degree than for those with a high school diploma.4. first numeric column). This "genmuny.().to return to this topic in later chapters.This final c mparis sistent with the parental admonition degree pens doors 'hot remain It n is con. the conditional mean of earnings for women whose highest degree is a high school diploma-that degree 90th percentile of earnings is much hi her for workers with a college degree than for workers w uh only is. The distribution of average hourly earnings for female college graduates (Figure 2.19 21.l 39. 4 S23 ll(\ 3~42 32.59 10. and bonusc di. Interestingly.4a). the of the distribution troubling-ospect der gap" in earnings is an importam-c-und. IThe distributions were estimated using dot from the March 2009 Current Population Survey. mean earnings are higher for those with a college degree (Table 2.69 Average hourly earnings are the sum of annual pretax wages. which I de rl~ll Appendix 3. E(EarnillgslHigheSf a high school diplcm u.93 19.34 CHAPTER 2 Reviewof Probability GlI!JD summariews ofrkthe con2doioti80~i~~r~s~:~~~~i'~~fL:::r:~~ ~~~~~. as measured by the standard deviation.21 16.08 $13.50 19. .23 (median) Deviation 7S% 90'

0 (a) Women with high school diploma (b) Women with four-year college degree (c) Men with high school diploma (d) Men with four-year college degree $14.4d and Figure 2.73 per hour.73 23.97 $7.1 For example. Gender = lema/e)-is ihut n ollege . tips. the same shift can be seen for the two groups of men (Figure 2. For both men and women.63 17. In addition.

29) (2. TI.29) through (2.35) follow from the definitions of the mean.e results in Key Concept 2. (2. Equations (2. variance.3 Two Random Variables 35 Means. (2. and V be random variables. Y)I s 1and l(Txyl :5 v'(T}(T9 (correlation inequality). The variance of the sum of X and Y is the sum of their variances plus two times their covariance: var(X+ Y) =var(X) +var(Y) +2cov(X. and covariances involving weighted sums of random variables are collected in Key Concept 2.xt-v. . b. Y) =(T}+(T9+2(Txy. and Covariances of Sums of Random Variables Let X. and covariance: ~ 2. and Jet a.let /. (2.Lx and (T) be the mean and variance of X.35) E(XY) = (TXY + !J.37) Useful expressions for means. Y) = b(Txy + WVY. var(a + bY) = b2(T9.y.3. (2. then the covariance is zero and the variance of their sum is the sum of their variances: var(X + Y) = var(X) + var( Y) = (T) + (T9 (if X and Yare independent). Icorr(X. and c be constants. variances. Y.33) (2.36) If X and Yare independent.30) COV(ll + bX + cV.1.34) (2.3 are derived in Appendix 2. Variances.x + C!J.let a xv be the covariance between X and Y (and so forth for the other variables).2.3 E(a + bX + cy) = a + b!J.

between Jl . and variance u2 i expressed 2 concisely as "N(I-'. Chi-Squared.96u.e probability distributions most often enc untered in econ metri mal.96a II II + \ W><r y 2. have been devel p d for the normal distribution. and F dist ributi ns.4 The Normal.f. and the standard normal cumulative distribution function is denoted by the Greek letter (1). where e is a constant.36 The normal probability density function with mean 1l and variance (J2 is a bell-shaped curve. TI. arc the nor- The Normal Distribution A continuous random variable with a normal distribution has th familiar hellshaped probability density shown in Figure 2.1.2).96u and I' + 1. I). The normal distribution is denoted N(p. Values of the standard normal cumulative distribution function are tabulated in Appendix Table l. we must standardize the variable by first subtracting the mean. Student t. Rand m variables that have a N(O.96" is 0. Student I. The area under the normal p.95. accordingly."The standard normal distribution is the normal distribution with mean I-' ~ 0 and variance u2 ~ J and is denoted N(O.96(J and I-' + J .d. centered at JJ. The normal distribution with mean JJ. then by dividing Some special notation and terminology .5 show. . 1) distribution are of len denoted Z. To look up probabilities for a normal variable with a general mean and variance.. and F Distributions 11.5.1. 1/-1. chi-squared. ( ). As Figure 2.e function defining th normal probability density is given in Appendix 17. Pr(Z :5 e) ~ <pte).. the n rmal density with mean I-' and variance (J2is symmetric around its mean and has 95% of it probability between I-' -1.

it has version of Y is Y minus its mean.691 is taken from Appendix The same approach distributed random variable Table l. For example. suppose Y is distributed of 4.Thus.1) is one (see Exercise 2.6a? the random variable ~(Y . Y is normally distributed with a mean of 1 and a variance What is the probability The standardized normally to ~(Y -1) tion. what is the shaded area in Figure 2. The box "A Bad Dayan normal distribution..39) The normal Table 1. c2 is distributed ing by its standard Let c\ and d2 = (C2 -/")/u. C2) = Pr(Z s.8). C2 CI Y is normally distributed with mean /" and variance N(/". is symmetric. so its skewness is zero.1)/ distributed that Y s.' s. 2. of the cumulative These steps are summarized Street" presents an unusual The normal distribution the normal distribution is 3.. d2) = <I>(~). The kurtosis . the standard normal distribution s.691. Then Y is standardized deviation. (2. 4)-that is. and F Distributions 37 Computing Probabilities Involving Normal Random Variables Suppose ~ 2. Y by subtracting its mean and dividZ = (Y -/")/u. ~) = <1>(0. devia- N(I.38) Cl) = Pr(Z .2. s. 2-that is.4. (2. (Y .4 The Normal.4 u2.~(Y -1) s. Accordingly. in other words. (2. Chi-Squared. (2). 2 is equivalent Pr(Y s. that is. cumulative distribution function <I> is tabulated in Appendix the result by the standard deviation. divided by its standard V4 = ~(Y - 1). with mean zero and variance is.5)= 0.. Student t. the probability that a normally Wall of can be applied to compute in Key Concept application exceeds some value or that it faUs in a certain range. by computing with denote Then two numbers < and let dl = (CI -/")/" and Pr(Y Pr(Y. Now Y s.41) where the value 0. 2) ~ Pr[~ (Y -1) ~l Pr(Z = s. ~(2 -I)-that shown in Figure 2. dJl = 1- <I>(dl). ~. that is.6b.

4) dl51nbut~n (Y . is given in Appendix 18.5 The normal distribution can be generalized to describe the joint distribution of a set of random variables. is given in Appendix 17. or. then aX + bY has the normal distribution: aX The multivariate normal distribution.38 CHAPTER 2 Review Probability of Ci13!ll1JD . 1) dl'tnbutlon 0.0 (b)N(O. 4) Calculatingthe Probability That To calculate Pr( Y :::. the distribution is called the multivariate normal distribution. 2 = r Y-I "T" a Pr(Z -s 0.691. + bY is distributed N( a/Lx + b/Ly.5) ~ <1>(05) ~ 0. y s 2 When Y Is Distributed NO.f.42) (X. a2(T} + b2(T~ + Tabo X1') (2.1 )/2.4) y Pr(Z 505) N(O.0 2. The multivariate normal distribution has four important pr perties.0 (a) N(I. Because the standardized random variable. Y IS standardized by subtracting its mean (J. If X and Y have a bivariate normal distribution with covariance (J KY and if a and b are two constants.d. The probability that Y:::..1. and the formula for the general multivariate normal p.6b. Pr( Y:::. use the standard normal distribution table. In this case. Pr(Y51) N(I. and the corresponding probability after standardizing Y is shown in Figure 2. Y bivariate normal) . is a standard normal (Z) variable.d. I) 0.5).6a. if nly two variabies are being considered.-1)= random 1.2). 2 is shown in Figure 2. ) p( <L=. Pr(Z" 0. the bivariate normal distributionThe formula for the bivariate normal p. From Appendix Table 1.L ::: 1) and dividing by its standard deviation (c-> 2). then ..f.1. standardize Y.

so the drop of 22. October 19. 1980.I December 31. If daily percentage price changes are normally distributed. or more than 22 'landerd devianons: 10 5 a -5 -10 -15 October 9. Percent change Ihe average percentage daily change of "the Dow' index was 0.6%." the Dow Jones Industrial Average (an average of 30 large industrial stocks) fell hv 22. Chi-Squared. a plot of the daily returns on the Dow during the 1980s.05% and its standard deviation was 1.5 X 10-89. The enormity of this drop can be seen in Figure 2.6%! Fr m January I.stock market can rise or fall on Monday.. where there are a total of 88 zeros! Daily Percentage Changes in the Dow Jones Industrial Average in the 1980s Duringthe 19805.. This is a 101-but nothing compared to what happened 1987.On "Black Monday. You will not find this value in Appendix Table I. then the probability of a change of at least 20 standard deviations is Pr(IZI c: 20) = by 1% or even more. and F Distributions 39 A Bad Day on Wall Street O n a typical day the overall value of stocks traded on the U.J3) pric changes on the Dow was 1. the standard deviation of daily percentage return of 20(= 22. . that is.7.2.4 The Normal.6% was a negative 2 x <1>( -20). 2009.16%.6/J.1987 1 -20 -25 L_-'-_--'-:---:':-:-------:'::-:-_-!=_-:-::-:--:-:::::--:::::----:-::::--:: 1980 1981 1982 1983 1984 t985 1986 1987 1988 1989 1990 Year continued . This probability is 5.000 . 19a7-'Black Monday" -the indexrell25. 0.l3%. standard deviations. n O October 19. but you can calculate it using a computer (try it!). 00055. Student t.

1987 October 13.2008 December 01.0 -7.1 -8.9 -6. there have been many large to be consistent with a normal distribution with a constant variance.1 10.IXI012 6.an exiremel prices are normally distributed lcarly.0 X 10 to - f . or 1.! and very good-days we actually sen 011 Ir ret.01'9 = (x . 1987 October 26.200'1. varying variances ore discussed In models abandon more than 5. Thl 1.4 X 10-11 2. tlongwilh hnng eX eed 6. 8/" k ·.9 10. All len standard deviati ns.5 X 10-89.st kpriceper cntage r re vent hnn Ir \lock have existed for 14bil- lion years. F r this reason.5 x 1O-89? Consider the following: • The world population is about 7 billion. In fact.4 chonge u 109 the me n ond vori- the srandardized ance over this period.7 Change z Nom. Daily Percentage Changes in the Dow Jones Industnal and the Normal Probability of a Change at Least as Larg Standardized x Percentage Change (x) 22.2 -7.5X 10 12 2.. Til... October 21.4 X 10• The universe is believed 10 10 Industrial January Average in the 7571 tr dlnp tit ~ bell"een 1. 1987 October 15.40 CHAPTER 2 Review of Probability How small is 5.1 9. or about 5 X 1017 seconds.2008 OClober 28. so the probability of choosing a particular second at random from all the seconds since the beginning of time is 2 x 10-18 • There are approximately 1043 molecules of gas in the first kilometer above the earth's surface.19 O. Date 1980-2009.3 X 10-11 5.9 X 10-11 -- 7. 2008 October 09. so period.6 11.0 7. X 10 I~ - 7.( tober wilh 11Ine16).5 2<1>1 10 !<'I z) October 19.finnn e pr()(c~'lonnls u models of stock price hnnge n stock price change as normally dhtnbuled ancc that evolves over lime.8 9.. 2008 October 27.- -604 -604 lAX 10-1• 2.9 7. Other In in the f II of 21Xl/\ hove (mlldc" hnpter higher volatiliry than Iher.. 1987 and the Iinanci I eris: u h m I\J l trcats with like. 2008 .so the probability of winning a random lottery among allliving people is about one in 7 billion.2001 -20. These / models are more consistent with the very but. 0ll!JD The Ten Largest ~ Ind .1997 September 17.1 .5 lists the ten largest daily percentage price changes in the Dow Jones the norrnal di<tribulion fnvor r distributions with heavier toils.111.2 7.5 I. I Probob Illy of • Chang a' L.JJJ/fT p~lz '" z) 5. The probability of choosing one at random is 10-43• Although Wall Street did have a bad day.0 -6. Table 2.6 9. the fact that it happened at all suggests its probability days-good and bad-with stock price changes was 100 s hav a disother vllri- tributi n with heavierwilli th n rhe normal lil -tnbuti n.3 7..andDe ember 1.on ideo papulnnlcd in Nassim Tnleb's 2007 book.1 X 10-11 6.

Selected percentiles of the X. Second.2. Then Zt + Z~ + Z~ has a chi-squared distribution with 3 degrees of freedom. ern = O. TIle chi-squared distribution is the distribution of the sum of m squared independent standard normal random variables. The Chi-Squared Distribution The chi-squared distribution is used when testing certain types of hypotheses in statistic and econometrics. then the marginal distribution of each of the variables is normal [this follows from Equation (2. If X and Yare jointly normally distributed.Joint normality innplies linearity of conditional expectations.81) = 0. The Student t Distribution TIle Student I distribution with m degrees of freedom is defined to be the distribution of the ratio of a standard normal random variable. let Z be a standard normal random variable.11). and 2:l be independent standard normal random variables.95. distribution are given in Appendix Table 3. Student I. if a set of variables has a multivariate normal distribution. then.. This distribution depends on m..let W be a random variable with a chi-squared distribution with m degrees of freedom. Chi-Squared.. if variables with a multivariate normal distribution have covariances that equal zero. let ZI> Zz. if X and Y have a bivariate normal distribution and a xy = 0. For example.if n random variableshave a multivariate normal distribution. then the variables are independent. For example. . where a and b are constants (Exercise 17.. Appendix Table 3 shows that the 95th percentile of the X~ distribution is 7. which is called the degrees of freedom of the chi-squared distribution. that is. 7. regardless of their joint distribution. if X and Y have a bivariate normal distribution. but linearity of conditional expectations does not imply joint normality. In Section 2.81. The name for this distribution derives from the Greek letter used to denote it: A chi-squared distribution with m degrees of freedom is denoted X. and FDistributions 41 More generally. then the converse is also true.. then the conditional expectation of Y given X is linear in X.3 it was stated that if X and Yare independent. so Pr(Zt + Z~ + Z~ :s.42) by setting a = 1 and b = OJ. then X and Yare independent. divided by the square root of an independently distributed chi-squared random variable with m degrees of freedom divided by m. Third. That is. Fourth. This result-that zero covariance implies independence-is a special property of the multivariate normal distribution that is not true in general.E( YI X = x) = a + bx. then any linear combination of these variables (such as their sum) is normally distributed.4 The Normal. Thus.

. divided by 111: Wlm is distributed F. where Wand V are independently distributed. In thi limiting case. To state this rnathcmati ally. x3 .00 distribution is 2. divided by the degrees of freedom. to an independently distributed chi-squared rand III variable with degrees of freedom 1/. the 95110 percentile of the 1'3. denoted 0"JI' is defined 10 be the distribution of tbe ratio of a chi-s [uare I random variable with d grccs of freedom m.. 95'h. which is the same as the 95110 pel' entile of the distribution. elected percentiles f the Student I distribution are given in Appendix Table 2.. distribution depends on the degrees of [Ieedo. In statistics and econometrics.60. The Student I distribution depends on the degrees f freed m IIl. the tudent I distribution is well approximated by the standard normal distribution and the I distributi n equals the standard normal distribution. distribution tends to the 1'. Then the random variable and let Z an d W b e 111 • •• • Z/VW/m has a Student I distribution (also called the I dISlrobllhO~) with //I degrees of freedom.Then has an distribution-thai is.Thus the 95'h percentile of the I". 7.". and that mean is 1 because the mean f a squared standard nOImal random variable is 1 (see Exercise 2.24)..81/3 = 2.60.. JI 95 .00 limit of 2. The 90110..m "~.." distribution can be approximated by the 0". but when m is small (20 or less). For example.. an F distrihuti n with numerator degrees of freedom 1/1 and dcnominat rdcgrce of freedom II. let W be a chi-squared random variable with 111 degrees r freed m and let be a chisquared random variable with 1/ degrees of freed m.. and the 951h percentile of the F.42 CHAPTER 2 Review of Probability 'dependently distributed. distribution. Thus the F. r.l1lC tudcnr I distribution has a bell shape similar to that of the normal distriburi n.30 distribution is 2...81 (from Appendix Table 2). (rom AppendixTable 4..". distribution arc given in Appendix Table 5 for selected values of I'll and n.. divided by m.7 J. distribuii n is the distribution of a chi-squared random variable with 11/ degree of fr cdom.oo. As the denominator degrees of freedom n increases.. This distribution is denoted I". The F Distribution The F distributiou with 1'1/ and 1/ degrees f freedom.92. the 95'10 percentile of the 1'.the denominator random variable V is the mean of infinitely many chi-squared randam variables. which is 3 (7. it has more mass in the tails-that i il is a "fauer" bell shape than the normal. When 111 is 30 or more. 90 distributi n is 2.. and 991h percentiles of the F. o/f. an important pccial asc f the F di trihuti n arises when the denominator degrees of freedom is large enough that the f.. the 1h percentile (the F. divided by II.60). For example.

kn \ ing the value of the commuting time n ne of the e rnnd 1111 ele ted day provides n inf rrnati n about the commuting lime nan th r f the do . l'2 is the sec nd b ervati n. where >'I is the first b servatlon. . and her daily mmutin lime ha the umulati e di tribution function in igure 2.5 Random Sampling and the Distribution of the Sample Average 1m l all the statisti al and econometric procedures used in this book involve aver ges or weighted average of a sample of data.2a.. . \ hich is called it sampling distribution.1h of her rand ml elected day Be u e the membe f the population included in the sample are elected at rand m. .5 RandomSamplingand the Distributionof the SampleAverage 43 2.rand mly drawing a sample fr m larger popul ti n-ha the effect of making the sample average itself a rand m variable. Random Sampling uppo e our c mmuting student fr m ecti n 2. is the mmuting time 0 the . Because these d \ ere sete ted at rand m. Be use the ample average is a rand m variable.2. If Simpl random sampling. and forth. TIli se tion introduces some basic concepts bout random sampling and the di: tributi n of average that arc used through ut the bo k.called simple random sllmpling. The situ ti n de ribed in the previ u paragraph is an example of the simple t ampling heme u ed in stati nics.. the v rlue f the commuting time n each f the different day are independently di .. that is.because the days were elected at rand m. it has a probabllit distributi n.the value of the b er ati n >'I. Iri uted r nd m variables. This ecti n concludes with m pr pertie f the sampling di tributi n of the sample average. >'I i the commuting time n the first of her /I rand mly elected da sand Y. haracterizing the distributi n f sample aver ges theref re is an es ential step toward understanding the perf rman e of e n mel ric pr edures. Y". In the commuting example. . We begin by discussing random sampling. The /I b erv ti n in the ample are den ted >'I •.1 aspire l be a slati ti ian nd decides t rec rd her commuting time n various do he el ct the e da at rand m from the ch I year. y" are themselve random. The a t f random sampling-that i . in which /I bje tare el ted at rand m fr m a population (the p pulati n f cornmutan do and e h memb r f the p pulatlon (each day) is equally likely t be in luded in the ample..

....Y. Thus the act of random sampling means that ables. . Before they are sampled._1 Y - Y. When I the marginal distribution tribution is the distribution distributed. this marginal of Yin Ihe population /1.. . .... + 2>. of the n observations Yl" . .i. oncept 2. . Random Variables In a simple random sample.) = n 1 nO.. values. + Y..i..).i. is the same for each i = 1. of Yz..). they are said to be indepeudently and identically distributed Simple random sampling and i.. Y" are said to be identically Under simple random sampling.44 CHAPTER 2 Review of Probability 4mi.d.11.. so the conditional distribution marginal distribution When of distributed independently the value of of l'1 provides no infor- Yz given l'1 is the same as the 12..43) An essential concept is th~ the act of drawing a random sample has the effect of making tbe sample average Ya random variable.5.. Y. 1 II Y. are independently and so fortb..... is distributed 12. r. knowing mation about )1. then ~.. can take on many possible for each observation. . draws. in Key distributed... nand r..d. + 1'. . under simple random sampling.Y" are drawn from the same distribution and are independently (or i.d.d. . the distriindependcntly bution of Y. that is. their values of Y will differ. Y" different members of the population are chosen. . Yi is r. draws are summarized The Sampling Distribution of the Sample Average The sample average or sample mean.. n objects are drawn at random from a popl~lation each object is equally likely to be drawn.. .Y.. ..m3m Simple Random Sampling and i. can be treated as rand m vari- l'1.' i=l n (2. of Y.'. Because each object is equally to be drawn and the distribution of and 2..i. . and identically (i. Because r. is the same for all i = 1. is ..5 Y for likcly Y. di being sampled. is the same for all i.. a specific value is recorded i. . .Y" are randomly drawn from the same population.d... . ... The value of the random variable the ph randomly drawn object is denoted Y.i.. . the rand distributed m variables of Yo11 .. .. Y. has the same marginal distribution for i = 1. Y. In other words. after they are sampled. Because the sample was drawn .

so cov( y" 1j) = O. .. For example. = 1. average is random.44) n = 2.Ly = 2f. Had she chosen five different days.d..var(Y) = 1<T$· For general n. the mean and variance is the same for ail .. because r.and Y. are random. .".31) with a = b ~ 1 and cov(r. n).Ly = f.37). Mean and variance ofY.28):£(r. . Thus the mean of the sample average is E[1( l'J + 11)] = x 2f. + 11 is given by applying Equation (2.11". it has a probability distribution.uy and <T~denote tbe mean and variance of Y. Random Sampling and the Distribution of the Sample Average 45 thei.(because the observations are i.1j) i=l1'=l. for + 11) = 2<T~. When n = 2. <Ty/Vii. In general..Thus. .i.. = var (1" Ii var(Y) 2: Y. 11) = OJ. .5 at random.r the value of each }j is random.) i=l ~ f. .. Suppose that tbe observations r.Ly.2.. ..d. ... . Y is found by applying Equation (2. The variance of are independently distributed for i '" j..45) =n' The standard deviation of <T$ Y is the square root of the variance. 1 _ £(Y) 1 II = Ii~ E(Y. the mean of tbe sum r. she would have recorded five different times-and thus would have computed a different value of the sample average. . The distribution of Y is called the sampling distribution of Y because it is the probability distribution associated with possible values ofY that could be computed for different possible samples r..i.uy +f.. Y. + 11) = . We start our discussion of the sampling distribution of Y by computing its mean and variance under general conditions on the population distribution of Y.11" are i. The sampling distribution of averages and weighted averages plays a central role in statistics and econometrics.Ly. Had a different sample been drawn. Because l'l.d.) 1 +2' n 1""'[ 1 =2' n 2:var(Y. For example..) i=l n 2: 2: cov(Y. then the observations and their sample average would have been different: The value of Y differs from one randomly drawn sample to the next. ... j#-i n n (2. and let. Because Vis random.Ly..Y" are i. so [by applying Equation (2.i. var(r. Suppose our student commuter selected five days at random to record her commute times. then computed the average of those five times. (2.Y.

Suppose you divide $1 equally among n assets. so portfolio after 1 year is (1'[ + Y.45). + Y. the distribution of Y. that is. and. in which the fund holds many Slacks and an individual share of the fund.46) and (Ty - 2 up = . the same variance.)/n = 1'. Sampling distribution of Y when Y is normally distributed . does not need to take on a specific form.dev( Y) =(Ty =-. the actual payoff of your products such as stock mutual funds.48) to hold.46 CHAPTER Reviewof Probability 2 Financial Diversification and Portfolios T he principle of diversification says that you can reduce your risk by holding small investments in EY = J. thereby owning a small amount of many stocks. But diversification has its limits: For many assets. y. " are i. In the case of stocks. but that portfolio remains subject to the unpredictable fluctuations of the overall stock market.46) through (2. The math of diversification follows from assets has the same expected pay ut. + . Then the expected payout is In summary. that is the variance of the population distribution from which the observation is '~rawn. the sum of n normally distributed random variables i itself . Vii These results hold whatever the distribution of 1'1 is. That is. payouts are positively correlated. but In (T diversiFying reduces the variance fr The math of diversification 2 to pu2. draws from the N(IJ-y.42). the variance. ' The not~ion (T~ denotes the variance of the sampling distribution of the sample average Y.' imilarly. compared to putting all your money into one asset. (T9 is the variance of each individual y.d.i. Putting all your money into one asset or spreading it equally acrossall 11. S uppose t h at Y.47) (2. and tbe same positive correlation p across assets [so that cov( y" lj) ~ pu'].. As stated following quation (2.lhe variance of the portfolio payout is varCY) = per' (Exercise 2. multiple assets.26). Yaref') remains positive even if II is large. for largen. In contrast.. the mean. such a the normal di tribution for Equations (2. fLy.you shouldn't put all your eggs in one basket. suppose that each asset has the same expected payout.48) std. . 2 To keep things simple. ~""'. a .n' (2. has led to financial owns a Equation (2. (Tf denotes the standard deviation of the sampling distribution of Y.. var( Y) = (TY - (2. risk is reduced by holding a portfolio. (T9) distribution. and the standard deviation of Yare E(Y) = IJ-y.'Y. Because you invested lin dollars in each asset. Let }j represents the payout in 1 year of $1 invested in the i1h asset.

There are two approaches to characterizing samphng di tribuli ns: an "exa t" approach and an "approximate" appr ach.ptotie distribution-"a ymptotic" because the appr im Ii n bc m exa t in Ihe limit Ihalll ./Ly)/u). if l'I. o}). . t ampling distributions are complicatcd and depend n the di tobuli n f . then (a dis ussed in c lion 2. what the sampling di rribut! n f is... then in gcueral rhe exa I mpling distribution f l' is very complicated and dcp nd n the dlstributi n f . (ry/n). When til ~nmple il i large. The cenlral limit theorem ays Ihat.lhe a mpl ti distributions are imple. l} wil) bc I to /Ly with e high probability. when the sample size is large.r Ih u and these nsympt Ii di tributi n can be counled n to pr vide vc good appr ximati n t the exact sampling di tribution. 2.. Thi n rmal appr ximate distribution provide enormous impliIi lJ n and underlie the theory of regre i n u ed Ibr ughout this book. .d. The sampling distribution that exactly des ribe: the distributi n f f r ny 1/ is called the e net distribution or finitevnmple dhlribution f .d. nfortunately. this mean th~t. then l' is distributcd (/Ly.y)/u does not depend on Ihe di lribuli n f Y.Y" are i. r example.Y" are i. if the distribution of Yin t n rmal.i.As we see in thi secti n. i pproximalely normal. in a rnathernatica! en e. .5) the exa t di tribution of l' i normal v Ith mean /Ly and vori n e U~/II. draws from the N(/Ly. lthougb cx. Be ause the mean of l' is /Ly and the variance of l' is uNn.6 Large-Sample Approximations to Sampling Distributions ampling dislributi n playa central r le in the development of tatistical and c on metrt pr edure so it is imp rtaru to know. The" ppr lmme" ippr a h u es approximati n to the ampling distributi n thot rcl n the mple ize cing Iarge. Moreover-remarkablyIhc mpl Ii normal di lributi n f ( -p.2. Thi e Ii n pres '0\ tbe t\ 0 ke I I u ed I appr imalc sampling di tribUll n ben Ihe nmple ize i I rge: the law of large number and the central Itmit Ih rem.The large-s mple approximation to the amplmg di lri uti n i fl n allcd the OS)'Il. the ampling di tributi n of lhe londardized sample avcrage..i.. ( . . if Y is n rmally distributed and l'I. Ih "e ppr imali n n b very a curale even if the ample ize is onl 1/ = 0 01> erv Ii n Bccnu c snmpl ilC u ed in pro tice in ec n melri typically number in th hundr d.6 Large-Sample Approximationso SamplingDistributions t 47 normall distributed. Thc "e a I" appr ach entails deriving a f rmula for the sampling distributi n rhat h Ids exactly f r any value f n.. TIle la\ f large numbers ay that.

48

CHAPTER 2

Review Probability of

Convergence
~

in Probability, Consistency,

and the

law

of Large Numbers

2.6

11,e sample average Y converges in p~obability to /ky (or, equivalently, Y i COn_ sistent for /kY) if the probability that Y is in the range /ky - c to /ky + C beeoll1,= arbitrarily close to 1 as n. increases for any constant c > 0.111e convergence of y to in probability is written,}I ...L... /ky, . The law of large numbers says that if Y;, i = 1, ... ,11 arc indc] endcntly and identically distributed with E( Y,) = /ky and if large outliers arc unlikely (tcehniI"Y =

cally ifvar(Y;)

ol

< (0), then Y -'-> t-v-

-

p

The Law of Large Numbers and Consistency
1110law of large numbers states that, under general conditions, Y will be ncar My with very high probability when /I is large. Thi is sometimes called the "law f averages."When a large number of random variables with the same mean arc averaged together, the large values balance the small values and their sample average is close to their COlll1110n mean. For example, consider a simplified version of our student ommuter's experiment in which she simply records whether her commute was short (Ie s than 20 minutes) or long. Let Y; equal I if her commute was short on the i1h rand mly selected day and equal 0 if it was long. Because she used simple random saml lin , l'\, ... , Y" are i.i.d. Thus Y'j, i = 1, ... II are i.i.d. draws of a Bernoulli rand 11'\ variable, where (from Table 2.2) the probability that Y; = .l is 0.78. Because the expectation of a Bernoulli random variable is its success probability, E( Y,) = /ky = 0.78.111e sample average }lis the fraction of days in her sample in which her commute was short. Figure 2.8 shows the sampling distribution of Y for vari u sample sizes ,.1. When n = 2 (Figure 2.8a), Y can take on only three values:O, ~,and 1 (neither commute was short, one was short, and both were short), none of which is particularly close to the true proportion in the population, 0.78. As 11 increases, however (Figures 2.8b-d), Y takes on more values and the sampling distribution bec mes tightly
I

centered on

My.

11,e property that Y is near I"Y with increasing probability as 11 increases is called convergence in probability or, more coocisely, consistency (see Key oncept 2.6). The law of large numbers states that, under certain conditions, Y converges in probability to /ky or, equivalently, that Y is consistent for MY,

r

2.6

Large-Sample Approximations to Sampling Distributions

49

Sampling Distribution of the Sample Average of n Bernoulli Random Variables

Probability 0.7 0.6

Probability 0.5 0.4

0.5
0.4

u = 0.78
0.3

u = 0.78

0.3 . 0.2 0.1 0.0 0.0 (a)

0.2

0.1

0.25

0.50 0.75 1.00 Value of sample average
(b)

0.25

,,= 2

0.50 0.75 1.00 Value of sample average

,,= 5

Probability 0.25 0.20

Probability 0.125 0.100

p = 0.78

0.15 0.075 0.10 0.05 0.00 e.,,...,.,-,.,,-.-,,."I'" 0.0 0.25 0.50 0.75 1.00 Value of sample average
(c) ,,=25

0.050 0.025 0.00 'r-r-r-r-r-r-r-r-r-r-f--r-r-rf 0.0 0.25 0.50 0.75 1.00 Value of sample average (d) ,,=100

The distributions are the sampling distributions of with p
:=

Y, the

sample average of n independent

Bernoulli random variables

Pr( Y,' ee 1)

=

0.78 <the probability of a short commute is 78%), The variance of the sampling distribution of
J.1.

Y

decreases as n gets larger, so the sampling distribution becomes more tightly concentrated around its mean the sample size n increases.

=

0.78 as

50

CHAPTER 2

Review of

Probability

· . f tl e law of large numbers thai we will usc in this b k are The con d mons 'or 1 _ 2'" that Y;,i = 1, ... ,1'1 are i.i.d. and that the variance of 11, Uy,lS finite. The mathemat_ ical role of these conditions is made clear in ection 17.2, where the.law of large num, bers is proven. If the data are collected by simple random sampling, then the i.i.d. assumption bolds. The assumption that the variance is finite says that extremely large values of Y;-that is, outliers-are unlikely and observed Infrequently; therwise, these large values could dominate Yand the sample average w uld be unreliable. This assumption is plausible for the applications in this b k. F r example, because there is an upper limit to our student's commuting time (she could park and walk if the traffic is dreadful), the variance of the distribution of c mmuting times is finite.

The Central Limit Theorem
The central limit theorem says that.under general conditions, the di uri ution ofY is well approximated by a normal distribution when 11 is large. Recall that the mean ofYis I"Y and its variance is u~ = up/no According to Ihe central limit the rem, when 11 is large, the distribution ofY is approximately N(I"Y, u~). As discus ed at the end of Section 2.5, the distribution ofYis exactly N(I"Y, u~) when the sample is drawn from a population with the normal distributi n N(I-'Y, up). The central limit theorem says that this same result is approximately true when 11 i large even if y" ... ,Y;, are not themselves normally distributed. The convergence of the distribution of Y to the bell-shaped, n rrnal appr ximation can be seen (a bit) in Figure 2.8. However, because the distribution gets quite tight for large 1'1, this requires some squinting. It would be easier t sec the shape of the distribution ofY if you used a magnifying glass r had me other way to zoom in or to expand the horizontal axis of the figure. One way to do this is to standardize Y by subtracting its mean and dividing by its standard deviation so that it has a mean of 0 and a variance f l.This process leads to examining the distribution of the standardized vcrsi n f Y, (Y - I-'I)/uy. According to the central limit theorem, this di tribution should be w II approximated by a N(O,1) distribution when n is large. The distribution of the standardized average (Y - I-'Y)/U1' is plotted in Figure 2.9 for the distributions in Figure 2.8; the distributions in Figure 2.9 arc exactly the same as in Figure 2.8, except that the scale of the horizontal axis i changed so that the standardized variable has a mean of 0 and a variance of l.After this change of ~cale, it is easy to see that, if n is large enough, the distribution of Y is well approximated by a normal distribution. One might as".:,how large is "large enough"? That is, how large must 1/ be for the distribution of Y to be approximately nonnal?The answer is, "I t depends."",e

t

2.6

Large-Sample Approximationsto SamplingDistributions

51

Distribution of the Standardized Sample Average of n Bernoulli Random Variables with p = 0.78
Probability 0.7 0.6 0.5
0.4 0.3

Probability 0.5
0.4

0.3 0.2

0?:

0.1 0.0 ~h,..-,---lL,-r-.,J"'--r---r-~ -3.0 -2.0 -1.0 0.0 1.0 2.0

0.1 0.0 ", ...... -, -3.0 -2.0 ...... -,---!''-r--.''---,-,---;:"'; -1.0 0.0 1.0 2.0 3.0 Standardized value of sample average (b)

3.0

Standardized value of sample average (a)

,,= 2

,,= 5

Probability 0.25

Probability
0.12

-2.0

-1.0

0.0

1.0

2.0

3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

Standardized value of sample average (c),,=25 The sampling distribution of (d) ,,=100

Standardized value of sample average

Y in

Figure 2.8 is plotted here after standardizing

Figure 2.8 and magnifies the scale on the horizontal axis by a tactor of

Y. This plot centers the distributions in Vii. When the sample size is large, the sampling

distributions are increasingly well approximated by the normal distribution (the solid line), as predicted by the central limit theorem. The normal distribution is scaled so that the height of the distributions is approximately the same in all figures.

I 1': are i. after centering and scaling. Y.2 < 00. n f the underlv. Y.. that is quite different from the Bern lion has a long right tail (it is "skewed" to the right). the d. The central limit theorem is summarized in Key Summary 1.d. have a similar shape. 0' y where 2 o < . the pI' bability I function (for continuous random variables).. The probabilities with which a random variable takes on different variables).10 for a population Figure 2.' = iJ-l' - and var( 1'. Becau e the distri ution The convenience of the normal approximation. While the "small in Figures 2.stflbutlon of (Y - iJ-Y)/" 2 _ (where .7 S uppose th a t v l] I •. distributed. themselves have a distribution that is far from n rmal.e respectively. then this di tributi ampling n. n rmally when the ing Y. Although the sampling distribution ulli distribution. amazingly. then Y is exactly normally distributed underlying I'.7. i approaching imperfect] ~ the bell ns. quality of the normal approximation depends on the distributi if the for all II...".10 are complicated 1/" and quite differd are simple aches the and. ability hecause of the central limit theorem.the normal to the distribution ofY typically is very g d f I' a wide variety II" population distributions. sh wn in This di tribudistribution of approximation can require n = 30 or even more. is shown in Figures 2.. of is quite go d. By nape for II n ~ 25.I I' II IDO.. and 100.10a.) 11 . the normal approximation however. 25.. makes it a key underpinning applied econometrics. The central limit theorem is a remarkable ent from each other.9 and 2. In Iact.O'y/n) becom:s arbitrarily well approximated by the standard n rmal distribution. the normal approximation approximation still has noticeable = 100.52 CHAPTER 2 Review of Probability ~ The Central Limit Theorem 2.. normal as 1/ grows large.\ f Yappr distribu- tions of Yin parts band c ofFigures 2. that make up the average.IOb-d f I'll = 5. Y is said to have an asymptotic normal distribution. and the probability values are distridensity summarized by the cumulative distribution bution function (for discrete random functi n.) = a}. At one extreme. . 00. As n . the "large distributions re ult. with E( Y.i. are thcrnselve In contrast. This point is illustrated in Figure 2.9d and 2. combined wi th its wide applicof modern nccpt 2.

0 -1. like the population distribution. as predicted by the central limit theorem..06 0.--3.lOa. When n is small tn = 5).0 1.0 0.0 2.0 20 3.50 0.10 0. the sampling distribution is well approximated by a standard normal distribution (solid line).-.oo~.0 2.00 ~.0 3. the sampling distribution.-"'" -3.0 0.12 Probability 0.0 -2.30 0.0 3.0 3.0 0.~---fI -3.0 -2.09 0.~100 1/=25 The figures show the sampling distribution of the standardized sample average of rid population distribution n draws from the skewed (asyrnrnet- shown in Figure 2.0 1.03 o.Summary Distribution of the Standardized Sample Average of n Draws from a Skewed Distribution 53 Probability 0.0 1.0 0.0 -2.0 Standardized value of sample average (e) Standardized value of sample average (d) .0 Standardized value of sample average (a) 1/ Standardized value of sample average (b) 1/ ~ =1 5 Probability 0.0 0.09 0.0 0. But when n is large tn = 10m.0 2. .0 1.12 0.-. is skewed.03 -1.20 0. The normal distribution is scaled so that the height of the distributions is approximately the same in all figures.06 0.12 0.40 Probability 0.-.0 -2.00"'='.0 -1.0 -1.

··· . The conditional pr bability distribu_ tion of Y given X = x is the probability distribution of Y. Y.d.l'. Key Terms outcomes (15) probability (15) sample space (15) event (15) discrete random variable (15) continuous random variable (15) probability distribution (16) cumulative probability distribution (16) cumulative distribution function (c.. S. that are independently and identically distributed (i.i. .1'. the law of large numbers says that u 2 fT~/n..). 111esample average. The expected value of a random variable Y (also called its mean. 3.Ld.L) (17) Bernoulli random variable (17) Bernoulli distribution (17) probability density Iuncti ( 18) density function (I ) density (18) expected value (18) expectation (I ) mean () variance (21) standard deviation (2\) moments of a distribution skewness (23) kurtosis (25) outlier (25) n (p..f. are i.. Simple random sampling produces n. then: a.d. varies from one randomly chosen sample t the next and thus is a random variable with a sampling distributi n. The joint probabilities [or two random variablcs X and Yare summarized by their joint probflbility distribution. has a standard normal distribution [N(O.tribution] when fI is large.d. ILy). To calculate a probability associated with a normal random variable.5.. the central limit theorem says that the standardized veri n f Y. I) di . . 4. random observations lative distribution

Related Interests

I. If 11. Yc nverges in probability to /-'y. 6. is its probability-weighted averflge value.) (23) . conditional nX taking on the value x. the sampling distribution ofY has rnean uv and variance b. (Y -/-'l')/Uy. and the standard deviation of Y is the square root of its variance. denoted E(Y).54 CHAPTER 2 Reviewof probability 2. first standardize the variable. A normally distributed random variable has the bcll-shaped probability density in Figure 2. and c. then use the standard normal cumutabulated in Appendix Table I.y)']. The variance of y is (Jp = E[(Y _/-.

and (e) whether it is raining or not. you the computer of the next person you meet. (b) the number of times a computer (c) the time it takes to commute to school. weight of the students A random sample average weight is calculated.) (44) (44) (45) distribution (47) (48) (47) sampling (43) distribution expectation mean (28) Jaw of iterated expectations variance (30) distributed (31) independently distributed sample average sampling asymptotic sample mean (44) distribution distribution exact (finite-sample) (36) (36) (38) (32) distribution law of large numbers (48) convergence consistency (38) asymptotic in probability (48) normal distribution (52) normal distribution a variable (36) normal distribution central limit theorem (50) normal distribution Review the Concepts 2.i. 2. and their in the Will the average variable.1 Examples of random variables used in this chapter included (a) the gender crashes. during a given month and Y denotes born in Los Angeles during the same month. and you know why knowing the value of X tells you nothing 2. and the mean student weight is 145 lb. sample equal J 45 lb? Why or why not? Use this example to explain why the Y. of four students is selected from the class. (d) whether Explain why each can be thought of as random.d. is a random .2 Suppose that the random variables X and Yare independent their distributions. sample average. Are X and Y independent? 2.4 An econometrics class has 80 students.3 Suppose that X denotes the amount of rainfall in your hometown the number of children Explain. are assigned in the library is new or old.Reviewthe Concepts leptokurtic rth moment (25) (25) distribution (27) (28) (29) (31) (26) (27) distribution chi-squared Student distribution (41) 55 t distribution (41) joint probability marginal conditional conditional conditional conditional independently independent covariance correlation uncorrelated normal standard standardize multivariate bivariate (31) (32) probability t distribution (42) F distribution (42) simple random population identically (43) distributed (44) and identically (i. Explain about the value of y.

3 Using the random variables X and Y from Table 2. and (c) "'wv and corr(W.. Suppose that p ~ OJ. 2. skewness = 0.. consider (b). What IS the rela- tionship between your answer and the law of large numbers? 2. (Hint: You might find it helpful to use the formulas Exercise 2.S. population Or lookfor 2008.Repeat this for n _ and n . two new ran- 2..5 Suppose that 1>""" a . Derive the probability distribution of Y. describe how the densities differ.. and kurtosis 100. Would if n = 5? What Y is a random variable with !-Ly = 0. You want to calculate Pr( Y it be reasonable to use the normal approximation n ~ 25 or n ~ 100? Explain. ance in DC? 2. and (c) a xy and corr(X.1 Let Y denote the number of "heads" that occur when two coins are tossed. 2. . a. re i i d random variables with a N(I.1 and. V). 0.2 to compute (a) E(Y) and E(X). Compute tbe mean. Sketch a hypothetical probability distribution of Y.6 The [allowing table gives the joint probability distribution between employment status and college graduation among those either employed Ing [or work (unemployed) in the working age U. = why n ran- Exercises 2.10a.21..~.10 lion. variance.. 4) distribu2.1). Derive the mean and variance of Y. are i i d. Sketch the probability density of Y when n ~ 2.a.. Explain dom variables drawn from this distribution might have some large outliers. In war dS. Show E(Xk) ~ ~ p. What are the mean. . random variables with the probahility • _ disabout tribution given in Figure 2. In' Y.. a. dom variables W ~ 3 + 6X and V ~ 20 . standard deviation.4 Suppose X is a Bernoulli random variable with P(X = 1) = p.5 In September. Compute (a) E(W) and E(V). V).6 S uppose th a t Y. (b) . b. Seattle's daily high temperature has a mean of 70°F and a and vari- standard deviation of7°F. Derive the cumulative probability distribution of Y.11 . c.2 Use the probability distribution given in Table 2. skewness.2..7 s.~. ...) 2. "'y ~ 1. 2.56 CHAPTER 2 Review of Probability Y. given in c. and kurtosis of X. and . p for k > O. Y.100 .7Y. Show E(X3) b..

I).8 The random variable Y has a mean of 1 and a variance of 4. The correlati ct n between male and female earnings for a earnings for a randomly c uple is O.659 0.000 = I) a. dollars ($) to euros (€).000. n of two-earner male/female status independent? In a given p pulati have a mean tion couples.Exercises Joint Distribution of Employment Status and College Graduation in the U.954 0. O.0 0 per year and a standard devia[$1 .S. h w that the unemployment c. Population Aged 25 and Greater. Female earnings have a mean of $45. What is the tandard deviation of C? d.341 1. Let Z = ~(Y . A rand mly selected member ot this p pulation reports being unernpi yed. alculaie the unemployment rate [or (i) college graduates and (ii) n n-collcge graduates.s. is the covariance between male and female earning? c.622 0. how that Ikz 2. Wha. d. elected couple. b. \ hat is the pr bability that this worker i a college graduate? Anon.009 0.000.332 0.000 per year and a tandard deviation of $12. lculat E(YIX=I)andE(YIX=O). rare is the fracuon of the labor force that is rate is given by 1.7 Are educational achievement and crnploymeru xplain.9 = 0 and u~= l. ompule' (Y). X and Yare di crete rand m variables with the foLJowingjoint distribution: . male earnings [$40. onvert the an \ er to (a) through (c) from U.ollege graduate? r. 2008 Unemployed Employed 57 CY = 0) W=1) Total Non-cotlege grads(X ollegegrads(X To'ot = 0) 0.037 0. denote the combined u.E(Y). What i the mean of b. e. TIle unemployment unemployed. 2. 2.046 0.

99).03 30 0.99).78).01 0. If Y is distributed Ro.4.58 CHAPTER 2 Review of Probability Value of y 14 22 0. Use the definition 2. find Pr( Y:5 7. If Y is distributed F. If Y is distributed N(5.02 0. 2). Why are the answers to (b) and (c) the same? e..) xl xl. Pr(X = 1. Why are the answers to (b) and (c) approximately e.79). mean.31). If Y is distributed b.oo. Y:5 1.10 65 0. 9). find Pr( -1. and variance mean.15 40 0.t7 0. find Pr(40 :5 Y:5 52). 1). of c. > 1.02 8 That is. If Y is distributed the distribution. '90. c. Y = 14) = 0. .4). If Y is distributed 'l5. [f Y is distributed xlo.02.03 0.10 Compute the foUowing probabilities: a. Calculate the covariance and correlation 2. find Pr( Y d. find Pr(Y 3).find Pr(Y > 18. > 0). b.05 0.01 0. If Y is distributed N(O.83).99:5 c.75). find Pr(6 :5 Y :5 8). if Y is distributed N(50. Calculate the probability distribution.99 :5 Y:5 1.09 J Value of X 5 0. find Pr( Y > 1. If Y is distributed FWD. and variance of Y. f. 2.02 0. between X and Y. If Y is distributed N(3. X=8. find Pr(Y :5 1.05 0. Calculate the probability distribution. b. find Pr( -1.12 Compute the following probabilities: a. the same? find Pr(Y > 2. and so forth. find Pr( Y > 4. If Y is distributed N(I. c.10 0. find Pr(Y:5 b. d. 25). u Y is distributed xl.12).11 Compute the folJowing probabilities: a. of Y given a. d.0).15 0. (Hint.

find Pre J 0 I :s Y :s 103). you do n t ha e your rcxibo k and do not have acce to a n rmal probabilit table like ppendix Table I. iHint: Use the law of iterated expectations conditioning on X = 0 and X = I.43) whcn /I /I = 100. /I = 20. a. /I.. b. find Prey > 98). Y. (ii) /I = 100.. (I-lill/: What is the skewness for a symmetric di tribution?) the kurt = 3 and E(W4) = 3 X 1002.100) distributi n. . In a random sample of size /I = 64. each distributed mpute Pr(9. Pre Y :s 0.) ) 4 c.99. Show that Pre I 0 .0 as /I grows large. E( 2). find Pr(Y:S In a random ample h. uppo c c is a positive number. 2.d.14 In a populau nlJ. (Hil/I: Use the fact that i is 3 for a normal distribution. and X. and b.i.6). are i.6).X) W.. Let S = Y + (I ...y= lOOanduf=43.4).i. f size /I = 165. 2. are i. Derive the skewness and kurtosis for S. II. ii. . I). Shov that E( y d.16 Y i distributed (5. 2.c :s Y :s 10+ c) be omcs ct C I 1. Pre Y i!: = 0. draws from the I ( .. Y is distributed (0.6 :s Y:s lOA) when (i) (iii) /I 1.. nfortunately. In a random sample of size /I = 100. Let Y ns for 0. and S = W when X = 0.37) when . Derive E( ). W is di tribuicd N(O.d. random variable. (That is.4. you do have your ompurer and a mpurcr program tha: can generate i. argue that Y converge in probability to 10. £(S3) and £(S4). se the central limit to compute approximati i. an wer the f II wing que tions: II. how that E( y3) = 0 and E( W3) = O. how that £(y2) = I and E(W2) = 100. (10.000. i = I. sc our answer in (b) 10 c. and Ware independent. c.i. Howe er.) e. Use the central limit theorem to 101).d.) B. xplain how you can u e your computer to compute an a urate appro irnation for Prey < 3.17 y.Exercises S9 2.1 uppose y" i-I. Bernoulli random variables with denote the ample mean.100) and you want 10 calculate Prey < 3.13 X is a Bernoulli random variable with Pr(X = J) = 0. = 400. 100). 2. /I. p 2.S = Y when X = 1.

Suppose that Y takes on k values Y" .] (2. . Show E(X _1-')4 = E(X ) 4 and so forth. Y) = O. Use your answer to (a) to verify Equation c.95? (Use the central limit theorem to compute an approximate answer. Z = z)..).41) 0. Consider an "insurance pool" of 100 people whose homes are sufficiently dispersed so that. E(X'). What are the mean and standard deviation of the damage in any year? b./Y~y. Y..19). .] that Y = Y can be calculated [Hin/:This is a generalization of b.20).) Pr(X = x. XI' a. and Z. and that Z takes on of X. Z)].).. .60 CHAPTER2 Review 01 Probability b. the damage is random. Suppose that X and Yare independent. corr( X. Suppose that in 95% of the years Y = $0. Z is of Y values z" . a. Y = Y. Let Y denote the dollar value of damage in any given year.x~x Z-z) . Suppose that Y takes on k values y" 11/ . Yk. (i) What is Y? (ii) What is the proba- Y exceeds $2000? 2. - 3[E(X2)][E(X)] 4[E(X)J[E(X3)] + 2[E(X)]3 + 6[E(X)f[E(X')]- 3[E(X)]' .000 . and the conditional probability distribution givenXandZisPr(Y=yIX=x Z=z) =P.19) and (2. . of Equation (2. the weather can inflict storm damage to a home. Zm' The joint probability distribution Pr(X = x.20 Consider three random variables X. Pr(X Xj. a. that X takes on I values x" .19 Consider two random variables X and Y. I]'Xy b.Z z) . [Hint: Use the definition of Pr( Y = YilX = x. in any year. How large would n need to be to ensure that Pr(0. Explain how the marginal probability from the joint probability distribution. a. Show that = 0 and 2. the damage to different can be viewed as independently distributed random variables.. E(X3). [Hin/:This is a generalization Equations (2. Show that E(Y) = E[E(YIX.] 2. the expected value of the average damage bility that homes Let Y denote the average damage to these 100 homes in a year.. but in 5% of the years Y = $20.39 -s: 2: Y '" 0.21 X is a random variable with moments E(X).) 2.~lPr( Y ~ yil X = x.16). From year to year..18 In any year. Y. X. Yk and that X takes on I values x]. Show E(X _1-')3 = E(X3) b. how that Pre Y = Yi) = L:.

after 1 year and that $1 invested sup0. Show that !-'y = = X2 1. [Hint: Use your answer to (a)..22 Suppose you have some money to invest-for simplicity.5. What value of w makes the mean of R as large as possible? What is the standard deviation of R for this value of w? d.Xn denote a sequence of num- of summation bers. Show that E(W) = n. (Harder) What is the value of w that minimizes the standard deviation of R? (Show using a graph. Show that E(Yl/o.. in the bond fund. (1/(72) L::~. . 1 . b yields R. Y) = 0 and thus corr(X. standard Y) = O.w. is random fund.] d. Suppose that w b.75. = O. = 1. Compute the mean and standard deviation of R. b.. X~. = O. Suppose that = 0. Show that W = i. and let Y = X2 + Z. and c denote three constants.Exercises 61 2. N(O.i. Yl>' . Compute w = 0.w.. . b a. c. $l-and you are planning to put a fraction w into a stock market mutual fund and the rest.d. normal random a. The correlation between R. Show that E(yIX) b. n.) Let Xl. Show that V = Yj 2. into a bond mutual pose that R. Show that cov(X. + (1 .05 (5%) and standard devia- tion 0.23 This exercise provides an example of a pair of random variables X and Y for which the conditional mean ofY given X depends distributed on Xbut corr(X. is distributed Yl n- c..Yn denote another sequence of numbers. (72) for i = 1.) Y) d. 1 . . .08 (8%) and standard deviation and Suppose that lib is random with mean 0. the mean and standard deviation of R.2) b. If you place a fraction w of your money ill the stock fund and the rest. algebra.25 (Review / L::~2Y?IS d· ib ute d l.w)R . Let X and Z be two independently variables. Show that E(XY) of a standard normal random variable are all zero. then the return on your investment is R = w R. 2. .-J1 istn notation.. and Rb is 0.24 Suppose 1) is distributed a. with mean 0.07.2. and a. or calculus. Show that . Suppose that $1 invested in a stock fund in a bond fund yields R .25. (Hint: Use the fact that the odd moments c.04.) 2.

Let error associated with this guess. Derive E(V').=1 2.1 Derivation of Results in Key Concept 2. Suppose the value of Z. are random variables with a common mean t-v.=1 tl II n b.(Hint: use the law of iterated expectations.] APPENDIX 2.26 Suppose that l'l..XiYi . you know a guess of denote tbe X ~ E(XI Z) denote the value of X using the information on Z. To derive Equation (2.3. show that E(Y) = /"y and d. + ey. but not the value of X.h(Z).a j=l = na n II II II d.3 This appendix derives the equations in Key Concept 2.x. . 2. Let X = g(Z) denote anotber guess of X using Z. Show that cov( Y. When n is very large. - a.y. Suppose that n = 2. show that var( Y) '" pUy. Show that E( WZ) ~ O.. Lax. [Hint: Let h(Z) g(Z) .E(a + bY l]'J = El[b(Y . where i '" j). + 2ab ~Xi + 2ae ~Yi + 1=1 1-1 I-I 2be 2.X a. c.) b. use the definition of the variance to write yare a = + bY) = El[a + bY . . and V =X - X = denote its error. b. 11. Show tbat E(W) = O. and let W ~ X . lj) = perf for i '" j.62 CHAPTER2 Review of Probability II a. - = /"y and var(Y) = ~af + ~paf.=1 + e' 2. . i=1 i(a + n bx. and lj is equal to p for all pairs i and j. so that V = [X .I'v)'] b'u9. . a common variance O"~l and the same correlation p (so that the correlation between Y. j=! II = a LX. Equation (2.) = i=l n ?Xi 1=1 + ?Yi 1=1 c. Show that E( V') "" E(W').1'.. 2.E(XI Z)] . 2 2 var(Y) = av/n + [(n . 2.29) follows from the definition of the expectation. (Xi + y. .27 X and Z are two jointly distributed random variables.30)..l'y1]'} ~ b'E[(Y .1 )/n ]pUy. Show tbat E(Y) Co For n "" 2. .E(Xj Z).)' = no' + b' 2.

.34). I carr (X.Derivationof Results in KeyConcept 2. ing the quadratic.33).I"x) + I"x][( Y .E(a + bX + cV)][Y -I"Y]} =E ([b(X-l"x) + c(V -l"v)][Y -I"y]j -I"y]j -I"y]j = E {[b(X -l"x)][Y = buXY + E ([c(V -l"v)][Y + caVY.32). inequality implies that cr}y/(crlcr'f.):5 !crxy!(uxu). which (using the definition of the correlation) proves the correlation I carr (X Y)I -s 1. -I"Y) + l"yJ'} = E[(Y -I"Y)'] + To derive Equation (2. use the definition of the covariance to write cov(a + bX + cV.a}yja}.52) lor.31). We now prove the correlation inequality in Equation (2. Y)I Let a = -UXY/ a} and b = 1. Because u} + 0'9 + 2( -uxy/u})aXY (2.I"Y) + I"y]j = oS Y -I"Y)] + I"xE(Y -I"Y) + l"yE(X -I"x) + I"x I"Y = aXY + I"x I"Y' 1.Rearranging this inequality yields s uh The covariance :5 a}u~ (covariance inequality).33).50) which is Equation (2.u'j-yja} ~ O. (2.(al"x + bI"Y)]') = E{[a(X -I"x) + b(Y -I"Y)]'} = E[ a'(X -I"x)'] + 2E[ ab(X . it cannot be negative. that is.Applying Equation (2.3 To derive Equation (2. write E(XY) E[(X -I"x)( = E {[(X .49) +E[b'(Y-I'Y)'] = a'var(X) + 2abcov(X. Y) ~ E{[a + bX + cV . the third equality follows by expand.51) it must be that uf . and the fourth equality follows by the definition of the variance and To derive Equation (2. equivalently. (2. To derive Equation (2. where the second equality follows by collecting terms. covariance. we have that var(aX+ Y) =a2ul+uf+2auXY ~ (-uxy/u})' = u~.I"x)(Y -I"Y) ] Y) + b' var( Y) (2. write E(Y') 21"YEI.51) var(aX + Y) is a variance.35).)1 inequality.31). Y-I"Y) +I"~= u9 + 1"9 because E(Y-I'Y) = E{[(Y = O. ~ a'a} + 2aba xy + b'a}. 1. use the definition of the variance to write 63 var(aX + bY) = E{[ (aX + bY) . so from the final line of Equa- tion (2.

1. if so. more practicalapproach is needed.S. what is the mean of the distribution S tatistics is the science of using data to learn about the world around us. by how much' These questions relate to the distribution of earnings in the population of workers. . hypothesis testing. managing and conducting the surveys. measuring the earnings of each worker and thus finding the population distribution of earnings. 3. For example. we can use this sample to reach tentative conclusions-to draw statistical inferences-about characteristics of the full population.say. and 3. 1000 members of the population. and confidence Intervals In the context of statisticalinference about an unknown mean.l. and confidence intervals.3 review estimation. Three types of statistical methods are used throughout econometrics: estimation. The key insight of statisticsis that one can learn about a population distribution selected by selecting a random sample from that population. Confidence intervals use a set of data. The 2000 l. Estimation entails computing a "best guess" numerical value for an unknown characteristic testing entails formulating a specific hypothesis about the population. of a population distribution. and the 2010 Censuscould cost $15 billion or more. population. Census cost $10 billion. from a sample of data. population is the decennial census.s. Statistical of earnings tools help us answer questionsabout unknown characteristics of distributions in of recent college graduates' Do mean earnings differ for rnen and women. In practice. at random by simple random sampling. Thus a difJerent. The only comprehensive survey of the U. One way to answer these questions would be to perform an exhaustive survey of the population of workers.S. and compiling and analyzing the data takes ten years.CHAPTER 3 Review of Statistics populations of interest. such a comprehensive survey would be extremely expensive. 64 population . to estimate an interval or range for an unknown population characteristic. Hypothesis then using sample evidence to decide whether it is true. such as its mean. many members of the population slip through the cracks and are not surveyed.2. Despite this extraordinary commitment. and.' . we might survey. Sections 3. The process of designing the census forms. Rather than survey the entire U. however. Using statistical methods. hypothesis testing.

lj. For example.d.1 Estimation of the Population Mean Suppose estimate pendently you want to know the mean valne of Y (that is. Both Yand 11 are functions of the data that are designed to esti3. so what makes one estimator "better" are random variables. Yand mate !Ly.3 are extended to compare means in two different populations. A natural way to Y from a sample of n inde- (i. 1'. be based on the Student I distribution instead of the normal distribution. another observation. both are estimators of !Ly. but it is not !Ly is simply to use the first the only way.. than another? phrased bution Because estimators There are many possible estimators.. of from one sample to the next. In some special circumstances. at least in some average sense. if they are collected by simple random estimation of !Ly and tbe properties of Y as an estimator Estimators and Their Properties Estimators.1 Estimationof the Population Mean 6S Most of the interesting questions in economics involve relationships between two or more variables or comparisons between different populations. which Y and lJ are two examples. this question can be of the sampling distrithat gets as close more precisely: What are desirable characteristics of an estimator? to the unknown In general. 3.i.6. this mean is to compute and identically the sample average such as the mean earnings of women recently graduated distributed from college.5 focus on the use of the normal distribution for performing hypothesis tests and for constructing confidence intervals when the sample size is hypothesis tests and confidence intervals can large. The chapter concludes with a discussion of the sample correlation and scatterplots in Section 3.2 through 3. Sections 3..1 through 3. Y and lJ take on different values (they proof !Ly. . we would like an estimator as possible true value. The sample average Y is a natural way to estimate !Ly. For example..5 discusses how the methods for comparing the means of two populations can be used to estimate causal effects in experiments.!LY) in a population. these special circumstances are discussed in Section 3.d. in fact. the methods for learning about the mean of a single in Sections 3. Thus the estimators There are.4. Section 3.·· .i. This section of !Ly.1. lJ. (recall that sampling). in other . way to estimate lJ. Y" discusses are i.) observations.7. is there a gap between the mean earnings for male and female recent college graduates? population In Section 3. . using the terminology When evaluated duce different in repeated estimates) in Key Concept samples.> 3. many estimators l[ both have sampling distributions.

fj.y is that.if so. [ky and ILy. a desirable of fj. while an estimate is a nonrandom words.y is that the probability approaches Concept 2.y has a smaller variance than ILy.y. that is. An estimate is the numerical value of the estimator computed using data from a specific sample.y has a smaller variance than ILY. 1 as the sample size increases. How might you choose between them? One way to do so is to choose the estimator with the tightest sampling distribution. An estimator because of randomness number. of an estimator to be a lightly cenleads to three specific value as possible. let fj. the uncertainty about the value of /'oy arising from random variations in the sample is very small. when it is actually is a random in selecting the sample.y and ILy by picking the estimator with the smallest ance. Thus a desirable properly of an estimator sampling distribution as equals /'oy. This sugvarithan gests choosing between fj. where E([ky) i the mean of fj..y) = /'oy. (Key Variance and efficiency.y is said to be more efficient ILy. This observation of an estimator: unbiased ness (a lack of bias). It is reasonable to hope that. such Y or l'I. To slate this concept mathematically.l' is biased. Another desirable property fj. variable ulation. consistency.6). . both of which are unbiased. The terminology "efficiency" stems from the notion that if fj.y denote some est ima t r of /ky. The estimator fj. we would like the sampling distribution tered on the unknown desirable characteristics and efficiency. otherwise. Unbiosedness. the estimator is said to be unbiased. fj.c 66 CHAPTER 3 Reviewof Statistics ~ Estimators and Estimates 3. Stated more precisely. on average. Suppose you evaluate an estimator many times over repeated is that the mean of its randomly drawn samples.y is consistent that it is within a small interval of the true value /ky for /k. If fj. Suppose you have two candidate estimators.1 An estimator is a function of a sample of data to be drawn randomly from a pop. of an estimator of the sampling distribution Consistency. when the property sample size is large. then fj. then it uses the information in the data more efficiently than does fiy. you would get the right answer.y is unbiased if E(fj.

consistency. the variance of Y is o}/n.. we need to specify the estimator or estimators to which Y is to be compared. and Efficiency CilDmim Let /LY be an estimator of l-'y.2. is an unbiased estimator of I-'Y' Its variance is var( l'i) = o-f. estimator estimator • Let /iy be another estimator of I-'y and suppose that both /Ly and /iy are unbiased. Thus.6. Consistency.. Y is a more efficient estimator than Y" so. that is. We start by comparing the efficiency of V to the estimator l'i. • --=--==~_lI..I . the law of large numbers (Key Concept 2. From Section 2. Y should be used instead of Y. and efficiency are summarized in Key Concept 3. of I-'y if E(fly) of j.5 and 2. Bias. .1IlIIIIIIIIiI. /-Ly. The estimator Y. Properties of Y How does Y fare as an estimator of I-'Y when judged by the three criteria of bias.2 = I-'Y.i. and efficiency? The sampling distnbution of Y has already been examined in Sections 2. consistency.. the variance of Y is less than the variance of l'i. Becanse Y" . I-'y. .I-'Y.Then /Ly is said to be more efficient than /iy if var(/Ly) < var(/iy). E(V) = I-'Y..6) states that Y ---L.d.1 Estimation the Population of Mean 67 Bias.!I.5. according to the criterion of efficiency. Bios and consistency. Y" are i. thus y..5.en: • The bias of fly is E(fly) • fly is an unbiased • (Ly is a consistent . so V is an unbiased estimator of I-'Y' Similarly. for n 2: 2. that is.ty if My ~ 3. . the mean of the sampling distribution of l'i is E( l'i) = I-'y. As shown in Section 2..TI. might strike you as an obviously poor estimator-why would you go to the trouble of collecting a sample of n observations only to throwaway all but the first?-and the concept of efficiency provides a formal way to show that V is a more desirable estimator than Y.Y is consistent. What can be said about tbe efficiency of Y? Because efficiency entails a comparison of estimators.3. Efficiency.

. . 1'.=J I'i> varey) < var(p. Y is consistent. If iJ..25a~/11 (Exercise 3. that is..I.y is unbiased. that is.. functions 5.. 11. Because m is an estimator of E(Y).. show thai the weighted of all unbiased estimators averages 11 and Y have these conclusions reflect a more general result: Y is the most efficient estimator that are weighted averages of 11. Thus j? is unbiased and.1'. jiy =::: ". . Said and are linear differently..3 Efficiency of Y: Y Is BLUE Let My be an estimator of J. . However. .. The comparisons in the previous two paragraphs larger variances than Y.. then .1 aY where a. structure: They are Y.y = Y Thus Y is the Best Linear Unbiased Estimator (BLUE).68 CHAPTER 3 Reviewof Statistics m:mmm 3.This result is stated in Key Concept that is. Y is the most efficient estimator of My among all unbiased estimators weighted averages of }]. Y"..y) unless p.. a are nonrandom constants.-mf. II (l/n) ":::'. ..lllUs The estimators weighted averages of Y is more efficient than Y. m that minimizes (3. because var( Y) ---> 0 as n ---> 00. Y is the Best Linear Unbiased Estimator (BLUE). The mean of Y is /"y and its variance is var( Y) = 1.1' that are What about a less obviously poor estimator? in which the observations are alternately weighted Consider the weighted by ~ and~: average (3. it is the most efficient (best) estimator among all estimators that are unbiased 3.\.2) rn ~(Y.3 and is proven in Chapter Y is the least squares estimator of /"y. . i=l which is a measure of the total squared gap or distance between the estimator and the sample points.." . ot r].ty that is a weighted average of>. . The sample average Y provides the best the obser- fit to the data in the sense that the average squared Consider the problem of finding the estimator n differences between vations and Yare the smallest of all possible estimators.. . j? has a larger variance than Y...11). 1'. . and Y have a common mathematical r.. 1'. In fact. you can think of it as a .. .1) where the number of observations 11is assumed to be even for convenience.

such as those that would be sampling. .2) is called the least squares estimator. Landon would defeat the incumbent.1 Estimationof the Population Mean 69 _f--------------: S won hortly before the 1936 U. you have the value that makes Expression solve the least squares problem: Try many values of m until you are satisfied that as is done in Appendix 3. and an estimate of the unemployment or oversarnples.2) as small as possible. . This assumption is important because the from simple random national unemployment sampling can result in Y being biased.i. to estimate rate. . example biases introduced in the sample. that }]. Franklin D.3. Alternatively.S. the unemployed represented this sampling overrepresents. Because the telephone survey did not sample randomly from the population but instead undersampled Democrats. Because ple are at work at that hour (not sitting in the park!). Suppose agency that. Y is the least The Importance of Random Sampling We have assumed obtained nonrandom monthly scheme 10:00 A.2. but the "Landon by sampling that is not entirely random. the Literary Gazette records and automobile registration files.. Do you think surveys conducted over the Internet might have a similar problem with bias? prediction of the value of Y. . you can use algebra or calculus to show that choosing m = Y minimizes squares estimator the Sum of squared gaps in Expression of J.d. the estimator was biased and the Gazette made an embarrassing mistake.Ly.m in Expression One can imagine using trial and error to (3. 11. but it was wrong about the winner: Roosevelt by 59% t041%1 Bow could the Gazette have made such a big ntistake?111e Gazette's sample was chosen from telephone were also more be Republican. presidential election. a. so the gap Y. � iiiiiiiiiiiiii ===""""""'_' _ . published a poll111dlcatmg to 43%. a statistical adopts a sampling most employed peoare overly rate based on in which interviewers survey working-age adults sitting in city parks at on the second Wednesday of the month. (3..2) so that (3.e estimator 111 that minimizes the sum of squared gaps Y. This bias arises because this sampling scheme of the population. the unemployed members plan would be biased. The sum of squared gaps in Expression (3..M. Roosevelt. But in 1936 many households did not have cars Or telephones. by a landslide-57% The Gazette was right that the election was a landslide. Y" are i. draws.2) can be thought of as the sum of squared prediction mistakes. This example of Wins!" box gives a real-world is fictitious.nd those that did tended to be richer-and likely to that AlI M.m can be thought of as a prediction mistake.

unemployment 3. The two-sided alternative H. Current the monthly U.O (two-sided alternative).S. called the null hypothesis. specifies what is true if the null hypothesis hypothesis is that E(Y) is written The alternative two-sided alternative hypothesis hypothesis The most general alternative * is not. a !-'Y. called the alternative mean. college graduates a null hypothesis distribution = 20.3) that. that of hourly earnings. denoted !-'Y. which is called because it allows E(Y) to be either as less than or greater than !-'y.: E(Y) * !-'Y. Hypothesis pare the null hypothesis that holds if the null does not. 111is section describes hypothesis tests concerning the population mean of hourly earnings equal $20?). The null hypothesis is that the population to a second hypothesis.4.O' about the population is that E(Y) (3. the conjecture earn $20 per hour constitutes = !-'Y. Stated mathematically. takes on a specific Ho and thus is value.S. !-'Y.o. The starting point of statistical hypotheses testing is specifying the hypothesis be tested.S.o. then the null hypothesis is. the Sur- bias. (3. on average in the population. Null and Alternative Hypotheses to testing entails using data to comhypothesis. if Y is the hourly earning of a randomly selected recent college graduate. Appendix 3. Hypothesis tests involving two populations mean earnings the same for men and women?) are taken up in Section 3.2 Hypothesis Tests Concerning the Population Mean Many hypotheses about the world around us can be phrased as yes/no questions. college graduates embody specific hypotheses equal $20 per hour? Are mean earnings the same for male and female college graduates? Both these questions about the population distribution of earnings. E( V).Q 70 CHAPTER3 Reviewof Statistics 11 is important to design sample selection schemes Population in a way that minimizes Statistics Survey (CPS).4) . Do the mean hourly earnings of recent U.o = 20 in Equation (3. The statismean (Does the population (Are tical challenge is to answer these questions based on a sample of evidence.O' The null hypothesis is denoted Ho: E(Y) For example.1 includes vey it uses to estimate a discussion of what the Bureau of Labor rate.3). actually does when it conducts the U.

and it is reasonable not to reject the null hypothesis. suppose that.o because of random sampling. if this p-value is large.o (the null hypothesis is true) but Y differs from !'-y. the sample average Y will rarely be exactly equal to the hypothesized value t-v» Differences between Y and t-v» can arise because the true mean in [act does not equal!. The p-Value In any given sample. alternatives are also possible. is the probability of drawing a statistic at least as adverse to the null hypothesis as the one you actually computed in your sample. then it is quite likely that the observed sample average of $22. In the case at hand. and these are discussed later in this The problem [acing the statistician is to use the evidence in a randomly selected sample ot data to decide whether to accept the null hypothesis H or to o reject it in favor o[ the alternative hypothesis Hl.64 by pure random sampling variation. the average wage is $22. let ya" denote the value of the sample average actually computed in the data set at hand and let PrHo . To state the definition of the p-vaJue mathematically. say 0.2 Hypothesis TestsConcerning the Population Mean 71 One-sided section. assuming the null hypothesis is correct.3. For example. it is accepted tentatively with the recognition that it might be rejected later based on additional evidence. statistical hypothesis testing can be posed as either rejecting the null hypothesis or failing to do so. accordingly. Although a sample of data cannot provide conclusive evidence about the null hypothesis. It is impossible to distinguish between these two possibilities with certainty. the p-value is the probability of drawing Y at least as far in the tails of its distribution under the null hypothesis as the sample average you actually computed. in your sample of recent college graduates. assuming that the null hypothesis is true. If the null hypothesis is "accepted. the evidence against the null hypothesis is weak in this probabilistic sense.-y. The p-value.64.The p-value is the probability of observing a value of Y at least as different from $20 (the population mean under the nuU) as the observed value of $22. For this reason. also called the signilicance probability. If this p-value is small.5%. then it is very unlikely that this sample would have been drawn if the null hypothesis is true.64 could have arisen just by random sampling variation if the null hypothesis is true. By contrast. thus it is reasonable to conclude that the null hypothesis is not true." this does not mean that the statistician declares it to be true. say 40%.o (the null hypothesis is false) or because the true mean equals !'-y. This calculation involves using the data to compute the p-value of the null hypothesis. rather. it is possible to do a probabilistic calculation that permits testing the null hypothesis in a way that accounts for sampling uncertainty.

Written mathematically. in which case the .O > (Ty I ly acl u:Y..o)/Ul" has a standard normal distribution.O than y " under the nuU hypothesis or.!Ly. 11. when the sample size is large the sampling distribution ofY is well approximated by a normal distribution. the standardized version of Y.o)/Ul' in absolute value. this variance is typically unknown.o.The pa value is the probability of obtaining a value ofY farther from /l1'.o - I) = 2(1' (ly - acl .ol.Y.e calculation of the p-value when Uy is known is summarized in Figure 3.o )/u. Thus. then under the null hypothesis the sampling distribution of Y is N(/ly.6. the p-value) is p-value = PI'".1./ly./ly. (3. is binary so that its distribution is Bernoulli. the p-value is the area in the tails of the distribution of Y under the null hypothesis beyond I Y'''' . how- utl.!Ly.1.!LY.y.o. [An exception is when Y..!LY. The details of the calculation. equivalently. the shaded tail probability in Figure 3. when the sample size is small this distribution is complicated.6) where <l> is the standardnormal cumulative distribution function.6) depends on the variance of the population distribution.o)/ul' greater than (Y'''' .72 CHAPTER Reviewof Statistics 3 denote the probability assuming that E( Y. To compute the p-value..o).e formula for the p-value in Equation (3. computed p-value is = !Ly.Y. As discussed in Section 2. However. In practice. (Y .5) That is. Under the null hypothesis the mean of this normal distribution is !LY.This probability is the shaded area shown in Figure 3.ol]. the p-value is the area in the tails of a standard normal distribution outside ± (yacl -" . as long as the sample size is large..o. I) (3. it i not.That is. it is necessary to know the sampling distribution of Y under the null hypothesis. according to the central limit theorem. () (I Y . where = ut/n . but if the p-value is small. where u~ ~ ut/n. u~). so under the null hypothesis Y is distributed N(/ly.) computed under the null hypothesis (that is. is consistent with the null hypothesis. ut ever.1 (that is.. Tf the sample size is large. The = p-value Prl/o[IY .lf the p-value is large. is the probability of Obtaining (Y . then the observed value yac.ol > !ya" - /ly. depend on whether CT~ is known.This large-sample normal approximati n makes it possible to compute the p-value without needing to know the population distribution of Y. ut.a. Calculating the p-Value When O'y Is Known 11. under the null hypothesis.

so (Y . s~. is 2 Sy- n -1 ". Thus the p-value is the shaded standard normal tail probability outside _ N(O. n. The sample variance and standard deviation.11 Y. Sy.!Ly)2 in the population distribution. and second. . The formula for the sample variance is mucb like the formula for tbe population variance. The sample variance. Y is distributed N(IJ-y.u. . and the standard error of the sample average Y is an estimator of tbe standard deviation of the sampling distribution of Y. In large samples.o by at least as much as yoct.O I I variance is determined by the null bypothesis. and Standard Error The sample variance s? is an estimator of tbe population variance O"~. . is the square root of the sample variance.7) and Exercise 3. . 1)..!Ly f.7) The sample standard deviation.1 instead of n.3.o)!uy is distributed N(O. is replaced by Y.2 ~ Calculating Hypothesis Tests Concerning the Population Mean a p-value 73 The p-value is the probability of drawing a value of Y that differs from My. The population variance. see Equation (2.. witb two modifications: First. The Sample Variance..O I o y"" . the sample standard deviation Sy is an estimator of the population standard deviation ay.o. E(Y .!LYf. (3..2. iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii===~~. Similarly.IJ-y.. .] Because in general O"} must be estimated before the p-value can be computed. i = 1.the sample variance is tbe sample average of (Y.(1) i=1 1 "" _-2 Y). Sample Standard Deviation. we now turn to the problem of estimating u~. u~) under the null hypothesis.o)!uyl· I Y""-Il O"y Y. the average uses the divisor n . is the average value of (Y . n z ±I( y"" -IJ-y.

The reason (or the first modification-replacing unknown and thus must be estimated.1)//1. and as a result Sy IS un bi d . The result in Equation that 1"[" .](T~. esti1 "degree of freedom" -in f freedom Consistency of the sample variance./Vii.7) instead of n is called a degrees of freedom is..4.:'\ (Y. and Y. deviation of the sampling dis- tribution ofY is (T1' = (Ty/Vii.4 Y is an Y.-1)(T~.-1 /-.18.V)'] = [(/1. has a finite fourth moment.1 degrees Dividing by /1. . Y" SE(Y) = 0-1' = s. uses up remain.1 in Equation (3. the natural son for the second modification-dividing mating /"y by by /1.. - Vf] = (/1. lase correction: Estimating the mean uses up some of the information-that the data. is called the standard error of as Key Concept 3.7) instead " reets for this small downward bias.8) 3. Dividing by n . Y and is 0-1' (the caret "'" over the symbol means that it i an estimator error of V is summarized of (T1')'The standard . that so s~ in Key obeys the law of large numbers. .Thus EL.74 CHAPTER 3 Review of Statistics KEY CONCEPT The Standard Error of The standard error of Y estimator of the standard deviation of are i. that is. Because the standard tor of (T1" The estimator denoted SE(Y) or of (T1'. .. must have a finite fourth moment.y that /-.. (3. the sample variance is close to the population probability when /1.is large. which in turn means E( yt) Y. so that only /1. E( yl) < that s~ is consistent is that it is a sample average.9) variance with high In other words. S9 to obey the law of large numbers Concept 2. Equation (3.6. . (3.d. the reason (3. a small downward bias in (Y.1 in Equation (3. Y" are i.. Intuitively..i. of 1/ shown in Exercise E[( Y. The reaas is instead of by I/-is that esti- Y introduces 3.d. . The standard error 0('1.y by Y -is /-. (1. The stan_ dard error ofY is denoted SE(Y) or 0-1" When r.Y? = Cor- IlE[(Y. - vf Specifically. mator of the population variance: SY The sample variance is a consistent '-+(T' y.. But (or must be finite..9) justifies using sy/Vii as an estirna- sy/Vii.i.3 under the assumptions 00../-'y)' must have finite variance.y is estimator of is Y..9) is proven in Appendix 3. in other words.

.d./ are i. the p-value can be computed by replac- ing cry in Equation (3. ..."Y.o 1= SE(Y)' a hypothesis (3.. . ..d.p)/n (see Exer- cise 3. the p-value is calculated using the formula p-value = 2(t>(_ll'aa - SEC Y) .o)/SE(l') plays a central role in testing hypotheses and has a special name. Y. are i. That is. (3."Y.... the formula for the variance ofl'simplifies distribution with Success to p(l .LY. Thus the distribution same as the distribution the standard normal distribution when 11 of' the r-statistic is approximately (l' .6) by the standard error. draws from a Bernoulli p..3. The formula for the standard error also takes on a simple form that depends only on l'and n./LY. .===~ . 1) for large fl.. (3. sf is close to cr9 the by with high probability...13) __________ iiiiiiiiiiiiii.."Y. when cry is unknown and }j.. SE(l') = <Ty. I is approximately The formula the I-statistic.O I - SEC Y) . a test statistic is a statistic used to perform r-statistic is an important example of a test statistic.o)/cry.. the (-statistic or z-ratio: l' . is large because of the centrallimit the- orem (Key Concept 2. Accordingly...11) In general. SE(l') = \!Y(l.10) can be rewritten in terms of la" denote the value of' the t-statistic actually computed: yaci - act _ - j.Y)/n.. The Large-sample distribution of the t-stotistk: of When n is large. Calculating the p-Value When Uv Is Unknown Because Sf is a consistent estimator of a}..10) The t-Statistic The standardized statistical sample average (l' . 1.i.2 When probability Hypothesis TestsConcerning the PopulationMean 75 1[..12) for the p-value in Equation (3.i. test.ol) .. which in turn is well approximated under the null hypothesis.. Let distributed N(O.. (3.7).2).

28 = 2. but in many practical situations this preferential treatment is appropriate.14.20)/1. the p-value is 2<1>(2. That is. Hypothesis tests using a fixed significance level. Suppose it has been decided that the hypothesis will be rejected if the p-value is less than 5%.96.e value of the z-statistic is I"" = (22.when n is large. and the sample standard deviation is Sy = $18.28. E(Y). then under the null hypothesis the z-statistic has a N(O.64.15) That is. Then the standard error of Y is sy/Vii = 18.14) As a hypothetical example. of incorrectly rejecting the null hypothesis when it is true. 5%). Thus the probability of erroneously rejecting the null hypothesis (rejecting the null hypothesis when it is in fact true) is 5%. you can make two types of mistakes: You can incorrectly reject the null hypothesis when it is true. .9%.06) = 0.reject if the absolute value of the I-statistic computed from the sample is greater than 1.039. the p-value can be calculated using p-value = 2(1)( -11""1)· (3. Hypothesis Testing with a Prespecified Significance Level When you undertake a statistical hypothesis test. From Appendix Table 1. Because the area under the tails of the normal distribution outside ±1. the probability of obtaining a sample average at least as different from the null as the one actually computed is 3.05. If n is large enough. 1) distribution. or 3.06.76 CHAPTER 3 Review of Statistics Accordingly. The sample average wage is ¥"" = $22. Hypothesis tests can be performed without computing the p-value if you are willing to specify in advance the probability you are willing to tolerate of making the first kind of mistake-that is. (3.9%. or you can fail to reject the null hypothesis when it is false. TI. If you choose a prespecified probability of rejecting the null hypothesis when it is true (for example. suppose that a sample of n = 200 recent college graduates is used to test the null hypothesis that the mean wage.96 is 5%.64 .14/v'200 = 1. then you willreject the null hypothesis if and only if the p-value is less than 0. is $20 per hour. this gives a simple rule: Reject Ho if 1/""1 > 1. assuming the null hypothesis to be true.96. This approach gives preferential treatment to the null hypothesis.

the r-statistic was 2. the critical value ofthi two. and the probability that the test corrc tly "ejects the null hypothe is when the alternative is true is the power of the test. TI.15) is 5%. In the previou example of re ting the hypothesis that the mean earnings f recent college graduates is $20 per hour.0 at the 5% significance level. reporting only whether the null hypothesis is rejected at a prespecified significance level conveys less information than reporting the p-value.lf you were to test many statistical . The p-value is the probability of obtaining a te t statistic.96. Although performing the test' ith a 5% significance level is easy. the p-value is the smallest ignifi ance level at which you can reject the null hypothesis. the prespecified pr bability of a type I error-i the significance level of the test.e critical vulue of the test statistic is the value of the statistic for which the test just rejects the null hypothesis at the given significance level. Equivalently. assuming that the null hypothe is is correct.e pr bability that the test actually incorrectly rejects the null hyp thesis when it is true is the size of the test. This value exceeds 1. TI. summarized in Key oncept 3.e prespeeified rejection probability of a statistical hypothesis test when the null hypothesis is truethat is. The set of values of the test statistic for which the test rejects the null hypothesis is the rejection region.3.96. by random sampling variation. Testing hypotheses u ing a prespecified significance level does not require computing p-values.2 Hypothesis Tests(oncerning the Population Mean 77 The Terminology of Hypothesis Testing ~ A statistical hypothesi te t can make two types of mi takes: a type I error.ided test is 1. so the hypothesis is rejected at the 5% level.96. the populati n mean !iy is said to be statistically significantly different from 1-'1'.06. in which the null hypothesis is not rejected when in fact it is false. What significance level should you use in practice? and econometricians In many cases. statisticians use a 5% significance leveJ.5. If the test rejects at the 5% significance level. and a type U error. 3. The significance level of the test in Equation (3. at least as adverse to the null hyp thesis value as is the statistic actually observed. and the values of the lest stati tic f r which it docs not reject the null hypothesi is the acceptance region.5 111i framework f r te ting statistical hypotheses has some specialized terminology. and the rejeclion regi n i the values of the r-statistic out ide ±J . in which the null hypothesis is rejected when in fact it is true.ll.

if a government agency is considering permitting the sale of a new drug. For example. then you never need to look at any statistical change yonr mind! The lower the significance test.1 %.78 CHAPTER 3 Review of Statistics KEY CONCEPT 3. Compute the t-statistic 3. the most conevidence for you will never servative thing to do is never to reject the null hypothesis-but view. tbe alternative hypothesis might he that the mean exceedS so the relfJ-y. Sometimes a more conservative significance level might be in order. if It""1 > 1. so a 5% significance level. Many economic compromise.O * 1. Compute the p-value error ofY.6 summarizes hypothesis tests for the population and policy applications a legal case. evant alternative to the null hypothesis that earnings are the same for college graduates and non-college graduates is not just that their earnings differ. but rather . in the sense of using a very low significance level. In fact. the significance level used is 1%. Compute the standard 2.o· For example.o Against the Alternative E(Y) fLY. has a cost The smaller the significance difficult it becomes level. you would incorrectly reject the null on average once in 20 cases. SECY) [Equation [Equation (3.8)]. One-Sided Alternatives In some circumstances. [Equation (3. the hypothesis at the 5% significance level if the p-value is less than 0. hypotheses at the 5% level. Being conservative.6 Testing the Hypothesis E( Y) = fLy. and the null hypothesis could be that the defendant is not guilty.13)]. Key Concept 3. Reject (3. legal cases sometimes involve statistical evidence. one hopes that education helps in the labor market. In some legal settings. the lower the power can call for less conservatism level is often considered to be a reasonable mean against the two-sided alternative. to avoid this sort of mistake.or even 0. Similarly. the larger the critical value and the more if that is your of the than to reject the null when the null is false. a very conservative standard might: be in order so that consumers can be sure that the drugs available in the market actually work.05 (equivalently.14)].96). then one would want to be quite sure that a rejection of the null (conclusion of guilt) is not just a result of random sample variation.

64. to test the one-sided hypothesis in Equation (3. and the prespecified this set is called the confidence level. Test the null value !Lv.3 Confidence Intervalsfor the Population Mean 79 that graduates earn mare than non graduates. called a confidence interval.: E(Y) 11.16). region For this test is all values of the z-statistic exceeding 1.. = !Lyo against the alternative that!LY '" !Ly. with the modifi- cation that only large positive values of the I-statistic reject the null hypothesis rather than values that are large in absolute value. That is. Such a set is in sible to use data from a random the true population called a confidence the possible sample to construct a set of values that contains probability that !Ly is contained mean t-v with a certain prespecified set.a.96. this hypothesized 5% level. for example. is p-value The N(O.L.o is not rejected at the that u .. Now pick another arbitrary ___________ iiiiiiiiiiiiiiiiii====~~. p-values and to hypothesis (3.a· The one-sided of the previous the 5% rejection IF instead the alternative hypothesis is that E(Y) < !LY.l. call it !LY.16) concerns values of!LY exceeding The rejection !Ly. The confidence set for !Ly turns out to be all values of the mean between a lower and an upper Limit. The p-value is the to the distribution of the area under the standard normal distribution to the right of the calculated I-statistic..o.64.13). 3. However. "~_ . it is posprobability. r-siatistic. the confidence set is an interval.64. and write down tbis non rejected value !Ly. This is called a one-sided alternative hypothesis and can be written /-I. J) approximation = PrHo(Z > I'''') = 1. the p-value. Specifically.3. hypothesis in Equation (3. region consists of values of the I-statistic less than -1.3 Confidence Intervals for the Population Mean Because oC random sampling error.16) testing is the to computing same for one-sided alternatives as it is for two-sided alternatives. it is impossible to learn the exact value of the population mean of Y using only the information in a sample. based on the N(O.a.17) J) critical value for a one-sided test with a 5% significance level is 1. construct the I-statistic in Equation (3. then the discussion paragraph applies except that the signs are switched.e general approach > !LyO (one-sided alternative). (3. Here is one way to construct Begin by picking some arbitrary hypothesis r-statistic: value for the mean.O by computing if it is less than 1. so that the a 95% confidence set for the population mean.q)(I'''').

1) distribution. you will correctly accept 21.58S£(Y)!. write this value down on your list.64S£(Y)). In 95% of all samples. a trial value of iJ. Suppose u-v is 21.Thus the values on your list constitute 95% confidence set for iJ.y is an interval constructed in 95% of all possible random samples. mean that cannot be rejected This list is useful because it summarizes the set of hypotheses you can and cannot reject (at the 5% level) based on your data: If someone walks up to you with a specific number in mind. A bit of clever reasoning property: The probability the true value of that it contains the true value of the population mean is 95%.y = 21. value of iJ.7 interval for iJ.5.5. Fortunately. mean Thus.96SE(Y) of Y. your list will contain the true value of t-v.96SE(Y))..y= IY interval for iJ.5 has a N(O. if you cannot reject it.y 3. so that it When the sample size n is large.y. Do this again and again. goes like this. Thus the set of values of My that are not rejected at the 5% level consists of those valfor My is Ythis approach.5. Y .. Then Y has a normal distribution centered and the z-statistic testing the null hypothesis u v = 21. 95%. for it requires you to test all possible values of iJ. the probability of rejecting the null hypothesis iJ. This method of constructing a confidence set is impractical. But because you tested all possible values of the population in constructing your set.y are ± 1.5 on 21. do so for all possible values of the population mean.5 at the 5% level is 5%. iJ.80 CHAPTER 3 Reviewof Statistics ems. a 95% confidence interval + 1. in particular you tested the true value.mUD Confidence Intervals for the Population Mean A 95% two-sided confidence contains the true value of iJ. Continuing this process yields the set of all values of the population at the 5% level by a two-sided hypothesis test.y. and 99% confidence 95% confidence 90% confidence 99% confidence interval for iJ. 1.96 standard errors away from ues within ±1.965£(1'). Key Concept 3.u . you can tell him whether shows that this set of values has a remarkable The clever reasoning his hypothesis is rejected or not simply by looking up his number on your handy list. there is a much easier approach. (although we do not know this). if n is large.y as null hypotheses.y= interval for intervals for iJ.o is rejected at the 5 % level if it is more than 1. this means that in 95% of all samples. According to the formula for the r-statistic in Equation (3.13). That is. indeed.y.o and test it. 90%. = 21.7 summarizes a Y. My 0.y = IY ± 2. IY ± 1.965£(1') 0.

13.28.e coverage probability of a confidence interval for the population mean is the probability. One could instead construct a one-sided confidence interval as the set of values of J. Coverage probabilities. Consider the null hypothesis that mean earnings for these two populations differ by a certain amount.51 ~ [$20.4 Comparing Means from Different Populations Do recent male and female college graduates earn the same amount on average? This question involves comparing the means of two different population distributions.64 ± 1. Although one-sided confidence intervals have applications in some branches of statistics.L". $25.3.15]. . 3.Ly that cannot be rejected by a one-sided hypothesis test. that it contains the true population mean. 111 section summarizes how to test hypotbeses and how to conis struct confidence intervals for tbe difference in the means from two different populations.18) The null bypothesis that men and women in these populations have tbe same mean earnings corresponds to Ho i11Equation (3. be the population mean for recently graduated men. consider the problem of constructing a 95% confidence interval for the mean hourly earnings of recent college graduates using a hypothetical random sample of 200 recent coLlegegraduates where Y ~ $22.96 x 1. Then the nuLl hypotbesis and the two-sided alternative hypothesis are (3. let IJ-" be the mean hourly earning in the population of women recently graduated from coLlegeand let J.64 ± 2.4 ComparingMeansfrom DifferentPopulations 81 As an example.28 ~ 22. say do. computed over all possible random samples.18) with do ~ O. ~'ypothesis Tests for the Difference Between Two Means To iLlustrate a test for the difference between two means. The 95% confidence interval for mean hourly earnings is 22.64 and SECY) ~ 1. This discussion so far has focused on two-sided confidence intervals. they are uncommon in applied econometric analysis. TI.

is distributed can be = do· In N[Jio". . then this approximate normal distribution fJ-w used to compute p-values for the test of the null hypothesis that fJ-". practice.82 CHAPTER 3 Reviewof Statistics Because these population means are unknown./I1". s.. is the population ance of earnings for men.Jio". however./ and (T~. Thus the 1.4 that a weigbted average of two normal random variables is itself normally distributed."1. the p·value of the tWO-SIde d . according to the central distributed limit theovari- distributed eI.Y.. 1.. they are independent random variables.. III ·IV and dividing t=("1./ nm). where s. .) = (3... and 1. where eI'. .Jio". by subtracting Y. . except that the statistic is is computed only for the men in the sample..- Y./1 - ~v 5£(1. these population variances are typically unknown be estimated. (320) If both fIlii and nw are large. (J~ t:v is approximately constructed N( Mill! a." are from different randomly Thus elected samples. The r-statistic for testing the null hypothesis is constructed analog variable.-y ..y". two means). = do using Y. is defined similarly for the women. Suppose we have samples of be"'Y. then this r-statistic has a standard normal distribution... we need to know "1.:. As before." .:/I1".19) when Y is a Bernoulli random see Exercise 3. If (J"I. Because the r-statistic in Equation (3. "1. from the estimator the result by the standard error of YIII .. Let the sample average annual yoIV for women. and s.)]..). is the population variance of earnings for women. .~ is defined standard error of as in Equation (3. usly to the I-statistic for testing a hypothesis about a single population the null hypothesized value of 1'-". Also. 11/ 5£( Y _ V) lw (r-statistic for comparing .) + (eI.Y. they must be estimated 11.Jio"" (eI. for men and III the distribution of rem.. approximately where men and nw women earnings ~I" ~Il - drawn at random from their populations..15...19) For a simplified version of Equation (3. -11" mean.-Yw)-dO ..7/I1"." ." is...20) has a standard normal distribution under the null hypothesis whe n fIlii an d· Ilw ale large.. recall from Sec- tion 2. Then an estimator o[!-LI/I -""'1 is Recall that N(Jio"" To test the null hypothesis that Jio". Similarly. are known." . Because Y. they can be estimated so they must using the sample variances.. . and s.7)./11 from samples of men and women.

. But III '" 1..3 the that of to constructing a confidence interval for the difference set if means.2 that a randomized subjects (individuals 01'.96. and control groups is imental treatment.L11I - 95% confidence interval for d = fJ.5 Dlfferences-of-Means Estimation of Causal Effects Using Experimental Data 83 test is computed exactly as it was in the case of a single population. of gender dif- 3.. For example.Y. college graduates.96SE(Y.Y...17). . which receives the experbetween the sample means of the treatment of the causal effect of the treatment. (3.64. which does not receive the treatment.).. the nul1 hypothesis is rejected at the 5% significance value of the r-statisric exceeds 1.: f. d = 1-'".) ± 1.. If the alternative is one-sided rather than two-sided (tbat is.96.1-'".21) With these formulas in hand..w is (Y. and a test with a 5% significance level Confidence Intervals for the Difference Between Two Population Means The method extends if for constructing confidence intervals summarized in Section between 3. more general1y. simply calculate the level if the absolute 1- (3. or to a control group.Y". controlled experiment randomly selects of interest... Because the bypothesized value do is rejected at the 5% level III > 1. is less than 1.3. To conduct a test with a prespecified statistic in Equation significance level. .20) and compare it to the appropriate critical value...96 means of those values the estimated Y. The pvalue is computed using Equation rejects when I > 1.5 Differences-of-Means Estimation of Causal Effects Using Experimental Data Recal1 from Section 1..2.e difference an estimator assigns them either to a treatment group..96. . . if the alternative is that 1-'". do wi11be in the confidence difference.S. (3. Thus the 95% two-sided confidence d within ± 1. .. III '" 1..96 standard errors of Y. the pvalue is computed using Equation (3. -1-'". the box "TIle Gender Gap of Earnings of College Graduates in the United States" contains an empirical investigation ferences in earnings of U. entities) from a population then randomly 'TI.Y. then the test is modified as outlined in Section 3. > do). .14).. That is.96 standard interval for d consists errors away from do.

21). For this reason. In economics) however.84 CHAPTER 3 Review of Statistics The Causal Effect as a Difference of Conditional Expectations The causal effect of a treatment of the treatment as measured effect can be expressed is the expected effect on the outcome of interest This in an ideal randomized controlled experiment. econometricians also called quasi-experiments. E(yIX value of Y for the treatment ideal randomized controlled level x is the difference where E(YIX treatment in the cond]. as the difference of two conditional Specij. then we can let X = 0 group.20). A for the of by the difference in the sample average outcomes that the treatment the treatment and control groups. study "natural experiments. the causal effect is also called the treatment denote the control group and X = 1 denote the treatment (that . ethically sometimes in which some event questionable. then the causal between is ineffective (3. or subject characteristics has the effect of assigning Savings. the treatment controlled experiment. randomized iments tend to be expensive. for the causal effect. well-run experiment interval 95% confidence interval for the difference in the means of the two groups is a 95% causal effect can be constructed can provide a compelling controlled experiments estimate a causal effect.E(ylx=O) in an ideal randomized is binary). ized controlled experiment. For this reason. "A Novel Way to Boost Retirement of such a quasi-experiment that yielded some surprising . so they remain rare. In the context effect. if the treatment then the causal effect level X = 0). E(Ylx=l) levels (that is. which can be tested using the r-statistic for comparing two means. given in Equation confidence interval A well-designed. experand." to different subjects as if they had been part of a randomThe box.E(YIX = 0). such as medicine. difficult to administer. Estimation of the Causal Effect Using Differences of Means If the treatment in a randomized effect can be estimated controlled experiment is binary. If the treatment effect) is is." unrelated to the treatment different treatments provides an example conclusions. The hypothesis is equivalent to the hypothesis that the two means are the same. the causal effect on Y of treatment tional expectations. expectations. so a 95% confidence using Equation (3. are com- monly conducted in some fields. If there are of experiments. experiment and = x) group (which receives level X = x) in an E( Y Ix = 0) is the expected of Y for the control group (which receives treatment only two treatment is binary treatment. in some cases. is the expected value = x) . ically.

' In 2008.74 21.10 11.87. Is the gender gap in earnings of college graduates stable.11 might' not sound like much. $4. on earn more than Socialnorms and for men was $11.50** 4. this increase is not statistically significant college-educated full-time workers aged 25-34 in the United States in 1992. Second.01 11.87).97 3. in 2008 Dollars Men Women Differencer Men V5.60 12.1"1per hour. or has it diminished Table 3. 1992 to 2008.$20. and 2008.98 .]1 ** 0. continued .11 (= $24. and 2004 were Population Sur- adjusted for inflation by putting them in 2008 dollars using the Consumer Price Index (CPI).35 2.19 3.98 20.88 2.5 Differences-oF-Means Estimation of Causal Effects Using Experimental Data 85 The Gender Gap of Earnings of College Graduates in the United States T he box in Chapter 2. the data for 2008 were collected til March 2009).41.02 20.66 1368 1230 ll81 1735 .17 10. 1996. from 1992 to 2008.87 7.66. assuming a 40-hour work week and 50 paid weeks per year.12 24. the average hourly earnings of the '1838 men surveyed ~rC"ds in Hourly Earnings in the United States of Working College Graduates.Yw) for d 1992 1996 2000 2004 2008 to.32-4.42 0.66'/1871)."14** 4.ey.80).80 These estimates are computed using data on all full-time workers aged 25-34 surve~ed in the Current POI~ulfltionS~rv. Ym- Yw SE(Ym .96 X 0.1871 3. with a standard error of $0. but over a year it adds up to $8220. Women 95% Confidence Interval $m IJm Year ¥m 23.35 = ($3.41-4.11 ± 1." shows that. The average hourly earnings in 2008 of the 1871 women surveyed was $20.87 per hour in real terms. Til us the estimate of the gender gap in earnings for 2008 is $4.95 9.27 22. and the standard deviation of earnings was $9.22** 3.22 per hour to $4.80-4.2000..The95% con- States. from $3.98 y. Earnings [or 1992.87 $.2004.2000. "The Distribution of Earn- was $24. the gender gap is large. using data collected by the Current vey. the estimated gender gap increased by $0.35 (=Y1I. n.78.98.40-4. conducted in March of the next year (for example. however. What arc the recent' trends laws governing gender discrimination in the workin the United over lime? + 9.1 suggest four conclusions.33 0. An hourly gap of $4.782/1838 average. Ages 25-34.36 9.35 0.78 1594 1379 1303 1894 1838 7.36 0. and the standard deviation of earnings ings in the United States in 2008.88 25.36 9.'10** 4.1996.58-3.80 3..3. 111e results in Table 3. male college graduates in this "gender gap" in earnings? place have changed substantially female college graduates. 2005 18.1 gives estimates of hourly earnings for fidence interval for the gender gap in earnings in 2008 is 4. First. The difference ISsignificantly dIfferent from zero at the U I % significance level.48 24.

the topic of Part It working full-time in 2008 was $23.1.93)/30. The use of the standard with critical valis testing and for the connormal distribution justified by the central limit theorem. To make earnings in 1992 and 2008 comparable in Table 3. Y".17). the CPI basket of goods and services that cost $100 in . that the "genanalysis documents der gap" in hourly earnings is large and has been fairly stable (or perhaps increased slightly) over the recent past. using data yt.40 in 2008. the gap is Jarge if it is measured instead in percent- age terms: According to the estimates in Table 3. while for men this mean was $30.5. however. which corresponds to a gender gap of23% [= (30. by multiplying 1992earnings by 1. The analysis does not." 3. experience.1992 cost $153. When the sample size is small. which applies when the sample size is large.11/$24. the standard normal distribution can provide approximation to the distribution of the z-statistic.4):As reported in Table 2.1) than it is for all college graduates (analyzed in Tahle 2. I and criti- cal values can be taken from the Student distribution. the population a poor dis- tribution is itself normally distributed.4. the r-statistic is used in conjunction normal distribution for hypothesis intervals.1. Does it arise from gender 'Because of inflation. .If.6) of the z-statistic testing the mean of a single I population is the Student distribution with n . however. Over the 16 years from '1992 to 2008.6 Using the t-Statistic When the Sample Size Is Small In Sections 3. tell us diswhy this gap exists. see Section 2. then the exact distribution (that is.1992 earnings arc inflated by the amount of overall CPI price inflation.86 CHAPTER3 Reviewof Statistics crimination in the labor market? Does it reflect dif.97 . the price of the CPI market basket rose by 53.97. a dollar in 1992 was worth more than a dollar in 2008.4%. . Third.. or education between at the 5% significance level (Exercise 3.2 through ues from the standard struction of confidence 3. The t-Statistic and the Student t Distribution I The t-stotistk: testing the mean. in 2008 women earned 16% less per hour than men did men and women? Does it reflect differences in choice of jobs? Or is there some other cause? We return to ($4. the mean earnings for all college-educated women these questions once we have in hand the tools of multiple regression analysis. that is.27). I Consider the z-statistic used to test the hypoth- esis that the mean of Y is /Lv. slightly more than the gap of 14% seen in 1992 ($3. in other words. One way 10 make this adjustment is to lise the CPI. in the sense that a dollar in 1992 could buy more goods and services than a dollar in 2008 could.534 to put them into "2008 dollars.22/$23.o. the gender gap is smaller for young college graduates (the group aaalyzed in Tahle 3..98). the finitesample distribution. a measure of the. ferences in skills. pnce of a "market basket" of consumer goods and services Constructed by the Bureau of Labor Statistics.1 degrees of freedom. The formula for this statistic .23. Fourth. Thus earnings in 1992 cannot be directly compared to earnings in 2008 without adjusting for inflation.97] among all fulltime college-educated This empirical workers.93.

however.lT?. Y" are i. Specifically. and Z and Ware independently distributed./-ty.d.4 that if 11..7). and it can be very complicated.1 the r-statistic given in Equation degrees of freedom. then under the null hypothesis distribution with n ..i. (3.o v. . lTh the z-statistic can be written as such a ratio. X~-I distribution fat all n.0).22) can be written as I = Z/VW/(n -1).8).22) has an exact Student is normally distributed. Substitution the r-statistic: (3.1 degrees of freedom. It follows that if the population distribution of Y is normal. When 1']. . In addition. with a chi-squared W is a random with n . under general conditions the standard normal has a to if the sample size is large and the null hypothesis of Y if n is large. Yu9.o)/V IT?/n and let W= (n -l)s?.22) has a Student I distribution case in which the exact distribution of the r-statistic distributed. .1 degrees of freedom.2. The exact distribution of Y.r. and the population distribution of Y is N(/-ty. then some algebra! shows that the z-statistic in Equation (3. and the population of1'is distribution IT?/n) for all n. let Z = (1' .d. .. recall [rom Section 2. then Z = (1' . it can As discussed is true [see Equation the r-statistic be unreliable distribution normally tribution (3. thus. W = (n -l)s?/lT? has a of Y is N(}J-y. one special is relatively simple: If Y is I diswith (3. Although is reliable for a wide range of distributions if n is small. if the null hypothesis uv = }J-y. where Z is a random variable with a standard normal distribution.. where the standard error of l' is given by Equation for of the latter expression into the former yields the formula 1'-/-ty..o)/V IT?/n has a standard normal distribution for all fl.n 11-1 .- and dividing by ~ and collecting terms: VsNn VU9!11 \j~ Is! (1'-/"0) (n-l)s9l<T9~Z.-VW/(n_l).22) where standard s? is given in Equation in Section normal distribution (3..3. of the z-statistic depends on the There is.6 Using the I-Statistic When the Sample Size Is Small 87 is given by Equation (3. the sampling distribution exactly N(/-ty. of freedom is defined distribution To verify this result. the r-statistic approximation (3.4 that the Student n -1 degrees variable to be the distribution of Z/VW/(n -1)./-ty. and lTh then l' and sr are independently I distributed. 3.10). then the z-statistic in Equation with n . If the population distribution the Student I distribution then critical values from tests and to construct can be used to perform hypothesis "The desired expression is obtained by multiplying {~1'-I-"'O (1'-1-'". .i.o is correct. Recall from Section 2.. Y" are i.12)]..

error has a Student I distribution with /I. would be for /ky.15 null hypothesis would be rejected at the 5% significance sided alternative.09). If the population in group w. + II" - 2 .. consider t"" = 2. interval constructed The t-stotistit: testing differences of means. distribution of Y is normal.y ) I III 'III· of Y in group /'11.. distribution is N(/k"" (T. and if the two group variances are the I . is 2. . A modified version of the differences-of-means standard error formula-the dent I distribution "pooled" standard when Y is normally distributed. based on a different an exact Stustandard 3. does not have a does not apply here because the variance estimator chi-squared distribution.88 CHAPTER 3 Review of Statistics a hypothetical is n . The (Exercise as Adopt the notation of Equation (3.20).) .Jooleri X in group 11/ and the second sumerror of the diferror is the mation is for the observations I-statistic is computed pooled standard error . given in Equation (3.19) so that the two groups are denoted and pooled variance estimator is S2 pooe Id= 1 I'lm+nw-2 [ ~ (Y. (J".~v) V1/1I11/ + If"'wl and tile pooled .15 and n = 20 so that the degrees of freedom t9 ' = J 9.. The 95% confidence interval distribution. .19) does not produce a denominator in the r-statistic r-statistic.". tion.)2] I (3.. .1=1 v"i+ ~ i=l (Y.20). III error formula-has however.(T". even if the population error ill Equation The I-statistic testing tudent the differI distribu- ence of two means. This confidence wider than the confidence value of 1.- V.09 SE(Y)..~). Because dix Table 2. The Student used to compute I distribution the standard with a (3.23) group III group w where the first summation is for the observations = S. 2 _ 2) same ( t rat IS. As an example. From Appe-.then under the null hypothesis the r-statistic computed using the pooled standard degrees of freedom.09. U".1l . using Equation (3. the level against the twn. where the standard SEpootec I( -y. if the population distribution of Y in group w is N(/k". using the 119 is somewhat normal critical Y± 2.I problem in which confidence intervals.21). The pooled standard ference in means is SEpooled(Y. constructed interval using the standard > 2. the pooled of observations error formula applies only in the special case that the two groups have the same variance or that each group has the same number w.96. the 5% two-sided critical value for the distribution the r-sratistic is larger in absolute value than the critical value (2.

lax code. Madrian and Shea studied changed the default option firms to offer 40"1 plans in which enrollment (k) default.6 Using the r-Sratist« When the Sample Size Is Small 89 A Novel Way to Boost Retirement Savings M any economists think that people do not save Conventions! methods savings focus on finan- Madrian between and Shea found the workers 110 systematic before differences the enough for retirement. Thus. (treatof savings. their sample is large. This research had an important then takes it But. study published Madrian and Shea found that the default enrollment rule made a huge difference-The enrollment rate for the "opt-in" (control) group was 37. Laibson. To learn more about the design of retirement' behavioral economics and change and not automatically enrolled (but could opt in). Madrian (2008).4% (n whereas the enrollment Madrinn and Dennis Shea considered unconventional method for stimulating = 4249). and savings plans. cial incentives.S. although they (computed in Exercise3. be wrong? Docs the of the growing field of "behavioural both could lead to accepting option. in such plans. Brigitte one such retirement and the causal effect of the change could be estimated by the difference in means between the two groups. the Pension Pro- In August 2006. but there also has been an upsurge in assigned treatment interest in unconventional for retirement. How could the default choice matter so much? firms. The estimate taken out of the paycheck of participating employ.They groups of workers: those hired prominently in testimony the year before on this part of the legislation. .8% to 50.2%. Congress tection Act that (among a large firm that passed otber things) encouraged is the and ment.3. However.5% (= 85. called 401(k) plans effect is 48. and those hired in the year after the change and automatically enrolled (but could opt out). hired and after for encouraging retirement change.9% . savings ment) group was 85. see Benartzi Choi. Enrollment savings plans in rate for the "opt-out" in full or in part. and Shea wondered.37. the 95% confidence for the treatment effect is after the applicable section of the U. Neither explanation is economically only if they choose to opt in. 46. ees. The econometric Shea and others featured findings of Madrian for its 40I(k) plan from compared two the nonparticipation 10participation. The financial aspects of the plan remained the same. from an econometrician's the change was like a randomly perspective.15) tight. or maybe they just didn't want to think about growing rationalold.9% (/7 the treatment Because = 5801). Many firms offer retirement which the firm matches. According tional economic models enrollment-opt of behavior. employees arc enrolled to convertof Maybe workers found these financial choices too confusing. at other at some firms employees in the plan. is always optional. and and Thaler (2007) and Beshears." and The rational worker the optimal action. Madrian could conventional economics the default enrollment method of enrotlment in a savings plan directly affect its enrollment rate? To measure the effect of the method of enroll- practical impact. In an important ways to encourage saving in 2001. are automatically enrolled can opt out. the method out or opt in-should computes not matter: but both are consistent with the predictions economics.4%).

distribution in large samples.19) does not inferences about differences with the large- bave a Student t distribution..0 I. in means should be based on Equation (3. and the standard normal distribution normal is negligible between interthe if the sam- > 15. . however.Theretests and confidence intervals-about the mean of a Street"). in tbe United States in 2008" and "A Bad Day on Wall of the {-statistic is valid if the sample size is large. see the boxes in Chapter Distribution of Earnings imation to the distribution fore. in fact. For economic variables. which allows for different group variances. therefore.. are normal. If the population variances the null distribution * II". is used. difference never exceeds 0. the pooled standard error and the pOoled Use of the Student t Distribution in Practice For the problem of testing the mean of Y.. the difference in the p-values computed using the Studistributions never exceeds 0.90 CHAPTER 3 Reviewof Statistics The drawback of using the pooled variance estimator S~ool'd is if the two population variances are the same (assuming 11. distributions correct standard error formula. it does not even have a standard t-statistic should not be used unless you have a good reason to believe population variances are the same. this does not pose a problem because the difference Student t distribution ple size is large. inferences-hypothesis if the 2. any economic reason for two groups having different means typically implies that the two groups also could have different ances.). variis as the and the using the standard error formula in Equation (3. Even though the Stndent {distribution is rarely applicable in econom ics.002. the pooled standard error formula is inappropriate. for n > 80. the applications. the pooled variance estimator is biased and inconsistent. In most modern enough for the difference between the Student t distribution mal distribution to be negligible. are different but the pooled variance formula distribution.In practice. If the population I that it applies only variances are different. Even if the underlying data are not normally distributed.19).19). some software uses the Student t distribution to compute p-values and confidence vals. Therefore. When comparing two means. used in conjunction sample standard normal approximation. and in all applicalarge norand the standard the sample sizes are in the hundreds or thousands. the Student {distribution is applicable normal distributions are the exception (for example. "TIle underlying population distribution of Y is normal. the normal approx- distribution should be based on the large-sample normal approximation. Accordingly. even if normal that the of the pooled t-statistic is not a Student the data are normally distributed. For n dent { and standard tions in this textbook. Even if tbe population In practice. given in Equation t-statistic computed (3.

how- of the joint probability distribution or correlation.-X)(Y. and 1). . The population covariance and correlation ever. and the Sample Correlation What is the relationship ers. this worker's by the highlighted dot in Figure 3. Y (earnings). The sample covariance.. however.1 instead of 11. .3.2 is a scatterplot (Y) for a sample of 200 managers of age (rom the March 2009 CPS. Like the estimators discussed a population SXY. like many othThis section reviews the one variable. to another. X (age). n. they are computed by replacing mean (the expectation) is _ sample mean. in which each observation in the information is by the point (X" 1).=1 V). Sample Covariance and Correlation TIle covariance population covariance and correlation were introduced in Section 2.. this difference stems from using X and Y to estimate .3 as two properties distribution is unknown.here. i = 1. . denoted 1 n !1 _ _ SXy=---=-y2:(X. (3. 100. and the sample correlation coefficient.and the Sample Correlation 91 3.24) is computed by dividing by n.. one of the workers in this samage and earnings shows a positive rela- ple is 40 years old and earns $35.This relationship is not exact.7 Scatterplots.2. and earnings could not be predicted perfectly using only a person's age. Figure 3.2 corresponds to an (X. previously in this chapwith a ter. are Y) pair for one of the observations. the Sample Covariance. For example. the Sample Covariance. The sample covariance and correlation of the population and are estimators of the population covariance and correlation. the average in Equation (3. indicated tionship For example. relates between age and earnings? This question.). three ways to summarize the relationship between variables: the scatterplot. of the random variables X and Y Because the in practice we do not know the population can.7 Scatterplots. observations represented industry (X) and hourly earnings on X. be estimated by taking a random sample of n members collecting the data (X" Y. Each dot in Figure 3.24) Like the sample variance.78 per hour. TIle scatterplot between age and earnings in this sample: Older workers tend to earn more than younger workers. Scatterplots A scat1erplot is a plot of n. sample covariance.

correlation is unitless and lies between -1 and 1: I :s. More generally. the correlation is ±1 if the scallerplot .• . Like the population correlation. = . The data are for computer and intorrna- tion systems managers from the March 2009 CPS. The hiqhlighted dot corresponds to a 40-year-old worker who earns $35. .78 per hour. The sample correlation coefficient.25) The sample correlation measures the strength of the linear association between X and Y in a sample of n observations.:• e. or sample correlation. = Y. Age Average hourly earnings • 80 70 60 50 40 30 20 ###BOT_TEXT### • • • • • r- r. · •• • • 40 45 35 • 25 30 50 55 60 65 Age Eachpoint in the plot represents the age and average earnings of one of the 200 workers in the sample.Y. If the all i. is denoted is the ratio of the sample covariance to the sample standard deviations: whether 'Xyand (3.1. it makes little difference division is by n or n . When n is large. the respective population means.• • 0 20 • • •• •• • • ••• ••• • • • • • • • •• • • • ••• •• • • • ••••• • • • •• •• • •I • • • :• • • • • • . for all i and equals -1 if is a straight X. for line. 1. the sample 'xyl The sample correlation equals 1 if X.92 CHAPTER 3 Reviewof Statistics t:ml:!DD 100 90 Scallerplot of Average Hourly Earnings vs. • •I • • • • • • • • ••• • • •• • • • : : • •• • • •••• • •• • • •••••• •• • •••• •• • • • • • • ••• • • • •• • • • • ••• • • ••• •• • • • • • ··.

then the correlation is 3316/(9. the sample standard deviation of age is SA = 9. and the sam3.20). rather. The covariance between age and earnings is SAE = 33. have finite fourth moments is similar to the proof in Appendix 3.8.3c shows a scatterplot with no evident . lj).3a relationship between tbese variables. examples of scatterplots and correlation.3.07 X 1437) = 0. it means that the points in the scatterplot fall very close to a straight line.26) In other words.9.l'Y ---'--+ o-Xy· Like the sample vari- (3..07 years and the sample standard deviation of earnings is SE = $14. The closer the scatterplot is to a straight line. the sam~ ple correlation coefficient is consistent..25 or 25%. SampleCovariance.i.26) under the assumption that (X. the SampleCorrelation the and 93 line slopes upward.25 or 25%. and Y.25 means that there is a positive relationship between age and earnings.3b shows a strong negative relationship with Figure 3. then there is a positive relationship between X and Y and the correlation is L If the line slopes down. Because the sample variance and sample covariance are consistent. in which case the sample standard deviations of earnings is 1437¢ per hour and the covariance between age and earnings is 3316 (units are years X cents per hour). lj) are i. To verify that the correlation does not depend on the units of measurement. P s. The correlation of 0. Suppose that earnings had been reported in cents.3 that the sample covariance is consistent and is left as an exercise (Exercise 3.16/(9. tbe closer is the correlation to ±L A high correlation coefficient does not necessarily mean that tbe line has a steep slope.3 gives additional shows a strong positive linear ple correlation is 0.7 Scatterplots. The proof of the result in Equation (3. corr(X. this relationship is far from perfect.d. and that X.16 (the units are years X dollars per hour.37) = 0. Thus the correlation coefficient is "IE = 33.07 X 14.37 per hour. rXY Example. That is. Figure 3. in large samples the sample covariance is close to the population covariance with high probability. For these 200 workers. ance. Figure 3. that is. then there is a negative relationship and the correlation is-1. but as is evident in the scatterplot. not readily interpretable). consider the data on age and earnings in Figure 3. Figure a sample correlation of -0. the sample covariance is consistent.2. As an example. Consistency of the sample covariance and correlation.

.3b show strong linear relationships between X and Y In Figure 3. '.' . :oi fe... the sample correlation is zero..94 ~ CHAPTER 3 Review of Statistics Scatterplots for Four Hypothetical Data Sets in Figures The scatterplots 3.. . • '0. . ' . 0: . This final example emphasizes all important point:The correlation is a measnre of linear association.3d.. In Figure 3. °0' . .. : ... ... to:' . .. 0° : 60 50 40 30 20 10 0 70 • :' to· "'~'.. . .:I. . There is a relationship not linear. \:..' : '.. • .... • °0... cernable relationship is zero.3a and 3.0 (quadratic) -~ relationship. .' '.. ...3d. :.. .. . . :. for these data.:'.. . 80 90 100 110 120 130 x t10 120 130 x 0 70 (a) Correlation y = (b) Correlation = -0.." . . . 80 90 (c) Correlation = 0.. .. and the sample correlation tionship: As X increases.rI'!. 0° ".. • 0° '.3d shows a clear relathis dis- Y initially increases bnt then decreases.>fo. the two y 70 60 50 40 30 20 10 0 70 80 90 100 +0. X is independent of Yand the two variables are uncorrelated. "0. '..." • -:'. 100 110 120 130 x . • ::###BOT_TEXT###0. " . Despite between X and Y . •\ . ..0° .' I' ..0 (d) Correlation = 0.~ . .. but it is I=~=:... .: • .9 y 70 variables also are uncorrelated even though they are related nonlinearly. ~. 60 50 40 30 20 10 i.=~~ .:. " " ..8 Y 70 60 50 40 30 20 10 0 70 80 90 100 110 120 130 x 70 .3(.. '. 0". small values of Yare associated values of X.' . '. ..". Figure 3. the reason with both large and small coefficient is that. in Figure 3.

Ly and variance u~ = a'P/n. Y is unbiased.d.. 2.::=~=~ . 4... Hypothesis tests and confidence intervals for the difference two populations are conceptually of a single population. The sample average.... b.. and measures the linear relationship between two vari- is. iiiiiiiiiiiiiiiiiiiiiiiiiiii. by the central limit theorem.. The r-statistic can be used to calculate the p-value associated with the null so that it con- hypothesis.. 3. Ylj"" Y" are i.. the sampling distribution of Y has mean J. 6. d. is an estimator of the population mean.i. pling distribution when the sample size is large.Ly in 95% of all possible samples.. 5. Y is consistent. how well their scatterplot is approximated Key Terms (66) (66) and efficiency (68) estimator tests (70) (70) hypothesis probability) (70) (71) hypothesis (70) alternative (significance variance (73) (69) (67) (Best Linear Unbiased sample standard standard error of deviation (73) (74) degrees of freedom estimator estimate BLUE bias. consistency.When a. The sample correlation lation coefficient ables-that coefficient in the means of similar to tests and intervals for the mean is an estimator of the population by a straight correline.. Estimator) least squares hypothesis alternative two-sided p-value sample null hypothesis Y (74) I-statistic (r-ratio) (75) test statistic (75) type 1 error (77) type II error (77) significance level (77) critical value (77) rejection region (77) acceptance region (77) size of a test (77) ______ .. The r-statisric is used to test the null hypothesis that the population takes on a particular value. If n is large. t-v. and Y has an approximately normal sammean normal c.. Y... the /-statistic has a standard sampling distribution when the null hypothesis is true..Key Terms 9S Summary 1. A small p-vaJue is evidence that the null hypothesis is false. A 95% confidence interval for !"y is an interval constructed tains the true value of J. by the law of large numbers...

5 of confidence play in statistical hypothesis test- intervals? is? Among hypoth- What is the difference between a null and alternative hypothe size. (b) -1. Exercises 3. Relate your answer to the law of large numbers. is an estimator of the treatment Sketch a hypothetical scatterplot for a sample of size 10 for two random variables with a population correlation of (a) 1.96 CHAPTER 3 Reviewof Statistics power of a test (77) one-sided alternative confidence set (79) confidence level (79) confidence interval (79) coverage probability test for tbe difference means (81) causal effect (84) hypothesis (79) treatment scatterplot effect (84) (91) sample covariance (91) sample correia lion coefficient (sample correlation) (92) (81) between two Review the Concepts 3. (d) -0. significance level. Determine the mean and variance ofY from an i. 3. and (c) n = 1000.9.0. Provide an Y and the population 3.6 Why does a confidence interval contain of a single hypothesis test? Explain why the differences-of-means randomized 3. and an estimate.0.3 A population distribution has a mean of 10 and a variance of l6.!"y = 100 and answer the following questions: O'f = 43. (b) rt = 100. (e) 0.2 Explain the difference between an estimator example of each.i.8 controlled experiment.d. 3. more information than the result 3.7 estimator. applied to data from a effect.5. and power? Between a one-sided alternative esis and a two-sided alternative hypothesis? 3.1 Explain the difference between the sample average mean. Use the central limit theorem to .4 What role does the central limit theorem ing? In the construction 3.0. (c) 0.1 In a population. sample from this populati n for (a) n = 10.

What is the p-value for the test Ho: p ~ 0. draws from this distribution. to calculate the standard error of your estimator. In a random sample of size n = 100.3 .5? for the test Ho: P = 0. and let p be the fraction of survey respondents preferred the incumbent.p lin. that they would vote for the Let at who the incumbent In a survey of 400 likely voters. 98). HI: p > 0. 215 responded incumbent and 185 responded p denote the fraction that they would vote for the challenger.50 vs. Why do the results from (c) and (d) differ? f. b.d. Hl: p '" 0. evidence that the at the time of the survey? c. . Show that var( 3. Y" be i.Exercises 97 a.i. Y = 1) =p .5 calculations.Plin. test the hypothesis level. In a random sample of size n = 64. of all likely voters who preferred the time of the survey. find PrCY < Y> 101).. b.5 vs. Construct a 99% confidence interval for p.5? e. c.2 < Y < 103). c. Use the estimator of the variance of p. p is an unbiased p) = p(l estimator of p.3: a. find Pr(l01 c. and let A. a. II. What is the p-value d. HI: p '" 0. Why is the interval in (b) wider than the interval in (a)? d. A and candidate who prefer candidate A survey of JOSS registered to choose between candidate of voters in the population and the voters are asked A. Let Y be a Bernoulli random variable with success probability Pre and let l'[.50 at the 5% significance voters is conducted. find Pre 3. of successes (Is) in this sample. interval for p. . Did the survey contain statistically significant incumbent was ahead of the challenger Explain. Letp be the fraction Show that b. Let p denote the fraction p denote the fraction of voters in the sample who prefer Candidate . 3. Construct a 95% confidence b.5 vs.. Without doing any additional Ho: p 3.p(l . B. Show that p = Y. = 0. In a random sample of size n = 165.4 Using the data in Exercise 3. Use the survey results to estimate p.

What is the probability that the true value of p is contained in aU 20 of these confidence intervals? n. * 5 usingthe usual r-statisuc yields a p-vaJue of 0.and the sample standard deviation is 123.HI: P > O. Y" be i. I.pi> 0. Ill.i. draws from a distribution with mean u. . Test Ho: p = 0. val? Explain. it is half the length of 9S% confidence interval.96 X SEt p). v. c. a 9S% confidence interval for p is constructed.p I. = 0. Construct a 95% confidence interval for p. .s] > 0. Suppose that you decide to reject flo if * = O.01) :5 O. b. Test Ho: P = 0. Suppose that the survey is carried out 20 times. Construct a 95% confidence interval for the population mean test score for high school seniors. = 6 is contained a. 11.5 vs. In the survey.98 CHAPTER3 Reviewof Statistics a. The sample mean test score is 1110. A survey using a simple random sample of 600 landline telephone numbers finds 8% African Americans. Does the 9S% confidence interval contain u: = 5? Explain. Construct a 99% confidence interval for p. Can you determine if jJ. = S versus Hv. .8 A new version of the SAT test is given to 1000 randomly selected high school seniors. You are interested in the competing hypotheses flo: P fll: p 0. that is. fll: P * O. 11% of the likely voters are African American..Susing a S% significance level. IV. Construct a SO%confidence interval for p.Q2. In survey jargon. the "margin of error" is 1. Compute the power of this test if p = 0. i.Svs. What is the size of this test? II.53.03.Susing a S% significance level. Ip .54. Suppose you wanted to design a survey that had a margin of error of at most 1%. jJ. b. using independently selected voters in each survey. you wanted Pr( p .5 vs. 3.6 Let flo: l'I. A test of jJ. Is there evidence that the survey is biased? Explain. How many of these confidence intervals do you expect to contain the true value of p? d. That is. in the 9S% confidence inter- 3.5.o.For each of these 20 surveys.OS.d.7 In a given population. ow large should n be if the surH vey uses simple random sampling? I 3..

H(!L > 2000. she will conclude that the new process is no better than the old process.11 Consider the estimator Y. Show that 3. Consider the null and alternative hypothesis HO:!L = 2000 vs. . Can you conclude with a high degree of confidence that the population means for Iowa and New Jersey students are different? (What is the standard error of the difference in the two sample means? What is the p-value of the test of no difference in means versus some difference?) 3. Construct a 95% confidence interval for the mean score of all New Jersey third graders. and the sample standard deviation.25o}/n.10 Suppose a new standardized test is given to 100 randomly selected thirdgrade students in New Jersey.Exercises 99 3. Sy. What testing procedure should the plant manager use if she wants the size of her test to be 5%? 3.~.12 To investigate possible gender discrimination in a firm. An inventor claims to have developed an improved process that produces bulbs with a longer mean life and the same standard deviation. b. Suppose the new process is in fact better and has a mean bulb life of 2150 hours. Construct a 90% confidence interval for the difference in mean scores between Iowa and New Jersey. What is the size of the plant manager's testing procedure? b. The sample average score Yon the test is 58 points. a. otherwise. c. The authors plan to administer the test to all third-grade students in New Jersey. She says that she wiil believe the inventor's claim if the sample mean life of the bulbs is greater than 2100 hours.1). defined in Equation (a) E(Y) =!Ly and (b) var(Y) = 1. The plant manager randomly selects 100 bulbs produced by the process. What is the power of the plant manager's testing procedure? c. producing a sample average of 62 points and sample standard deviation of 11 points. (3. is 8 points.9 Suppose that a lightbulb manufacturing plant produces bulbs with a mean life of 2000 hours and a standard deviation of 200 hours. Let 11 denote the mean of the new process. A summary of the resulting monthly salaries follows: n iiiiiiiiiiiiiiiiiiiiiiii======--'. a sample of 100 men and 64 women with similar job descriptions are selected at random. Suppose the same test is given to 200 randomly selected third graders from Iowa. a.

b.14 Values 01 height in inches (X) and weight in pounds (Y) are recorded a sample of 300 male college students.15 Let Y" and y" denote Bernoulli random variables [rom two different popa and b. Convert these statistics to the metric system (meters grams).. Y= 158 Ib. hypothesis. TI. ~ n the following results were found: .) [gender discrimination fir t state the r-statistic. with sample average ulations.5 rXY = in.13 Data on fifth-grade districts in California •. 3..8 in. Do these data suggest that the firm is guilty in its compensation 3.5 .85. . Construct population.100 CHAPTER 3 Reviewof Statistics Average Men Women Salary (Y) Standard Deviation (Sy) n $3100 $2900 $200 $320 tOO 64 of •. policies? Explain. and a random sam pic of size lib is chosen from population b.2 and standard = 19.2 lb. null and alternative third. SXl' = 21.0 significant evidence 19.9 238 182 Is there statistically that the districts with smaller classes have higher average test scores? Explain..4 650. When the districts were divided into districts with small cia scs « students per teacher) and large classes 20 (2: 20 students per teacher). s« = 1. compute use the p-value the p-value associated to answer the question.-----------------Class Size Small Large Average Score (Y) Standard Deviation (Sy) 6S7. Sy = 14.4 t7.e resulting summary statistics from are X = 70. What do these data suggest about wage dif[erences in the firm? Do they represent statistically significant evidence that average wage men and women are different? (To answer this question.) = Pb' A random sample of size II" is chosen from population a. 3. X lb. test scores (reading yield and mathematics) deviation for 420 school Sy Y = 646. compute the relevant with the I-statistic. Suppose that E( y") = P« and E( Yt. and finally. a 95% confidence interval for the mean test score in the b. and and kilo- 0. denoted denoted p". second.73 in.

Exercises with sample average denoted jJb' Suppose the sample from population independent of the sample from population b. a. Show that E(p,)
= Pa and var(Pa) = Pa(1-

101 a is

Pa)/n". Show that

E(p,,)

= p"

and

var(Pb)
1I

= Pb(l - p")/n,,.

b. Show that

var(p

- P'b ) = Pa(1 nn Pal + p,,(1 tlb. Pb) (H"tnt. R emem ber -, I" er

that the samples are independent.) c. Suppose that n, and nb are large. Show that a 95% confidence interval for Pa - p" is given by

(Pa - p,,) ± 1.96

cP,,""( 1 a",,) - "-c,---.'-'P + /)" (1n» P b ) . n{/
interval for Pa - p,,? Savings" in Section group and

How would you construct d.

a 90% confidence

Read the box "A Novel Way to Boost Retirement 3.5. Let population population confidence b denote

a denote the "opt-out"
the "opt-in" (control)

(treatment)

group. Construct a 95%

interval for the treatment

effect,p, - p".

3.16 Grades dents

on a standardized in the United

test are known to have a mean of 1000 for stuto 453 randomly

States. The test is administered

selected students in Florida; in this sample, the mean is 1013 and the standard deviation (s) is 108.
3.

Construct a 95% confidence Florida students.

interval for the average test score for

b. Is there statistically differently

significant evidence that Florida students perform

than other students in the United States? are selected at random from Florida. They are deviation of 95.

c. Another 503 students

given a 3-hour preparation

course before the test is administered.

Their average test score is 1019 with a standard \. Construct a 95% confidence

interval for the change in average

test score associated
I\.

with the prep course. significant evidence that the prep course

Is there statistically helped?

d. The original 453 students

are given the prep course and then are change in their test of the change is 60

asked to take the test a second time. The average scores is 9 points, and the standard deviation points. \. Construct a 95% confidence rest scores.

interval for the change in average

102

CHAPTER 3

Review Statistics of
II.

Is there statistically significant evidence that students will perform better on their second attempt after taking the prep curse?

iii. Students may have performed better in their sec nd attempt because of the prep course or because they gained test-taking experience in their first attempt. Describe an experiment that would quantify these two effects. 3.17 Read the box "The Gender Gap of Earnings United States" in Section 3.5. a. Construct a 95% confidence interval for the change in men' hourly earnings between 1992 and 2008. b. Construct a 95% confidence interval f r the change in w men's average hourly earnings between '1992 and 2008. c. Construct a 95% confidence interval for the change in the gender gap in average hourly earnings between 1992 and 200 . (Hirn:
"Y,1l.1992 Yw.l992

of

allege Graduates

in the

average

is independent of

~1I,2008 - ~v,2008')

3.18 This exercise shows that the sample variance i an unbiased
the population ance
(J~.

estimator

of

variance when

Y1"",

Y,', arc i.i.d. with mean J.Ly and vari-

a. Use Equation (2.31) to show that E[( Y, - y)2] = var( Y,) - 2cov( Y" Y) var(Y). b. Use Equation (2.33) to show that cov(Y, Y,)

+

= uNII.

c. Use the results in (a) and (b) to show that £(s~) 3.19 a. Y is an unbiased estimator of !ky. Is b. Vis a consistent estimator of t-v- Is

=

IT~.

y2 an un iased e .timator of Jk~?

y2

a c nsi tent cstimat

r of !k~?

3,20 Suppose that (Xi, Y,) are i.i.d. with finite fourth m mcnts, Prove that the sample covariance is a consistent estimator of the populati n covariance, that is, SXY ---L.. a xy, where SXY is defined in Equati n (3.24). (Ill/II: Use the strategy of Appendix 3.3 and the auchy chwartz inequality.) 3.21 Show that the pooled standard errol' [S£,wol"I(Y,,, - Y,,)] given following Equation (3.23) equals the usual standard error for the difference in means in Equation (3.19) when the two group sizes are the same (11m = 11".).

Empirical Exercise

103

Empirical Exercise
E3.1 On the text Web site http://www.pearsonhighered.com/stock_watson/You will find a data file CPS92_08 that contains an extended version of tbe dataset used in Table 3.1 of the text for the years 1992 and 2008. It contains data on full-time, full-year workers, age 25-34, with a high scbool diploma or B.A.lB.S. as their highest degree. A detailed description is given in CPS92_08_Description, available on the Web site. Use these data to answer the followingquestions. a. Compute the sample mean for average hourly earnings (AHE) in 1992 and in 2008.Construct a 95% confidence interval for the population means of ARE in 1992 and 2008and the change between 1992and 2008. b. In 2008,the value of the Consumer Price Index (CPI) was 215.2. In 1992, the value of the CPI was 140.3.Repeat (a) but use ARE measured in real 2008 dollars ($2008);that is, adjust the 1992 data for the price inflation that occurred between 1992 and 2008. c. If you were interested in the change in workers' purchasing power from 1992 to 2008, would you use the results from (a) or from (b)? Explain. d. Use the 2008 data to construct a 95% confidence interval for the mean of ARE for high school graduates. Construct a 95% confidence interval for the mean of ARE for workers witb a college degree. Construct a 95% confidence interval for the difference between the two means. e. Repeat (d) using the 1992 data expressed in $2008. 1'. Did real (inflation-adjusted) wages of high school graduates increase from 1992 to 2008? Explain. Did real wages of college graduates increase? Did the gap between earnings of college and high school graduates increase? Explain, using appropriate estimates, confidence intervals, and test statistics. g. Table 3.1 presents information on the gender gap for college graduates. Prepare a similar table for higb school graduates using tbe 1992 and 2008 data. Are there any notable differences between the results for high school and college graduates?

i~
104 CHAPTER 3 Review

of Statistics

APPENDIX

3.1

The U.s. Current Population Survey
Each month, the Bureau of Labor Statistics in the U.S, Department
population, including the level of employment, unemployment,

of Labor conducts

the

Current Population Survey (CPS), which provides data on labor force characteristics 50,000 U.S. households arc surveyed each mont h. The sample is chosen by randomly ing addresses from a database of addresses from the most recent decennial mented with data on new housing units constructed

of the select_

and earnings. More than

census aug-

after the last census. The exact random

sampling scheme is rather complicated (first, small geographical arcus arc randomly

selected, then housing units within these areas arc randomly selected): details can be found in the Handbook of Labor Sratisrics and on the Bureau of Labor Statistics Web site (www .bls.gov), The survey conducted each March is more detailed than in other m nths and asks

questions about earnings during the previous year.The statistics in Tables 2.4 and 3.1 were computed using the March surveys.The CPS earnings data are for full-time workers, defined to be somebody employed ous year. more than 35 hours per week for at least 48 weeks in the previ-

APPENDIX

3.2

Two Proofs That Yis the Least Squares Estimator of fLy
This appendix provides two proofs. one using calculus and aile not, that Y minimizes sum of squared prediction
estimator or

the

mistakes in Equation

(3.2)-thnl

is, that

Y

is the least squares

E( Y),

Calculus Proof
To minimize the sum of squared prediction mistakes. take its derivative and set it to zero: d
dm

I-I

:? (Y; -III)' ~ -2 2:(l'i
1=1

II

1/

II

-III)

= -22: Y;+ 2/1111 0, =
;=1 'C'IJ ~i=1

(3.27)

Solving for the final equation for

//'1

shows that

(V'i-II'/

)2' IS

••. nurunuze d

I W1 ell

m=Y.

from which it follows that Y is the =Y _ so that m=V-d.by setting m = Y -so that Y is the least APPENDIX 3. as stated 1')' = [( Y.[Equation (3. ~.2)J is II H n ~ (Y. as small as possible. .Y) = O.. when Y. .3 A Proof That the Sample Variance Is Consistent This appendix uses the law of large numbers to prove that the sample variance sistent esti maror of the population variance Sf is a con- a$.~I (11- 11I)2 is minimized by choosing d to make the second term. .i.!Ly)J' = (Y.9). Because both terms in the final line of Equation (3. - where the final equality follows from the definition ofY [which implies that L.d.1')2 into the definition of sl.28) where the second equality uses the fact that L.-V)'+ 2d( Y. Let d and Y In.!LY)+ (1' . Then (Y.(Y.!LY)] and by collecting terms. 1'.!Ly)2 2( Y..A ProofThat the Sample Variance Is Consistent 105 Noncalculus Proof The strategy is to show that the difference between the least squares estimator least squares estimator. d = O-that is. add and subtract uv to write (Y.28) are non negative and because the first term does not depend on d.-VJ+d)'=(Y.-m)'=(Y. . This is done by setting squares estimator of E( Y).7)].!LY)'.fLy) = fI( 1" .!Ll')( 1" .l1lL1s the sum of squared prediction mistakes [Equation (3. nd2./ are i.1')' + lid'. .-[V-dj)'=([Y. 1=1 1"'1 i=l (3. . -m)' 1=1 = L(Y.- in Equat ion (3. must be zero.~l( tj . .1")' + 2d2.1') + nd' ~ L(Y. we have that First.~l(}j . .(1' . and E( Yf) < !LY) . .V) II + d'. Substituting this expression [or (Y. .

ltYj..) = u~ (by the definition of the vari. Now E( W. But W= - E( yl) < -I'Y) " "'I..). W 00. so the second term converges in probability to zero. Combining these results yields s~~ uf.P.29).d. Also. .. ance). ..-I'y)'. .) = (f9. = (Y..i.. . .. Define W. In addition. . the random variables DO = E[( Y.29) converges 2. -I'y)'] < < 00. Wn are i. so the first term in in probability to uf. . by assumption.d..106 CHAPTER 3 Review of Statistics The law of large numbers can now be applied to the two terms in the final line of Equation (3.i. so (l(n) Equation (3. 11/(n -1) I.50 p because. Because 17 -"-> I'Y.. - -"-> (f9.i.IJ-y)' .. E(W.d.. Y" are i. Because the random variables are i. and var("'[) W satisfies the conditions for the law of large E(W. numbers in Key Concept E( W. (Y ..) Thus f1.. 0. z and ~:~l (Y.6 and W ~ I'y)' ('<''' l(n)"'i~I(Y.

X. say. we show how to estimate the expected effect on test scores the line relating X and Y can be estimated by a method called ordinary least squares (OLS). Y. to another. to estimate the a sample of data on these two variables. the slope of the line relating of line relating X and Y is the effect of a one-unit change in X on Y. one student per class. the slope of the X and Y is an unknown characteristic of the population joint distribution of is. student test scores. using data on class sizes and test scores from class sizes by. She teachers and she wants your advice. The slope and the intercept of different school districts. This chapter introduces the linear regression model relating one variable. But hiring more teachers means spending more reduce the number of students per teacher (the student-teacher faces a trade-off. For instance. she will Parents want smaller classes so that their children can receive attention. Y A state implements tough new penalties on drunk drivers: What is the effect on <Y being highway deaths. class size. or years of schooling). which is not to the liking of those paying the bill! So she asks you: If she cuts class sizes. Tf she hires the teachers. or earnings). X (X being penalties for drunk driving.1 The Linear Regression Model The superintendent additional of an elementary school district must decide whether to hire ratio) by two. on another variable. This model postulates a linear relationship between X and Y. X and Y. more individualized money. Just as the mean Y is an unknown characteristic of the population distribution of Y. The econometric problem is to estimate this slope-that effect on Y of a unit change in X-using of data on of reducing This chapter describes methods for estimating this slope using a random sample X and Y.cmmD Linear Regression 4 with One Regressor highway fatalities? A school district cuts the size of its elementary school classes: What is the effect on its students' standardized test scores? You successfully complete one more year of college classes: What is the effect on your future earnings? All three of these questions are about the unknown effect of changing one variable. what will the effect be on student performance? 107 . 4.

1) so that A TesiScore ~ (3CI". Then a reduction you would predict in class size of two students (4. (4. (4. Equation (4. not only would you be the change in test scores at a district associated with a change in class size. you would be able to tell the superintendent that decreasing class size by one student would change districtwide {3CfassSize' test scores by You could also answer the superintendent's changing class size by two students actual ques- tion.108 CHAPTER 4 Linear Regression with One Regressor In many school districts. 111us. We therefore is measured by standardized can depend in part on how sharpen the superintendent's question: [f she reduces the average class size by two students. To do so. A (delta) stands for "change is..'Si". of the test that test scores would rise by 1.2 points as a result reduction in class sizes by two students per class. what would beta. This straight line can be written line relating TestScore = where {3o is the intercept According to Equation able to determine f30 + {3cltlssSize X ClassSize."That f3C/assSize (4. (4.. .6) X (-2) = 1. as before. change in TestScore change in ClassSire A TestScore A Classsize' in.Siu X A ClassSize. {3C1a. where the subscript she expect the change in standardized test scores to be? We can write this as a mathusing the Greek letter ClosiSir» distinguishes the effect of changing the class size from other effects.3) of this straight line and. which concerned rearrange Equation per class.1) is the definition of the slope of a straight scores and class size. if you knew f30 and {3CI". Suppose that f3C1""Si" ~ -0.3)..2) per class would yield a predicted change in test scores of (-0. If the superintendent ematical relationship a quantitative statement about changes the class size by a certain amount. f3C1""SI".1) where the Greek letter change in the class size. f3CI""Slu is the by the change in the test score that results from changing the class size divided If you were lucky enough to know f3C1""Si. what will the effect be on standardized test scores in her district? A precise answer to this question requires changes..6. that is.Siu is the slope. student performance tests.2. and the job status or pay of some administrators well their students do on these tests. but you also would be able to predict the average test score itself for a given class size.

3) to the superintendent. and let u..4) Thus the test Score for the district is written in terms of one component. denote the other factors influencing the test score in the il" district. let Xi be the average class size in the i th district. Equation (4.] . including each district's unique characteristics (Ior example. A version of this linear relationship that holds for each district must incorporate these other factors influencing test scores. she tells you that something is wrong with this formulation. i = 1. of course.. = /30 + f3. be the average test score in the i Ih district. She is right.4. n). teachers. One approach would be to list the most important factors and to introduce them explicitly into Equation (4.5) instead of /3 C/tI.Xi + /I.. (4. where f30 is the intercept of this line and /31 is the slope.5) for each district (that is. Let Y.4) is much more general.3) (an idea we return to in Chapter 6). Two districts with comparable class sizes. however.. fnstead. Suppose you have a sample of n districts. . we simply lump all these "other factors" together and write the relationship for a given district as TesrScore = /30 + /3C1""Si" X ClassSize + other factors.1 The LinearRegression Model 109 When you propose Equation (4. . Finally. quality of their teachers. so it is useful to introduce more general notation. She points out that class size isjust one of many facets of elementary education and that two districts with the same class sizes will have different test Scores for many reasons. Then Equation (4. it should be viewed as a statement about a relationship that holds 011 average across the population of districts. she points out that even if two districts are the same in all these ways they might have different test scores for essentially random reasons having to do with the performance of the individual students on the day of the test. and textbooks still might have very different student populations.uSIz. /30 + /3ClassSize X ClasxSize.e because this equation is written in terms of a general variable Xi. . that represents the average effect of class size on scores in the population of school districts and a second component that represents all other factors. Although this discussion has focused on test scores and class size. One district might have better teachers or it might use better textbooks. for all these reasons. the idea expressed in Equation (4. For now. how lucky the students were on test day). perhaps one district has more immigrants (and thus fewer native English speakers) or wealthier families.4) can be written more generally as Y.3) will not hold exactly for all districts. [111egeneral notation /31 is used for the slope in Equation (4. (4. background of their students.

1'1. i. when X is the class size. strictly of the intercept is nonsensical. Y. a single regressor regression observations on test scores (Y) and class size (X). for example. f30 variable or the regressor that holds between The first part of Equation the popnlation regression ing to this population + f3. acconj.X. if you knew the value of X. including teacher quality. The linear regression cept 4. in Equation (4.TIle popu- lation regression line is the straight line f30 + f3. but.X The population < 0).1. line Or Y is the dependent variable and X is the independent (4. also known as the parameters of the population regression line. Because of the other factors that determine test performance. for a specific observation. Tn the class size example. as mentioned earlier. This error term the value of the dependent all of the factors responsible test score and the value predicted by the population contains all the other factors besides X that determine variable. In other applications.110 CHAPTER 4 Linear Regression with One Regressor Equation (4. is above the population line.e slope is the value the the with a unit change in X The intercept the Y axis.. The errol' term incorporates for the difference between the i'h district's average regression line. is f30 + f3.5). by the regression cal observations in Figure 4. and even any mistakes in grading the test. ematicalmeaning the linear regression model with model and its terminology are summarized in Key Confor line background. the hypothetiregression line. In some econometric line when X = 0. 111e term /I. which means that districts with lower student-teacher by the population regres- ratios (smaller classes) tend to have higher test scores. it is the point at which the populaapplications. meaning.X The intercept f30 and the slope f31 is the change in Yassociated of the population regression tion regression line intersects intercept has a meaningful intercept has no real-world speaking the intercept f3. economic interpretation.1 do not fall exactly on the population For example. student economic luck. This is the relationship line you would regression predict that the value of the regresdependent variable. Thus. it has no real-world meaning in this example. in whict. The intercept f30 has a mathas the value of the Y axis intersected sion line.1 summarizes seven hypothetical slopes down (f3.. function. Figure 4. This means that test scores in district # I were better than predicted .5) is the linear regression model with a single regressor. is the population regression y and X on average over the population. are the coefficients of the population sian line. that determines it is the predicted value of test scores when there are no stuas the coefficient the level of dents in the class! When the real-world meaning is best to think of it mathematically the regression line.5) is the error term. Y. TI. these other factors include all the unique features of the i Ih district that affect the performance of its students on the test. the value of Y for district #1.

.1 The Linear Regression Model 111 Terminology for the Linear Regression Model with a Single Regressor The linear regression model is cmmtm) 4. Ai f30 is the independent variable. n. Student-Teacher (Hypothetical Datal The scatterplot shows hypothetical observations for seven school districts. . Yj) 680 vertical distance from the point to the population regression line is ". or simply the right-hand variable.4.. Y.1 where the subscript i runs over observations. 660 640 • Yi. The population regressian line is f30 + {31X. ui for the which is the population error term ith observation.. 620 600 10 15 20 25 30 ratio (X) Student-teacher . the regressor. is the dependent variable.({3o + f31Xi). The jth Ratio Test score (Y) 700 . or simply the left-hand variable. the regressand. Scattenplot of Test Score vs.(X. . and f30 is the intercept of the population f3L is the slope of the population Ui is the error term. regression line. + f31X is the population regression line or the population regression function. i = 1. regression line.

is positive .112 CHAPTER 4 Linear Regression with One Regressor population regression line. This estimation problem is similar to others you have faced in statistics. We do not know the population value of f3CI". Y.2 Estimating the Coefficients of the Linear Regression Model In a practicalsituation such as the application to class size and test scores. But just as it was possible to learn about the population mean using a sample of data drawn from that population. The same idea extends to the linear regression model. Table4. we can estimate the population means using a random sample of male and female college graduates. for example.1summarizes the distributions of test scores and class size for this sample. The data we analyze here consist of test scores and class sizes in 1999 in 420 California school districts that serve kindergarten through eighth grade. so is it possible to learn about the population slope f3ClassSi" using a sample of data. of the population regression line are unknown.'Si". The average student-teacher ratio is 19. For example. so the error term for that district. so test scores for that district were worse than predicted.In con.6 students per teacher. Then the natural estimator of the unknown population mean earnings for women. The 10'h percentile of the distribution of . Class size can be measured in various ways. These data are described in more detail in Appendix 4. But what is the value of {3Clas. The test score is the districtwide average of reading and math scores for fifth graders.I'Size? 4. and '" < O.9 students per teacher. is below the population regression line..e answer is easy: The expected change is (-2) X f3Cltu. and the standard deviation is 1.. trast.'Si'" the slope of the unknown population regression line relating X (class size) and Y (test scores). we must use data to estimate the unknown slope and intercept of the population regression line. which is the number of students in the district divided by the number of teachersthat is. is the average earnings of the female college graduates in the sample.the districtwide student-teacher ratio. Now return to your problem as advisor to the superintendent: What is the expected effect on test scores of reducing the student-teacher ratio by two students per teacher? 11.suppose you want to compare the mean earnings of men and women who recently graduated from college. Although the population mean earnings are unknown. "I.Therefore. the intercept Po and slope f3.The measure used here isone of the broadest.1.

:.3 (that is..' " '.. •. Student-Teacher Test score Ratio (California School District Data) California school distrios. The sample correlation is -0. •... .-5 ~~~~~-..: .... /..'.: .9. . .~.. .23..-f .. .!t. .~~""0... there are other determinants keep the observations Despite through from falling perfectly along a straight if one could somehow this low correlation. ·.\ I..c~..:r ..' .-:. • . ?. . .--"-~~~311o --' Student-teacher ratio ... There is a weak negative relationship between the studentteacher ratio and test scores:The sample correlation is -0...!. nOr 700 680 660 640 620 .. . these data. • '· ..4.. then the slope of this line would be an estimate of Cim!DD Data from 420 Scatterplot of Test Score vs.:.. .•. . . while the district at the 90th percentile ratio of 21.. .. Although ple tend to have lower test scores. ..2 Estimating the Coefficientsof the Linear Regression Model 113 the student-teacher ratio is 17. .: • .t ........ ...... . -s:... '. draw a straight line {3Cla"Si" ratio is shown in Figure 4. .c\. •• • .....:...20. only 10% of districts have student-teacber has a student-teacher ratios below 17.23.3)...:111·· •.. -.. • • ·ro. 1. 6001'~0~-~~~-:1.." . ..2..:y " ...: ••• .. . .. .. indicating a weak negbetween the two variables. . ...-~~~~-2~5.:. A scatterplot ative relationship of these 420 observations on test scores and the student-teacher larger classes in this samof test scores that line.. -: ..

that minimize Expression (4. is. Y minimizes the total squared 2:. least squares (OLS) OLS estimator estima· of f30 f30 and f3. so the value or Y. are identical except for the (4. and the OLS estimator or f3. . the sample average.(bo + b. then. that (3. where cI seness is mea- sured by the sum of the squared mistakes made in predicting Y given X. Y. mistakes data..j2. Just as there is a [1'17 in Expression (3.2).~.1.x..X.e 0 LS regressiou line. The regression linc based on the e estiusing this line is bo mators is bo + b. is the straight line constructed ffi.predicted mistake made in predicting + b.6) The sum of tbe squared sion (4. f3o.. The predicted value of Y. if there is no regressor.(y. One way to draw the line would be 10 take out a pencil and a ruler and to "eyeball" the best line you could. it is very people will create different the line that produce estimated lines. should you choose among the many possible lines? By far the 1U0st the "least squares" (OLS) estimator.6) and the two problems minimizes the Expression mistakes I'or the problem estimating the mean in Expression (3.lation mean. and different common way is to choose data-that How. In fact.6) is the extension does not enter Expression different notation unique estimator.6).bo ns i b.X.114 CHAPTER4 Linear Regressionwith One Regressor based on these data.2)]. Let bo of f30 and f3. fit 10 these unscientific.6)]. 2:(Y. . - mY among all possible estimators [sec Exprc The OLS estimator and b.-bo-b. to use the ordinary least squares The Ordinary Least Squares Estimator The OLS estimator chooses the regression coefficients so that the estimated regression line is as close as possible to the observed As discussed in Section 3.. E( Y).2).so is there a unique pair (4. of the sum of the squared (4.The is denoted also called the sample regression liue or sample regression function.2).Xj• Thus the the i'"observation is Y. that is.l1. least squares estimaestimation ion (3.6) are called the ordinary of the intercept and slope that minimize the sum of squared mistakes in Expression f30 and f3. bo in Expression Y. i=l " (4. is the 111 tor of the popu. .Xi) The sum of these squared prediction mistakes over all » observati = Y.X. using the OLS estimators: ffio + OLS h~s its own special notation and terminology. While this meth d is easy. be some estimators extends this idea to the linear reg res ion m del. of estimators of The estimators tors of mistakes for the linear regression model in Expresof then b. is denoted ffi.

The residual for tbe i th observation is the difference coefficients. repeatedly takes in Expression be quite estimators.2. and its predicted value: Ui = Y. . of the population The OLS estimators. of slope (pd. = Po + 13.8 ..11. . that streamline are collected in Key Concept 4. they are the least squares estimates. however. Xi.'~. - y. These all statistical and spreadsheet 4. counterpart are sample of 1)0 between Y. and residual X.f3. and the intercept ~ (30 are 4.6) using calculus. and Y.: 1. i = 1"". and Residuals The OLS estimators of the slope (3.).7) L(X.2 (4. j=j " Y) L: (X. " ..-X)(Y. the OLS regression line + is the sample of the population regression line (30 + . programs.8. based on the OLS regression line. i . slope (. and £tj = residuals iti are (49) (4.. Similarly. Fortunately.10) (Ui) are computed from a of tbe (Ui)' Y. Po and 13" are sample counterparts ..X. . and the OLS residuals Ui counterparts of the population errors Ui' Po PjX You could compute tbe OLS estimators and b. This method would the calculation tedious. (4. (4. n }J .6). . n.80).2 Estimating the Coefficientsof the Linear Regression Model 115 The OLS Estimator. and error term given. . i = 1. The estimated intercept sample of n observations unknown true population (Po). These are estimates intercept (.80 and (3. Predicted Values. The OLS formulas and terminology formulas are implemented in virtually These formulas are derived in Appendix ing Expression 130 and PI by trying different values the total squared misderived by minirnizof the OLS until you find those that minimize there are formulas.xj2 i=l The OLS predicted values n (48) Y.4.. is .2.

-~~~~-=----'-'~~~-L~~-~.·:-··.. • ....•••• .11) STI? is the that it Tes/Score is the average test score in the district student-teacher ratio. Aecordingly. . .' .. ·. .9. (larger classes) is associated with poorer performance ~ The Estimated Regression Line for the California Data Test score 720 700 680 660 640 620 The estimated regression line shows a negative relationship between test scores and the student-teacher ratio. the estimated regression predictsthat test scores will increase by 2. on average. .28 X STR 6 .9 ..28 where X STII. . The slope of -2.. the estimated slope is -2. The "~.2.:'~<~. . over Test Score in Equation over the scatterplot (4.~I":s::t:... ~ . •.. • ·'1 •• .11) indicates is the predicted value based on the OLS regression regression line superimposed line.9 ... Figure 4.1-: . the OLS regression line For these 420 Tesiscore = 698. associated with an increase in test scores of 4.••":....2.. A decrease in the student-teacher [= -2 X (-2. .116 CHAPTER 4 Linear Regression with One Regressor OLS Estimates of the Relationship Between Test Scores and the Student-Teacher Ratio When OLS is used to estimate scores using the 420 observations the estimated intercept observations is a line relating the student-teaeher ratio to test and in Figure 4.3 plots this OLS of the data previously shown in Figure 4... associated with a decline in districtwide by 2.2. .-J' 10 t5 20 ... .28)]..28 means that an increase in the student-teacher student per class is. ..' z 25 Student-teacher .. . .. .::. .1 . . on average. ratio by one test scores ratio by two st udents per teacher per class is....'.'..~:. '.~"tscore =•• 98\.. 30 ratio 600o~~~~7...56 points The negative slope indicates that more st udents on the test.. .. .28 is 698.."-{ --...\ ~~ .....28 points.C~'~ It {~.... and (4.2. : . If class sizes fall by one student.. . .28 points all the test. . .

] I). Of course.7 From Table 4. and. but it would not be a panacea. this. we return to the hiring enough teachratio is 19. ratio in these data is 14.5.5. Thus a decrease in class size that w uld place her district close to the 10% with the smallest classes would move her test scores from the 50'h to the 60'h percentile. if her district's test scores to 659. as the figure shows.9 . u ing the data in Figure 4. per teacher to 5? quaiion (4. from 19. ba is for predicting low student-teacher Why Use the OLS Estimator? There arc both pra tical and theoretical ~I' Because • reasons to usc the OLS estimators throughout economics. would move her student-teacher ratio from the 50'h percentile to very near the J O'h percentile. and the social sciences more generally. What if the superintendent the estimate was estimated in were contemplating such as reducing Unfortunately. Recall that she is contcmplating district.11) would not be very useful to her.H w would it affect test scores? According to Equati n (4. the median student-teacher and the mcdian test core is 654. This regression the student-teacher ratio from 20 students on her budgetary a far more radical change. the effect of a radical so these data alone move to such an the srna lie t student-teacher marion on how district arc not a reliable extremely with extremely ratio. cutting the student-teacher to increase ratio by 2 is pre- dieted to increase test sc res by approximately arc at the median. this improvement district from the median to just short of the 60'h percentile.2. 654.1. But the of the other factors that determine a district's Iinc does give a prediction (the OLS prediction) of what test scores would ratio.7. at least.4. they are predicted mcnt large or small? According 4. absent those other factors. and she would need to hire many new teacher . mates. based on their student-teacher Is this estimate uperimcnderu's ers to reduce of thc alifornia problem.6 points. it has become tbe comfinance (sec "The results mon language 'Beta' of a analysis t ck ' box).2. This is a big change.2 Estimating the Coefficientsof the Linear Regression Model 11 7 It is now possible to predict the districtwide test Score given a value of the student-teacher right because regression ratio. Is this improvewould move her to Table 4. this prediction will not be exactly performance. Presenting . the student-teacher ratio by 2. A reduction of two students to 17.28 x 20 = 653.3. For example. These data contain no infersmall classes perform. 130 and i the dominant (or regression method used in practice. for a district with 20 students per teacher. cutting the student-teacher pCI' According to these esti(two students ratio by a large amount teacher) IV uld help and might be worth doing depending situation.7 of the slope large or small? To answer be for that district.1. Suppose her district is at the median per class. the predicted test score is 698.1.

Said differently.11ms the expected excess risk-free. The "beta" of a stock has become a workhorse of the investment industry.Rf. must exceed the return on a safe. R .ccm. R. which then paid a $2. For example. 0.Rf.05.3 2. riskier stocks The capital asset pricing model (CAPM) formalizes this idea. like owning firm Web sites. a stock with a {3 < 1 has less risk than the market portfolio and therefore has a lower expected excess return than the market portfolio.can "porttolio't-ciu be reduced by holding other stocks in a other words. the expected to the on an asset is proportional Estimated {J expected excess return on a portfolio of all available assets (the "market portfolio").50 dividend during the year and sold on December 31 for $1. should be positive.4 CAPM. The table below gives estimated betas for seven should be measured by its variance. Tn practice. would have a return of R ~ [($105 . "Thereturn o~ ~ll investment is the change in its price plus an~ p~y. In contrast. return. a stock bought on January 1 for $100.50]/$ tOO~ 7. According excess return to the CAPM. the expected return! on a risky investment.3 0. by diversifying your financial holdings.vldend) from the investment as a percentage of Its initial ~nce. Company staples like Kellogg have stocks with low betas.0 1. stock in a company. the risk-free return is often taken to be the rate of interest short-term U.6 1.5%.S. as other economists all spreadsheet and statisticians.Rf on Rill . the CAPM says that (412) where Rm is the expected return on the market portfolio and f3 is the coefficient in the population regreson sion of R . This means that the right way to measure the risk of a stock is not by its variance but rather by its covariance with the market.o.6 0. The OLS formulas software packages ' making and statistical . however. or Rf. That is.$100) + $2. Much of that risk. government debt.S. According to the Wal-Mart (discount retailer) Kellogg(breakfast cereal) Waste Management (waste disposal) Verizon (telecommunications) Microsoft (software] Best Buy (electronic equipment retailer} Bankof America (bank) Source: Smartbdoncy.ut(dl. and you can obtain estiby mated betas for hundreds of stocks on investment return. investment.118 CHAPTER 4 LinearRegressionwith One Regressor _f---------A functamental investor idea of modern finance is that an a financial incentive to take a a stock with a f3 > 1 is riskier than the market portexcess needs folio and thus commands a higher expected risk.5 0. Those betas typically are estimated At first it might seem like the risk of a stock OLS regression of the actual excess return on the stock against the actual excess return on a broad market index. on a risky investment. stocks. ~sing OLS (or its variants discussed later in this book) means that you are "speakmg the same language" are built Into virtually OLS easy to use. U. Low-risk producers of consumer have high betas.

3 Measuresof Fit 119 The OLS estimators also have desirable theoretical properties.1. and the total Sum of squares (TSS) is the Slim of squared deviations of 1) from its average: ESS = ~ (Y. TIle OLS estimator is also efficient among a certain class of unbiased estimators. from their average.13) In this n tation. r are they pread out? The R and the standard error of the regression measure bow well the OLS regres ion line fits the data. jE'1 -. The R2 ranges between 0 and 1 and measures the fraction of the variance of Y. Does the regressor account for much or for lillie of the variation in the dependent variable? Are the observations tightly clustered around the regrc sion line.. (4. of Y as an estimator of the population mean. 2 The R2 The regression R' is the fracti n of the sample variance of 1) explained by (or predicted by) X.-Y). (4. typically is from its predicted value.. this efficiency result holds under some additional special conditions.1')2 i-I " . that i explained by X.15) " TSS= ~ -.4. plus the residual iiI' B. They are analogous to the desirable properties.3 Measures of Fit Having estimated a linear regression. the OLS estimator is unbia ed and consistent. however. you might wonder how well that regression line describes the data. 4. the R2 can be written as the ratio of the explained sum of square to the total sum of square. studied in Section 3. 1'1.14) (4.5.-(Y. the R2 is the ratio of the sample variance of ance of 1'1. and further discussion of this result is deferred until Section 5..4.. TIle explained su~u of squares (ESS) is the sum of quared deviations f the predicted values of Y. Under the assumptions introduced in Section 4. B to the sample vari- Mathematically. . The definitions f the predicted value and the residual (see Key Concept 4. The standard error of the regression mea ures how far Y.2) allow us to write the dependent variable 1) as the sum of the predicted value.

= (that is. The Standard Error of the Regression The standard error of the regression (SER) is an estimator of the standard tion of the regression measure ot the spread error Ui. is the sum of the SSR = "" -z LJ"i' i=[ " (4. Tn this case. then Xi explains none of the varia- value of 1) based on the regression is just the sample sum of squares is zero and the sum of squared residuals equals the total Sum of squares.16) Alternatively. U ~ I tion of 1) and the predicted average of Y. while an 1/' near 0 indicates the regressor is not very good at predicting Y. devia- The units of u. then Y. or SSR.. the does not take on that the extreme values of 0 or I but (ails somewhere in between. An R2 near 1 indi- cates that the regressor is good at predicting Y. In general.14) uses the fact that the sample average OLS predicted value equals Y (proven in Appendix 4.. and }j are the same.3 that TSS = ESS + SSR. thus the R2 is zero. i( the units of the dependent . not explained by Xi' The sum of squared squared OLS residuals: residuals.17) It is shown in Appendix 4. 1110 R' is the ratio of the explained sum of squares to the total sum of squares: (4. In contrast.3). the R2 of the regression of Yon the single regressor X is the square of the correlation coefficient between Yand X. 11. Thus the R2 also can be expressed as 1 minus the ratio of the sum of squared residuals to the total Sum of squares: R2 = 1_ SSR TSS' (4.. if Xi explains all of the variation of 1).18) Finally.120 CHAPTER 4 Linear Regression with One Regressor Equation (4. Ui = 0). so that ESS = TSS and Y. For example. the explained = 0. the R' can be written in terms of the fraction of the variance of Y. for all i and every R2 residual is zero R2 = 1. so the S ER is a around the regression line. measured of the observations in the units of the dependent variable.e R' ranges between 0 and 1.

by n . (The mathematics behind this is discussed in Section 5.051. The formula for the SER in Equation (4.051 means that the regressor STR explains 5. Figure 4. .:...6.6.3 superimposes this regression line On the scatterplot of the Test Score and STR data. .1 %. but much variation remains unaccounted for.. the SER is computed using their sample counterparts. the difference between dividing by 17. Because tbe standard deviation is a measure of spread..1.2' (4.Y in Equation (3.19) is similar to the formula for the sample standard deviation of Y given in Equation (3.3 around tbe regression line as measured in points on the test.. relating the standardized test score (TestScore) to the student-teacher ratio (STR). two "degrees of freedom" of the data were lost. As the scatterplot shows. The R2 of this regression is 0.2. whereas here it is n . Because the regression errors ttl. .4..6. the OLS residuals Ul>. This is called a "degrees of freedom" correction because two coefficients were estimated ({3o and {3.1 in Equation (3.=:======::::::=:=~lil "-- . --LU? n . the SER of 18.except that y.) is the same as the reason for using the divisor 17.19) where the formula for s~ uses the fact (proven in Appendix 4. .2 i='] I SSR =-_ n .3) that the sample average of the OLS residuals is zero. The reason for using the divisor 17. U/1 are unobserved.6means that tbere is a large spread of the scatterplot in Figure 4.6 means that standard deviation of the regression residuals is 18. and tbe SER is 18. ____ .1. the student-teacher ratio explains some of the variation in test scores. The R2 of 0.11) reports the regression line. the magnitude of a typical regression error-in dollars. Application to the TestScoreData Equation (4.2.1 % of the variance of the dependent variable TesrScore. or by n .2 here (instead of 17..or 5..then the SER measures the magnitude of a typicaldeviation from the regression Line-that is.3 Measuresof Fit 121 variable are dollars. __ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii.7): It corrects for a light downward bias introduced because two regression coefficients were estimated.). . The formula for the SERis l un- SER -- SUI wheresi/ 2 = 1".7) is replaced by Uj and the divisor in Equation (3.7) is 17. The SER of ]8.2.7) in Section 3. estimated using the California test score data..2 is negligible.. so the divisor in this factor is I'l .) When /I is large. . This large spread means that predictions of test scores made using only the student-teacher ratio for that district will often be wrong by a large amount.. where the units are points on the standardized test.

" What the low SER is large) does no. being at other centered on the population = 20 and. or luck on the test. by itself. in the sense that. valuesx of X. represents the other factors that lead test scores at a given district to differ from the prediction based on the population sometimes these other factors lead to better performance and sometimes to worse performance tion the prediction bution of is right. They do. at a given value of class size. or. but they do indicate ratio. on thc linear regression Initially.does tell us IS that other in in school quality unrelated that the student-teacher to the important factors influence test scores. the mean of the distri- 1. = x. In other words. E(u.4. and the error term u. As shown in Figure 4. have natural interpretations. > 0) iu. the distribution has a mean of zero. these assumptions when OL will-and model of might will and the sampling scheme under which OLS provides an appropriate the unknown regression ing these assumptions not-give estimator f30 and f3. 4. is essential for understanding of the regression coefficients. is zero. pler notation. regression line at X. but on average over the popula- = 20. (II.This assumption is a formal mathematical statement about the "other factors" contained in Iii and asserts that these other factors are unrelated to X. < 0). useful estimates and understand- Assumption #1 : The Conditional Distribution of u.IX. stated mathematically. regression line. this is shown as the distribution 01 u. The population regression is the relationship that holds on average between class size and test scores in the population. as well. more generally. In Figure 4. sim- E( u.[X. These factors could include dIfferences the student body across districts. Said differently. however..) = o. of It"conditional on X..4 The Least Squares Assumptions This section presents a set of three assumptions coefficients. = x) = 0. Given X Has a Mean of Zero The first of the three least squares assumptions is that the conditional distribution of given Xi has a mean of zero. given X. imply that this R. This assumption is illustrated in Figure 4. in somewhat . appear abstract.4. say 20 students than predicted per class. The low R2 and high SER do not tell us ratio alone explains only a small part of the variation in test scores in these data.122 CHAPTER 4 Linear Regression with One Regressor What should we make of this low this regression is low (and the regression is either R2 and large SER? The fact that the R2 of "good" or "bad. given a value of X" the mean of the distribuHi tion of these other factors is zero. differences student-teacher what these factors are.4./.

) = 0 is equivalent to that the population line is the conditional mean of Y.Whether cation return with observational to this issue repeatedly.IX.4 The Least Squares Assumptions 123 GBDt Test score The Conditional Probability Distributions and the Population Regression Line 720 700 680 660 Distribution of Y when X == 15 / E(Ylx = 15) Distribution of Y when X == 20 / E(Ylx = 20) E(YIX = 25) 20 Distribution of Y when X == 25 / 25 30 640 620 600 .) = O. proof of this is left as Exercise 4.6). (X = 1) or to the control group (X using a computer Random program that X is distributed assignment independently = 0).-. The mean of the conditional distribution of test scores. in Figure 4. X is not randomly assigned in an experiment.4. assigned. subjects are randomly experiment.X).-~~~~--=-~~~~-----l. given the studentteacher ratio.. and we the best that can be hoped for is that X is as if randomly sense that E(u. has a conditional mean of zero for all values of X. in the precise and judgment. E(YI X). U = Y .-~~~~. that the conditional In observational mean of u given X is zero. The random that uses no information of all personal makes X and u independent.4. data requires careful thought this assumption holds in a given empirical appli- . is the population regression line f30 + f31X.IX. and 25 students. Y is distributed around the regression line and the error. given X. data. As shown assuming (a mathematical The conditional domized group is done ensuring subject. the assumption regression that E(u. mean of u in a randomized controlled experiment. In a ran- controlled assigned to the treatment assignment typically about the subject.(/30 + f3. 20. characteristics of the which in turn implies Instead.~~~~-----i 10 IS Student-teacher ratio The figure shows the conditional probability of test scores for districts with class sizes of 15. At a given value of X.

ii. Because correlation is a measure of linear association. are independently and identically distributed (i. .i. i = 1" . then the two random variables have zero covariance and thus are uncorrelated [Equation (2. survey data from a randomly chosen subset of the population typically can be treated as i..)..27)]. The i.d. then it must be the case that E(u.i. If they are drawn at random they are also distributed independently from one observation to the next. is random)..d.i. ii. are Correlated.i. given X. If a sample of n workers is drawn from this population. If she picks the techniques (the level of X) to be used on the ilh plot and applies the same technique to the i th plot in all repetitions of tbe experiment.~ Y. the conditional mean of u. For example. then (X. .) is nonzero. n.) = O. Y. does not change from one sample to the next. tion is violated.) across observations.. u. this implication does not go the other way. is nonrandom (although the outcome Y.). Assumption #2: (X. and imagine drawing a person at random from the population of workers. might be nonzero. assumption is a reasonable one for many data collection schemes. i = 1"". i = 1.d. If the observations are drawn by simple random sampling from a single large population.3 that if the condi_ tional mean of one random variable given another is zero. X and Y willtake on some values).).i. . One example is when the values of X are not drawn from a random sample of the population but ratber are set by a researcher as part of an experiment. i = 1. suppose a horticulturalist wants to study the effects of different organic weeding methods (X) on tomato production (Y) and accordingly grows different plots of tomatoes using different organic weeding techniques.. For example. Recall from Section 2. Thus X. That randomly drawn person will have a certain age and earnings (that is. If X. n. As discussed in Section 2. necessarily bave tbe same distribution. if X. It is therefore often COnvenient to discuss the conditional mean assumption in terms of possible correlation between X. then the value of X. Thus the conditional mean assumption E(u. and u. For example. let X be the age of a worker and Y be his or her earnings.d. then (X" Y. and a. or carr (X.).5).IX. and are un correlated. n. Not all sampling schemes produce i. and are correlated.i. they are i. and u..) = 0 implies that X.i. are uncOrre_ lated.IX. even if X. then the conditional mean assump. .d.5 (Key Concept 2. . The results presented in this chapter developed . this assumption is a statement about how the sample is drawn. n. so the sampling scheme is not i. however. observations on (X" Y.124 CHAPTER 4 Linear Regression with One Regressor Correlation and conditional mean... Are Independently and Identically Distributed The second least squares assumption is that (X" Y.d.).d. However. that is. are i.

. sampling is when observations refer to the same unit of observation Over time.For example. This is an example of time series data. Imagine collecting . Large outliers can make OLS regression results misleading.d.. thereby circumventing any possible bias by the horticulturalist (she might use her favorite weeding method for the tomatoes in the sunniest plot).d.6 applies to the average. Assumption #3: Large Outliers Are Unlikely The third least squares assumption is that large outliers-that is. . then the law of large numbers in Key 1 n . are i. quite special.i. Another way to state this assumption is that X and Y have finite kurtosis.) are i.i.y )2 .4. Concept 2. This potential sensitivity of OLS to extreme outliers is illustrated in Figure 4. Y. Time series data introduce a set of complications that are best handled after developing the basic tools of regression analysis. observations with values of Xi. and a key feature of time series data is that observations falling close to each other in time are not independent bnt rather tend to be correlated with each other.3 showing that s ~ is consistent. we might have data on inventory levels (Y) at a firm and the interest rate at which the firm can borrow (X). The case of a nonrandom regressor is.2:i~l("Y.9) states tliat the sample variance s~ is a consistent estimator ofthe population variance a} (s~ ..i. Y.a key step in t I proo f In ie Appendix 3. the level of X is random and (Xi.. they might be recorded four times a year (quarterly) for 30 years...d.however.. 1'.lf}J. This pattern of correlation violates the "independence" part of the i. modern experimental protocols would have the horticulturalist assign the level of X to the different plots using a computerized random number generator. for example.d. a}). such as a typographical error or incorrectly using different units for different observations. regressors are also true if the regressors are nonrandom. -jJ. One source of large outliers is data entry errors.or both that are far outside the usual range of the data-are unlikely.. In this book.5 using hypothetical data. We encountered this assumption in Chapter 3 when discussing the consistency of the sample variance. if interest rates are low now. Specifically.4 TheLeastSquaresAssumptions 125 for i. Equation (3.!!.i. is finite. they are likely to be low next quarter.i. the assumption that large outliers are unlikely is made mathematically precise by assuming that X and Y have nonzero finite fourth moments: o < E(Xt) < 00 and 0 < E(YI) < 00. . When this modern experimental protocol is used.d. Another example of non-i. . and the fourth moment of Y... where these data are collected over time from a specificfirm. assumption. The assumption of finite kurtosis is used in the mathematics that justify the large-sample approximations to the distributions of the OLS test statistics. For example.

drop the observation Data entry errors aside. but inadvertently height in centimeters recording one student's correct the instead... used distributions such as the normal distribution rules out those distributions. have four of matter.3. then it is unlikely that statistical inferences using OLS by a few observations. the best you can do on a standardized size and test scores have a finite range. ... some distributions IHlVe infinite moments.. they necessarily generally... 2000 • 1700 1400 1100 800 500 200 OLS regression line including outlier L---.. __ ----' 40 excluding outlier L-~::--_L__ __::' oL 30 50 60 X 70 data on the height of students in meters. Still. then you can either error or.:::O~L-S-r-eg-r-c'. The least squares throughout assumptions play twin roles..:... and we this textbook.---::_ . and this assumption finite fourth moments will be dominated If the assumption holds. the assumption of finite kurtosis is a plausible one in many applications with economic data. One way to find outliers is to plot your data..::. but the OLS regression line estimated without the outlier shows no relationship.. Use of the Least Squares Assumptions The three least squares rized ill Key Concept return to them repeatedly assumptions for the linear regression model are summa4. commonly moments.126 CHAPTER 4 Linear Regression with One Regressor tmImIID The Sensitivity of OLS to Large Outliers y This hypothetical data set has one outlier.. If you (rom your data set. as a mathematical test is to get all the quesclass More fourth tions right and the worst you can do is to get all the questions wrong..-. if that is impossible... Class size is capped by the physical capacity of a classroom. decide that an outlier is due to a data entry error."si~o-n-li-n-e-:/T..The OLS regression line estimated with the outlier shows a strong positive relationship between X and V. Because have finite kurtosis.=:::::::::::.

where 4. then. you those outliers carefully to make sure those observations and belong in the data set. Their second OLS regression. The error term it. The third assumption should examine rectly recorded modification serves as a reminder that OLS. n. have nonzero finite fourth moments. i = 1.2. the first least squares to consider in practice. as is shown have sampling distribulets us develop intervals using the in the next section. . in large samples the OLS estimators methods for hypothesis OLS estimators.lx. has conditional mean zero given and identically X. One reason why the first least squares assumpis discussed in Chapter 6. tion might not hold in practice are discussed in Section 9. . Therefore.5 Sampling Distribution of the OLS Estimators Because the OLS estimators ~o and ~ I are computed from a randomly drawn sample. the estimators the sampling possible random themselves are random variables with a probability distributionthat describes the values they could take over different presents tbese sampling distributions .. Y.) draws 2. iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiO====~:::::::::::==- . Tbis section distribution- .). are independent from their joint distribution.: E(u.3 1. 4.. (Xi. pendence to consider holds in an the regres- Although it plausibly holds in many cross-sectional 2 require data sets. + uil i :::::1.i. If your data set contains large outliers. Their first role is mathematical: 1f these assumptions hold. samples. just like the sample are cor- mean..4. It is also important application. 3. .. this large-sample normal distribution confidence role is to organize the circumstances that pose difficulties for assumption is the most reasons As we will see.) = 0.. In turn. important testing and constructing tions that are normal. distributed (i. I n..5 Sampling Distribution of the OLS Estimators 1 27 The Least Squares Assumptions }j = /30 + [3 Lx. can be sensitive to large outliers. Large outliers are unlikely: and Xi and Y. the indefor some assumption is inappropriate sion methods developed under assumption applications with time series data.d. and additional whether the second assumption for time series data.

the probability of these diffcrent values is summarized _ I piing distribution. it is possible to make certain tatcmcnt bout it that hold for all 11.not a simple average. but in large In small samp Ies. Rccallthc discuss~n in Sections 2.7) for f3"you will see that it. If you examine the numerator 10 Equation 0. (11 _ Y)( X. 1'-1" Bccause Yis calculated using a randomly drawn sample. This argument invokes the central limit theorem. then m re can be said about tbe sampling distribution. are unbiased estimators of . an be rnplicated when the sample size is small. The pro f that ~.Y an be Complicated when the sample size is small.7. ~o and ~. of the unknown interccpt. .3.8o and sl pe.8.128 CHAPTER 4 Linear RegreSSion WI . In particular.is a type of average. but an average of the pr duct. If the samp!e is sU~iciently large. like Y. under the least qua res assurnpti ns in Key onccpt 4. In particular. Although the sampling distributi n f. ver to the L estimators ~o and~. are approxl1nate y norma The Sampling Distribution of the OLS Estimators Review of the sampling distribution of Y. E(Y) = I'-Y.3. and the proof that ~o is unbiased is left as xercise 4. _ X).Technically. so Y is an unbiased estimator of 1'-1" If 11 is large.ibution of averages (like Y). of thc population regression line. an CStimator of the unknown population mean of Y.5 and 2. . the probability of the e different values is ummnrized in their sampling distributions. is unbiased is given in Appendix 4.TI..80 and . I because of the centralllll1ltthcorcJ11.4. the central limit theorem concerns the dist. Because the OL e timat r are calculated using a random sample. 'Fis a random variable thattakcs on differcnt value from One samin its sam. ~o and are random variables that take on different value fr rn ne sample to the next. the mean of the sampling di tribution is t-v. Y. As discussed . are normal in large samples.the centrr Ilimit the rem (Section2. This implies that the marginal distributi n of ~o and ~.::plc average. inn with One Regressor amples th ' ey these distributions are complicated. by the central limit the rem the sampling distribution of f30 and f3.20) that is.C c idcas carr (4.)..6) states that this distribution is approximately normal. is well approximated by the bivariate normal distribution (Section 2. .8. Although the sampling distribution of ~o and~. '0 . ~j Thesampling distribution of ~o and ~ r. pe to the next: .6 about the sampling distribution of thc a.ln other words. are f30 and f3. the mean of the sampling distributi n of ~o and ~. it is possible to make certam statements about it that hold for all n_ In particular. I . that is.

.4 and f31 have ~ jointly normal sampling distribution..1 (421) The large-sample normal distribution of &0 is N(f3o. the central limit theorem applies to this average so that. the OLS estimators are consistentthat is. where the variance of this distribution (T? is . u x ].21) so the smaller is (Tl. f30 and f3h when n is large. The large-sample normal distribution of f31 is N(f31. of f31 in Equation "' of &. . Mathemati(4. in general. larger is the denominator in Equation (4.3 hold. like the simpler average Y. (422) further in Appendix 4. ). modern econometric to be reliable. so the distribution around their means. so we will treat the normal approxi- mations to the distributions of the OLS estimators as reliable unless there are good reasons to think otherwise. when the sample size is large.. The normal approximation samples is summarized to the distribution of the OLS estimators in large in Key Concept 4.4. (Appendix 4. with high probability. In virtually all applications.4 is that.21) is the to the square of the variance of Xi: the larger is var(Xi). This criterion carries over averages appearing in regression analysis..) A relevant question in practice is how large n must be for these approximations sufficiently normal distribution. f30 and f31 will be close to the true population coefficients f30 and f3.6. the larger is the variance of Xi. (T~o)' where 2 _ 1var(Hiui) (T~o . The results in Key Concept 4.3 summarizes the deriva- tion of these formulas.. large for the sampling distribution of Y to be well approximated to the more complicated n > 100..3. This is because the variances (T~o and to zero as n increases (Tl of the estimators decrease will be tightly concentrated Anotber (n appears in the denominaof the OLS estimators tor of the formulas for the variances).: least squares assumptions Po and PI &0 . then in large samples 4..ply tha. this implication inversely proportional arises because the variance .To get a better sense .4 iU. we suggested that n = 100 is by a and sometimes smaller n suffices. In Section 2.7.n [E(H1JF' where H. (T7. {31' Cllm:mm in Key Concept 4.4. the smaller is the variance cally. it is normally distributed in large samples. implication of the distributions in Key Concept 4. = 1 - [ E(X1) X.5 Sampling Distribution of the OLS Estimators 129 Large-Sample Distributions of Ifth.

the more precise is f31' The distributions in Key Concept 4.the larger the variance of X. . ..__ . Thi can be seen mathemat- ically in Equation (4.21) because 1/... 0 0 • 196 f- • o o 194 L__ --'---J. The y 206. .--...:e. the smaller is the variance of ffit.130 CHAPTER 4 Linear Regression . 202 - 200 f- . . The data points indicated by the colored dots are the 75 observations closest to X. if thc errors are smaller (holding the X's fixed). 198 f- o . but not dcnominator. .. •• • . were smaller by a factor of one-half but the X's did not change. ~ .of all 1/. • •• • • o 01 .. Similarly.. The black dots represent a set of Xis with a large variance.. and the Varianceof X The colored dots represent a set of Xis with a small variance.-.6. would be smaller by a factor of one-half and would be smaller by a factor of one-fourth (Exercise 4.4 also imply that the smaller is the variance of the error U. The normal approximation to the sampling distributi n f ~o and ~1 is a powerful tooL With this approximation in hand. . .... :~ . which have a larger varia~ce than the colored dots.-. then the data will have a tighter scatter around the population regression line so its slope will be estimated more precisely._ 101 102 103 X 100 of why this is so.- 204 f- regression line can be estimated more accurately with the black dots than with the colored dots. 0. allf cr' .__ __ 97 98 99 ___l' __ ~:_:_-----l'_::_-____. look at Figure 4... enters the numerator.13).. we are able to develop methods for making inferences about the true population values of the regression coefficients using only a sample of data. Stated less mathematically. : :a: : . WI 'th One RegressOr CiBID The Variance of P.. ... 0 . which presents a catterplot of 150 artificial data points on X and Y.. Suppose you were asked to draw a line as accurately as possible through either the colored or the black dots-which would you choose? ft would be easier to draw a precise line through the black dots. . . then <rj.... • 1. .

X. determines the level (or height) of the regression line. Key Concept 4.that is. f3" is the expected change in Yassociated with a one-unit change in X. X and Y have finite fourth moments (finite kurtosis). The first assumption is that the error term in the linear regression model has a conditional mean of zero. to its standard error. Stated more formally. There are many ways to draw a straight line through a scatterplot. Moreover.) are i. The results in this chapter describe the sampling distribution of the OLS estimator. By themselves.Summary 131 4.This assumption yields the formula. These important properties of the sampling distribution of the OLS estimator hold under the three least squares assumptions. Y. The intercept. The second assumption is that (X" Y. then the OLS estimators of the slope and intercept are unbiased.. f30 + f3. hypothesis tests. If the least squares assumptions hold. f3o. Summary 1. given the regressor X.X. as is the case if the data are collected by simple random sampling. however. or to construct a confidence interval for f3. and a single regressor. if n is large.i. The third assumption is that large outliers are unlikely. are consistent.1 summarizes the terminology of the population linear regression model. these results are not sufficient to test a hypothesis about the value of f3. The reason for this assumption is that OLS can be unreliable if there are large outliers. and confidence intervalsis taken in the next chapter. This step-moving from the sampling distribution of ~.. is the mean of Y as a function of the value of x''Ille slope. The population regression line.4.4.for the variance of the sampling distribution of the OLS estimator.6 Conclusion This chapter has focused on the use of ordinary least squares to estimate the intercept and slope of a population regression line using a sample of n observations on a dependent variable. presented in Key Concept 4. Taken together. then the sampling distribution of the OLS estimator is normal. the standard error of the OLS estimator. and have a sampling distribution with a variance that is inversely proportional to the sample size n. but doing so using OLS has several virtues. .d. Doing so requires an estimator of the standard deviation of the sampling distribution . This assumption implies that the OLS estimator is unbiased. the three least squares assumptions imply that the OLS estimator is normally distributed in large samples as described in Key Concept 4.

The R' and standard error of the regression (SER)are meas~res of how close the values of Y.1 Explain the difference between hi and 13.). (2) consistent. model: (1) The have a mean of zero conditional on the regressors Xi. between the residual iii and ihe regression error IIi. Key Terms linear regression model with a single regressor (110) dependent variable (110) independent variable (110) regressor (110) population regression line (11. If these assumptions hold. 4.. Xl i = 1. n by ordinary least squares (O.LS). and between the OLS predicted value Y. with a larger value indicating that the l)'s are closer to the line. .the OLS estimators ho and are (1) unbiased. The popu Ia t IOnr .. .d. assumption is valid .2 For each. Ui. are to the estimated regression line.132 CHAPTER 4 Linear Regression with One Regressor · egression line can be estimated using sample observations 2. least square s assumptIOn. and E(Y.and (3) large outliers are unlikely. random draws from the population. (2) the sample observations are i.IX.i.. The R is between o and 1. (). provide an example '.0) population regression function (110) population intercept (110) population slope (110) population coefficients (110) parameters (110) error term (110) ordinary least squares (OLS) estimators (114) OLS regression line (114) sample regression line (114) sample reg res ion function (114) predicted value (114) residual (l15) regression R2 (119) explained sum of squares (ESS) (119) total sum of squares (TSS) (119) sum of squared residuals (SSR) (120) standard error of the regression (SER) (120) least squares assumptions (122) Review the Concepts 4.Th: OLS estimators of th~ r:g~ession intercept and slope are denoted 130 and 13" 3. then p roviid e an example in which the assumption f aJ'1 S . 4. and (3) normally distributed when hl the sample is large. in which tbe '. There are three key assumptions for the linear regression regression errors.The standard error ol the regression is an estimator of the standard deviation of the regression error.

) d. SER = 10. prediction for the change in the classroom c. estimates .41 where Weight is measured a. Last year a classroom What is the regression's average test score? had 19 students. What is the regression's prediction for the increase in this man's weight? c.81. What is the sample standard deviation of test scores across the 100 classrooms? 4.82 the OLS regression test Testscore = 520.94 X Height. R2 = 0. = -99.5.. scores from 100 third-grade using data on class size (CS) and average classes.1 Suppose that a researcher. and SER.4.5. = 0. prediction for a.g. What is the regression's + 3.2 (Hint: Review the formulas for the R2 and SER. A classroom has 22 students.) . What is the sample average of the test scores across the 100 classrooms? (Hint: Review the formulas for the OLS estimators.5.3 Sketch a hypothetical scatterplot of data for an estimated scatterplot regression with with R 2 R2 = 0.08. R2 ~ 0. tall? 65 in. tall? b..Exercises 1 33 4. weight prediction for someone who is 70 in. estimates from this new centimeter-kilogram (Give all results. What is the regression's that classroom's average test score? b. estimated R2.) men is selected from Suppose that a random sample 01'200 twenty-year-old a population and that these men's height and weight are recorded.2. in pounds and Height is measured in inches.. tall? 74 in. over the course of a year.9. and this year it has 23 students. The sample average class size across the 100 classrooms is 21.4 xes. A man has a late growth spurt and grows 1.5 in. A regression of weight on height yields w. and kilograms. Suppose that instead of measuring weight and height in pounds and inches these variables What are the regression regression? are measured in centimeters coefficients. SER = 11. Sketch a hypothetical of data for a regression Exercises 4.

age (measure in ye . ] 20 rrunutes.). .3%.2.6 X Age. Given what you know about the distribution of earnings. Suppose that the value of f3 is less than 1 for a particular stock. for the worker? do you think it is plausible that the distribution of errors in the regression is normal? (Hint: Do you think that the distribution is symmetric or skewed? What is the smallest value of earnings. Each student is randomly assIgne I . In a given year.use the estimated value of f3 to estimate the slack' expected rate of return. . What are the units of measurement for the SER? (Dollars? Years? Or is SER unit-free?) c. a. d . Is it possible that variance of (1/ . For each company listed in the table in the box.7 + 9. b.023.1. . R2 = 0. . 4. b.) 4.2. The standard error of the regression (SER) is 624. measured in dollars) On A regressIOn 0 av d. a.134 CHAPTER 4 Linear Regression with One RegresSor 43 .) c. The regression R2 is 0.Rf) for this stock is greater than the variance of (R". f erage weekly earnings (AWE. and is it consistent with a normal distribution?) g.7 and 9. . What is the regression's predicted earnings for a 25-year-old A 45-year-old worker? e.023.6 mean.5 A professor decides to run an experiment to mea ure the effect of time pressure on final exam scores. He gives each of the 400 students in his course the same final exam. The average age in this sample is 41.R f) for this stock is greater than the variance of (1/". but some students have 90 minutes to complete the exam while other Slave. kers aeed 25-65 yields the followmg: nme war co AWE = 696. Suppose that the value of f3 is greater than I for a particular stock. ars) using a random sample of college-educated fUll.R.)? (Hint: Don't forget the regression error. Explain what the coefficient values 696. Will the regression give reliable predictions for a 99-year-old worker? Why or why not? f.4 Read the box "The 'Beta' of a Stock" in Section 4.6 years.5% and the rate of return on a large diversified portfolio of stocks (the S&P 500) is 7.1. the rate of return on3-monlh Treasury bills is 3. Show that the variance of (R . What are the units of measurement R2? (Dollars? Years? Or is 1/2 unit-free?) d. SER = 624. . What is the average value of AWE in the sample? (Hin: Review Key oncept 4.I/.

. = f30 + f3. 1).20. Why will different students different values of Ui? b. A linear regression 4. Compute the estimated gain in score for a student who is given an additional 10 minutes on the exam.. let Xi denote the amount of time that the student has to complete the exam (Xi = 90 Or 120). II.] uate the terms in Equation 4. Suppose you know that f30 = O.IX.4 continue to hold? Which change? Why? (Is f3.8 4. regression the estimated Y. where (Xi.) = O. A linear regression b. 4.Xi + u.6 Show that the first least squares assumption.Xi + u. = f30 + f3. and consider the regression model 1.9 a. represents.. implies that Show that &0 is an unbiased estimator of f30. 4). = 49 + 0. [Hint: Eval- for the large-sample (4. denote ith 135 the number of points scored on the exam by the student (0 so 1..) Suppose that all of the regression assumptions isfied except that tbe first assumption is replaced parts of Key Concept in Key Concept 4..24 Xi' regression's prediction for the average the exam. E(UilX. Let 1.i.d.) 4. which is shown in Appendix 4. U. is a is N(O.Derive a formula for the least squares estimator of f3. Ui) are i. Explain why E(UiIXi) = 0 for this regression in Key Concept is model.3 are satisfied. is N(O.X.Does this imply that &. = f30 + f3. have a. 4..Exercises one of the examination times based on the flip of a coin. and X.7 = f30 + e. Show that R2 yields R2 = O. Explain what the term u.11 Consider the regression model 1. = O. Derive an expression assumptions in Key Concept 4. c.3. = O.4? What about &o?) 4. when X = 0. The estimated I. Are the other assumptions d. Repeat Compute Score of students given 90 mioutes to complete for 120 minutes and 150 minutes. a. normally distributed in large samples with mean and variance given in Key Concept 4. + "i. is unbiased.21). Whit. so 100).3 are satwith E(UilX. = O? U. a. Show that tbe regression b. variance of &.h 4. Bernoulli random variable with Pr(X = 1) = 0.) = 2. . When X = 1.10 Suppose that yields &.x.. (Hint: Use the fact that &.3 satisfied? Explain. 1. E(1.

Show that the large sample -.-.1 [1'1' 'Th' '. where K is a non-zero constant and ( y" Xi) satisfy the three least squares assumptions.)". IS equation IS the (31 variance given in equation (4.) a. R2~rh· b. as their highest degree.13 Suppose that Y. Does age account for a large traction of the variance in earnings across individuals? Explain.14 Show that the sample regression line passes through the point (X. (These are the same data as in CPS92_08 but are limited to the year 2008. c.the squared Sho X a v. Show that ~I ~ rXY(syjsx). lder workers have more job experience. where I'XY is the sample correlation between X and Y. (Generally.pearsonhighered. Predict Alexis's earnings using the estimated regression. and Sy and Sx are the sample standard deviations of X and Y. Empirical Exercises E4. Predict Bob's earnings using the estimated regression. 2 2Iv"[IX.1 for 2008.you will find a data file CPS08 that contains an extended ver ion of the data set used in Table 3. variance of (3 I IS given by (J. full-year workers. show thai value of the sample correlation bel ween an . A detailed description is given in CPS08_Description..com/slock_watsonl. Derive a formula for the least square s w that the regression R2 in the regression of Yon X is. also available on the Web site. Bob is a 26-year-old worker. Suppose you now estimator of (31' 4.rt rat IS. .136 CHAPTER 4 Linear Regression with One Regressor k that b. What is the estimated intercept? What is the estimated slope? Use the estimated regression to answer this que tion: H w much do earnings increase as workers age by 1 year? b..12a.V). leading to higher productivity and earnings. ~ K ./B. ]' • tnt. you will investigate the relationship between a worker's age and earnings. (3 0 ~ 4.A. age 25-34.) In this exercise. . Run a regression of average hourly earnings (A H £) on age (Age).21) multiplied by <2. Alexis is a 30-year-old worker. c. ~ (30 + (31 Xi + KUi.. It contains data for full-time. [vnr('\i).] 4. 4. Show that the R2 from the regression of Yon X is the same as the R2 from the regression of X on Y. with a high school diploma or B.1 On the text Web site http://www.

(Hinl: What is the sample mean of Beaufy?) c. 24(4): 369-376.Empirical Exercises 137 E4. "Beauty in the Classroom: Instructors' Pulchritude and Putalive Pedagogical Productivity. while Professor Stock's value of Beauty is one standard deviation above the average." e. also available on the Web site. a. Construct a scatterplot of average course evaluations (Course_Eval) on the professor's beauty (Beamy). What is the estimated intercept? What is the estimated slope? Explain why the estimated intercept is equal to the sample mean of CourseEval. Professor Watson has an average value of Beauty. In this exercise. course characteristics. Run a regression of average course evaluations (Course_Eval) on the professor's beauty (Beauty). Comment on the size of the regression's slope.2 On the text Web site hltp:/Iwww.you will find a data file TeachingRatings that contains data on course evaluations. Is the estimated effect of Beauty on Course_Evallarge or small? Explain what you mean by "large" and "small.pearsonhighered. d. and professor characteristics for 463 courses at the University of Texas at Austin. complete I These data were provided by Professor Daniel Hamerrnesh of the University of Texas at Austin and were used in his paper with Amy Parker. E4.pearsonhighered. Does Beauty explain a large fraction of the variance in evaluations across courses? Explain.com/stock_watsonl. In this exercise.you will find a data file CollegeDistance that contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986.' A detailed description is given in TeachingRatings_Description. One of the characteristics is an index of the professor's "beauty" as rated by a panel of six judges. . on average. so that students who live closer to a four-year college should. Does there appear to be a relationship between the variables? b. (Proximity to college lowers the cost of education. you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student's high school to the nearest four-year college.3 On the text Web site hltp:llwww. you will investigate how course evaluations are related to the professor's beauty." Economics of Education Review. August 2005. Predict Professor Stock's and Professor Watson's course evaluatious.com/stock_watson/.

mance and the Sources or Growth"• Ioumol 01 //JOII EeonOl1l1es. . D es Malta look like an outlier? e.) What is the estimated intercept? What IS the estimated lope? Use the esu.com/stoek_watsonl. essor ecrna Rouse of Princeton University And were use In paper Democratlzatlon or D' .you will investigate the relationship between growth and trade.s a.58: 261-300. .138 CHAPTER 4 Linear Regression with One Regressor f hi hereducation. April 1995. run a regression of Growth on TradeShare.less find Economic Sllllislics. "L rversrcn'. has a trade share much larger than the other countries.you will find a data file Growth that contains data on average growth rates from 1960 through 1995 for 65 countries along with variable that are potentially related to growth. One country. cents. yza. t'on also available on the Web sue. Using all observations. Find Malta on the scatterplot. A detailed deseripti n is given in Growth_Description. What is the value of the standard error of the regression? What are the units for the standard error (meters.4 On the text Web site http://www. b. years. What is the estimated slope? What is the estimated intercept? Use the "These data were provided by Prof C ili d' her " '. mated regression to answer this question: How does the average value of years of completed scho ling change when colleges are built close to where students go to high school? I . dollars. Docs there appear to be a relationship between the variables? b. d. Malta. Construct a scanerplot of average annual gr wth rate (Grow/h) on the average trade share (TradeSltare).oumat of BlIsil.. 2 ge . er w'Ll 11 ( B k r essor ass Levine of Brown University nnd were used 111 ISpap h J 1 lOTS en ec and Norman Loa "Fi F' cial '2000 .pearsonhighercd.) A detailed description is given in Colle more years 0 tg . also available on the Web site. Does distance to college explain a large I'racti n of the variance in educational attainment across individuals? xplain. ? 10 . e Effect of Community alleges on Educational Att31n~nThent. In this exerci e. grams. 12(2): 217-224. Predict Bob's years of completed education using the estimated regression. Bob's high school was 20 miles [rom the ncare t c liege. Dis! = 2 means that thc distance i 20 miles. Or something else)? E4. Distance_ D esenp I l ' a. Run a regression of years of completed ~ducati n (ED) on distance to the neares t college (Dis/) where Dist IS measured 111 tens of miles (For example. ese data were provided by P of R' . How would the prediction change if Bob lived 10 miles [rom the nearest college? c.

ca. and the percentage of students who are English learners (that is. a standardized test administered to fifth-grade students. The student-teacher ratio used here is the number of students in the district divided by the number of full-time equivalent teachers..23) (4. e. Where is Malta? Why is the Malta trade share so large? Should Malta be included or excluded from the analysis? APPENDIX ------- _ 4. students for whom English is a second language). TIle demographic variables include the percentage of students who are in the public assiuance program Cal Works (formerly AFDe) . APPENDIX 4.24) .1 The California Test Score Data Set The California Standardized Testing and Reporting data set contains data on test performance. the percentage of students who qualify for a reduced price lunch.)2 P [Equation (4.bo .~1 i .0. Estimate the same regression excluding the data from Malta.Derivation of the OLS Estimators 139 regression to predict the growth rate for a country with a trade share of 0.6)]. The data used here are [rom all 420 K-6 and K-8 districts in California with data available for 1999. number of teachers (measured as "full-time equivalents"). and student demographic backgrounds.2. Answer the same questions in c. number of computers per classroom.2 Derivation of the OLS Estimators This appendix uses calculus to derive the formulas for the OLS estimators given in Key Concept 4. and expenditures per student. To minimize the sum of squared prediction mistakes 2:.5 and with a trade share equal to 1.bIX. Demographic variables for the students also are averaged across the district.Test scores are the average of the reading and math scores on the Stanford 9 Achievement Test. d.cde. All of these data were obtained from the California Department of Education (www. school characteristics. School characteristics (averaged across the district) include enrollment.gov). first take the partial derivatives with respect to bo and b( (4.

I b xi d ~ are the values of bo and b. Bec~use Ii. l ill i-I *±XHW i=1 ~o=y-~. Xl j"'l (427) (428) given in Key Concept 42.iI).24) equa zero.has the normal sampling distribution given in Key Concept 4.settmg these derivatives equal tozer . . the and denominator in Equa- Equations (4.X " 2: (Xi. in large samples.. we show that the OLS estimator ~I is unbiased and. . eqUIvalent y. "th One Regressor CHAPTER 4 Linear Regression WI LS The O II _ estimators.3 Sampling Distribution of the OLS Estimator In this appendix.140 ..1') i.27) by I. Equauons '. - (429) = il. I the values of bo and b] or W ic 1 the denvatives' or.-X)(Y.so the numerator of the formula for ~Iin Y == f31(Xi - X) + u. APPENDIX ------------------- ---- 4. Lj:I(Y.27) and (4. and dlVldlOo Y .X){Lli .28) are the formulas for ~o and~. (423) and (4. ." 0..Y. formula ffil ..-iloX 11i=l "I II " - _ il'n£'" i=l " 1 Xf ~ =0 . Representation of ~ t in Terms of the Regressors and Errors We start by providing an expression for }'j = 130 + (3\Xi + Uh 1i Equation (4. (426) Solving this pair of equations for ~oand ~ I yields l±xy-xy n I I " 2:(X. _ bu J I" Accordingly.::0 sxyJsl 11 - is obtained by dividing the numerator tion (4. mustsatisfv collecting terms. f30 and (31.". 2: (Xi 1"'\ II X)' + 2: (Xi i=l II . .-. that minimize (30 an ###BOT_TEXT###gt; f h' I .27) is ~I in terms of the regressors and errors. . 0 b 11 shows that the OLS estimators.4. " In A the two equations (425) -2:X.

lx" .) = E(u.~l(X. . Because (4./.(X.1 I Li=IXi . . .~.:::::::=====--". E(~l - f3rl ~ E[E(~. X. . SUbstituting in - L. is unbiased.X)Uj into the final Y) =f3Il'='l (X-X)'+ 2:" I X)Ui' SUbstituting (4.-:¥)u..). that unbiased. so that E(~. ••• . the term in the numerator ~ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~~. is ohtained by taking the expectation of both sides of Equation (4..~I (~. (4.31) where the second equality in Equation (4.-X)Uh where . It follows that the conditional is'~1 is conditionally expectation in large brackets in the second line or Equation (4. First consider the numerator large... - - -u) = 2:. other is distributed independently than i. j. to a close approximation.) = O. Thus. X. E(UiIX.. of ~.f3J!X" .Lx.) = f31. By the law of iterated expectations. if the sample size is of X is nearly equal to f. Equivalently. I X.29) 2:'~ (X-X)(Y- . X is consistent.f31IxJ. .. By the first least however.30) Proof That f31 Is Unbiased The expectation Thus.'".(X.(X."" Ar..31.27) yields 2: =1 (.)] ~ 0..u) = yields '1-1 I L.nX]u = O. Xn) = f31. that Large-Sample Normal Distribution of the OLS Estimator The large-sample 4.<..X)(Ui .4) is obtained normal approximation by considering to the limiting distribution of ffi j (Key Concept the behavior of the final term in Equation of this term.II Equation (4.- 2::~. ..-X)u= .3).) = O. the final equality follows from the definition of X which implies that c 2:.31) is zero. given Xl. so E(u.Sampling Distribution of the OLS Estimator N ow 11 [ 141 2:1/ ..30). X. .~J(Xi I expression "." (X _ X)u = '::"/=..) follows by using the law of iterated Ui expecta- tions (Section 2.30).-X)(u.. Ecfillx isj.. so that E(~. .t.. of X for all observations squares assumption. i this expression in turn into the formula for ffil in Equation (4. By the second least squares assumption.

142 CHAPTER 4 Linear Regression with One Regressor Equation (4. in large samples.~. and the total sum of squares is the 1 sum. 15).in large samples. the sample variance is a consistent population variance. Thus the distribution of v is well approximated by the N(O. .(X. this is the sample Van.thus /I 11 " of 130lets us write the OLS residu- ~ili= ~(Y. . distribution of ~1 - 13\ Oil y/var(xi): /31 is._ L. N(13I. But the definitions 'C" ". v satisfies all the requirements (Key Concept 2.17)].Y) .- . at. I). is. by the third least squares of the central distributed N(O.i. Y. of squared residuals an d tlte exp Iained sum of squares [the ESS TSS and SSR are . X). which is inconsequential if" is large).so O. f30 f3IXi~(Y. . .30) is the sample average V. is (T~ is i. Next consider the expression in the denominator in Equation (4. The = var[(Xi- !-LX)Ui]._X)=O.30). the sample covariance s::x rete between the OLS resod ua Is an d t Ite regressors IS zero.(lJ-Y)=O . ° II defined In Equations (4. which.8)]. so in large samples it is arbitrarily close to the population Combining these two results. and (4. so that the sampling var(v)/[ var( Xi)]' (4. we I~ave that.-Y)-f31(Xi-X).2 [Equation (3. where assumption. variance of Vi Vi vi = (X j- J. limit the arem nonzero and finite . ±(X. in large samples. = if:/II.' .IS . Equations (432) through (435) say t Itat the sample average of the . the sample average of the OLS pre di d values equals '9.35) L residuals is zero. verify Equation (432) n~e t hat the deflni " ~t t ie d~nllion . TSS = SSR + E:SS.r=llli - of Y and X . .34) (4.d.32) (4. ance of X (except dividing by II rather than 1'/ - I.33) " LUiXi 1=1 = 0 and $. which is the cxpressi Some Additional Algebraic Facts About OLS The OLS residuals and predicted values satisfy: (4. where a? == n in Equa~:on = var] (Xi -I'x )11/]/ {II[ var( Xi) ]'}.jx = 0. To = "_" _ ~ t als as U. and I'.14).7). As estimator of the variance of X discussed in Section 3.mpyI that L:~. (4.LX)lIi· By the first least squares VI has a mean of zero. u~/lI) distribution. Thus v/ if. where assulllption .and (4. . Therefore .-.21). By the second least squares assumption.ul).

u. so 2:7"'1}j = where the second equality is a consequence of Equation (4. SSR + ESS. note that j L.Y) 1=1 (4.liY.I.(1. .36) is obtained using the formula for (4. where ~l the L.~Ii/'iX:' final equality follows from = 0 by the previous results. This result. "A Lio=ll.:"lUi(X X).36) wbere the final equality in Equation (4. -:~')J(X.27). so i=l ±u.~ltJ'i= 0 implies 2:7".~l B. To verify Equation (4. _ j 2:.J' + 2. -:y) .X.~lU'J = L.I. Equation (4.)(Y. .=1 /I A Y)'= 2.=1 ... i==l Y)' + 2 2:(y.0=.~. 1/ = II"" Li=lltj(f30 " + f31Xi ) _ " .1 II A 1. .(X. r/ II = 2: (I. = .. (4....::::::====::' . UiXi = L.Y)(X.-:y) ±[(I.~li1+ L. implies that Some ffil in Equation sux = O.I "-+ ~ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~~.32). note that tj= Y. ~I ~l - X)' ~ 0.(1.Sampling Distribution of the OLS Estimator 143 To verify Equation (4.= i"'l " .~l2:(X.y)' II" A _ ~ 2:(y.{30£"i=lUi ". + Uj. combined with the preceding results.33).l.Y).34).35) follows from the previous results and II '" algebra: /I A _ TSS~ 2:(1.37) i=J = SSR + ESS+Z2.+ 1.

4. Sections 5. some stronger conditions hold. Because the effect on test scores a f a unit change 111 class size is /3CI 5". the superintendent 144 asks. and its standard error to test hypotheses." of the population regression line is zero. The for this Section 5.3 assume that the three least squares assumptions results can be derived regarding the distribution of the OLS estimator. C sertmg t at the population regression line is flat-that i . which measures the spread of the sampling distribution of ~" Section 5. we show how knowledge of this sampling distribution can be used to uncertainty. under certain conditions. a concept Section 5. Section 5. the superintendent. claims. Is there.. She ha an angry tax. payer IS as . the raxa uni .2 /31 that accurately summarize the sampling starting point is the standard error of the OLS estimator. then some stronger One of these in stronger conditions is that the errors are homoskedastic. lass ize. in addition.5 presents the Gauss-Markov introduced theorem.the slope /3C1""S.rased in the language f regres ion analysis. In differs from one sample to the next-that make statements about /31 has a sampling distribution. no effect on test scores. /3" Section 5. explains how to construct confidence intervals for standard error (and for the standard error of the OLS estimator of the intercept). this chapter. 5. Section 5. has payer in her office who assertsthat cutting cla s size will not help boost test scores. calls you with a problem. Chapter 4 hold. .1 through 5.1 provides an expression then shows how to use ~. If.CHAPTER 5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals T his chapter continues the treatment oflinear is. h' oss I . OLS is efficient (has the smallest variance) among a certain classof estimators. . how r:gression with a single regresSor.6 discusses the distribution of the OLS estimator when the population distribution of the regression errors is normal. so reducing them further is a waste of money. the taxpayer The taxpayer's claim can be reph.1 Testing Hypotheses About One of the Regression Coefficients Your client. which states that.3 takes up the of special case of a binary regressor. Chapter 4 explained how the OLS eStimator ~ I of the slope coefficient /3.

at least tentatively pending further new evidence? This section discusses tests of hypotheses about the slope f3.6. the t-statistic has the form 5.5ize = 0. and the two-sided alternative is fl. lact the p-value for a two-sided hypothesis 2 <I> ( -ltac/I). Two-Sided Hypotheses Concerning /3. .Alternatively.5). dom sampling 3.D' as The test of the null bypothesis as in the three steps summarized standard sampling 1= alternative proceeds the in Key Concept 3. the statistic Concept samples equivalently.n.hypotheSized value standard error of the estimator' evidence in your sample of 420 observations on California school districts that this slope is nonzero? Can you reject the taxpayer's hypothesis that f3Ch". by ranvariation. The first is to compute second step is to compute 5. The the r-statistic. which has here. We start by discussing two-sided tests of the slope tests and to tests of hypotheses regarding fJ. is the same as to testing hypotheses about the population mean. in detail. based on the test statistic actually observed. Testing hypotheses about the population mean. The third step is to compute the p-value.5.1 (51) estimator . at least as different from the null hypothesis value as is assuming that the null hypothesis normal is correct (Key in large test is distribution actually observed. which is an estimator of the standard deviation of the distribution of Y. so we begin with a brief review. which is the smallest significance level at which the null bypothesis could be rejected.1. the z-statistic is the general form given in Key Concept (Y .2 that that the mean of Y is a specific value !Ly.o can be written = !LY. The general approach to testing hypotheses about the coefficient f3. applied error ofY. then turn to one-sided intercept f3o.1 Testing Hypotheses About One of the Regression Coefficients 14S General Form of the t-Statistic ~ In general. Because the z-statistic has a standard under the null hypothesis. SE(Y).o)/SE(Y).: E(Y) Ho against the two-sided * !LY./-LY. the null hypothesis flo: E(Y) Recall from Section 3. where cumulative is the value of the I-statistic actually computed and <I> is the standard normal distribution tabulated in Appendix Table l. or intercept of the population f30 the regression line. or should you accept it. the p-value is the probability of obtaining a statistic.

can be tested using the same general approach.under the null hypothesis the true population slope (31 takes on some specific value. The second step IS to compute the r-statistic . the critical feature justifying the foregoing testing procedure for the popula tion mean is that.2) To test the null hypothesis Ho. t46 CHAPTER 5 . hypothe es about the true value of the slope 13. . a two-sided test with a 5% significance level would reject the null hypothesis if 11"'1 > 1. t= 13.96. the population mean i~said to be stati tically significanily different from the hypothesized value at the 5 Yostgntflcance level. The angry taxpayer's hypothesis is that 13 I"" I" = 0. tian mean.. is an estimator of (J~" the standard deviation of the sampling distribution of (31' Specifically. '" 131.0 (two-si Ie I alternative).0 SE(ffit) (55) . " ' computed by regression software so that it is easy to use in practice. does not equal (3 t. tn applicati ns the standard error IS h '1 ' .o-That is.1. in large samples. the null hypothesis and the two-sided alternative hypothesis are Testinghypotheses about the slope f3t- Ho: 131 = (31. The null and alternative hypotheses need to be stated precisely before they can be tested. 131. (3. In this case.n.13'. AItough the formula for (J~IIS camp I'icated. Under the two-sided alternative. we follow the same three steps a for the populaThe prst step is to compute the standard error of E(ffil)' The standard err?r of 13.0 vs. For example. (53) where ih (SA) The estimator of the vari 'E ' anance 111 quauon (SA) is discussed in Appendix 5. also has a normal sampling distribution in large samples. More generally. the sampling distribution of'Yis approximately normal. Because~. h S' I Regressor' ypothesisTestsand Confidence lntervals H . At a thcoreticallevel. HI: 13. RegressionWit a In9 e the third step can be replaced by simply comparing the l-stati tic to the critical value appropriate for the test with the desired significance level. (5.

... p-value = Pr(IZI > 1t''''I) = 2<I>(-lla"I).. Compute the standard 2.. is approximately normally distributed in large samples.. _____ iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii~. (5. by comparing test. _ 'TIle third step is to compute the p-va]ue. Stated mathematically.1 Testing Hypotheses About One of the Regression Coefficients 147 Testing the Hypothesis Against the Alternative 1.96. Compute the p-vaJue f3..3)]...::::". 'f.f31. equivalently. [Equation (5. in Key Concept 5..ol > = Pr" o 1~1" - f31. error of ~t. the probability of observing a value of f31 at least as different from f3t.96.ol] = Pr" 0 [1 ~1-f3I...7)]. p-value = Pr"o[I~1 ..Oas the estimate actually computed (~1Cf). SE(~I) [Equation (5..oll~1"-f3t'OI] _ > _ SE(f3I) SE(f3I) (1t1>11""I).. the critical value for a two-sided of obtaining a value of f31 at least as far from the null as that actually observed is less than 5%.. so in large samples.2. Compute the z-statistic 3.. if It ""I > 1. as a standard under the null hypothesis the I-statistic is approximately distributed normal random variable..96.. the hypothesis can be tested at the 5% significance level simply at the 5% level if Ila"l the value of the r-statistic to ±1. under the null hypothesis. = f31 0 f3. the probability hypothesis is rejected at the 5% significance level.7) A p-value of less than 5% provides evidence against the null hypothesis i~ the sense that. If so... Reject the hypothesis at the 5% sig- nificance level if the p-value is less than 0. the null Alternatively...2 (5.5....6) where Pr"o denotes the probability computed under the null hypothesis.05 or.-=-== . The standard error and (typically) the z-statistic and p-value testing f3'1= 0 are computed automatically by regression software. the second equality follows by dividing by SE(~ I)' and la" is the value of the r-statistic actuBecause ~J ally computed. assuming that the nul] hypothesis is correct..".. (5..f3'"o [Equation KEY CONCEPT 5..5)].. and rejecting the null hypothesis These steps are summarized > 1.

2.0 against the hyp~thesls that f3.11 . . . Because of the importance of the standard errors. as shown in Figure 5.. by onvenuon they are included when reporting the estimated OLS coelf. The OlS Reportmg regression eq '.e null hypothesis f3Ch. . e against the student-teacher ratto. One compact way to I t d d errors is to place them in parentheses below the respective report t re S an ar coefficients of the OLS regression line: TestScore _ 698. itis . This is a common format f I' reporting a single regression equation. it is reasonable to conclude that the null hypothesis is false.11. 111is probability is the area in the tails of standard normal distribution.y.8) at the 5% significance level. TIllis quation (5. .8) into the general f rrnula in Equation (5.3 . 11.8) also reports the regression R2 and the standard error f the regression (SER) following the estimated regression line.5). construct tbe z-statistic and compare it to 1. Thi r-statistic exceeds (in absolute value) the 5% two-sided critical value f 1. uations and application to test scores. = f31. less than 0.38. That is. as far [rom the null as the value we actually brained is extremely small.1. is zero in the population counterpart of Equation (5.52 = -4. " . approximately 0.e standard error f these estinon (4. and its standard error from Equation (5. we can compute the p-value associated with I'''' = -4.9 . Id d f3' ~ 698 9 and f3.o. estimates of the sampling uncertainty of the slope and tbe intercept (tbe standard errors). the estimated slope. To do so. the 5% (two-sided) critical value taken from the standard normal distribution. Suppose yon wish to test the null hypothesis that the sl pe /3.6. This probability is extremely small. the result is t'" = (-2.52) = 0. ometimes. reported In Equa regresslOO of the test scar .00001.28. S E R = 18. HypothesisTests and Confidence Intervals 148 CHAPTER 5 Regression with a Single Regressor. and two measures of the fit of this regressian line (tbe R2 and the SER).isis a two-sided hypothesis te t. however. The discussion so far has focused on testing the hypothe i that /3..001'Yo.clent . Because this event is so unlikely.28 . or 0. .9 . ) provides the estimated regression line.) = 0. and it will be used throughout the rest f this book. One-Sided Hypotheses Concerning {3. because under the alternative /3. under the null hyp thesis (zero).28 x STR.8) Equation (5. TI. mates are SE(~o) = lOA and SE(f3.0)/0. Alternatively. (5.051 . ify. the probability of obtaining a value of /3.s the null hypothesis is rejected in favor of the two-sided alternative at the 5% significance level.e r-statistic is constructed by substituting the hypothesized value of f3. ). '" l3"o. = -2.52.001 %.96.ssSiu = 0 is true. could be either larger or smaller than f3.e eo' . R2 (lOA) (0. .

00001 a The p-value is the area to the left of -4.0 vs. therefore.645. It might make sense. the construction sided alternative sided alternative of the I-statistic is the same.9) is reversed. The only difference between (5. but not large positive. (5. to higher Scores.o. appropriate ratio/test learning to use a one-sided hypothesis test. f3 j is negative: Smaller classes lead to test the null hypothesis that environment.and two-sided hypothesis in Equation for large negative.9) ratio exam- where f31.9).5. If the alternative is that f3. For the oneis rejected against the one- test. values of the (-statistic: Instead of rejecting if level if tact < -1. the hypothesis is rejected at the 5% significance . is less than than f31.96.and a two-sided hypothesis a test is how you interpret the (-statistic. the null hypothesis and the one-sided alternative hypoth- esis are Ho: f31 = f31.0 is the value of f31 under the null (0 in the student-teacher pie) and the alternative greater Because is that f3.38. HI: f31 < f31. It'IC'1 > 1. When r"> -4.o (one-sided alternative).38 The p-value of a two-sided test is the probability that !Zl > ItOC'1 where Z is a standard normal random variable and tOcr is the value of the r-sratistk calculated from the sample. For a one-sided test. For example.38 + z the area to the right of +4. f3l.0' the inequality the null hypothesis in Equation (5. the p-value is only0. is is the same for a one. f31 = 0 (no effect) against the one-sided alternative that f3j < O. the null hypothesis one. many people think that smaller classes provide a better Under that hypothesis. in the student-teacher Score problem.1 Testing Hypotheses About One of the Regression Coefficients 149 cmmn:DI Calculating the p-Value of a Twa-Sided Test When /"ct = -4.38.

B "u on lh~.uion j ke thai a university's secret of ucce s is 10 admit mlemed . the IIlcquUlities in Equations (5. If the alternative i one.lO) If the alternative hYPolhesi is that fJ. upOn reflection this might not nccessaril be newt ormul H 'd dru Undergoing clinical trials actually could prove harmful he u e of pr \ II U I unr 'cognized side effects.llIp" {JI ca ionally.lI) 1lte general approach t testing thi null h pothell " n\l ts of Ihe Ihree stepS Key Concept S.th .150 CHAPTER 5 . In the class size example. rd ncr. so the IJ-V lu Pr(Z > (""').lgOlll't Ih· n '"J u Iltcrtlotive at the 1% level. How_ ever.m ###BOT_TEXT### then make sure that the faculty stays out of their way and eoe llute dum I' .33 (Ihe critical valuc fro onc 'Id 'd I "1 \ Itil I 1% igni(icance level).' d 10.2 applied 10 fJo (Ihe f nnula for Ihe lanu. When should a one-sided test be used? In I rh n hI I III ('Ir bability. Testing Hypotheses About the Int rc pt 130 This discussion has focused on lesting h pot he".(l()(1Il .o (t wll'Sld 'd lit rn 1\ 1\ c). .. 8. however.. This value is less than -2.o vs.. The r·Slalhlic resrln the h pothe I Ih II there is no effect of class size on lest seore [ fJt. pn r empin tl evid -nc or h th. we or' remind 'd 01 th r ulu. d d Iternalive hypotheses should be use I nly when ther . even if it initially seems that the rdev nl nit rn 11

Related Interests

I un vrdcd.. I I Ihl . id~d./lt: fJo 'F fJo. so the null hypOthesis is r~j' I'd .lI 0 in 'qu lion ( . as Pr ( Z < I "rn <P(r"") 1= (p.10) are rever ed.Iud . you can reject the angry taxpayer's os erti n IhOtlhc n' lItI

Related Interests

"IIIlUll tlf Ihe slope arose purely because of rand III samplin lIritll'On tth I • I 11Ihcan e level. 111 • null hI pothesis concerning the intercept and the Iw ·sided allematlve arc Ho: /30 = fJo.ngleRegressor.tru 'ntlr f ~o is given 111 Appendix 5.I n practice. the p. The p-value for a one-sided mal distribution p-value = Inl I te t i: bwined from thc cumulauv e nd. Hypolhesls Tes15 and Confi Regressionwith a S. such ambiguity oflen lends ec n mcirl ion 10 1I1' 1"11 'Ided I ·. (S. Ihon O. on ·. pracn e. '1)11'1 4.value i Ie. i grcnler th n (JI o.. In (act. one-vid 'dlcfl-lillll I). (S.9) and (5. the hypothesis c ncern the inter ept (3u.1 Application to test scores. Imul th • .111is reason could come from economic the ry. approa h" mlldificd as was discussed 10 the previous subsecti n for h p the es OOOul the I r' III .vnluc. n for d in' .i le~r T~ .1).

First. there are many times that no single hypothesis about a regression coefficient is dominant. Being able to accept or to reject this null hypothesis based on the statistical evidence provides a powerful tool for coping with the uncertainty inherent in using a sample to learn about the population.0 is outside the range ~ iiiiiiiiiiiiiiilliii ""'-". the true value of {3. Because the 95% confidence interval (as defined in the first definition) is the set of all values of f31 that are not rejected at the 5% significance level.testing the null hypothesis f3. in 95% of all possible samples.. in 95% of possible samples that might be drawn. reject the true value of {31 in only 5% of all possible samples.3). it follows that the true value of f31 will be contained in the confidence interval in 95 % of all possible samples. The reason these two definitions are equivalent is as follows. however. 5. and instead one would like to know a range of values of the coefficient that are consistent with the data.The 95% confidence interval is then the collection of all the values of {31 that are not rejected. in principle a 95% confidence interval can be computed by testing all possible values of {3J (that is. it is the set of values that cannot be rejected using a two-sided hypothesis test with a 5% significance level. It is possible. Second. A hypothesis test with a 5% significance level will. An easier way to construct the confidence interval is to note that the t-statistic will reject the hypothesized value {31. that is. Confidence interval for {3. Yet.2 Confidence Intervals for a Regression Coefficient Because any statistical estimate of the slope {31 necessarily has sampling uncertainty.0 for all values of {3J. we cannot determine the true value of {31 exactly from a sample of data. the confidence interval will contain the true value of f31' Because this interval contains the true value in 95% of all samples. = {31."" _ .5.2 Confidence Intervals for a Regression Coefficient 151 Hypothesis tests are useful if you have a specific null hypothesis in mind (as did Our angry taxpayer). Recall that a 95% confidence interval for {31 has two equivalent definitions.o) at the 5% significance level using the t-statistic. by definition. it is an interval that has a 95% probability of containing the true value of f31. it is said to have a confidence level of 95%. will not be rejected.O whenever {31. to use the OLS estimator and its standard error to construct a confidence interval for the slope f31 or for the intercept f3o. This calls for constructing a confidence interval. As in the case of a confidence interval for the population mean (Section 3. But constructing the z-statistic for all values of f31 would take forever. that is...

11le ther end of the confidence intervnl e. = is not contained in this confidence interval. The OL regression stud~nt-teacher ratio.it contains the true value of /3.96x 0. ± 1. + L96S£(~ . ted by a 5% two-sided hypoth SIStest. Becauseone en I of a 95% con(idence interval for /3 is [3 . of a confidence interval [3.12) + 1. t he argument used for to develop a confidence interval for the population mean.)J. that is'. it is constructed as 95% confidence interval for [~.96SE(~I)' That is. . reported in quation f the te tore against the -2. . 10 ibl 01 all POSSI e ran do Inly drawn samples. .8). yielded ffil SE(~. Hypothesis Tests and ConfidenceIntervals ~ Confidence Interval for f31 5.) = 0.52. h a 95°'(0 probability'. . so (as we knew already from ection 5. The value [3. ~. The 95% two-sided confidence int rval for f3. the 95% confidence interval for /31 is the interval [~j-1. in 95"' valueof [3 lwlt ( .e 95 % confidence interval for /3.96SE(~.96 E(~. is surnmurizcd as Key Confidence interval for [3n A 95% confidence interval for f3n is constructed as in Key Concept 5.I 96S£([3') I .52}. ~. /31 = (5. ". The p pulati Yasso- n slope f31 is unknown. or -3. but because we can construct a confidence interval for (3" we can construct a confidenceinterval for. .the predictedeffect /316x. can be usedto construct a 95% confiden e interval for the predicted effect of a general change in X. is {-2.96SE([3.30 :5 [3.28 and (5. When the sample size that cannot b e rejec is large. " .152 CHAPTER 5 Regression with a Single Regressor. Consider changing X by a given amount. EqUivalently. t ie predicted effect of the change Ax uSing estunateof IS[/31 .) ] X 6x.3.1) the hypothesis [3.3 id d fide ice interval for [31 is an interval that contains the tru A 95% tWO-SI e con I I . I.1965E(~I)j.3.)]. II IS the set of values of"~I .28 ± 1. :5 -126.. 111i argument parallel.. with ~o and SE(ffio) replacing ~I and E(~I)' Application to test scores. tllis .6x. a Confidence intervals for predicted effects of changing X 11. ~ 0 can be rejected at the 5% significance level. The construction Concept 5.' .1.11le predicted change in ciated with this change in X is /3. e . 6x.). .

The interpretation .13) reducing for the superintendent by 2. dependratio is less than 20: ratio in ilh district < 20 ratio in ith district > 20· is (5.D.52. + 1. = . if female. .60 or as little as -1. i = 1. ~.4.3 RegressionWhen X Is a Binary Variable t 53 is}.96SE(~1) X D... = The population regression (5.)] X D.96SE(~ 1) X D.8. = 0 if male). + 1. Because the 95% confidence interval . = 0 if rural). . Thus decreasing the student-teacher ratio by 2 is predicted to increase test scores by between 2.26].X For example.X . suppose you have a variable D.8 1D. gender so far has focused on the case that the regressor is a continuous Regression analysis can also be used when the regressor is binary-that is. = 0 if large).80 + . is different. whether a school district is urban or rural (= 1.8.1. when it takes on only two values.x = ing x by the amount D.26 X (-2) = 2. as described ill Section 3..14) model with D. that equals either 0 or 1. with a 95% confidence level.D. X might be a worker's (= 1. For example.3 Regression When X Is a Binary Variable The discussion variable.30. 5.30 X (-2) = 6. or whether the district's class size is small or large (= 1 if variable or some- small.52 and 6.8.x. A binary variable is also called an indicator times a dummy variable. n. ing on whether the student-teacher To see this. as the regressor y.96SE(. Interpretation The mechanics of the Regression Coefficients of regression of with a binary regressor are the same as if it is cona difference of means tinuous. our hypothetical student-teacher [-3. 0 or 1.60 points.5. ratio + 1. is -1. and it turns out that regresto performing sion with a binary variable is equivalent analysis.x. and the predicted effect of the change using that estimate is [. the effect of reducing the student-teacher ratio by 2 could be as great as -3. is contemplating (5.96SE(~Jl.x]. I __ lIiIiiiiiiiiiiliiilll _ .x can be expressed as 95% confidence [~. Thus a 95% confidence interval for the effect of changinterval for . + 1Ii.8.15) 1 if the student-teacher { 0 if the student-teacher D. if urban. D.8" however.

ID. E( Y. io the two groups. Y. ges of Y. when D. ically. Thus we will not lies there IS no me.=O). it makes sense that the OLS estimator f3j is the difference between the sample aver. as the refer to f3i as the s ope In . (516) Because E(Il.15) is not a slope. It' I'D· in this regresSIOn or. . = 0. when D. when D. then f3.f30 = f3. Because {31 is the difference in the population means. = I. D i . " ' . what is it? The best way to interpret ~o ' regression with a binary regressor 15 to consider. I on ~~ f31 in Equation (5. .) . that is. can take on only two val· not useful to t 10 0 1 .. f31 is the difference between mean test score in districts with low student-teacher ratios and the mean test score in districts with high student-teacher ratios. If the two population means are the same. and f30 i the pop' ulation mean of Y.ID.ID.' the binary variable D. and. wben D. ThIShypothesis can be tested using the procedure outlined in Seeti n 5. . = E( Y. In the test score example. 154 CHAPTER 5 ' H pothesis Testsand Confidence Intervals y Regression with a Single Regressor. in fact.1 5).15) becomes Y.15) is zero. In other words. Specif.) = 0. that is.1. = I. + III (D. r f3. = 0.. = 1)E(Y. = 1 If the student-teacher rauo IS 11Igh. = 0) = f3o.ID. sion model with the continuous regressor X. when D. = 1.17) Thus. = f30 + f3. in Equation (5. • . one at a lime. Thus the null hypothesis that the two population means are the same can be tested against the alternative hypothesis that they differ by testing the null hypothesis f3. . the conditional expectation f l'/ when D. f30 + f31 is the population mean value of test scores when the student-teacher ratio is low. Similarly. Because f30 + f3. 1I1stead we will simply refer to f3. (D.ID. ible cases. is the population mean of Y. IS not eOnlmuous it is that now the regresSOl IS _ ' hi k f f3 as a slope' indeed. so 1 ~ . the coefl1cient coefflcient mu Ip ymg . is the difference between these two means. I' Equation (5.''''1 makes no sense to talk about a slope. = 1 and when D. because D. f3.=f30+U. then D. this is the case. the null bypothesis can be rejected at the 5% level against the two-sided . = 0). = 0 is E(Y. = 0 POSSI I • I and Equation (5. Hypothesis tests ond confidence intervals. = 0 against the alternative f31 # 0. is the difference between the conditional expectation of Y. except This is the same as t he regres. = 1) = f30 + f3. the two an d f3 1108 • . = I). (5. f30 is the population mean value of test scores when the student-teacher ratio is high. Because D. more compactly. "1.0 and D.. the difference (f30 + f3.

9). The difference between the sample average test scores for the two groups is 7.4 Heteroskedasticity and Homoskedasticity 155 alternative when the OLS r-statisric I ~ P 1/ SE(Pl) exceeds 1.96 X 1. furthermore..0.4/1.2 yields TeSIScore ~ 650. 5. then the errors are said to be homoskedastic.0 (1.4 Heteroskedasticity and Homoskedasticity Our only assumption about the distribution of u.0 + 7. 10. PI Application to test scores. The OLS estimator and its standard error can be used to construct a 95% confidence interval for the true difference in means.96 in absolute value. and the risks you run if you use these simplified formulas in practice.4 ~ 657. constructed as ± 1. (5. provides a 95% confideuce interval for the difference between the two population means. As an example.04.18) where the standard errors of the OLS estimates of the coefficients f30 and f31 are given in parentheses below the OLS estimates.4 ± 1.4D. so that (as we know from the previous paragraph) the hypothesis 131 ~ 0 can be rejected at the 5 % significance level. Is the difference in the population mean test scores in the two groups statistically significantly different from zero at the 5% level? To find out. construct the z-statistic on f31: I ~ 7.8 ~ 4. so the hypothesis that the population mean test scores in districts with high and low student-teacher ratios is the same can be rejected at the 5% significance level. the simplified formulas for the standard errors of the OLS estimators that arise if the errors are homoskedastic.8) ~ 0.3) + 7. . and the average test score for the subsample with studentteacher ratios lessthan 20 (so D ~ 1) is 650. SER ~ 18.96SE(iJd as described in Section 5. for which D ~ 0) is 650.2. Similarly. a regression of the test score against the student-teacher ratio binary variable D defined in Equation (5. This confidence interval excludes f31 ~ 0. its theoretical implications.8 ~ (3. This section discusses homoskedasticity.4.037. This is the OLS estimate of f31.4.96 in absolute value. the variance of this conditional distribution does not depend on X. is that it has a mean of zero (the first least squares assumption).5. conditional on X. If. a 95% confidence interval for f31. This value exceeds 1.7. R2 (1. the coefficient on the student-teacher ratio binary variable D. Thus the average test score for the subsample with student-teacher ratios greater than or equal to 20 (that is.9.This is 7.14) estimated by 0 LS using the 420 observations in Figure 4.

in Figure 4. so that the err r in Figure 5.. this distribution is tight. the error term is heteroskedastic. Unlike Figure 4." butlo n ot '7"_ 7'~" uno n e! u glven)( var(uIX). = x. the conditional variance of III given X. return to Figure 4..2 illustrates a case in which the conditional distribu.4. this is the conditional distribution of "I given X. = x does not depend all x. ~~~~~---t5~~~~~. Because this distribution appJie specifically for the indicated value of x. cisely. more pre.. given Xi is con. tion of u.4 are homoskedastic. Otherwise. As an illustration. . . and homosked sri ity arc summarized Cim!Il'Im distribution of test An Example of Heleroskedasticity Test score Like Figure 4.. it has a greater spread.-_0. depends on X. As drawn in that figure.4. stant for i = 1.4. 156 CHAPTER 5 . That is. = x increases with x.JL~~ 15 20 25 30 Student-teacher ratio .2 are heteroskedastic. TIle definitions of heteroskedasticity in Key Concept SA. For small values of x . The error term "i is homoskedastic if the variance of the conditional distribution of u. Regression WIt a Ing e What Are Heteroskedasticity and Homoskedasticity? Definitions of heteroskedasticity and homoskedasticity. h S· I Regressor'HypothesisTests and Confidence Intervals . III contrast. Because the variance of the distribution of .. so the errors illustrated in Figure 4. these distributions become more spread out (have a larger variance) for larger class sizes. the variance of these distributions is the same (or the various values of r. The distribution of the errors "i is shown for various values of x.2 the variance of Ui given X. u is 10 hetercskedastic. Thus in Figure 5. Figure 5. n and in particular does not depend on X. spreads out as x increases. but for larger values o( x. all these conditional distributions have the same spread.-~~_~-./-. this shows the conditional scores for three differ- 720 700 680 660 640 620 600 Distribution of Y when X = 15 ent class sizes..4.

. . Earnings.21) = f30 + f31 + UI Thus. In this regard.IXi = x). for women. is the variance of the error term the same for men and for women? If so. u. depends on MA LEI requires thinking hard about what the error term actually is. var(u.. (5.19) as two separate equations. so at issue is whether the variance of the error term depends on MALE. f31 is the difference in the population means of the two groups-in tbis case. The binary variable regression model relating a college graduate's earnings to his or her gender is Earnings. (5. it is useful to write Equation (5." is eqnivalent to the . is constant for i = 1... is homoskedastic if the variance of the conditional distribution of til given Xi. . It follows that the statement. be a binary variable that equals 1 for male college graduates and equals 0 for female graduates. we digress from the student-teacher ratio/test score problem and instead return to the example of earnings of male versus female college graduates considered in the box in Chapter 3. n. The definition of homoskedasticity states that the variance of does not depend on the regressor. is the deviation of the it" woman's earnings from the population mean earnings for women (f3o). and the definitions might seem abstract.19) for i = 1. it is heteroskedastic. 5. one for men and one for women. if not. UI is the deviation of the it" man's earnings from the population mean earnings for men (f3o + f3.20) (5. Here the regressor is MALE. . and for men. n and in particular does not depend on x. Because the regressor is binary. To help clarify them with an example. "the variance of UI does not depend on MALE. It. (women) and (men). Otherwise. = f30 + It. . Deciding whether the variance of u. tbe difference in mean earnings between men and women who graduated from college. In other words.SA Heteroskedasticily and Homoskedasticily 157 Heteroskedasticity and Homoskeda5ticity ~ The error term U. the error term is beteroskedastic. These terms are a moutbful.. Earnings. the error is hornoskedastic.. = f30 + f31MALE." Let MALE.4 Example.). + U. "The Gender Gap in Earnings of College Graduates in the United States.

Mathematical Implications of Homoskedasticity The OLS estimators remain unbiased and asymptotically norrnal. ' tIlOy ' e an where s' is giiven ~n 'E f. e 'd' t ibution of earnings is the same for men and women. then there is a specialized formula that can be used fOl}he standard errors 01' ~o and ~ I' 11..3 hold and thc errors are hornoskedas. homoskedasticity-only estimator of the variance of ~I: (hornoskedasticity(522) ' quauon (4. they apply to both the general case of hetcroskcdasticity and the special case of homoskedasticity. if these va ' popu Ianon IS n I'. ditional variance. then the formulas for the variances of ~o and ~ I in Key oncept4. consistent. the OLS estima· tors have sampling distributions that are normal in large samples even if the errors are homoskedastic. the OLS estimators remain unbiased and consistent even if the errors are homoskedastic. n. is the variance formula. then the OLS estimators Po and ~I are efficient among all estimators that are linear in 1'\. . In the special case that X is a binary variable.4 simplify. is discussed in Secti n 5.the square of the standard error 01' /3.. ances differ. Homoskedasticity-only If the err r term is homoskedastic. " h t e V31lanc " e of earnings is the same for men as it is for Wonlen" I .I' 111is result.. ."/ and are unbiased.1.1. under homoskedasticity) i the so-called pooled vanance formula for the difference in means.3 place no restrictions on thecon. and asympt tically normal.. the error term is heteroskedastlc. [f the least squares assumptions in Key Concept 4. 158 CHAPTER 5 ' with a S' Reqressron statement.. the OLS estimator is unbiased. In addition. Because the least squares assumptions in Key Concept 4.. Therefore. . given in Equation (3. tic. if the errors are homoskedastic. .5. Because these alternative formulas are derived for the special case that the errors are homoskedastic add 0 not ' apply If rhe errors are heteroskedaslic.The homoskedasticity-only f rmula for the standard error of /30 is given in Appendix 5. 1119e I R ressor HypothesisTests and eg . Confidence Intervals n ' thi xample the error term is homoskedastlc If the vallance ofth other wor d 5. In lIS e c . conditional 11 X\l .19 ) . which is called the Gauss-Markov theorem. Whether the errors are homo kedastic or heteroskedastic. . Efficiency of the OLS estimator when the errors are homoskedastic. '. I X.e homoskedastieity-only standard error of /3" derived in Appendix 5. Consequently. is SE(~I) = where 0'6.!fiJ. the estimator of the v~riance of ~I under h III skedasticity (that is. .l'. nly).23).

19) is heteroskedastic. todaywomen were not found in the top-paying jobs:There have always been poorly paid men. In other words . heteroskedasticity or homoskedasticity? The answer to this question depends on the application. For many years-and. Because the standard errors we have used so far [that is.those based on Equations (5. the estimators and of the variances of ~I and ~o given in Equations (5. Because such formulas were proposed by Eicker (1967). if the errors are heteros ked astic. "The Gender Gap in Earnings of College Graduates in the United States"). the variance of the error term in Equation (5. the issues can be clarified by returning to the example of the gender gap in earnings among college graduates.4) and (5. Thus the presence of a "glass ceiling" for women's jobs and pay suggests that the error term in the binary variable regression model in Equation (5. ai. in general the probabun« that this interval contains tbe true value of the coefficient is not 95%. In fact. Familiarity with how people are paid in the world around us gives Someclues as to which assumption is more sensible. and White (1980).4 Heteroskedasticity and Homoskedasticity 159 will be referred to as the "homoskedasticity_only" formnlas for the variance and standard error of the OLS estimators. if the errors are heteroskedastic but a confidence interval is constructed as ±1.4) and (5. because homoskedasticity is a special case of heteroskedasticity.jjnn hypothesis tests and confidence intervals based on those standard errors are valid whether or not the errors are heteroskedastic. they are also referred to as Eicker-Huber-Wllite standard errors. In contrast. the correct critical values to use for this homoskedasticity-only I-statistic depend on the precise nature of the heteroskedasticity.26) produce valid statistical inferences whether the errors are heteroskedastic or homoskedastic. even in large samples. then the z-statistic computed using the homoskedasticity-only standard error does not have a standard normal distribution. Specifically. Unless there are compelling reasons to the contrary-and we can think of noneit makes sense to treat the error term in this example as heteroskedastic.21) for men. to a lesser extent. However. ai. As the name suggests. but there have rarely been highly paid women. even in large samples. so those critical values cannot be tabulated. What Does This Mean in Practice? Which is more realistic. if the errors are heteroskedastic. Huber (1967). . This suggests that the distribution of earnings among women is tighter than among men (See the box in Chapter 3.96 homoskedasticity-only standard errors. they are called heteroskedasticity-robust standard errors. then the homoskedasticity_only standard errors are inappropriate.26)] lead to statistical inferences that are valid whether or not the errors are heteroskedastic.5. Similarly.20) for women is plausibly less than the variance of the error term in Equation (5.

i I I 0 . this standard tion is $7. SER Education.23) earning $50 per hour by the time they are 29. it might also be that the spread 01 the of earnings is greater for workers with of earnings 0 . u II I--r] 10 : -20 15 Years of education . But if the best-paying jobs mainly go to the college educated.159.3.1. but some will. (5. Earnings with many years of cducarion have low~payingjobs.3 has two striking features. The spread around the regression line increases with the years of education. for workers with a high seh 01 diploma.91.to 30-Year Olds in the United States in 2008 Average hourly earnings 80 Hourly earnings are plotted against years of education for 2.Or distribution 1.to 3D-year-old workers. indicating that the regression errors are heteroskedastic.05) (0. and devia- r r workers with a college degree. Figure 5.76 Years (1.This increase is summarized by the OLS regression line. • 60 40 20 Fiucd values . hourly earnings increase by $1.96 tion. this standard deviation increases these standard of education. RegreSSion Wit The Economic Value of a Year of Education: Homoskedasticity or Heteroskedasticity? average. and workers with only len years of education have no shot at those job This line is plotted in Figure 5.25. n errors He hetcroskedastic. Figure 5. so answering it requires analyzing data.76 in the OLS regression line means that on Gm!IiIm scatterplot of Hourty Earnings and Years of Education for 29. workers with more education have higher earnings than workers with less educa- additional year f education. For workers with ten years of education.50.h a S· gle Regressor' Hypothesis Tests and Confidence Intervals In . This can bc quantified at the spread of the residuals around sion line. The data come from the March 2009 Current Population Survey. the rcgressi real-world terms.08) R' = 0. Does the distribution spread out as education increases? This is an empirical question. to $12.I .76 for each O n average.76 ± 1. In all college graduates = 9. f- I . the varinnce (the years 11 t regression of Equation (5.30. The coefficient of 1.38 + 1.08 .34. The 95% confidence X interval for this coefficicnt is 1. The second striking feature of Figure 5. which is described in Appendix 3. i !. Because levels in the deviati l1S differ for different f the residuals f cducauon). While more education.. 160 CHAPTER 5 . The first is that the mean of the distribution of earnings increases with the number of years of education. on the value in other will be = .989 full-time.5.3 is that the spread of the distribution with the years of cducnti of earnings increases some workers n. the standard deviation of the residuals is $4.60 to 1. 29.3 is a scarterplot of the hourly earnings and the number of years of education for a sample of 2989 full-time workers in the United States in 2008. with between 6 and 18 years of education. very few workers with low levels f education have by looking the OLS regres- high-paying jobs. ages 29 and 30.23) depends of the regressor words.

many software programs report homoskedasticity-only standard errors as their default setting. if the least squares assumptions hold and if the errors are homoskedastic.. so it is up to the user to specify the option of heteroskedasticity-robust standard errors. As just discussed. which is a consequence of the Gauss-Markov theorem.. then the OLS estimator has the smallest variance of all conditionally unbiased estimators that are linear functions of r.. and has a normal sampling distribution when the sample size is large. is always to use the heteroskedasticity-robust standard errors.. All of the empirical examples in this book employ heteroskedasticity-robust standard errors unless explicitly stated otherwise.The details of how to implement heteroskedasticity-robust standard errors depend on the software package you use. 1'. economic theory rarely gives any reason to believe that the errors are homoskedastic. then..5. then you should use the more reliable ones that allow for heteroskedasticity. is consistent. If the homoskedasticity-only and heteroskedasticity-robust standard errors are the same.. it might be helpful to note that some textbooks add hornoskedasticity 1'0 the list of least squares assumptions. In addition. In this regard. it is useful to imagine computing both. . For historical reasons. has a variance that is inversely proportional to n. iiiiiiiiiiiiiio ...5 The Theoretical Foundations of Ordinary Least Squares As discussed in Section 4.:::==----1 . under certain conditions the OLS estimator is more efficient than some other candidate estimators. ____ iiiiiiiiiiiiiiiiiii ... This section explains and discusses this result. then choosing between them. It therefore is prudent to assume that the errors might be heteroskedastic unless you have compelling reasons to believe otherwise. The main issue of practical relevance in this discussion is whether one should use heteroskedasticity_robust or homoskedasticity-only standard errors. nothing is lost by using the heteroskedasticity-robust standard errors. the OLS estimator is unbiased. At a general level. "This section is optional and is not used in later chapters. Specifically. Practical implications..5 The Theoretical Foundations of Ordinary LeastSquares 161 As this example of modeling earnings illustrates. if they differ. The section concludes with 'Jn case this book is used in conjunction with other texts. however. The simplest thing. however.! *5. ...5. heteroskedasticity arises in many econometric applications. this additional assumption is not needed for the validity of OLS regression analysis as long as heteroskedasticity-robust standard errors are used.

= ~aiY. ' f It ' rive estimators that are more efficient than OLS when th a dISCUSSion0 a et na 1 e conditions of the Gauss-Markov theorem do not hold. thai arc linear functions of YJ... if {31 is a linear estimator.. given X].. that is... summarized in Key Concept 3. (5. .." ..3) hold and if the error is homoskedastic. is conditionally unbiased if the mean of its conditional sampling distribution. the estimator 13t is conditionally unbiased if E(13...2.2 that the OL estimator is linear and conditionally unbiased. the OLS estimat r is the De t Linear conditionally Unbiased Estimator-that is.. In other words. The Gauss-Markov a set of conditions known as the Gauss-Markov has ~he smal. which are stated in Appendix 5. Jt is shown in Appendix 5. .. x" but not on Yj.1 and that are unbiased.) = {3.. are implied by the three least squares . x. then it can be written as " ~.. .. Linear Conditionally Unbiased Estimators and the Gauss-Markov Theorem If the three least sqnares assDlllptions (Key Concept 4. )(". then the OLS estimator has the smallest variance. is conditionally unbiased). conditional On among all estimators in the class of linear conditionally unbiased esu.24) (it is linear) and if Equation (5.. That is.3. given X". it is BLUE. III Linear conditionally unbiased estimators.1'..lest conditional variance. . . (ii. can depend on Xii' .1'. (13. This result is an extension of the result. is (3" That is.. The clas of linear conditionally unbiased estimators consists of all estimators of {3. i=1 a/I 1 I (5. und~r conditions the OLS estimator~.. of ~II linear conditionally unbiased estimators of {3j. theorem states that. . I x. The Gauss-Markov theorem. 162 CHAPTER 5 ' 'I R r: HypothesisTests and Confidence Intervals RegreSSionwith a SIng e egresso. that the sample average Y is the most efficient estimator of the population mean among the class of all estimators that are unbiased and are linear functions (weighted averages) of 1']. is linear). X. }j" The estimator 13. . X b . The Gauss-MarkOv conditions. X mators.lx" ..25) The estimator ~j is a linear conditionally unbiased estimator if it can be written in the form of Equation (5. .24) where the weights a" . conditional on Xlt···..the OLS estimator is BLUE. .25) holds (it is conditionally unbiased).

==:-r .5 and proven in Appendix 5. if tbe error term is heteroskedastic-as it often is ineconomic applications-then tbe OLS estimator is no longer BLUE.:. tbe presence of beteroskedasticity does not pose a threat to inferencebased on heteroskedasticity-robust standard errors.then the OLS estimator ~l is the Best (most efficient) Linear conditionallyUnbiased Estimator (is BLUE). If the errors are heteroskedastic. its conditions might not hold in practice. called weighted least squares (WLS). weights the i''' observation by the inverse oftbe square root of tbe conditional variance of u.5. is known up to a constant factor of proportionality-then it is possible to construct an estimator tbat has a smaller variance than the OLS estimator. but it does mean that OLS is no longer the efficient linear conditionally unbiased estimator. Regression Estimators Other Than OLS Under certain conditions. An alternative to OLS when there is heteroskedasticity of a known form. tbese other estimators are more efficient than OLS.5 The Theoretical oundations OrdinaryLeastSquares F of 163 TheGauss-Markov Theorem for ffi.However. the errors in this weighted regression are homoskedastic. tben OLS is BLUE. The weighted least squares estimator. there are other candidate estimators that are not linear and conditionally unbiased. is discussed below.2. The Gauss-Markov theorem is stated in Key Concept 5. The Gauss-Markov theorem provides a theoretical justification for using OLS. Because of tbis weighting. is BLUE. This method. under some conditions. called the weighted least squares estimator. some regression estimators are more efficient than OLS. if the conditional variance of u. so OLS. then OLS is no longer BLUE. If the nature of tbe heteroskedasticity is known-specifically. given X. ________ iiiiliiiiiiiiiiiiiiitliiiiiiiiiiiiiiiil --. Consequently. when applied to the weigbted data. As discussed in Section 5. 5. First.4. given X. In particular..3 hold and if errors are homoskedastic. ~ If the three least squares assumptions in Key Concept 4.5 assumptions plus the assumption that the errors are homoskedastic. Limitations of the Gauss-Markov theorem. the theorem has two important limitations. if the three least squares assumptions bold and the errors are homoskedastic. The second limitation of tbe Gauss-Markov tbeorem is that even if the conditions of the theorem hold.

3.These five assurnptions-. As discussed in ection 4. IS therefore used far less frequently than OLS. arc obtained by solving a minimization like that in Equation (4. that the errors are hornoskedasric. Thus the treatment of linear regression throughout Ihe remainder of this text focuses exclusively on least squares method.is uncommon in applications. the exact distribution f the r-statistic is complicated and depends on the unknown population distributi n of the data. and the regression errors are normally distributed. I k wn in econometnc applications. .164 CHAPTER 5 Regression with a Single _ R egresso.-.If.6) except that the absolute value of the prediction "mistake" is used instead of its square. The t-Statistic and the Student t Distribution Recall from Section 2.where Z is a random variable Wit h a .bo .b. r: Hypothesis Tests and Confidence Intervals .:.I .II I gant the practical problem with weighted least squa Although theoretlca ye eza __ . that minimize L.~. The least absolute deviations estimator. the OLS estimator can be sensitive to outliers. Weighted least squares' thing that IS rare y no . the LAD estimators of f30 and f3.6 Using the t-Statistic in Regression When the Sample Size Is Small When the sample size is small. or other estimators with reduced sensitivity to outl ier . then the OLS estimator is normally distributed and the hornoskedasticity-only r-statistic has a Student t distribution. One such estimator is the least absolute dcviati n (LAD) estima_ tor.. the three least squares assumptions hold. depends on X sOme IS that you must now . the three least q uares assumption. *5.That is. t-k how the conditional vanance of II.. In many economic data sets.4 that the Student t distribution with III degrees of freedom is defined to be th e diistnibuti of. however. and that the errors are n rmally distriblltedare collectively called the homoskedastic normal regression assumptions. are the values of bo and b.. The LAD 1'1 estimator is less sensitive to large outliers in u than is OL . the regression errors are hornoskedastic. deferred to Chapter ]7. . W is a random variable with a chi-squared dislributioa l 'This section is optional <lad is not used in later chapters.res . usc f the LAD estimator.: . If extreme utliers are not rare.. then other estimators can be more efficient than OLS and can produ e inferences that are more reliable.Xi!. in which the regression coefficients f30 and f3.-. severe outliers in II arc rare. uuon W/m standard normal distribution. and further discussion of WLS is I!.Z I'V r.

are independently distributed. Under the homoskedastic normal regres. 1.O) has a normal distribution under the null hypothesis. In econometric applications. are homoskedastic and normally there is rarely a reason to believe that the errors distributed. It follows that the result of Section 3. conditional tion. and confi- large. r-statistic has This result is closely related to a result discussed in Section 3.~" (5.13 L."" a weighted uted. . then the hornoskedasticity-only regression r-statistic has a Student t distribution (see Exercise 5. average of independent 131 has a normal distribution.. Xn' As dison XL. inference can proceed as described in Sections 5. the OLS estimator is a weighted average of l]. conditional On XI.23)].5.x".2 degrees of freedom." where the weights depend on XL... the (normalized) distribution homoskedasticily_only x". and ffi. errors are homoskedastic (Appendix and normally distributed and if the homoskedasticity-only z-statistic is used. 111 Under the null hypothstandard error can using the homoskedasticity_only z-statisue testing 131 = 131. by first computing heteroskedasticity-robust the standard normal distribution dence intervals. Because sample sizes typically are standard errors and then by using tests. and Z and Ware independent.1 and 5.2]. Thus (ffil . and iii.5 is a special case of the result that if the homoskedastic normal regression assumptions hold. Use of the Student t Distribution in Practice If the regression the Student tribution. the z-statisrn. In addi- variance estimator has a chi-squared with n .2-that is. this distinction is relevant only if the sample size is small.131o)/(. conditional cussed in Section 5. .5.2. computed be written in this form. Consequently. . Th~~. then the (pooled) z-statistic has a Student ( distribution.32) in Appendix 5. then critical values should be taken from Table 2) instead of the standard normal dist t distribution Because the difference between the Student distribution and the nor- mal distribution is negligible if n is moderate or large.IS defined sion assumptions.5 in the context of testing for the equality of the means in two samples.0 is 'i = (ffi] .22). . .6 Using the r-Starisric in Regression When the Sample Size Is Small 165 with m degrees of [reedom. if the two population distributions are normal with the same variance and if the r-statistic is constructed using the pooled standard error formula [Equation (3. the homoskedasticity-only standard error for ffi I simplifies to the pooled standard error formula for the differ- ence of means... divided by n .omoskedasticitY_Only where u~. esis. the homoskedasticity-only a Student t distribution with n . x" [see Equation (5.2 degrees of freedom.'" Equation Y has a normal distribution. When X is binary. In that problem. Because normal random variables is normally distribon X" .. however. to compute p-values. hypothesis ____ iiiiiiiiIII _ .10).

" could mean that the OLS analysis one so far has little value to the superintendent. The coefficient is moderately large. This corresponds to moving a district at the 50'h percentile of the distribution of test scores to approximately the 60'h percentile. after all. It thus might be that our negative estimated relationship between test scores and the student-teacher ratio is a con eq uence of large classes being found in conjunction with many other factors that are. their children are not native English speakers. the real cause of the lower test scores. But students at wealthier schools also have other advantages over their poorer neighbors. on average. in many cases. Indeed. it could be misleadtng. in fact. but is this relationship necessarily the causal one that the superintendent needs to make her decision? Districts with lower student-teacher ratios have. on average. the probability of doing so (and of obtaining a r-stati ti n fJ\ as largeas we did) purely by random variation over potential amples is exceedingly small. ideri hi ing additional teachers to cut the student-teacher ratio Wil t who IS consi enng in ' a have we learned that she might find useful? . This result represents considerable progress toward answering the superintendent's question yet a nagging concern remain . reasnn to worry that it might not. in fact. Moreover. students at wealthier schools tend themselves to come from more affluent families and thus have other advantages not directly associated with their school.6 points higher.The population coefficient might be 0. and better-paid teachers. The coefficient on the student-teacher ratio is statistically significantly different from 0 at the 5% significance level. including better facilities. Our regression analysis. However. But does this mean that reducing the student-teacher rati will.30 :S fJ\ :5 -1. .based on the 420 observations for 1998 m the California d t et showed that there was a negative relationship between the stu testscore a as. higher test scores. approximately 0. and we might simply have estimated our negative coefficient by rand 10 ampling variation. c - dent-teacher ratio and test scores: Districts with smaller classes have higher test scores. and. so wealthier school districts can better afford smaller classes.There i a negative relationship between the student-teacher ratio and test scar . Hiring more teachers. in fact. newer books. 5. or "omitted variables.001%. increase scores? There is. test scores that arc 4.26. costs money. For example.166 CHAPTER 5 . d These other factors. ' R r: HypothesisTestsand Confidence Intervals RegressionWitha 51ngle egresso.A 95% confidence interval for f3\ is -3. the e immigrants tend to be poorer than the overall population.7 Conclusion he problem that started hapter 4: the superintende t Return for a momen t to t n . in a practical sense: Districts with two fewer students per teacher have. . California has a large immigrant community.

If the three least squares assumption hold and if the regression errors are homoskedastic. 3. 2. = x) is constant. the OLS estimator is BLUE. then the OLS t-statistic computed using homoskedasticity-only standard errors has a Student t distribution when the null hypothesis is true. the error u. a 95% confidence interval for a regression coefficient is computed as the estimator ±1.96 standard errors. Homoskedasticityonly standard errors do not produce valid statistical inferences when the errors are heteroskedastic. 5. then. holding these other factors constant.var(". Summary 1. is heteroskedastic-that is. the variance of u at a given value of x" var(". Hypothesis testing for regression coefficients is analogous to hypothesis testing for the popnlation mean: Use the t-statistic to calculate the p-values and either accept or reject the null hypothesis.Ix. The difference between the Student t distribution and the normal distribution is negligible if the sample size is moderate or large.that is. the regression model can be used to estimate and test hypotheses about the difference between the population means of the "X = 0" group and the "X = I" group. If the three least squares assumptions hold.Ix. = x) depends on x. we need a method that willallow us to isolate the effect on test Scores of changing the student-teacher ratio. Like a confidence interval for the population mean. That method is multiple regression analysis. When X is binary. A special case is when the error is homoskedastic. the topic of Chapter 6 and 7. as a result of the Gauss-Markov theorem. To address this problem. and if the regression errors are normally distributed. In general. if the regression errors are homoskedastic.Key Terms 167 Changing the student-teacher ratio alone would not change these other factors that determine a child's performance at school. Key Terms null hypothesis (146) two-sided alternative hypothesis (146) standard error of ~ I (146) z-statistic (146) p-value (147) confidence interval for /31 (151) confidence level (151) indicator variable (153) dummy variable (153) . 4. but heteroskedasticity-robust standard errors do.

lhe ficient.. Calculate the p-value for the two-sided rest of the null hypothe Ho: fJ.. = O.:e = 520. hut are the dependent and independent variables? 5.4) (2. I I•. . Do you reject the null hypothe is at the 5% level? 1% level? . coefficient multiplying D.1 Suppose that a researcher.08. (154) heteroskedasticityand homoskedasticity (156) homoskedasticity-only standard errors (158) heteroskedasticity-robuSI error (159) standard auss-M rko theorem (162) be t linear unbia cd esurnato(BL )(16) weighted least square. a..5.d.5.ressi n test TestSco.). I R ressor Hypothesis Tests and Confidence Intervals Regressionwith a Sing e eg .2 Explain how you could use a regression model to c 110101 the wage gender gap using the data on earnings f rn 'n and w m ·n. II. .i. .4 . Construct a 95% confidence interval f r PI. set of ob crvati ns~. i I. a 5. using data n etas 'i. scores from lOa third-grade classes.: PI = a in a regression model using an i. (154) coefficient on D..168 CHAPTER 5 . (163) h moskcda lie normal regression assumptions 164) nuss-Marko onditions (17 ) Review the Concepts 5.. ER 11.1 Outline the procedures f r computing the /I-vl\lue fa IWO sided test of flo: t-v = using an i. PrOVIde it hyp thetical empirical example in which y u think Ihe errors would be hctcr kcdastic and explain your reasoning.d.82 x (2Q. regression sl pe coefis I the b.. estimates lite ( ) and average rc ." Outline the procedures for computing the /I-value of n two-sid d I 'Sl <I( II. Exercises 5. R2 0. s t of obscrvarlons ( /•• \.21) .i.3 Define homoskedasticity and heteroskednsticit .

2. Construct a 99% confidence interval for 130' 5. ~ -5. using wage data On 250 randomly selected male workers and 280 female workers. In the sample.) c. R' ~ 0. Construct a 95% confidence interval for the gender gap. Another researcher uses these same data but regresses Wages on Female. d. SER ~ 10.94 X Height. determine whether -5. Without doing any additional calculations. what is the mean wage of women? Of men? e. Define the wage gender gap as the difference in mean earnings between men and women. What is the estimated gender gap? b. Calculate the p-value for the two-sided test of the null hypothesis Ho: 13.2 SUppose that a researcher.6 is contained in the 95% confidence interval for 13" d.5 inches over the Courseof a year. . R' = 0.6.3 Suppose that a random sample of200 twenty-year-old men is selected from a population and their heights and weights are recorded.23) (0.15) (0.52 + 2.31) where Weight is measured in pounds and Height is measured in inches.06.12 x Male. Construct a 99% confidence interval for the person's weight gain. Is the estimated gender gap significantly different from zero? (Compute the p-value for testing the null hypothesis that there is no gender gap. a variable that is equal to 1 if the person is female and 0 if the person a male.2. SER ~ 5.Exercises 169 c. a. A regression of weight on height yields Weight ~ -99. A man has a late growth spurt and grows 1. (. What are the regression estimates calculated from this regression? - Wage ~ __ + __ X Female.41 + 3. SER ~ 4. estimates the OLS regression Wage ~ 12.36) where Wage is measured in dollars per hour and Male is a binary variable that is equal to 1 if the person is a male and 0 if the person is a female. (2. R 2 ~ .81.

6) (2. How much is this worker's average hourly earnings expected 10 increase? c. test score. A randomly selected 30-year-old worker reports an educati n level of 16 years. Do you think that the regressi n errors plausibly are homoskedastic? Explain.0 + 13. and small classe contained approximately 15st udents. A high school graduate (12 years of educati n) is contemplating going to a community college for a 2-year degree. Is the estimated effect of class size on test score cant? Carry out a test at the 5% level. Is this statement consistent with the regression evidence? What range of values is consistent with the regression evidence? 5. a. in the population.4. college graduates earn $10 per hour more than high school graduates.0 I.5) R2 = 0. A high school counselor tells a student that.Suppose that the regressron errors were homoskedastic: Would this affect the validity of the confidence interval constructed in Exercise 5.) was computed USing Equation (5.170 CHAPTER 5 . and given standardized tests at the end of the year.5 In the 1980s.) Suppose that. S' I R ressor:HypothesisTests and Regressionwith a Ing e eg .Tennessee conducted an experiment in which kindergarten students were randomly assigned to "regular" and "small" classes.23) to answer the following. (1. Do small classes improve test scores? By how much? Is the effect large? Explain.4 Read the box "The Economic Value of a Year of Education: Homoskedas_ ticity or Heteroskedasticity?" in Section 5.9 X SmaltClass. (Regular cia ses contained approximately 24 students. What is the worker's expected average hourly earnings? b. the standardized tests have a mean Score of 925 points and a standard deviation f 75 points.5. a. a. Let mollCloss denote a binary variable equal 10 1 if the student is assigned t a small class and equal to 0 otherwise. SE(~.SS yields Yes/Score = 918. Construct a 99% confidence interval for the effect of mall Class on 5. .5(c)? Explain.6 Refer to the regression described in Exercise 5.6. ER = 74. . b. A regression of TestScore on Small Ie. iatistically signifi- c. Use the regression repOrted in Equation (5. b. on average. ). Confidence Intervals 5.

: f3.5) = 0. (7. and (a) and (b) answered. A sample of size n = 30 Y = 43. HI: f31 '" 0 at the 5 % level. where Y and X 73 is a linear function of Yj.2 (10. and /h Y.3 and. .. Xi) satisfy the assumptions in Key Concept 4.26. d. and Xi.2X.. Test Ho: f31 = 55 vs. where lI.4) standard where the numbers in parentheses are the homoskedastic-only errors for the regression coefficients. II. Let 73 denote are the an estimator of f3 that is constructed as sample means of Y.1) (1. Would you be and many samples of size In Xi are independent n = 250 are drawn. and Xi satisfy the assumptions in Key Concept 4.3. interval for b. yields that (Y. In what fraction of the samples would Ho from (a) be rejected? what fraction of samples would the value f3 j = 0 be included in the confidence interval from (b)? 5. a. Suppose you learned that surprised? Explain.9 Consider the regression model > 55 at the 5% level. Suppose that Y. a. b. R' = 0.54.2) + 61. SER = 6. Xi) satisfy the assumptions in Key Concept 4.2. Y. R' (3.. HJ' f31 '" 55 at the 5% level.8 Suppose addition.. SER = 1. O"~) and is independent of Xi.5X. Test Ho: f31 = 55 vs. Show that 73 = YIX. Construct a 95% confidence c. regressions estimated. in is N(O. Test Ho: f31 = 0 VS. Construct a 95% confidence interval for f3o. 73 is conditionally unbiased. 5. Show that b. A random sample of size n = 250 is drawn and yields y = 5. + u.3. a. and Xi were independent.. c. Y" . = f3X.7 Suppose that (Y. respectively. H.52.4 + 3. Y".Exercises 1 71 5. .

How would your answers to (a) and (b) change if you assumed only that (Y" X) satisfied the assumptions in Key oncept 4.' n 1. LtV denote the sample mean for observation With X'" 0 a d f3 1tUrelo X .3 and var(u. O'~) and is independent I' X. + 1/.22). b. + I/i' Find the OL estimates and their corresponding standard errors.. is N(O. a.y"J2] is $68.13 Suppose that (Y. . = '\)". Derive the conditional variance of the estimator. derive the variance of f30 under homoskedasticity given in Equation (5.an ~r o~servatlons With X = 1.i = {3w.. . suppose that 1. = 131 WOmen.lx = x) is constant? d. To be specific. how that ~o = 'YoJo + ~. X).l' au 's-Markov con- of {3 an sh W that it is a linear unbiased.".31).:. denotes years of schooling.denotes earnings..3 and. 172 CHAPTER 5 .::I L. How would your answers to (a) and (b) change if you assumed only that (Y" X.i + lim. of f30 and ~I J:.11 A random sample of workers contains II". 5. = y" and {31 = y.10 and Sw = $51. Let Women denote an indicator variable that is equal to 1 for women and 0 for men and suppose that all 251 observations are used in the regression 1. d.. 5." = (1/11. Y.i .o = (3/11. u. = $485.i' Let ~m. 5..0 + f3m. Show that the estimator is conditionally c. denote a binary variable and consider the r~gre sian Y..15 A researcher has two independent samples of observations on (1.l denote the OLS estimator constructed ~1I. . Women. Prove that the estimator is BLUE. X. conditionally unbiased? h.10. and the sample standard deviation [s"..) satisfied the assumptions in Key oncept 4. where (1/" X.. Is ~I oncept 4.. The corresponding values for women are Y. denote the sample m. Is ~.. = f3X.) Y.tX. . Derive the least squares estimator function of 1\.10 Let X. .10. '" f30 + · + ..1..) satisfy the ditions given in Equation (5. a..14 Suppose that Y.1 (1'. Regression WIth a Sing e 5.ll.3'1 5.1.. I R egre ssor: HypothesisTests and Confidence Intervals . = 120 1l1~ and II". and the Independent samples are for men and women. 5.I~

Related Interests

J + uw.. Write the regression for men as ~1J.1 and the regression for women as + f31. = f30 + f3. in the best linear conditionally unbia e I estimator of f31? c.12 Starting from Equation (4. X) satisfy the assumptions in Key addition.. The sample average of men's weekly earnings [1'.2 ) in Appendix 5.] is $523.Yo.

2. or 1% significance level? What is the p-value associated with coefficient's I-statistic? b. using data only on males and repeat (b). Repeat (a) using only the data for high school graduates.5%.1.l) denote the corresponding s.15. different for e.) ~ V[sE(ffi""Jl)2+ [sE(ffiw. (Hint: See Exercise 5.3. Construct a 95% confidence interval for the slope coeftlcient..) .. can you reject the null hypothesis Ho: [31 ~ 0 versus a twosided alternative at the 10%.Empirical Exercises 173 ffi". and SE(ffi"u) and SE(ffiw. Repeat (a) using only the data for college graduates. Is the effect of distance on completed years of education men than for women? tHint: See Exercise 5. or 1 % significance level? What is the p-value associated with coefficient's I-statistic? b. can you reject the null hypothesis H : [31 ~ 0 versus a two- o sided alternative at the 10%. or 1% significance level? What is the p-value associated with coefficient's I-statistic? E5. Is the estimated regression slope coefficient statistically significant? That is.l)f using the sample of men.15. e.1 _ ffiw" is given by SE([3"". Rnn the regression interval for the slope coefficient. regression of average hourly earnings (A HE) on Age and carry out the fol- a. 5%.2 Explain. Show that the standard errol' of ffi".[3w. Is the estimated regression slope significant? That is. can you reject the null hypothesis Ho: [3.l denote the OLS estimator constructed from the sample of women. Is the effect of age on earnings different for high school graduates than for college graduates? E5. ~ 0 versus a two-sided alternative at the 10%.) described in Empirical Exercise E4. Construct a 95% confidence c.andar~ errors. Run the regression d. Is the estimated regression slope coefficient statistically significant? That is..1 Using the data set CPS08 described in Empirical Exercise E4. run a regression of years of completed education (ED) on distance to the nearest college (Disl) and carry out the following exercises. run a lOWing exercises . Empirical Exercises E5. using data only on females and repeat (b). a.5%. c.2 Using the data set CollegeDistance described in Empirical Exercise E4. Using the data set TeachingRatings coefficient statistically run a regression of Course_Eval on Beauty. d.

uti and stems from Homoskedasticity-Only Variances var] to (5. Replacing var[(X. ity. TIle variance Li' I(X/ adjustment tIT. simplify =--and (T~ 1'/U'1. a7 =-x (1'" ')' ' L. th e formulas In Key Conccpl4. APPENDIX --------------------. The estimator of the variance of ffiois --'" . with a modificHlion. 111CSC 5.174 CHAPTER 5 Regression Wit . is a constant: It the errors are hornoskedasric . 1 /I .'~l X~)Xi.3.'\!) ::= O"~. = 1 . os t3IJ 2 = --'~<T' 110'2 . these are the "heteroskedasticity-robust" the OLS estimators of homoskedasticity..21) by the corresponding variance in the numerator of Equation (4.. the conditional variance of u. which allow for hcteroskedastic_ standard errors. where to correct by a degrees-of-freedom adjustment used in the definiis estimated in the denominator X)'. .. The consistency of hClcroskcdasticity-robust standard errors is discussed in Section 17. R rmulas for the variance of and the associated standard errors are then given r r the special case Heteroskedasticity-Robust ances in Equation Standard Errors the population X)2 varisample variances. for downward bias.1 Formulas for OLS Standard Errors are first presented under the least squares assumptions in Key Concept 4.!. analogously to the degrees-of-freedom in Section 4.4 ormu as i 2 (T~ I Ii.26) iJo n i-I where if.-I Nl".2l) by these two estimators yields .3.N.(Xi) in quati n (4. given X.h a S· I R gressor· Hypothesis Tests and Confidence Intervals Ing e e ..4)..111e standard error of ffiois E(ffio) = al.] and va.Y II' E(X') (5.21) is estimated by the divisor n .Pine reasoning replacing population behind the estimator uJuis the same as behind expectations with sample averages./Lx)n.27) ~nder hornoskedasticity.(X/t L. This appendix discusses the formulas for OLS standard errors.28) .2 (instead of II) incorporates tion of the SER (l/n)2::~I(X. .JI in Equation (5. 1 1'/-2~ Ii I I (5.3.4) is obtained by replacing (4. The n ~ 2 The estimator UPI defined in Equation (5. . .

21) and simplifying. and by TIle hornoskedaSlicity_only standard errors are obtained by substjtuting sample means and estimating the variance of tors of these variances are by the square of the SER. -!Lx)'var(u. and where the final equality follows from the law of then var(u. .3). where E[(X.] = .]}') ~ E{[(X. The homoskedasticity_only square roots of (fpo and tTl standard errors are the APPENDIX - _ 5.21) as var[(X._.!Lx)u. "" (X. If u.IX.28).E[(X. A sim- this expression into the numerator of Equation ilar calculation yields Equation (5.Ix.--:'_". so 2. -!Lx)u. Standard Errors means and variances in Equations (5.]') follows Theorem 175 (5. (homoskedasticity_only) and (5. then the OLS estimator is the best (most efficient) unbiased estimator (is BLUE).19).-1 __ PI) n i=1 1" ) ( -~ Xf s~ n~ /I I 2: (X.27) and (5.!Lx)".2 s.X) '" Ii - ..)] substituting iterated (Section =iT. write the numerator .J].27) follows by (4.28).~ i=l .E[(X. the second equality because =0 (by the first least squares assumption) expectations ar E[(A'. (5. . Homoskedasticity-Only variances for the population It. (4. The hornoskedaSlicity_only estima- Util . This appendix begins by stating the Gauss-Markov condi- tions and showing that they are implied by the three least squares condition plus .2 The Gauss-Markov Conditions and a Proof of the Gauss-Markov Theorem As discussed conditions in Section 5. -:

Related Interests

)' (homoskedastici ty-oul y).The Gauss-Markov To derive Equation E({(X.] ~ E[(X... _ !LxJ'ulJ E[(X. -!Lx)'] ~ <T..5. the Gauss-Markov theorem states that if the GaUSS-Markov conditionally linear hold. .<TJc. is homoskedastic. The result in Equation (5. Conditions and a Proof of the Gauss-Markov in Equation .!Lx)'v (lI.) = iT. -!Lx)"..!LxJu.27).29) &9 .lx.30) where s~ is given in Equation (4.

i. 2).. bomoskedasucity .(Xi. Thus the least squares assumptions onccpt 4. (nonzero W £(II/IX.. . I t n ) (5..(X. (3. "/) a 1::(11. L:~I(X. squares assumptions. irnilarlv.. E .1"1. Xn) = £(/I. £(1111. and rhnt the err h lei coudln nrc uncorrclatcd fordif~ where all these statements nnlly on nil SqUMCS bserved X's assumptions and by 2. because E(II. .X)I.II)X" .-" --I (

Related Interests