## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

by Jeffrey Zax

Ratings:

672 pages10 hours

*Introductory Econometrics: Intuition, Proof, and Practice* attempts to distill econometrics into a form that preserves its essence, but that is acceptable—and even appealing—to the student's intellectual palate. This book insists on rigor when it is essential, but it emphasizes intuition and seizes upon entertainment wherever possible.*Introductory Econometrics* is motivated by three beliefs. First, students are, perhaps despite themselves, interested in questions that only econometrics can answer. Second, through these answers, they can come to understand, appreciate, and even enjoy the enterprise of econometrics. Third, this text, which presents select innovations in presentation and practice, can provoke readers' interest and encourage the responsible and insightful application of econometric techniques.

In particular, author Jeffrey S. Zax gives readers many opportunities to practice proofs—which are challenging, but which he has found to improve student comprehension. Learning from proofs gives readers an organic understanding of the message behind the numbers, a message that will benefit them as they come across statistics in their daily lives.

An ideal core text for foundational econometrics courses, this book is appropriate for any student with a solid understanding of basic algebra—and a willingness to use that tool to investigate complicated issues.

Publisher: Stanford Economics and FinanceReleased: Mar 31, 2011ISBN: 9780804777209Format: book

You've reached the end of this preview. Sign up to read more!

Page 1 of 1

**estimators **

**1 Basic Objectives **

**2 Innovations **

**3 Math **

**4 Statistics **

**5 Conclusion **

The purpose of this book is to teach sound fundamentals to students, at any level, who are being introduced to econometrics for the first time and whose preparation or enthusiasm may be limited. This book is a teaching tool rather than a reference work. The way that a typical student needs to approach this material initially is different from the way that a successful student will return to it. This book engages novices by starting where they are comfortable beginning. It then stimulates their interests, reinforces their technical abilities, and trains their instincts.

For these purposes, the first objective of this book is the same as that of a medical doctor: Do no harm. Armed with only some vocabulary from a conventional textbook and a standard computer software package, anyone can run a regression, regardless of his or her preparation. The consequences of econometric work that lacks precision, care, and understanding can range from inept student research to bad public policy and disrepute for the econometric enterprise. These outcomes are best prevented at the source, by ensuring that potential perpetrators are exposed to the requirements, as well as the power, of econometric techniques.

Many of the students who take this course will actually run regressions in their subsequent professional life. Few will take another econometrics course. For most, this is a one-shot opportunity. It is crucial that we—students and teacher—make the best of it. At minimum, this book is designed so that those who finish it are aware that econometric tools shouldn’t be used thoughtlessly.

This book’s additional objectives form a hierarchy of increasing optimism. Beyond basic awareness of the power of econometrics, we want students to understand and use responsibly the techniques that this book has taught them. Better still, we want students to recognize when they have an econometric challenge that goes beyond these techniques. Best of all, we want students to have enough preparation, respect, and perhaps even affection for econometrics that they’ll continue to explore its power, formally or otherwise.

In pursuit of these objectives, this text makes five distinctive choices:

1. It emphasizes depth rather than breadth.

2. It is curiosity driven.

3. It discusses violations of the standard assumptions immediately after the presentation of the two-variable model, where they can be handled insightfully with ordinary algebra.

4. The tone is conversational whenever possible, in the hope of encouraging engagement with the formalities. It is precise whenever necessary, in order to eliminate confusion.

5. The text is designed to evolve as students learn.

This book engages only those subjects that it is prepared to address and that students can be expected to learn *in depth*. The formal presentations are preceded, complemented, and illuminated by extensive discussions of intuition. The formal presentations themselves are usually complete, so that there is no uncertainty regarding the required steps. End-of-chapter exercises provide students with opportunities to experiment on their own—both with derivations and their interpretations. Consequently, this book devotes more pages to the topics that it does cover than do other books. The hope is that, in compensation, lectures will be more productive, office hours less congested, and examinations more satisfying.

Equivalently, this book does not expose

students to advanced topics, in order to avoid providing them with impressions that would be unavoidably superficial and untrustworthy. The text includes only one chapter that goes beyond the classical regression model and its conventional elaborations. This chapter introduces limited dependent variables, a topic that students who go on to practice econometrics without further training are especially likely to encounter.

A typical course should complete this entire text, with the possible exception of **chapter 15, in a single semester. Students who do so will then be ready, should they choose, for one of the many texts that offer more breadth and sophistication. **

Motivation is the second innovation in this book. The sequence of topics is driven by curiosity regarding the results, rather than by deduction from first principles. The empirical theme throughout is the relationship between education and earnings, which ought to be of at least some interest to most students.

The formal presentation begins in **chapter 3, which confronts students with data regarding these two variables and simply asks, What can be made of them? Initial answers to this question naturally raise further questions: The covariance indicates the direction of the association, but not its strength. The correlation indicates its strength, but not its magnitude. The desire to know each of these successive attributes leads, inexorably, to line fitting in chapter 4. Finally, the desire to generalize beyond the observed sample invokes the concept of the population. **

This contrasts with the typical presentation, which begins with the population regression. That approach is philosophically appealing. Pedagogically, it doesn’t work. First-time students are not inclined to philosophic rigor. Worse, many first-time students never really get the concept of the population. It’s too abstract. If the course starts with it, students begin with a confusion that many never successfully dispel.

For this reason, this book refers only obliquely to the contemporary approach to regression analysis in terms of conditional expectations. This approach is deeply insightful to those with a thorough understanding of statistical fundamentals. The many undergraduates who struggle with the summations of **chapter 2 do not truly understand expectations, much less those that are conditional. We cannot expect them to appreciate any approach that is based first on population relationships. **

The sample is a much more accessible concept. It’s concrete. In several places in this book, it actually appears on the page. The question of how to deal with the sample provokes curiosity regarding its relationship with the underlying population. This gives the student both foundation and motivation to confront the population concept in **chapter 5. **

Subsequent chapters repeat this pattern. Each begins by raising questions about the conclusion of a previous chapter. In each, the answers raise the level of econometric sophistication. The text aspires to engage students to the point that they are actually eager to pursue this development.

The third innovation places the discussions of inference, heteroscedasticity, correlated disturbances, and endogeneity in **chapters 6 through 10, immediately after the presentation of the model with one explanatory variable. In the case of inference, students may have seen the basic results in the univariate context before. If they did, they probably didn’t understand them very well. Bivariate regression provides the most convenient and accessible context for review. Moreover, it is all that is necessary for most of the results. **

Heteroscedasticity, correlated disturbances, and endogeneity all concern violations regarding the ordinary assumptions about the disturbance terms. The methods of diagnosis and treatment do not vary in any important way with the number of explanatory variables. The conventional formulas are often especially intuitive if only one explanatory variable is present.

This arrangement relieves the presentation of the multivariate model of the burden of conveying these ancillary issues. Accordingly, **chapter 11 concentrates on what is truly novel in the multivariate model: the effects of omitted variables, the implications of including an irrelevant variable, the consequences of correlations among the explanatory variables, and statistical tests of joint significance. **

In addition, this arrangement ensures that shorter courses, such as those that take place within a quarter rather than a semester, or courses in which progress is slower can still address these alternative assumptions. If the students in such courses must omit some basic material, they are better served by exposure to the problems to which regression may be subject than to an expansion of the basic results regarding the estimation of slopes.

This innovation, like the second, is a departure from the ordinary order of presentation. These discussions usually appear after the multivariate model. However, the presence of additional explanatory variables adds nothing, apart from additional notation. With an audience that is wary of notation to begin with, this is counterproductive.

As its fourth innovation, this book adopts a conversational tone wherever possible. Appropriate respect for the power of statistical analysis requires some familiarity with formal derivations. However, formal discussions reinforce the prejudice that this material is incompatible with the typical student sensibility. This prejudice defeats the purpose of the derivations.

Their real point is that they are almost always intuitive, usually insightful, and occasionally surprising. The real objective in their presentation is to develop the intuitive faculty, in the same way that repeated weight-lifting develops the muscles. The book pursues this objective by placing even greater emphasis on the revelations in the formulas than on their derivations.

At the same time, the book is meticulous about formal terminology. The names of well-defined concepts are consistent throughout. In particular, the text is rigorous in distinguishing between population and sample concepts.

This rigor is distinctive. The word mean

provides an egregious example of ambiguity in common usage. As a noun, this term appears as a synonym for both the average in the sample and the expected value in the population. With an audience who can never confidently distinguish between the sample and the population in the first place, this ambiguity can be lethal. Here, mean

appears only as a verb.

This ambiguity is also rampant in discussions of variances, standard deviations, covariances, and correlations. The same nouns are typically employed, regardless of whether samples or populations are at issue. This text distinguishes carefully between the two. Chapter 3 qualifes all variances, standard deviations, covariances, and correlations as sample statistics. The common Greek symbols for variances, standard deviations, and correlations appear only after **chapter 5 explains the distinction between sample statistics and population parameters. **

Finally, this book is designed to evolve with the students’ understanding. Each chapter begins with the section What We Need to Know When We Finish This Chapter,

which sets attainable and appropriate goals as the material is first introduced. The text of each chapter elaborates on these goals so as to make them accessible to all. Once mastery is attained, the What We Need to Know . . .

sections, the numbered equations, and the tables and figures serve as a concise but complete summary of the material and a convenient source for review. A companion Web site—www.sup.org/econometrics—provides these review materials independent of the book. It also provides instructors with password-protected solutions to the end-of-chapter exercises. Instructors can release these to students as appropriate so that they can explore the material further on their own.

Many students would probably prefer a treatment of econometrics that is entirely math-free. This book acknowledges the mathematical hesitations and limitations of average students without indulging them. It carefully avoids any mathematical development that is not essential.

However, derivation can’t be disregarded, no matter how ill-prepared the student. Proof is how we know what it is that we know. Students have to have some comfort with the purpose and process of formal derivation in order to be well educated in introductory econometrics. Some respect for the formal properties of regression is also a prerequisite for responsible use.

This book accommodates the skills of the typical student by developing all basic results regarding the classical linear regression model and the elaborations associated with heteroscedasticity, correlated disturbances, and endogeneity in the language of ordinary algebra. Virtually all of the mathematical operations consist entirely of addition, subtraction, multiplication, and division. There is no reference to linear algebra. The only sophistication that the book requires is a facility with summation signs. The second chapter provides a thorough review.

In general, individual steps in the algebraic derivations are simple. The text explains all, or nearly all, of these steps. This level of detail replicates what, in my experience, instructors end up saying in class when students ask for explanations of what’s in the book. With these explanations, the derivations, as a whole, should be manageable. They also provide opportunities to emphasize emerging insights. Such insights are especially striking in **chapter 11, where ordinary algebra informs intuitions that would be obscured in the matrix presentation. **

**Chapters 11 through 14 present the three-variable model entirely with ordinary algebra. By the time students reach this point, they are ready to handle the complexity. This treatment has two advantages. First, students don’t need matrices as a prerequisite and faculty don’t have to find time to teach, or at least review, them. Second, the algebraic formulas often reveal intuitions that are obscure or unavailable in matrix formulations. **

Similarly, this book contains very few well beyond any regressions they five derivatives are inescapable. Two occur when minimizing the sum of squared errors in the two-variable model of **chapter 4. Three appear when executing the same task in the three-variable model of chapter 11. All five are presented in their entirety, so that the students are not responsible for their derivation. All other derivatives are optional, in the appendices to chapters 4 and 7, in chapter 15, and in several exercises. **

In sum, this book challenges students where they are most fearful, by presenting all essential derivations, proofs, and results in the language with which they are most familiar. Students who have used this book uniformly acknowledge that they have an improved appreciation for proof. This will be to their lasting benefit, well beyond any regressions they might later run.

This text assumes either that students have not had a prior course in introductory statistics or that they don’t remember very much of it. It does not derive the essential concepts, but reviews all of them.

In contrast to other texts, this review does not appear as a discrete chapter. This conventional treatment suggests a distinction between statistics and whatever else is going on in econometrics. It reinforces the common student suspicion, or even hope, that statistics can be ignored at some affordable cost.

Instead, the statistics review here is consistent with the curiosity-driven theme. The text presents statistical results at the moment when they are needed. As examples, it derives the sample covariance and sample correlation from first principles in **chapter 3, in response to the question of how to make any sense at all out of a list of values for earnings and education. The text defines expectations and population variances in section 5.3, as it introduces the disturbances. **

Results regarding the expectation of summations first appear in section 5.4, where they are necessary to derive the expected value of the dependent variable. Variances of summations are not required until the derivation of the variance of the ordinary least squares slope estimator. Accordingly, the relevant formulas do not appear until section 5.8.

Similarly, a review of confidence intervals and hypothesis tests immediately precedes the formal discussion of inference regarding regression estimates. This review is extensive both because inference is essential to the responsible interpretation of regression results and because students are rarely comfortable with this material when they begin an econometrics course. Consequently, it appears in the self-contained **chapter 6. The application to the bivariate regression context appears in chapter 7. **

This book tries to strike a new balance between the capacities and enthusiasms of instructors and students. The former typically have a lot more of both than the latter. This book aspires to locate a common understanding between the two. It attempts to distill the instructor’s knowledge into a form that preserves its essence but is acceptable and even appealing to the student’s intellectual palate. This book insists on rigor where it is essential. It emphasizes intuition wherever it is available. It seizes upon entertainment.

This book is motivated by three beliefs. First, that students are, perhaps despite themselves, interested in questions that only econometrics can answer. Second, through these answers they can come to understand, appreciate, and even enjoy the enterprise of econometrics. Third, this text, with its innovations in presentation and practice, can provoke this interest and encourage responsible and insightful application. With that, let’s go!

**1.0 What We Need to Know When We Finish This Chapter **

**1.1 Why Are We Doing This? **

**1.2 Education and Earnings **

**1.3 What Does a Regression Look Like? **

**1.4 Where Do We Begin? **

**1.5 Where’s the Explanation? **

**1.6 What Do We Look for in This Explanation? **

**1.7 How Do We Interpret the Explanation? **

**1.8 How Do We Evaluate the Explanation? **

**1.9 R² and the F-statistic **

**1.10 Have We Put This Together in a Responsible Way? **

**1.11 Do Regressions Always Look Like This? **

**1.12 How to Read This Book **

**1.13 Conclusion **

**Exercises **

This chapter explains what a regression is and how to interpret it. Here are the essentials.

1. **Section 1.4: **The *dependent *or *endogenous *variable measures the behavior that we want to explain with regression analysis.

2. **Section 1.5: **The *explanatory*, *independent*, or *exogenous *variables measure things that we think might determine the behavior that we want to explain. We usually think of them as *predetermined*.

3. **Section 1.5: **The *slope *estimates the effect of a change in the explanatory variable on the value of the dependent variable.

4. **Section 1.5: **The *t*-statistic indicates whether the explanatory variable has a discernible association with the dependent variable. The association is discernible if the *p-value *associated with the *t*-statistic is .05 or less. In this case, we say that the slope is *statistically significant*. This generally corresponds to an absolute value of approximately two or greater for the *t*-statistic itself. If the *t*-statistic has a *p*-value that is greater than .05, the associated slope coefficient is *insignificant*. This means that the explanatory variable has no discernible effect.

5. **Section 1.6: **The *intercept *is usually uninteresting. It represents what everyone has in common, rather than characteristics that might cause individuals to be different.

6. **Section 1.6: **We usually interpret only the slopes that are statistically significant. We usually think of them as indicating the effect of their associated explanatory variables on the dependent variable *ceteris paribus*, or *holding constant all other characteristics that are included in the regression*.

7. **Section 1.6: ***Continuous variables *take on a wide range of values. Their slopes indicate the change that would be expected in the dependent variable if the value of the associated explanatory variable increased by one unit.

8. **Section 1.6: ***Discrete variables*, sometimes called *categorical variables*, indicate the presence or absence of a particular characteristic. Their slopes indicate the change that would occur in the dependent variable if an individual who did not have that characteristic were given it.

9. **Section 1.7: **Regression interpretation requires three steps. The first is to identify the discernible effects. The second is to understand their magnitudes. The third is to use this understanding to verify or modify the behavioral understanding that motivated the regression in the first place.

10. **Section 1.7: **Statistical significance is *necessary *in order to have interesting results, but not *sufficient*. Important slopes are those that are both statistically significant and substantively large. Slopes that are statistically significant but substantively small indicate that the effects of the associated explanatory variable can be reliably understood as unimportant.

11. **Section 1.7: **A *proxy *is a variable that is related to, but not exactly the variable we really want. We use proxies when the variables we really want aren’t available. Sometimes this makes interpretation difficult.

12. **Section 1.8: **If the *p-value *associated with the *F-statistic *is .05 or less, the collective effect of the ensemble of explanatory variables on the dependent variable is statistically significant.

13. **Section 1.8: ***Observations *are the individual examples of the behavior under examination. All of the observations together constitute the *sample *on which the regression is based.

14. **Section 1.8: **The *R*², or *coefficient of determination*, represents the proportion of the variation in the dependent variable that is explained by the explanatory variables. The adjusted *R*² modifies the *R*² in order to take account of the numbers of explanatory variables and observations. However, neither measures statistical significance directly.

15. **Section 1.9: ***F*-statistics can be used to evaluate the contribution of a subset of explanatory variables, as well as the collective statistical significance of all explanatory variables. In both cases, the *F*-statistic is a transformation of *R*² values.

16. **Section 1.10: **Regression results are useful only to the extent that the choices of variables in the regression, variable construction, and sample design are appropriate.

17. **Section 1.11: **Regression results may be presented in one of several different formats. However, they all have to contain the same substantive information.

The fundamental question that underlies most of science is, how does one thing affect another? This is the sort of question that we ask ourselves all the time. Whenever we wonder whether our grade will go up if we study more, whether we’re more likely to get into graduate school if our grades are better, or whether we’ll get a better job if we go to graduate school, we are asking questions that econometrics can answer with elegance and precision.

Of course, we probably think we have answers to these questions already. We almost surely do. However, they’re casual and even sloppy. Moreover, our confidence in them is almost certainly exaggerated.

Econometrics is a collection of powerful statistical tools that are devoted to helping provide answers to the question of how one thing affects another. Econometrics not only teaches us how to answer questions like this more accurately but also helps us understand what is necessary in order to obtain an answer that we can legitimately treat as accurate.

We begin in this chapter with a primer on how to interpret regression results. This will allow us to read work based on regression and even to begin to perform our own analyses. We might think that this would be enough.

However, this chapter will not explain why the interpretations it presents are valid. That requires a much more thorough investigation. We prepare for this investigation in **chapter 2. There, we review the summation sign, the most important mathematical tool for the purposes of this book. **

We actually embark on this investigation in **chapter 3, where we consider the precursors to regression: the covariance and the correlation. These are basic statistics that measure the association between two variables, without regard to causation. We might have seen them before. We return to them in detail because they are the mathematical building blocks from which regressions are constructed. **

Our primary focus, however, will be on the fundamentals of regression analysis. Regression is the principal tool that economists use to assess the responsiveness of some outcome to changes in its determinants. We might have had an introduction to regression before as well. Here, we devote **chapters 4, 5, and 7 through 14 to a thorough discussion. **

**Chapter 6 intervenes with a discussion of confidence intervals and hypothesis tests. This material is relevant to all of statistics, rather than specific to econometrics. We introduce it here to help us complete the link between the regression calculations of chapter 4 and the behavior that we hope they represent, discussed in chapter 5. **

**Chapter 15 discusses what we can do in a common situation where we would like to use regression, but where the available information isn’t exactly appropriate for it. This discussion will introduce us to probit analysis, an important relative of regression. More generally, it will give us some insight as to how we might proceed when faced with other situations of this sort. **

As we learn about regression, we will occasionally need concepts from basic statistics. Some of us may have already been exposed to them. For those of us in this category, **chapters 3 and 6 may seem familiar, and perhaps even chapter 4. For those of us who haven’t studied statistics before, this book introduces and reviews each of the relevant concepts when our discussion of regression requires them.¹ **

Few of us will be interested in econometrics purely for its theoretical beauty. In fact, this book is based on the premise that what will interest us most is how econometrics can help us organize the quantitative information that we observe all around us. Obviously, we’ll need examples.

There are two ways to approach the selection of examples. Econometric analysis has probably been applied to virtually all aspects of human behavior. This means that there is something for everyone. Why not provide it?

Well, this strategy would involve a lot of examples. Most readers wouldn’t need that many to get the hang of things, and they probably wouldn’t be interested in a lot of them. In addition, they could make the book a lot bigger, which might make it seem intimidating.

The alternative is to focus principally on one example that may have relatively broad appeal and develop it throughout the book. That’s the choice here. We will still sample a variety of applications over the course of the entire text. However, our running example returns, in a larger sense, to the question of section 1.1: Why are we doing this? Except now, let’s talk about college, not this course.

Presumably, at least some of the answer to that question is that we believe college prepares us in an important way for adulthood. Part of that preparation is for jobs and careers. In other words, we probably believe that education has some important effect on our ability to support ourselves.

This is the example that we’ll pursue throughout this book. In the rest of this chapter, we’ll interpret a somewhat complicated regression that represents the idea that earnings are affected by several determinants, with education among them. In **chapter 3, we’ll return to the basics and simply ask whether there’s an association between education and earnings. Starting in chapter 4, we’ll assume that education affects earnings and ask: by how much? In chapter 10, we’ll examine whether the assumption that education causes earnings is acceptable, and what can be done if it’s not. **

As we can see, we’ll ask this question with increasing sophistication as we proceed through this book.**² The answers will demonstrate the power of econometric tools to address important quantitative questions. They will also serve as illustrations for applications to other questions. Finally, we can hope that they will confirm our commitment to higher education. **

**Figure 1.1 **Our first regression

**Figure 1.1 is one way to present a regression. **

Does that answer the question?

Superficially, yes. But what does it all mean?

This question can be answered on two different levels. In this chapter, we’ll talk about how to interpret the information in **figure 1.1. This should put us in a position to read and understand other work based on regression analysis. It should also allow us to interpret regressions of our own. **

In the rest of the book, we’ll talk about why the interpretations we offer here are valid. We’ll also talk about the circumstances under which these interpretations may have to be modified or may even be untrustworthy. There will be a lot to say about these matters. But, for the moment, it will be enough to work through the mystery of what **figure 1.1 could possibly be trying to reveal. **

The first thing to understand about regression is the very first thing in **figure 1.1. The word earnings identifies the dependent variable in the regression. The dependent variable is also referred to as the endogenous variable. **

The dependent variable is the primary characteristic of the entities whose behavior we are trying to understand. These entities might be people, companies, governments, countries, or any other choice-making unit whose behavior might be interesting. In the case of **figure 1.1, we might wonder if earnings implies that we’re trying to understand company profits. However, earnings here refers to the payments that individuals get in return for their labor. So the entities of interest here are workers or individuals who might potentially be workers. **

Dependent

and endogenous

indicate the role that earnings

plays in the regression of **figure 1.1. We want to explain how it gets determined. Dependent suggests that earnings depends on other things. Endogenous implies the same thing, though it may be less familiar. It means that the value of earnings is determined by other, related pieces of information. **

The question of what it means to explain

something statistically can actually be quite subtle. We will have some things to say about this in **chapters 4 and 10. Initially, we can proceed as if we believe that the things that we use to explain earnings actually cause earnings. **

Most of the rest of **figure 1.1, from the equality sign to (–.199), presents our explanation of earnings. The equality sign indicates that we’re going to represent this explanation in the form of an equation. On the right side of the equation, we’re going to combine a number of things algebraically. Because of the equality, it looks as though the result of these mathematical operations will be earnings. Actually, as we’ll learn in chapter 4, it will be more accurate to call this result predicted earnings. **

The material to the right of the equality sign in **figure 1.1 is organized into terms. The terms are separated by signs for either addition (+) or subtraction (–). Each term consists of a number followed by the sign for multiplication (x), a word or group of words, and a second number in parentheses below the first. **

In each term, the word or group of words identifies an *explanatory variable. *An explanatory variable is a characteristic of the entities in question, which we think may cause, or help to create, the value that we observe for the dependent variable.

Explanatory variables are also referred to as *independent variables. *This indicates that they are not dependent.

For our present purposes, this means that they do not depend on the value of the dependent variable. Their values arise without regard to the value of the dependent variable.

Explanatory variables are also referred to as *exogenous variables*. This indicates that they are not endogenous.

Their values are assigned by economic, social, or natural processes that are not under study and not affected by the process that is. The variables listed in the terms to the right of the equality can be thought of as causing the dependent variable, but not the other way around. We often summarize this assumption as causality runs in only one direction.

This same idea is sometimes conveyed by the assertion that the explanatory variables are *predetermined*. This means that their values are already known at the moment when the value of the dependent variable is determined. They have been established at an earlier point in time. The point is that, as a first approximation, behavior that occurs later, historically, can’t influence behavior that preceded it.**³ **

This proposition is easy to accept in the case of the regression of **figure 1.1. Earnings accrue during work. Work, or at least a career, typically starts a couple of decades into life. Racial or ethnic identity and sex are usually established long before then. Age accrues automatically, starting at birth. Schooling is usually over before earnings begin as well. Therefore, it would be hard to make an argument at this stage that the dependent variable, earnings, causes any of the explanatory variables.⁴ **

In each term of the regression, the number that multiplies the explanatory variable is its *slope*. The reason for this name will become apparent in **chapter 4.⁵ The slope estimates the magnitude of the effect that the explanatory variable has on the dependent variable. **

Finally, the number in parentheses measures the precision of the slope. In **figure 1.1, these numbers are t-statistics. What usually matters most with regard to interpretations of the t-statistic is its p-value. However, the p-value doesn’t appear in figure 1.1. This is because they don’t appear in the most common presentations of regression results, which is what our discussion of figure 1.1 is preparing us for. **

We’ll offer an initial explanation of *p*-values and their interpretations in section 1.8. We’ll present the *p*-values for **figure 1.1 in table 1.3 of section 1.11. Finally, we’ll explore the calculation and interpretation of t-statistics at much greater length in chapter 6. **

In the presentation of **figure 1.1, what matters to us most is the absolute value of the t-statistic. If it is approximately two or greater, we can be pretty sure that the associated explanatory variable has a discernible effect on the dependent variable. In this case, we usually refer to the associated slope as being statistically significant, or just significant. It is our best guess of how big this effect is. **

If the absolute value of the *t*-statistic is less than approximately two, regression has not been able to discern an effect of the explanatory variable on the dependent variable, according to conventional standards. There just isn’t enough evidence to support the claim that the explanatory variable actually affects the dependent variable. In this case, we often refer to the associated slope as *statistically insignificant*, or just *insignificant*.

As we’ll see in **chapter 6, this is not an absolute judgment. t-statistics that are less than two in absolute value, but not by much, indicate that regression has identified an effect worth noting by conventional standards. In contrast, t-statistics that have absolute values of less than, say, one, indicate that there’s hardly a hint of any discernible relationship.⁶ Nevertheless, for reasons that will become clearer in chapter 6, we usually take two to be the approximate threshold distinguishing explanatory variables that have effects worth discussing from those that don’t. **

As we can see in **figure 1.1, regression calculates a value for the slope regardless of the value of the associated t-statistic. This discussion demonstrates, however, that not all of these slopes have the same claim on our attention. If a t-statistic is less than two in absolute value, and especially if it’s a lot less than two, it’s best to assume, for practical purposes, that the associated explanatory variable has no important effect on the dependent variable. **

The regression in **figure 1.1 contains nine terms. Eight contain true explanatory variables and are potentially interesting. **

The first term, which **figure 1.1 calls the intercept, does not contain a true explanatory variable. As we’ll see in chapter 4 and, more important, in chapter 7, we need it to predict values of the dependent variable. **

Otherwise, the intercept is ordinarily uninteresting.**⁷ It measures a part of the dependent variable that is common to all entities under examination. In other words, it measures a part that doesn’t depend on the other characteristics of these entities that are included as explanatory variables in the regression. **

This isn’t usually interesting because we’re typically concerned with explaining why different people or organizations are different from each other. The intercept usually tells us only about what they share. Consequently, it isn’t informative about the relevant question.

In the case of **figure 1.1, the interesting question is why some people have higher earnings than others. The intercept in the regression there tells us that everyone starts out with –$19,427 in annual earnings, regardless of who they are. **

This can’t be literally true. It’s probably best to take the intercept as simply a mechanical device. As we’ll see in **chapter 4, its purpose is just to provide an appropriate starting point from which to gauge the effects of the genuine explanatory variables. **

The eight true explanatory variables are potentially interesting because they measure specific characteristics of each person. Regression can attempt to estimate their contributions to earnings because they appear explicitly in the regression. The first question that must be asked with regard to any of them is whether the regression contains any evidence that they actually affect the dependent variable.

As we learned in the last section, the *t*-statistic answers this question. Therefore, the first number to look at with regard to any of the explanatory variables in **figure 1.1 is the number in parentheses. If it has an absolute value that is greater than two, then the regression estimates a discernible effect of the associated explanatory variable on the dependent variable. These variables deserve further attention. **

In **figure 1.1, t-statistics indicate that four explanatory variables are statistically significant, or have discernible effects: years of schooling, age, female, and black. With regard to these four, the next question is, how big are these effects? As we said in the last section, the answer is in the slopes. **

Two of these explanatory variables, years of schooling and age, are *continuous*. This means that they can take on a wide range of values. This regression is based on individuals whose years of schooling range from 0 to 21. Age varies from 18 to 65.**⁸ **

For these variables, the simplest interpretation of their slopes is that they estimate how earnings would change if years of schooling or age increased by a year. For example, the slope for years of schooling is 3,624.3. This indicates that earnings could be expected to increase by $3,624.30 for each additional year devoted to study. Similarly, the slope for age is 378.60. Earnings would increase by $378.60 annually simply as an individual grows older.

This interpretation is based on the image of following individuals as their lives evolve. This sort of image will usually be helpful and not grossly inappropriate. However, it’s not exactly what’s going on in **figure 1.1. That regression is not comparing different moments in the life of the same individual. Instead, it’s comparing many different individuals, of different ages, all observed at the same moment, to each other. **

This suggests a more correct, though more subtle interpretation of, for example, the slope associated with years of schooling. It actually compares the earnings of two individuals who have the same values for all of the other explanatory variables, but who differ by one year in their schooling. In other words, the regression of **figure 1.1 tells us that if we had two individuals who had the same racial or ethnic identification, were of the same sex and age, but differed by one year in schooling attainment, we would expect the individual with greater schooling to have annual earnings that exceeded those of the other individual by $3,624.30. **

**Chapter 11 will demonstrate formally why this interpretation is appropriate. Until then, it’s enough to summarize the interpretation of the preceding paragraph as follows: Any slope estimates the effect of the associated explanatory variable on the dependent variable, holding constant all other independent variables. **

This interpretation is often conveyed by the Latin phrase *ceteris paribus*.**⁹ It’s important to remember that, regardless of the language in which we state this interpretation, it means that we are holding constant only the other variables that actually appear as explanatory in the regression.¹⁰ We will often summarize this condition by stating that we are comparing individuals or entities that are otherwise similar, except for the explanatory variable whose slope is under discussion. **

The ceteris paribus interpretation of the slope for the age variable is analogous to that of years of schooling. Again, if we compared two individuals who had identical racial or ethnic identities, the same sex and level of education, but who differed by one year in age, the older’s earnings would exceed those of the younger by $378.60.

The two remaining statistically significant explanatory variables are female and black. Both are *discrete variables*, or *categorical variables*, meaning that they identify the presence of a characteristic that we ordinarily think of as indivisible. In this case, the variable female distinguishes women from men, and the variable black distinguishes individuals who reported themselves as at least partially black or African American from those who did not.**¹¹ **

For this reason, the interpretation of the slopes associated with discrete explanatory variables differs somewhat from that of the slopes associated with continuous explanatory variables. In the latter case, the slope indicates the effect of a *marginal *change in the explanatory variable. In the former case, the slope indicates the effect of changing from one category to another.

At the same time, the interpretation of slopes associated with discrete variables is ceteris paribus, as is that of slopes associated with continuous variables. In other words, if we compare a man and a woman who have the same age, the same amount of schooling, and the same racial or ethnic identities, we expect their incomes to differ by the amount indicated by the slope associated with the female variable.

In **figure 1.1, the slope indicates that the woman would have annual earnings that are $17,847 less than those of the man. Similarly, the slope for blacks indicates that, if we compare two individuals of the same age, schooling, and sex, annual earnings for the black person will be $10,130 less than those of the otherwise similar white person. **

These must seem like very large differences. We’ll talk about this in the next section. We conclude this section by noting that the slopes of the variables identifying individuals who are Hispanic, American Indian or Alaskan Native, and Asian are all statistically insignificant. Nevertheless, they also seem to be large. Even the smallest of them, that for Hispanics, indicates that their annual earnings might be $2,309.90 less than those of otherwise similar whites.

Although the magnitudes of the slopes for these variables might seem large and even alarming, it’s inappropriate to take them seriously. Not only are their *t*-statistics less than two, they’re a lot less than two. This means that, even though regression has calculated slopes for these variables, it really can’t pin down their effects, if any, with any precision. In later chapters, we’ll discuss what we might do if we wanted to identify them more clearly.

Regression interpretation proceeds through three steps. We’ve already taken the first. It was to identify the explanatory variables that have statistically significant slopes. For the most part, regression is only informative about these explanatory variables. They are the only variables for which regression can estimate reliable effects.

We’re also halfway through the second step, which is to interpret the magnitude of these effects. Effects that are both statistically significant and substantively large are the ones that really catch our attention.

The coefficient on the categorical variable for females is an example. Not only is it estimated very reliably, it indicates that women have annual earnings that are almost $18,000 less than those of otherwise similar men. In economic terms, this difference seems huge.**¹² **

The slope associated with blacks is similar. Its *t*-statistic is smaller, as is its magnitude. However, the *t*-statistic is still large enough to indicate statistical significance. The magnitude is still big enough to be shocking.

It takes a little more work to evaluate the effect of years of schooling. Its *t*-statistic indicates that it is reasonably precise. Its magnitude, however, is markedly smaller than that of the slopes for females and blacks.

Nevertheless, this slope indicates that a worker with one more year of schooling than an otherwise similar worker will have earnings that are greater by $3,624.30 in every year that he or she is of working age. If a typical working career lasts for, perhaps, 40 years, this advantage accumulates to something quite substantial.

Another way to think of this is to calculate the earnings advantage conferred by completing an additional level of schooling. People with college degrees have approximately four more years of schooling than those who end their formal education with high school graduation. Relative to these people, an individual with a college degree will get the $3,624.30 annual earnings premium for each of his or her four additional years of schooling.

This amounts to a total annual earnings premium of $14,497.20. This premium is again quite large. It explains why so many people continue on to college after high school and why there is so much concern regarding the career prospects for those who don’t.

It also presents an interesting comparison to the slopes for women and blacks. The slope for women is larger than the earnings premium for four years of schooling. This suggests that, in order for women to have the same annual earnings as otherwise similar men, they would have to have more than four additional years of schooling. Similarly, blacks would have to have nearly three additional years of schooling in order to attain the same earnings level as otherwise similar whites.

The remaining explanatory variable with a statistically significant effect is age. Its slope is about one-tenth of that associated with years of schooling, so its effect is substantively much smaller. Two otherwise similar individuals would have to differ in age by about 27 years in order to have an earnings differential similar to that between an otherwise similar black and white individual of the same age.**¹³ **

This raises a very interesting point. It is possible for an explanatory variable to have an effect that is statistically significant, but economically, or *substantively*, unimportant. In this case, regression precisely identifies an effect that is small. While this may be of moderate interest, it can’t be nearly as intriguing as an effect that is both precise and large.

In other words, statistical significance is *necessary *in order to have interesting results, but not *sufficient*. Any analysis that aspires to be interesting therefore has to go beyond identifying statistical significance to consider the behavioral implications of the significant effects. If these implications all turn out to be substantively trivial, their significance will be of limited value.

This second interpretive step reveals another useful insight. The slope for females in **figure 1.1 is given as –$17,847. The computer program that calculated this slope added several digits to the right of the decimal point. However, the most important question that we’ve asked of this slope is whether it is substantively large or small. Our answer, just above, was that it is huge. Which of the digits in this slope provide this answer? **

It can’t be the digits to the right of the decimal point. They aren’t even presented in **figure 1.1. It also can’t be the digits in the ones or tens place in the slope as it is presented there. If this slope had been –$17,807 instead of –$17,847, would we have concluded that it wasn’t huge, after all? Hardly. In fact, this effect would have arguably looked huge regardless of what number was in the hundreds place. **

In other words, the substantive interpretation that we applied to this variable really depended almost entirely on the first two digits. The rest of the digits did not convey any really useful information, except for holding their places.

At the same time, they didn’t do much harm. They wouldn’t, unless they’re presented in such profusion that they distract us from what’s important. Unfortunately, that happens a lot. We should be careful not to invest too much effort into either interpreting these digits in other people’s work or presenting them in our own.**¹⁴ **

The third interpretive step is to formulate a plausible explanation of the slope magnitudes, based on our understanding of economic and social behavior. In fact, this is something we should have done already. Why did we construct the regression in **figure 1.1 in the first place? Presumably, because we had reasons to believe that the explanatory variables were important influences on earnings. **

It’s now time to revisit those reasons. We compare them to the actual regression results. Where they are consistent, our original beliefs are confirmed and strengthened. Where they are inconsistent, we have to consider revising our original beliefs. This is the step in which we consolidate what we have learned from our regression. Without it, the first two steps aren’t of much value.

We begin this step by simply asking, Why?

For example, why does education have such a large positive effect on income? It seems reasonable to believe that people with additional education might be more adept at more sophisticated tasks. It seems reasonable to believe that more sophisticated tasks might be more profitable for employers. Therefore, it may be reasonable to expect that employers will offer higher levels of pay to workers with higher levels of education.

The regression in **figure 1.1 confirms these expectations. This, in itself, is valuable. Moreover, our expectations had very little to say about exactly how much employers would be willing to pay for an additional year of schooling. The slope estimates this quantity for us, in this case with a relatively high degree of precision. **

Of course, there was no guarantee that the regression would be so congenial. How would we have responded to a slope for years of schooling that was inconsistent with our expectations, either too low or too high?**¹⁵ What would we have done if the data, representing actual experience, were inconsistent with our vision about what that experience should be like? **

A contradiction of this sort raises two possibilities. Either our expectations were wrong, or something was wrong with the regression that we constructed in order to represent them. It would be our obligation to review both in order to reconcile our expectations and experience.

Ordinarily, we would begin with the issue where we were least confident in our initial choices. If we were deeply committed to our expectations, we would suspect the regression. If we believed that the regression was appropriate, we would wonder first what was faulty in our expectations.

In the case of years of schooling, its estimated effect probably leaves most of us in the following position: We were already fairly certain that more education would increase earnings. Our certainty is now confirmed: We have an estimate of this effect that is generally consistent with our expectations. In addition, we have something much more concrete than our own intuition to point to for support when someone else takes a contrary position.

The slope for age presents a different explanatory problem. Why might we expect that earnings would change with age? We can certainly hope that workers become more productive as they learn more about their work. This would suggest that earnings should be greater for older workers.

At the same time, at some point in the aging process, workers become less vigorous, both physically and mentally. This should reduce their productivity and therefore their wages.

This might be an explanation for why the slope for age is relatively small in magnitude. Perhaps it combines an increase in earnings that comes from greater work experience and a reduction in earnings that comes from lower levels of activity? The first effect might be a little stronger than the second, so that the net effect is positive but modest.

This is Superficially plausible. However, a little more thought suggests that it is problematic. For example, is it plausible that these two effects should cancel each other to the same degree, regardless of worker age?

It may seem more reasonable to expect that the effects of experience should be particularly strong when the worker has very little, at the beginning of his or her career. At the ages when most people begin to work, it’s hard to believe that vigor noticeably deteriorates from one year to the next. If so, then productivity and, therefore, earnings, should increase rapidly with age for young workers.

Conversely, older workers may have little more to learn about the work that they do. At the same time, the effects of aging on physical and mental vigor may be increasingly noticeable. This implies that, on net, productivity and earnings might decline with age among older workers.

**Figure 1.2 illustrates these ideas. As shown in the figure, a more thorough understanding of the underlying behavior suggests that the effects of increasing age on earnings should depend on what age we’re at. We’ll learn how to incorporate this understanding into a regression analysis in chapter 13. For the moment, it suggests that we should be cautious about relying on the slope for age in the regression of figure 1.1. It’s estimated reliably, but it’s not clear what it represents. **

**Figure 1.2 **Potential effects of age on earnings

This difficulty is actually a symptom of a deeper issue. The confusion arises because, according to our explanation, the single explanatory variable for age is being forced to do two different jobs. The first is to serve as a rough approximation, or *proxy*, for work experience. The second is to serve, again as only a proxy, for effort.

In other words, the two variables that matter, according to our explanation, aren’t in the regression. Why? Because they aren’t readily available. The one variable that is included in the regression, age, has the advantage that it is available. Unfortunately, it doesn’t reproduce either of the variables that we care about exactly. It struggles to do a good job of representing both simultaneously. We’ll return to this set of issues in **chapters 11 and 12. **

What about the large negative effects of being female or black? We might be tempted to explain them by surmising that women and blacks have lower earnings than do white males because they have less education, but that would be wrong. Education is entered as an explanatory variable in its own right. This means that, as we said in section 1.6, it’s already held constant. The slope for females compares women to males who are otherwise similar, including having the same years of schooling. In the same way, the slope for blacks compares blacks to whites who are otherwise similar, again with the same years of schooling.

The explanation must lie elsewhere. The most disturbing explanation is that women and blacks suffer from discrimination in the labor market. A second possibility is that women and blacks differ from white males in some other way that is important for productivity, but not included in the regression of **figure 1.1. We will have to leave the exploration of both of these possibilities to some other time and context. **

A third possibility, however, is that the quality of the education or work experience that women and blacks get is different from that of white males. We’ll talk about how we might address this in **chapters 13 and 14. **

At this point, we know what explanatory variables seem to be important in the determination of earnings, and we have explanations for why they might be so. Can we say anything about how good these explanations are as a whole?

The answer, perhaps not surprisingly, is yes. The remaining information in **figure 1.1 allows us to address this question from a statistical perspective. The most important piece of additional information in figure 1.1 is the p-value associated with the F-statistic. The F-statistic tests whether the whole ensemble of explanatory variables has a discernible collective effect on the dependent variable. In other words, the F-statistic essentially answers the question of whether the regression has anything at all useful to say about the dependent variable. **

**Chapter 12 will explain how it does this in detail. For the moment, the answer is most clearly articulated by the p-value, rather than by the F-statistic with which it is associated. If the p-value is .05 or less, then the joint effect of all explanatory variables on the dependent variable is statistically significant.¹⁶ **

If the *p*-value is larger than .05, the ensemble of explanatory variables does not have a jointly reliable effect on the dependent variable. We’ll explore the implications of this in **chapter 12. It could be that a subgroup of explanatory variables really does have a reliable effect that is being obscured by the subgroup of all other explanatory variables. But it could also be that the regression just doesn’t tell us anything useful. **

In the case of **figure 1.1, the p-value associated with the F-statistic is so small that the computer doesn’t calculate a precise value. It simply tells us that the p-value is less than .0001. Further precision isn’t necessary, because this information alone indicates that the true p-value is not even one five-hundredth of the threshold value of .05. This means that there can be almost no doubt that the joint effect of the collection of explanatory variables is statistically significant. **

What’s left in **figure 1.1 are two R² measures. The first, R², is sometimes written as the "R-square**

orR-squared" value. It is sometimes referred to as the coefficient of determination. The R² represents the proportion of the variation in the dependent variable that is explained by the explanatory variables. The natural interpretation is that if this proportion is larger, the explanatory variables have a more dominant influence on the dependent variable. So bigger is generally better.

The question of how big *R*² should be is difficult. First, the value of *R*² depends heavily on the context. For example, the *R*² value in **figure 1.1 is approximately .17. This implies that the eight explanatory variables explain a little less than 17% of the variation in annual earnings. **

This may not seem like much. However, experience shows that this is more or less typical for regressions that are comparing incomes of different individuals. Other kinds of comparisons can yield much higher *R*² values, or even lower values.

The second reason why the magnitude of *R*² is difficult to evaluate is that it depends on how many explanatory variables the regression contains and how many individuals it is comparing. If the first is big and the second is small, *R*² can seem large, even if the regression doesn’t provide a very good explanation of the dependent variable.**¹⁷ **

The adjusted *R*² is an attempt to correct for the possibility that *R*² is distorted in this way. Chapter 4 gives the formula for this correction, which naturally depends on the number of explanatory variables and the number of individuals compared, and explains it in detail. The adjusted *R*² is always less than *R*². If it’s a lot less, this suggests that *R*² is misleading because it is, in a sense, trying to identify a relatively large number of effects from a relatively small number of examples.

The number of individuals involved in the regression is given in **figure 1.1 as the number of observations. Observation is a generic term for a single example or instance of the entities or behavior under study in a regression analysis. In the case of figure 1.1, each individual represents an observation. All of the observations together constitute the sample on which the regression is based. **

According to **figure 1.1, the regression there is based on 1,000 observations. That is, it compares the value of earnings to the values of the eight explanatory variables for 1,000 different individuals. This is big enough, and the number of explanatory variables is small enough, that the R² should not be appreciably distorted. As figure 1.1 reports, the adjusted R² correction doesn’t reduce R² by much. **

We’ve examined *R*² in some detail because it gets a lot of attention. The reason for its popularity is that it seems to be easy to interpret. However, nothing in its interpretation addresses the question of whether the ensemble of explanatory variables has any discernible collective effect on the dependent variable. For this reason, the attention that *R*² gets is misplaced. It has real value, not because it answers this question directly, but because it is the essential ingredient in the answer.

The *F*-statistic, which we discussed in the last section, addresses directly the question of whether there is a discernible collective effect. That’s why we’ll emphasize it in preference to *R*². Ironically, as **chapter 12 will prove, the F-statistic is just a transformation of R². It’s R² dressed up, so to speak, so as to **

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading