You are on page 1of 648
Jos] [igen SISATVNV NOISSAYOTa REGRESSION ANALYSIS Concepts and Applications With SAS or Minitab Lab Manual Franklin A. Graybill Hariharan K. Iyer Assumptions (A) for Multiple Linear Regression Notation The (k + U-variable population (WX)... XQ} is the study population (Population) Assumption 1 ‘The mean yotsi... a) of the subpopulation of ¥ values with X;=.0),....X) = mis Hilti oa) = Ba + Buty + ++ + Bix where By. Bis... Bi are unknown parameters and sie... X: belong to the set of allowable values (sometimes called the domain) of the predictor variables. (Population) Assumption 2 The standard deviation of the ¥ values in the subpopulations with Xi = xy, . Xe = x, does not depend on the values 1), .. . , 11 (ie. the standard deviations are the same for each subpopulation determined by specified values of the predictor variables Xi, ..... X,). This common standard deviation of all the subpopulations is denoted by oy sy...» When there is no possibility of confusion, we use the simpler notation «instead of the more com plete notation oyjx,.. x (Population) Assumption 3 Each subpopulation of ¥ values, deter- mined by specified values of Xi, .... Xi is a Gaussian population, (Sample) Assumption 4 The sample data are obtained by simple random sampling or by sampling with preselected values of X\, X,, discussed in Section 2.3. The number of items in the sample is (Sample) Assumption 5 All. sample values yc ti... .t for 1 = 1... 1 are observed without error (but read Section 3.10). Assumptions (B) for Multiple Linear Regression Population) Assumption 1 The study population {(¥, Xi, .. . .Xa)} is 1 (k + 1)-variable Gaussian population. (Sample) Assumption 2 The sample data are obtained by simple random sampling described in Section 2.3; i.e., a simple random ‘sample of n items is selected from the population and the values of the variables ¥, Xi, ... , Xs are observed. (Sample) Assumption 3 The sample values yi, X1,.-., Xa for i= 1,...,mare measured without error. ———— ee Simple Random Sampling ‘Sample data are obtained by selecting a simple random sample of n items from the entire population of N items and recording the values for the response variable ¥ and the predictor variables X1, . . .. X, for each item in the sample, Refer to Section 1.6. Random Sampling with Preselected X values Specific values of the predictor variables Xi, . . ., Xs are preselected by the investigator, and each of these preselected sets of values determines a subpopulation of ¥ values. A simple random sample of one or more Y values is selected from each of these subpopul ‘observations to be sampled from each subpopulat by the investigator. Concepts and. Regression Analysis: Applications Our Purpose Our Approach Our Emphasis Preface Tarts ineie o vw ne sone coe ned rg sion” that focuses on concepts and applications to be taught to juniors, seniors, or first-year graduate students. Its prerequisites are: (I) one course in statistical ‘methods, (2) proficiency in high school mathematics, and (3) some familiarity with computers. Of course, it is always useful for students to have more mathe- ‘matics, statistics, or computing than the minimal prerequisites. ‘There are two kinds of statistics books—theory books and methods books. Theory books generally start with a discussion of probability and populations (probability distributions) and then discuss sampling, whereas books on statisti- ccal methods tend to start directly with data and do not discuss populations thoroughly. We believe that the theoretical approach is sound and that high school mathematics is sufficient to understand populations as they relate tomany applications. Consequently we first introduce populations and then discuss sam- pling, focusing all the while on the concepts underlying statistical inference. We have endeavored to give an accurate and clear account of linear regression with an emphasis on prediction. We have deemphasized some traditional con- cepts such as correlation and tests of hypotheses and significance. We stress standard deviations of prediction errors (when using unbiased predictors). rather 1 Preface than correlations, to judge the adequacy of regression functions, and we use confidence intervals in place of tests. This change in emphasis is more than superficial because it influences the way practical questions are formulated and answered. Use of Finite Population Models ‘The Organization Computing Although most of the mathematical theory of regression is derived using infinite population models, we feel that it is easier for students to understand many concepts—such as random sampling, unbiasedness, sampling distributions, and the relative frequency interpretation of confidence intervals and tests—by using finite population models. Moreover, most populations we come actoss in real ;problems happen to be finite populations. For this reason, we attempt to explain ‘important concepts using finite populations. On the other hand, the theory based on infinite models is essentially valid for topics covered in this book because real populations under study usually consist of a large number of items, and so infinite population models serve as good approximations. Chapter 1 isa review of what is normally covered in a first course in statistical ‘methods plus some material about matrices, functional notation, and the multi- ‘variate Gaussian (normal) population. We suggest that Chapter I not be omitted even if students have had one or more courses in statistics because it includes some notation, terminology, and specific prerequisites for the rest of the book. Chapter 2 introduces many of the fundamental concepts underlying regression. Chapter 3 contains a detailed discussion of straight line regression. Chapter 4 zzives an in-depth study of multiple linear regression. Chapter 5 deals with diagnostic procedures in regression analysis. Chapters 6 and 7 treat special topics that cover several important practical applications of linear regression. Chapter 8 discusses some procedures, including a distribution-free method, that are valid when traditional assumptions for straight line regression are not appropriate. Finally, Chapter 9 gives a brief discussion of nonlinear regression, ‘The frst four chapters form the core of the book. Thereafter, any or all ofthe sections in Chapters 5 through 9 can be studied in any order since each section in these chapters depends on only the material in Chapters 1 through 4. It is difficult, and practically speaking impossible, to do all of the computing required in multiple regression without the use of some software computing package, yet we did not want details about computing software or hardware to interfere with the flow of the material. Thus we decided to avoid discussion of ‘The Data Sets Preface ti computing issues in the textbook and to make laboratory manuals available ‘where computing is discussed. The textbook is self-contained and is not depen- dent on knowledge of any computing package. We include relevant outputs from. MINITAB and/or SAS whenever the required computing is, to the say the least, tedious. As a result, students can learn the material without any statistical com- puting package, although we strongly advise using a suitable one. We have written two laboratory manuals—one for MINITAB and one for SAS, We chose MINITAB because itis an easy statistical package for students to leam without detracting from the main subject. We also chose SAS because ‘many graduate students know and use this package or want to learn it because itis widely used in government, industry, and business, The laboratory manuals contain instructions for using MINITAB and SAS to solve problems and exer- cises in the textbook. This arrangement offers students the following choi 1 For those who wish to use MINITAB, a laboratory manual explaining MINITAB commands needed for regression is available with the textbook. 2. For those who wish to use SAS, a laboratory manual explaining basic SAS commands needed for regression is similarly available with the textbook. 3. For those who do not wish to use computers, the textbook is self-sufficient and they need not concern themselves with the laboratory assignments con- tained in the laboratory manuals. Practically every data set used in the book appears “real,” and many are based (on real studies. The problem descriptions associated with the data sets in the book do not require knowledge of any particular field, such as physics, chem- istry, biology, or engineering. The data sets involve common and well-known topics such as insurance premiums, grade point averages, final and midterm exams, electric bills, grocery costs, strength of plastic containers, professors’ salaries, and atmospheric pollutants. Examples, Problems, and Exercises In the examples, problems, and exercises we have aimed to ask and answer ‘questions that investigators may encounter in the course of an investigation and rot questions that are of interest to statisticians only. In particular, we have tried to explain how to formulate practical questions about population parameters in the statistical model being considered. ‘There are more than 100 examples (over half of them worked out in detail), and there are 14 tasks and more than 85 data sets throughout the book to illustrate, ‘each new procedure as itis presented. Problems at the end of each section help students determine whether they understand the material covered in that section. Exercises at the end of each chapter cover the material in the entire chapter. Answers to selected problems and exercises are given in an appendix. ‘The Mathematics ‘The Notation Bi Preface ‘We believe that students can lear, understand, and use the techniques of regres- sion correctly and effectively in their applied work without a great deal of ‘mathematics and theoretical statistics. In many statistics courses, students prove ‘theorems without really understanding the statistical concepts the theorems im- ply, and in some cases this can be an impediment to developing their intuition ‘about statistical methods. Consequently this book requires very little mathemat- ics and no theorems are proven. For many procedures, we give a heuristic explanation to help students understand the underlying ideas. Instructors who wish to teach theoretical (mathematical) regression techniques can supplement this book with appropriate mathematical procedures. This should be much easier than using a mathematical text and supplementing it with appropriate method- ological procedures that include data sets, computing techniques, etc. ‘Some students may find the notation difficult and troublesome, but adequate notation is required to understand many of the concepts and procedures. We have simplified the necessary notation as much as possible, and we believe that it will actually help clarify many of the concepts if students study it carefully Recommended Coverage ‘One way to cover the material inthis book for a one-semester course that meets three hours per week is as follows: 1. The material in Chapter 1 can be covered rather quickly but deliberately because itis supposed to be a review. 2. Chapter 2, which is an introduction to the concept of regression, can also be covered rather quickly. 3. Chapters 3 and 4 should be covered in depth. However, the instructor may elect to spend very litle time on regression analysis when there are measure- ‘ment errors in Section 3.10 and lack-of-fit analysis in Section 4.11. 4 After Chapters 1 through 4 have been completed, instructors can teach any section in the remainder of the book. Some instructors may elect to cover the topics in one of the ways described below: Selection of variables, growth curves, and nonlinear regression, along with other topics. Selection of variables, tolerance intervals, and prediction intervals, etc. Selection of variables, nonlinear regression, spline regression, etc. Preface Ii Noteworthy Features Some additional noteworthy features of the book are: 1 An exceptional range of applications is presented in Chapters 6 and 7—toler- ance intervals, calibration, regulation, intersection of lines, variable sclection, growth curves, maximum and minimum of a quadratic, spline regression, and ‘many others. 2. Foreach and every technique discussed, we stress the assumptions needed for that technique to be valid. 3 We point out that no valid point estimates or confidence intervals exist for certain population parameters unless sampling is carried out by a prescribed ‘method. This relationship between the sampling method and the availability of valid inference procedures is often ignored in many textbooks. 4 We downplay tests of hypotheses and tests of significance but emphasize the use of confidence intervals for making practical decisions from data because confidence intervals are more informative than tests for making these deci- sions. Tests are used by investigators to assess “statistical significance,” but confidence intervals can be used to help assess “practical importance.” 5 We use word problems, called tasks, to help students sce how regression can be used to answer practical questions. 6 We provide an altemative approach for assessing lack-of-fit of prediction ‘models using confidence intervals rather than traditional tests. 7 We stress the fact that correlation must be used with caution in real problems and that it is not an adequate measure of regression functions. 8 We use standard deviations, rather than variances, because standard devia- tions are easier to interpret in applied problems. 9 Al the data sets used in the book are available on a data disk that accompa~ nies each laboratory manual. 10 In the laboratory manuals we show how MINITAB and SAS can be used to analyze the data sets and solve problems and exercises in the textbook. For every topic in regression that is discussed in the book, we discuss a computer program that can be used for computations. 11 For problems that cannot be solved using the built-in commands in MINITAB or SAS, we have written programs for MINITAB and SAS. These programs are called macros, and the laboratory manuals explain how to use them, These macros are available on the data disk. They are intended for only the problems discussed in this book, and they are not necessarily valid for ‘other, more complex, problems. 12 We make recommendations as to which procedures should be used in applied problems. 13 We have included several conversations between a statistician and an inves- tigator to help clarify certain concepts. TW Acknowledgments Itis our hope that this book will serve the needs of students of statistics and researchers who wish to have an in-depth understanding of the concepts underly- ing regression analysis and alternative ways of formulating questions of interest in practical applications. To Jeanne, our children and grandchildren for the good times. F.G. To my grandparents, my parents, all my teachers, and to Pam, Matthew, Kristin, Kevin, and Geoffrey. H. 1. Far better an approximate answer to the right question, ‘which is often vague, than an exact answer to the ‘wrong question, which can always be made precise.” John W. Tukey, Ann, Math, Stat, vol. 33, p. 13, 1962. Regression Analysis: Concepts and Applications Franklin A. Graybill Colorado State Univesity Hariharan K. Iyer Colorado State University curren l Contents Preface ix Review of Basic Statistical Concepts and Matrices 1 Ll 12 13 14 1s 16 17 18 19 110 Overview 1 Basic Ingredients for Statistical Inference 3 Population 5 Model 10 Parameters (Summary Numbers) 14 ‘Samples and Inferences 20 Functional Notation 47 Matrices and Vectors 50 Multivariate Gaussian Populations 62 Exercises 70 Regression and Prediction 73 24 22 23 24 Overview 73 Prediction 73 Regression Analysis 82 Exercises 97 Contents warren d: Straight Line Regression 99 : 31 32 33 34 35 36 37 38 39 3.10 BAL Baz Overview 99 ‘An Example of Straight Line Regression 100 Straight Line Regression Model—Assumptions (A) and (B) 109 Point Estimation 112 Checking Assumptions 132 Confidence Intervals 161 Tests 171 Analysis of Variance 178 Coefficient of Determination and Coefficient of Correlation 181 Regression Analysis When There Are Measurement Errors 194 Regression Through the Origin 210 Exercises 214 Multiple Linear Regression 219 4a 42 43 4a 45 46 47 48 49 4.10 4a 42 Overview 219 Notation and Definitions 220 Assumptions for Multiple Linear Regression 232 Point Estimation 235 Residual Analysis 251 Confidence Intervals 262 Tests 278 Analysis of Variance 283, Comparison of Two Regression Functions (Nested Case) and Coefficients of Determination 291 ‘Comparing Two Multiple Regressions Models (Nonnested Case) 309 Lack-of-Fit Analysis 318 Exercises 335 Diagnostic Procedures 351 Sa 52 Overview’ 351 Outliers 352 cunrren OD cuneren | 53 54 55 56 Contents Leverages or Hat Values 365 Influential Observations—Cook's Distance and DFFITS 371 Ul-Conditioning and Multicollinearity 392 Exercises 399 Applications of Regression I 403 61 62 63 64 65 66 67 68 69 Overview 403 Prediction Intervals 403 Tolerance Intervals 416 Calibration and Regulation for Straight Line Regression 425 Comparison of Several Straight Line Regressions—Identical, Parallel, and Intersecting Lines 436 Intersection of Two Straight Line Regression Functions 450 ‘Maximum or Minimum of a Quadratic Regression Model 456 Linear Splines 465 Exercises 476 Applications of Regression II 501 1 12 13 14 15 16 Overview 501 ‘Subset Analysis and Variable Selection 501 All-Subsets Regression 504 Alternative Methods for Subset Selection 520 Growth Curves $51 Exercises 567 Alternate Assumptions for Regression 571 Bl 82 83 84 Overview 571 Straight Line Regression with Unequal Subpopulation Standard Deviations 571 ‘Straight Line Regression—Theil’s Method 584 Exercises 592 wii Contents cu artes § Nonlinear Regression 599 9.1 Overview 599 9.2 Some Commonly Used Families of Nonlinear Regression Functions 599 9.3. Statistical Assumptions and Inferences for Nonlinear Regression 605 9.4 Linearizable Models 615 9.5 Exercises 622 Appendix A: Answers to Selected Problems and Exercises 627 Appendix B: Bibliography 645 Appendix D: Data Sets 647 Table D-1: Car Data 649 Table D-2: Car2 Data 657 Table D-3: Grades Data 659 Table D-4: Plastic Data 669 Appendix T: Tables 679 ‘Table T- 1: Percentiles of a Standard Gaussian Population 681 : Percentiles of a Student's ¢ Population 682 Percentiles of a Chi-Square Population 683 ‘Table T-4: Student's t for m Simultaneous Confidence Intervals for m = 2,3,4,5,6 684 Table T-5: Percentiles of Snedecor’s F Population 689 Table T-6: Table for Obtaining Confidence Bounds for aoBo + 4:8; Using Theil’'s Method 692 ‘Table T-7: Charts for Confidence Bounds for the Simple Correlation Coefficient 693 Table T-8: Selected Percentiles of the Noncentral 1 695, Index 697 ll Overview Review of Basic Statistical Concepts and Matrices ‘The material in this chapter is generally taught in a first course in statistical meth- ods, which is prerequisite for studying this book. So, depending on the reader's background, this chapter may be treated as a quick review or may be studied in depth. Whatts Statistics? Statistics, in a narrow sense, is a branch of science that deals with making inferences ‘about populations based on samples. In a broader sense, statistics encompasses col- lection, organization, and summarization of data; presentation of data in tabular and ‘graphical form; developing models for the purpose of understanding random and nonrandom phenomena; use of models for prediction; mathematical approaches to decision making and evaluation of risks; and so forth. Regression (and correlation) analysis is an area of statistics that deals with methods for investigating the existence of associations and, if present, the nature ofthe associations, among various obsery- able quantities. For instance, one might be interested in knowing whether there is any association between age and blood pressure and, if so, what the nature of this association is. One might also be interested in expressing associations in the form of ‘mathematical equations. Regression analysis is one method of investigating the pres- ence of associations if appropriate data are available. The discovery of associations and the ability to express such associations in a precise mathematical form may en- able one to predict the unobservable value of a variable based on the observed value ‘of one or more associated or related variables. They may also help determine how ‘one might control the values of one variable by manipulating the values of a related variable, For instance, one might be able to manipulate or control one’s blood pres- sure by controlling one’s det ifthe nature of the association between blood pressure tnd the dietary components is understood. We do not make any statements regarding that elusive and controversial concept of eause and effect. Rather, we simply say that it may be possible to control the values of one variable by manipulating the 2 Chapter 1: Review of Basic Statistical Concepts and Matrices values of a related variable. This can be confirmed only by controlled experimental investigations. In this book we present the fundamental ideas that form the core of regression (and correlation) analysis, and we give numerous practical applications of the meth- ‘ods developed. Prorequisites We assume that you are familiar with the material that is usually taught in an in- troductory course in statistical methods, which includes concepts of populations, samples, point estimation, confidence interval estimation, tests, and Gaussian di tribution (also called normal distribution) models. These are reviewed briefly in Sections 1.2 through 1.6. We further assume that you are acquainted with functions and functional notation and elementary topics in matrix arithmetic. These topics are reviewed in Sections 1.7 and 1.8, respectively. ‘A brief introduction to the multivariate Gaussian distribution model (also re- ferred to as the multivariate normal distribution) appears in Section 1.9. This topic {snot usually taught in an introductory statistics course, but we feel that it is useful background material for studying regression analysis. Remarks on Computing Regression analysis requires a great deal of computing, and it will be helpful for Yyou to learn to use computers and statistical computing packages. However, since ‘ome of you will not have access to a computer or a suitable computing package, ‘we present most of the basic computations and require you to do only computations that can be done easily on a hand-held calculator. Thus you are not required to have knowledge of a computing language or a statistical computing package to read and ‘understand the material in this book. Nevertheless, we strongly urge that you use a computing package if one is available Practically every statistical computing package contains programs for regres- sion analyses. Some of the popular statistical packages are MINITAB, SAS, SPSS, BMDP, S-PLUS, etc. This textbook is accompanied by a laboratory manual that ex- plains how to use one of the statistical packages—MINITAB or SAS—to perform the calculations needed for regression analysis. The reason that we wrote a labora~ tory manual using MINITAB is that this system is very easy to learn and does not detract from the main topic, regression. Also, many colleges and universities use this package as a tool for teaching statistics. The reason that we wrote a laboratory ‘manual using SAS is that SAS is widely used by data analysts and researchers in colleges, universities, govemment, and industry. Also, many graduate students who study regression are already familiar with SAS and would like to analyze their own research using SAS. If computations are needed for the procedures discussed in any section of this book, the corresponding sections of the accompanying laboratory manual explain how to do the calculations in MINITAB or SAS. For example, in Section 3.6 we discuss confidence intervals for simple linear regression, and in Sec- tion 3.6 ofthe laboratory manual we show how to use MINITAB or SAS to perform 1.2 Basic Ingredients for Statistical Inference 1 the calculations needed for confidence intervals. If MINITAB or SAS is not avail- able, you can either ignore the laboratory manual or, if you have another statistical package available, you can work through the assignments in the laboratory manual using that package. Nothing that follows in this book requires knowledge of or the use of computers and statistical computing packages. 12 Basic Ingredients for Statistical Inference One of the aims of science is to relate, describe, and predict events in the world in which we live, These activities are also important in business and our everyday affairs. In almost every aspect of human endeavor, itis useful to be able to predict future events based on present and pas information. ‘To describe situations of interest, we must define them precisely and decide what ‘we want to determine or predict. To illustrate, we first use a very simple example. ‘You are handed a box that contains 10,000 marbles that are indistinguishable except for color—some are white and some are green—and you are asked to determine the proportion of green marbles in the box. You are not allowed to look in the box, but ‘you are allowed to select 20 marbles at random from the box and note their colors. (On the basis of this information (the colors of these 20 marbles), you must estimate ‘what proportion of the 10,000 marbles are green. Of course you do not expect that by examining only 20 marbles you will be able to determine exactly the proportion Of the entire 10,000 marbles that are green, but you would like to use scientific reasoning to come as close to the true answer as possible. The science of statistics and probability can be useful for making a decision in this situation and for attaching ‘a measure of uncertainty (or certainty) to your result. ‘The preceding example may seem trivial, but it contains the basic ingredients of the most complicated of problems. Mother Nature has her secrets “locked up in a ‘box” and, in order to make intelligent decisions about everyday activities, we may want to describe or predict the contents of that box. It may be possible to observe «a part of Mother Nature’s box (observe the colors of the 20 marbles) and make an intelligent decision as to its contents. ‘The Four Basic Ingredients From this example we abstract four fundamental concepts that will be useful in ex- mining more complicated situations: poputations, models, parameters, and sam- ples and inferences. These are summarized in Table 1.2.1. ‘We now consider a more realistic illustration, EXAMPLE 1.2.1 ‘Suppose a company that owns a chain of department stores is considering opening store in city A. There are many things the company wants to know about city A so that it can decide whether it will be profitable to build a store there. One item of interest to the company is last year’s average annual income per houschold in 4 Chapter 1: Review of Basic Statistical Concepts and Matrices raBre 1.2.1 Foor Fundamental Conceps Concept Description Population A population of items (sometimes also called a population of unit) i a specified set of items or units in which an investigator is interested. The population may be real ort may ‘be conceptual. I may bea population that existed in the past, or exists now, or wil exist in the future, or exists ony in one's imagination. A real population exists now end the entre popultion or any part of the population is available now and may be examined. A conceptual population is a population that snot available now and cannot be examined. ‘Typically, a conceptual population is one thet wil exist in the future or exists only in one's {magination. A population may remain constant, or it may change slowly with time, or it may change rapidly with time In the preceding lusration, the st of 10,000 marbles is {he population of tems. Is areal population since itis available fr study now. Also its «population that will remain constant, i, the colors ofthe marbles (white or gree) wil ot change with ime. Mode ‘A population model isa description of the quantities (numbers or atributes) of interest associated with each item in the population. In the illustration above, the attribute of interest is the color of the marbles and the description is that the marbles are indistinguishable except for color, wih some being white and others green. Parameter Parameters are summary numbers that characterize various aspects ofthe population and are usually the quantities of interest tothe investigator. In the preceding illustration, the ‘proportion p of marbles that are green, characterizes one aspect ofthe population and is the quantity of interest in the investigation Sample ‘A sample is a set of items selected from the population, and observations are made on this set of items. On the basis of this sample a decision is made about the values of the ‘parameters of interest, and this decision is usually accompanied by a measure of the ‘uncertainty inthe answer. In the preceding illustration, the set of 20 marbles isa sample ‘rom the population of 10,000 marbles. The proportion of green marbles inthe sample ‘may be used as an estimate of the proportion of all 10,000 marbles inthe box that are ‘green. Thus one uses the information contained inthe sample to make inferences about population characteristics of interest. the city. The four fundamental concepts outlined in Table 1.2.1 as they apply to this investigation are as follows: 1 The population of interest isthe set of houscholds in city A last year. A house- ‘hold must be precisely defined (which may not be an easy task!). For all practical ‘purposes, we can regard this population as being constant although families may ‘have moved in or out of the city during the course of the year. Consequently this isa real population that can be studied now. 2. The population model may be that the set of annual incomes of the households last year is a Gaussian population with mean x and standard deviation «both of which are unknown. 1a Population 13 Population § 3 The parameter of interest is y1, the average income of the households in the population, 4 Arandom sample of some specified number of households can be selected and last year's annwal income of each of the selected households recorded. Based ‘on these data, a decision can be made about the value of 11, the parameter of interest. = In the next four sections we discuss in detail each of the four basic ingredients: Population = Model = Parameters ‘Samples and inferences ‘A population of items is defined to be any set of items that one wants to study (refer to Table 1.2.1. In Example 1.3.1 we describe several populations of items. EXAMPLE 1.3.1 ‘The following are some examples of populations of items, ‘a The sctof all US. citizens who paid federal income tax last year. 1b The setof all automobiles that will be made by manufacturer A next year. € Thesetof all US. citizens who were diagnosed as having lung cancer three years ‘ago and are still alive today. ‘The set of all trees on a specified tre farm on July 1 two years hence. ‘The set of all farms inthe states of lowa and Nebraska last year. The set of all grocery stores in the U.S. next year. The set of all human beings who will be bom in Los Angeles, California, in the year 2000. 1h The set of all ball bearings that will be made in a specified production plant next month. 1 The set of all plastic containers that a specified manufacturer may make next ‘year, using every possible process temperature between 300° F and 400° F. J The set of all automobile tires that will be manufactured by a certain company next year using a newly developed tread design. k The set of all daily record sheets that will be maintained by a particular power plant, containing information including the amount of sulfur-dioxide emitted and the amount of electrical power generated for each day that the power plant will be in operation next year. maoe 8 Chapter 1: Review of Basi Statistical Concepts and Matrices 1 The set of all daily data records that will be kept by a particular monitoring station, containing information about the total daly flow values at a given mon- itoring point, and the total daily precipitation recorded at a certain gauge, for a specified river, for next year. 1m The set ofall measured values of the length ofa single bolt ‘The set of all measured values ofthe breaking strength of a given metal rod. The set of items to be studied, called the population of items, must be precisely defined. The number of items in a population can be finite or infinite; if this number is finite we denote it by N. This number NV is generally unknown, but in most ‘practical problems it is large. Notice thatthe dates are specified for each of the populations of items described in Example 1.3.1; otherwise the set of items would not be precisely defined. For instance, in Example 1.3.1(@) what was a farm last year may not be a farm this year, in (fa store that isa grocery store next year may not be a grocery store two years hence, ec. Note InExample 1.3.1 the populations in (a)-(h) and ())-() ae all finite, whereas in ), (a), and (n) they may be infinite. The populations described in (a), (c), and (e) are real populations, and we can study these populations now. The populations described in (b), (@, (B, (@)s (), (). (0, and (D are conceptual populations since they are future populations. They do not exist at the present time and cannot be observed ‘now. The populations described in (i), (m), and (n) are also conceptual populations. In fact, they are imagined populations; the entire population will never become available. In the population described in (), it is impossible for the manufacturer to actually manufacture plastic containers using every possible process temperature between 300° F and 400° F. But we can certainly imagine this population and ask questions about it. For instance, we may want to know what process temperature ‘would lead to production of plastic containers having the required strength. For the population described in (m), it is possible to examine part ofthe population, viz., the next several measured values, but the entire population can only be imagined and will never become available. The population described in (n) is also an imagined ‘Population. The very act of measurement will destroy the given metal rod and so only one value can be observed from this imagined population. Nevertheless, we ‘may want to make inferences about these conceptual populations so that decisions ‘can be made or actions can be taken now. Each item in a population of items possesses one or more characteristics that ‘may be of interest in an investigation, as you can see in Example 1.3.2. EXAMPLE 1.3.1 In Example 1.3.1(2) we may be interested in the age of each person, the I. of each person, the weight of each person, the sex of each person, the politcal afliation of each person, the marital status of each person, etc. In Example 1.3.1(b) we may be interested in studying the miles per gallon each automobile will get, the first-year 413 Population 1 ‘maintenance cost for each automobile, the number of miles each car will be driven the first year after its purchase, etc. In Example 1.3.1(d) we may be interested in the age of each tree, the diameter of each tree measured 4.5 feet from the ground, the height of each tree, etc. In Example 1.3.1(g) we may be interested in the number of years each person will live, the number of dollars each person will spend on ‘education, the number of siblings each person will have, etc. = Associated with each item ina population are one or more numbers (age, height, income, etc.) or attributes (sex, marital status, political affiliation, etc.) of interest. Inthis book we are mainly concemed with numerical quantities associated with each ‘population item. This set of numbers is called a population of numbers and will be referred to as the population. Ifa single number is associated with each item in the population, then we say that this population of numbers is a univariate population. If more than one number is associated with each item in the population, then we say that this population of numbers is a multivariate population. A bivariate pop- ulation is a special case of a multivariate population, where each population item has two numbers of interest associated with it. A trivariate population isa special case of a multivariate population, where each population item has three numbers of interest associated with it. More generally, if each item has k numbers of interest as- sociated with it, we refer to the population of numbers as a k-variate ora k-variable poputstion. In Example 1.3.3 we list specific sets of numbers (populations) that can be associated with each population of items listed in Example 1.3.1. EXAMPLE 1.3.3 ‘The following are populations (of numbers): The amount of money in dollars that each U.S. citizen earned as intrest income last year b_ The maintenance cost in dollars and the number of miles each car will be driven during its first year The age of each person and the average number of cigaretes each one smoked per day forthe five yeas prior to diagnosis of lung cancer. The eight, the diameter measured at 45 feet above ground level, and the dollar value ofeach tree, on July 1 two years hence, ‘The sizeof each farm in acres and the profit each farm made. ‘The profit in dollars, that each grocery store will make. The number of dollars each person will spend on health care and the number of years each person will live. 1h The diameter of each ball bearing. 1 The strength of each plastic container and the temperature at which itis made. The set ofall wead-depths remaining atthe end of 50,000 mils of driving ofall the automobile tres that will be manufactured by the company next year 8 Chapter 1: Review of Basic Statistical Concepts and Matrices Kk The set of al total daily sulfur dioxide emission values and the amount of elec- trical power generated for each day of next year. 1 The set ofall total daily low values and corresponding daily precipitation values ‘that will be recorded by the monitoring station next year. 1m The set of all measured values of the length of the bolt. The set of all measured values of the breaking strength of the rod. = Note that in Example 1.3.3 (a), (0, (h),(),(m), and (a) are examples of univariate ‘populations, (6), (€) (€),(@).(i (K), and (Dare examples of two-variable (bivariate) populations, and (d) is an example of a 3-variable (trvariate) population. ing of an investigation, a target population of items is defined and one ‘or more numbers associated with each item are identified as being of interest. This resulting population (of numbers) is what an investigator is interested in If a population is a univariate population consisting of N’ numbers, this popu- lation of numbers is written symbolically as {Y,, ¥>,-..» Yh where Y; represents the /th number (the number associated with the Ith item) inthe population. Alterna- tively, we may write (Y} without specifying the number of items in the population. If the population is a bivariate population consisting of NV pairs of numbers, this population is written symbolically as {(Yy_ X,)s (Yas Xq)s.--» yy» Xy)} (or simply as {(Y, X)}) where ¥, refers tothe first quantity and X; refers to the second quantity associated with the th item. ‘Sometimes a double subscript notation is used, asin {(Y,. ¥ya)s (Fas Yaq)s +--+ yy» Yya))- Here Yj, stands for the Jth quantity of interest associated with the [th popllation item. When working with multivariate populations, the double subscript notation is almost always used. For instance, a k-variable population may be denoted Dy (pas Vazy ones Yyds eves Cys Faas --++ Fad) oF simply by ((f,,Yq.---¥,)} Other symbols such as X or Z may be used instead of ¥ when convenient. ‘The population of numbers that is of interest to the investigator is called the target population. In some situations, the target population isa real population and is available for study. This is the case for the populations (a), (c), and (e) in Exam- ple 1.3.3. But, in other situations, the target population of interest is a conceptual ‘population and is not available for study, as in the case of populations (b), (), (f), ©), @, @, (0, ©, (a), and (a) of Example 1.3.3. Even when the target popula- tion isa real population, it may sometimes be impractical or too expensive to study this population, and thus itis unavailable for study. When the target population is ‘unavailable for study, one is sometimes able to find another population that resem- bies the target population and isa real population tha is available for study now. ‘Such a population is called a study population. We discuss these in some detail next. 1.3 Population § Target Population and Study Population ‘Target Population ‘The target population is the population that is the target of a study. Conclusions from an investigation are to be applied to this population. As stated earlier, the target population is often unavailable for study. In many investigations, the numbers in ‘the target population are future values, but we want to make decisions about them ‘now. Example 1.3.3(d) is a good illustration. Suppose we are interested in what the dollar value of each tree will be on July 1 two years hence, but we must make this determination now. The target population in this case consists of the dollar values Of each of the trees on July 1 two years hence, 60 this population is unavailable now. Therefore it is not possible to study the target population now. This is also the situation in Example 1.3.3(b), where we want to predict the average first-year ‘maintenance cost of cars that will be produced by manufacturer A next year. We ‘must make the prediction atthe beginning of the year so we can plan a maintenance ‘budget. But the target population is a future population and is not available for study now. In fact it is quite common for a target population, which is the population of interest, to be unavailable for study, either because it consists of future values or because it is impractical, inconvenient, or too expensive to study this population. ‘Thus we are led to consider another population, the study population. ‘Study Population ‘When the target population is unavailable for study, we are sometimes able to find ‘another population that resembles the target population and is available for study ‘now. Such a population is called a study population. Thus the study population is the population thet is actually studied during an investigation. In those situations ‘where the target population itself is available for study, the target population and the study population are one and the same. When the target population is unavailable for study, the study population should be chosen to resemble the target population as closely as possible. ‘To illustrate, consider Example 1.3.3(d). The target population is in the future, 60 itis not available for study now. The study population can be defined as the set of dollar values of all trees as of the present date. This is a real population that is available for study now. In Example 1.3.3(b) we can use as the study population the first-year maintenance costs for the same make of automobiles last year. In Example 1.3.36), no suitable study population may be available. In this case the investigator can produce plastic containers in a pilot plant under several different ‘process temperatures between 300° F and 400° F and measure the strengths of these containers. The data thus generated are all that is available and thi collection of pairs of numbers, ie, the strength of a plastic container and the temperature at which it ‘was made, can be considered the study population. 14 Model W Chapter 1: Review of Basic Statistical Concepts and Matrices Methods are available for making valid statistical inferences about the study population. When the study population is different from the target population, ‘generalizations from the study population to the target population are subject- ‘matter considerations, and we have to rely on the investigator's judgments. All statistical inference procedures discussed in this book refer to the study population. A univariate population to be studied consists of NV numbers, where N is generally quite large. In Example 1.3.3(@), for instance, the N’ numbers are the amounts in dollars, eared as interest income by each of the NV citizens; in Example 1.3.3(0, the 'N numbers are the profits, in dollar, each grocery store will make next year. To study a population of A’ numbers one must have some way of organizing these N’ numbers. One useful approach is to organize them into a probability his- togram and find @ mathematical function that approximates this histogram. Such a ‘mathematical function is usually called a probability density function. Theoretical statisticians use a variety of probability density functions to study different types of populations, the most important among them being Gaussian probability den- sity functions. They are the subject of our discussion in this section. The Gaussian population model for multivariate populations is discussed in Section 1.9. Gaussian Populations AA theoretical Gaussian population (Y} is completely specified by its mean, denoted by iy, and its standard deviation, denoted by oy. The distribution of this population is described by the probability density function b (a 1). Then there are k ‘quantities associated with each population item, resulting in Nk numbers in all. Each Of the k quantities associated with the population items gives rise to a univariate population. Let the & quantities for item 7 be denoted by Xj1,...»X,. Then the numbers X,),...,Xyy form a univariate population with mean j2, and standard de- viation 0, the numbers X;9,......Xyq form another univariate population with mean 4 and standard deviation o, etc. Thus 1, .., sy, and 0}, ..., 0, are parameters associated with the k-variate population. The Nk numbers may be schematically represented as in Table 1.5.1. Cc 15 Parameters (Summary Numbers) 1 tase l.§.] ‘Schematic Representation ofa &-Variate Population of Size N Tome 2 1 Nd Xu | Xya | oe | Xue La wy | oe | | me Standard deviaion | o, | c, | | For an illustration, consider Example 1.3.3(@). Let X;,, jp, and X75 denote the height, the diameter at 4.5 feet above ground level, and the dollar value of the th tree in the population. We thus have a three-variable population, and 2. #1, and + represent the mean height, the mean diameter, and the mean dollar value of the trees in the population. Likewise, 0}, da, and o represent the standard deviations of the heights the diameters, and the dollar valves, respectively, of the trees in the population. Cosficien of Correlation When dealing with multivariate populations there is often a need to summarize associations among various quantities measured on the same item. For instance, how can we summarize the association that may exist between the height of a tree and the diameter of that tree at 4.5 feet above ground level? How can we summarize the association or relationship between the number of miles a car is driven per year and the yearly maintenance cost for the car? One summary measure that is sometimes used for this purpose is the cvefficient of correlation between two variables (also called the Pearson correlation or the product moment correlation). This measure of association is denoted by py x (p is the Greek letter rho) when summarizing the relationship between the variabies ¥ and X. It is defined by (15a) fora bivariate population {(Y, X)} of V items. Note that (1.5.1) implies py x = py yi i.e. the coefficient of correlation between ¥ and X is the same as the coéfficient of correlation between X and ¥. Itcan be shown that py x is anumber between —1 and Chapter 1: Review of Basic Statistical Concepts and Matrices +1. It is equal to +1 when the population values (¥;, X,) all lie on a straight line ‘that has a positive slope, and itis equal to —1 when all the population values lie on a straight line with a negative slope. Figures 1.5.1-1.5.3 are scatterplots of bivariate populations {(Y, X)} consisting of $00 items, each with a different value of py, x. Figure 1.81 prx=0 Observe that a positive value of the coefficient of correlation between ¥ and X indicates that, generally speaking, larger values of ¥ are associated with larger values. of X, and smaller values of Y are associated with smaller values of X. Likewise, when the correlation coefficient is negative, we find that larger values of ¥ are associated with smaller values of X, and smaller values of Y are associated with larger values of X. Notice that a general lack of linear association is indicated when ‘the magnitude of the correlation coefficient is close to zero, Whereas correlation coefcients may have useful interpretations for some problems (particularly when Y and X are approximately linearly related), they may provide no useful interpretation and, in fact, be a misleading summary quantity in other problems. In summary, when dealing with a k-variate population, say {(X),.-..X,)}+ the basic summary quantities or parameters that are often used are the means, My.-+++ gs and standard deviations, o1,...,,, of the & univariate populations, along with the (5) = k(t—1)/2 coefficients of comelation, py x. yx, Px, x,» Between the k(k — 1)/2 pairs of variables (X;. Xp). (Xy.X3)s-o++ 1.5 Parameters (Summary Numbers) 9 Fioure 1.52 prx=09 1h UW Chapter 1: Review of Basic Statistical Concepts and Matrices — FIGurReE 1.5.9 Samples and Inferences AAs stated earlier, an investigator must first define the study population. Whenever possible the target population itself should be the study population. If this is not possible, then the study population should resemble the target population as closely 1s possible. After the study population is defined, it is described with a model, and parameters that are needed in order to make decisions are identified. The next step is to determine the values of these parameters. Since the population is never completely known in a real situation, an investigator can never know the parameter values exactly. A commonly used procedure isto select a subset of the items, referred to as a sample, from the study population and to use the measurements associated with these sample items to infer the values of population parameters of interest. If the sample is selected from the population using one of several random sampling procedures, then itis possible to assign a measure of uncertainty to the conclusions

You might also like