You are on page 1of 23
COURSE GUIDEBOOK Business Statistics Part IL Lecture 9: Sampling Dist Lecture 10) Lecture 11 Lecture 12: C als for Other Parameters Lecture 13: Hypothesis Testing Simple Linear Regression ‘The Validity and Usefulness of a Regression Lecture 16: Introduction to Multiple Regression 1-800-TEACH-12 1-800- ‘ww: TEACH 12,com 546 S195) “281095 10ssajory TI #eg ‘onsneig ssauisng =e COURSE GUIDEBOOK ; Great Courses ‘Teaching that engages the n Business Statistics Professor George T. Geis University of California at Los Angeles Part I THe TeAcHING Company Table of Contents Business Statistics Part Professor Biography Purpose of Course Lecture Nine: Sampling Distributions and Estimators Lecture Ten: The Central Limit Theorem Lecture Eleven: Confidence Intervals Lecture Twelve: Confidence Intervals for Other Parameters Lecture Thirteen: Hypothesis Testing Lecture Fourteen: Simple Linear Regression Lecture Fifleen: The Validity and Usefulness of a Regression Lecture Sixteen: Introduction to Multiple Regression Answers Bibliography (©1997 The Teaching Company Limited Parmership n 24 28 am 38 George T. Geis, Ph.D. Anderson Graduate School of Management University of California at Los Angeles George T. Geis was born in Chicago, IL in 1944. He received a B.S, “summa cum laude” with “Honors in Mathematics” from Purdue University in 1966. Dr. Geis earned his Ph.D. in 1977 at the University of Southern California and his MBA from University of California, Los Angeles in 1981. Dr. Geis was a National Science Foundation and Woodrow Wilson Honorary Fellow. In the field of Finance, he has been honored with the Financial Executives Institute Award for outstanding achievement, During his teaching career as an Adjunct Professor at the Anderson Graduate ‘School of Management at UCLA, Professor Geis has been voted outstanding teacher three times. His academic experiences include serving as Research Coordinator atthe Center for Human Resource Management. Presently, he is serving as a member of the faculty advisory board for the Entrepreneurial Studies program at UCLA. Geis is also an author. He has published dozens of professional articles and five books. His books include: Desktop Computing and the Essence of Management (Prentice-Hall, 1990) and Micromanaging (Prentice Hall, 1987). Currently, he is researching the application of computer technology to visually represent ‘dynamics in converging technology and communication markets and the use of interactive media in illustrating statistical analysis He has extensive consulting experience and isa frequent lecturer on emerging, trends in the computer, communications and media markets. In his spare time Professor Geis plays three-on-three basketball, struggles to lower his golf handicap, and paints his seven-color Vietorian-style home in Pasadena, California, 2 (01997 The Teaching Company Limited Parership Business Statistics Purpose of the Course In our tightly wired world, business executives make decisions under pressure. ‘Almost always, these decisions must be made with less than complete information. This course is about how to effectively use data that is currently available (or can be obtained within a reasonable time frame and cost) to improve business decision making. We will use business examples from functional areas such as finance, ‘marketing, human resources, and operations to illustrate the role of data analysis, in decision making. This course is not designed to be a dry sleepy-time set of abstract, mathematical lectures. My goal is to make statistics come alive in the context of life and in the context of real business problems demanding solution, (Quantitative methods such as statistical analysis must not be viewed as the be- «al and end-all of decision making. The vital role that seasoned business intuition plays in effective decision making can not be overemphasized. Nevertheless, analytical techniques are a central part of many decisions. Infact, we illustrate in this course how statistics and probability can effectively work together with managerial intuition in business problem solving. The advent of personal computer statistical software that readily generates visual representations of data and performs sophisticated analyses enables a manager to concentrate on the meaning of data. The burden of computation has largely been, climinated, and business people are now free to focus on probing issues and searching for creative solutions. In this course, we illustrate the use of computer- generated output that promotes visualization of data tudents tell me that statistics was obscure and inaccessible for them as undergraduates. On the first day of class, they enter my MBA course on Statistics and Data Analysis prepared for the worst. Fortunately, Iam often able to help them build intuition for statistics, appreciate how the content can be applied and actually enjoy the experience. ‘Whatever, previous experience you have had with statistics (iF any), our main objective will be to make the content useful to you in business decision-making and relevant to decisions we all make in everyday life. ***#Tn addition to questions at the end of each lecture, problems have been provided where relevant. For you convenience answers are available at the end of this outline (©1997 The Teaching Company Limited Parmership 3 Statistical Software Credits For further information on Crystal Ball Software please contact Decisioneering Ine. 1,800, 289. 2550 1.303.337.3560 (f) IMP-IN 3 for Windows © is available from Duxbury Press, ‘An International Thompson Publishing Company Belmont, CA 1-800-876-2350 Images Copyrighted New Visions Technologies Ine. All sights reserved, No part ofthis book may be reproduced in any manner ‘whatsoever without written permission except in the case of brief quotations ‘embodied in critical articles and reviews. For information, send complete description of intended use to The Teaching Company/Rights and Permissions, 7405 Alban Ct. Suite B-215, Springfield, VA 22150, USA a (©1097 The Teaching Company Limited Parersip Lecture Nine Sampling Distributions and Estimators Scope: What are the benefits of random sampling in business analysis and decision-making? What is a sampling distribution and why is it important? What isa simple random sample and how do you select ‘one? The issue of whether or not a sample is representative of the population isa central problem addressed by statistics. Given the data, contained in the sample, what conclusions can be drawn about the population? Out 1. _Inferences about the population at large are drawn from information found in the sample. In order to draw inferences from the sample to the population, the sample should be random, A random sample is a sample drawn so that each member of the population has an equal and independent chance of being included in the sample. Since most of the time it is impossible to survey the entire population there are many benefits of random sampling. ‘A. Random sampling ensures a representative sample, B. Random sampling can describe how your results differ from the population's Statistics versus parameters A. A sample statistic is computed from your sample data, It is random and. known, B. A population parameter is computed for the population as a whole. Its fixed and usually unknown. TIL. An unbiased estimator is a sample statistic that is neither systematically to0 high nor too iow compared to the population parameter. If many samples are randomly selected from a population, and a sample mean X caleulated for each sample, on average X will equal [It is important to avoid bias in our estimation IV. A sampling distribution lists, for each value of the statistic, the proportion of all possible samples with this value, A, For example we survey Lafayette, Indiana and find that 21% say that they will try our new concept product, Virginia Chicken, Our sample statistic is 21%, B. Each time we survey a group in the population our sample statistic ‘would vary, (©1997 The Teaching Company Limite Putership 3 VL Depicting a sampling distribution as a histogram of statistics, A simple random sample must meet both of the following criteria: A. Each unit has an equal probability of being chosen. This may mean that a telephone survey may have some bias toward those who have a telephone and answer the phone. Is this group of people representative of the city as a whole? Or do they tend to be different? B. Units are chosen independently of each other. For example, zip codes ‘may be chosen at random, But by selecting people based on their zip code, people who share a certain socioeconomic status may be over represented, Selecting a simple random sample can be done by creating a frame for each ‘member of the population. Then either by using a random number chatt or ‘random number generator in a spreadsheet, individuals may be randomly and independently selected. (©1997 The Teaching Company Limited Parmesship| (Questions for Lecture Nine 1, What are two major benefits you derive from taking a random sample? 2. Define population parameter. 3. Define sample statistic. 4. ‘True or False. The population parameter is fixed and known. S. True or False, The sample statistic is random and unknown, 6. Putin your own words what we mean by sampling distribution. 1. In order to obtain a simple random sample, is it enough that all units nave an equal chance of being chosen? Explain. Essential Reading for Lecture Nine Aczel, Complete Business Statistics, Chapter 5, Irwin, Third Edition, 1996, Recommended Reading for Lecture Nine Cochran, Sampling Techniques, Wiley, 1973. Hanke and Reitsch, Understanding Business Statistics, Chapter 7, Irwin, 1994, (©1097 The Teaching Company Limited Parmership Lecture Ten The Central Limit Theorem Scope: The central limit theorem provides us with one of the most important results in statistics. What is the central limit theorem, and how is it useful in business analysis? How does it help us work with sampling distributions for statistics sueh as the sample mean and sample proportion? Outline 1. The normal distribution is a mathematical model which can be used to represent collected data. The major value of the normal distribution lies in its ability to serve as a reasonably good model of many phenomena, ‘A. Many data sets in business follow a normal pattern B. Even when a distribution is not normal, the distribution of an average or a sum of numbers from this distribution will be close to normal if is large enough, TL, The central limit theorem states that as the sample size inereases, the shape of the sampling distibution of means becomes increasingly like the normal fone. This means that even if the data set we are working with is not normally distributed, thatthe distributions of the means or sums of the data will be approximately normal if ur sample is large enough. Refer to the Central Limit Theorem section below. IIL, The standard deviation of the sampling distribution of the statistic is known, as the standard error of the statistic, A, For example, the standard deviation of the scatter in the mean is called the standard error of the mean, B. Distinguishing between variability in elementary units from variability in summary numbers (statistics). C. The central limit theorem helps us understand the population parameter, HL, whether [refers to average customer age or salary, et cetera TV. A sample is drawn randomly from a population, and the sample mean, OF + is calculated. Or we can draw a sample mean from a sampling distribution of means. This sample mean is interpreted as an estimate of the true population mean, }l. Nevertheless, the estimate may be in error, since the estimate is based on sample data, The magnitude of the error in estimating ft from X depends on how much sample means differ fom ‘one another on repeated sampling from the population. This isthe standard 8 (01997 The Teaching Company Limited Parmersip error ofthe mean, GF. ‘The smaller the standard error, the smaller isthe error likely to arise in estimating [L from X ‘A, When n=I, the sampling distribution ofthe sample mean is identical to the population distribution. When n>1, the standard error of the mean gets smaller. B. As the sample size gets large the standard eror ofthe mean gets Staller This i anoter application o the cena inst eorem C. Take information from a sample group, 2-100, ¥ =520,0 =88.00, then =F =~4 2320, ten dg =k, % “a ios This allows sto sy something about H. I'm sing $20 to estate HL My standard error of the mean is $0.80. This is a much tighter istribution than the standard deviation of the raw data $0.80 ‘The sampling distribution for the sample proportion isthe binomial distribution by application of the Central Limit Theorem. A. The sampling distribution for the sample proportion is related to the binomial distribution. Iti the binomial distribution with parameters n and p, where nis the sample size and p is the population proportion BB. As the sample size increases the Central Limit Theorem applies. So the sampling distribution of the sample proportion approaches a normal distribution, as n gets large. As a rule of thumb, we can use the normal approximation if mp(-p) > 5. Business applications using the central limit theorem A. You survey some of your customers to determine if sales will go up if ‘you cut prices. A xofn forn=35 } =35 ‘The sampling distribution {) , approaches a normal distribution witk the pop) \ Since x follows a P ‘mean = p and a standard distribution distribution. (©1997 The Teaching Company Limited Partnership 9 B. You survey 400 voters on an upcoming ballot initiative, You assume A A p=0.5. ¥ survey and find our that 425, The distribution of ‘p ‘will be a normal curve with a mean of 0.5 and a standard distribution of errors below my proposed p. So this would be evidence that the population parameter is probably not 0.5, .025. This means that I am three standard Central Limit Theorem * Central limit theorem: for a data set of n independent observations of a random variable representing a population ~ for both the average and the sum, the distribution becomes more and more normal, as n gets large - the mean & standard deviation of the > ‘Construct the confidence interval for the sample proportion: another real estate example. In this case the confidence interval is again given by the point estimate + 2 multiplier times the standard error estimate. ‘A. What percent of my client base has previously owned a home? In a survey of 100 clients, 60 have been previous homeowners. Construct a 99% confidence interval (01997 The Teaching Company Limited Parersip 7 B. The point estimate is a 90. The standard error estimate is P AGA) PAY p/ _, (06-06 _. [28 a= \P 100 =\V i100 C. Using 2.576 as the z multiplier, a confidence interval of 99% can be established for .60 + 126 having previously owned ahome, This ‘means that we can be 99% confident that 47.4% to 72.6% of our clients hhave been previous homeowners. TV. Confidence intervals are valid only if certain requirements are observed. |A. Be sure the data set is @ random sample from the population of interest. For example itis impossible to sample the future. B. Be sure the quantity being measured is normally distributed. This is nota rigid requirement since the central limit theorem tells us that ‘means and other measures are normally distributed. 18 (01997 The Teaching Company Limited Parersip Questions for Lecture Twelve Explain what is meant by 1-c, the level of confidence. For a large scale sample, what does 1-c. typically depict in relation to a normal curve? For a large scale sample, what does o/2 typically depict in relation to a normal curve? In reporting on an election poll, a newswoman states that 52% + 3% of the electorate say they will vote for a given candidate. Is this a confidence interval, and if so, what parameter is being estimated. ‘True or false. In order to construct a valid confidence interval, the deta set lilized must be a random sample from the population of interest. Problems for Lecture Twelve ‘A new pizza topping is testing in your supermarket. A sample of 500 shoppers try the product and 240 say that they like it. 1 2 ‘What is the sample statistic forthe proportion of shoppers that like the spread? Construct a 90% confidence interval for the percentage of shoppers that like the topping. Construct a 95% confidence interval for this percent Interpret in your own words what the 95% confidence interval means, Essential Reading for Lecture Twelve Aczel, Complete Business Statistics, Chapter 6, trwin, Third Edition, 1996. Reco mended Reading for Lecture Twelve Hanke and Reitsch, Understanding Business Statistics, Chapter 8, Irwin, 1994. (©1997 The Teaching Company Limited Parnerstip 19 Lecture Thirteen Hypothesis Testing Scope: In this lecture we explore the use of hypothesis testing in business. In a 1 m. 20 ‘business situation our data is limited to a sample of reality. Statistical techniques can test how large a part chance plays in the results reflected by the designated sample. In designing a hypothesis test, we intend to determine whether or not a claim, such as response rate from an advertising campaign, should be allowed to stand. We will examine the steps in conducting a hypothesis test. Outline ‘Asstume that the experimental results reflect only the random variation ‘caused by chance. This assumption is called the null hypothesis. The object of our research isto be able to reject or fail to reject the null hypothesis. Stating the null and alternative hypothesis A. The null hypothesis can be viewed as the status quo; i is valid until proven otherwise. It is usually denoted by Ho. B. The alternative hypothesis is the competing theory which you are trying to establish, The alternative hypothesis bears the burden of proof. Itis usually denoted by H1 ‘The task of hypothesis testing isto reject the null hypothesis or Fail to reject the null hypothesis, Errors in hypothesis testing. A, In a Type I error: rejecting the null hypothesis when itis true. also known as an alpha error. 1B, Type Il error: failing to reject the null hypothesis when itis false, also known as an beta error. v. CC. Examples: the ding letters and true love 1. Hg: You shouldbe hired. Hy: You shoud be dinged Correct decision Company hire ain decision wire | corect Type I eror ding ___ [Type terror | corret (©1997 The Teaching Company Limited Parership W. 2. Ho: You should pursue this romantic relationship. Hy: You should not pursue this romantic relationship. Truth what you should do yursue ot pursue whatyou | pursue | correct-True | Type If error- decide to do love Looking for love in all the wrong places not pursue |Type Terror | correct-Thank Golden chanees | God for pass me by. ‘unanswered prayer A two-tailed test is used when the difference between the population parameter and a sample statistic is non-directional. ‘The statistic could be very large or very small. When the direction of difference between the population mean and a particular value is specified, the alternative hypothesis is directional, or one-il. In 2 one-tailed test, consider urder ‘what circumstances to take action. This will determine the alternative hypothesis A. Use a right-hand-tailed test to take action if a parameter is greater than some value since the alternative hypothesis will state that the parameter is greater than some value. B. Use a left-hand-tailed test to take action if a parameter is less thas some value since the alternative hypothesis will state thatthe parameter is less than some value. ‘The steps involved in hypothesis testing A, Set up the null and alternative hypotheses. B. Choose , the level of significance. C. Define the test statistic, for example z D. Define a rejection region, In this region, the value of the test statistic results in rejecting the null hypothesis. E, Calculate the value of the test statistic and carry out the test F, State a conclusion for the original question, (01907 The Teaching Company Limited Parmersip a VI. A hypothesis test can be used to test product quality claims. Suppose you produce a professorial punching bag with the claim that it's good for 400 punches. Check out the claim using hypothesis testing as outlined above. AL Hy: b= 400 Hy: | 400 test 100 punching bags, n=100, X = 420, $=50 B. alpha =005 Kp ¥-a0 K -400 G2 = = o sx 5 D. > 1.96 or 2<-1.96 = jn 400-420 g, te F, Since the z-value is so extreme, we reject the null hypothesis. The likelihood of being wrong is less than 5%. 2 (©1997 The Teaching Company Limited Parership Questions for Lecture Thirteen ‘What is the null hypothesis ofa test? How does the alternative hypothesis relate to the null? Explain what is meant by Type I error? What is a Type I error? ‘When would you use a hypothesis test as opposed to simply constructing a confidence interval? Problems for Lecture Thirteen Suppose you manufacture small packages of tissue paper and want to knowhow ‘many tissues should be put in your package. You decide to test the industry ‘wisdom that the average person uses 40 tissues during a cold. You contiuct a random sample of 100 customers with a cold and find the average customer uses 235 tissues with a standard deviation of 25. You set cat 5%. 1. Write the null and alternative hypotheses for yout test 2. What isthe test statistic you will use? 3. Define the rejection region for the mull hypothesis. 4. Calculate the value of the test statistic. 5. Should the null hypothesis be rejected. Explain, Essential Reading for Lecture Thirteen Aczel, Complete Business Statisties, Chapter 7, Inwin, Third Edition, 1996, Recommended Reading for Lecture ‘Thirteen Hanke and Reitsch, Understanding Business Statistics, Chapter 9, Irwin, 1994. (©1997 The Teaching Company Limited Partnership 2B Scope: Linear regression is a method for modeling the rel n. 24 Lecture Fourteen Simple Linear Regression jonship between two variables, such as advertising and sales or training and job performance. Regression is a widely used technique and ofien provides ‘useful mathematical formulation of a real world situation. This lecture will explore the basies of simple linear regression, Outline Regression and modeling ‘A, Simple linear regression involves two variables x (independent) and y (Gependent) assumed to have a straight-line relationship B. Linear regression is one of the most widely used statistical techniques in describing the relationship between two variables such as advertising and sales, training and job performance. C. A good model captures and extracts the systematic behavior of the data, leaving out factors that are nonsystematic and cannot be foreseen, namely random error. ‘The purpose of simple linear regression is to provide a best model for a straight-line relationship between two variable. ‘A. Simple linear regression assumes an intercept parameter nd a slope parameter: y= fio * fix & where ty isan estimate ofthe imercept, san estimate ofthe slope and € represents random 1. The intercept parameter provides the value ofthe dependent variable when the independent variable is equal to 0 2. Apositive lope parameter will occur when increasing values of the independent variable are associated with increasing values of the dependent variable. 3. Anegatve slope parameter will occur when increasing valuss of the independent variable are assoelated with decreasing values ofthe dependent variable. B The method used to estimate the regression parameters is called least Squares. "This technique minimizes the sum of the squared eror. (©1997 The Tesching Company Limited Parehip C. ‘The MSE (Mean Square Error) is used in estimating error variance. ‘The smaller the error variance, the closer the points are to the line. If the error variance is too large when using simple linear regression, then itis more difficult to make accurate and meaningful forecast predictions. Error variance for can be represented as syne fee nas D. Consider the following example concerning the relationship between housing square footage and sales price. x(f2) yiprice) 1500 25K 200 230K 1800 290K 3000 340K 350000 300000 . 280000 Aj Soles 200000 pprice 160000 00000 ‘50000 0 © 1000» 2000-3000 Square footage possible regression line y=850,000 + 80x intercept is $50,000, slope is 80 IIL. Correlation must be distinguished from regression. A. When we do correlation analysis, we assume that both x and y are random variables. With regression, we assume that xis fixed. The correlation between x and y is a measure of the degree of linear association between the two variables, B. The sample correlation is denoted by r and can take values from -1 10 41, With 0 correlation, there will he litle if any association between the two variables, for example shoe size and eye color, R2, the coefficient of determination, isthe square of the correlation for simple linear regression and has a special meaning in regression analysis. (©1097 The Teaching Company Limited Parership 25 26 c Correlation is a measure of how closely two variables stick together im a straight line relationship, Both variables are independent. In regression analysis, one variable is independent and one is dependent. (©1997 The Teaching Company Limited Parerhip (Questions for Lecture Fourteen 1. Describe in your own words the purpose of simple linear regression. 2. ‘True or False. A good statistical model will often explain all of the systematic behavior ofthe data eliminating all of the random error, 3. What information does the intercept parameter in simple linear regression provide? 4. Give an example of when the slope parameter in linear regression would be negative. 5. True or False. There is one line that minimizes the squares of the error from the points to that line, That lin is the regression lin. 6. ‘True or False. The Root Mean Square Error is used in constructing confidence curves for the regression line. 7. State a major difference between regression and correlation, True or False. Correlation ranges from -1 to +1 9. Givean example of two variables that have correlation of around 0. Essential Reading for Lecture Fourteen Aczel, Complete Business Statistics, Chapter 10, Irwin, Third Edition, 1996, Recommended Reading for Lecture Fourteen Hanke and Reitsch, Understanding Business Statistics, Chapter 14, Irwin, 1994, Mendenhall and Sincich, A Second Course in Business Statistics: Regression Analysis, Chapter 2, Dellen, 1993, (©1997 The Teaching Company Limited Parmeship 7 Lecture Fifteen The Validity and Usefulness of a Regression Scope: Just because we run a regression does not guarantee that its useful or valid. A regression may be valid only for @ small range of values. In this lecture, we explain how to determine whether or not the regression equation in meaningful for business analysis. We also discuss what conditions must be met in order for a regression to be valid. The goal of regression is not just to fit a line to a set of data points, but to be able {0 use the line to forecast and predict. Outline BotBixte. ‘A. When there is no linear relationship between x and y, the population regression slope, 1, is equal to 0. Therefore the most important statistical test in simple linear regression is whether ono the slope Parameter sO. In every other situation there isa linear relationship ‘hich exists, either positive or negative 1. The slope parameter may be O when y is a constant value. 2. Asx increases there is no systematic influence ony. They are completely independent and the data points are randomly distributed B. The statistical test fora linear relationship between x any. 1. Use hypothesis testing. Set the null and altemative hypothesis, Ho:0b 1= 0.4: 11 0,28 divided by the standard eror of. If we can reject the nll hypothesis then we can conclude there is 4 linear relationship between the two variables, 2. Enter the data into statistical software package which will caleulate the regression line and all the parameters. Suppose that ‘Testing fora linear relationship in the reprenion ln 0.000480 vane Catinte standard tra Boimercep 90000 25.000 Armee 80 3027006 3. Ifthe tratio is high enough we can reject the null hypothesis and assume a linear relationship exists. Generally speaking a linear relationship exists when tis larger than two, 4, The p value is the value of «at which the hypothesis test would change conclusions. Since our tis generally .05, any p value Jess than ,05 (,006 is Tess than .05) allows us to reject the aul hypothesis. 191997 The Teaching Company Limited Parmeship p value nl. Mm. ‘The usefulness ofa regression can be measured and quantified. ‘A. The mean square error (MSE) is an estimate of regression error, ‘measuring the variation of the data about the regression line. MSE, however, depends on the nature of the data B. R2is arelarive measure that compares the variation of y about the regression line with the variation without the regression line. The coeff ent of determination (R2) isthe proportion ofthe variation in y that i explained by the regression relationship ofy with x. R2 ranges from010 +1 C. The regression line always goes through the mean (X,Y). R2 tells you how much work the regression line is doing as x moves away from and y moves away from ¥ . R?= 0 means thattne regression ine does not explain the movement away from the mean. R reans thatthe line isa perfect fit Residual analysis ofa regression checks for equality of error variance, tests for missing variables inthe regression and helps detect if there isa possible ‘curvilinear relationship, A. Ifthe residuals are plotted, a pattern may emerge known as hoteroscedasticity in which the residuals get larger as x gets larger (a funnel shape). This implies thatthe error variance is not equal ard thus bring into question the validity of the regression. ‘The desire ‘outcome is homoscedasticty in which the residuals are scattered randomly. B. Sometimes when the residuals are plotted the points form a linear pattern, which often indicates that variable should be included in the ‘model. It may also indicate a curvilinear relationship. Constructing a prediction interval: 9 + interval ‘A. The width ofthe prediction interval depends on the distance of x from the mean B, For example, there sa significant linear relationship between January stock prices and how stock perform forthe year. However the root mean square error is so large that the regression line sof litle or no _use in predicting stock prices {©1997 The Teaching Company Limited Parwership 29 Lecture Sixteen Introduction to Multiple Regression Questions for Lecture Fifteen 1. Describe in your own words the test for determining whether or not there is 1 regression relationship between x and y. 2. True or False, MSE (Mean Square Error) isa relative measure of how good the regression fits. Scope: In this lecture we will provide an introduction to multiple regression. Multiple regression is an extension of simple linear regression in that ‘more than one independent variable is used in attempting to explain 3. True or False, R? essentially tells you what percentage of the variation in y variation in the dependent variable. We also explore the use of dummy is explained by the regression line. variables in regression models. Nevertheless, just because a model can be built, it does not necessarily follow that the model will be good for prediction. In business situations, statistical modeling is generally not Explain how residual analysis is used to check the validity of the Nea . ‘an end in itself, but when analytical and statistical modeling are 5, True or False, Ifthe plot of the residuals against x yields a upside down U- combined with business experience and intuition, more effective shaped curve, the linear regression is confirmed. decision making will often be the result. 6. You determine that there isa valid regression relationship between ‘movement in January stock prices and the stock price movement for the Outline entire year. Nevertheless, you determine that your prediction interval is not useful. How can this be? 7. What is heteroscedasticity? 8, Truc or False. A prediction interval consists of two lines parallel to the 1. When two or more independent variables are included in a regression ‘model, we are using multiple regression. regression line IL, Parsimony is important in building regression models. ‘A, Given n points, we can find an (n-1) dimensional surface that will fit Problems for Lecture Fifteen the data perfectly. It is possible to overfit the data by introducing too Problems | through 3 relate to the following situation, Suppose that a regression many variables. line for ice eream sales ata ball park has been developed using historical data. B. ot Pixs *Baxa* Bh axa..* Bex ‘The regression equation is: y = 12000+200x, where y represents sales in dollars, C. Utilize the minimum aumber of independent vaiables to get the job and x represents average temperature in degrees Fahrenheit, aa 1. Does the slope of the regression line appear to be in the direction you would expect? Explain IIL, ‘The Analysis of Variance (ANOVA) test using data from residential real estate sales as an example. 2. Whatis the expected diference in ice cream sales a the park between a MATS ANOVA Gn mete ts ee ctu eee day when the average temperature is 60" and & day when the average relationship between y and any ofthe independent variables? temperature is 70°? Consider the following data in our example: 3, Would you expect temperature to explain most of the variation in ice eream Resi. | aales rice | square feet| —Toraize sales atthe park? Explain dential Essential Reading for Lecture Fifteen eel . e Acrel, Complete Business Statistics, Chapter 10, Irvin, Third Edition, 1996. ee o: Recommended Reading for Lecture Fifteen 2 $300,000 2.200 12,000 Hanke and Reitseh, Understanding Business Staristics, Chapter 14, Irwin, 1994. exe? 900000 000 18,000 Mendenhall and Sincich, A Second Course in Business Statistics: Regression “The statistical test or overall test is a follows Analysis, Chapt 3, Dellen, 1993, Ho: Bi-+f2=0 oF Hy: not all the is are =0. TFall the fis are equal to zero then the mean of the data set is doing all the work and the regression is not helping us. 30 (©1997 The Teaching Company Limited Parveeship 161997 The Teaching Company Limited Parership 3 32 B. ANOVA is included in most statistical or spreadsheet software applications. The statistical package runs the regression and calculations once you've entered the data. The resulting ANOVA. lable includes source of variation, degrees of freedom (k relates to the ‘number of independent variables in the regression), sums of the squares (SSR), mean square from the regression (MSR), f-ratio and p- value. Source af ss Eratio _p value Regression _k R MSR 0.010 Ln MSE Error n{kel) SSE SSE ‘a(kel) Total nl 1. The Fratio test indicates whether or not there is a regression relationship between y and any of the independent variables. ‘The higher the F value, the more likely thatthe regression has explanatory and predictive power. A rough rule of thumb for, larger sample sizes is that an F ratio greater than five indicates that there is a rogression relationship between the dependent variable and at least one of the independent variables. It should also be remembered that the p-value also needs to be less than 0.05 to indicate a regression relationship. For example, inthe ANOVA table above if the p-value were 0.10 you would conclude that there was not a regression relationship. ANOVA is important because series oft tests to compare pairs ‘of means are not independent of each other. This is especially true when there are three or more independent variables. This is ‘due to the fact that one variable may be robbing another variable of its predictive power. Thus, the ANOVA testis done first in situations involving multiple regression. C. Note that we still need separate tests to determine which ofthe slope parameters are different from 0. In this case {tests have been uscd: Variables Estimate [Standard |evalue |p value Error of Estimate Constant 36,000 Xt 70 2 58 <0.001 Xz 7 34 2A 047 ‘Since the model passed the overall F test there isa relationship between the variables, Both of the independent variables, and X2 should be included in the model since p<0.05, ‘The model would be $ = 36,000 + 70x, + 7x2 £01997 The Teaching Company Limited Patersbip 3. To predict the price of a piece of residential real estate with 2 2,000 square foot house and # 10,000 square foot lot, substitute X;=2,000, X=10,000. The regression model equation ‘calculates the sales price as follows: 16,000+70(2,000) + 7(1,000) = $246,000 IV. The usefulness and accuracy of the multiple regression is indicated by the root mean square error and the R? value. A. B. ‘The mean square error (MSE) estimates the population square error. ‘The root mean square error (SE) is (MSE . The SE is generally used as a multiplier in the prediction interval 2, which corresponds to the multiple coefficient of determination ‘measures the proportion of variation explained by the regression ‘model. R2 tends to go up as more variables are included. V. Dummy variables are also used in a regression. In a dummy variable the ‘switeh” is either on or off; the value is either O or 1 AL B. c. ‘A dummy of indicator variable expresses levels of a quality, such as whether the house is on a golf course, type of coffee or genre of Use of a dummy variable in regression analysis is straightforward. ‘Simply code the indicator variable to ifthe level is obtained or to 0 ifthe level is not obtained. Consider the regression equation: y=Bo* fini * fh axo* fh 3x3, Let x3 represent whether or not the house ison. golfcourse, Ifthe house ison the golf course 3a. Ths in the following regression equation y=$40,000—85) + 10x + 50,0003. The dummy variable x adds $50,000 to the sales pric if the house is located on the golf course. (©1997 The Teaching Comoany Limited Protein ” Questions for Lecture Sixteen Explain what is meant by parsimony in building a multiple regression mode! ‘True or False. The maximum number of independent variables that should be used in multiple regression is three. Multiple regression often provides a more adequate way of modeling ‘complex business situations than simple linear regression, Explain this statement. 4, True of False. The Analysis of Variance (ANOVA) table is used to determine which of the independent variables have a regression relationship with the dependent variable. Assume you are attempting to build a multiple regression model to explain the price of properties in a real estate development located near a golf ‘course, What are some of the independent variables you might use? True or False. Unlike simple linear regression where R® must be less than 1, in multiple regression, it is possible for R? to be greater than 1. ‘What are dummy variables and why are they coded as 0 or 1? ‘Suppose you are attempting to build a regression model to explain box office sales for upcoming movies. Which of the following are dummy variables: production cost budget, advertising budget, whether or not a ‘major star is in the film, whether or not the film isa sequel Essential Reading for Lecture Sixteen Aczel, Complete Business Statistics, Chapter 11, Irwin, Third Edition, 1996. Recommended Reading for Lecture Sixteen Hanke and Reitsch, Understanding Business Statistics, Chapter 15, Irwin, 1994. Mendenhall and Sincich, A Second Course in Business Statistics: Regression Analysis, Chapter 4, Dellen, 1993. 34 {©1997 Te Teaching Company Limited Partership Answers (©1997 The Teaching Company Limit Parmership 35 2 3. 4. 5. 6. 36 Answers to Questions for Lecture Nine ‘A random sample provides a “representative” sample; using a random sample, you can often describe how your resulls differ from those of the population, ‘A parameter is a number computed for the entie population. {A statistic is number computed from your sample data. False False A sampling distribution lists, for each possible value of the statistic, the . fraction ofall possible samples with a given value. 2 No. The units must also be chosen independently. Answers to Questions for Lecture Ten Many data sets we work with in business will be normally distributed. Other data sets will not be normally distributed. However, given the central L limit theorem, the distributions of means or sums of the data will be approximately normal if our sample is large enough, False ‘When sampling from a population, the distribution of means will tend toward a normal distribution as the sample size gets large True True Because ofthe central limit theorem No Answers to Problems for Lecture Ten Yes. You can use the normal distribution to approximate the binomial, since np(1-p) is large (greater than 5) About 90.1% (using a normal distribution table) About 21.5% (using a normal distribution table) (©1997 The Teaching Company Limited Parneship Answers to Questions for Lecture Eleven ‘An interval of numbers within which we expect the true value of the population parameter to lie Tre ‘The sample size is large enough so that the central limit theorem can be applied ‘A wider confidence interval Tre False. Given the possiblity of very remote events, a 100% confidence interval (if obtainable) is too large to be useful Answers to Problems for Lecture Eleven. From $24.85 to $25.15. From $24.88 to $25.12. Note that going to a 95% confidence interval does not “cost” you much in interval width, given the large sample size. People who send in for rebates may not be a random sample of your customers. (©1007 The Tesching Company Lin 5. 3. 38 Answers to Questions for Lecture Twelve ‘This isthe fraction of all confidence intervals that would include the true value of the population parameter ‘The area under the curve that excludes the tails ‘The area in one tail of the distribution Yes, this is a confidence interval to estimate p, the population proportion, True Answers to Problems for Lecture Twelve 48% 44.3% 10 SL7% 43.6% 10 52.4% ‘We are 95% sure that between 43.6% and 52.4% of our customers like the new pizza topping (©1997 The Teaching Company Limit Parmership Answers to Questions for Lecture Thirteen ‘What is claimed to be correet~ the status quo ‘The alternative hypothesis competes with the null ‘The chances of rejecting the mull hypothesis when itis indeed true Failing to reject the null hypothesis when its false ‘When you are testing a specific claim for a population parameter Answers to Problems for Lecture Thirteen [Null hypothesis: mean = 35; alternative hypothesis: mean (1 35 ‘The z-statistic Rejection region: 2<-1.96 or 7>1.96 20 Yes. Since z* falls in the rejection region, we conclude that there is ‘evidence that the average number of tissues used is not 40, (©1997 The Teaching Company Limited Parersiip 39 6 40 Answers to Questions for Lecture Fourteen ‘The purpose of linear regression is to provide a “best model” fora straight line relationship between two variables. False ‘The value of the dependent variable when the independent variable is equal 100. ‘This will occur when increasing values of the independent variable are associated with decreasing values of the dependent variable. For example, using age to predict the time that it takes adults to run a 100 yard dash may produce a negative slope parameter estimate. ‘True True With correlation, we assume that both x and y are random variables, ‘whereas with regression we assume that x is not random. True With 0 correlation, there will be litle if any association between the two variables, An example might be height and intelligence of company CEO's, (©1997 The Teaching Company Limited Parnership Answers to Questions for Lecture Fifteen "The test is a t-test that examines whether or not the slope parameter is equal t0 0. False True ‘Asx increases, check the residuals to see ifthe error variance is staying approximately constant. False ‘The root mean square error may be large, and the prediction interval may be too large to be useful Unequal error variance False Answers to Problems for Lecture Fifteen Yes. It makes sense for sales to go up as temperature rises. $2,000 Not necessarily. Other factors such as attendance may be very important. (©1997 The Teachine Comouny Limited Pacnershio at a Answers to Questions for Lecture Sixteen Building a good regression model withthe minimum number of independent variables False ‘Many business variables (such as sales) are complex and are better explained by using more than one independent variable. False Lot size, interior square footage, number of bedrooms, and whether or not the property is on the golf course are some examples. False Dummy variables are used to indicate whether or not a quality is present or not. A value of O means that quality is not present, and a value of I means the quality is present Whether or not a major star isin the film, whether or not the film is a sequel (©1997 The Teaching Company Limited Partnership Bibliography Acrel, Complete Business Statistics, Irwin, Third Falition, 1996, Clemen, Making Hard Decisions, PWS-Kent, 1991 Cochran, Sampling Techniques, Wiley, 1973. Crystal Ball Users Manual, Decisioneering, 1995. Deming, “On Probability as a Basis for Action,” American Statistician, Vol. 29, 1975, 146-152. Derman, Gleser, and Olkin, A Guide to Probability Theory and Applications, Holt, Rinehart and Winston, 1973. Hanke and Reitsch, Understanding Business Statistics, Irwin, 1994, ‘Mendenhall and Sincich, A Second Course in Business Statistics: Regression Analysis, Chapter 4, Dellen, 1993, Schleifer and Bell, Data Analysis, Regression, and Forecasting, Chapter 2, Course Technology, 1995. Winston, Simulation Modeling using @ Risk, Duxbury, 1995. {1007 The Teaching Carn eed Payeehin *

You might also like