You are on page 1of 26

Demand Forecasting with Multiple Regression

William Swart Ph.D

Developed exclusively for IEEE eLearning Library


Sponsored by: IEEE Educational Activities

Course Presenters Biography

William Swart is professor of Marketing and Supply Chain Management at East Carolina University. He holds a Ph.D. in Operations Research and a M.S. in Industrial and Systems Engineering from the Georgia Institute of Technology and a B.S. in Industrial Engineering with Honors from Clemson University. Dr. Swarts experience is diversified between industry and academia. In academia, he served as Provost and Vice Chancellor for Academic Affairs at East Carolina University, Dean of Engineering at New Jersey Institute of Technology and Old Dominion University, Associate Dean of Business and Economics at California State University, Chairman of the Department of Industrial Engineering and Management Systems at the University of Central Florida. In industry, he served as Vice President for Operations Systems and Vice President for Management Information Systems at Burger King Corporation. As Dean of Engineering and Technology at New Jersey Institute of Technology, he supervised the development of a number of green engineering initiatives including the establishment of the Multi-lifecycle Engineering Research Center. Dr. Swart remains active as a strategic consultant for industry. His professional achievements have been honored by the Institute of Industrial Engineers with the 1994 Operations Research Practice Award. He was awarded the Achievement in Operations Research Medal from the Institute for Operation Research and Management Sciences (INFORMS) and has been named an Edelman Laureate for twice having been a finalist in the prestigious Edelman Competition for the best Operations Research application in the world. Professor Swart has been professionally active in Latin America, Europe, Asia, and the Middle East and is fluent in four languages. He has over 100 publications and has been Principal Investigator for grants and contracts in excess of $10 million.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 2 / 26

Course Outline

An accurate forecast of future demand is an absolute requirement for planning production without creating wasteful overages or shortages and hence constitutes a cornerstone of successful green engineering. This module introduces multiple regression from a users perspective and shows step b y step how you can create a statistically robust forecasting formula based on variables that you believe play a role in determining the demand for your product. The appendix of the module shows hands on and step by step how someone with limited statistical and computer spreadsheet knowledge can implement the shown steps using Microsoft Excel and have a working forecasting system.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 3 / 26

Course Summary / Key Points

Course Summary / Key Points


Reviews how to create a demand forecast using multiple regression Illustrates the steps involved in creating a forecast using the Excel Solver add-in and Microsoft Excel

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 4 / 26

Course Transcript

Background
We developed this module partly because, in the recent eye for transport survey, it was indicated that forecasting is an area where most respondents still feel that they have room for improvement. Only 22% of retail and consumer product supply chain executives rated their forecasting capabilities as either good or excellent. In other words, 30% rated their forecasting as less than satisfactory or very poor, or if we rephrase this one more time, we can report that 78% of respondents would not rate their forecasting capabilities as anything better than merely satisfactory.

Forecasting and Green Engineering


Forecasting is very important to green engineering in general and to green production in manufacturing systems in particular. Paul Anastas, a professor of chemical engineering and an EPA administrator has developed 12 principles of green engineering. Were not going to go over each of those, but there are two principles that appear to be particularly relevant to explaining why forecasting is important to green engineering.

Forecasting and Green Engineering (cont.)


Principle five, which states that a green engineering system must be output-pulled versus input-pushed and talks about the fact that manufacturing systems that adhere just in time, philosophy, are pull system. The customer requests the product and pulls it through the supply chain. When it is not practical to wait until the customer requests the product, accurate forecasts are used instead of actual customer orders. Now lean manufacturing describes the process of getting rid of ways that result in overproduction, delays, inventory, over-processing and error and leads to a more continuous production flow based almost entirely on customer pull.

Module Rationale
Consequently, the rationale for this module being part of the IEEE green production management series is that an accurate prediction of future demand is a requirement in order to plan production without creating wasteful overages or shortages and hence constitutes a cornerstone of successful green engineering.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 5 / 26

Course Transcript

Independent vs. Dependent Item Forecasts


To understand what we are talking about in this module, it is important to understand the difference between independent and dependent items for forecasting. We refer to an independent demand item as essentially forecasting finished products; those that have a use all by themselves. The dependent demand items are essentially the components that go into the final product. The demand for the finished product, for the independent demand item, can be estimated with forecasting techniques such as discussed in this module. The demand for components is not forecasted independently, it is calculated from the forecast of the finished product by techniques such as Materials Requirement Planning

Learning Objectives
The learning objectives of this module are basically that you will learn to use multiple regression for forecasting. As part of that you will learn to postulate causal or independent variables, you will learn to use the output from multiple regression software products to develop a robust forecasting model and finally, you will learn how to make point and interval forecasts.

Two Quantitative Approaches to Forecasting


There are two quantitative approaches to forec asting, and we stress the word quantitative because there are also a number of qualitative approaches to forecasting, which are used when there is no hard data available. But for quantitative forecasting we typically use either multiple regression or timed series analysis. Multiple regression develops an equation that predicts how one or more independent or causal variables, and sometimes we call this independent causal variables, how they can forecast the value of a dependent variable. Now a time series analysis is different in that it does not necessarily try to bring in causal variables, but it simply examines data that has been collected over time and quantifies the patterns it exhibits such as trend and seasonal behavior in order to project those into the future for the purposes of making a forecast.

What is a Good Forecast?


If the mental question is, what is a good forecast? I think we would all agree that a good forecast is a prediction that comes as close to the future it tries to predict as possible, but unfortunately, we can only determine that in retrospect. Consequently, to assess how good a forecast might be, we select a recent known past period as a test period and forecast for that period as if it were unknown. Once we have made that forecast if we have several forecasts

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 6 / 26

Course Transcript

that were produced by lets say different approaches, then we apply various measures to the forecast to quantify the goodness of each. Of course then, we would want to select the best.

Measure of Goodness
The most common measure of forecast goodness is the sum of the squared difference between the actual and forecasted demand (residuals) over the test period. Take a look at the example shown. Thus, finding the best forecast is a mathematical optimization problem. For causal forecasting, minimizing the sum of squares is accomplished via regression. For time series forecasting, other measures of goodness and methods of optimization may be used.

Example 1a
Here we have an example. Suppose that we have data over a 19 time interval. Now, if we wanted to see which of two methods would be the best forecasting method, we would first select a test period. We would assume that that test period is one that would encompass a number of periods, typically the number of periods ahead of time that you would want to forecast. Then we would pretend that we did not have the actual demand, we would use the first 14 periods in this case, to come up with a forecast for period 15, 16, 17, 18 and 19. If we had two forecasting methods, then lets say that these are the results that we would have.

Example 1b
These results we could show on a graph and we can see that the blue line is the actual data; we can see that the red line is the forecast obtained by method one and the green line is the forecast obtained from method two during this particular test period.

Example 1c
In order to determine whether method one or method two was the best forecasting method, we would quantify the difference between the forecasts obtained by each method and the actual demand. So we would calculate the method one residual, which is the difference between the demand in period 15 and the forecast for period 15. We would also calculate the residual for period 15 for method two, which would be the demand in period 15 minus the forecast for period 15. Then we would square those and we would repeat that calculation for each of the periods in our test period, which is from period 15 through 19. We would then add those squared residuals. We would find that method one gives us a sum of squared residuals of 26,091.66. Method two would give us a sum of square residuals of 132,302.95.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 7 / 26

Course Transcript

Consequently, we would draw the conclusion that method one is the best forecasting method for the data in this example.

Multiple Regression
If we next focus our attention on the main topic of this module, multiple regression, then we would indicate that multiple regression requires a sample of data that includes the values of a dependent variable that well always refer to as Y and a number of corresponding, and Ill say that number is M an arbitrary number, corresponding independent variables, X1, X2 through XM and typically M could be as small as one variable if you only have one independent variable, then you have what is mostly referred to as simple regression. If you have more than one independent variable then we have what is normally referred to as multiple regression. Once you have identified the independent variables and incidentally, those are variables that you have reason to believe would impact the value of your dependent variable Y. Once you have selected your independent variables and the dependent variable, then you postulate that you will be able to obtain an equation by which to forecast the value of Y. The forecasted value of Y is usually not the same as the actual value of Y and consequently we refer to Y* as the forecasted value of Y and we hypothesize that that is going to be equal to a set of values, an intercept if you will, that we call 0 and then we have added to that a slope for each of the independent variables multiplied by the independent variables, in this case it would be as shown here. Now, the s are betas, we refer to as the regression coefficients. Again, for us to pick Y and X we have to have values for those. The unknowns in that equation are going to be the s and we are going to utilize multiple regression to find the values of 0, 1, 2, through M that gives us values for Y* that minimize the sum of residuals squared. In other words, we want to minimize the difference between Y and Y*2, the sum of those. In multiple regression, the test period is not a small period, but we utilize the entire data set as the test period.

A Numerical Example
To illustrate how one goes about doing multiple regression, we are going to take a numerical example, which is a modification of an example given in the book by Makridakes and Wheelwright, that is listed in the references. In our particular example, we are going to say that we have 14 years of data and we are trying to predict the sales for a company that we call the Carolina Plate Glass Company. Now, this particular company is a fictitious company, but we assume then that the executives of that particular company have gotten together and

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 8 / 26

Course Transcript

they have found that their principle customers are the automobile producers and the builders. So they then feel that they have reason to believe that if they have the production of automobiles for a particular year and the building contracts that are awarded in a particular year, then they should be able to predict their sales fairly well.

Regression Model for the Example


This is a hypothesis and the way that they state this hypothesis is that we believe that our sales are going to be able to be predicted by an equation that is constant 0 plus a 1 x our automobile production NA2 x our building contracts awarded. Now 0 is the intercept. Again, the intercept simply means the value of Y when X1 is zero and X2 is zero. When X1 is zero we mean there is no automobile production, when X2 is zero we mean there is no building contracts awarded and presumably then these zeroes simply means the demand or the sales that would be accrued from sources other than automobile and building. Now 1 is the slope with respect to automobile production and that simply indicates the change in the sales, in the predicted sales for every unit change in X1. Similarly, 2 is the slope with respect to building contracts. In other words, if building contracts, X2 changes by one unit, then 2 reflects by how much the predicted from sales will increase. Now we are going to utilize multiple regression to find the values of 0, 1 and 2.

Multiple Regression Software


Now, there are a number of different software products that can be used to accomplish this. One that is perhaps as available, if not more so than any other one, is the regression capability thats associated with Microsofts Excel spreadsheet. Its a data analysis add -in. In our appendix we show how to use that particular add-in to go through all of the steps that will illustrate its use. But there are many other very good software systems, perhaps for the heavy duty user of regression and those include systems that are called Minitab, you also have the SAS and you also have SPSS, those are all acronyms for regression packages. Many times they include much more than just regression, they are part of overall statistical computational systems.

MR Software -Differences and Commonalities


Now, there are differences and commonalities in the various multiple regression software products that exist. First of all, multiple regression usually involves the manipulation of large volumes of data and most multiple regression software packages differ in the manner in

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 9 / 26

Course Transcript

which data is entered and the types of statistical analysis they provide above and beyond basic multiple regression. They differ in the amount and type of output analysis including graphics that they provide. However, all multiple regression software packages provide similar basic multiple regression output and well refer to that as BMR, basic multiple regression output.

BMR Output
Now, the basic multiple regression output that is provided by all software packages encompasses information about regression statistics, something that is typically referred to as ANOVA or analysis of variance output, it provides information regarding the actual values and the predicted values, in other words, residual output. It gives information about the correlation matrix.

Using BMR Output


In order to use basic multiple regression output, we have to understand that multiple regression is a statistical procedure that considers your data to be a sample from the larger population. A variety of statistical tests must be conducted before a regression model is considered to yield statistically significant estimates of the population parameters it produces. Thus the basic multiple regression outputs will be submitted to five tests in sequence, only when all five tests are satisfied, will the resulting regression model be considered ready to produce statistically reliable forecasts. A model that passes all five tests is referred to as a robust model.

BMR Output for Example


To give you an idea of what a basic multiple regression output might look like, here we have the output of what Excels data analysis add-in produces. All regression program software products produce something that we refer to as the regression statistics. We will not explain these here. We will explain these as we go on.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 10 / 26

Course Transcript

BMR Output for Example (cont.)


They produce a residual output, which is shown in this particular table, and is simply a listing of the predicted values and the residuals, which is the difference between the predicted and the actual values that you have.

BMR Output for Example (cont.)


And then all basic multiple regression outputs also provide a correlation matrix that talks about the relationship between the independent variables with each other, as well as each independent variable and the dependent variable. What we are talking there is the degree of association between those and of course, the degree of association is referred to as the correlation between two variables.

Regression Tests for Robustness


We have indicated that before we can use the basic multiple regression output for forecasting we have to subject that information to a sequence of five tests that ultimately will lead to a statistically reliable or robust model that we will be able to use for forecasting. The first test that we are going to use is what we call test one, where we check the adjusted R2 and we do that to see the extent, the independent variables, and explain the dependent variable. Now the adjusted R2 gives the percentage of the overall variability of the dependent variable that is explained by the independent variables. There are different statisticians that use different criteria. Its recommend that the criteria for successfully passing this test are that your adjusted R2 is greater than or equal to 0.6. In other words, I would personally like to see models for which the independent variables explain at least 60% of the overall variability of the dependent variable. What do we do if that test is not met? Well, if we have not explained enough of the variability of the independent variable, perhaps then we would want to look for additional independent variables. Can we think of any other independent variables that we have not yet included in the model that might help us to further explain the value of the dependent variable? If the answer to that is negative, then we simply say, okay, thats the best we can do and well go ahead and proceed, wishing that we had variables that would explain a greater

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 11 / 26

Course Transcript

percentage of the variability. Again, this is no reason to stop, we simply continue understanding that we just wish that we could do better.

Applying Test 1
In applying the test, its very simple. We simply go to that first part of the output, we go to the regression statistics, we look for where it says adjusted R2 and we look at that number and that number is greater than 0.6. Consequently, we say that test one is okay. We always recommend that somehow in the regression output we type out the results of each of the tests that we take.

Regression Tests for Robustness


The second test for robustness is to check the F statistic. We check the F statistic to see whether any of the independent variables explain the dependent variable. Currently my criterion for successfully passing this test is if the F statistic is greater than or equal to five. Again, different authors of different texts have different preferences, but you should be happy if the F statistic is greater than or equal to five, then we have at least one useful variable in our model. If we do not get an F statistic that is greater than or equal to five, then unfortunately, we have to throw all of the independent variables away and start from scratch, see if in fact we can find independent variables that do explain the dependent variable.

Applying Test 2
Applying test two, just like applying test one is a simple process. We now go to the ANOVA, the analysis of variance tables that are given to us. I have marked in red where the F statistic is located. In our case, the F statistic has a value of 60.41, which definitely is greater than five. Consequently, we pronounce test two as being satisfied.

Regression Tests for Robustness


The third test examines the correlation matrix to determine if multicolinearity exists between pairs of independent variables. Multicolinearity exists when two independent variables are strongly correlated to each other meaning that the absolute value of their correlation

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 12 / 26

Course Transcript

coefficient is greater than 0.70. If two previously thought independent variables are correlated, then one of the two is not only redundant in predicting the dependent variable, but it compounds any error associated with those variables. Test 3 is passed if ALL the correlation coefficients between the INDEPENDENT variables have an absolute value equal to or less than 0.70. If there is only one pair of independent variables that have an absolute value of their correlation coefficient greater than 0.7, then we discard from the regression model the independent variable that has the least correlation coefficient in absolute value with the dependent variable.

Regression Tests for Robustness


If more than a single pair of independent variables have a correlation coefficient in absolute value between them greater than 0.7, then we select the pair that has the largest correlation coefficient and absolute value. From that pair, we discard from the regression model the independent variable that has the least correlation coefficient and absolute value with the dependent variable. We continue to eliminate independent variables from the regression model using the above actions until all remaining independent variables have correlation coefficients that are less than or equal to 0.7 and absolute value. Now something that is important to remember is that after each independent variable is removed from the regression model, according to the above rules, the remaining data must be processed by the regression software again and all tests repeated again.

Preparing to Conduct Test 3


Clearly before we can apply test three, we have to obtain the correlation matrix from the multiple regression software that you are using.

Applying Test 3
For our example, your correlation output would be what you see at the top left-hand corner of this particular spreadsheet. Now, below that, we explain the meaning of all of that information because the key of what we have said is that the test involves checking the correlation coefficients between independent variables. We only have two independent

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 13 / 26

Course Transcript

variables X1 and X2 and the correlation coefficient between X1 and X2 is given to us in green. In this case its 0.030414. The other information involves if we have to remove a coefficient, which we dont have in this particular case or we have to remove a variable and you see here, that the red number is the correlation between variable X1 automobile production and sales and the black variable involved is the correlation coefficient between building contracts awarded and sales. It is important to remember that when we want to use independent variables to predict a dependent variable it is good to have high correlation between the independent variable and the dependent variable, but not between the independent variables.

Applying Test 3 (Cont.)


So having explained a meaning of the entries in the correlation matrix, we can now apply test three by checking as to whether there is multicolinearity. In order to pass test 3, all the correlation coefficients between the independent variables must be less than or equal to 0.70 in absolute value. That particular correlation coefficient, the green number has a value of 0.030414, which is less than 0.70; consequently we conclude that there is no multicolinearity and consequently test 3 passes.

Regression Tests for Robustness


Test four for robustness involves performing what is referred to as a T test. We do that on each of the regression coefficients in order to determine if it is statistically significantly different from zero. Now my criteria for passing test four is that all regression coefficients must have the absolute value of their corresponding T statistic greater than or equal to 2.0. This corresponds to approximately a 95% confidence interval depending on whether you use a T or Z statistic. We use a 95% confidence interval in all of our discussions in this module. If we find that there is a T statistic that is less than two in absolute value, then we must select, if theres more than one, the regression coefficient that has the smallest absolute value of the T statistic and discard from the regression model, the independent variable that is associated with that regression coefficient. Note that after each independent variable is removed from the regression model, the remaining data must be processed by the regression software again and all tests repeated. We never simply erase or delete a variable from the regression model without rerunning the entire model with the remaining data.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 14 / 26

Course Transcript

Applying Test 4
In applying test four, remember in our example we already applied test one, test two and test three and they were all satisfied. Now we are applying test four. The T statistic is indicated in the column in red. We find that theres a T statistic that is less than two in absolute value, but its associated with the intercept. In other words, the intercept is not significantly different from zero and of course, if something is not significantly different from zero, then we might as well call it zero. Test four is not okay because of that. Consequently, we must remove the intercept from the model. Know that there is no independent variable associated with the intercept; nevertheless we must remove the intercept.

Removing the Intercept


And in order to do that most multiple regression software allows the user to specify that they wish the intercept, sometimes called the constant, to be zero. So to remove the intercept we have to select that option and we have to rerun the multiple regression. Again, do not simply eliminate the constant when the intercept is forced to be zero, all other output is impacted.

Removing the Intercept (Cont.)


Now in removing the intercept, if we go to the multiple regression software and check that we want the intercept to be removed, then for our example, we have the new regression results shown here. We perform test one again, we see that it is okay. We perform test two again and we see that it is okay. Now, test number three never has to be performed once it is okay, so since test three was okay last time, there was no multicolinearity, then it continues to be grandfathered in from here on. Consequently, now we check the T statistics, the intercept is zero and now, we have slightly different values for the value of 1 automobile production and the value of 2, building contracts awarded. The associated T statistics are all in absolute value greater than or equal to two. Consequently now, test four is okay and we can move on to the next test.

Regression Tests for Robustness


The next test, test five is to determine if the distribution of residuals appears to be Gaussian or bell-shaped. Now, criteria for successfully passing this test are that the histogram of the

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 15 / 26

Course Transcript

residuals, which is given as part of the regression output, should appear to be Gaussian. If it is not, then it means that there are other independent variables that have not been identified or that the data exhibits other correlation. When this happens, you may want to consult someone that has considerable statistical expertise regarding how to resolve this problem. However, if all of the other tests are passed and you are stuck and you cant go anywhere else, then we suggest that you proceed with caution, understanding that there is still something out there that is not random, that is impacting the value of your sales.

Obtaining the Distribution of Residuals


To obtain the distribution of residuals, the residuals are almost always provided in any basic multiple regression program. Most multiple regression programs will give you histograms, some of them dont. If you do not have the option of getting the histogram of residuals, then you would have to get it and in order to get that you might remember that we get the histogram by deciding how many observations we have. If we have less than 25 observations and consequently we need to have either five or six class intervals. Now these rules are given in any of the statistics books listed in the references. Once you have decided how many class intervals you want, then you have to determine the width of the class interval and that is simply the difference between the maximum value in the data, the minimum value of the data, divided by your number of class intervals and then your histogram, your class interval limits are essentially the prior limit plus the class interval where the first prior form is the min value of your data set. Again here, you have that information given and you simply tally how many values fall between -82 and -56, how many between -56 and -29 and so on.

Applying Test 5
What we have found here is that once we have the distribution of residuals, then it almost seems to be shaped by a Gaussian distribution, a bell-shaped or normal distribution, but it seems to be bimodal. The second class seems to be higher than the class before and after it and this bimodal distribution is one that we are always concerned about, its not bell -shaped and when you have this bimodal distribution, then that is usually an indication that you may have some other correlation, something else in play. To remove that so that all your residuals are random and not due to some other cause, you may want to consider someone that has a greater statistical expertise, but we dont have that. We look at this and say, well gosh, we know theres something going on here, but well just go on and proceed with our analysis.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 16 / 26

Course Transcript

Graph of Actual vs. Predicted Y


Its recommended that once you have the regression output, you obtain a graph of the actual versus the predicted. Well do this for our example. Now, the reason for doing that is that a picture is worth 1,000 words and by having this graph of actual versus your predicted, you can sort of get a gut feel for whether you have a good model or not. Clearly the closer the actual are to the predicted, the happier you would tend to feel. You can always get this information by doing it separately or perhaps your multiple regression package provides information for us.

Graph Of Actual vs. Predicted Y for Example


Here, we have downloaded the information for you and we have shown you what our actual, the red, looks like in comparison to the blue, the predicted. We can see that there is a fair correspondence between the blue line and the red line and this is what we are looking for. Keep in mind that forecasting is imprecise; the red line is never going to be exactly superimposed on the blue line. There are always going to be discrepancies.

Making a Forecast
To make a forecast to go into the future, we have to remember that we hypothesized a relationship that says that our forecasted value was an intercept plus a slope of X1 x X1, the slope for automobile production x automobile production plus the slope of building contracts awarded times building contracts awarded. Weve indicated before that the best values from 0, 1, and 2 were gotten by a regression and what we do is we look at our last regression output, which in our case was the regression, no INT output.

Making a Forecast (Cont.)


When we look at our last regression output, in our case the no INT regression outputs, then you find a column down below where we did test three that says coefficients, these are in fact the values of 0, 1, and 2. So by realizing that these are the values for 0, 1, and 2, our forecasting equation becomes Y*, our estimate of sales is going to be zero, the intercept, plus 39.776 whatever times X1, plus 10.6886 whatever times X2. So this is the equation that will give us a forecast.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 17 / 26

Course Transcript

Making a Forecast (Cont.)


The regression equation was obtained from a sample of data, hence, we have to think of Y* as the expected estimate of sales. The actual values of the expected sales are a random variable, which the mean is Y* and they have a variability, a standard error. That standard error is found right under our adjusted R2.

Making A Forecast (Cont)


Keeping that in mind, we go back to our example and let us remember that we have a 14year period, a set of history, which we did. We wanted to get a forecast, and lets assume that we want to get a forecast for year 15, 16, 17, 18 and 19. We did have independent variable data together with our sales data, but we are now stuck with the fact that if we want to forecast our dependent variable, we have to have values for X1 and X2 into the future. So the question is where do we get that? Typically, we have to select independent variables for which there are estimated future values. In our case, automobile production is something that is typically forecasted by industry and governmental groups, as well as building activity. So we need to look at the forecast generated by the automobile or builders or by the government economic agencies and see what they predict over these next several years. We have to use those figures as the value of the independent variables that we are going to use to calculate our dependent variable.

Making a Forecast (Cont.)


So let us now say that we have gone to those data sources and we have gotten estimates for what future automobile production is going to be, what future building contracts awarded is going to be and now we want to forecast sales.

Making a Forecast (Cont.)


When we make a forecast we should provide three pieces of information. First of all, the expected value, then the upper confidence limit, which we refer to as UCL and the lower competence limit, LCL, so statistically we are talking about a confidence interval about the mean, which is the expected forecast. The lower confidence limit and the upper confidence limit is that region about the expected value that will give us the 95% of all possible values that the true value of the future sale can be.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 18 / 26

Course Transcript

Making a Forecast (Cont.)


Using the figures from our example, we would calculate the expected sales or Y* by simply plugging the values of X1 and X2 for the future years in this formula. The upper confidence limit would be the expected sales or Y* plus the product of 1.96 times the standard error. The lower confidence limit is the expected sales or Y* minus the product of 1.96 times the standard error. Now, these figures are approximations, I would call them engineering approximations. If you go to multiple regression textbooks, they have some other terms there to account for going farther into the future. I think that these particular values are good enough for basic purposes, understanding that were going to be wrong anyway.

Making a Forecast (Cont.)


Applying those formulas we find that for year 15 our expected value is 587.01, the lower confidence limit, which is this value -1.96, the standard error is this guy, this value is the expected value plus 1.96 times the standard error. So here again, we have these values of the independent variables that we got from industry groups, this is the best single value for a forecast that we can have, but are better off saying that for example, in year 19 we are 95% confident that our sales are going to be between 539.76 million and 683.58 million with the expected number being 611.67.

Appendix
This module provides an appendix, which are instructions for doing everything that we have shown in this module using the Excel data analysis add-in on multiple regression. This is not a recommendation that you should choose this specific tool for multiple regression. There are many fine products, but we thought it would be useful to illustrate what might be involved in regression by providing this appendix.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 19 / 26

Course Transcript

Multiple Regression with Excel


First of all, for what we show in this appendix, you must have Excel 2007 installed on your computer. Once you have verified that you have Excel 2007 installed on your computer, then you can follow these steps to get your data analysis link. Now, of course you can also click the help button for specific details

Obtaining the Data Analysis Link


And when you do, then the Microsoft Excel help shows you how to load the analysis tool, which is what I call a data analysis add-in and here it gives you again in more specific details, exactly how to get this data analysis installed.

The Regression Dialog Box


When you go and get regression, you go to data analysis, you ask for regression and this particular box pops up, this dialogue box

Filling in the Regression Dialog Box


Once you ask for the dialogue box and you get it, you fill it in according to the way that we have shown here. Your input by range is the range of cells where you find your dependent variable Y, and that is from C8 to cell C22. Your input X range refers to the matrix of all of your independent variables, and they have to be in your data set configured to each other. In this case, you can see that our two independent variables are X1 and X2 and the first cell in that matrix is D8 and it goes to the lower right-hand corner, D22. So those are the addresses of both our dependent and independent variable. Since weve entered in the cell addresses, the headings of the column Y for the dependent variable and X1 and X2 for the independent variable, we check the box where it says labels. We then indicate that we want a new worksheet. What that means is we want the output of the regression to come out in a separate worksheet and then we have to name that worksheet and we call it regression all, meaning we have not taken any variables out of it yet. Then finally, we do want to get the residuals that we have talked about, so we check the box that says residuals. So weve entered the cell addresses for the dependent variable, for these independent variables, weve indicated that we have labels in those addresses, we gave a name to the sheet where the results will appear and weve indicated that we wanted to get the residuals. That then allows you to go to the next step.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 20 / 26

Course Transcript

Explanation of Dialog Box


Now, weve summarized these steps in this particular slide and once we have completed the box as we showed in a prior slide, you can click OK.

BMR Output for Example


And after we say OK you would get the information that we have shown in the main body of this particular module, namely your summary output, your ANOVA output, as well as your residual output.

Example for Getting Correlation Matrix


In order to conduct test three, we had to get a correlation matrix. We obtained a correlation matrix by going to the data analysis menu that comes up when you ask for the add-in, then we look for the entry that says correlation that we see in the arrow labeled A and we select it. When we select it we get a box that is titled correlation and we fill it out with the input range to begin with. The input range is the matrix where all of your dependent and independent variable are located. The initial cell for that matrix is cell C6 and the bottom right-hand corner, the end of that matrix is cell E20. Now as before, we have included the labels of the columns Y, X1 and X2 in our input range, so we check labels to indicate that. Next we indicate that we want a new worksheet and were going to call it correlation all. Once we have filled in this information, then we can click OK.

Instructions for Getting Correlation Matrix


Shown here is a summary of the information we just covered for reference purposes.

Example Correlation Output


If you say OK, you get this information that was the correlation matrix that we showed in the main body of the module

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 21 / 26

Course Transcript

Removing the Intercept


In order to remove the intercept in Excel we can go to our regression box that we showed you earlier at the beginning of the appendix. When you bring that back up, you will still see the entries that you have provided earlier and leave those alone. All you have to do to remove the intercept is to go back and check the box that says constant is zero. Remember that constant and intercept have the same meaning. Once you check that box, you have to rerun the regression, but before you say OK you have to change the name. Your new worksheet, your results, you want to put into a worksheet that is going to be called regression, no INT so you can differentiate it from the prior regression that was simply called regression.

Example of no int Regression Output


So when you do that, then you get the result that is shown here and that was used in the main body of the module.

Instructions for Getting Histogram with Excel


In order to construct a histogram with Excel, we have to do some manual work because Excel arbitrarily selects the number of class intervals that it will give you, when in fact there are some statistical rules of thumb that should be observed when you construct a histogram. These rules of thumb can be found in most statistics books, and certainly in the ones listed in the reference section, but typically they say that if you have somewhere between lets say less than 25 observations, then you should have between five and six class intervals. If you have between 25 and 50 observations or data points or periods, you should have somewhere between seven and 14 classes. If you have more than 50 data points, then you should have somewhere between 15 and 20 class intervals. We have selected five class intervals and given that information, then we need to determine the class width. The class width is the range of the data divided by the number of class intervals and the range you may recall is simply the highest value minus the lowest value, which in our case is 133.6056. Consequently, the class width is going to be the range divided by five, which is 26.72. Now that we have that class limit, our first class upper limit is the minimal residual value plus the class width and the minimal residual value you may check in your data table is -82.73, and from that we subtract 26.72 and we got the -56 for our first class upper limit. Now the next class upper limit is going to be your previous class upper limit plus the class width. We show that information in the upper right of this slide. In the first

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 22 / 26

Course Transcript

class number one we have an upper limit of -56 and some, our second class we have an upper limit of -29 and some, all the way up to the five intervals or classes that we wanted.

Example of Getting Histogram with Excel (Cont.)


Now, to use this information in Excel, we have to go to the Excel data analysis menu, we have to ask for a histogram and then we have enter the information in the histogram sheet. The input range now is all your cell addresses for the residual starting at cell C3, going to cell C17. Your second entry is your bin range. Your bin and your class intervals are the same thing in Excel, so a bin is simply a class interval and what Excel wants are the upper limits. You will put in your list of upper limits, and youll leave the label in here limit, and we put that in here so the bin range goes from B21, which is where the limit is, all the way to B26. Now weve decided that we want the output from this particular worksheet to be located in an arbitrary location, so were going to pick F3, which is going to be here to see the output. Then very importantly you want to be sure that you ask for the chart output so we have checked under G, the chart output box.

Instructions for Getting Histogram with Excel (Cont.)


Shown here is a summary of the information we just covered for reference purposes.

Excel Histogram Output


We will get this output information from the Excel histogram function. Remember, we calculated this and this was the data that we had and then the histogram output gives us this information, the frequencies and the histogram, the graph of these frequencies.

Instructions for Generating Y vs Predicted Y Graph


The last function in Excel that we performed was to generate the Y versus predicted Y graph, not Y versus, but the graph where we showed the actual and the forecasted values together. What we have to do is we have to obtain the predicted Y data from the regression, no INT output. We have to obtain the Y data from the original data set, we copy both of those and paste them both in a new area of the spreadsheet and we have to highlight both columns.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 23 / 26

Course Transcript

Then from the Excel insert tab we click line and then we click one of the options, I usually click the first option. Doing this creates the actual versus predicted Y graph. The purpose of obtaining this graph is to obtain a visual of how the estimates for Y obtained from the regression compared to the actual values.

Example of Actual vs. Predicted Plot


Here we have collected the Y from my regression, no INT output, the predict ed Y. Weve gotten the Y from the original data, we put them side-by-side on this spreadsheet and we have highlighted them. Now, we go to the insert tab and in the insert tab, which is next to the home tab on the top of the Excel spreadsheet, we look for the line graph icon that we show with and once we click that, then low and behold we can easily get this particular graph. This then allows you to get a visual of how well your regression predicted your actuals.

Course Summary
In this tutorial we reviewed the importance of creating demand forecasting and the steps involved in using multiple regression to create a demand forecast. We used the Solver Add-in for Microsoft Excel to demonstrate how to create a demand forecast using multiple regression.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 24 / 26

Glossary

Residual
The difference between actual demand and forecasted demand

Sum of Squares
The sum of the residuals for each data period squared. It is the measure of forecast goodness in multiple regression.

Multiple Regression
This is a simple, yet reasonable, algorithm than is used to establish minimum expected performance on a dataset. For instance, the eigenfaces approach based on principal component analysis is the baseline algorithm for face recognition. And, the silhouette correlation approach establishes the baseline for gait recognition.

Regression Coefficients
The coefficients associated with of each term of a regression model which will minimize the residuals and is obtained from the regression analysis.

Robust Model
A regression model that is statistically significant (passes all 5 tests for robustness)

Adjusted R Square
Indicates the percentage of variability of the dependent variable explained by the regression model.

Multicolinearity
The existence of significant correlation between previously hypothesized independent variables.

T test
A test used in multiple regression to determine if the regression coefficients are significantly different than zero.

Expected forecast
The value of the dependent variable as computed from the regression equation.

Upper Confidence Limit of a forecast


Together with the lower confidence limit, it forms an interval between which the true value of the forecast will be with a predetermined probability (95% for our example).

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 25 / 26

References

MS Nixon, T Tan, R Chellappa, "Human Identification Based on Gait," Springer 2006, ISBN 978-0-387-24424-2. Nixon, M.S.; Carter, J.N.; "Automatic Recognition by Gait," Proceedings of the IEEE, vol.94, no.11, pp.2013-2024, Nov. 2006 Kale, A.; Sundaresan, A.; Rajagopalan, A.N.; Cuntoor, N.P.; Roy-Chowdhury, A.K.; Kruger, V.; Chellappa, R.; "Identification of humans using gait," Image Processing, IEEE Transactions on, vol.13, no.9, pp.1163-1173, Sept. 2004. Han, J.; Bhanu, B.; "Individual recognition using gait energy image," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.28, no.2, pp.316-322, Feb. 20. Sarkar, S., Liu, Z.: Gait Recognition. In: Handbook of Biometrics. Springer (2008) Z. Liu and S. Sarkar, Improved Gait Recognition by Gait Dynamics Normalization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp. 863 876, June 2006. Z. Liu and S. Sarkar, Effect of Silhouette Quality on Hard Problems in Gait Recognition, IEEE Transactions on Systems, Man, and Cybernetics-Part B, vol. 35, no. 2, pp. 170 183, Apr. 2005. S. Sarkar, P. Jonathon Phillips, Z. Liu, I. Robledo, P. Grother, K. Bowyer, The Human ID Gait Challenge Problem: Data Sets, Performance, and Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp. 162177, Feb. 2005. H. Vajaria, T. Islam, P. Mohanty, S. Sarkar, R. Sankar, R. Kasturi, Evaluation and analysis of a face and voice outdoor multi-biometric system, Pattern Recognition Letters, vol. 28, no. 12, pp. 1572 1580, Sept. 2007. Z. Liu and S. Sarkar, Outdoor recognition at a distance by fusing gait and face, Image and Vision Computing, vol. 25, no. 6, pp. 817832, June 2007.

IEEE eLearning Library Biometrics for Recognition at a Distance Transcript pg. 26 / 26