You are on page 1of 101

1SimpleLinearRegressionILeastSquaresEstimation

Textbook Sections: 18.118.3 Previously, we have worked with a random variable x that comes from a 2 population that is normally distributed with mean and variance . We have

seen that we can write x in terms of and a random error component , that is, x = + . For the time being, we are going to change our notation for our random variable from x to y. So, we now write y= +. We will now .nd it useful to call the random variable y a dependent or response variable. Manytimes, the response variable of interest may be related to the value(s) of one or more known or controllable independent or predictor variables. Consider the following situations: LR1 A college recruiter would like to be able to predict a potential incoming students .rstyear GPA(y)based on known information concerning high school GPA(x1)and college entrance examination score(x2). She feels that the students .rstyear GPA will be related to the values of these two known variables. LR2 A marketer is interested in the e.ect of changing shelf height(x1)and shelf width(x2)on the weekly sales(y)of her brand of laundry detergent in a grocery store. LR3 A psychologist is interested in testing whether the amount of time to become pro.cient in a foreign language(y)is related to the childs age(x). In each case we have at least one variable that is known (in some cases it is controllable), and a response variable that is a random variable. We would like to .t a model that relates the response to the known or controllable variable(s). The main reasons that scientists and social researchers use linear regression are the following: 1. 1. Prediction To predict a future response based on known values of the predictor variables and past data related to the process. 2. 2. Description To measure the e.ect of changing a controllablevariable on the mean value of the response variable. 3. 3. Control To con.rm that a process is providing responses (results) that we expect under the present operating conditions (measured bythe level(s) of the predictor variable(s)). 1.1 A Linear Deterministic Model Suppose you are a vendor who sells a product that is in high demand (e.g. cold beer on the beach, cable television in Gainesville, or life

jackets on the Titanic, to name a few). If you begin your day with 100 items, have a pro.t of $10 per item, and an overhead of $30 per day, you know exactly how much pro.t you will make that day, namely 100(10)30=$970. Similarly, if you begin the day with 50 items, you can also state your pro.ts with certainty. In fact for any number of items you begin the daywith(x), you can state what the days pro.ts(y)will be. That is, y=10 x 30.

This is called a deterministic model. In general, we can write the equation for a straight line as y= 0+1x, 1 where 0is called the yintercept and 1is called the slope. 0is the value of y when x =0, and 1is the change in y when x increases by1 unit. In many realworld situations, the response of interest (in this example its pro.t) cannot be explained perfectly by a deterministic model. In this case, we make an adjustment for random variation in the process. 1.2 A Linear Probabilistic Model
The adjustment people make is to write the mean response as a linear function of the predictor variable. This way,we allow for variation in individual responses(y), while associating the mean linearly with the predictor x. The model we .t is as follows: E(y|x)= 0+ 1x, and we write the individual responses as y= 0+ 1x + , We can think of y as being broken into a systematic and a random component: y= 0+ 1x +
. .. . ....
random

systematic

where x is the level of the predictor variable corresponding to the

response, 0and 1are unknown parameters, and is the random error component corresponding to the response whose distributionwe assume is N(0,), as before. Further, we assume the error terms are independent from one another, we discuss this in more detail in a later chapter. Note that 0can be interpreted as the mean response when x=0, and 1can be interpreted as the change in the mean response when x is increased by1 unit. Under this model, we are saying that y|x N(0+1x, ). Consider the following example. Example 1.1 Co.ee Sales and Shelf Space A marketer is interested in the relation between the width of the shelf space for her brand of co.ee(x) and weekly sales(y) of the product in a suburban supermarket (assume the height is always at eye level). Marketers are well aware of the concept of compulsive purchases, and know that the more shelf space their product takes up, the higher the frequency of such purchases. She believes that in the range of 3 to 9 feet, the mean weekly sales will be linearly related to the width of the shelf space. Further, among weeks with the same shelf space, she believes that sales will be normally distributed with unknown standard deviation (that is, measures how variable weekly sales are at a given amount of shelf space). Thus, she would like to .t a model relating weekly sales yto the amount of shelf space x her product receives that week. That is, she is .tting the model: y= 0+ 1x + , so that y|x N(0+ 1x, ). One limitation of linear regression is that we must restrict our interpretation of the model to the range of values of the predictor variables that we observe in our data. We cannot assume this linear relation continues outside the range of our sample data. We often refer to 0+ 1x as the systematiccomponentof y and as the randomcomponent. 1.3 Least Squares Estimation of 0and 1
We now have the problem of using sample data to compute estimates of the parameters 0and 1. First, we take a sample of n subjects, observing values y of the response variable and x of the predictor variable.We would like to choose as estimates for 0and 1, the values b0and b1that best .t the sample data. Consider the co.ee example mentioned earlier. Suppose the marketer conducted the experimentover a twelve week period (4weeks with 3 of shelf space, 4weeks with 6, and 4 weeks with 9), and observed the sample data in Table 1. x 636939 Shelf Space Weekly Sales Shelf Space

yx 526 6 421 3 581 9 630 6 412 3 560 9 Weekly Sales y 434 443 590 570 346 672 Table 1: Co.ee sales data for n = 12 weeks

SALES

700

600

500

400

300

6 SPACE

912

Figure 1: Plot of co.ee sales vs amount of shelf space Now, look at Figure 1. Note that while there is some variation among the weekly sales at 3, 6, and 9, respectively, there is a trend for the mean sales to increase as shelf space increases. If we de.ne the .tted equation to be an equation: y= b0+b1x, we can choose the estimates b0and b1to be the values that minimize the

distances of the data points to the .tted line. Now, for each observed response yi, with a corresponding predictor variable xi, we obtain a .tted value yi = b0+ b1xi. So, we would like to minimize the sum of the squared distances of each observed response to its .tted value. That is, we want to minimize the error sum of squares, SSE, where:
nn

SSE

=(yi

yi) =(yi
2

(b0+

b1xi)) . i=1i=1 A little bit of calculus can be used to obtain the estimates:
.

n i=1 =
(

b 1=

.in =1

) ()

SSxy

2
and .
.

SSxx

n n i=1 i
y b1

i=1 i

b0= =. nn An alternative formula, but exactly the same mathematically, is to compute the sample covariance of x and y, as well as the sample variance of x, then taking the ratio. This is the the approach your book uses, but is extra work from the formula above. n n 2
)

) SSxy cov(x,y)=
i=1(

i=1

(SSxx cov(x,y)

= 2= = n1 n1
sx

n1 n1 x

b1=

s2

Some shortcut equations, known as the corrected sums of squares and crossproducts, that while not very intuitive are very useful in computing these and other estimates are: n
.

SSxx =

.n
i =1

() =

i=1

2 i

(i=1xi)2

n nn .in =1
xiyi

(i=1xi)()

i=1yi SSxy =
.n
i =1

( n n

n n SSyy =
i=1

() =

i=1 i

y 2

(i=1yi)2

n Example 1.1 Continued Co.ee Sales and Shelf Space For the co.ee data, we observe the following summary statistics in Table 2. Week Space(x) Sales(y) 1 6 526 2 3 421 3 6 581 4 9 630 5 3 412 6 9 560 7 6 434 8 3 443 9 9 590 106 570 113 346 129 672 x xy y 36 3156 276676
2 2

9 1263 177241 36 3486 337561 81 5670 396900 9 1236 169744 81 5040 313600 36 2604 188356 9 1329 196249 81 5310 348100 36 3420 324900 9 1038 119716 81 6048 451584 x =72 y= 6185 x = 504 xy = 39600 y = 3300627 Table 2: Summary Calculations Co.ee sales data SSxx =(xx)2= x2 = =72 n 12
.. ..

x)

2 504

(72)

x)(y)

39600

(72)(6185)
(

SSxy =(xx)(yy)= xy = = 2490 n 12 ..


y)

2SSyy =(yy)2= y2 = 3300627 n 12

(6185)2

112772.9 From these, we obtain the least squares estimate of the true linear regression relation(0+1x). 2490 b1=
SSxy

= =34.5833

SSxx 72 x 72 b0= b1=


y 6185

34.5833(

12

)=515.4167 207.5000 = 307.967.

nn 12 y= b0+ b1x = 307.967 + 34.583x

So the .tted equation, estimating the mean weekly sales when the product has x feet of shelf y space is = 0+ 1x = 307.967 + 34.5833x. Our interpretation for b1is the estimate for the increase in mean weekly sales due to increasing shelf space by 1 foot is 34.5833 bags of co.ee. Note that this should only be interpreted within the range of x values that we have observed in the experiment, namely x = 3 to 9 feet. Example 1.2 Computation of a Stock Beta A widely used measure of a companys performance is their beta. This is a measure of the .rms stock price volatility relative to the overall markets volatility. One common use of beta is in the capital asset pricing model (CAPM) in .nance, but you will hear them quoted on many business news shows as well. It is computed as(ValueLine): The beta factor is derived from a least squares regression analysis between weekly percent changes in the price of a stock and weekly percent changes in the price of all stocks in the survey over a period of .ve years. In the case of shorter price histories, a smaller period is used, but never less than two years. In this example, we will compute the stock beta over a 28week period for CocaCola and AnheuserBusch, using the S&P500 as the market for comparison. Note that this period is only about 10% of the period used by ValueLine. Note: While there are 28 weeks of data, there are only n=27 weekly changes. Table 3 provides the dates, weekly closing prices, and weekly percentchanges of: the S&P500, CocaCola, and AnheuserBusch. The following summary calculations are also provided, with x representing the S&P500, yC representing CocaCola, and yA representing AnheuserBusch. All calculations should be based on 4decimal places. Figure 2gives the plot and least squares regression line for AnheuserBusch, and Figure 3 gives the plot and least squares regression line for CocaCola. x =15.5200 yC = 2.4882 yA =2.4281 = 124.6354 y 2C =

461.7296 y x 2A = 195.4900 xyC = 161.4408 xyA =84.7527 a) Compute SSxx, SSxyC, and SSxyA. b) Compute the stock betas for CocaCola and AnheuserBusch.

ClosingDate05/20/9705/27/9706/02/9706/09/9706/16/9706/23/9706/30 /9707/07/9707/14/9707/21/9707/28/9708/04/9708/11/9708/18/9708/25 /9709/01/9709/08/9709/15/9709/22/9709/29/9710/06/9710/13/9710/20 /9710/27/9711/03/9711/10/9711/17/9711/24/97 S&PABCCPricePricePrice829.7543.0066.88847.0342.8868.13848.2842.8868.50 858.0141.5067.75893.2743.0071.88898.7043.3871.38887.3042.4471. 00916.9243.6970.75916.6843.7569.81915.3045.5069.25938.7943.567 0.13947.1443.1968.63933.5443.5062.69900.8142.0658.75923.5543.3 860.69899.4742.6357.31929.0544.3159.88923.9144.0057.06950.5145 .8159.19945.2245.1361.94965.0344.7562.38966.9843.6361.69944.16 42.2558.50941.6440.6955.50914.6239.9456.63927.5140.8157.00928. 3542.5657.56963.0943.6363.75S&PABCC%Chng%Chng%Chng
2.08 0.15 1.15 4.11 0.61 1.27 3.34 0.03 0.15 2.57 0.89 1.44 3.51 2.52 2.61 3.29 0.55 0.28 0.00 3.22 3.61 0.88 2.17 2.95 0.14 4.00 4.26 0.85 0.72 3.31 3.14 1.73 3.94 0.70 1.87 0.54 1.09 6.10 0.70 0.53 0.35 1.33 0.80 1.27 2.14 8.66 6.28 3.30 5.57 4.48 4.71

2.88 0.56 2.10 0.20 2.36 0.27 2.87 1.41 0.09 3.74

4.11 1.48 0.84 2.50 3.16 3.69 1.84 2.18 4.29 2.51

3.73 4.65 0.71 1.11 5.17 5.13 2.04 0.65 0.98 10.75

Table 3: Weekly closing stock prices S&P 500, AnheuserBusch, CocaCola

The following (approximate) data were published byJoel Dean, in the 1941 article: Statistical Cost Functions of a Hosiery Mill,(StudiesinBusinessAdministration, vol. 14, no. 3).
-4-3-2-10 1 2 3 4 5 x

Figure 2: Plot of weekly percent stock price changes for AnheuserBusch versus S&P 500 and least squares regression line

-4-3-2-10 1 2 3 4 5 x

Figure 3: Plot of weekly percent stock price changes for CocaCola versus S&P 500 and least squares regression line y Monthly total production cost (in $1000s). x Monthly output (in thousands of dozens produced). A sample of n = 48 months of data were used, with xi and yi being measured for each month. The parameter 1represents the change in mean cost per unit increase in output (unit variable cost), and 0represents the true mean cost when the output is 0, without shutting plant (.xed cost). The data are given in Table 1.3 (the order is arbitrary as the data are printed in table form, and were obtained from visual inspection/approximation of plot). i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

xi 46.75 42.18 41.86 43.29 42.12 41.78 41.47 42.21 41.03 39.84 39.15 39.20 39.52 38.05 39.16 38.59 yi 92.64 88.81 86.44 88.80 86.38 89.87 88.53 91.11 81.22 83.72 84.54 85.66 85.87 85.23 87.75 92.62 i 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 xi 36.54 37.03 36.60 37.58 36.48 38.25 37.26 38.59 40.89 37.66 38.79 38.78 36.70 35.10 33.75 34.29 yi

91.56 84.12 81.22 83.35 82.29 80.92 76.92 78.35 74.57 71.60 65.64 62.09 61.66 77.14 75.47 70.37 i 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 xi 32.26 30.97 28.20 24.58 20.25 17.09 14.35 13.11 9.50 9.74 9.34 7.51 8.35 6.25 5.45 3.79 yi 66.71 64.37 56.09 50.25 43.65 38.01 31.40 29.45 29.02 19.05 20.36 17.68 19.23 14.92 11.44 12.69 Table 4: Production costs and Output Dean (1941) .

This dataset has n =


nnnnn 2

xi = 1491.23 x = 54067.42 yi = 3140.78 y = 238424.46 xiyi =


i

113095.80 i

i=1i=1i=1i=1i=1 From these quantites, we get: n


.

n
i=1 i

SSxx =

x 2

(i=1xi)2

= 54067.42

(1491.23)2

= 54067.42 46328.48 =

7738.94 n 48 n

n i=1xi)(i=1yi) SSxy =
i=1

= 113095.80

(1491.23)(3140.78)

xiyi(n= 113095.8097575.53 = n 48

15520.27 n
.

n y 2
(i=1yi)2

SSyy =

i=1 i

= 238424.46

(3140.78)2

= 238424.46 205510.40

= 32914.06 n 48 n
i=1xiyi

(i=1xi)(n) 15520.27 n

b1=

.n .i=1xi 2 (

.nn )i=1yi=

SSxy

= =2.0055 n i=1xi2SSxx 7738.94 8

yb1x

65.4329 (2.0055) (31.0673)

3.12 74

Example 1.3 Estimating Cost Functions of a Hosiery Mill b0 yi ei = b0+ b1xi yi yi = 3.1274 + 2.0055xi yi (3.1274 + 2.0055xi i=1,...,48 ) i=1,...,48

Table 1.3 gives the raw data, their .tted values, and residuals. A plot of the data and regression line are given in Figure 4.

Figure 4: Estimated cost function for hosiery mill (Dean, 1941)

We have seen now, how to estimate 0and 1. Now we can obtain an


estimate of the variance of the responses at a given value of x. Recall from your previous statistics course, you estimated the variance bytaking the average squared deviation of eachmeasurement from the sample (estimated)
n

mean. That is, you calculated s =

i=1(yiy)2

. Now that we .t the regression

model, we know yn1 longer use yto estimate the mean for each yi, but rather yi = b0+ b1xi to estimate the mean. The estimate we use now looks similar to the previous estimate except we replace ywith y i and we replace n1 with n2 since we have estimated 2 parameters, 0and 1. The new estimate (whichwe will refer as to the estimated error variance) is: n
se = MSE=
(SSxy)2.

SSE

i=1

(yi yi) SSyy

n1

.. s2

[cov(x,y)]

SSxx

== y n2 n2 n2 n2 s2

x i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 i 46.75 42.18 41.86 43.29 42.12 41.78 41.47 42.21 41.03 39.84 39.15 39.20 39.52 38.05 39.16 38.59 36.54 37.03 36.60 37.58 36.48 38.25 37.26 38.59 40.89 37.66 38.79 38.78 36.70 35.10 33.75 34.29 32.26 30.97 28.20 24.58

20.25 17.09 14.35 13.11 9.50 9.74 9.34 7.51 8.35 6.25 5.45 3.79 yi 92.64 88.81 86.44 88.80 86.38 89.87 88.53 91.11 81.22 83.72 84.54 85.66 85.87 85.23 87.75 92.62 91.56 84.12 81.22 83.35 82.29 80.92 76.92 78.35 74.57 71.60 65.64 62.09 61.66 77.14 75.47 70.37 66.71

64.37 56.09 50.25 43.65 38.01 31.40 29.45 29.02 19.05 20.36 17.68 19.23 14.92 11.44 12.69 yi 96.88 87.72 87.08 89.95 87.60 86.92 86.30 87.78 85.41 83.03 81.64 81.74 82.38 79.44 81.66 80.52 76.41 77.39 76.53 78.49 76.29 79.84 77.85 80.52 85.13 78.65 80.92 80.90 76.73 73.52

70.81 71.90 67.82 65.24 59.68 52.42 43.74 37.40 31.91 29.42 22.18 22.66 21.86 18.19 19.87 15.66 14.06 10.73 i 4.24 1. 1.09 0.64 1.15 1.22 2.95 2.23 3.33 4.19 0.69 2.90 3.92 3.49 5.79 6.09 12.10 15.15 6.73 4.69 4.86 6.00 1.08 0.93 2.17 10.56 7.05 15.28 18.81 15.07 3.62 2. 4.66 1.53 1.11 0.87 3.59 2.17 0.09 1. 0.61 0.51 2. 6.84 3.61 1.50 0.51 0.64 0.74 2.62 1.96 0.03 Table 5: Approximated Monthly Outputs, total costs, .tted values and residuals Dean (1941) .

10 This estimated error variance s


e

can be thought of as the average squared

distance from each observed response to the .tted line. The word average is in quotes since we divide byn2 and not n. The closer the observed responses fall to the line, the smaller s
2

our predicted values will be.

is and the better

Example 1.1 (Continued) Co.ee Sales and Shelf Space For the co.ee data,
(2490)2

112772.9

112772.9 86112.5

72
2

= = = 2666.04,

e 12 2 10

and the estimated residual standard error (deviation) is Se = 2666.04 = 51.63. We now have estimates for all of the parameters of the regression equation relating the mean weekly sales to the amount of shelf space the co.ee gets in the store. Figure 5 shows the 12 observed responses and the estimated (.tted) regression equation.

SALES

700 600 500 400 300

6 SPACE

912

Figure 5: Plot of co.ee data and .tted equation

Example 10.3 (Continued) Estimating Cost Functions of a Hosiery Mill For the cost function data: n xy SSE =
i=1

(yiyi) =

SSyySS2=

32914.06(15520.27)2=

32914.0631125.55 = 1788.51 SSxx7738.94 SSE 1788.51 s = MSE = e se = 38.88 =6.24 2SimpleRegressionIIInferencesConcerning1 Textbook Section: 18.5 (and some supplementary material) Recall that in our regression model, we are stating that E(y|x)= 0+ 1x. In this model, 1represents the change in the mean of our response variable y, as the predictor variable x increases by 1 unit. Note that if 1=0, we have that E(y|x)= 0+ 1x = 0+0x = 0, which implies the mean of our response variable is the same at all values of x. In the context of the co.ee sales example, this would imply that mean sales are the same, regardless of the amount of shelf space, so a marketer has no reason to purchase extra shelf space. This is like saying that knowing the level of the predictor variable does not help us predict the response variable. Under the assumptions stated previously, namely that y N(0+ 1x,), our estimator b1has a sampling distribution that is normal with mean 1(the true value of the parameter), and standard error ..n . That is: 2 (xix)
2
n2

482

=38.88

i=1 b1 N(1, ) SSxx We can now make inferences concerning 1. 2.1 A Con.dence Interval for 1 First, we obtain the estimated standard error of b1(this is the standard deviation of its sampling distibution:
s sb1=

SSxx

=.

(n e 1)s2 x

The interval can be written: . S S x x

b t

/2,n2 b1

b t

/2,n2

se

Note that is the estimated standard error of b1since we use se = MSE to estimate . SSxx Also, we have n 2 degrees of freedom instead of n 1, since the estimate s has 2 estimated e paramters used in it (refer back to how we calculate it above). Example 2.1 Co.ee Sales and Shelf Space For the co.ee sales example, we have the following results: b1=34.5833, SSxx =72,se =51.63,n =12. So a 95% con.dence interval for the parameter 1is: 51.63
2
se

34.5833 t.025,122 =34.5833 2.228(6.085) = 34.583 13.557, 72 which gives us the range (21.026, 48.140). We are 95% con.dent that the true mean sales increase by between 21.026 and 48.140 bags of co.ee per week for each extra foot of shelf space the brand gets (within the range of 3 to 9 feet). Note that the entire interval is positive (above 0), so we are con.dent that in fact 1> 0, so the marketer is justi.ed in pursuing extra shelf space. Example 2.2 Hosiery Mill Cost Function

b1=2.0055, SSxx = 7738.94,se =6.24,n =48. For the hosiery mill cost function analysis, we obtain a 95% con.dence interval for average unit variable costs(1). Note that t.025,482= t.025,462.015, since t.025,40=2.021 and t.025,60=2.000 (we could approximate this with z.025=1.96 as well). 6.24 2.0055 t.025,46 =2.0055 2.015(.0709) =2.0055 0.1429 =(1.8626,2.1484) 7738.94 We are 95% con.dent that the true average unit variable costs are between $1.86 and $2.15 (this is the incremental cost of increasing production byone unit, assuming that the production process is in place. 2.2 Hypothesis Tests Concerning 1
Similar to the idea of the con.dence interval, we can set up a test of hypothesis concerning 1. Since the con.dence interval gives us the range of believable values for 1, it is more useful than a test of hypothesis. However, here is the 0 procedure to test if 1is equal to some value, say .
1

% H0: 1= (10speci.ed, usually 0) % (1) Ha : 1. 1 =

1 0

1 (2) Ha : 1> 2 (3) Ha : 1< 1 1

0 0

11
b10

TS : tobs = seb10 SS xx sb 1 (1) RR : |tobs|t/2,n2 (2) RR : tobs t,n2 (3) RR : tobs t,n2 (1) Pvalue: 2P(t |tobs|) 1 (2) Pvalue: P(t tobs) 2 (3) Pvalue: P(t tobs) Using tables, we can only place bounds on these pvalues. Example 2.1 (Continued) Co.ee Sales and Shelf Space Suppose in our co.ee example, the marketer gets a set amount of space (say 6 )for free, and she must pay extra for any more space.For the extra space to be pro.table (over the long run), the mean weekly sales must increase by more than 20 bags, otherwise the expense outweighs the increase in sales. She wants to test to see if it is worth it to buy more space. She works under the assumption that it is not worth it, and will only purchase more if she can show that it is worth it. She sets = .05. 1. 1. H0: 1=20 HA : 1>20 2. 2. T.S.: tobs = 34.583320= 6.085=2.397 14.5833 5 1
.

. 6 3 7 2 1. 3. R.R.: tobs >t.05,10=1.812 2. 4. pvalue: P(T>2.397) <P(T>2.228) = .025 and P(T>2.397) >P(T>2.764) = .010, so .01 <pvalue<.025. So, she has concluded that 1>20, and she will purchase the shelf space. Note also that the entire con.dence interval was over 20, so we already knew this. Example 2.2 (Continued) Hosiery Mill Cost Function Suppose we want to test whether average monthly production costs increase with monthly production output. This is testing whether unit variable costs are positive(=0.05). % H0: 1=0 (Mean Monthly production cost is not associated with output) % HA : 1>0 (Mean monthly production cost increases with output) 2.0055 TS: tobs =
2.00550

29

=
=28.

6.240.0709 7738.94

% RR: tobs >t0.05,46 1.680 (or use z0.05=1.645) % pvalue: P(T>28.29) 0 We have overwhelming evidence of positive unit variable costs. 2.3 The Analysis of Variance Approach to Regression
Consider the deviations of the individual responses, yi, from their overall mean y. We would like to break these deviations into two parts, the deviation of the observed value from its .tted value, yi = b0+ b1xi, and the deviation of the .tted value from the overall mean. See Figure 6 corresponding to the co.ee sales example. That is, wed like to write:

yi y=(yi yi)+(yi y).

Note that all we are doing is adding and subtracting the .tted value. It so happens that algebraically we can show the same equality holds once weve squared each side of the equation and summed it over the nobserved and .tted values. That is,
nnn

(yi y) =(yi yi) + (

yi y)2. i=1i=1i=1

3 SPACE

912

= 515.4167 These three pieces are called the total, error, and model sums of squares, respectively. We denote them as SSyy, SSE, and SSR, respectively. We have already seen that SSyy represents the total variation in the observed responses, and that SSE represents the variation in the observed responses around the .tted regression equation. That leaves SSR as the amount of the total variation that is accounted for by taking into account the predictor variable X. We can use this decomposition to test the hypothesis H0: 1=0 vs HA : 1. = 0. We will also .nd this decomposition useful in subsequent sections when we have more than one predictor variable. We .rst set up the Analysis of Variance (ANOVA) Table in Table 6. Note that we will have to make minimal calculations to set this up since we have already computed SSyy and SSE in the regression analysis. Source Variation of Sum of Squares

ANOVA Degrees of Mean Freedom Square F


.

n M

S R MODEL SSR =
.n

i=1

(1 MSR=

SSR

F=

MSE

1 ERROR SSE = (yi yi )22n2 MSE =


.

SSE

i=1n2 n )2

TOTAL SSyy =

i=1

(n1

Table 6: The Analysis of Variance Table for simple regression The procedure of testing for a linear association between the response and predictor variables using the analysis of variance involves using the Fdistribution, which is given in Table 6 (pp B11B16) of your text book. This is the same distributionwe used in the chapteron the 1Way ANOVA. The testing procedure is as follows: 1. H0: 1=0 HA : 1. =0 (This will always be a 2sided test) M S R 1. 2. T.S.: Fobs =

2. 3. R.R.: Fobs >F1,n2, 3. 4. pvalue: P(F> Fobs)(You can only get bounds on this, but computer outputs report them exactly) Note that we already have a procedure for testing this hypothesis (see the section on Inferences Concerning 1), but this is an important leadin to multiple regression. Example 2.1 (Continued) Co.ee Sales and Shelf Space Referring back to the co.ee sales data, we have already made the following calculations:

MSE

SSyy = 112772.9, SSE = 26660.4,n =12. We then also have that SSR = SSyy SSE = 86112.5. Then the Analysis of Variance is given in Table 7. ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F 86112. 5 MODEL SSR = 86112.51 MSR=
86112.5

= 86112.5 F = 1 26660.4

2666.04

=32.30

MSE =

ERROR TOTAL

SSE = 26660.4 122= 10

10

= 2666.04

SSyy = 112772.9121= 11 Table 7: The Analysis of Variance Table for the co.ee data example

To test the hypothesis of no linear association between amount of shelf space and mean weekly co.ee sales, we can use the Ftest described above. Note that the null hypothesis is that there is no e.ect on mean sales from increasing the amount of shelf space. We will use = .01. 1. H0: 1=0 HA : 1. =0 MSR 1. 2. T.S.: Fobs =
=32.30
MSE

86112.5

2. 2666.04 2. 3. R.R.: Fobs >F1,n2, = F1,10,.01=10.04 4. pvalue: P(F> Fobs)= P(F> 32.30) 0 We reject the null

hypothesis, and conclude that 1. = 0. There is an e.ect on mean weekly sales when we increase the shelf space. Example 2.2 (Continued) Hosiery Mill Cost Function For the hosiery mill data, the sums of squares for each source of variation in monthly production costs and their corresponding degrees of freedom are (from previous calculations): n Total SS SSyy =(yi y) = 32914.06 dfTotal = n1= 47
.

i=1 n

Error SS SSE =

i=1

(yi yi) = 1788.51 dfE =


i=1

n2=46 Model SS SSR =

( n yi y) =

SSyy SSE = 32914.06 1788.51 = 31125.55 dfR =1 Source of Sum of Degrees of Variation Squares Freedom ANOVA Mean Squar eF
MSR=

3112 5.55
=

3112 5.55 MODEL SSR = 31125.55 1 1= 31125.55 =800.55 38.88

MSE =

1788.51
46

ERROR SSE = 1788.51 48 2= 46

=38.88

TOTAL SSyy = 32914.06 48 1= 47

Table 8: The Analysis of Variance Table for the hosiery mill cost example

The Analysis of Variance is given in Table 8. To test whether there is a linear association between mean monthly costs and monthly production output, we conduct the Ftest(=0.05). 1. H0: 1=0 HA : 1. =0 MSR 1. 2. T.S.: Fobs = 2. 3. R.R.: Fobs >F1,n2, = F1,46,.05 4.06 3. 4. pvalue: P(F> Fobs)= P(F> 800.55) 0 38.88 We reject the null hypothesis, and conclude that 1.=0. 2.3.1 Coe.cient of Determination A measure of association that has a clear physical interpretation is R , the coe.cient of determination. This measure is always between 0 and 1, so it does not re.ect whether y and x are positively or negatively associated, and it represents the proportion of the total variation in the response variable that is accounted for by .tting the regression on x. The formula for R is: SSR [cov(x,y)] R2=(R)2=1
SSE 2 . 2 2
MSE

31125.55

= 800.55

SSyy

=
x

s
2

SS

yy

s2

Note that SSyy =(yi y) represents the total variation in the response variable, while i=1 n SSE =
i=1

(yi yi) represents the variation in the observed responses


2

about the .tted equation (after taking into account x). This is whywe sometimes saythat R is proportion of the variation in ythat is explained byx. Example 2.1 (Continued) Co.ee Sales and Shelf Space For the co.ee data, we can calculate R using the values of SSxy,SSxx,SSyy, and SSE we have previously obtained.
R2=1 2

26660.4 86112.5 == .7636

112772.9 112772.9 Thus, over 3/4 of the variation in sales is explained bythe model using shelf space to predict sales. For the hosiery mill data, the model (regression) sum of squares is SSR = 31125.55 and the total sum of squares is SSyy = 32914.06. To get the coe.cient of determination:
R2

31125.55 = =0.9457 32914.06

Almost 95% of the variation in monthly production costs is explained bythe monthly production output. 3SimpleRegressionIII EstimatingtheMeanandPredictionataParticularLevelofx,Correlation Textbook Sections: 18.7,18.8 We sometimes are interested in estimating the mean response at a particular level of the predictor variable, say x = xg. That is, wed like to estimate E(y|xg)= 0+ 1xg. The actual estimate y 9point prediction)is just = b0+ b1xg, which is simply where the .tted

line crosses x = xg. Under the previously stated normality assumptions, the estimator
.

y0is normally distributed with (xgx)mean 0+ 1xg and standard error of estimate 1+ .n(xi2x). That is: 2 i = 1

n 2 1(xg x) y0 N(0+ 1xg,


+ .n n

(xi x)2

).

i=1

Note that the standard error of the estimate is smallest at xg = x, that is at the mean of the sampled levels of the predictor variable. The standard error increases as the value xg goes away from this mean. For instance, our marketer maywish to estimate the mean sales when she has 6 of shelf space, or 7 ,or 4 . She may also wish to obtain a con.dence interval for the mean at these levels of x. 3.1 A Con.dence Interval for E(y|xg)=0+1xg Using the ideas described in the previous section, we can write out the general form for a (1)100% con.dence interval for the mean response when xg. 2 1(xg x) (b0+ b1xg)t/2,n2Se
n SSxx . . .

Example 3.1 Co.ee Sales and Shelf Space

Suppose our marketer wants to compute 95% con.dence intervals for the mean weekly sales at x=4,6, and 7 feet, respectively (these are not simultaneous con.dence intervals as were computed based on Bonferronis Method previously). Each of these intervals will depend on t/2,n2= t.05,10=2.228 and x = 6. These intervals are: (307.967 +34.5833(4)) 2.228(51.63) + 446.300 115.032 .1389 12 72 = 446.300 42.872 (403.428,489.172) (307.967 +34.5833(6)) 2.228(51.63) + 515.467 115.032 .0833 12 72 = 515.467 33.200 (482.267,548.667) (307.967 +34.5833(7)) 2.228(51.63) + 550.050 115.032 .0972 12 72 19 = 550.050 35.863 (514.187,585.913) Notice that the interval is the narrowest at xg = 6. Figure 7 is a computer generated plot of the data, the .tted equation and the con.dence limits for the mean weekly co.ee sales at each value
1 (7 6)2 1 (6 6)2 1 (4 6)2

SALES

700 600

500 400 300

6 SPACE

912

Figure 7: Plot of co.ee data, .tted equation, and 95% con.dence limits for the mean

Example 3.2 Hosiery Mill Cost Function Suppose the plant manager is interested in mean costs among months where output is 30,000 items produced(xg = 30). She wants a 95% con.dence interval for this true unknown mean.
Recall:
b0=3.1274 b1=2.005 5 se=6.2 4 n =48 x =31.0673 . 1 SSx
x

7738.94t.025,46 2.015

Then the interval is obtained as:

(3031.0673)2 3.1274+2.0055(30)2.015(6.24)+ 487738.94 63.292.015(6.24)0.0210 63.291.82 (61.47,65.11) We can be 95% con.dent that the mean production costs among months where 30,000 items are produced is between $61,470 and $65,110 (recall units were thousands for x and thousands for y). A plot of the data, regression line, and 95% con.dence bands for mean costs is given in Figure 8. 3.2 Predicting a Future Response at a Given Level of x

In many situations, a researcher would like to predict the outcome of the response variable at a speci.c level of the predictor variable. In the previous section we estimated the mean response, in this section we are interested in predicting a single outcome. In the context of the co.ee sales example, this would be like trying to predict next weeks sales given we know that we will have 6 of shelf space. 0 size 1020304050
.

Figure 8: Plot of hosiery mill cost data, .tted equation, and 95% con.dence limits for the mean First, suppose you know the parameters 0and 1Then you know that the response variable, for a .xed level of the predictor variable(x = xg), is normally distributed with mean E(y|xg)= 0+ 1xg and standard deviation . We know from previous work with the normal distribution that approximately 95% of the measurements lie within 2 standard deviations of the mean. So if we know 0,1,and ,we would be very con.dent that our response would lie between(0+1xg)2 and(0+ 1xg)+2. Figure 9 represents this idea. F2

0.040 0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000

50 60 70 80 90 100 110 120 130 140 150 X

Figure 9: Distribution of response variable with known 0,1, and

We rarely, if ever, know these parameters, and we must estimate them as we have in previous sections. There is uncertainty in what the mean response at the speci.ed level, xg, of the response variable.We do, however know how to obtain an interval that we are very con.dent contains the true mean 0+1xg. If we apply the method of the previous paragraph to all believable values of this mean we can obtain a prediction interval that we are very con.dent will contain our future response. Since is being estimated as well, instead of 2 standard deviations, we must use t/2,n2estimated standard deviations. Figure 10 portrays this idea. F1

0.040 0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000

20 60 100 140 180 X

Figure 10: Distribution of response variable with estimated 0,1, and Note that all we really need are the two extreme distributions from the con.dence interval for the mean response. If we use the methodfrom the last paragraph on each of these two distributions, we can obtain the prediction interval by choosing the lefthand point of the lower distribution and the righthand point of the upper distribution. This is displayed in Figure 11. F1

0.040

0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000

20

60 X

100

140

180

Figure 11: Upper and lower prediction limits when we have estimated the mean The general formula for a (1)100% prediction interval of a future response is similar to the con.dence interval for the mean at xg, except that it is wider to re.ect the variation in individual responses. The formula is: 2 1(xg x)
.

(b0+ b1xg)t/2,n2s 1+

n SSxx

Example 3.1 (Continued) Co.ee Sales and Shelf Space For the co.ee example, suppose the marketer wishes to predict next weeks sales when the co.ee will have 5 of shelf space. She would like to obtain a 95% prediction interval for the number of bags to be sold. First, we observe that t.025,10=2.228, all other relevant numbers can be found in the previous example. The prediction interval is then: (307.967+34.5833(5))
.

2.228(51.63)1+ +

(56)2

=480.88393.5541.0972 1272

=480.88397.996 (382.887,578.879). This interval is relatively wide, re.ecting the large variation in weekly sales at each level of x. Note that just as the width of the con.dence interval for the mean response depends on the distance between xg and x, so does the width of the prediction interval. This should be of no surprise, considering the way we set up the prediction interval (see Figure 10 and Figure 11). Figure 12 shows the .tted equation and 95% prediction limits for this example. It must be noted that a prediction interval for a future response is only valid if conditions are similar when the response occurs as when the data was collected. For instance, if the store is being boycotted bya bunch of animal rights activists for selling meat next week, our prediction interval will not be valid. SALES

700 600 500 400 300

3 SPACE

912

Figure 12: Plot of co.ee data, .tted equation, and 95% prediction limits for a single response Example 3.2 (Continued) Hosiery Mill Cost Function Suppose the plant manager knows based on purchase orders that this month, her plant will produce 30,000 items(xg =30.0). She would like to predict what the plants production costs will be. She obtains a 95% prediction interval for this months costs. 3.1274+2.0055(30)2.015(6.24)1+ + 63.292.015(6.24)1.0210 487738.94 63.2912.70 (50.59,75.99) She predicts that the costs for this month will be between $50,590 and $75,990. This interval is much wider than the interval for the mean, since it includes random variation in monthly costs around the mean.A plot of the 95% prediction bands is given in Figure 13. 3.3 Coe.cient of Correlation In many situations, we would like to obtain a measure of the strength of the linear association between the variables yand x. One measure of this association that is reported in research journals from many .elds is the Pearson product moment coe.cient of correlation. This measure, denoted by r, is a number that can range from 1 to +1. Avalue of r close to 0 implies that there is very little association between the two variables(ytends to neither increase or decrease as x increases). A positive value of r means there is a positive association between yand x (ytends to increase as x increases). Similarly,a negative value means there is a negative association(ytends to decrease as x increases). If r is either +1 or 1, it means the data fall on a straight line(SSE =0) that has either a positive or negative slope, depending on the sign of r. The formula for calculating r is:
r=.
1 (3031.0673)2

SSxy cov(x,y) =

SSxxSSyy sxsy Note that the sign of r is always the same as the sign of b1. We can test whether a population coe.cient of correlation is 0, but since the test is mathematically equivalent to testing whether 1=0, we wont cover this test.

0 size

1020304050

Figure 13: Plot of hosiery mill cost data, .tted equation, and 95% prediction limits for an individual outcome Example 3.1 (Continued) Co.ee Sales and Shelf Space For the co.ee data, we can calculate r using the values of SSxy,SSxx,SSyy we have previously obtained. 2490 2490 r = . == .8738 (72)(112772.9) 2849.5 Example 3.2 (Continued) Hosiery Mill Cost Function For the hosiery mill cost function data, we have: 15520.27 15520.27 r = . == .9725 (7738.94)(32914.06) 15959.95 Computer Output for Co.ee Sales Example (SAS System) DependentVariable:SALES AnalysisofVariance SumofMean SourceDFSquaresSquareFValueProb>F Model186112.5000086112.5000032.2970.0002 Error1026662.416672666.24167 CTotal11112774.91667 RootMSE51.63566Rsquare0.7636 DepMean515.41667AdjRsq0.7399

ParameterEstimates ParameterStandardTforH0: VariableDFEstimateErrorParameter=0Prob>|T| INTERCEP1307.91666739.437388847.8080.0001 SPACE134.5833336.085321215.6830.0002

Obs 1 2 3 4 5 6 7 8 9 10 11 12

DepVarS ALES 421.0 412.0 443.0 346.0 526.0 581.0 434.0 570.0 630.0 560.0 590.0 672.0 Upper95 %Predict 538.1 538.1 538.1 538.1 635.2 635.2

PredictVa lue 411.7 411.7 411.7 411.7 515.4 515.4 515.4 515.4 619.2 619.2 619.2 619.2

StdErrPr edict 23.568 23.568 23.568 23.568 14.906 14.906 14.906 14.906 23.568 23.568 23.568 23.568

Lower95 %Mean 359.2 359.2 359.2 359.2 482.2 482.2 482.2 482.2 566.7 566.7 566.7 566.7

Upper95 %Mean 464.2 464.2 464.2 464.2 548.6 548.6 548.6 548.6 671.7 671.7 671.7 671.7

Lower95% Predict 285.2 285.2 285.2 285.2 395.7 395.7 395.7 395.7 492.7 492.7 492.7 492.7

Obs 1 2 3 4 5 6

Residual 9.3333 0.3333 31.3333 65.6667 10.5833 65.5833

Obs 7 8 9 10 11 12

Upper95 %Predict 635.2 635.2 745.6 745.6 745.6 745.6

Residual 81.4167 54.5833 10.8333 59.1667 29.1667 52.8333

4LogisticRegression
Often, the outcome is nominal (or binary), and we wish to relate the probabilitythat an outcome has the characteristic of interest to an interval scale predictor variable. For instance, a local service provider maybe interested in the probabilitythat a customer will redeem a coupon that is mailed to him/her as a function of the amount of the coupon.We would expect that as the value of the coupon increases, so does the proportion of coupons redeemed. An experiment could be conducted as follows.

Choose a range of reasonable coupon values (say x=$0 (.yer only), $1, $2, $5, $10) % Identify a sample of customers (say 200 households) % Randomly assign customers to coupon values (say 40 per coupon

value level) Send out coupons, and determine whether each coupon was redeemed bythe expiration date (y=1 if yes, 0 if no) % Tabulate results and .t estimated regression equation.

Note that probabilities are bounded by0 and 1, so we cannot .t a linear regression, since it will provide .tted values outside this range (unless b0is between 0 and 1 and b1is 0). We consider the following model, that does force .tted probabilities to lie between 0 and 1: e

0+1x e =2.71828 ...

p(x)=

1+ e0+1x Unfortunately, unlike the case of simple linear regression, where there are close form equations for least squares estimates of 0and 1, computer software must be used to obtain maximumlikelihoodestimates of 0and 1, as well as their standard errors. Fortunately, many software packages (e.g. SAS, SPSS, Statview) o.er procedures to obtain the estimates, standard errors and tests.We will give estimates and standard errorsin this section, obtained from one of these packages. Once the estimates of 0and 1are obtained, whichwe will label as b0and b1respectively, we obtain the .tted equation:
e

b0+b1x e =2.71828 ...

P (x)=

1+ eb0+b1x Example 4.1 Viagra Clinical Trial In a clinical trial for Viagra, patients su.ering from erectile dysfunction were randomly assigned to one of four daily doses (0mg,25mg,50mg, and 100mg). One measure obtained from the patients was whether the patient had improved erections after 24 weeks of treatment(y=1 if yes, y=0 if no). Table 9 gives the number of subjects with y=1 an y=0 for each dose level. Source: I. Goldstein, etal, (1998), Oral Sildena.l in the Treatment of Erectile Dysfunction, NEJM, 338:13971404. Based on an analysis using SAS software, we obtain the following estimates and standard errors for the logistic regression model: Dose(x) 0

25 50 100 Table 9: Patients improvement(y dose(x) showing

phatx 1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0 0 10 20 30

n 199 96 105 101 # Responding y=1 y=0 50 149 54 42 81 24 85 16 =1) and not showing improvement(y =0) by Viagra

40 dose

50

60

70

80

90100

Figure 14: Plot of estimated logistic regression equation Viagra data

29 A plot of the .tted equation (line) and the sample proportions at each dose (dots) are given in Figure 14.

4.1 Testing for Association between Outcome Probabilities and x Consider the logistic regression model:
e

0+1x

p(x)= e =2.71828 1+ e0+1x Note that if 1=0, then the equation becomes p(x)= e 0/(1 + e 0). That is, the probability that the outcome is the characteristic of interest is not related to x, the predictor variable. In terms of the Viagra example, this would mean that the probability a patient shows improvement is independent of dose. This is what we would expect if the drug were not e.ective (still allowing for a placebo e.ect). Futher, note that if 1> 0, the probability of the characteristic of interest occurring increases in x, and if 1< 0, the probability decreases in x. We can test whether 1=0 as follows: % H0: 1=0 (Probability of outcome is independent of x) % HA : 1. =0 (Probability of outcome is associated with x)
]

Test Statistic: X

2 =[b /s 2
,1

obs

1 b1 (=3.841, for =0.05).

Rejection Region: X

obs Pvalue: Area in


2

obs1

distribution above X

Note that if we reject H0,we determine direction of association (positive/negative) bythe sign of b1. Example 4.1 (Continued) Viagra Clinical Trial For this data, we can test whether the probability of showing improvement is associated with dose as follows: % H0: 1=0 (Probability of improvement is independent of dose) % HA : 1. =0 (Probability of improvement is associated with dose) 1. Test Statistic: X 2.
2
obs

2 2 2

=[b1/sb1] =[.0313/.0034] =(9.2059) =84.75


2
,1

% Rejection Region: X obs Pvalue: Area in


2
obs

(=3.841, for =0.05).

(virtually 0) 1distribution above X

Thus, we have strong evidence of a positive association (since b1> 0 and we reject H0)between probability of improvement and dose. 5MultipleLinearRegressionI Textbook Sections: 19.119.3 and Supplement In most situations, we have more than one independent variable. While the amount of math can become overwhelming and involves matrix algebra, many computer packages exist that will provide the analysis for you. In this chapter, we will analyze the data by interpreting the results of a computer program. It should be noted that simple regression is a special case of multiple regression, so most concepts we have already seen apply here. 5.1 The Multiple Regression Model and Least Squares Estimates In general, if we have kpredictor variables, we can write our response variable as:

y= 0+ 1x1+ + kxk + . Again, xis broken into a systematic and a random component: y= 0+ 1x1+ +kxk + . .. . systematic
random

....

We make the same assumptions as before in terms of , speci.cally that they are independent and normally distributed with mean 0 and standard deviation . That is, we are assuming that y,ata given set of levels of the kindependent variables(x1,...,xk)is normal with mean E[y| x1,...,xk]= 0+1x1++kxk and standard deviation . Just as before, 0,1,...,k,and are unknown parameters that must be estimated from the sample data. The parameters i represent the change in the mean response when the i predictor variable changes by 1 unit and all other predictor variables are held constant. In this model: % y Random outcome of the dependentvariable % 0 Regression constant(E(y|x1= = xk =0) if appropriate) % i Partial regression coe.cient for variable xi (Change in E(y)when xi increases by1 unit and all other x .s are held constant) % Random error term, assumed (as before) that N(0,) % k The number of independent variables By the method of least squares (choosing the bi values that minimize SSE =
.n

th

(yi yi) ), i=1

we obtain the .tted equation: Y = b0+ b1x1+ b2x2+ + bkxk and our estimate of : (yy)2SSE se == nk1 nk1 The Analysis of Variance table will be very similar to what we used previously, with the only adjustments being in the degrees of freedom. Table 10 shows the values for the general case when there are kpredictor variables. We will rely on computer outputs to obtain the Analysis of Variance and the estimates b0,b1,and bk.

Source Variation

of

Sum

of Squares M S R

ANOVA Degrees of Mean Freedom Square F

.n

yi y) k MSR= (

SSR

F=

MSE

MODEL SSR =

.i=1

k n ERROR TOTAL SSE SSE =


i=1

(yi =

yi) nk1
.n

MSE (yi

SSyy i=1

y) n1

nk1

Table 10: The Analysis of Variance Table for multiple regression

5.2 Testing for Association Between the Response and the Full Set of Predictor Variables To see if the set of predictor variables is useful in predicting the response variable, we will test H0: 1= 2= ... = k = 0. Note that if H0is true, then the mean response does not depend on the levels of the predictor variables. We interpret this to mean that there is no association between the response variable and the set of predictor variables. To test this hypothesis, we use the following method: 1. 1. H0: 1= 2= = k =0 2. 2. HA : Not every i =0 M S R 3. T.S.: Fobs =

MSE

4. R.R.: Fobs >F,k,nk1

5. pvalue: P(F> Fobs)(You can only get bounds on this, but computer outputs report them exactly) The computer automatically performs this test and provides you with the pvalue of the test, so in practice you really dont need to obtain the rejection region explicitly to make the appropriate conclusion. However, we will do so in this course to help reinforce the relationship between the tests decision rule and the pvalue. Recall that we reject the null hypothesis if the pvalue is less than . 5.3 Testing Whether Individual Predictor Variables Help Predict the Response If we reject the previous null hypothesis and conclude that not all of the i are zero, we may wish to test whether individual i are zero. Note that if we fail to reject the null hypothesis that i is zero, we can drop the predictor xi from our model, thus simplifying the model. Note that this test is testing whether xi is useful given that we are already .tting a model containing the remaining k1 predictor variables. That is, does this variable contribute anything once weve taken into account the other predictor variables. These tests are ttests, where we compute t = bijust as we did in the section on making inferences concerning 1in simple regression. The sbiprocedure for testing whether i =0 (the i predictor variable does not contribute to predicting the response given the other k1 predictor variables are in the model) is as follows: % H0: i =0(yis not associated with xi after controlling for all other independent variables) % (1) HA : i . = 0 1 (2) HA : i >0 2 (3) HA : i <0 % T.S.: tobs = 2 3 4 5 6
b S
i bi

th

1 R.R.: (1) |tobs|>t/2,nk1

(2) tobs >t,nk1 (3) tobs <t,nk1 (1) pvalue: 2P(T>|tobs|) (2) pvalue: P(T>tobs) (3) pvalue: P(T<tobs)

Computer packages print the test statistic and the pvalue based on the twosided test, so to conduct this test is simply a matter of interpreting the results of the computer output. 5.4 Testing for an Association Between a Subset of Predictor Variables and the Response We have seen the two extreme cases of testing whether all regression coe.cients are simultaneously 0 (the Ftest), and the case of testing whether a single regression coe.cient is 0, controlling for all other predictors (the ttest). We can also test whether a subset of the kregression coe.cients are 0, controlling for all other predictors. Note that the two extreme cases can be tested using this very general procedure. To make the notation as simple as possible, suppose our model consists of kpredictor variables, of which wed like to test whether q(qk)are simultaneously not associated with y, after controlling for the remaining kqpredictor variables. Further assume that the kqremaining predictors are labelled x1,x2,...,xkq and that the qpredictors of interest are labelled xkq+1,xkq+2,...,xk. This test is of the form: H0:kq+1=kq+2= =k=0HA:kq+1. =0and/orkq+2. =0 =0and/or... and/ork.

The procedure for obtaining the numeric elements of the test is as follows: 1. 1. Fit the model under the null hypothesis(kq+1= kq+2= = k = 0). It will include only the .rst kqpredictor variables. This is referred to as the Reduced model. Obtain the error sum of squares(SSE(R)) and the error degrees of freedom dfE(R)= n (kq)1. 2. 2. Fit the model with all k predictors. This is referred to as the Complete or Full model (and was used for the Ftest for all regression coe.cients). Obtain the error sum of squares (SSE(F)) and the error degrees of freedom(dfE(F)= nk1). By de.nition of the least squares citerion, we know that SSE(R)SSE(F). We now obtain the test statistic: SSE(R)SSE(F) SSE(F))/q TS : =
Fobs =

(n(kq)1)(nk1)(SSE(R)

SSE(F)

MSE(F)

nk1 and our rejection region is values of Fobs F,q,nk1. Example 5.1 Texas Weather Data In this example, we will use regression in the context of predicting an outcome.A construction companyis making a bid on a project in a remote area of Texas.Acertain component of the project will take place in December, and is very sensitive to the daily high temperatures. They would like to estimate what the average high temperature will be at the location in December. They believe that temperature at a location will depend on its latitude (measure of distance from the equator) and its elevation. That is, they believe that the response variable (mean daily high temperature in December at a particular location) can be written as: y= 0+ 1x1+ 2x2+ 3x3+ , where x1is the latitude of the location, x2is the longitude, and x3is its elevation (in feet). As before, we assume that N(0,). Note that higher latitudes mean farther north and higher longitudes mean farther west. To estimate the parameters 0,1,2,3, and , they gather data for a sample of n =16 counties and .t the model described above. The data, including one other variable are given in Table 11.
COUNTY HARRIS DALLAS KENNEDY MIDLAND DEAF SMITH KNOX MAVERICK NOLAN ELPASO COLLINGTO N PECOS SHERMAN TRAVIS ZAPATA LASALLE CAMERON LATITUDE 29.767 32.850 26.933 31.950 34.800 33.450 28.700 32.450 31.800 34.850 30.867 36.350 30.300 26.900 28.450 25.900

LONGITUDE 95.367 96.850 97.800 102.183 102.467 99.633 100.483

100.533 106.40 100.217 102.900 102.083 97.700 99.283 99.217 97.433 ELEV 41 440 25 2851 3840 1461 815 2380 3918 2040 3000 3693 597 315 459 19 TEMP 56 48 60 46 38 46 53 46 44 41 47 36 52 60 56 62 INCOME 24322 21870 11384 24322 16375 14595 10623 16486 15366 13765 17717 19036 20514 11523 10563 12931 Table 11: Data corresponding to 16 counties in Texas The results of the Analysis of Variance are given in Table 12 and the parameter estimates, estimated standard errors, tstatistics and pvalues are given in Table 13. Full computer programs and printouts are given as well. We see from the Analysis of Variance that at least one of the variables, latitude and elevation, are related to the response variable temperature. This can be seen bysetting up the test H0: 1= 2= 3=0 as
described previously. The elements of this test, provided by the computer output, are detailed below, assuming = .05. 1. H0: 1= 2= 3=0 Sourc e of Variati on MODE L ERRO R Sum of Squares SSR = 934.328 ANOVA Degrees of Freedom k=3 Mean Square MSR=
934.3283

FF=
311.4430.63 4

SSE =7.609

nk1= 16 31 = 12

=311.443 MSE= 7.60912 =0.634

=491.235

pvalue . 0001

TOTAL SSyy = 941.938 n1=15 Table 12: The Analysis of Variance Table for
Texas data PARAMETER INTERCEPT(0) LATITUDE(1) LONGITUDE(2) ELEVATION(3) ESTIMATE b0=109.25887 b1= 1.99323 b2= 0.38471 b3= 0.00096 t FOR H0: i=0 36.68 14.61 1.68 1.68 PVALUE .0001 .0001 .1182 .1181 STANDARD ERROR OF ESTIMATE 2.97857 0.13639 0.22858 0.00057 Table 13: Parameter estimates and tests of hypotheses for individual parameters

2. HA : Not all i =0

M S R

1. 3. T.S.: Fobs = MSE =

311.443

= 491.235

2. 4. R.R.: Fobs >F2,13,.05=3.81 (This is not provided on the output, the pvalue takes the place of it). 3. 5. pvalue: P(F> 644.45) = .0001 (Actually it is less than .0001, but this is the smallest pvalue the computer will print). 0.634 We conclude that we are sure that at least one of these three variables is related to the response variable temperature. We also see from the individual ttests that latitude is useful in predicting temperature, even after taking into account the other predictor variables. The formal test (based on =0.05 signi.cance level) for determining wheteher temperature is associated with latitude after controlling for longitude and elevation is given here: % H0: 1=0 (TEMP(y)is not associated with LAT(x1)after controlling for LONG(x2)and ELEV(x3)) % HA : 1. =0 (TEMP is associated with LAT after controlling for LONG and ELEV) b1 % T.S.: tobs = % R.R.: |tobs|>t/2,nk1= t.025,12=2.179 0.136399 Thus, we can conclude that there is an association between temperature and latitude, controlling for longitude and elevation. Note that the coe.cient is negative, so we conclude that temperature decreases as latitude increases (given a level of longitude and elevation). Note from Table 13 that neither the coe.cient for LONGITUDE(X2)or ELEVATION(X3)are signi.cant at the =0.05 signi.cance level(pvalues are .1182 and .1181, respectively). Recall these are testing whether each term is 0 controlling for LATITUDE and the other term. Before concluding that neither LONGITUDE(x2)or ELEVATION(x3)are useful predictors, controlling for LATITUDE, we will test whether they
Sb1

1.99323

= 14.614

% pvalue: 2P(T> |tobs|)=2P(T> 14.614) <.0001

are both simultaneously 0, that is: H0: 2= 3=0 vs HA : 2.=0 =0 and/or 3. First, note that we have: n =16k=3q=2SSE(F)=7.609dfE(F)=1631=12MSE(F)=0.634 dfE(R)=16(32)1=14F.05,2,12=3.89 Next, we .t the model with only LATITUDE(x1)and obtain the error sum of squares: SSE(R)= 60.935 and get the following test statistic:
Fobs =

(SSE(R)SSE(F))/q (60.935 7.609)/2 26.663 TS : = ==42.055 MSE(F)0.634 0.634

Since 42.055 >> 3.89, we reject H0, and conclude LONGITUDE(x2)and/or ELEVATION (x3)are associated TEMPERATURE(y), after controlling for LATITUDE(x1).

that with

The reason we failed to reject H0: 2=0 and H0: 3=0 individually based on the ttests is that ELEVATION and LONGITUDE are highly correlated (Elevations rise as you go further west in the state. So, once you control for LONGITUDE, we observe little ELEVATION e.ect, and vice versa.We will discuss why this is the case later. In theory,we have little reasonto believe that temperatures naturally increase or decrease with LONGITUDE, but we may reasonably expect that as ELEVATION increases, TEMPERATURE decreases. We re.t the more parsimonious (simplistic) model that uses ELEVATION(x1) and LATITUDE(x2)to predict TEMPERATURE(y). Note the new symbols for ELEVATION and LATITUDE. That is to show you that they are merely symbols. The results are given in Table 14 and Table 15. ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square Fpvalue
MSR=

932.532

F=

466.266

MODEL SSR = 932.532 k =2

20.634

.0001 =466.2 66 =644.0

14 ERROR SSE =9.406 nk1 = MSE = 9.406 1 3 16 21 = 13 =0.724 TOTAL SSyy = 941.938 n1=15 Table 14: The Analysis of Variance Table for Texas data without LONGITUDE We see this by observing that the tstatistic for testing H0: 1= 0 (no latitude e.ect on temperature) is 17.65, corresponding to a pvalue of . 0001, and the tstatistic for testing H0: 2=0 (no elevation e.ect) is 8.41, also corresponding to a pvalue of .0001. Further note that both estimates are negative, re.ecting that as elevation and latitude increase, temperature decreases. That should not come as anybig surprise. PARAMETER INTERCEPT(0) ELEVATION(1)LATITUDE(2) ESTIMATE b0=63.45485 b1= 0.00185 b2= 1.83216 t FOR H0: i=0 36.68 8.41 17.65 PVALUE .0001 .0001 .0001 STANDARD ERROR OF ESTIMATE 0.48750 0.00022 0.10380 Table 15: Parameter estimates and tests of hypotheses for individual parameters without LONGITUDE The magnitudes of the estimated coe.cients are quite di.erent, which may make you believe that one predictor variable is more important than the other. This is not necessarily true, because the ranges of their levels are quite di.erent (1 unit change in latitude represents a change of approximately 19 miles, while a unit change in elevation is 1 foot) and recall that i represents the change in the mean response when variable Xi is increased by 1 unit. The data corresponding to the 16 locations in the sample are plotted in Figure 15 and the .tted equation for the model that does not include LONGITUDE is plotted in Figure 16. The .tted equation is a plane in three dimensions.

T E M P

6 2 . 0 0

5 3 . 3 3

4 4 . 6 7

3 6 . 0 0

36.35 3918

19 25.90

Figure 15: Plot of temperature data in 3 dimensions

Example 5.2 Mortgage Financing Cost Variation (By City) A study in the mid 1960s reported regional di.erences in mortgage costs for new homes. The sampling units were n =18 metro areas (SMSAs) in the U.S. The dependentvariable(y)is the average yield (in percent) on a new home mortgage for the SMSA. The independentvariables(xi) are given below. Source: Schaaf, A.H. (1966), Regional Di.erences in Mortgage Financing Costs, JournalofFinance, 21:8594. x1 Average Loan Value / Mortgage Value Ratio (Higher x1means lower down payment and higher risk to lender). SMSALosAngelesLongBeachDenverSanFranciscoOaklandDallasFortWorthMiamiAtlantaHoustonSeattleNewYorkMemphisNewOrleansCle velandChicagoDetroitMinneapolisStPaulBaltimorePhiladelphiaBoston y 6.17 6.06 6.04 6.04 6.02 6.02 5.99 5.91 5.89 5.87 5.85 5.75 5.73 5.66 5.66 5.63 5.57 5.28 x1x2x3x4x5x6 78.1304291.31738.145.533.1 77.0199784.11110.451.821.9 1. 75.73162129.31738.124.046.0 2. 77.4182141.2778.445.751.3 77.41542119.11136.788.918.7 1. 73.6107432.3582.939.926.6 2. 76.3185645.2778.454.135.7

1. 72.53024109.71186.031.117.0 2. 77.3216364.32582.411.97.3 3. 77.41350111.0613.627.411.3 72.4154481.0636.127.38.1 1. 67.0631202.71346.024.610.0 2. 68.9972290.11626.820.19.4 3. 70.7699223.41049.624.731.7 1. 69.81377138.41289.328.819.7 2. 72.9399125.4836.322.98.6 68.7304259.51315.318.318.7 67.80428.22081.07.52.0 ye =yy 6.190.02 1. 6.040.02 2. 6.050.01 6.050.01 6.040.02 1. 5.920.10 2. 6.020.03 5.910.00 1. 5.820.07 2. 5.860.01 5.810.04 5.640.11 1. 5.600.13 2. 5.630.03 3. 5.810.15 5.770.14 5.570.00 5.410.13 Table 16: Data and .tted values for mortgage rate multiple regression example.

ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square Fpvalue

MSR=

0.738770.12313

MODEL SSR =0.73877 k =6 6F =

0.00998

.0003 =0.12313 =12.33 0.10980

MSE =

ERROR SSE =0.10980 nk1=

=0.00998

11

18 61 = 11

TOTAL SSyy =0.84858 n1=17 Table 17: The Analysis of Variance Table for Mortgage rate regression analysis

PARAMETER INTERCEPT(0) x1(1) x2(2) x3(3) x4(4) x5(5) x6(6) ESTIMATE b0=4.28524 b1=0.02033 b2=0.000014 b3= 0.00158 b4=0.000202 b5=0.00128 b6=0.000236 STANDARD ERROR 0.66825 0.00931 0.000047 0.000753 0.000112 0.00177 0.00230 tstatistic 6.41 2.18 0.29 2.10 1.79 0.73 0.10 Pvalue .0001 .0515 .7775 .0593 .1002 .4826 .9203 Table 18: Parameter estimates and tests of hypotheses for individual parameters Mortgage rate regression analysis ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square Fpvalue 0.732650.24422
MSR= .0001

MODEL SSR =0.73265 kq=3 3


F=

0.00828=0.24422 =29.49

MSE =

0.11593 14

ERROR SSE =0.11593 n(kq)1= 18 31 = 14 =0.00828 TOTAL SSyy =0.84858 n1=17 Table 19: The Analysis of Variance Table for Mortgage rate regression analysis (Reduced Model) PARAMETER INTERCEPT(0) x1(1) x3(3) x4(4) ESTIMATE b0=4.22260 b1=0.02229 b3= 0.00186 b4=0.000225 STANDARD ERROR 0.58139 0.00792 0.00041778 0.000074 tstatistic 7.26 2.81 4.46 3.03

Pvalue .0001 .0138 .0005 .0091 Table 20: Parameter estimates and tests of hypotheses for individual parameters Mortgage rate regression analysis (Reduced Model)

Next, we .t the reduced model, with 2= 5= 6= 0. We get the Analysis of Variance in Table 19 and parameter estimates in Table 20. Note .rst, that all three regression coe.cients are signi.cant now at the =0.05 signi.cance level. Also, our residual standard error, Se = MSE has (0.09991 to 0.09100). This implies we have lost very ability by dropping x2, x5, and x6from the model. Now whether these three predictor variables regression simultaneously 0 (with =0.05): 1. H0: 2= 5= 6=0 2. =0 and/or 5.=0 % HA : 2.=0 and/or 6.
TS : Fobs =

also decreased little predictive to formally test coe.cients are

(0.115930.10980)/2 =
.00307

=0.307

0.00998.00998 RR : Fobs F0.05,3,11=3.59 We fail to reject H0, and conclude that none of x2, x5,or x6are associated with mortgage rate, after controlling for x1, x3, and x4. Example 5.3 Store Location Characteristics and Sales A study proposed using linear regression to describe sales at retail stores based on location characteristics. As a case study, the authors modelled sales at n = 16 liquor stores in Charlotte, N.C. Note that in North Carolina, all stores are state run, and do not practice promotion as liquor stores in Florida do. The response was SALES volume (for the individual stores) in the .scal year 7/1/19796/30/1980. The independentvariableswere: POP (number of people living within 1.5 miles of store), MHI (mean household income among households within 1.5 miles of store), DIS, (distance to the nearest store), TFL (daily tra.c volume on the street the store was located), and EMP (the amount of employment within 1.5 miles of the store. The regression coe.cients and standard errors are given in Table 5.4.

Source: Lord, J.D. and C.D. Lynds (1981), The Use of Regression Models in Store Location Research: A Review and Case Study, AkronBusinessandEconomicReview, Summer, 1319.

Variable Estimate Std Error


POP MHI DIS TFL EMP 0.09460 0.06129 4.88524 2.59040 0.00245 0.01819 0.02057 1.72623 1.22768 0.00454

Table 21: Regression coe.cients and standard errors for liquor store sales study

a) Do anyof these variables fail to be associated with store sales after controlling for the others? b) Consider the signs of the signi.cant regression coe.cients. What do they imply? 5.5 R2and AdjustedR2 As was discussed in the previous chapter, the coe.cient of multiple determination represents the proportion of the variation in the dependentvariable(y)that is explained by the regression on the collection of independentvariables:(x1,.. .,xk). R is computed exactly as before:
R2 2

SSR SSE

= =1 SSyy SSyy One problem with R is that when we continually add independent variables to a regression model, it continually increases (or at least, never decreases), even when the new variable(s) add little or no predictive power. Since we are trying to .t the simplest (most parsimonious) model that explains the relationship between the set of independent variables and the dependent variable, we need a measure that penalizes models that contain useless or redundant independentvariables. This penalization takes into account that by
2

including useless or redundant predictors, we are decreasing error degrees of freedom(dfE = nk1). A second measure, that does not carry the proportion of variation explained criteria, but is useful for comparing models of varying degrees of complexity, is AdjustedR : n1 SSE Adjusted R2=1
SSE/(nk1) 2

=1

nk1 SSyy

SSyy/(n1) Example 5.1 (Continued) Texas Weather Data Consider the two models we have .t: Full Model I.V.s: LATITUDE, LONGITUDE, ELEVATION 41 Reduced Model I.V.s: LATITUDE, ELEVATION For the Full Model, we have: n =16 k=3 SSE =7.609 SSyy = 941.938 and, we obtain R F:
F

a n d A d j R
2

15 7.609 R2
F

7.609

=1 .008 =0.992 Adj R2

=1 F =1 =1 1.25(.008) = 0.9900 941.938 12 941.938

For

the

Reduced

Model,

we

have:

n =16 k=2 SSE =9.406 SSyy = 941.938 and, we obtain R


R

2
R

a n d A d j R
2

15 9.406 R2
9.406

=1 .010 =0.990 Adj R2

=1 R =1 =1 1.15(.010) = 0.9885 941.938 13 941.938

Thus, byboth measures the Full Model wins, but it should be added that both appear to .t the data very well! Example 5.2 (Continued) Mortgage Financing Costs For the mortgage data (with Total Sum of Squares SSyy =0.84858 and n = 18), when we include all 6 independentvariables in the full model, we obtain the following results: SSR =0.73877 SSE =0.10980 k=6 From this full model, we compute R and AdjR : SSRF 0.73877 n1 SSEF 17 0.10980 2
R
F

= =0.8706 F =1 =1=0.8000 nk1 SSyy 11 0.84858

SSyy 0.84858

AdjR2

Example 5.3 (Continued) Store Location Characteristics and Sales In this study, the authors reported that R =0.69. Note that although we are not given the Analysis of Variance, we can still conduct the F test for the overall model: SSR MSR SSR/k SSyy F == =
SSE

/k

R /k

(nk1)

MSE SSE/(nk1) (1 R2)/(nk1) SSyy


/

For the liquor store example, there were n = 16 stores and k =5 variables in the full model. To test: H0: 1= 2= 3= 4= 5=0 vs HA : Not all i =0 we get the following test statistic and rejection region(=0.05): 0.138 TS : Fobs =
0.69/5

==4.45 RR : Fobs F,k,nk1= F0.05,5,10=3.33

(1 0.69)/(16 51) 0.031 Thus, at least one of these variables is associated with store sales. What is AdjustedR for this analysis? 42 5.6 Multicollinearity Textbook: Section 19.4, Supplement Multicollinearity refers to the situation where independent variables are highly correlated among themselves. This can cause problems mathematically and creates problems in interpreting regression coe.cients. Some of the problems that arise include: % % % % Di.cult to interpret regression coe.cient estimates In.ated std errors of estimates (and thus small tstatistics) Signs of coe.cients may not be what is expected. However, predicted values are not adversely a.ected
2

It can be thought that the independent variables are explaining the same variation in y, and it is di.cult for the model to attribute the variation explained (recall partial regression coe.cients).

Variance In.ation Factors provide a means of detecting whether a given independentvariable is causing multicollinearity. They are calculated (for each independent variable) as: 1 VIFi =
1R

i 2where Ri 2is the coe.cient

of multiple determination when xi is regressed on the k 1 other independent variables. One rule of thumbsuggests that severe multicollinearityis present if VIFi > 10(Ri 2>.90). Example 5.1 Continued First, we run a regression with ELEVATION as the dependent variable and LATITUDE and LONGITUDE as the independentvariables.We then repeat the process with LATITUDE as the dependent variable, and .nally with LONGITUDE as the dependent variable. Table ?? gives R and VIFfor each model. Variable ELEVATION LATITUDE LONGITUDE R2 .9393 .7635 .8940 VIF 16.47 4.23 9.43 Table 22: Variance In.ation Factors for Texas weather data Note how large the factor is for ELEVATION. Texas elevation increases as you go West and as you go North. The Western rise is the more pronounced of the two (the simple correlation between ELEVATION and LONGITUDE is .89). Consider the e.ects on the coe.cients in Table 23 and Table 24 (these are subsets of previously shown tables). Compare the estimate and estimated standard error for the coe.cient for ELEVATION and LATITUDE for the two models. In particular, the ELEVATION coe.cient doubles in absolute value and its standard error decreases by a factor of almost 3. The LATITUDE coe.cient and standard error do not change very much. We choose to keep ELEVATION, as opposed to LONGITUDE, in the model due to theoretical considerations with respect to weather and climate. PARAMETER INTERCEPT(0) LATITUDE(1)LONGITUDE(2) ELEVATION(3)
2

ESTIMATE b0=109.25887 b1= 1.99323 b2= 0.38471 b3= 0.00096 STANDARD ERROR OF ESTIMATE 2.97857 0.13639 0.22858 0.00057 Table 23: Parameter estimates and standard errors for the full model

PARAMETER INTERCEPT(0) ELEVATION(1) LATITUDE(2) ESTIMATE b0=63.45485 b1= 0.00185 b2= 1.83216 STANDARD ERROR OF ESTIMATE 0.48750 0.00022 0.10380 Table 24: Parameter estimates and standard errors for the reduced model 5.7 Autocorrelation Textbook Section: 19.5 Recall a key assumption in regression: Error terms are independent. When data are collected over time, the errors are often serially correlated (Autocorrelated). Under .rstOrder Autocorrelation, consecutive error terms are linealy related: t = t1+t where is the correlation between consecutive error terms, and t is a normally distributed independent error term. When errors display a positive correlation, > 0 (Consecutive error terms are associated). We can test this relation as follows, note that when = 0, error terms are independent (which is the assumption in the derivation of the tests in the chapters on linear regression).

DurbinWatson Test for Autocorrelation H0: =0 No autocorrelation Ha : >0 Postive Autocorrelation n


D= .

(etet1)2 t=2 n t = 1
e
2

t D dU = Dont Reject H0D dL = Reject H0dL D dU = Withhold judgement Values of dL and dU (indexed bynand k(the number of predictor variables)) are given in Table 11(a), p. B22. % Additional independent variable(s) Avariable maybe missing from the model that will eliminate the autocorrelation. % Transform the variables Take .rst di.erences(yt+1yt)and(yt+1yt)and run regression with transformed y and x. Example 5.4 Spirits Sales and Income and Prices in Britain A study was conducted relating annual spirits (liquor) sales(y)in Britain to per capita income (x1) and prices(x2), where all monetary values were in constant (adjusted for in.ation) dollars for the years 18701938. The following output gives the results from the regression analysis and the DurbinWatson statistic. Note that there are n = 69 observations and k =2 predictors, and the approximate lower and upper bounds for the rejection region are dL =1.55 and dU =1.67 for an =0.05 level test. Since the test statistic is d =0.247 (see output below), we reject the null hypothesis of no autocorrelation among the residuals, and conclude that they are positively correlated. See Figure 17 for a plot of the residuals versus year. Source: Durbin J., and Watson, G.S. (1950), Testing for Serial Correlation in Least Squares Regression, I, Biometrika, 37:409428. TheREGProcedure Model:MODEL1 DependentVariable:consume AnalysisofVariance SumofMean SourceDFSquaresSquareFValuePr>F Model2 Error66 CorrectedTotal68 4.805572.40278712.27< .0001 0.222640.00337

5.02821

RootMSE0.05808RSquare0.9557 DependentMean1.76999AdjRSq0.9544 CoeffVar3.28143

ParameterEstimates ParameterStandard VariableDFEstimateErrortValuePr>|t| Intercept14.61171 income10.11846 price11.23174 0.1526230.22<. 0001 0.108851.090.2804 0.0502424.52<.0001

DurbinWatsonD0.247 NumberofObservations69 1stOrderAutocorrelation0.852


O b s consume 1.9565 1.9794 2.0120 2.0449 2.0561 2.0678 2.0561 2.0428 2.0290 1.9980 income 1.7669 1.7766 1.7764 1.7942 1.8156 1.8083 1.8083 1.8067 1.8166 1.8041 price1. 9176 1.9059 1.8798 1.8727 1.8984 1.9137 1.9176 1.9176 1.9420 1.9547 yhat2.04 042 2.05368 2.08586 2.09249 2.05830 2.04032 2.03552 2.03571 2.00448 1.99032 e0.08392 0.07428 0.07386 0.04759 0.00220 0.02748 0.02058 0.00709 0.02452 0.00768

1.9884 1.9835 1.9773 1.9748 1.9629 1.9396 1.9309 1.9271 1.9239 1.9414 1.9685 1.9727 1.9736 1.9499 1.9432 1.9569 1.9647 1.9710 1.9719 1.9956 2.0000 1.9904 1.9752 1.9494 1.9332 1.9139 1.9091 1.9139 1.8886 1.7945 1.7644 1.7817 1.7784 1.7945 1.7888 1.8751 1.7853 1.6075 1.5185 1.6513 1.6247 1.5391 1.4922 1.4606

1.8053 1.8242 1.8395 1.8464 1.8492 1.8668 1.8783 1.8914 1.9166 1.9363 1.9548 1.9453 1.9292 1.9209 1.9510 1.9776 1.9814 1.9819 1.9828 2.0076 2.0000 1.9939 1.9933 1.9797 1.9772 1.9924 2.0117 2.0204 2.0018 2.0038 2.0099 2.0174 2.0279 2.0359 2.0216 1.9896 1.9843 1.9764 1.9965 2.0652 2.0369 1.9723 1.9797 2.0136

1.9379 1.9462 1.9504 1.9504 1.9723 2.0000 2.0097 2.0146 2.0146 2.0097 2.0097 2.0097 2.0048 2.0097 2.0296 2.0399 2.0399 2.0296 2.0146 2.0245 2.0000 2.0048 2.0048 2.0000 1.9952 1.9952 1.9905 1.9813 1.9905 1.9859 2.0518 2.0474 2.0341 2.0255 2.0341 1.9445 1.9939 2.2082 2.2700 2.2430 2.2567 2.2988 2.3723 2.4105

2.01087 1.99841 1.99142 1.99060 1.96330 1.92709 1.91378 1.90619 1.90321 1.90691 1.90472 1.90585 1.91379 1.90874 1.88066 1.86482 1.86437 1.87700 1.89537 1.88024 1.91131 1.90612 1.90619 1.91372 1.91993 1.91813 1.92163 1.93193 1.92280 1.92823 1.84634 1.85087 1.86601 1.87565 1.86675 1.98091 1.92069 1.65766 1.57916 1.60428 1.59075 1.54655 1.45514 1.40407

0.02247 0.01491 0.01412 0.01580 0.00040 0.01251 0.01712 0.02091 0.02069 0.03449 0.06378 0.06685 0.05981 0.04116 0.06254 0.09208 0.10033 0.09400 0.07653 0.11536 0.08869 0.08428 0.06901 0.03568 0.01327 0.00423 0.01253 0.01803 0.03420 0.13373 0.08194 0.06917 0.08761 0.08115 0.07795 0.10581 0.13539 0.05016 0.06066 0.04702 0.03394 6 0.00745 0 0.03705 9 0.05652 7

1.4551 1.4425 1.4023 1.3991

2.0165 2.0213 2.0206 2.0563

2.4081 2.4081 2.4367 2.4284

1.40669 1.40612 1.37097 1.37697

0.04841 5 0.03638 3 0.03132 8 0.02213 4

46
5 9 6 0 6 1 6 2 6 3 6 4 6 5 6 6 6 7 6 8 6 9 1.3798 1.3782 1.3366 1.3026 1.2592 1.2365 1.2549 1.2527 1.2763 1.2906 1.2721 2.0579 2.0649 2.0582 2.0517 2.0491 2.0766 2.0890 2.1059 2.1205 2.1205 2.1182 2.4310 2.4363 2.4552 2.4838 2.4958 2.5048 2.5017 2.4958 2.4838 2.4636 2.4580 1.37357 1.36622 1.34373 1.30927 1.29480 1.28046 1.28281 1.28807 1.30112 1.32600 1.33317 0.00622 6 0.01198 3 0.00713 1 0.00667 3 0.03560 0 0.04395 7 0.02790 6 0.03537 2 0.02482 3 0.03540 4 0.06107 4

1870

1880

1890

1900 year

1910

1920

1930

1940

Figure 17: Plot of the residuals versus year for British spirits data

6SpecialCasesofMultipleRegression
Textbook Sections: 20.2,20.3 In this section, we will look at three special cases that are frequently used methods of multiple regression. The ideas suchas the Analysis of Variance, tests of hypotheses, and parameter estimates are exactly the same as before and we will concentrate on their interpretation through speci.c examples. The four special cases are:

1. 1. Polynomial Regression 2. 2. Regression Models with Nominal (Dummy) Variables 3. 3. Regression Models Containing Interaction Terms

6.1 Polynomial Regression


While certainly not restricted to this case, it is best to describe polynomial regression in the case of a model with only one predictor variable. In many real world settings relationships will not be linear, but will demonstrate nonlinear associations. In economics, a widely described phenomenon is diminishing marginal returns. In this case, y may increase with x, but the rate of increase decreases over the range of x. By adding quadratic terms, we can test if this is the case. Other situations may show that the rate of increase in y is increasing in x. Example 6.1 Health Club Demand y= 0+ 1x + 2x + .
2

Again, we assume that N(0,). In this model, the number of people attending in a day when there are x machines is nomally distributed with mean 0+ 1x + 2x and standard deviation . Note that we are no longer saying that the mean is linearly related to x, but rather that it is approximately quadratically related to x (curved). Suppose she leases varying numbers of machines over a period of n = 12 Wednesdays (always advertising how many machines will be there on the following Wednesday), and observes the number of people attending the club each day, and obtaining the data in Table 25. In this case, we would like to .t the multiple regression model: y= 0+ 1x + 2x2+ , which is just like our previous model except instead of a second predictor variable x2,we are using the variable x , the e.ect is that the .tted equation y will be a curve in 2 dimensions, not a plane in 3 dimensions as we saw in the weather example. First we will run the regression on the computer, obtaining the Analysis of Variance and the parameter estimates, then plot the data and .tted equation. Table 26 gives the Analysis of Variance for this example and Table 27 gives the parameter estimates and their standard errors. Note that even though we have only one predictor variable, it is being used twice and could in e.ect be treated as two di.erent predictor variables, so k =2. The .rst test of hypothesis is whether the attendance is associated with the number of machines. This is a test of H0: 1= 2= 0. If the null hypothesis is true, that implies mean daily attendance Week # Machines(x) 13 26 31 42 55 64 71 85 93 10 2
2 2

11 6 12 4 Attendance(y) 555 776 267 431 722 635 218 692 534 459 810 671
Table 25: Data for health club example

Sourc e of Variati on MODE L ERRO R

Sum of Squares SSR = 393933.12

ANOVA Degrees of Freedom k =2 Mean Square MSR=


393933.122

FF=
196966.56776. 06

SSE = 6984.55

nk1= =1221=9

=196966.56 MSE = 6984.559 =776.06

=253.80

pvalue . 0001

TOTAL SSyy = 400917.67 n1=11 Table 26: The Analysis of Variance Table for
health club data

PARAMETER INTERCEPT(0)MACHINES(1) MACHINES SQ(2) ESTIMATE b0=72.0500 b1= 199.7625 b2= 13.6518 t FOR H0: i=0 2.04 8.67 4.23 PVALUE .0712 .0001 .0022 STANDARD ERROR OF ESTIMATE 35.2377 23.0535 3.2239 Table 27: Parameter estimates and tests of hypotheses for individual parameters is unrelated to the number of machines, thus the club owner would purchase very few (if any) of the machines. As before this test is the Ftest from the Analysis of Variance table, which we conduct here at = .05.

1. 1. H0: 1= 2=0 2. 2. HA : Not both i =0


MSR

1. 3. T.S.: Fobs = MSE =

196966.56

= 253.80

2. 4. R.R.: Fobs >F2,9,.05=4.26 (This is not provided on the output, the pvalue takes the place of it). 3. 5. pvalue: P(F>253.80) = .0001 (Actually it is less than .0001, but this is the smallest pvalue the computer will print). 776.06 Another test with an interesting interpretation is H0: 2= 0. This is testing the hypothesis that the mean increases linearly with x(since if 2=0 this becomes the simple regression model (refer back to the co.ee data example)). The ttest in Table 27 for this hypothesis has a test statistic tobs = 4.23 which corresponds to a pvalue of .0022, which since it is below .05, implies we reject H0and conclude 2. = 0. Since b2is is negative, we will conclude that 2is negative, which is in agreement with her theory that once you get to a certain number of machines, it does not help to keep adding new machines. This is the idea of diminishing returns. Figure 18 shows the actual data and the .tted equation y=72.0500 +199.7625x13.6518x . YHAT 900 800 700 600 500 400 300 200 100 0
2

01234567 X

Figure 18: Plot of the data and .tted equation for health club example

6.2 Regression Models With Nominal (Dummy) Variables All of the predictor variables we have used so far were numeric or what are often called quantitative variables. Other variables also can be used that are called qualitative variables. Qualitative variables measure characteristics that cannot be described numerically, such as a persons sex, race, religion, or blood type; a citys region or mayors political a.liation; the list of possibilities is endless. In this case, we frequently have some numeric predictor variable(s) that we believe is (are) related to the response variable, but we believe this relationship maybe di.erent for di.erent levels of some qualitative variable of interest. If a qualitative variable has mlevels, we create m1indicator or dummyvariables. Consider an example where we are interested in health care expenditures as related to age for men and women, separately. In this case, the response variable is health care expenditures, one predictor variable is age, and we need to create a variable representing sex. This can be done bycreating a variable x2that takes on a value 1 if a person is female and 0 if the person is male. In this case we can write the mean response as before: E[y|x1,x2]= 0+ 1x1+ 2x2+ . Note that for women of age x1, the mean expenditure is E[y|x1,1] = 0+1x1+2(1) =(0+2)+ 1x1, while for men of age x1, the mean expenditure is E[y|x1,0] = 0+1x1+0(0) = 0+1x1. This model allows for di.erent means for men and women, but requires they have the same slope (we will see a more general case in the next section). In this case the interpretation of 2=0 is that the means are the same for both sexes, this is a hypothesis a health care professional maywish to test in a study. In this example the variable sex had two variables, so we had to create 21= 1 dummyvariable, now consider a second example. Example 6.2 We would like to see if annual per capita clothing expenditures is related to annual per capita income in cities across the U.S. Further, we would like to see if there is anydi.erences in the means across the 4 regions (Northeast, South, Midwest, and West). Since the variable region has 4 levels, we will create 3 dummyvariables x2,x3,and x4as follows (we leave x1to represent the predictor variable per capita income): 1 if region=South x2= 0 otherwise

1 if region=Midwest x3= 0 otherwise 1 if region=West x4= 0 otherwise Note that cities in the Northeast have x2= x3= x4= 0, while cities in other regions will have either x2,x3,or x4being equal to 1. Northeast cities act like males did in the previous example. The data are given in Table 28. The Analysis of Variance is given in Table 29, and the parameter estimates and standard errors are given in Table 30. Note that we would fail to reject H0: 1= 2= 3= 4=0 at = .05 signi.cance level if we looked only at the Fstatistic and its pvalue(Fobs =2.93, pvalue=.0562). This would lead us to conclude that there is no association between the predictor variables income and region and the response variable clothing expenditures. This is where you need to be careful when using multiple regression with many predictor variables. Look at the test of H0: 1=0, based on the ttest in Table 30. Here we observe tobs=3.11, with a pvalue of .0071.We thus conclude 1. =0, and that clothing expenditures is related to income, as we would expect. However, we do fail to reject H0: 2=0, H0: 3= 0,and H0: 4=0, so we fail to observe anydi.erences among the regions in terms of clothing expenditures after adjusting for the variable income. Figure 19 and Figure 20 show the original data using region as the plotting symbol and the 4 .tted equations corresponding to the 4 Table 28: Clothes Expenditures and income example ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F pvalue MODEL 1116419.0 4 279104.72.93 .0562 ERROR 1426640.2 15 95109.3 TOTAL 2543059.2 19 Table 29: The Analysis of Variance Table for clothes expenditure data

Table 30: Parameter estimates and tests of hypotheses for individual

parameters regions. Recall that the .tted equation is = 657.428+0.113x1+237.494x2+21.691x3+254.992x4, yand each of the regions has a di.erent set of levels of variables x2,x3,and x4. Y 2600 2500 2400 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200

W W S M NM W SM W N N M W N S S N M

16000

18000

REGION

NN N MM M

22000 24000 26000 X 1 Midwest SS S Northeast Wes South WW W t

20000

Figure 19: Plot of clothing data, with plotting symbol region


YN 2600 2500 2400

2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200

16000 18000 20000 22000 24000 26000 X1

Figure 20: Plot of .tted equations for each region

6.3 Regression Models With Interactions


In some situations, two or more predictor variables may interact in terms of their e.ects on the mean response. That is, the e.ect on the mean response of changing the level of one predictor variable depends on the level of another predictor variable. This idea is easiest understood in the case where one of the variables is qualitative. Example 6.3 Truck and SUV Safety Ratings Several years ago, TheWallStreetJournalreported safety scores on 33 models of SUVs and trucks. Safety scores(y)were reported, as well as the vehicles weight(x1)and and indicator of whether the vehicle has side air bags(x2=1 if it does, 0 if not). We .t a model, relating safety scores to weight, presence of side airbags, and an interaction term that allows the e.ect of weight to depend on whether side airbags are present: y= 0+ 1x1+ 2x2+ 3x1x2+. We can write the equations for the two side airbag types as follows: Side airbags: y= 0+ 1x1+ 2(1) + 3x1(1) + =(0+ 2)+(1+ 3)x1+, and No side airbags: y= 0+1x1+ 2(0) + 3x1(0) + = 0+ 1x1+ . The data for years the 33 models are given in Table 31. The Analysis of Variance table for this example is given in Table 32. Note that

2 R = .5518. Table 33 provides the parameter estimates, standard

errors, and individual ttests. Note that the Ftest for testing H0: 1= 2= 3=0 rejects the null hypothesis(F=11.90, Pvalue=.0001), but none of the individual ttests are signi.cant (all Pvalues exceed 0.05). This can happen due to the nature of the partial regression coe.cients. It can be seen that weight is a very good predictor, and that the presence of side airbags and the interaction term do not contribute much to the model (SSE for a model containing only Weight(x1)is 3493.7, use this to test H0: 2= 3=0). For vehicles with side airbags the .tted equation is: yairbags =(b0+ b2)+(b1+ b3)x1=44.18 + 0.02162x1, while for vehicles without airbags, the .tted equation is: ynoairbags = b0+ b1x1=76.09 + 0.01262x1. Figure 21 shows the two .tted equations for the safety data.

3000

4000 weight

5000

6000

Figure 21: Plot of .tted equations for each vehicle type

54 Table 31: Safety ratings for trucks and SUVs

ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F pvalue MODEL 3838.03 3 1279.34 11.90 .0001 ERROR 3116.88 29 107.48 TOTAL 6954.91 32 Table 32: The Analysis of Variance Table for truck/SUV safety data PARAMETER INTERCEPT(0) x1(1) x2(2) x3(3) ESTIMATE 76.09 0.01262 31.91 0.0090 STANDARD ERROR OF ESTIMATE 27.04 0.0065 31.78 .0076

t FOR H0: i=0 PVALUE 2.81 .0087 1.93 .0629 1.00 .3236 1.18 .2487 Table 33: Parameter estimates and tests of hypotheses for individual parameters Safety data 7IntroductiontoTimeSeriesandForecasting Textbook Sections: 21.121.6 In the remainder of the course, we consider data that are collected over time. Many economic and .nancial models are based on time series. First, we will describe means of smoothing series, then some simple ways to decompose a series, then we will describe some simple methods used to predict future outcomes based on past values. 7.1 Time Series Components Textbook Section: 21.2 Time series can be broken into .ve components: level, longterm trend, Cyclical variation, seasonalvariation, and random variation. A brief description of each is given below: Level Horizontal sales history in absence of other sources of variation (long run average). Trend Continuing pattern of increasing/decreasing values in the form of a line or curve. Cyclical Wavelike patterns that represent business cycles over multiple periods suchas economic expansions and recessions. Seasonal Patterns that occur over repetitive calendar periods such as quarters, months, weeks, or times of the day. Random Short term irregularities that cause variation in individual outcomes above and beyond the other sources of variation.

Example 7.1 U.S. Cotton Production 19782001 Figure 22 represents a plot of U.S. cotton production from 1978 to 2001 (Source: Cotton association web site). We can see that there has been a trend to higher production over time, with cyclical patterns arising as well along the way. Since the data are annual production, we

cannot observe seasonal patterns. Example 7.2 Texas instate Finance/Insurance/Real Estate Sales 19892002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984 1983 1982 1981 1980 1979 1978 year

Figure 22: Plot of U.S. cotton production 19782001 Table 34 gives instate gross sales for the Finance, Insurance, and Real Estate (FIRE) for the state of Texas for the 4 quarters of years 19892002 in hundreds of millions of dollars (Source: State of Texas web site). A plot of the data (with vertical lines delineating years) is shown in Figure 23. There is a clear positive trend in the series, and the fourth quarter tends to have much larger sales than the other three quarters.We will use the variables in the last two columns in a subsequent section. 0

10 quarter1

20

30

40

50

60

Figure 23: Plot of quarterly Texas instate FIRE gross sales 19892002 7.2 Smoothing Techniques Textbook Section: 21.3 Moving Averages are averages of values at a particular time period, and values that are near it in time.We will focus on odd numbered moving averages, as they are simpler to describe and implement (the textbook also covers even numbered MAs as well). A 3period moving averge involves averaging the value directly prior to the current time point, the current value, and the value directly after the current time point. There will not be values for either the .rst or last periods of the series. Similarly,a 5period moving average will include the current time point, and the two prior time points and the two subsequent time points.

Example 7.3 U.S. Internet Retail Sales 1999q42003q1 The data in Table 35 gives the U.S. ecommerce sales for n = 14 quarters (quarter 1 is the 4th quarter of 1999 and quarter 14 is preliminary reported sales for the 1st quarter of 2003) in millions of dollars (Source: U.S. Census Bureau). Quarter 1 5393 . 5393 5393 5558 5904 6491 7870 7939 7922 7908 9348 9409 9585 10025 11897 11909

2 5722 5788 5426 3 6250 6350 5508 4 7079 7526 5665 5 9248 8112 6024 6 8009 8387 6222 7 7904 7936 6390 8 7894 8862 6541 9 10788 9384 6965 10 9470 10006 7216 11 9761 9899 7470 12 10465 11332 7770 13 13770 12052 8370 14 11921 . 8725 Sales(yt) MA(3) ES(0.1) ES(0.5)

Table 35

:Quarterly ecommerce sales and smoothed values for U.S. 1999q42003q1 To obatain the three period moving average (MA(3)) for the second quarter, we average the .rst, second, and third period sales: 5393 + 5722 +6250 17365 = = 5788.3 5788

33 We can similarly obtain the three period moving average for quarters 313. The data and three period moving averages are given in Figure 24. The moving average is the dashed line, while the original series is the solid line. Exponential Smoothing is an alternative means of smoothing the series. It makes use of all prior time points, with higher weights on more recent time points, and exponentially decaying weights on more distance time points. One advantage is that we have smoothed values for all time points. One drawback is that we must select a tuning parameter (although we would also have to choose the length of a moving average as well, for that method). One widely used convention is to set the .rst periods smoothed value to the .rst observation, then make subsequent smoothed values as a weighted average of the current observation and the previous value of the smoothed series. We use the notation St for the smoothed value at time t. 1 234567 8910 11 12 13 14 quarter

Figure 24: Plot of quarterly U.S. internet retail sales and 3Period moving average

S1= y1St = wyt +(1w)St1t 2

Example 7.3 (Continued) Thus, for quarter 4 of 1999, we set S1= y1= 5393. In Table 35, we include smoothed values based on w =0.1 and w =0.5, respectively: w =0.1: S2=0.1 y2+0.9 S1=0.1(5722) + 0.9(5393) = 572.2+ 4853.7 = 5425.9 5426 w =0.5: S2=0.5 y2+0.5 S1=0.5(5722) + 0.5(5393) = 2861.0+ 2696.5 = 5557.5 5558 The smoothed values are given in Table 35, as well as in Figure 25. The solid line is the original series, the smoothest line is w =0.1, and the intermediate line is w =0.5. 7.3 Estimating Trend and Seasonal E.ects Textbook Section: 21.4 While the cyclical patterns are di.cult to predict and estimate, we can estimate linear trend and seasonal indexes fairly simply. Further, there is no added di.culty if the trend is nonlinear (quadratic), but we will consider only the linear case here. First, we must identify seasons, these can be weeks, months, or quarters (or even times of the dayor days of the week). Then we .t a linear trend for the entire series. This is followed bytaking the ratio of the actual to the .tted value (from the regression equation) for each period. Next, we average these ratios for each season, and adjust so that the averages sum to 1.0. Example 7.2 (Continued) 1 234567 8910 11 12 13 14 qu ar te r

Figure 25: Plot of quarterly U.S. internet retail sales and 32 Exponentially smoothed series

Consider the Texas gross (instate) sales for the FIRE industry. The seasons are the four quarters. Fitting a simple linear regression, relating sales to time period, we get: yt = b0+ b1t =1.6376 + 0.08753t The .tted values (as well as the observed values) have been shown previously in Table 34. Also for each outcome, we obtain the ratio of the observed to .tted value, also given in the table. Consider the .rst and last cases: t =1 : y1=1.567 y1=1.6376 + 0.08753(1) =1.725 =0.908
y1 1.567

y11.725 t =56 : y56=8.522 y1=1.6376 + 0.08753(56) =6.539 =1.303


y1 8.522

y16.539 Next, we take the mean of the observedto.tted ratio for each quarter. There are 14 years of data: 0.908+1.016+0.763+0.884+0.853+0.849+0.785+0.798+0.728+0.83 0+0.829+0.824+0.870+0.860 Q1:=0.843 14 The means for the remaining three quarters are: Q2: 0.906 Q3: 0.875 Q4: 1.398 The means sum to 4.022, and have a mean of 4.022/4=1.0055. If we divide each mean by 1.0055, the indexes will sum to 1: Q1: 0.838 Q2: 0.901 Q3: 0.870 Q4: 1.390 The seasonally adjusted time series is given bydividing each observed value by its seasonal index. This way, we can determine when there are

real changes in the series, beyond seasonal .uctuations. Table 36 contains all components as well as the seasonally adjusted values. 7.4 Introduction to Forecasting Textbook Section: 21.5 There are unlimited number of possibilities of ways of forecasting future outcomes, so we need means of comparing the various methods. First, we introduce some notation: % yt Actual (random) outcome at time t, unknown prior to t % Ft Forecast of yt, made prior to t % et Forecast error et = yt Ft (Book does not use this notation). Two commonly used measures of comparing forecasting methods are given below: ..n |ytFt| t=1 Mean Absolute Deviation (MAD) MAD= Sum of Square Errors (SSE) SSE=et 2=
|et|

number of forecasts n
.n

(yt Ft)

t=1 When comparing forecasting methods, we wish to minimize one or both of these quantities. 7.5 Simple Forecasting Techniques Textbook Section: 21.6 and Supplement In this section, we describe some simple methods of using past data to predict future outcomes. Most forecasts you hear reported are generally complex hybrids of these techniques.

7.5.1 Moving Averages This method, which is not included in the tesxt, is a slight adjustment to the centered moving averages in the smoothing section. At time point t,we use the previous k observations to forecast yt. We use the mean of the last k observations to forecast outcome at t: Xt1+ Xt2+ + Xtk Ft = k Problem: How to choose k? Example 7.4 AnhueserBusch Annual Dividend Yields 19521995 Table 37 gives average dividend yields for AnheuserBuschfor the years 19521995 (Source:ValueLine), forecasts and errors based on moving averages based on lags of 1, 2, and 3. Note that we dont have early year forecasts, and the longer the lag, the longer we must wait until we get our .rst forecast. Here we compute moving averages for year=1963: 1Year: F1963= y1962=3.2 y1962+y1961 2Year: F1963= 2=3.0 2 y1962+y1961+y1960 3Year: F1963= 3=
3.2+2.8+4.4

3.2+2.8

=3.47 3

Figure 26 displays raw data and moving average forecasts. t Year11952219533195441955519566195771958819599196010196111 19621219631319641419651519661619671719681819691919702019 71211972221973231974241975251976261977271978281979291980 30198131198232198333198434198535198636198737198838198939 1990401991411992421993431994441995 yt 5.30 4.20 3.90 5.20 5.80 6.30

5.60 4.80 4.40 2.80 3.20 3.10 3.10 2.60 2.00 1.60 1.30 1.20 1.20 1.10 0.90 1.40 2.00 1.90 2.30 3.10 3.50 3.80 3.70 3.10 2.60 2.40 3.00 2.40 1.80 1.70 2.20 2.10 2.40 2.10 2.20 2.70 3.00 2.80 F1,te1,t .. 5.301.10 4.200.30 1. 2. 3. 4. 3.901.30 5.200.60 5.800.50 6.300.70

5.600.80 4.800.40 4.401.60 1. 2.800.40 2. 3.200.10 3.100.00 3.100.50 2.600.60 2.000.40 1.600.30 1.300.10 1.200.00 1.200.10 1.100.20 1. 0.900.50 2. 1.400.60 3. 2.000.10 1. 2. 3. 4. 5. 1.900.40 2.300.80 3.100.40 3.500.30 3.800.10 3.700.60 3.100.50 2.600.20 1. 2.400.60 2. 3.000.60 2.400.60 1.800.10 1. 1.700.50 2. 2.200.10 1. 2.100.30 2. 2.400.30 1. 2. 3. 4. 2.100.10 2.200.50 2.700.30 3.000.20 F2,te2,t

.. .. 4.750.85 1. 2. 3. 4. 4.051.15 4.551.25 5.500.80 6.050.45 5.951.15 5.200.80 4.601.80 3.600.40 1. 3.000.10 2. 3.150.05 3.100.50 2.850.85 2.300.70 1.800.50 1.450.25 1.250.05 1.200.10 1.150.25 1. 2. 3. 4. 5. 6. 7. 8. 9. 1.000.40 1.150.85 1.700.20 1.950.35 2.101.00 2.700.80 3.300.50 3.650.05 3.750.65 3.400.80 2.850.45 1. 2.500.50 2. 2.700.30 2.700.90 2.100.40 1. 2. 3. 4. 1.750.45 1.950.15 2.150.25 2.250.15 2.250.05

1. 2.150.55 2. 2.450.55 3. 2.850.05 F3,te3,t .. .. .. 4.470.73 1. 2. 3. 4. 4.431.37 4.971.33 5.770.17 5.901.10 5.571.17 4.932.13 4.000.80 3.470.37 1. 3.030.07 2. 3.130.53 2.930.93 2.570.97 2.070.77 1.630.43 1.370.17 1.230.13 1.170.27 1. 2. 3. 4. 5. 6. 7. 8. 9. 1.070.33 1.130.87 1.430.47 1.770.53 2.071.03 2.431.07 2.970.83 3.470.23 3.670.57 3.530.93 3.130.73 2.700.30 2.670.27 2.600.80 2.400.70 1.970.23 1. 1.900.20

2. 2.000.40 3. 2.230.13 1. 2. 3. 4. 2.200.00 2.230.47 2.330.67 2.630.17

Table 37: Dividend yields, Forecasts, errors 1, 2, and 3 year moving Averages DIV_YLD

7 6 5 4 3 2 1 0

1950 1960 1970 1980 1990 2000 CAL_YEAR

Figure 26: Plot of the data moving average forecast for Anheuser Busch dividend data 7.5.2 Exponential Smoothing Exponential smoothing is a method of forecasting that weights data from previous time periods with exponentially decreasing magnitudes. Forecasts can be written as follows, where the forecast for period 2 is

traditionally (but not always) simply the outcome from period 1: Ft+1= St = wyt +(1w)St1= wyt +(1w)Ft where : % % % % Ft+1is the forecast for period t+1 yt is the outcome at t St is the smoothed value for period t (St1= Ft) w is the smoothing constant (0 w 1)

Forecasts are smoother than the raw data and weights of previous observations decline exponentially with time. Example 7.4 (Continued) 3 smoothing constants (allowing decreasing amounts of smoothness) for illustration: % w =0.2 Ft+1=0.2yt +0.8St1=0.2yt +0.8Ft % w =0.5 Ft+1=0.5yt +0.5St1=0.5yt +0.5Ft % w =0.8 Ft+1=0.8yt +0.2St1=0.8yt +0.2Ft Year 2 (1953) set F1953= X1952, then cycle from there. t Year11952219533195441955519566195771958819599196010196111 19621219631319641419651519661619671719681819691919702019 71211972221973231974241975251976261977271978281979291980 30198131198232198333198434198535198636198737198838198939 1990401991411992421993431994441995 yt 5.30 4.20 3.90 5.20 5.80 6.30 5.60 4.80 4.40 2.80 3.20 3.10 3.10 2.60 2.00

1.60 1.30 1.20 1.20 1.10 0.90 1.40 2.00 1.90 2.30 3.10 3.50 3.80 3.70 3.10 2.60 2.40 3.00 2.40 1.80 1.70 2.20 2.10 2.40 2.10 2.20 2.70 3.00 2.80 Fw=.2,tew=.2,t .. 5.301.10 5.081.18 1. 2. 3. 4. 5. 4.840.36 4.920.88 5.091.21 5.330.27 5.390.59 5.270.87 5.102.30 4.641.44 4.351.25 4.101.00 3.901.30 3.641.64

3.311.71 2.971.67 2.641.44 2.351.15 2.121.02 1.911.01 1.710.31 1. 2. 3. 4. 5. 6. 7. 8. 9. 1.650.35 1.720.18 1.760.54 1.861.24 2.111.39 2.391.41 2.671.03 2.880.22 2.920.32 2.860.46 1. 2.770.23 2. 2.810.41 2.730.93 2.540.84 2.380.18 2.340.24 1. 2.290.11 2. 2.310.21 2.270.07 1. 2.260.44 2. 2.350.65 3. 2.480.32 Fw=.5,tew=.5,t .. 5.301.10 4.750.85 1. 2. 3. 4. 4.330.88 4.761.04 5.281.02 5.790.19 5.700.90 5.250.85 4.822.02 3.810.61

3.510.41 3.300.20 3.200.60 2.900.90 2.450.85 2.030.73 1.660.46 1.430.23 1.320.22 1.210.31 1. 2. 3. 4. 5. 6. 7. 8. 9. 1.050.35 1.230.77 1.610.29 1.760.54 2.031.07 2.560.94 3.030.77 3.420.28 3.560.46 3.330.73 2.960.56 1. 2.680.32 2. 2.840.44 2.620.82 2.210.51 1. 2. 3. 4. 1. 2. 3. 4. 1.960.24 2.080.02 2.090.31 2.240.14 2.170.03 2.190.51 2.440.56 2.720.08 Fw=.8,tew=.8,t .. 5.301.10 4.420.52 1. 2. 3. 4. 4.001.20 4.960.84 5.630.67 6.170.57

5.710.91 4.980.58 4.521.72 1. 3.140.06 2. 3.190.09 3.120.02 3.100.50 2.700.70 2.140.54 1.710.41 1.380.18 1.240.04 1.210.11 1.120.22 1. 2. 3. 4. 5. 6. 7. 8. 0.940.46 1.310.69 1.860.04 1.890.41 2.220.88 2.920.58 3.380.42 3.720.02 3.700.60 3.220.62 2.720.32 1. 2.460.54 2. 2.890.49 2.500.70 1.940.24 1. 1.750.45 2. 2.110.01 1. 2.100.30 2. 2.340.24 1. 2. 3. 4. 2.150.05 2.190.51 2.600.40 2.920.12

Table 38: Dividend yields, Forecasts, and errors based on exponential smoothing with w = 0.2,0.5,0.8 Table 38 gives average dividend yields for AnheuserBuschfor the

years 19521995 (Source:ValueLine), forecasts and errors based on exponential smoothing based on lags of 1, 2, and 3. Here we obtain Forecasts based on Exponential Smoothing, beginning with year 2 (1953): 1953: Fw=.2,1953= y1952=5.30 Fw=.5,1952= y1952=5.30 Fw=.8,1952= y1952=5.30 1954(w =0.2): Fw=.2,1954= .2y1953+ .8Fw=.2,1953= .2(4.20) + .8(5.30) =5.08 1954(w =0.5): Fw=.5,1954= .5y1953+ .5Fw=.5,1953= .5(4.20) + .5(5.30) =4.75 1954(w =0.8): Fw=.8,1954= .8y1953+ .2Fw=.5,1953= .8(4.20) + .2(5.30) =4.42 Which level of w appears to be discounting more distant observations at a quicker rate? What would happen if w = 1? If w = 0? Figure 27 gives raw data and exponential smoothing forecasts. DIV_YLD

7 6 5 4 3 2 1 0

1950 1960 1970 1980 1990 2000 CAL_YEAR

Figure 27: Plot of the data and Exponential Smoothing forecasts for AnheuserBusch dividend data Table 39 gives measures of forecast errors for three moving average, and three exponential smoothing methods.
Meas ure MAE MSE Moving Average 1Period 2Period 3Period 0.43 0.53 0.62 0.30 0.43 0.57

Exponential Smoothing
w =0.2 0.82 0.97 w =0.5 0.58 0.48 w =0.8 0.47 0.34

Table 39: Relative performances of 6 forecasting methods Anheuser Busch data Note that MSE is SSE/n where n is the number of forecasts. 7.6 Forecasting with Seasonal Indexes
After trend and seasonal indexes have been estimated, future outcomes can be forecast by the equation: Ft =[b0+ b1t]SIt where b0+b1t is the linear trend and SIt is the seasonal index for period t. Example 7.2 (Continued) For the Texas FIRE gross sales data, we have: b0=1.6376 b1= .08753 SIQ1= .838 SIQ2= .901 SIQ3= .870 SIQ4=1.390 Thus for the 4 quarters of 2003(t =57,58,59,60), we have: Q1: F57=[1.6376 + 0.08753(57)](.838)= 5.553 Q2: F58=[1.6376 + 0.08753(58)] (.901)=6.050 Q3: F59=[1.6376 + 0.08753(59)](.870) =5.918 Q4: F60=[1.6376 + 0.08753(60)] (1.390) =9.576 7.6.1 Autoregression Sometimes regression is run on past or lagged values of the dependent

variable (and possibly other variables). An Autoregressive model with independent variables corresponding to k periods can be written as follows: yt = b0+ b1yt1+ b2yt2+ + bkytk Note that the regression cannot be run for the .rst k responses in the series. Technically forecasts can be given for only periods after the regression has been .t, since the regression model depends on all periods used to .t it. Example 7.4 (Continued) From Computer software, autoregressions based on lags of 1, 2, and 3 periods are .t: 1Period: yt =0.29 + 0.88yt1 2Period: yt =0.29 + 1.18yt10.29yt2 3Period: yt =0.28 + 1.21yt10.37yt2+0.05yt3 Table 40 gives raw data and forecasts based on three autoregression models. Figure 28 displays the actual outcomes and predictions. Table 40: Average dividend yields and Forecasts/errors based on autoregression with lags of 1, 2, and 3 periods
DIV_YLD

7 6 5 4 3 2 1 0

1950

1960

1970

1980

1990

2000

CAL_YEAR

Figure 28: Plot of the data and Autoregressive forecasts for AnheuserBusch dividend data