You are on page 1of 228

Lecture Notes for 36-707 Linear Regression

Fall 2010 Revised from Larry Wassermans 707 notes

Prediction

Regression analysis is used to answer questions about how one variable depends on the level of one or more other variables. Does diet correlate with cholesterol level, and does this relationship depend on other factors, such as age, smoking status, and level of exercise? We start by studying linear regression. Virtually all other methods for studying dependence among variables are variations on the idea of linear regression. For this reason the text book focuses on linear regression. The notes start with linear regession and then go on to cover other modern regression-based techniques, such as nonparametric regression, generalized linear regression, treebased methods, and classication. In the simplest scenario we have one response variable (Y ) and one predictor variable (X ). For example, we might predict a sons height, based on his fathers height (Figure 1). Or we might predict a cats heart weight, based on its total body weight (Figure 2). Suppose (X, Y ) have a joint distribution f (x, y ). You observe X = x. What is your best prediction of Y ? Let g (x) be any prediction function, for instance a linear relationship. The prediction error (or risk) is R(g ) = E(Y g (X ))2 where E is the expected value with respect to the joint distribution f (x, y ). Condition on X = x and let r(x) = E(Y |X = x) = yf (y |x)dy be the regression function. Let = Y r(X ). Then, and we can write Y = r(X ) + . (1) E( ) = E[E[Y r(X )|X = x] = 0

Key result: for any g , the regression function r(x) minimizes the prediction error R(r) R(g ).

We dont know r(x), so we estimate it from the data. This is the fundamental problem in regression analysis. 1

1.1

Some Terminology

Given data (X1 , Y1 ), . . . , (Xn , Yn ) we have two goals: estimation: Find an estimate r(x) of the regression function r(x). prediction: Given a new X , predict Y ; we use Y = r(X ) as the prediction. At rst we assume that Yi R. Later in the course, we consider other cases such as Yi {0, 1}. X scalar X vector r linear r arbitrary r(x) = 0 + 1 x r(x) is some smooth function simple linear regression nonparametric regression r(x) = 0 + j j xj r(x1 , . . . , xp ) is some smooth function multiple linear regression multiple nonparametric regression

son 60 65

70

75

60

65 father

70

75

Figure 1: Galton data. Predict Sons height from Fathers height.

Simple Linear Regression: X scalar and r(x) linear h


r(x) = 0 + 1 x. 2 (2)

Suppose that Yi R, Xi R and that

Y = Heart Weight

6
2.0

10

12

14

16

18

20

2.5

3.0 X = Body Weight

3.5

Figure 2: Cat example This model is only an approximation to the truth, but often times it is close enough to correct that it is worth it to see what we can learn with a simple model. Later on well learn that we neednt assume that r is linear. I use the hsymbol to alert you to model-based statements. We can write Yi = 0 + 1 Xi + i (3) where E( i ) = 0 and 1 , . . . , n are independent. We also assume that V( i ) = 2 does not depend on x. (Homoskedasticity.) The unknown parameters are: 0 , 1 , 2 . Dene the residual sums of squares
n 2

RSS (0 , 1 )

=
i=1

Yi (0 + 1 Xi )
RSS (0 , 1 ).

(4)

The least squares estimators (LS) minimize: 2.1 Theorem. The LS estimators are 1 =

n i=1 (Xi X )(Yi n 2 i=1 (Xi X )

Y)

(5) (6)

0 = Y 1 X where X = n1
n i=1

Xi and Y = n1

n i=1

Yi .

For details, see Weisberg, p. 273. 3

Compare RSS to the risk. The former is an empirical version of the latter calculated for g (X ) = 0 + 1 X . We dene: The tted line: r ( x) = 0 + 1 x The predicted or tted values: Yi = r(Xi ) = 0 + 1 Xi The residuals: i = Yi Yi 2 The residual sums of squares: RSS = n i=1 i An unbiased estimate of 2 is

2 =

RSS

n2

(7)

The estimators are random variables and have the following properties (conditional on X1 , . . . , Xn ): E(0 ) = 0 , E(1 ) = 1 , V(1 ) =
n 1 2 where s2 x = n i=1 (Xi X ) . Lets derive some of these facts. Let

2 n s2 x

di = Note that
i

) ( Xi X n 2. i=1 (Xi X )

di = 0. Then
n

E(1 ) = E
i=1 n

) di (Yi Y
n

=
i=1 n

di E(Yi ) + Y
i=1

di

=
i=1

+ 1 (Xi X )] di [0 + 1 X

= 1
n

V(1 ) = = ns2 x
i=1 2

d2 i V(Yi )

E(0 ) =

1 n

i=1

E[1 ] (0 + 1 Xi ) X
n

1 = 0 + 1 n = 0 Also, E( 2 ) = 2 . The standard error: se(1 ) =

i=1

E[1 ] ( Xi ) X

. n sx

Both 0 and 1 are linear combinations of Y1 , . . . , Yn , so it follows from the Central Limit Theorem that they are approximately normally distributed. Approximate Normality

h
If
i

0 N 0 , se2 (0 ) ,

1 N 1 , se2 (1 )

(8)

N (0, 2 ) then:

1. Equation (8) is exact. 2. The least squares estimators are the maximum likelihood estimators. 3. The variance estimator satises: 2 And E[ 2 ] =
2 (n2) n2

= 2.

2 2 n2 n2

Note: To verify these results again assume calculations are performed conditional on X1 , . . . , Xn . Then Yi N (0 + 1 xi , 2 ). The likelihood is f (y1 , . . . , yn ; 0 , 1 , 2 ) =
i=1 n

f (yi ; 0 , 1 , 2 ).

If we write out the likelihood of the normal model, the result follows directly.

2.1

Inference

It follows from (8) that an approximate 1 condence interval for 1 is

1 z/2 se(1 )

(9)

where z/2 is the upper /2 quantile of a standard Normal: P(Z > z/2 ) = , where Z N (0, 1). 2

For = .05, z/2 = 1.96 2, so, an approximate 95 per cent condence interval for 1 is 1 2se(1 ). 2.2 Remark. If the residuals are Normal, then an exact 1 condence interval for 1 is 1 t/2,n2 se(1 ) (11) (10)

where t/2,n2 is the upper /2 quantile of a t with n 2 degrees of freedom. Although technically correct, and used in practice, this interval is bogus. If n is large, t/2,n2 z/2 so one may as well just use the Normal interval. If n is so small, that t/2,n2 is much different than z/2 , the n is too small to be doing statistical inference. (Do you really believe that the residuals are exactly Normal?) To test use the test statistic

H0 : 1 = 0 versus H1 : 1 = 0

(12)

z=

1 0 se(1 )

(13)

Under H0 , z N (0, 1). The p-value is p value = P(|Z | > |z |) = 2(|z |) where Z N (0, 1). Reject H0 if p-value is small. (14)

2.2

R Statistical Package

R is a exible, powerful, statistical programming language. This exibility is its biggest strength, but also its greatest challenge. Like the English language, there are many ways to say the same thing. Throughout these notes examples are provided that help you to learn at least one way to accomplish the tasks at hand. To nd out more about any function in R by using help(functionname). To search for a function you dont know the name of, use help.search(keyword). Refer to the R documents on the class web page. 2.2.1 Matrices

> satdata = matrix(c(61, 70, 63, 72, 66, 1100, 1320, 1500, 1230, 1400), ncol = 2, dimnames = list(rep(" ", 5), c("Height", "SAT")) ) > satdata Height SAT 61 1100 70 1320 63 1500 72 1230 66 1400 > satdata[,1] 61 70 63 72 66 > satdata[4,] Height SAT 72 1230 > satdata[3,2] [1] 1500

> score = satdata[,2] > height = satdata[,1] > which(score >= 1300) 2 3 5 > any(score > 1510) [1] FALSE More commands: max(height) min(score) var(score) sd(height) median(score) sum(height) 2.2.2 Basic Graphs

> plot(satdata, xlab = "Height (inches)", ylab = "SAT Score", main = "Scatterplot of Height vs. SAT Score", pch = 19, col = "red") > abline(h = mean(satdata[,2]) ) > legend(66, 1150, "Mean SAT Score", lty = 1) 7

Figure 3: 2.2.3 Extras

heightcm = height*2.54 #scalar operators apply to entire vector. samplescores = rnorm(10, mean = 1200, sd = 100) ls() rm(heightcm) regressionline = lm(scoreheight) summary(regressionline) read.table(filename.txt,....) #reads in data from a file calld filename.txt and creates an R object. 2.2.4 More on Plotting

These commands can help you to make attractive graphs with more meaningful labels, etc.
Commands to use before plot: par(mfrow = c(numrow, numcol)): Sets the number of graphs per window. For example, par(mfrow = c(3, 2)) will make a window for 3 rows and 2 columns of graphs. You can return your graphing window to its default with par(mfrow = c(1, 1)).

Graphics Parameters for Plot Commands: Labels: xlab: set the x-axis label ylab: set the y-axis label main: set the main label e.g. plot(x, y, xlab = "Height (cm)", ylab = "Weight (kg)", main = "Scatterplot of Height v. Weight") Points: pch: set the shape of plotted points cex: set the size of plotted points, 1 is default col: set the color of plotted points. Each of these can be vectors. For example, if you have four points, you could set each color as follows: plot(x, y, col = c("red", "orange", "cyan", "sandybrown")).

Window modifiers: xlim: set the x-range of the window. e.g. plot(..., xlim = c(10, 30)) will only show points in the x range of 10 to 30. ylim: set the y-range of the window.

Some other graph types: hist(x): plots a histogram pie(y): pie graph for categorical variables boxplot(x): plots a boxplot barplot(x): bar plot for categorical variables

Plotting Additional Lines: abline(intercept, slope); plots a line on the previous graph. You can graph vertical lines with abline(v = xvalue) and horizontal lines with abline(h = yvalue). Additionally, you can use this command to plot regression lines on graphs. e.g.: regline = lm(yx) plot(x, y) abline(regline) You can modify the lines width and type: lwd: set line width, 1 is default. lty: set line type For example, abline(regline, lwd = 2.5, lty = 2, col = "purple") plots a thick dashed purple line on the graph.

Plotting Additional Points: points(x, y): plot additional points. Useful if you have different vectors for different groups. e.g. plot(x1, y1, col = "green") points(x2, y2, col = "orange") Adding a legend: legend(xposition, yposition, c("label1", "label2", ...), ...) Add a legend to your plot. For example: legend(10, 2, c("Freshmen", "Sophomores", "Juniors", "Seniors"), col = c("yellow", "orange", "red", "blue"), pch = 19)

2.3 Example. Here is an example based on the relationship between a cats body weight and heart weight (Figure 2). To get the data we use in this example we call the library MASS. This is the bundle of functions and datasets to support Venables and Ripley, Modern Applied Statistics with S. ### Cat Example ### > library(MASS) #load library containing data > attach(cats) > dim(cats) [1] 144 3 #There are 144 cats in the study. > help(cats) The help le displays basic information about a data set. It is very important to learn how the variables in the data set are recorded before proceeding with any analysis. For example, when dealing with population data, some variables might be recorded as a per-capita measurement, while others might be raw counts. Description The heart and body weights of samples of male and female cats used for digitalis experiments. The cats were all adult, over 2 kg body weight. Format This data frame contains the following columns: Sex Sex factor. Levels "F" and "M". Bwt Body weight in kg. Hwt Heart weight in g. 10

> names(cats) [1] "Sex" "Bwt" "Hwt" > summary(cats) Sex Bwt F:47 Min. :2.000 M:97 1st Qu.:2.300 Median :2.700 Mean :2.724 3rd Qu.:3.025 Max. :3.900

Hwt Min. : 6.30 1st Qu.: 8.95 Median :10.10 Mean :10.63 3rd Qu.:12.12 Max. :20.50

> par(mfrow = c(2,2)) #configure output window for 4 plots, 2x2 > boxplot(Bwt, Hwt, names = c("Body Weight (kg)", "Heart Weight (g)"), + main = "Boxplot of Body and Heart Weights") > regline = lm(HwtBwt) > summary(regline) Call: lm(formula = Hwt Bwt) Residuals: Min 1Q Median -3.56937 -0.96341 -0.09212

3Q 1.04255

Max 5.12382

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.3567 0.6923 -0.515 0.607 Bwt 4.0341 0.2503 16.119 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.452 on 142 degrees of freedom Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16 The most important section of the R summary output is Coefcients:. Here we see that the tted regression line is (Hwt) = 0.3567 + 4.0341 Bwt. The nal column gives symbols indicating the signicance of a coefcient. For example, the three stars (***) for Bwt indicate that the signicance of the Bwt coefcient is between 0 and 0.001. That is, if Bwt and Hwt were independent, the probability of observing a Bwt coefcient as large in magnitude as 4.0341 is less than 0.001. In fact, we see that the associated p-value is < 2e 16. Generally we are not concerned with the signicance of the intercept coefcient. 11

> plot(Bwt, Hwt, xlab = "Body Weight (kg)", ylab = "Heart Weight (g)", + main = "Scatterplot of Body Weight v. Heart Weight", pch = 19) > abline(regline, lwd = 2) > names(regline) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" > r = regline$residuals > plot(Bwt, r, pch = 19, xlab = "Body Weight (kg)", ylab = "Residuals", + main = "Plot of Body Weight v. Residuals") > abline(h = 0, col = "red", lwd = 2) > r = scale(r) > qqnorm(r) > qqline(r) Plots are shown in Figure 4. A Q-Q plot displays the sample (observed) quantiles against theoretical quantiles. Sample quantiles are scaled residuals. Theoretical quantiles are quantiles drawn from the standard normal distribution. An ideal Q-Q plot has points falling more or less on the diagonal line, indicating that our residuals are approximately normally distributed. If the points fall far from the line, a transformation may improve the reliability of the inferences (more later). 2.4 Example (Example: Election 2000). Background: In 2000 Bush and Gore were the main candidates for President. Buchanan, a strongly conservative candidate, was also on the ballot. In the state of Florida, Bush and Gore essentially tied, hence the counts were examined carefully county by county. Palm Beach County exhibited strange results. Even though the people in this county are not conservative, many votes were cast for Buchanon. Examination of the voting ballot revealed that it was easy to mistakenly vote for Buchanon when intending to vote for Gore. Lets look at the count of votes by county. Figure 5 shows the plot of votes for Buchanan (Y) versus votes for Bush (X) in Florida. The least squares estimates (omitting Palm Beach County) and the standard errors are 0 = 66.0991 se(0 ) = 17.2926 1 = 0.0035 The tted line is Buchanan = 66.0991 + 0.0035 Bush. Figure 5 also shows the residuals. The inferences from linear regression are most accurate when the residuals behave like random normal numbers. Based on the residual plot, this appears not to be the case in this example. If we repeat the analysis replacing votes with log(votes) we get 0 = 2.3298 se(0 ) = 0.3529 12 se(1 ) = 0.0002.

1 = 0.730300 se(1 ) = 0.0358.

Figure 4:

13

3000

1500

125000

250000

500

500

125000

250000

10

11

12

13

10

11

12

13

Figure 5: Voting Data for Election 2000. (Top row, left) Bush versus Buchanon (vertical); (top row, right) Bush versus residuals. The bottom row replaces votes with log votes. This gives the t log(Buchanan) = 2.3298 + 0.7303 log(Bush). The residuals look much healthier. Later, we shall address the following questions: how do we see if Palm Beach County has a statistically plausible outcome? The statistic for testing H0 : 1 = 0 versus H1 : 1 = 0 is |Z | = |.7303 0|/.0358 = 20.40 with a p-value of P(|Z | > 20.40) 0. This is strong evidence that that the true slope is not 0.

14

2.3

h ANOVA and R2

In the olden days, statisticians were obsessed with summarizing things in analysis of variance (ANOVA) tables. The entries are called sum of squares and the sum of squares deviation between the observed and tted values are called the residual sum of squares (RSS). It works like this. We can write
n n n

i=1

(Yi Y )2 =

i=1

(Yi Yi )2 +

i=1

(Yi Y )2

SStotal = RSS + Then we create this table: Source Regression Residual Total df SS 1 SSreg n-2 RSS n-1 SStotal

SSreg

MS F SSreg /1 M Sreg /MSE RSS/(n-2)

Under H0 : 1 = 0, F F1,n2 . This is just another (equivalent) way to test this hypothesis. The coefcient of determination is R2 =
RSS SSreg =1 SStot SStot

(15)

Amount of variability in Y explained by X . Also, R2 = r2 where r=


n i=1 (Yi n i=1 (Yi

Y )(Xi X )
n i=1 (Xi

Y )2

X )2

is the sample correlation. This is an estimate of the correlation E (X X )(Y Y ) X Y 1 1.

= Note that

2.4

Prediction Intervals

Given new value X , we want to predict Y = 0 + 1 X + . 15

The prediction is Y = 0 + 1 X . The stardard error of the estimated regression line at X is seline (Y ) = 1 + n (X X )2 . n 2 i=1 (Xi X ) (17) (16)

The variance of a predicted value at X is 2 plus the square of the variance of the estimated regression line at X . Hence sepred (Y ) = A condence interval for Y is 1+ 1 + n (X X )2 . n 2 i=1 (Xi X ) (18)

Y z/2 sepred (Y ).

2.5 Remark. This is not really the standard error of the quantity Y . It is the standard error of 0 + 1 X + . Note that sepred (Y ) does not go to 0 as n . Why? 2.6 Example (Election Data Revisited). On the log-scale, our linear regression gives the following prediction equation: log(Buchanan) = 2.3298 + 0.7303 log(Bush). In Palm Beach, Bush had 152,954 votes and Buchanan had 3,467 votes. On the log scale this is 11.93789 and 8.151045. How likely is this outcome, assuming our regression model is appropriate? Our prediction for log Buchanan votes -2.3298 + .7303 (11.93789)=6.388441. Now, 8.151045 is bigger than 6.388441 but is is signicantly bigger? Let us compute a prediction interval. We nd that sepred = .093775 and the approximate 95 per cent prediction interval is (6.200,6.578) which clearly excludes 8.151. Indeed, 8.151 is nearly 20 standard errors from Y . Going back to the vote scale by exponentiating, the condence interval is (493,717) compared to the actual number of votes which is 3,467.

2.5

Condence Bands

2.7 Theorem (Scheffe, 1959). Let I (x) = where r(x) = 0 + 1 x c = 2F,2,n2 16 1 + n (x x)2 . 2 i (xi x) r(x) c , r(x) + c (19)

Then,

P r(x) I (x) for all x

1 .

(20)

2.8 Example. Let us return to the cat example. The R code is: library(MASS) attach(cats) plot(Bwt,Hwt, xlab = "Body Weight (kg)", ylab = "Heart Weight (g)", main = "Body Weight vs. Heart Weight in Cats") regression.line = lm(HwtBwt) abline(regression.line,lwd=3) r = regression.line$residuals n = length(Bwt) x = seq(min(Bwt),max(Bwt),length=1000) #Creates a sequence of 1000 numbers equally spaced between #the smallest and largest body weights.

d = qf(.95,2,n-2) #Finds the critical value (quantile of .95) for an f distribution with degre #of freedom 2 and n-2. All major distributions have a q function, e.g. #qt, qnorm, qbinom, etc. beta = regression.line$coeff xbar = mean(Bwt) ssx = sum( (Bwt-xbar)2 ) sigma.hat = sqrt(sum(r2)/(n-2)) stuff = sqrt(2*d)*sqrt( (1/n) + ((x-xbar)2/ssx))*sigma.hat ### Important: Note that these are all scalars except that x is a vector. r.hat = beta[1] + beta[2]*x upper = r.hat + stuff lower = r.hat - stuff lines(x,upper,lty=2,col=2,lwd=3) The bands are shown in Figure 6.

17

Figure 6: Condence Band for Cat Example .

2.6

Why Are We Doing This If The Model is Wrong?

The model Y = 0 + 1 x + is certainly false. There is no reason why r(x) should be exactly linear. Nonetheless, the linear assumption might be adequate. But how do we assess whether the linear assumption is adequate? There are three ways. 1. We can do a goodness-of-t test. 2. Second, we can do a nonparametric regression that does not assume linearity. 3. We can take a purely predictive point and view 0 + 1 x as an estimate of the best linear predictor not as an estimate of the true regression function. We will return to these points later.

Association Versus Causation

There is much confusion about the difference between causation and association. Roughly speaking, the statement X causes Y means that changing the value of X will change the distribution 18

of Y . When X causes Y , X and Y will be associated but the reverse is not, in general, true. Association does not necessarily imply causation. For example, there is a strong linear relationship between death rate due to breast cancer and fat intake. So, RISK OF DEATH = 0 + 1 FAT + (21) where 1 > 0. Does that mean that FAT causes breast cancer? Consider two interpretations of (21). ASSOCIATION (or correlation). Fat intake and breast cancer are associated. Therefore, if I observe someones fat intake, I can use equation (21) to predict their chance of dying from breast cancer. CAUSATION. Fat intake causes Breast cancer. Therefore, if I observe someones fat intake, I can use equation (21) to predict their chance of dying from breast cancer. Moreover, if I change someones fat intake by one unit, their risk of death from breast cancer changes by 1 . If the data are from a randomized study (X is randomly assigned) then the causal interpretation is correct. If the data are from an observational study, (X is not randomly assigned) then the association interpretation is correct. To see why the causal interpretation is wrong in the observational study, suppose that people with high fat intake are the rich people. And suppose, for the sake of the example, that rich people smoke a lot. Further, suppose that smoking does cause cancer. Then it will be true that high fat intake predicts high cancer rate. But changing someones fat intake will not change their cancer risk. How can we make these ideas precise? The answer is to use either counterfactuals or directed acyclic graphs. Look at the top left plot in Figure 7. These are observed data on vitamin C (X) and colds (Y). You conclude that increasing vitamin C decreases colds. You tell everyone to take more vitamin C but the prevalence of colds stays the same. Why? Look at the second plot. The dotted lines show the counterfactuals. The counterfactual yi (x) is the value Y person i would have had if they had taken dose X = x. Note that

Yi = yi (Xi ).

(22)

In other words: Yi is the function yi () evaluated at Xi . The causal regression is the average of the counterfactual curves yi (x):

19

y 0

3 x data

3 x data

x counterfactuals

x counterfactuals

c(x)

c(x) 0

x causal regression function

x causal regression function

Figure 7: Causation

c(x) = E(yi (x)).

(23)

The average is over the population. In other words, x a value of x then average yi (x) over all individuals. In general: r(x) = c(x) association does not equal causation (24)

In this example, changing everyones dose does not change the outcome. The causal regression curve c(x) is shown in the third plot. In the second example (right side of Figure 7) it is worse. You tell everyone to take more vitamin C but the prevalence of colds increases. Suppose that we randomly assign dose X . Then Xi is independent of the counterfactuals {yi (x) : x R}. In that case: c(x) = = = = E(y (x)) E(y (x)|X = x) since X is indep of {y (x) : x R} E(Y |X = x) since Y = y (X ) r(x). (25) (26) (27) (28)

Thus, if X is randomly assigned then association is equal to causation. In an observational (non randomized) study, the best we can do is try to measure confounding variables. These are variables that affect both X and Y . If we can nd all the confounding variables Z then {y (x) : x R} is independent of X given Z . In other words, given Z , the 20

problem is like a randomized experiment. Consider the breast cancer scenario. Suppose Z = 0 for the poor and Z = 1 for the rich, and that smoking (not fat) causes cancer. If the rich smoke and the poor do not, then c(x) = 0 + 0 x + 2 z . Formally, c(x) = E(y (x)) = = = = E(y (x)|Z = z )f (z )dz (29) (30)

E(y (x)|Z = z, X = x)f (z )dz since X is indep of {yi (x) : x R}|Z (31) E(Y |X = x, Z = z )f (z )dz 1 x + 2 z f (z )dz if linear (32) (33) (34)

= 1 x + 2 E(Z ).

This is called adjusting for confounders. Specically, we regress Y on X and Z , obtaining 1 x + 2 z , which approximates c(x). Of course, we can never be sure we have included all confounders. This is why obervational studies have to be treated with caution. Note the following difference: c(x) = E(Y |X = x) = E(Y |Z = z, X = x)f (z )dz E(Y |Z = z, X = x)f (z |x)dz. (35) (36)

In the former, f (z ) smoothes over the distribution of the confounding variable. In the latter, if X and Z are correlated, f (z |x) does not smooth over the likely spectrum of values. This enhances the impression that X causes Y , when in fact it might be that Z causes Y .

Review of Linear Algebra

Before starting multiple regression, we will briey review some linear algebra. Read pages 278287 of Weisberg. The inner product of two vectors x and y is x, y = xT y =
j

xj yj .

Two vectors are orthogonal is x, y = 0. We then write x y . The norm of a vector is ||x|| = x, x =
j

x2 j.

If A is a matrix, denote its inverse by A1 and its transpose by AT . The trace of a square matrix, denoted by tr(A) is the sum of its diagonal elements. 21

P ROJECTIONS . We will make extensive use of projections. Let us start with a simple example. Let e1 = (1, 0), e2 = (0, 1) and note that R2 is the linear span of e1 and e2 : any vector (a, b) R2 is a linear combination of e1 and e2 . Let L = {ae1 : a R}

be the set of vectors of the form (a, 0). Note that L is a linear subspace of R2 . Given a vector x = (a, b) R2 , the projection x of x onto L is the vector in L that is closest to x. In other words, x minimizes ||x x|| among all vectors in L. Another characterization of x is this: it is the unique vector such that (i) x L and (ii) x x y for all y L. It is easy to see, in our simple example, that the projection of x = (a, b) is just (a, 0). Note that we can write x = Px where P = 1 0 0 0 .

This is the projection matrix. In general, given a vector space V and a linear subspace L there is a projection matrix P that maps any vector v into its projection P v . The projection matrix satises these properties: P v exists and is unique. P is linear: if a and b are scalars then P (ax + by ) = aP x + bP y . P is symmetric. P is idempotent: P 2 = P . If x L then P x = x. Now let X be some n q matrix and suppose that XT X is invertible. The column space is the space L of all vectors that can be obtained by taking linear combinations of the columns of X. It can be shown that the projection matrix for the column space is P = X(XT X)1 XT . Exercise: check that P is idempotent and that if x L then P x = x. Recall that E[
n i=1

ai Y i ] =

n i=1

ai E[Yi ] and V[

n i=1

ai Y i ] =

n i=1

a2 i V[Yi ]+

i<j

2ai aj Cov[Yi , Yj ].

22

R ANDOM V ECTORS . Let Y be a random vector. Denote the mean vector by and the covariance matrix by V(Y ) or Cov(Y ). If a is a vector then E(aT Y ) = aT , V(aT Y ) = aT a. If A is a matrix then E(AY ) = A, V(AY ) = AAT . (38) (37)

5
5.1

Multiple Linear Regression


Fitting the model

If Y depends on several variables, then we can extend our simple linear regression model to include more X s. For example we might predict the height of a child based on the height of father, height of mother, and sex of the child. The multiple linear regression model is

Y = 0 + 1 X1 + + p Xp + = T X +

(39)

where = (0 , . . . , p )T and X = (1, X1 , . . . , Xp )T . The value of the j th covariate for the ith subject is denoted by Xij . Thus Yi = 0 + 1 Xi1 + + p Xip + i . At this point, it is convenient to use matrix notation. Let 1 X11 X12 . . . X1p 1 X21 X22 . . . X2p X = . . . . . . . . nq . . . . 1 Xn1 Xn2 . . . Xnp (40)

Each subject corresponds to one row. The number of columns of X corresponds to the number of features plus 1 for the intercept q = p + 1 Now dene, 0 Y1 1 Y2 2 1 Y = . = . (41) = . . . . . . . . Yn p n We can then rewrite (39) as

23

Y = X +

(42)

Notational conventions.
Following the notational conventions used in Hastie et al., we will denote a feature by the symbol X . If X is a vector, its components can be accessed by the subscripts Xj . An output, or response variable, is denoted by Y . We use uppercase letters such as X and Y when referring to the variables. Observed values are written as lower case, for example, the ith observation of X is xi . Matrices are represented using mathbold font, for example, a set of n input q -vectors, xi , i = 1, . . . , n would be represented by the n q matrix X. In general vectors will not be bold, except when they have n components; this convention distinguishes a q -vector of inputs xi for the ith observation from the n-vector xj consisting of all the observations on the variable Xj . Since all vectors are assumed to be column vectors, the ith row of X is xT i , the vector transpose of xi . 5.1 Theorem. The least squares estimator is = SY where S = (XT X)1 XT assuming that (XT X) is invertible. The tted values are Y = X = X(XT X)1 XT Y = HY, (44) (43)

where H is the projection matrix that maps Y onto L, the set of vectors that can be written as Xa (where a is a column vector). The residuals are = Y Y . Of course Y L and is orthogonal to L. Thus, RSS = || ||2 = T . The variance is estimated by 2 =
RSS

np1

RSS

nq

(45)

24

5.2 Theorem.

The estimators satisfy the following properties.

1. E( ) = . 2. V( ) = 2 (XT X)1 . 3. M N (, 2 (XT X)1 ). 4. An approximate 1 condence interval for j is j z/2 se(j ) (46)

where se(j ) is the square root of the appropriate diagonal element of the matrix 2 (XT X)1 . Lets prove the rst two assertions. Note that E( ) = E(SY ) = S E(Y ) = S X = (XT X)1 XT X = . Also, by assumption V(Y ) = 2 I , where I is the identity matrix, V( ) = V(SY ) = S V(Y )S T = 2 SS T = 2 (XT X)1 XT (XT X)1 XT = 2 (XT X)1 XT X(XT X)1 = 2 (XT X)1 . The ANOVA table is Source df SS Regression q 1 = p SSreg Residual n q = n p 1 RSS Total n-1 SStotal MS F SSreg /p M Sreg /MSE RSS/(n-p-1)
T

)2 and SStotal = 2 where SSreg = i (Yi Y i (Yi Y ) . We often loosely refer to the degrees of freedom, but we should indicate whether we mean the df model (p) or the df error (n p 1). The F test F = M Sreg /MSE is distributed Fp,np1 . This is testing the hypothesis H0 : 1 = p = 0 Testing this hypothesis is of limited value. More frequently we test H0 : j = 0. Based on an assumption of asymptotic normality one typically performs a t-test. The test statistic is of the form T = j . se bj

25

Reject H0 if |T | is large relative to a t-statistic with (n q ) degrees of freedom. 5.3 Example. Example: SAT Data (Sleuth text) Reading in the Data > data = read.table("CASE1201.ASC", header = TRUE) > data[1:4,] state sat takers income years public expend 1 Iowa 1088 3 326 16.79 87.8 25.60 2 SouthDakota 1075 2 264 16.07 86.2 19.95 3 NorthDakota 1068 3 317 16.57 88.3 20.62 4 Kansas 1045 5 338 16.30 83.9 27.14 > dim(data) [1] 50 8

rank 89.7 90.6 89.8 86.3

Description of Data In 1982, average SAT scores were published with breakdowns of state-by-state performance in the United States. The average SAT scores varied considerably by state, with mean scores falling between 790 (South Carolina) to 1088 (Iowa). Two researchers examined compositional and demographic variables to examine to what extent these characteristics were tied to SAT scores. The variables in the data set were: state: state name sat: mean SAT score (verbal and quantitative combined) takers: percentage of total eligible students (high school seniors) in the state who took the exam income: median income of families of test takers, in hundreds of dollars. years: average number of years that test takers had in social sciences, natural sciences, and humanities (combined) public: percentage of test takers who attended public schools expend: state expenditure on secondary schools, in hundreds of dollars per student rank: median percentile of ranking of test takers within their secondary school classes. Possible values range from 0-99, with 99th percentile students being the highest achieving. Notice that the states with high average SATs had low percentages of takers. One reason is that these are mostly midwestern states that administer other tests to students bound for college in-state. Only their best students planning to attend college out of state take the SAT exams. As the percentage of takers increases for other states, so does the likelihood that the takers include lower-qualied students. Research Question: After accounting for the percentage of students who took the test and the median class rank of the test takers (to adjust, somewhat, for the selection bias in the samples from each state), which variables are associated with state SAT scores? After accounting for the percentage of takers and the median class rank of the takers, how do the states rank? Which states perform best for the amount of money they spend? Exploratory Data Analysis

26

> par(mfrow = c(2, 4)) > hist(sat, main = "Histogram of SAT Scores", xlab = "Mean SAT Score", col = 1) > hist(takers, main = "Histogram of Takers", xlab = "Percentage of students tested", col = 2) > hist(income, main = "Histogram of Income", xlab = "Mean Household Income ($100s)", col = 3) > hist(years, main = "Histogram of Years", xlab = "Mean Years of Sciences and Humanities", col = 4) > hist(public, main = "Public Schools Percentage", xlab = "Percentage of Students in Public Schools", col = 5) > hist(expend, main = "Histogram of Expenditures", xlab = "Schooling Expenditures per Student ($100s)", col = 6) > hist(rank, main = "Histogram of Class Rank", xlab = "Median Class Ranking Percentile", col = 7)

Exploratory data analysis allows us to look at the variables contained in the data set before beginning any formal analysis. First we examine the variables individually through histograms (Fig. 8). Here we can see the general range of the data, shape (skewed, gapped, symmetric, etc.), as well as any other trends. For example, we note that one state has almost double the amount of secondary schooling expenditures of all the other states. We may be interested in determining which state this is, and can do so in one line of code:
> data[which(expend == max(expend)), ] state sat takers income years public expend rank 29 Alaska 923 31 401 15.32 96.5 50.1 79.6 Next we look the variables together. > par(mfrow = c(1, 1)) > plot(data[,-1]) #scatterplot matrix of data, ignoring the first column. > round(cor(data[,-1]), 2) sat takers income years public expend rank sat 1.00 -0.86 0.58 0.33 -0.08 -0.06 0.88 takers -0.86 1.00 -0.66 -0.10 0.12 0.28 -0.94 income 0.58 -0.66 1.00 0.13 -0.31 0.13 0.53 years 0.33 -0.10 0.13 1.00 -0.42 0.06 0.07 public -0.08 0.12 -0.31 -0.42 1.00 0.28 0.05 expend -0.06 0.28 0.13 0.06 0.28 1.00 -0.26 rank 0.88 -0.94 0.53 0.07 0.05 -0.26 1.00

The scatterplot matrix shows the relationships between the variables at a glance. Generally we are looking for trends here. Doed the value of one variable tend to affect the value of another? If so, is that relationship linear? These types of questions help us think of what type of interaction and higher order terms we might want to include in the regression model. 27

Figure 8: Histogram of SAT data

28

Here we can conrm some of the observations of the problem statement. The scatterplot matrix shows clear relationships between sat, takers, and rank (Fig. 9). Interestingly, we can also note Alaskas features, since we know its the state with the very high expend value. We can see that Alaska has a rather average sat score despite its very high levels of spending. For now we will leave Alaska in the data set, but a more complete analysis would seek to remove outliers and high inuence points (to be discussed in later sections of the notes). In fact, this data-set contains two rather obvious outliers. One feature visible in both the scatterplot and the histogram is the gap in the distribution of takers. When there is such a distinct gap in a variables distribution, sometimes it is a good idea to consider a transformation from a continuous variable to an indicator variable. Since subtle trends are often difcult to spot in scatterplot matrices, sometimes a correlation matrix can be useful, as seen above. Correlation matrices usually print 8-10 signicant digits, so the use of the round command tends to make the output more easily readible. We note that both the income and the years variables have moderately strong positive correlations with the response variable (sat). The respective correlations of 0.58 and 0.33 indicate that higher levels of income and years of education in sciences and humanities are generally associated with higher mean sat scores. However, this does not imply causation, and each of these trends may be nullied or even reversed when accounting for the other variables in the data set! A variable such a years may be of particular interest to researches. Although neither science nor humanities are directly tested on the SAT, researchers may be interested in whether an increase in the number of years of such classes is associated with a signicant increase in SAT score. This may help them make recommendation to schools as to how to plan their cirricula. Full Regression Line
#Fit a full regression line > attach(data) > regression.line = lm(sat takers + income + years + public + expend + rank) > summary(regression.line) Call: lm(formula = sat takers + income + years + public + expend + rank) Residuals: Min 1Q -60.046 -6.768 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -94.659109 211.509584 -0.448 0.656731 takers -0.480080 0.693711 -0.692 0.492628 income -0.008195 0.152358 -0.054 0.957353 years 22.610082 6.314577 3.581 0.000866 ***

Median 0.972

3Q 13.947

Max 46.332

29

Figure 9: Scatterplot of SAT data

30

public -0.464152 0.579104 -0.802 0.427249 expend 2.212005 0.845972 2.615 0.012263 * rank 8.476217 2.107807 4.021 0.000230 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 26.34 on 43 degrees of freedom Multiple R-squared: 0.8787, Adjusted R-squared: 0.8618 F-statistic: 51.91 on 6 and 43 DF, p-value: < 2.2e-16 > resid = regression.line$residuals > qqnorm(scale(resid)) > qqline(scale(resid))

The q-q plot indicates that the residuals in our regression model have heavy tails (Fig. 10). On the both the negative and positive side, observed (sample) quantiles are much larger than theoretical quantiles.

5.2

Testing Subsets of Coefcients


(RSSsmall RSSbig )/(df small df big ) RSSbig /df big

Suppose you want to test if a set of coefcients is 0. Use, F = (47)

which has a Fa,b distribution under H0 , where df means degrees of freedom error, a = dfsmall dfbig and b = dfbig . (Note: we often say reduced for the small model, and full for the big model.) 5.4 Example. In this example well use the anova command, rather than summary. This gives sequential sums of squares. The order matters. It gives the SS explained by the rst variable, then the second variable, conditional on including the rst, then the third variable, conditional on the rst and second, and so forth. For the SAT data, lets try dropping income, years, public and expend.
> regression.line = lm(sat takers + income + years + public + expend + rank) > anova(regression.line) Analysis of Variance Table Response: sat Df Sum Sq Mean Sq F value Pr(>F) takers 1 181024 181024 260.8380 < 2.2e-16 *** income 1 121 121 0.1749 0.6778321 years 1 14661 14661 21.1253 3.753e-05 *** public 1 5155 5155 7.4272 0.0092545 ** expend 1 3984 3984 5.7409 0.0209970 * rank 1 11223 11223 16.1712 0.0002295 ***

31

Figure 10: qqplot of residuals for SAT analysis

32

Residuals 43 29842 694 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

> reduced.line = lm(sat takers + rank) > anova(reduced.line) Analysis of Variance Table Response: sat Df Sum Sq Mean Sq F value Pr(>F) takers 1 181024 181024 158.2095 < 2.2e-16 *** rank 1 11209 11209 9.7964 0.003003 ** Residuals 47 53778 1144 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > top = (53778 - 29842)/(47 - 43) > bottom = 29842/43 > f = top/bottom > f [1] 8.622478 > p = 1-pf(f, 4, 43) > p [1] 3.348509e-05

Since p is small, we conclude that the set of variables not included in the reduced model collectively contain valuable information about the relationship with SAT score. We dont know yet which are important, but the p-value indicates that removing them all would be unwise. 5.5 Example. Using Residuals to Create Better Rankings for SAT data In Display 12.1, the states are ranked based on raw SAT scores, which doesnt seem reasonable. Some state universities require the SAT and some require a competing exam (the ACT). States with a high proportion of takers probably have in state requirements for the SAT. In states without this requirement, only the more elite students will take the SAT, causing a bias. In Display 12.2, the states are ranked based on SAT scores, corrected for percent taking the exam and median class rank. Lets explore this thinking further. To address the research question of how the states rank after accounting for the percentage of takers and median class rank, we use our reduced model (reduced.line above). Instead of ranking by actual SAT score, we can rank the schools by how far they fall above or below their tted regression line value. A residual is dened as the difference between the observed value and the predicted value. For example, we have the reduced regression model:
> summary(reduced.line)

33

34

35

Call: lm(formula = sat takers + rank) Residuals: Min 1Q Median -98.49 -22.31 5.46

3Q 21.40

Max 53.89

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 412.8554 194.2227 2.126 0.03881 * takers -0.8170 0.6584 -1.241 0.22082 rank 6.9574 2.2229 3.130 0.00300 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 33.83 on 47 degrees of freedom Multiple R-squared: 0.7814, Adjusted R-squared: 0.7721 F-statistic: 84 on 2 and 47 DF, p-value: 3.032e-16

To manually nd the tted value for Iowa,


1 state sat takers income years public expend rank Iowa 1088 3 326 16.79 87.8 25.60 89.7

we have the equation: tted = 412.85 - 0.8170*3 + 6.9574*89.7 = 1034.48 Additionally, we can verify this value by typing in reduced.line$t to get a vector of the tted values for all 50 states. The residual would then be: residual = observed - tted = 1088 - 1034.48 = 53.52 > order.vec = order(reduced.line$res, decreasing = TRUE) > states = factor(data[order.vec, 1]) > newtable = data.frame(State = states, Residual = as.numeric(round(reduced.line$res[order.vec], 1)), oldrank = (1:50)[order.vec]) > newtable State Residual oldrank 1 Connecticut 53.9 35 2 Iowa 53.5 1 3 NewHampshire 45.8 28 4 Massachusetts 41.9 41 5 NewYork 40.9 36 6 Minnesota 40.6 7 7 Kansas 35.8 4 8 SouthDakota 33.4 2 9 NorthDakota 32.8 3 10 Illinois 28.0 21 11 Montana 25.6 6

36

12 NewJersey 13 Delaware 14 Wisconsin 15 Nebraska 16 Maryland 17 RhodeIsland 18 Utah 19 Colorado 20 Virginia 21 Tennessee 22 Missouri 23 NewMexico 24 Vermont 25 Washington 26 Ohio 27 Pennsylvania 28 Wyoming 29 Michigan 30 Oklahoma 31 Hawaii 32 Maine 33 Arizona 34 Idaho 35 Louisiana 36 Florida 37 Alaska 38 California 39 Oregon 40 Kentucky 41 Alabama 42 Indiana 43 Arkansas 44 WestVirginia 45 Nevada 46 Mississippi 47 Texas 48 Georgia 49 NorthCarolina 50 SouthCarolina

22.8 21.7 20.5 20.5 19.5 15.6 14.8 14.1 13.9 13.3 9.6 8.3 7.9 5.8 5.1 2.3 -0.5 -0.8 -3.3 -3.8 -4.3 -9.4 -9.8 -10.5 -11.0 -18.3 -23.6 -23.9 -24.1 -27.7 -29.2 -31.2 -38.9 -45.4 -49.3 -50.3 -63.0 -71.3 -98.5

44 34 10 5 39 43 8 18 40 13 23 14 32 19 27 42 9 24 11 47 37 20 15 22 38 29 33 31 17 26 46 12 25 30 16 45 49 48 50

Above, the order command is used to sort the vectors by residual value. Saving this ordering into the vector order.vec, we were able to sort the state and old ranking by the same ordering. Note how dramatically the rankings shift once we conrol for the variables takers and rank. Connecticut for example shifts from 35th in raw score (896) to 1st in residuals. Similar shifts happened in the reverse direction, with Arkansas sliding from 12th to 43rd. We could further 37

Figure 13: Residual plots for SAT data analysis analyze the ranks by accounting for such things as expenditures to get a sense of which states appear to make efcient use of their spendings. One of the assumptions of the basic regression model is that the magnitude of residuals is relatively constant at all levels of the response. It is important to check that this assumption is upheld here.
res = reduced.line$res par(mfrow = c(1,3)) plot(sat, res, xlab = "SAT Score", ylab = "Residuals", pch = 19) abline(h = 0) plot(takers, res, xlab = "Percent of Students Tested", + ylab = "Residuals", pch = 19) abline(h = 0) plot(rank, res, xlab = "Median Class Ranking Percentile", + ylab = "Residuals", pch = 19) abline(h = 0)

Often residuals will fan out (increase in magnitude) as the value of a variable increases. This is called nonconstant variance. Sometimes there will be a pattern in the residual plots, such as a U shape or a curve. This is due to nonlinearity. Patterns are generally an indication that a variable transformation is needed. Ideally, a residual plot will look like a rectangular blob, with no clear pattern. In the attached residual plots, the rst and third plots (SAT and Rank) appear to t the ideal rectangular blob distribution (Fig. 13); however, Takers (percentage of students tested) had high residuals on the edges and low residuals in the center. This is a product of nonlinearity. Fig. 14 is a graph showing the scatterplots of Takers vs. SAT before and after a transformation.

38

Figure 14: Relationship before and after transformation

5.3

The Hat Matrix


Y = X = X(XT X)1 XT Y = HY (48)

Recall that where

H = X(XT X)1 XT

(49)

is called the hat matrix. The hat matrix is the projector onto the column space of X. The residuals are = Y Y = Y HY = (I H )Y. The hat matrix will play an important role in all that follows. 5.6 Theorem. The hat matrix has the following properties. 1. H X = X. 2. H is symetric and idempotent: H 2 = H 3. H projects Y onto the column space of X. 39 (50)

4. rank(X) = tr(H ). 5.7 Theorem. Properties of residuals: 1. True residuals: E( ) = 0, V( ) = 2 I . 2. Estimated residuals: E( ) = 0, V( ) = 2 (I H ). 3.
i i

= 0.

4. V( i ) = 2 (1 hii ) where hii is diagonal element of H . Lets prove a few of these. First, E( ) = = = = Next, V( ) = = = = = = (I H )V(Y )(I H )T 2 (I H )(I H )T 2 (I H )(I H ) 2 (I H H + H 2 ) 2 (I H H + H ) since H 2 = H 2 (I H ). (I H )E(Y ) = (I H )X X H X X X since H X = X 0.

To see that the sum of the residuals is 0, note that i i = 1, where 1 denotes a vector of ones. Now, 1 L, Y is the projection onto L and = Y Y . By the properties of the projection, is perpendicular to every vector in L. Hence, i i = 1, = 0. 5.8 Example. Suppose Yi = 0 + i . Let X= H= 1 n 1 1 . . . 1 1 . . . 1 . 1 1 . . . 1

Then

5.9 Example. Suppose that the X matrix has two-columns. Denote these columns by x1 and x2 . The column space is V = {a1 x1 + a2 x2 : a1 , a2 R}. The hat matrix projects Y Rn onto V . 40

The column space is V = {(a, a, . . . , a)T : a R} and HY = (Y , Y , . . . , Y )T .

1 1 . . . . . . 1 1

x2

Y x1

Figure 15: Projection

5.4

Weighted Least Squares

So far we have assumed that the i s are independent and have the same variance. What happens if the variance is not constant? For example, Sheathers text gives a simple example about a cleaning company. The building maintenance company keeps track of how many crews it has working (X ) and the number of rooms cleaned (Y ). The number of crews varied from 2 to 16, and for each level of X , at least 4 observations of Y are available. A plot of X versus Y reveals that the relationship is linear, but that the variance grows as X increases. Because we have several observations for each level of X we can estimate 2 as a function of X . (Of course, we dont usually have multiple measures of Y for each level of X , so1 we will need more subtle ways of handling this problem.) For another example, suppose Di is the number of diseased individuals in a population of Di . Under certain assumptions, it might be reasonable to assume that Di size mi and Yi = m i binomial, in which case V[Yi ] would be proportional to 1/mi . If the disease is contagious the binomial assumption would not be correct. Nevertheless, provided mi is large for each i, it might 2 be reasonable to assume that Yi is approximately normal with mean 0 + 1 xi and variance m . In i this case the variance is a function of mi , and we could model this variance as described below. Suppose that Y = X + where V( ) = . Suppose we use the usual least squares estimator . Then, E( ) = E((XT X)1 XT Y ) 41

= (XT X)1 XT E(Y ) = (XT X)1 XT X = . So is still unbiased. Also, under weak conditions, it can be shown that is consistent (converges to as we get more data). The usual estimator has reasonable properties. However, there are two problems. First, with constant variance, the usual least squares estimator is not just unbiased, it is an optimal estimator in the sense that it is they are the minimum variance, linear, unbiased estimator. This is no longer true with non-constant variance. Second, and more importantly, the formula for the standard error of is wrong. To see this, recall that V(AY ) = AV(Y )AT . Hence, V( ) = V((XT X)1 XT Y ) = (XT X)1 XT V(Y )X(XT X)1 = (XT X)1 XT X(XT X)1 which is different than the usual formula. It can be shown that minimum variance, linear, unbiased estimator is obtained by minimizing
RSS ( )

= (Y X )T 1 (Y X ). = SY (51) (52)

The solution is where S = (XT 1 X)1 XT 1 . This is unbiased with variance V( ) = (XT 1 X)1 . This is called weighted least squares. Let B denote the square root of . Thus, B is a symmetric matrix that satises B T B = BB T = . It can be shown that B 1 is the square root of 1 . Let Z = B 1 Y . Then we have Z = B 1 Y = B 1 (X + ) = B 1 X + B 1 = M + where M = B 1 X, and, = B 1 . Moreover, V( ) = B 1 V ( )B 1 = B 1 B 1 = B 1 BBB 1 = I. Thus we can simply regress Z on M and do ordinary regression.

42

In this case,

Let us look more closely at a special case. If the residuals are uncorrelated then 2 0 0 ... 0 w1 2 0 0 ... 0 w2 = . . . . . . . . . . . . . 0 0 0 0 0
2 wn n

RSS ( )

= (Y X ) (Y X )

i=1

2 wi (Yi xT i ) .

Thus, in weighted least squares we are simply giving lower weight to the more variable (less precise) observations. Now we have to address the following question: where do we get the weights? Or equivalently, 2 = V( i )? There are four approaches. how do we estimate i (1) Do a transformation to make the variances approximately equal. Then we dont need to do a weighted regression. (2) Use external information. There are some cases where other information (besides the current data) will allow you to know (or estimate) i . These cases are rare but they do occur. For 2 example i could be the measurement error of the instrument. (3) Use replications. If there are several Y values corresponding to each x value, we can use 2 . However, it is rare that you would have so the sample variance of those Y values to estimate i many replications. (4) Estimate (x) as a function of x. Just as we can estimate the regression line, we can also estimate the variance, thinking of it as a function of x. We could assume a simple model like (xi ) = 0 + 1 xi for example. Then we could try to nd a way to estimate the parameters 0 and 1 from the data. In fact, we will do something more ambitious. We will estimate (x) assuming only that it is a smooth function of x. We will do this later in the course when we discuss nonparametric regression. In R we simply include weights in the lm command: lm(Y X , weights= 1/StdDev 2 ), where StdDev is simply an estimate of V[Y |X ].

Diagnostics

Figure 16 shows a famous example. Four different data sets with the same t. The moral: looking at the t is not enough. We should also use some diagnostics. Generally, we diagnose problems by looking at the residuals. When we do this, we are looking for: (1) outliers, (2) inuential points, (3) nonconstant variance, (4) nonlinearity, (5) nonnormality. The remedies are:

43

15

10

y 0
0 5 10 x 15 20

0
0

10

15

10 x

15

20

15

10

y 0
0 5 10 x 15 20

0
0

10

15

10 x

15

20

Figure 16: The Ansombe Example Problem 1. Outliers Remedy Non-inuential: dont worry about it. Inuential: remove or use robust regression. 2. Inuential points Fit regression with and without the point and report both analyses. 3. Nonconstant variance Use transformation or nonparametric methods. Note: doesnt affect the t too much; mainly an issue for condence intervals. 4. Nonlinearity Use transformation or nonparametric methods. 5. Nonnormality Large samples: not a problem. Small samples: use transformations

44

Three types of residuals: Name residual standardized residual studentized residual Formula i = Yi Yi
bi Yi Y b 1hii b Y i Yi b(i) 1hii

R command (assume lm output is in tmp) resid(tmp) rstandard(tmp) rstudent(tmp)

6.1

Outliers
T Xj + T Xj +

Can be found (i) graphically or (ii) by testing. Let us write Yj = Test H0 : case i is an outlier versus H1 : case i is not an outlier Do the following: (1) Delete case i. (2) Compute (i) and (i) . (3) Predict the deleted case: Yi = XiT (i) . (4) Compute Yi Yi ti = . se (5) Reject H0 if p-value is less than /n. Note that T 1 V(Yi Yi ) = V(Yi ) + V(Yi ) = 2 + 2 xT i (X(i) X(i) ) xi . So, se(Yi Yi ) =
T 1 1 + xT i (X(i) X(i) ) xi . j j

+ j =i j = i.

How do the residuals come into this? Internally studentized residuals:


i ri = . 1 hii

Externally studentized residuals: r(i) = 6.1 Theorem. ti = ri np2 = r(i) . 2 n p 1 ri (i) 1 hii
i

45

6.2

Inuence
(Y(i) Y )T (Y(i) Y ) 1 2 r = q2 q i hii 1 hii

Cooks distance Di =

where Y = X and Y(i) = X (i) . Points with Di 1 might be inuential. Points near the edge are typically the inuential points. 6.2 Example (Rats). > data = c(176,6.5,.88,.42, 176,9.5,.88,.25, 190,9.0,1.00,.56, 176,8.9,.88,.23, ... 149,6.4,.75,.46) > > > > > > data = matrix(data,ncol=4,byrow=T) bwt = data[,1] lwt = data[,2] dose = data[,3] y = data[,4] dim(data) 4

[1] 19

> n = length(y) The four variables: bwt: the rats body weight lwt: the rats lung weight dose: the dosage given to the rat y: the amount of the dose that reached the rats liver > > > > data2 = cbind(bwt, lwt, dose, y) datam = as.data.frame(data2) pch.vec = c(1, 1, 19, rep(1, 16)) plot(datam, pch = pch.vec)

To produce a scatterplot matrix, the data must be formatted in R as a dataframe. In the scatterplot matrix, I have colored as black the observation that is ultimately removed as our high inuence point (Fig. 17). This observation is pretty typical as far as high inuence points go, and we can learn a lot just by looking at these graphs. How does this observation differ from the other 18?

46

Figure 17: Rat Data

47

Most obviously, it has an unusually large value for y. Furthermore, it is on the edge of the data (high values of bwt, lwt, and dose). Observations on the edge of a data space have higher inuence by nature. The plot of body weight versus dosage is particularly interesting. The relationship between weight and dosage is nearly perfectly linear, with exception to our high inuence point. Perhaps since one rat was given an abnormally large dosage for his weight, we see an abnormally large amount of the dose ending up in the liver. We note that in the plots on the bottom row, none of the three predictors appear to have an obvious relationship with the response variable y. > out = lm(ybwt+lwt+dose, qr=TRUE) > summary(out) Call: lm(formula = y bwt + lwt + dose, qr = TRUE) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.265922 bwt -0.021246 lwt 0.014298 dose 4.178111 0.194585 0.007974 0.017217 1.522625 1.367 -2.664 0.830 2.744 0.1919 0.0177 * 0.4193 0.0151 *

r = rstandard(out) ### standardized r = rstudent(out) ### studentized plot(fitted(out),rstudent(out),pch=pch.vec, xlab = "Fitted Value", ylab = "Studentized Residual");abline(h=0) This rst model appears to have some explanatory power, with the coefcients for both bwt and dose appearing as signicant. However, consider the scatterplot matrix again. These two variables were nearly perfectly correlated with exception to one observation. So whats happening here? In short, theyre cancelling each other out for most observations, while still accounting for our high inuence point. For every observation (except for our high inuence point), the expression for 1 *bwt + 3 *dose evaluates to nearly 0! For our high inuence point, however, it evaluates to 0.141. The model has been heavily inuenced by one point, creating articial signicance to account for the unusually high dosage (with respect to body weight) given to one rat. To formally quantify this inuence, we look at the Cooks distance for each observation.
> I = influence.measures(out) > names(I) [1] "infmat" "is.inf" "call"

48

Figure 18: Rat Data

> I$infmat[1:5,] dfb.1_ -0.03835128 0.14256373 -0.23100202 0.12503004 0.52160605 dfb.bwt dfb.lwt dfb.dose dffit cov.r cook.d 0.31491627 -0.7043633 -0.2437488 0.8920451 0.6310012 0.16882682 -0.09773917 -0.4817784 0.1256122 -0.6087606 1.0164073 0.08854024 -1.66770314 0.3045718 1.7471972 1.9047699 7.4008047 0.92961596 -0.12685888 -0.3036512 0.1400908 -0.4943610 0.8599033 0.05718456 -0.39626771 0.5500161 0.2747418 -0.9094531 1.5241607 0.20291617

1 2 3 4 5

> cook = I$infmat[,7] > plot(cook,type="h",lwd=3,col="red", ylab = "Cooks Distance")

In the Cooks distance plot, we see that our high inuence point (the third observation) has a much larger Cooks distance than any of the others (Fig 18). This generally indicates that the observation should be removed from the analysis. > > > > > > y = y[-3] bwt = bwt[-3] lwt = lwt[-3] dose = dose[-3] out = lm(y bwt + lwt + dose) summary(out)

Call: lm(formula = y bwt + lwt + dose) 49

Residuals: Min 1Q -0.102154 -0.056486 Median 0.002838 3Q 0.046519 Max 0.137059

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.311427 0.205094 1.518 0.151 bwt -0.007783 0.018717 -0.416 0.684 lwt 0.008989 0.018659 0.482 0.637 dose 1.484877 3.713064 0.400 0.695 Residual standard error: 0.07825 on 14 degrees of freedom Multiple R-squared: 0.02106, Adjusted R-squared: -0.1887 F-statistic: 0.1004 on 3 and 14 DF, p-value: 0.9585 After removing the high inuence point, we ret the original model. Now we nd a regression relationship with nearly no signicance (p = 0.9585). This seems consistent with what we observed in the original scatterplot matrix.

6.3

Tweaking the Regression

If residual plots indicate some problem, we need to apply some remedies. Look at Figure 6.3 p 132 of Weisberg. Possible remedies are: Transformation Robust regression nonparametric regression Examples of transformations: Y , log(Y ), log(Y + c), 1/Y

These can be applied to Y or x. We transform to make the assumptions valid, not to chase statistical signicance. 6.3 Example (Bacteria). This example is from Chatterjee and Price(1991, p 36). Bacteria were exposed to radiation. Figure 19 shows the number of surviving bacteria versus time of exposure to radiation. The program and output look like this.

50

Residuals vs Fitted

350

250

300

Residuals

200

50 0

100
15

survivors

50

100

150

8 time

10

12

14

50

50

100

150

200

250

Fitted values

Normal QQ plot
1

Cooks distance plot

Standardized residuals

Cooks distance

15

0.5

1.0

1.5

15

14

0.0

10

12

14

Theoretical Quantiles

Obs. number

Figure 19: Bacteria Data

51

> > > > > > > >

time = 1:15 survivors = c(355,211,197,166,142,106,104,60,56,38,36,32,21,19,15) plot(time,survivors) out = lm(survivors time) abline(out) plot(out,which=c(1,2,4)) print(summary(out))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 259.58 22.73 11.420 3.78e-08 *** time -19.46 2.50 -7.786 3.01e-06 *** --Residual standard error: 41.83 on 13 degrees of freedom Multiple R-Squared: 0.8234, Adjusted R-squared: 0.8098 F-statistic: 60.62 on 1 and 13 DF, p-value: 3.006e-06 The residual plot suggests a problem. Consider the following transformation. > > > > > > >

logsurv = log(survivors) plot(time,logsurv) out = lm(logsurv time) abline(out) plot(out,which=c(1,2,4)) print(summary(out))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.973160 0.059778 99.92 < 2e-16 *** time -0.218425 0.006575 -33.22 5.86e-14 *** --Residual standard error: 0.11 on 13 degrees of freedom Multiple R-Squared: 0.9884, Adjusted R-squared: 0.9875 F-statistic: 1104 on 1 and 13 DF, p-value: 5.86e-14 Check out Figure 20. Much better. In fact, theory predicts Nt = N0 et where Nt is number of 52

survivors at exposure t and N0 is the number of bacteria before exposure. So the fact that the log transformation is useful here is not surprising.

7
7.1

Misc Topics in Multiple Regression


Qualitative Variables

If X {0, 1}, then it is called a dummy variable. More generally, if X takes discrete values, it is called a qualitative variable or a factor. Let D be a dummy variable. Consider E(Y ) = 0 + 1 X + 2 D. Then coefcient intercept d=0 0 d=1 0 + 2 These are parallel lines. slope 1 1 Now consider this model: E(Y ) = 0 + 1 X + 2 D + 3 X D Then: coefcient intercept slope d=0 0 1 d=1 0 + 2 1 + 3 These are nonparallel lines. To include a discrete variable with k levels, use k 1 dummy variables. For example, if z {1, 2, 3}, do this: z d1 d2 1 1 0 2 0 1 3 0 0 In the model Y = 0 + 1 D1 + 2 D2 + 3 X + we see E(Y |z = 1) = 0 + 1 + 3 X E(Y |z = 2) = 0 + 2 + 3 X E(Y |z = 3) = 0 + 3 X You should not create k dummy variables because they will not be linearly independent. Then X X is not invertible.
T

7.1 Example. Salary data from Chatterjee and Price p 96. 53

Residuals vs Fitted

0.2

5.0

5.5

4.5

Residuals

logsurv

4.0

0.1

3.5

0.0

0.1

10

3.0

0.2

8 time

10

12

14

3.0

3.5

4.0

4.5

5.0

5.5

Fitted values

Normal QQ plot

Cooks distance plot


2

Standardized residuals

Cooks distance

0.3
1

0.2

0.4

10

0.0

0.1

10

12

14

Theoretical Quantiles

Obs. number

Figure 20: Bacteria Data

54

##salary example p 97 chatterjee and price sdata = read.table("salaries.dat",skip=1) names(sdata) = c("salary","experience","education","management") attach(sdata) n = length(salary) d1 = rep(0,n) d1[education==1] = 1 d2 = rep(0,n) d2[education==2] = 1 out1 = lm(salary experience + d1 + d2 + management) summary(out1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11031.81 383.22 28.787 < 2e-16 *** experience 546.18 30.52 17.896 < 2e-16 *** d1 -2996.21 411.75 -7.277 6.72e-09 *** d2 147.82 387.66 0.381 0.705 management 6883.53 313.92 21.928 < 2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.9568, Adjusted R-squared: 0.9525 F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16 Intepretation: Each year of experience increases our prediction by 546 dollars. Increment for management position is 6883 dollars. Compare bachelors to high school. For high school, d1 = 1 and d2 = 0 so: E(Y ) = 0 + 1 experience 2996 + 4 management For bachelors, d1 = 0 and d2 = 1 so: E(Y ) = 0 + 1 experience + 147 + 4 management So Ebach (Y ) Ehigh (Y ) = 3144. ### another way ed = as.factor(education)

55

out2 = lm(salary experience + ed + management) summary(out2)

Coefficients: Estimate Std. Error t value (Intercept) 8035.60 386.69 20.781 experience 546.18 30.52 17.896 ed2 3144.04 361.97 8.686 ed3 2996.21 411.75 7.277 management 6883.53 313.92 21.928 --Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|) < 2e-16 < 2e-16 7.73e-11 6.72e-09 < 2e-16

*** *** *** *** ***

* 0.05 . 0.1 1

Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.9568, Adjusted R-squared: 0.9525 F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16 Apparently, R codes the dummy variables differently. mean d1 d2 ed2 ed3 level high-school 8036 1 0 0 0 BS 11179 0 1 1 0 11032 0 0 0 1 advanced You can change the way R does this. Do help C and contr.treatment.

7.2

Collinearity

If one of the predictor variables is a linear combination of the others, then we say that the variables are collinear. The result is that XT X is not invertible. Formally, this means that the standard error of is innite and the standard error for predictions is innite. For example, suppose that x1i = 2 and suppose we include an intercept. Then the X matrix is 1 2 1 2 . . . . . . 1 2 and so XT X = n 1 2 2 4 0 +

which is not invertible. The implied model in this example is Y = 0 + 1 X1 +


i

= 0 + 21 + 56

where 0 = 0 +21 . We can estimate 0 using Y but there is no way to separate this into estimates for 0 and 1 . Sometimes the variables are close to collinear. The result is that it may be difcult to invert XT X. However, the bigger problem is that the standard errors will be huge. The solution is easy. Dont use all the variables; use variable selection (stay tuned...). Multicollinearity is just an extreme example of the bias-variance tradeoff we face whenever we do regression. If we include too many variables, we get poor predictions due to increased variance (more later).

7.3

Case Study

This example is drawn from the Sleuth text. When men and women of the same size and drinking history consume equal amounts of alcohol, the women tend to maintain a higher blood alcohol concentration. To explain this researchers conjectured that enzymes in the stomach are more active in men than women. In this study we examine the level of gastric enzyme as a predictor of rst pass metabolism. These two variables are known to be positively related. The question is, does this relationship differ between men and women? 18 women and 14 men were in the study. Of the 32 subjects, 8 were considered alcoholics. All subjects were given a dose of alcohol and then the researchers measured their rst pass metabolism. The higher this quantity, the more rapidly they were processing the alcohol. Here are the data: subject metabol gastric female alcohol 1 0.6 1.0 1 1 2 0.6 1.6 1 1 3 1.5 1.5 1 1 4 0.4 2.2 1 0 5 0.1 1.1 1 0 6 0.2 1.2 1 0 7 0.3 0.9 1 0 8 0.3 0.8 1 0 9 0.4 1.5 1 0 10 1.0 0.9 1 0 11 1.1 1.6 1 0 12 1.2 1.7 1 0 13 1.3 1.7 1 0 14 1.6 2.2 1 0 15 1.8 0.8 1 0 16 2.0 2.0 1 0 17 2.5 3.0 1 0 18 2.9 2.2 1 0 19 1.5 1.3 0 1 20 1.9 1.2 0 1 57

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30 31 32

21 22 23 24 25 26 27 28 29 30 31 32

2.7 3.0 3.7 0.3 2.5 2.7 3.0 4.0 4.5 6.1 9.5 12.3

1.4 1.3 2.7 1.1 2.3 2.7 1.4 2.2 2.0 2.8 5.2 4.1

0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 0 0 0 0 0 0 0 0 0

In Figure 21 you can see the relationship between gastric enzyme and metabolism and how it relates to sex and alcoholism. Consider the full model, including all interactions > out = lm(formula = metabol (gastric + female + alcohol)3, data = dat) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.6597 0.9996 -1.660 0.1099 gastric 2.5142 0.3434 7.322 1.46e-07 *** female 1.4657 1.3326 1.100 0.2823 alcohol 2.5521 1.9460 1.311 0.2021 gastric:female -1.6734 0.6202 -2.698 0.0126 * gastric:alcohol -1.4587 1.0529 -1.386 0.1786 female:alcohol -2.2517 4.3937 -0.512 0.6130 gastric:female:alcohol 1.1987 2.9978 0.400 0.6928 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > plot(fitted(out),resid(out),xlab="Fitted value",ylab="Residual") From the residual plot we see that two observations have very high tted values and large residuals. These are likely to be high inuence observations that require careful consideration. First we look to see if these observations are affecting our inferences. If we drop observations 31 and 32 from the full model (above) only gastric is signicant (results not shown). Consequently we believe these observations have high inuence on our analysis. This makes sense because both of these males have very high gastric activity. All other sujects have activity between 0.5 and 3. Perhaps these males are athletes, or extremely large people, or they differ in some fundamental way from the remaining individuals. If we restrict our inferences to people with gastric activity less than 3, we can feel more condent about our inferences. Next we need to simplify the model before preceeding because we have a small number of observations for the complexity of our original model. There is no indication of a detectable effect 58

12

M F M F M Female,nonAlc Male,nonAlc Female,Alc Male,Alcoholic

10

Metabolism

M M M M M M M M F F FF F F FF FF M FF F F F M F F

M M F

0 0

3 Gastric

Figure 21: Alcohol metabolism

59

2
q

q q

q q

Residual

q q q q q

q q q q q q q

q q q

qq q q

q q

6 Fitted value

10

Figure 22: Alcohol metabolism

60

due to alcoholism, so we drop alcohol from the model and then do formal investigation of inuence of observations 31 and 32. We calculate the inuence measures and look at Cooks distance. Below we show it for these two observations. All other observations have small distances. Hence we remove these two observations. > outsimple = lm(formula = metabol (gastric + female)2, data = dat) > I = influence.measures(outsimple) Cooks distance for records 31 and 32: > I$infmat[31:32,7] 31 32 0.960698 1.167255 > > > > datclean = dat[-c(31,32)] attach (datclean) final = lm(metabol (gastric + female)2, data = datclean) summary(final)

Coefficients: Estimate Std. Error t value (Intercept) 0.06952 0.80195 0.087 gastric 1.56543 0.40739 3.843 female -0.26679 0.99324 -0.269 gastric:female -0.72849 0.53937 -1.351 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 Pr(>|t|) 0.931580 0.000704 *** 0.790352 0.188455 . 0.1 1

Whereas, with 31 and 32 included we had signicant interaction between gastric and female. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.1858 0.7117 -1.666 0.1068 gastric 2.3439 0.2801 8.367 4.22e-09 *** female 0.9885 1.0724 0.922 0.3645 gastric:female -1.5069 0.5591 -2.695 0.0118 * From this output it appears metabolism does not depend on sex, but the effect of sex was clearly visible in the plot of the original data. Perhaps the model is still over parameterized. We try a model with no intercept for males or females, since metabolism is known to be approximately 0 when gastric activity is 0. This model forces the line through the origin. (Note: I have found that forcing no intercept is often a good modeling choice if the tted line appears to go through the origin.) The model we are tting for males is Y = 1 X + , 61

and for females is Y = (1 + 2 )X + . > femgastric = female*gastric > outnoint = lm(metabol (gastric + femgastric) -1) > summary(outnoint) Coefficients: Estimate Std. Error t value Pr(>|t|) gastric 1.5989 0.1249 12.800 3.20e-13 *** femgastric -0.8732 0.1740 -5.019 2.63e-05 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.8518 on 28 degrees of freedom Multiple R-squared: 0.877,Adjusted R-squared: 0.8683 F-statistic: 99.87 on 2 and 28 DF, p-value: 1.804e-13 A plot of residuals reveals that this model provides a good t (results not shown). We consider this our best model. Conclusions: As expected, metabolism increases with gastric activity; however, mean rst pass metabolism is higher for males than females, even if we account for gastric activity (p-value 0.0001). Specically it is 2.2 = 1 /(1 + 2 ) times higher for males than females. The experiment supports our hypothesis that males process alcohol more quickly than females, even when we account for gastic enzyme levels. Although it was not mentioned, I believe that gastric enzyme levels are higher for larger people, thus including this variable in the model controls for size of subject. It would be interesting to look at this study in more depth.

62

8
8.1

Bias-Variance Decomposition and Model Selection


The Predictive Viewpoint

The main motivation for studying regression is prediction. Suppose we observe X and then predict Y with g (X ). Recall that the prediction error, or prediction risk, R(g ) = E(Y g (X ))2 and this is minimized by taking g (x) = r(x) where r(x) = E(Y |X = x). Consider the set of linear predictors L= (x) = xT : Rp .
(x)

(We assume usually that x1 = 1.) The best linear predictor, or linear oracle, is where R( ) = min R( ).
L

= xT

In other words, (x) = xT gives the smallest prediction error of all linear predictors. Note that is well-dened even without assuming that the true regression function is linear. One way to think about linear regression is as follows. When we are using least squares, we are trying to estimate the linear oracle, not the true regression function. Let us make the connection between the best linear predictor and least squares more explicit. (Notation note: Remember that the column vector of features X is row vector in the matrix X. Consequently, in the following text, every time we move between X and X, it seems like weve made a transpositional error, but we have not.) We have R( ) = E(Y 2 ) 2E(Y X T ) + T E(XX T ) = E(Y 2 ) 2E(Y X T ) + T where = E(XX T ). By differentiating R( ) with respect to and setting the derivative equal to 0, we see that the best value of is = 1 C (53) where C is the p 1 vector whose j th element is E(Y Xj ). We can estimate with the matrix n = 1 T X X n

and we can estimate C with n1 XT Y . An estimate of the oracle is thus = (XT X)1 XT Y which is the least squares estimator.

63

Summary 1. The linear oracle, or best linear predictor at x, is xT where = 1 C. An estimate of is = (XT X)1 XT Y . 2. The least squares estimator is = (XT X)1 XT Y . We can regard as an estimate of the linear oracle. If the regression function r(x) is actually linear so that r(x) = xT , then the least squares estimator is unbiased and has variance matrix 2 (XT X)1 . 3. The predicted values are Y = X = HY where H = X(XT X)1 XT is the hat matrix which projects Y onto the column space of X.

8.2

The Bias-Variance Decomposition

If X and Y are random variables, recall the rule of iterated expectations E[g (X, Y )] = E[E(g (x, Y )|X = x]], where the inner expectation is taken with respect to Y |X and the outer one is taken with respect to the marginal distribution of X . Throughout the following section, we use this rule, conditioning on X to obtain the risk function R(x) at X = x. Let r(x) be any predictor, based on i = 1, . . . , n observations (Xi , Yi ). As a function of random variables, r(x) is a random variable itself, calculated at a xed value of x. In the calculations below think of (X, Y ) as new input and output variables, independent of (X1 , Y1 ), . . . , (Xn , Yn ). Then dene the risk a R = E(Y r(X ))2 = R(x)f (x)dx where R(x) = E((Y r(x))2 |X = x). Let r(x) = E(r(x)), V (x) = V(r(x)) 2 (x) = V(Y |X = x). Now R(x) = E((Y r(X ))2 |X = x) = E ((Y r(x)) + (r(x) r(x)) + (r(x) r(x)))2 X = x = 2 (x)
irreducible error

+ (r(x) r(x))2 + V (x) .


bias squared variance

(54)

We call (54) the bias-variance decomposition. Note: The irreducible error is the error due to unmodeled variation, such as instrument error and population variability around the model. The 64

bias is the lack of t between the assumed model and the true relationship between Y and X . This will be zero if the assumed model r(x) includes the truth E[Y |X = x]. The variance is the statistical variability in the estimation procedure. As n this quantity goes to zero. Finally, all the cross product terms have expectation 0 because Y is independent of r(x). If we combine the last two terms, we can also write R(x) = 2 (x) + MSEn (x) where MSEn (x) = E((r(X ) r(X ))2 |X = x) is the conditional mean squared error of r(x). Now R= R(x)f (x)dx 1 n
n i=1

R(xi ) Rav

and Rav is called the average prediction risk. It averages over the observed X s as an approximation to the theoretical average over the marginal distribution of the X s. We have Rav 1 = n
n

i=1

1 R ( xi ) = n

i=1

1 (xi ) + n
2

1 (r(xi ) r(xi )) + n i=1


2

V (xi ).
i=1

To summarize, we wish to know R, the prediction risk. Rav provides an excellent approximation, but Rav is not a quantity that we can readily calculate empirically because we do not know R(Xi ). Let us explore why it is challenging to calculate R. Let Yi = r(xi ), the tted value of the regression at xi . Dene the training error Rtraining 1 = n
n

i=1

(Yi Yi )2 .

We might guess that Rtraining estimates the prediction error (R) well but this is not true. The reason is that we used the observed pairs (xi , Yi ) to obtain Yi = r(xi ). As a consequence Yi are and Yi are correlated. Typically Yi predicts Yi better than it predicts a new Y at the same xi . Let us explore this formally. Let ri = E(r(xi )) and compute E(Yi Yi )2 = E(Yi r(xi ) + r(xi ) r(xi ) + r(xi ) Yi )2 = 2 + E(r(xi ) r(xi ))2 + V(r(xi )) 2Cov(Yi , Yi ).

Note: this time the cross-product involving the 1st and 3rd terms is not 0 because Cov(Yi , Yi ) = 0. This is because Yi is a particular observation from which we calculated Yi , hence the two terms are correlated. This introduces a bias into the estimate of risk

E(Rtraining ) = E(Rav ) 2Cov(Yi , Yi ).

(55)

Typically, Cov(Yi , Yi ) > 0 and so Rtraining underestimates the risk. Later, we shall see how to estimate the prediction risk. 65

8.3

Variable Selection

When selecting variables for a model, one needs to consider the research hypothesis as well as any potential confounding variables to control for. For example, in most medical studies, age and gender are always included in the model since they are common confounders. Researchers are looking for the effect of other predictors on the response once age and gender have been accounted for. If your research hypothesis specically addresses the effect of a variable, say expenditure, you need to either include it in your model or show explicitly in your analysis why the variable does not belong. Furthermore, one needs to consider the purpose of the analysis. If the purpose is to simply come up with accurate predictions for the response, researchers tend to simply look for variables that are easily obtained that account for a high degree of variation in the response. However we choose to select our variables, we should always be wary of overinterpretation of the model in a multiple regression setting. Heres why: 1) The selected variables are not necessarily special. Variable selection methods are highly inuenced by correlations between variables. Particularly when two predictors are highly correlated (R2 > .8), usually one will be ommitted despite the fact that the other may be a good predictor on its own. The problem is that since the two variables contain so much overlapping information, once you include one, the second variable accounts for very little additional variability in the response. 2) Interpretation of coefcients. If we have a regression coefcient of 0.2 for variable A, the interpretation is as follows: While holding the values of all other predictors constant, a 1-unit increase in the value of A is associated with an increase of 0.2 in the expected value of the response. 3) Lastly, for observational studies, causality is rarely implied. If the dimension p of the covariate X is large, then we might get better predictions by omitting some covariates. Models with many covariates have low bias but high variance; models with few covariates have high bias but low variance. The best predictions come from balancing these two extremes. This is called the bias-variance tradeoff. To reiterate: including many covariates leads to low bias and high variance including few covariates leads to high bias and low variance The problem of deciding which variables to include in the regression model to achive a good tradeoff is called model selection or variable selection. It is convenient in model selection to rst standardize all the variables by subtracting off the mean and dividing by the standard deviation. For example, we replace xij with (xij xj )/sj where xj = n1 n i=1 xij is the mean of covariate xj and sj is the standard deviation. The R function scale will do this for you. Thus, we assume throughout this section that

66

1 n 1 n

yi = 0 ,
i=1 n

1 n 1 n

n 2 =1 yi i=1 n

(56) (57)

xij = 0,
i=1

x2 ij = 1, j = 1, . . . , p.
i=1

Given S {1, . . . , p}, let (Xj : j S ) denote a subset of the covariates. There are 2p such subsets. Let (S ) = (j : j S ) denote the coefcients of the corresponding set of covariates 1 T and let (S ) = (XT S XS ) XS Y denote the least squares estimate of (S ), where XS denotes the design matrix for this subset of covariates. Thus, (S ) is the least squares estimate of (S ) from the submodel Y = XS (S ) + . The vector of predicted values from model S is Y (S ) = XS (S ). For the null model S = , Y is dened to be a vector of 0s. Let rS (x) = j S j (S )xj denote the estimated regression function for the submodel. We measure the predictive quality of the model via the prediction risk.

The prediction risk of the submodel S is dened to be 1 R(S ) = n where Yi = r(Xi ) + covariate value Xi .
i n

i=1

E(Yi (S ) Yi )2

(58)

denotes the value of a future observation of Y at

Ideally, we want to select a submodel S to make R(S ) as small as possible. We face two problems: estimating R(S ) searching through all the submodels S

8.4

The Bias-Variance Tradeoff

All results in this subsection are calculated conditionally on X1 , X2 , . . . , Xn . Before discussing the estimation of the prediction risk, we recall an important result.

67

Bias-Variance Decomposition of the Prediction Risk Rav (S ) = 2 unavoidable error +


1 n n 2 i=1 bi

1 n

n i=1

vi

variance (59) where bi = E(rS (Xi )|Xi = xi ) r(xi ) is the bias and vi = V(rS (Xi )|X i = xi ) is the variance. Let us look at the bias-variance tradeoff in some simpler settings. In both settings we induce a bias by chosing an estimator that is closer to zero than the observed data. Surprisingly, this bias can yield an estimator with smaller MSE than the unbiased estimator. 8.1 Example. Suppose that we observe a single observation Y N (, 2 ). The minimum variance unbiased estimator of is Y . Now consider the estimator = Y where 0 1. The bias is E() = ( 1), the variance is 2 2 and mean squared error is M SE = bias2 + variance = (1 )2 2 + 2 2 . (60) Notice that the bias increases and the variance decreases as 0. Conversely, the bias decreases and the variance increases as 1. The optimal estimator is obtained by taking = 2 /( 2 +2 ). The simple estimator Y did not produce the minimum MSE. 8.2 Example. Consider the following model: Yi = N (i , 2 ), i = 1, . . . , p. We want to estimate = (1 , . . . , p )T . Fix 1 k p and let i = Yi i k 0 i > k.
p

squared bias

(61)

(62)

Because the rst k terms have no bias and the last p k terms have no variance the MSE is M SE =
i=k+1 2 2 i + k .

(63)

As k increases the bias term decreases and the variance term increases. Using the fact that E(Yi2 2 ) = 2 i , which matches the bias term above, we can form an unbiased estimate of the risk, namely,
p

M SE = =

i=k+1 p

2 (yi 2 ) + k 2 2 yi + 2k 2 p 2

i=k+1

68

(Note RSS = minimizing

p i=k+1

2 yi .) Assuming 2 is known, we can estimate the optimal choice of k by

RSS + 2k 2

(64)

over k . While admittedly weird, this example shows that the estimated risk equals the observed error (RSS) plus a term that increases with model complexity (k ).

8.5

Risk Estimation and Model Scoring


n

An obvious candidate to estimate R(S ) is the training error 1 Rtr (S ) = n (Yi (S ) Yi )2 . (65)

i=1

For the null model S = , Yi = 0, i = 1, . . . , n, and Rtr (S ) is an unbiased estimator of R(S ) and this is the risk estimator we will use for this model. But in general, this is a poor estimator of R(S ) because it is very biased. Indeed, if we add more and more covariates to the model, we can track the data better and better and make Rtr (S ) smaller and smaller. For example, in the previous example Rtr (S ) = 0 if k = p. Thus if we used Rtr (S ) for model selection we would be led to include every covariate in the model. 8.3 Theorem. The training error is a downward-biased estimate of the prediction risk, meaning that E(Rtr (S )) < R(S ). In fact, bias(Rtr (S )) = E(Rtr (S )) R(S ) = 2 n Cov(Yi , Yi ).
i=1

(66)

Now we discuss some better estimates of risk. For each one we obtain an estimate of risk that can be approximately expressed in the form Rtr (S )) + penalty (S ). One picks the model that yields the minimum value. The rst term decreases, while the second term increases with model complexity. A challenge for most estimators of risk is that they require an estimate of 2 . Mallows Cp Mallows Cp statistic is dened by R(S ) = Rtr (S ) + 2|S | 2 n (67)

where |S | denotes the number of terms in S and 2 is the estimate of 2 obtained from the full model (with all covariates in the model). This is simply the training error plus a bias correction. 69

This estimate is named in honor of Colin Mallows who invented it. The rst term in (67) measures the t of the model while the second measure the complexity of the model. Think of the Cp statistic as: lack of t + complexity penalty. (68) The disadvantage of Cp is that we need to supply an estimate of . leave-one-out cross-validation Another method for estimating risk is leave-one-out cross-validation. The leave-one-out cross-validation (CV) estimator of risk is 1 RCV (S ) = n
n

i=1

(Yi Y(i) )

(69)

where Y(i) is the prediction for Yi obtained by tting the model with Yi omitted. It can be shown that 1 RCV (S ) = n
n

i=1

Yi Yi (S ) 1 Hii (S )

(70)

where Hii (S ) is the ith diagonal element of the hat matrix


1 T H (S ) = XS (XT S XS ) XS .

(71)

From equation (70) it follows that we can compute the leave-one-out cross-validation estimator without actually dropping out each observation and retting the model. An important advantage of cross-validation is that it does not require an estimate of . We can relate CV to Cp as follows. First, approximate each Hii (S ) with their average value n 1 n i=1 Hii (S ) = trace(H (S ))/n = |S |/n. This yields RCV (S ) 1 n
RSS (S )

|S | n

2.

(72)

The right hand side of (72) is called the generalized cross validation (GCV) score and will come up again later. Next, use that fact that 1/(1 x)2 1 + 2x and conclude that 2 2 |S | RCV (S ) Rtr (S ) + n where 2 =
RSS (S )/n.

(73)

This is identical to Cp except that the estimator of 2 is different. 70

Akaike Information Criterion Another criterion for model selection is AIC (Akaike Information Criterion). The idea is to choose S to maximize (74) S |S |, or minimize 2
S

+ 2|S |,

where S = S (S , 2 ) is the log-likelihood (assuming Normal errors) of the model evaluated at the MLE. This can be thought of as goodness of t minus complexity. Assuming Normal errors, 1 n (, 2 ) = constant log 2 2 ||Y X |2 . 2 2 Dene RSS(S ) as the residual sum of squares in model S . Inserting yields (, 2 ) = constant 1 n log 2 2 RSS (S ). 2 2

In this expression we can ignore n log 2 because it does not include any terms that depend on the 2 t of model S . Thus up to a constant we can write AIC (S ) =
RSS (S )

+ 2|S |.

(75)

Equivalently AIC nds the model that minimizes


RSS (S )

2|S | 2 . n

(76)

If we estimate using the error from largest model, then minimizing AIC is equivalent to minimizing Mallows Cp . Bayesian information criterion Yet another criterion for model selection is BIC (Bayesian information criterion). Here we choose a model to maximize BIC(S ) =
S

|S | n log n = log 2 2

RSS (S )

|S | log n. 2

(77)

The BIC score has a Bayesian interpretation. Let S = {S1 , . . . , Sm } where m = 2p denote all the models. Suppose we assign the prior P(Sj ) = 1/m over the models. Also, assume we put a smooth prior on the parameters within each model. It can be shown that the posterior probability for a model is approximately, eBIC (Sj ) P(Sj |data) . (78) BIC (Sr ) re 71

Hence, choosing the model with highest BIC is like choosing the model with highest posterior probability. But this interpretation is poor unless n is large relative to p. The BIC score also has an information-theoretic interpretation in terms of something called minimum description length. The BIC score is identical to AIC except that it puts a more severe penalty for complexity. It thus leads one to choose a smaller model than the other methods. Summary Cp R(S ) = Rtr (S ) + CV AIC -2BIC 2|S |f ull 2 . n 2|S |S 2 . n

R(S ) Rtr (S ) +

2 (S ) + 2|S |. 2 (S ) + |S | log n.

Note: the key term in (S ) is Rtr (S ), so each of these methods has a similar form. They vary in how they estimate 2 and how substantial a penalty is paid for model complexity.

8.6

Model Search

Once we choose a model selection criterion, such as cross-validation or AIC, we then need to search through all 2p models, assign a score to each one, and choose the model with the best score. We will consider 4 methods for searching through the space of models: 1. Fit all submodels. 2. Forward stepwise regression. 3. Ridge Regression. 4. The Lasso. Fitting All Submodels. If p is not too large we can do a complete search over all the models. 8.4 Example. Consider the SAT data but let us only consider three variables: Public, Expenditure, and Rank. There are 8 possible submodels. Here, x is a matrix of explanatory variables. (Do not include a column of 1s.) You can also use the nbest= option, for example, out = leaps(x,y,method="Cp",nbest=10) 72

This will report only the best 10 subsets of each size model. The output is a list with several components. In particular, out$which shows which variables are in the model, out$size shows how many parameters are in the model and out$Cp shows the Cp statistic. > library(leaps) > x = cbind(expend, public, rank) > out = leaps(x, sat, method = "Cp") > out $which 1 2 3 1 FALSE FALSE TRUE 1 FALSE TRUE FALSE 1 TRUE FALSE FALSE 2 TRUE FALSE TRUE 2 FALSE TRUE TRUE 2 TRUE TRUE FALSE 3 TRUE TRUE TRUE $label [1] "(Intercept)" "1" $size [1] 2 2 2 3 3 3 4 $Cp [1]

"2"

"3"

19.36971 241.68429 242.40923

12.34101

16.84098 243.17995

4.00000

The best model is the one with the lowest Mallows Cp. Here, that model is the one containing all three covariates. Stepwise. When p is large, searching through all 2p models is infeasible. In that case we need to search over a subset of all the models. One common method is to use stepwise regression. Stepwise regression can be run forward, backward, or in both directions. In forward stepwise regression, we start with no covariates in the model. We then add the one variable that leads to the best score. We continue adding variables one at a time this way. See Figure 23. Backwards stepwise regression is the same except that we start with the biggest model and drop one variable at a time. Both are greedy searches; nether is guaranteed to nd the model with the best score. Backward selection is infeasible when p is larger than n since will not be dened for the largest model. Hence, forward selection is preferred when p is large. 8.5 Example. Figure 24 shows forward stepwise regression on a data set with 13 correlated predictors. The x-axis shows the order that the variables entered. The y-axis is the cross-validation 73

Forward Stepwise Regression 1. For j = 1, . . . , p, regress Y on the j th covariate Xj and let Rj be the estimated risk. Set j = argminj Rj and let S = {j }. 2. For each j S c , t the regression model Y = j Xj + estimated risk. Set j = argminj S c Rj and update S S
s S

s Xs + and let Rj be the {j }.

3. Repeat the previous step until all variables are in S or until it is not possible to t the regression. 4. Choose the nal model to be the one with the smallest estimated risk.

Figure 23: Forward stepwise regression. score. We start with a null model and we nd that adding x4 reduces the cross-validation score the most. Next we nd try adding each of the remaining variables to the model abd nd the x13 leads to the most improvement. We continue this way until all the variables have been added. The sequence of models chosen by the algorithm is Y Y Y . . . = = = . . . 4 X4 4 X4 + 13 X13 4 X4 + 13 X13 + 3 X3 . . . S = {4} S = {4, 13} S = {4, 13, 3} . . .

(79)

The best overall model we nd is the model with ve variables x4 , x13 , x3 , x1 , x11 although the model with seven variables is essentially just as good.

8.6.1

Regularization: Ridge Regression and the Lasso.

Another way to deal with variable selection is to use regularization or penalization. Specically, we dene to minimize the penalized sums of squares
n

Q( ) =
i=1

2 (yi xT i ) + pen ( )

where pen( ) is a penalty and 0 is a tuning parameter. The bigger , the bigger the penalty for model complexity. We consider three choices for the penalty: L0 penalty || ||0 = #{j : j = 0} 74

0.60 Crossvalidation score 0.50 0.55

13 8 3 1 11 10 7 12 5 2 6

0.30 1

0.35

0.40

0.45

10

11

12

13

Number of variables

Figure 24: Forward stepwise regression on the 13 variable data.

75

L1 penalty || ||1 = L2 penalty || ||2 =

j =1 p

|j |
2 j .

j =1

The L0 penalty would force us to choose estimates which make many of the j s equal to 0. But there is no way to minimize Q( ) without searching through all the submodels. The L2 penalty is easy to implement. The estimate that minimizes
n p p

i=1

( Yi

j Xij )2 +
j =1 j =1

2 j

is called the ridge estimator. It can be shown that the estimator that minimizes the penalized sums of squares is as follows (assuming the features are standardized) is = (XT X + I )1 XT Y, where I is the identity. When = 0 we get the least squares estimate (low bias, high variance). When we get = 0 (high bias, low variance). Ridge regression produces a linear estimator: = SY where S = (XT X + I )1 XT and Y = HY where where H = X(XT X + I )1 XT . For regression we keep track of two types of degrees of freedom: model df (p, the number of covariates), and degrees of freedom error (n p 1). As the model incorporates more covariates, it becomes more complex, tting the data better and better and eventually, over-tting the data. For the remainder of the notes when we say effective degrees of freedom, we are always refering to an analog to the model degrees of freedom. For regularized regression, the effective degrees of freedom is dened to be df() = trace(H ). When = 0 we have df() = p (maximum complexity) and when , df() 0 (minimum complexity). How do we choose ? Recall that r(i) (xi ) = Y(i) , the leave-one-out tted value, and the cross-validation estimate of predictive risk is
n

CV =
i=1

(yi r(i) (xi ))2 .

76

It can be shown that CV =

i=1

yi r(xi ) 1 Hii

Thus we can choose to minimize CV. An alternative criterion that is sometimes used is generalized cross validation or, GCV. This is just an approximation to CV where Hii is replaced with its average: n1 n i=1 Hii . Thus,
n

GCV =
i=1

Yi r(xi ) 1b Hii = df () . n

where 1 b= n

i=1

The problem with ridge regression is that we really havent done variable selection because we havent forced any j s to be 0. This is where the L1 penalty comes in.

The lasso estimator () is the value of that solves:


n R

min p
i=1

2 (yi xT i ) + || ||1

(80)

where > 0 and || ||1 =

p j =1

|j | is the L1 norm of the vector .

The lasso is called basis pursuit in the signal processing literature. Equation (80) denes a convex optimation problem with a unique solution () that depends on . Typically, it turns out that many of the j () s are zero. (See Figure 25 for intuition on this process.) Thus, the lasso performs estimation and model selection simultaneously. The selected model, for a given , is S () = {j : j () = 0}. (81)

The constant can be chosen by cross-validation. The estimator has to be computed numerically but this is a convex optimization and so can be solved quickly. To see the difference in shrinkage and selection of terms in ridge versus lasso regression compare Figures 26 and 27, respectively. For the chosen tuning parameters shown (selected by cross validation), only three of the variables are included in the nal lasso model (svi, lweight and cavol). In contrast, because ridge regression is not a model selection procedure, all of the terms are in the ridge model. The three chosen by lasso do have the largest coefcients in the ridge model. What is special about the L1 penalty? First, this is the closest penalty to the L0 penalty that makes Q( ) convex. Moreover, the L1 penalty captures sparsity.

77

Figure 25: from Hastie et al. 2001

78

Figure 26: from Hastie et al. 2001 79

Figure 27: from Hastie et al. 2001 80

Digression on Sparsity. We would like our estimator to be sparse, meaning that most j s are zero (or close to zero). Consider the following two vectors, each of length p: u = (1, 0, . . . , 0) v = (1/ p, 1/ p, . . . , 1/ p). Intuitively, u is sparse while v is not. Let us now compute the norms: ||u||1 = 1 ||u||2 = 1 ||v ||1 = p ||v ||2 = 1. So the L1 norm correctly captures sparseness. Two related variable selection methods are forward stagewise regression and lars. In forward stagewise regression we rst set Y = (0, . . . , 0)T and we choose a small, positive constant . Now we build the predicted values incrementally. Let Y denote the current vector of predicted values. Find the current correlations c = c(Y ) = XT (Y Y ) and set j = argmaxj |cj |. Finally, we update Y by the following equation: Y Y + sign(cj )xj . (83) (82)

This is like forward stepwise regression except that we only take small, incremental steps towards the next variable and we do not go back and ret the previous variables by least squares. A modication of forward stagewise regression is called least angle regression. We begin with all coefcients set to 0 and then nd the predictor xj most correlated with Y . Then increase j in direction of the sign of its correlation with Y and set = Y Y . When some other predictor xk has as much correlation with as xj has we increase (j , k ) in their joint least squares direction, until some other predictor xm has as much correlation with the residual . Continue until all predictors are in the model. A formal description is in Figure 28. lars can be easily modied to produce the lasso estimator. If a non-zero coefcient ever hits zero, remove it from the active set A of predictors and recompute the joint direction. This is why the lars function in R is used to compute the lasso estimator. You need to download the lars package rst.

81

lars 1. Set Y = 0, k = 0, A = . Now repeat steps 23 until Ac = . 2. Compute the following quantities: c = XT (Y Y ) C = maxj {|cj |} A = {j : |cj | = C } sj = sign(cj ), j A XA = (sj xj : j A) G = XT A XA T 1 1/2 1 B = (1 G 1) w = BG 1 u = XA w T a=X u where 1 is a vector of 1s of length |A|. 3. Set Y Y + u where
+ = min c j A

(84)

(85) . (86)

Here, min+ means that the minimum is only over positive components.

C cj C + cj , B aj B + aj

Figure 28: A formal descrription of lars.

82

Summary
2 1. The prediction risk R(S ) = n1 n i=1 (Yi (S ) Yi ) can be decomposed into unavoidable error, bias and variance.

2. Large models have low bias and high variance. Small models have high bias and low variance. This is the bias-variance tradeoff. 3. Model selection methods aim to nd a model which balances bias and varaince, yielding a small risk. 4. Cp or cross-validation are used to estimate the risk. 5. Search methods look through a subset of models and nd the one with the smallest value of estimated risk R(S ). 6. The lasso estimates with the penalized residual sums of squares n 2 T i=1 (Yi Xi ) + || ||1 . Some of the estimates will be 0 and this corresponds to omitting them from the model. lars is an efcient algorithm for computing the lasso estimates.

8.6.2

Model selection on SAT data

Forward Stepwise Regression


> step(lm(satlog(takers)+rank), scope = list(lower=satlog(takers)+rank, + upper = satlog(takers)+rank+expend+years+income+public), direction = "forward") Start: AIC=346.7 sat log(takers) + rank Df Sum of Sq RSS + expend 1 13149 32380 + years 1 9827 35703 <none> 45530 + income 1 1305 44224 + public 1 16 45514 AIC 332 337 347 347 349

Step: AIC=331.66 sat log(takers) + rank + expend Df Sum of Sq RSS 1 5744 26637 32380 AIC 324 332

+ years <none>

83

+ public + income

1 1

421 31959 317 32063

333 333

Step: AIC=323.9 sat log(takers) + rank + expend + years Df Sum of Sq <none> + income + public 1 1 RSS 26636.8 26.6 26610.2 4.6 26632.2 AIC 323.9 325.9 325.9

Call: lm(formula = sat log(takers) + rank + expend + years) Coefficients: (Intercept) log(takers) 388.425 -38.015

rank 4.004

expend 2.423

years 17.857

With forward stepwise regression, we can dene the scope of the model to ensure that any variables that we wish to always include will be in the model, regardless of which variables would have otherwise selected. Above, we start with the model sat log(takers)+rank, and the algorithm picks variables to add one by one. At any given step, forward stepwise regression will select the variable that gives the greatest decrease in the AIC criterion. If no variable will decrease the AIC (or we have reached the upper scope of the model), stepwise regression will stop. Above, we started with an AIC of 346.7 with our lower scope model of sat log(takers) + rank (Fig. 29). By adding the variable expend, the AIC dropped to 331.66. Further adding years lowered the AIC to 323.9. Given the presence of these four variables, neither the addition of income or public would have resulted in a reduction in AIC. As such, the algorithm stopped.
par(mfrow = c(1,1)) AIC = c(346.7, 331.66, 323.9) ommitted = c(325.85, 327.8) plot(1:3, AIC, xlim = c(1,5), type = "l", xaxt = "n", xlab = " ", main = "Forward Stepwise AIC Plot") points(1:3, AIC, pch = 19) points(4:5, ommitted) axis(1, at = 1:5, labels = c("log(takers)+rank", "expend", + "years", "income", "public")) abline(h = 323.9, lty = 2) #Descriptions of most graphics functions used above can be found in #help(par) and help(axis)

As seen in the graph, the addition of income and public would have increased the AIC criterion. Backward Stepwise Regression 84

Figure 29: AIC Backward Stepwise Regression works like forward stepwise but in reverse. We generally start with a full model containing all possible variables, and remove one at a time until the AIC is minimized. With simple data sets, forward and backward regression often nd the same model. One drawback to backward regression is that n must be larger than p. For example, we cant start with a model with 25 variables and 20 observations.
> full = lm(satlog(takers)+rank+expend+years+income+public) > minimum = lm(satlog(takers)+rank) > step(full, scope = list(lower=minimum, upper = full), direction = "backward") Start: AIC=327.8 sat log(takers) + rank + expend + years + income + public Df Sum of Sq RSS 1 25 26610 1 47 26632 26585 1 4589 31174 1 6264 32850 AIC 326 326 328 334 336

- public - income <none> - years - expend

Step: AIC=325.85 sat log(takers) + rank + expend + years + income Df Sum of Sq RSS 1 27 26637 26610 1 5453 32063 AIC 324 326 333

- income <none> - years

85

- expend

7430 34040

336

Step: AIC=323.9 sat log(takers) + rank + expend + years Df Sum of Sq <none> - years - expend 1 1 RSS 26637 5744 32380 9066 35703 AIC 324 332 337

Call: lm(formula = sat log(takers) + rank + expend + years) Coefficients: (Intercept) log(takers) 388.425 -38.015

rank 4.004

expend 2.423

years 17.857

Both Stepwise Regression While forward and backward stepwise regression are both greedy algorithms, a third option is to run stepwise regression in both directions. It essentially toggles between: 1) One step of forward selection, and 2) One step of backward selection. As before, a step is only performed if it lowers AIC, otherwise it is skipped. The algorithm stops if two consecutive steps are skipped. Note that while both stepwise regression is not greedy, we are not guaranteed to nd the model with the lowest AIC. It could be that the best model would require exchanging sets of multiple variables, but stepwise can only move one step at a time. In a sense, we could nd a local minimum, but not the global minimum.
step(minimum, scope = list(lower=minimum, upper = full), direction = "both") step(full, scope = list(lower=minimum, upper = full), direction = "both")

For this particular data set, Both Stepwise Regression nds the same model whether we start with the minimum or full model. In fact, it is the same model found by both forward and backward stepwise regression. Ridge Regression > > > > > library(MASS) ltakers = log(takers) predictors = cbind(ltakers, income, years, public, expend, rank) predictors = scale(predictors) sat.scaled = scale(sat)

The function for ridge regression (lm.ridge) is contained in the MASS library. For simplicity, I column-bind the predictors into a single matrix. Also, Ridge Regression requires the variables to be scaled. 86

> lambda = seq(0, 10, by=0.25) > length(lambda) [1] 41 > > out = lm.ridge(sat.scaledpredictors, > round(out$GCV, 5) 0.00 0.25 0.50 0.75 1.00 0.00274 0.00271 0.00269 0.00268 0.00267 2.50 2.75 3.00 3.25 3.50 0.00265 0.00265 0.00265 0.00266 0.00266 5.00 5.25 5.50 5.75 6.00 0.00269 0.00269 0.00270 0.00270 0.00271 7.50 7.75 8.00 8.25 8.50 0.00275 0.00276 0.00276 0.00277 0.00278 10.00 0.00283 > which(out$GCV == min(out$GCV)) 2.25 10

lambda = lambda) 1.25 0.00266 3.75 0.00266 6.25 0.00271 8.75 0.00279 1.50 0.00266 4.00 0.00267 6.50 0.00272 9.00 0.00280 1.75 0.00265 4.25 0.00267 6.75 0.00273 9.25 0.00280 2.00 0.00265 4.50 0.00268 7.00 0.00274 9.50 0.00281 2.25 0.00265 4.75 0.00268 7.25 0.00274 9.75 0.00282

A primary feature of the lm.ridge function is its ability to accept a vector of lambda penalization values. Above, the lambda vector is created as a sequence from 0 to 10 by increments of 0.25, totalling to 41 elements. We will choose the model whose lambda value minimizes the generalized cross validation (GCV) value. We see above that the lambda value which minimizes GCV is 2.25, our 10th element (Fig. 8.6.2).
> dim(out$coef) [1] 6 41 > round(out$coef[,10], 4) predictorsltakers predictorsincome -0.4771 0.0223 predictorsexpend predictorsrank 0.1808 0.4195

predictorsyears 0.1796

predictorspublic -0.0028

In these models, we have 6 predictors and 41 values of lambda. The various model coefcients are listed in a 6 by 41 matrix, one column for each value of lambda. Above we found 2.25, the 10th element of the lambda vector, to minimize GCV, so we simply need to pull the 10th column from the coefcient matrix. Ridge Regression is unique in that it doesnt actually perform model selection. Instead, variables of lesser predictive importance will have coefcients that are lower in magnitude (which is why scaling was needed). Above, we see that income and public have coefcients that are close to 0 in magnitude, indicating that they are relatively unimportant given the presence of the other variables. Log(takers) and rank, the variables that we have controlled for, have the largest coefcients. Lastly, years and expend have moderately large coefcients. These four variables were selected in each of the stepwise regression models. 87

Figure 30: GCV for Ridge regression

88

In the following plot we see how the values of each regression coefcient changes with lambda (Fig. 31).
par(mfrow = c(1,1)) plot(lambda, out$coef[1,], type = "l", col = 1, xlab = "Lambda", + ylab = "Coefficients", main = "Plot of Regression Coefficients vs. Lambda Penalty + \nRidge Regression", ylim = c(min(out$coef), max(out$coef))) abline(h = 0, lty = 2, lwd = 2) abline(v = 2.25, lty = 2) for(i in 2:6) points(lambda, out$coef[i,], type = "l", col = i)

Lasso
> library(lars) > object = lars(x = predictors, y = sat.scaled) > object Call: lars(x = predictors, y = sat.scaled) R-squared: 0.892 Sequence of LASSO moves: ltakers rank years expend income public Var 1 6 3 5 2 4 Step 1 2 3 4 5 6 > plot(object) > object$Cp 0 1 2 3 4 349.908084 103.404397 46.890645 35.639307 3.101503 attr(,"sigma2") 6 0.1231440 attr(,"n") [1] 50 > plot.lars(object, xvar="df", plottype="Cp")

5 5.089719

6 7.000000

In Lasso regression, the coefents vary as a function of the tuning parameter. For near 0, the coefcient are set to zero. Even as the terms enter the model they are attenuated (shrunk towards zero) until = 1 (Fig. 32). We seek to nd the model that minimizes the Mallows Cp criterion. The order in which the variables are added is listed above after Sequence of LASSO moves. This shows us that ltakers was added rst, then rank, then years, etc. By examining object$Cp, we can see the Cp after each of the steps; see also Figure 33. The lowest Cp is found after 4 steps. That is, after ltakers, rank, years, and expend. This is consistent with the models found previously (in stepwise and ridge regression). 89

Figure 31: ridge regression coefcients

90

Figure 32: lasso coefcients

91

Figure 33: Mallows Cp versus degrees of freedom

92

8.7

Variable Selection versus Hypothesis Testing

The difference between variable selection and hypothesis testing can be confusing. Look at a simple example. Let Y1 , . . . , Yn N (, 1). We want to compare two models: M0 : N (0, 1), and M1 : N (, 1). Hypothesis Testing. We test H0 : = 0 versus = 0. The test statistic is Z= Y 0 V(Y ) We reject H0 if |z | > z/2 . For = 0.05, we reject H0 if |z | > 2, i.e., if 2 |y | > . n = nY.

AIC. The likelihood is proportional to


n

L() = where s2 =
i (yi

e(yi )
i=1

2 /2

= en(y)

2 /2

ens

2 /2

y )2 . Hence, () = n(y )2 ns2 . 2 2

Recall that AIC =

|s|. The AIC scores are AIC0 = (0) 0 = ny 2 ns2 2 2

ns2 1 AIC1 = () 1 = 2 since = y . We choose model 1 if AIC1 > AIC0 that is, if ns2 ny 2 ns2 1> 2 2 2 93

and

or

2 |y | > . n

Similar to but not the same as the hypothesis test.

BIC. The BIC scores are 0 ny 2 ns2 BIC0 = (0) log n = 2 2 2 and ns2 1 1 log n. BIC1 = () log n = 2 2 2 BIC1 > BIC0 that is, if |y | > log n . n

We choose model 1 if

Summary Hypothesis testing controls type I errors AIC/CV/Cp nds the most predictive model BIC nds the true model (with high probability)

94

Nonlinear Regression
Yi = r(Xi ; ) +

We can t regression models when the regression is nonlinear:


i

where the regression function r(x; ) is a known function except for some parameters = (1 , . . . , k ). 9.1 Example. Figure 34 shows the weight of a patient on a weight rehabilitation program as a function of the number of dayes in the program. The data are from Venables and Ripley (1994). It is hypothesized that Yi = r(x; ) + , where r(x; ) = 0 + 1 2x/2 . Since
x

lim r(x; ) = 0

we see that 0 is the ideal stable lean weight. Also, r(0; ) r(; ) = 1 so 1 is the amount of weight to be lost. Finally, we see that expected remaining weight r(x; ) 0 is one-half the starting remaining weight r(0; ) 0 when x = 2 . So 2 is the half-life, i.e., the time to lose half the remaining weight. The parameter estimate is found by minimizing
n

RSS =
i=1

(yi r(xi ; ))2 .

Generally, this must be done numerically. The algorithms are iterative and you must supply starting values for the parameters. Here is how to t the example in R.
> library(MASS) > attach(wtloss) > help(wtloss) Description The data frame gives the weight, in kilograms, of an obese patient at 52 time points over an 8 month period of a weight rehabilitation programme.

Format

95

This data frame contains the following columns: Days Time in days since the start of the programme. Weight Weight in kilograms of the patient. > plot(Days,Weight,pch=19) > out = nls(Weight b0 + b1*2(-Days/b2),data=wtloss, start=list(b0=90,b1=95,b2=120)) > info = summary(out) > print(info) Formula: Weight b0 + b1 * 2(-Days/b2) Parameters: Estimate Std. Error t value Pr(>|t|) b0 81.374 2.269 35.86 <2e-16 *** b1 102.684 2.083 49.30 <2e-16 *** b2 141.911 5.295 26.80 <2e-16 *** ---

Residual standard error: 0.8949 on 49 degrees of freedom Correlation of Parameter Estimates: b0 b1 b1 -0.9891 b2 -0.9857 0.9561 > > > > > > > b = info$parameters[,1] grid = seq(0,250,length=1000) fit = b[1] + b[2]*2(-grid/b[3]) lines(grid,fit,lty=1,lwd=3,col=2) plot(Days,info$residuals) lines(Days,rep(0,length(Days))) dev.off()

The t and residuals are shown in Figure 34.

96

110 120 130 140 150 160 170 180


0

Weight

50

100 Days

150

200

250

info$residuals

2
0

50

100 Days

150

200

250

Figure 34: Weight Loss Data

97

10

Logistic Regression

10.1 Example. Our rst example concerns the probability of extinction as a function of island size (Sleuth case study 21.1). The data provide island size, number of bird species present in 1949 and the number of these extinct by 1959. island area atrisk extinctions 75 67 66 51 28 20 43 31 28 32 30 20 31 16 15 33 40 6 5 3 10 6 3 4 8 3 5 6 8 2 9 5 7 8 13 3

Ulkokrunni 185.80 Maakrunni 105.80 Ristikari 30.70 Isonkivenletto 8.50 Hietakraasukka 4.80 Kraasukka 4.50 Lansiletto 4.30 Pihlajakari 3.60 Tyni 2.60 Tasasenletto 1.70 Raiska 1.20 Pohjanletto 0.70 Toro 0.70 Luusiletto 0.60 Vatunginletto 0.40 Vatunginnokka 0.30 Tiirakari 0.20 Ristikarenletto 0.07

Let ni be the number of species at risk and Yi be the number of extinctions out of ni . We assume Yi Binomial(ni , pi ) where pi is a function of the area. Dene xi = log(areai ), Plotting pi = Yi /ni as a function of xi we see an s-shaped decline in the response variable, but log[pi /(1 pi )] declines linearly with xi (Display21.2). This example motivates logistic regression. Logistic regression is a generalization of regression that is used when the outcome Y is binary or binomial. We start with binary data (ni = 1), which is most common in practice. In a later section we revisit the binomial with ni > 1. Suppose that Yi {0, 1} and we want to relate Y to some covariate x. The usual regression model is not appropriate since it does not constrain Y to be binary. With the logistic regression model we assume that e0 +1 Xi . 1 + e0 +1 Xi Note that since Yi is binary, E(Yi |Xi ) = P(Yi = 1|Xi ). We assume this probability follows the logistic function e0 +1 x /(1 + e0 +1 x ). The parameter 1 controls the steepness of the curve. The parameter 0 controls the horizontal shift of the curve (see Fig 36). E(Yi |Xi ) = P(Yi = 1|Xi ) = 98

99

Figure 36: The logistic function p = ex /(1 + ex ).

Figure 37: From Sleuth

100

Dene the logit function logit(z ) = log Also, dene Then we can rewrite the logistic model as logit(i ) = 0 + 1 Xi . The extension to several covariates is starightforward:
p

z 1z

i = P(Yi = 1|Xi ).

logit(i ) = 0 +
j =1

j xij = xT i .

The logit is the log odds function. Exponentiating the logit yields the odds, so that the odds of Y = 1 response at X = x are e0 +1 x . Thus the odds at X = A versus X = B are equal to the odds ratio, which simplies to e1 (AB ) . For instance if A = B + 1 then the odds ratio is e1 . In Fig 37), the nonlinearity of the probability is contrasted with the linear log odds. In this plot = 0 + 1 x. Increasing by one unit anywhere on the horizontal axis increases the odds by the same quantity. In contrast, the probability tends toward an asymptote of zero or one as becomes very small or very large, respectively. How do we estimate the parameters of the logistic regression function? Usually we use maximum likelihood. Lets review the basics of maximum likelihood. Let Y {0, 1} denote the outcome of a coin toss. We call Y a Bernoulli random variable. Let = P(Y = 1) and 1 = P(Y = 0). The probability function is f (y ; ) = y (1 )1y . The probability function for n independent tosses, Y1 , . . . , Yn , is
n n

f (y1 , . . . , yn ; ) =
i=1

f (yi ; ) =
i=1

yi (1 )1yi .

The likelihood function is just the probability function regarded as a function of the parameter and treating the data as xed:
n

L( ) =

i=1

yi (1 )1yi .

The maximum likelihood estimator or MLE, is the value that maximizes L( ). Maximizing the likelihood is equivalent to maximizing the loglikelihood function
n

( ) = log L( ) =

i=1

yi log + (1 yi ) log(1 ) . 101

Setting the derivative of ( ) to zero yields =


n i=1

Yi

Thus the mle for is easily obtained. The mle for is not so readily obtained. Recall that the Fisher information is dened to be I ( ) = E The approxmate standard error is se( ) = 1 = I ( ) (1 ) . n 2 ( ) . 2

Notice that the standard error of is a function of the mean . Returning to logistic regression, the likelihood function is
n n

L( ) = where

i=1

f (yi |Xi ; ) =
T

i=1

Yi (1 i )1Yi i

i =

eXi T . 1 + eXi

The maximum likelihood estimator has to be found numerically. The usual algorithm is called reweighted least squares and works as follows. First set starting values (0) . Now, for k = 1, 2, . . . do the following steps until convergence: 1. Compute tted values eXi i = i = 1, . . . , n. T (k) , 1 + eXi 2. Dene an n n diagonal weight matrix W whose ith diagonal element is i (1 i ). 3. Dene the adjusted response vector Z = X (k) + W 1 (Y ) where T = (1 , . . . , n ). 4. Take (k+1) = (XT W X)1 XT W Z which is the weighted linear regression of Z on X. 102
T (k )

Iterating this until k is large yields , the mle.

The standard errors are given by V( ) = (XT W X)1 . 10.2 Example. This example is drawn from the Sleuth text. An association was noted between keeping pet birds and increased risk of lung cancer. To study this further a case-control study was conducted. Among patients under 65 years of age, 49 cases were identied with lung cancer. From among the general population, 98 controls of similar age were selected. The investigators recorded Sex (1=F,0=M), Age, Socioeconomic status (SS = high or low), years of smoking, average rate of smoking, and whether a pet bird was kept for at least 6 months, 5 to 14 years prior to diagnosis (BK=1) or examination. We call those with lung cancer cases and others controls. Age and smoking history are known to be risk factors for cancer. The question is whether BK is an additional risk factor. Figure 38 shows the number of years a subject smoked versus their age. The plotting symbols show BK=1 (triangles) or BK=0 (circles). Symbols are lled if the subject is a case. In this gure it is obvious that smoking is associated with cancer. To see the relationship with BK, look at the distribution of triangles over horizontal stripes. For instance, among the non-smokers, the only lung cancer case was a bird keeper. The rst step of the analysis is to nd a good model for the relationship between lung cancer and the other covariates (excluding BK). To visualize this, bin smoking by decade (0,1-20,...,4150) and calculate the proportion of cases among the subjects in a bin. An empirical estimate of the logit versus binned years smoking shows that the logit increases as years of smoking increases (plot not presented). Using the available covariates, and including potential interactions and quadratic terms, we explore the models prediction of case/control status with logistic regression. From this class of models we chose a simpler model that includes sex, age, SS and year. The nal step is to include birdkeeping in the model. Including BK leads to a drop in deviance of 11.29, (with one df), which is clearly signicant (p-value of .0008). The estimated coefcient of birdkeeping is = 1.33. The odds of lung cancer for birdkeepers is estimated to be e1 .33 = 3.8 times higher for those who keep birds than those who do not. With a 95% CI of (0.533,2.14), the odds of lung cancer are estimated to be between 1.70 and 8.47 times greater. Scope of inference. These inferences apply to the Netherlands in 1985. Because this is an observational study, no causal inferences can be drawn, but these analyses do control for the effects of smoking and age. In the publication they cite medical rationale supporting the statistical ndings. Note: This case/control study is also known as a retrospective study. When the response variable has a rare outcome, like lung cancer, it is common to sample the study subjects retrospectively. In this way we can over sample the population of people who have the rare outcome. In a random sample virtually everyone will be not have lung cancer. Using a retrospective study we cannot estimate the probability of lung cancer, but we can estimate the log odds ratio. 10.3 Example. The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1) 103

104

or absence (Y = 0) of coronary heart disease. There are 9 covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age. A logistic regression yields the following estimates and Wald statistics Wj for the coefcients: Covariate Intercept sbp tobacco ldl adiposity famhist typea obesity alcohol age j -6.145 0.007 0.079 0.174 0.019 0.925 0.040 -0.063 0.000 0.045 se Wj p-value 1.300 -4.738 0.000 0.006 1.138 0.255 0.027 2.991 0.003 0.059 2.925 0.003 0.029 0.637 0.524 0.227 4.078 0.000 0.012 3.233 0.001 0.044 -1.427 0.153 0.004 0.027 0.979 0.012 3.754 0.000

Are you surprised by the fact that systolic blood pressure is not signicant or by the minus sign for the obesity coefcient? If yes, then you are confusing association and causation. The fact that blood pressure is not signicant does not mean that blood pressure is not an important cause of heart disease. It means that it is not an important predictor of heart disease relative to the other variables in the model. Model selection can be done using AIC or BIC: AICS = 2 (S ) + 2|S | where S is a subset of the covariates. When ni = 1 it is not possible to examine residuals to evaluate the t of our regression model. To t this model in R we use the glm command, which stands for generalized linear model.

> attach(sa.data) > out = glm(chd ., family=binomial,data=sa.data) > print(summary(out))

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.1482113 1.2977108 -4.738 2.16e-06 *** sbp 0.0065039 0.0057129 1.138 0.254928 tobacco 0.0793674 0.0265321 2.991 0.002777 ** ldl 0.1738948 0.0594451 2.925 0.003441 ** 105

adiposity famhist typea obesity alcohol age ---

0.0185806 0.9252043 0.0395805 -0.0629112 0.0001196 0.0452028

0.0291616 0.2268939 0.0122417 0.0440721 0.0044703 0.0120398

0.637 4.078 3.233 -1.427 0.027 3.754

0.524020 4.55e-05 *** 0.001224 ** 0.153447 0.978655 0.000174 ***

> out2 = step(out) Start: AIC= 492.14 chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age

etc.

Step: AIC= 487.69 chd tobacco + ldl + famhist + typea + age Df Deviance 475.69 1 484.71 1 485.44 1 486.03 1 492.09 1 502.38 AIC 487.69 494.71 495.44 496.03 502.09 512.38

<none> - ldl - typea - tobacco - famhist - age > > > > > > >

p = out2$fitted.values names(p) = NULL n = nrow(sa.data) predict = rep(0,n) predict[p > .5] = 1 print(table(chd,predict))

predict chd 0 1 0 256 46

106

1 73 87 > error = sum( ((chd==1)&(predict==0)) | ((chd==0)&(predict==1)) )/n > print(error) [1] 0.2575758 From these results we see that the model predicts the wrong outcome just over 25% of the time in these data. If we use this model to predict the outcome in new data, we will nd prediction slightly less accurate (more later).

10.1

More About Logistic Regression

Just when you thought you understood logistic regression... Suppose we have a binary outcome Yi and a continuous covariate Xi . To examine the relationship between x and Y we used the logistic model e0 +1 x . 1 + e0 +1 x To formally test if there is a relationship between x and Y we test P(Y = 1|x) = H0 : 1 = 0 versus H1 : 1 = 0. When the Xi s are random (so I am writing them with a capital letter) there is another way to think about this and it is instructive to do so. Suppose, for example, that X is the amount of exposure to a chemical and Y is presence or absence of disease. Instead of regressing Y in X , you might simply compare the distribution of X among the sick (Y = 1) and among the healthy (Y = 0). Lets consider both methods for analyzing the data.

Method 1: Logistic Regression. (Y |X ) The rst plot in Figure 39 shows Y versus x and the tted logistic model. The results of the regression are: Coefficients: Estimate Std. Error z value (Intercept) -2.2785 0.5422 -4.202 x 2.1933 0.4567 4.802 --Signif. codes: 0 *** 0.001 ** 0.01 Pr(>|z|) 2.64e-05 *** 1.57e-06 *** * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 138.629 Residual deviance: 72.549 AIC: 76.55 on 99 on 98 degrees of freedom degrees of freedom

107

The test for H0 : 1 = 0 is highly signicant and we conclude that there is a strong relationship between Y and X .

Method 2: Comparing Two Distributions. (X |Y ) Think of X as the outcome and Y as a group indicator. Examine the boxplots and the histograms in the gure. To test whether these distributions (or at least the means of the distributions) are the same, we can do a standard t-test for H0 : E(X |Y = 1) = E(X |Y = 0) versus E(X |Y = 1) = E(X |Y = 0). x0 = x[y==0] x1 = x[y==1] > print(t.test(x0,x1)) Welch Two Sample t-test data: x0 and x1 t = -9.3604, df = 97.782, p-value = 3.016e-15 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.341486 -1.522313 sample estimates: mean of x mean of y 0.1148648 2.0467645 Again we conclude that there is a difference.

Whats the connection? Let f0 and f1 be the probability density functions for the two groups. By Bayes theorem, and letting = P(Y = 1), P(Y = 1|X = x) = f (x|Y = 1) f (x|Y = 1) + f (x|Y = 0)(1 ) f1 (x) = f1 (x) + f0 (x)(1 ) 1
f1 (x) f0 (x)(1 ) . f1 (x) + f0 ( x)(1 )

Now suppose that X |Y = 0 N (0 , 2 ) and that X |Y = 1 N (1 , 2 ). Also, let = P(Y = 1). Then the last equation becomes P(Y = 1|X = x) = 108 e0 +x 1 + e0 +x

1.0

0.8

0.6

0.4

0.2

0.0

0 x

Histogram of x0

Histogram of x1

10

Frequency

Frequency
2 1 0 x0 1 2 3 4

10

0 x1

Figure 39: Logistic Regression

109

where 0 = log and

2 2 0 1 2 2

(87)

1 0 . (88) 2 This is exactly the logistic regression model! Moreover, 1 = 0 if and only if 0 = 1 . Thus, the two approaches are testing the same thing. In fact, here is how I generated the data for the example. I took P(Y = 1) = 1/2, f0 = N (0, 1) and f0 = N (2, 1). Plugging into (87) and (88) we see that 0 = 2 and 1 = 2. Indeed, we see that 0 = 2.3 and 1 = 2.2. There are two different ways of answering the same question. 1 =

10.2

Logistic Regression With Replication

When there are replications (ni > 1), we can say more about diagnostics. Suppose there is one covariate taking values x1 , . . . , xk and suppose there are ni observations at each Xi . An example was given for extinctions as a function of island size. Now we let Yi denote the number of successes at Xi . Hence, Yi Binomial(ni , i ). We can t the logistic regression as before: logit(i ) = XiT and now we dene the Pearson residuals ri = and deviance residuals di = sign(Yi Yi ) 2Yi log Yi Yi + 2(ni Yi ) log (ni Yi ) ni i (1 i ) Yi ni i

(ni Yi )

where Yi = ni , and 0 log 0 = 0. In glm models, deviances play the role of sum of squares in a linear model. Pearsons residuals follow directly a normal approximation to a binomial. The deviance residuals are the signed square root of the loglikelihood evaluated at the saturated model (Yi = Yi ) versus the tted model (Yi = ni i ). These diagnostics are approximately the same in practice. The residuals will behave like N (0, 1) random variables when the model is correct. We can also form standardized versions of these. Let H = W 1/2 X(XT W X)1 XT W 1/2 110

where W is diagonal with ith element ni i (1 i ). The standardized Pearson residuals are ri = ri 1 Hii

which should behave like N(0,1) random variables if the model is correct. Similarly dene standardized deviance residuals by di di = . 1 Hii Goodness-of-Fit Test. Now we ask the question, is the model right? The Pearson 2 2 =
i 2 ri

and deviance D=
i

d2 i

both have, approximately, a of a problem.

2 np

distribution if the model is correct. Large values are indicative

Let us now discuss the use of residuals. Well do this in the context of an example. Here are the data: y = c(2 , 7, 9,14,23,29,29,29) n = c(29,30,28,27,30,31,30,29) x = c(49.06,52.99,56.91,60.84,64.76,68.69,72.61,76.54) The data, from Strand (1930) and Collett (1991) are the number of our beetles killed by carbon disulphide (CS2 ). The covariate is the dose of CS2 in mg/l. There are two ways to run the regression. In the rst, the response Y goes in in two columns ((y, n y )). In this manner R knows how many trial (n) for each binomial observation. This is a prefered approach. > out = glm(cbind(y,n-y) x,family=binomial) > print(summary(out)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -14.7312 1.8300 -8.050 8.28e-16 *** x 0.2478 0.0303 8.179 2.87e-16 ***

111

Null deviance: 137.7204 Residual deviance: 2.6558

on 7 on 6

degrees of freedom degrees of freedom

> > > > > > >

b = out$coef grid = seq(min(x),max(x),length=1000) l = b[1] + b[2]*grid fit = exp(l)/(1+exp(l)) plot(x,y/n) lines(grid,fit,lwd=3)

The second way to input the data enters each trial as a distinct row. > > > > Y = c(rep(1,sum(y)),rep(0,sum(n)-sum(y))) X = c(rep(x,y),rep(x,n-y)) out2 = glm(Y X,family=binomial) print(summary(out2))

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -14.73081 1.82170 -8.086 6.15e-16 *** X 0.24784 0.03016 8.218 < 2e-16 ***

Null deviance: 313.63 Residual deviance: 178.56

on 233 on 232

degrees of freedom degrees of freedom

Notice that the outcome is the same if the binomials are entered as distinct Bernoulli trials except for the deviance. The correct deviance, which is useful as a goodness-of-t test, is obtained from the rst method. Going back to our original approach, to test goodness-of-t test, > print(out$dev) [1] 2.655771 > pvalue = 1-pchisq(out$dev,out$df.residual) > print(pvalue) [1] 0.8506433 Conclude that the t is good. Still, we should look at the residuals. > r = resid(out,type="deviance") > p = out$linear.predictors > plot(p,r,pch=19,xlab="linear predictor",ylab="deviance residuals") 112

0.8

1.0

deviance residuals
50 55 60 x 65 70 75

y/n

0.6

0.2

0.4

0.5

0.0

0.5

linear predictor

Normal QQ Plot

1.0

standardized deviance residuals

0.5

Sample Quantiles
2 1 0 1 2 3 4

0.0

0.5

0.5
1.5

0.0

0.5

1.0

1.0

0.5

0.0

0.5

1.0

1.5

linear predictor

Theoretical Quantiles

Figure 40: Beetles Note that > print(sum(r2)) [1] 2.655771 gives back the deviance test. Now lets create standardized residuals. > r = rstandard(out) > plot(x,r,xlab="linear predictor",ylab="standardized deviance residuals")

10.3

Deviance Tests

For linear models if we wish to compare the t of a full model with a reduced model we examine the difference in residual sum of squares via an F test: F = (RSSred RSSf ull )/df1 Fdf1 ,df2 , RSSf ull /df2

where df1 = pf ull pref and df2 = n (pf ull + 1). Now, assuming normality, RSSf ull 2 2 df2 , 2 2 2 2 so E [RSSf ull /df2 ] = , and RSSred RSSf ull ) df1 . Consequently, cancels out of the 113

equation in the F test. The test assesses whether the difference in RSS in the full and reduced models is bigger than expected with degrees of freedom equal to df1 . The denominator of the F test is included simply as an estimate 2 . For glm models a deviance test plays the same role for comparing the t of a full model with a reduced model. If the log likelihood is () = i log f (yi |), then the deviance is dened as 2[ (Y ) (Y )]. For a linear model the deviance reduces to RSS/ 2 . For Poisson and Binomial model the mean determines the variance of the model, so there is no unknown 2 to be estimated. To test a full vs. reduced model for glm we look at Devred Devf ull , rather than an F test. Under the null hypothesis, the difference in deviances is distributed 2 df1 . If df1 = 1 then an alternative test is Walds test j /se(j ) The Walds test is the natural analog to the t-test in linear models. The deviance test and Walds test will give similar, but not identical results.

11

Generalized Linear Models


Yi Bernoulli(i ) g (i ) = XiT

We can write the logistic regression model as

where g (z ) = logit(z ). The function g is an example of a link function and the Bernoulli is an example of an exponential family, which we explain below. Any model in which Y has a distribution that is in the exponential family and a function of its mean is linear in a set of predictors, is called a generalized linear model. A probability function (or probability density function) is said to be in the exponential family if there are functions (), B (), T (y ) and h(y ) such that f (y ; ) = h(y )e()T (y)B () . 11.1 Example. Let Y Poisson(). Then f (y ; ) = 1 y e = ey log y! y!

and hence, this is an exponential family with () = log , B () = , T (y ) = y , h(y ) = 1/y !. 11.2 Example. Let Y Binomial(n, ). Then f (y ; ) = In this case, () = log 1 , B () = n log() n y (1 )ny = y n exp y log y 1 + n log(1 ) .

114

and T (y ) = y, h(y ) =

n . y

If = (1 , . . . , k ) is a vector,then we say that f (y ; ) has exponential family form if


k

f (y ; ) = h(y ) exp
j =1

j ()Tj (y ) B () .

11.3 Example. Consider the Normal family with = (, ). Now, f (y ; ) = exp This is exponential with y2 1 x 2 2 2 2 1 () = 2 + log(2 2 ) 2 .

, T1 (y ) = y 2 1 2 () = 2 , T2 (y ) = y 2 2 2 1 + log(2 2 ) , h(x) = 1. B () = 2 2

Now consider independent random variables Y1 , . . . , Yn each from the same exponential family distribution. Let i = E(Yi ) and suppose that g (i ) = XiT . This is a generalized linear model with link g . Notice that the regression equation E(Yi ) = g 1 (XiT ) is based on the inverse of the link function. 11.4 Example (Normal Regression). Here, Yi N (i , 2 ) and the link g (i ) = i is the indentity function. 11.5 Example (Logistic Regression). Here, Yi Bernoulli(i ) and g (i ) = logit(i ). 11.6 Example (Poisson Regression). This is often used when the outcomes are counts. Here, Yi Poisson(i ) and the usual link function is g (i ) = log(i ).

115

Although many link functions could be used, there are default link functions that are standard for each family. Here they are (from Table 12.5 in Weisberg): Distribution Link Inverse Link (Regression Function) Normal Identity = xT g ( ) = T Poisson Log = ex g () = log() Bernoulli Gamma In R you type: glm(y x, family= xxxx) where xxxx is Normal, binomial, poisson etc. R will assume the default link. 11.7 Example. This is a famous data set collected by Sir Richard Doll in the 1950s. I am following example 9.2.1 in Dobson. The data are on smoking and number of deaths due to coronary heart disease. Here are the data: Age 35-44 45-54 55-64 65-74 75-84 Smokers Non-smokers Deaths Person-years Deaths Person-years 32 52407 2 18790 104 43248 12 10673 206 28612 28 5710 186 12663 28 2585 102 5317 31 1462 Logit g () = logit() Inverse g () = 1/ = =
ex 1+exT 1 xT
T

Poisson regression is also appropriate for rate data, where the rate is a count of events occurring to a particular unit of observation, divided by some measure of that units exposure. For example, biologists might count the number of tree species in a forest, and the rate would be the number of species per square kilometre. Demographers may model death rates in geographic areas as the count of deaths divided by person-years. It is important to note that event rates can be calculated as events per units of varying size. For instance, in these data, person-years vary. As the people get older there are fewer at risk. At the time the data were collected most people smoked, so there were fewer person-years for non-smokers. In these examples, exposure is respectively to unit (be it area, person-years or time). In Poisson regression this is handled as an offset, where the exposure variable enters on the right-hand side of the equation, but with a parameter estimate (for log(exposure)) constrained to 1. 116

log(E [Y |X = x]) = xt + log(exposure) This makes sense because we believe that log(E [Y |X = x]/exposure) = xt . Thus we pull log(exposure) to the right-hand-side of the equation and t a Poisson regression model. The offset option allows us to include log(exposure) in the model without estimating a coefcient. A plot of deaths by age exhibits an obvious increasing relationship with age which shows some hint of nonlinearity. The increase may differ between smokers and non-smokers so we will include an interaction term. We took the midpoint of each age group as the age.
> ### page 155 dobson > deaths = c(32,104,206,186,102,2,12,28,28,31) > age = c(40,50,60,70,80,40,50,60,70,80) > py = c(52407,43248,28612,12663,5317, + 18790,10673,5710,2585,1462) > smoke = c(1,1,1,1,1,0,0,0,0,0) > agesq = age*age > sm.age = smoke*age # #Notice the use of the offset for person-years below # > out = glm(deathssmoke+age+agesq+sm.age,offset=log(py),family=poisson) > summary(out) Deviance Residuals: 1 2 3 0.43820 -0.27329 -0.15265 10 -0.01275 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.970e+01 1.253e+00 -15.717 < 2e-16 *** smoke 2.364e+00 6.562e-01 3.602 0.000316 *** age 3.563e-01 3.632e-02 9.810 < 2e-16 *** agesq -1.977e-03 2.737e-04 -7.223 5.08e-13 *** sm.age -3.075e-02 9.704e-03 -3.169 0.001528 ** --Null deviance: 935.0673 Residual deviance: 1.6354 on 9 on 5 degrees of freedo degrees of freedom

4 0.23393

5 -0.05700

6 -0.83049

7 0.13404

8 0.64107

-0.

117

AIC: 66.703

Based on the p-value from the Wald tests above, smoking appears to be quite important (but keep the usual causal caveats in mind). Suppose we want to compare smokers to non-smokers for 40 years olds. The estimated model is E(Y |x) = exp{0 + 1 smoke + 2 age + 3 age2 + 4 smoke age + log PY} and hence exp{0 + 1 + 402 + 16003 + 404 + log(52407)} E(Y |smoker, age = 40) = E(Y |non smoke, age = 40) exp{0 + 402 + 16003 + log(18790)} = e1 +404 +log(52407)log(18790)
b b

This gives us the ratio of the expected number of deaths in a similar population. But we are interested in the rate parameter, so we now want to drop the person-years terms. The ratio of rates is b b e1 +404 = 3.1. suggesting that smokers in this group have a death rate due to coronary heart disease that is 3.1 times higher than non-smokers. Lets get a condence interval for this. First, set = 1 + 404 = and = 1 + 404 = where
T T T

= (0, 1, 0, 0, 40). V( ) =
T

Then, V where V = V( ). An approximate 95 percent condence interval for is (a, b) = ( 2 V( ), + 2 We are interested in = e . The condence interval is (ea , eb ). In R: V( )).

118

> summ = summary(out) > v = summ$dispersion * summ$cov.unscaled #summ$dispersion is 1 unless we allow "over dispersion" #relative to the model. This is a topic I skipped over. # > print(v) (Intercept) smoke age agesq s (Intercept) 1.5711934366 -4.351992e-01 -4.392812e-02 2.998070e-04 6.44585 smoke -0.4351992084 4.306356e-01 7.424636e-03 -1.601373e-05 -6.28094 age -0.0439281178 7.424636e-03 1.318857e-03 -9.633853e-06 -1.14420 agesq 0.0002998070 -1.601373e-05 -9.633853e-06 7.489759e-08 2.70059 sm.age 0.0064458558 -6.280941e-03 -1.144205e-04 2.700594e-07 9.41698 > ell = c(0,1,0,0,40) > gam = sum(ell*out$coef) > print(exp(gam)) [1] 3.106274 > se = sqrt(ell %*% v %*% ell) > ci = exp(c(gam- 2*se, gam+ 2*se)) > print(round(ci,2)) [1] 1.77 5.45 The result is that the rate is 2 to 5 times higher for smokers than non-smokers at age 40. Since heart disease is much more common than lung cancer, the risk of smoking has a bigger impact on public health for heart disease than smoking. There is a formal way to test the model for goodness of t. Lets look at the t of the model. As with logistic regression we can compute the deviances. Recall that the log likelihood for a Poisson is of the form () = y log() . di = sign(Yi Yi ) 2[ (Yi ) (Yi )] = sign(Yi Yi ) 2[(Yi log(Yi /Yi ) (Yi Yi ). The deviance is dened as D=
i 2 np1

The deviance residuals are dened as

d2 i.

This statistic is approximately distributed where p is the number of covariates. If D is larger than expeced (i.e., the p-value is small) this means that the Poisson model with the covariates included is not sufcient to explain the data. For these data the model appears to t well. > print(1-pchisq(out$deviance,df=5)) [1] 0.8969393 119

Figure 41: Regression with measurement error. X is not observed. W is a noisy version of X . If you regress Y on W , you will get an inconsistent estimate of 1 .

12

Measurement Error

Suppose we are interested in regressing the outcome Y on a covariate X but we cannot observe X directly. Rather, we observe X plus noise U . The observed data are (Y1 , W1 ), . . . , (Yn , Wn ) where Y i = 0 + 1 Xi + Wi = Xi + Ui
i

and E(Ui ) = 0. This is called a measurement error problem or an errors-in-variables problem. The model is illustrated by the directed graph in Figure 41. It is tempting to ignore the error and just regress Y on W . If the goal is just to predict Y from W then there is no problem. But if the goal is to estimate 1 , regressing Y on W leads to inconsistent estimates. 2 Let x = V(X ), and assume that is independent of X , has mean 0 and variance 2 . Also 2 . Let 1 be the least squares assume that U is independent of X , with mean 0 and variance u estimator of 1 obtained by regressing the Yi s on the Wi s. It can be shown that
1

1 1 =

as

(89)

where

2 x < 1. (90) 2 + 2 x u Thus, the effect of the measurement is to bias the estimated slope towards 0, an effect that is usually called attenuation bias. Let us give a heuristic explanation of why (89) is true. For simplicity, assume that 0 = 0 and that E(X ) = 0. So Y 0, W 0 and

1 = Now, 1 n Yi W i =
i

i (Yi

Y )(Wi W ) 2 i (Wi W )

1 n 1 n

Yi Wi . 2 i Wi

1 n

(1 Xi + i )(Xi + Ui )
i

120

1 n

Xi2 +
i

1 n

Xi U i +
i

1 n

i Xi i

1 n

i Ui i

2 . 1 x

(Note: Xi , Ui and i are all uncorrelated, so by the law of large numbers these sums are approximately zero for large n.) Also, 1 n Wi2 =
i

1 n

(Xi + Ui )2
i

1 = n

Xi2 +
i

1 n

Ui2 +
i

2 n

Xi Ui
i

2 2 + u x

which yields (89). 2 2 can be estimated. Otherwise, u If there are several observed values of W for each X then u must be estimated by external means such as through background knowledge of the noise mecha2 2 2 2 , we can estimate + u = x is known. Since, w nism. For our purposes, we will assume that u 2 x by 2 2 2 x = w u (91)
2 where w is the sample variance of the Wi s. Plugging these estimates into (90), we get an estimate 2 2 2 = (w u )/w of . An estimate of 1 is

1 =

2 2 This is called the method of moments estimator. This estimator makes little sense if w u 0. In such cases, one might reasonable conclude that the sample size is simply not large enough to estimate 1 . Another method for correcting the attenuation bias is SIMEX which stands for simulation extrapolation and is due to Cook and Stefanski. Recall that the least squares estimate 1 is a consistent estimate of 2 1 x . 2 + 2 x u Generate new random variables Wi = Wi + Ui 2 where Ui N (0, U ). The least squares estimate obtained by regressing the Yi s on the Wi s is a consistent estimate of 2 1 x () = 2 . (93) 2 x + (1 + )u

2 w . 2 2 1 w u

(92)

Repeat this process B times (where B is large) and denote the resulting estimators by 1,1 (), . . . , 1,B (). Then dene B 1 () = 1,b (). B b=1 121

1 SIMEX Estimate

) (

1 Uncorrected Least Squares Estimate

-1.0

0.0

1.0

2.0

Figure 42: In the SIMEX method we extrapolate () back to = 1. Now comes some clever sleight of hand. Setting = 1 in (93) we see that (1) = 1 which is the quantity we want to estimate. The idea is to compute () for a range of values of such as 0, 0.5, 1.0, 1.5, 2.0. We then extrapolate the curve () back to = 1. see Figure 42. To do the extrapolation, we t the values (j ) to the curve G(; 1 , 2 , 3 ) = 1 + 2 3 + (94)

using standard nonlinear regression. Once we have estimates of the s, we take 1 = G(1; 1 , 2 , 3 ) (95)

as our corrected estimate of 1 . Fitting the nonlinear regression (94) is inconvenient; it often sufces to approximate G() with a quadratic. Thus, we t the (j ) s to the curve Q(; 1 , 2 , 3 ) = 1 + 2 + 3 2
1

and the corrected estimate of 1 is 1 = Q(1; 1 , 2 , 3 ) = 1 2 + 3 . An advantage of SIMEX is that it extends readily to nonlinear and nonparametric regression. 122

26000

200

400

600

800

5000

200

400

Figure 43: CMB data. The horizontal axis is the multipole moment, essentially the frequency of uctuations in the temperature eld of the CMB. The vertical axis is the power or strength of the uctuations at each frequency. The top plot shows the full data set. The bottom plot shows the rst 400 data points. The rst peak, around x 200, is obvious. There may be a second and third peak further to the right.

13

Nonparametric Regression

Now we will study nonparametric regression, also known as learning a function in the jargon of machine learning. We are given n pairs of observations (X1 , Y1 ), . . . , (Xn , Yn ) where Yi = r ( Xi ) + i , and r(x) = E(Y |X = x). (97) i = 1, . . . , n (96)

13.1 Example (CMB data). Figure 43 shows data on the cosmic microwave background (CMB). The rst plot shows 899 data points over the whole range while the second plot shows the rst 400 data points. We have noisy measurements Yi of r(Xi ) so the data are of the form (96). Our goal is to estimate r. It is believed that r may have three peaks over the range of the data. The rst peak is obvious from the second plot. The presence of a second or third peak is much less obvious; careful inferences are required to assess the signicance of these peaks. 123

The simplest nonparametric estimator is the regressogram. Suppose the Xi s are in the interval [a, b]. Divide the interval into m bins of of equal length. Thus each has length h = (b a)/m. Denote the bins by B1 , . . . , Bm . Let kj be the number of observations in bin Bj and let Y j be the mean of the Yi s in bin Bj . Dene rn (x) = We can rewrite the estimator as rn (x) =
i=1

1 kj

Yi = Y j
i:Xi Bj

for x Bj .

(98)

n i (x)Yi

where i (x) = 1/kj if x, Xi Bj and i (x) = 0 otherwise. Thus, ( x) = 0, 0, . . . , 0, 1 1 , . . . , , 0, . . . , 0 kj kj


T

In other words, the estimate rn is a step function obtained by averaging the Yi s over each bin. 13.2 Example (LIDAR). These are data from a light detection and ranging (LIDAR) experiment. LIDAR is used to monitor pollutants. Figure 44 shows 221 observations. The response is the log of the ratio of light received from two lasers. The frequency of one laser is the resonance frequency of mercury while the second has a different frequency. The estimates shown here are regressograms. The smoothing parameter h is the width of the bins. As the binsize h decreases, the estimated regression function rn goes from oversmoothing to undersmoothing. Let us now compute the bias and variance of the estimator. For simplicity, suppose that [a, b] = [0, 1] and further suppose that the Xi s are equally spaced so that each bin has k = n/m. Let us focus on rn (0). The mean (conditional on the Xi s) is E(rn (0)) = 1 1 E(Yi ) = r(Xi ). k i B k iB
1 1

By Taylors theorem r(Xi ) r(0) + Xi r (0). So, E(rn (0)) r(0) + r (0) Xi . k iB
1

The largest Xi can be in bin B1 is the length of the bin h = 1/m. So the absolute value of the bias is |r (0)| Xi h|r (0)|. k iB
1

The variance is

2 m 2 2 = = . k n nh 124

0.2

log ratio

0.6

log ratio

1.0

400

500 range

600

700

1.0

0.6

0.2

400

500 range

600

700

0.2

log ratio

0.6

log ratio

1.0

400

500 range

600

700

1.0

0.6

0.2

400

500 range

600

700

Figure 44: The LIDAR data from Example 13.2. The estimates are regressograms, obtained by averaging the Yi s over bins. As we decrease the binwidth h, the estimator becomes less smooth.

125

The mean squared error is the squared bias plus the variance: MSE = h2 (r (0))2 + 2 . nh

Large bins cause large bias. Small bins cause large variance. The MSE is minimized at h= 2 2(r (0))2 n
1/3

c n1/3

for some c. With this optimal value of h, the risk (or MSE) is of the order n2/3 . Another simple estimator is the local average dened by rn (x) = 1 kx Yi .
i: |Xi x|h

(99)

The smoothing parameter is h. We can rewrite the estimtor as rn (x) =


n i=1 Yi K ((x Xi )/h) n i=1 K ((x Xi )/h)

(100)

where K (z ) = 1 if |z | 1 and K (z ) = 0 if |z | > 1. We can further rewrite the estimator as


n

rn (x) =
i=1

Yi i (x)

where
i (x)

= K ((x Xi )/h)/

t=1

K ((x Xt )/h).

We shall see later that his estimator has risk n4/5 which is better than n2/3 . Notice that both estimators so far have the form rn (x) = n i=1 i (x)Yi . In fact, most of the estimators we consider have this form. An estimator rn of r is a linear smoother if, for each x, there exists a vector (x) = ( 1 (x), . . . , n (x))T such that n rn (x) =
i=1 i (x)Yi .

(101)

Dene the vector of tted values Y = (rn (x1 ), . . . , rn (xn ))T where Y = (Y1 , . . . , Yn )T . It then follows that Y = LY 126 (103) (102)

where L is an n n matrix whose ith row is (Xi )T ; thus, Lij = j (Xi ). The entries of the ith row show the weights given to each Yi in forming the estimate rn (Xi ). The matrix L is called the smoothing matrix or the hat matrix. The ith row of L is called the effective kernel for estimating r(Xi ). We dene the effective degrees of freedom by = tr(L). (104)

Compare with linear regression where = p. The larger , the more complex the model. A smaller yields a smoother regression function. 13.3 Example (Regressogram). Recall that for x Bj , i (x) = 1/kj if Xi Bj and i (x) = 0 otherwise. Thus, rn (x) = n i=1 Yi i (x). The vector of weights (x) looks like this: (x)T = 0, 0, . . . , 0, 1 1 , . . . , , 0, . . . , 0 . kj kj

In general, it is easy to see that there are = tr(L) = m effective degrees of freedom. The binwidth h = (b a)/m controls how smooth the estimate is.

To see what the smoothing matrix L looks like, suppose that n = 9, m = 3 and k1 = k2 = k3 = 3. Then, 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 L= . 3 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1

13.4 Example (Local averages). The local average estimator of r(x), is a special case of the n kernel estimator discussed shortly. In this case, rn (x) = i=1 Yi i (x) where i (x) = 1/kx if |Xi x| h and i (x) = 0 otherwise. As a simple example, suppose that n = 9, Xi = i/9 and h = 1/9. Then, 1/2 1/2 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1 / 3 1 / 3 1 / 3 0 0 0 0 . 0 0 0 1 / 3 1 / 3 1 / 3 0 0 0 L= 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1 / 3 1 / 3 1 / 3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/2 1/2 127

13.5 Example. Linear Regression. We have Y = HY where H = X (X T X )1 X T . We can write r(x) = xT = xT (X T X )1 X T Y = i (x)Yi .
i

13.1

Choosing the Smoothing Parameter

The smoothers depend on some smoothing parameter h and we will need some way of choosing h. Recall from our discussion of variable selection that the predictive risk is E(Y rn (X ))2 = 2 + E(r(X ) rn (X ))2 = 2 + MSE where MSE means mean-squared-error. Also, MSE = where bias(x) = E(rn (x)) r(x)) is the bias of rn (x) and var(x) = Variance(rn (x)) is the variance. When the data are oversmoothed, the bias term is large and the variance is small. When the data are undersmoothed the opposite is true; see Figure 45. This is called the biasvariance tradeoff. Minimizing risk corresponds to balancing bias and variance. Ideally, we would like to choose h to minimize R(h) but R(h) depends on the unknown function r(x). Instead, we will minimize an estimate R(h) of R(h). As a rst guess, we might use the average residual sums of squares, also called the training error 1 n
n

bias2 (x)p(x)dx +

var(x)p(x)dx

i=1

(Yi rn (Xi ))2

(105)

to estimate R(h). This turns out to be a poor estimate of R(h): it is biased downwards and typically leads to undersmoothing (overtting). The reason is that we are using the data twice: to estimate 2 the function and to estimate the risk. The function estimate is chosen to make n i=1 (Yi rn (Xi )) small so this will tend to underestimate the risk. We will estimate the risk using the leave-one-out cross-validation score which is dened as follows. The leave-one-out cross-validation score is dened by
CV

= R(h) =

1 n

i=1

(Yi r(i) (Xi ))2

(106)

where r(i) is the estimator obtained by omitting the ith pair (Xi , Yi ). 128

Risk Bias squared

Variance

Less smoothing Optimal smoothing

More smoothing

Figure 45: The biasvariance tradeoff. The bias increases and the variance decreases with the amount of smoothing. The optimal amount of smoothing, indicated by the vertical line, minimizes the risk = bias2 + variance. The intuition for cross-validation is as follows. Note that E(Yi r(i) (Xi ))2 = E(Yi r(Xi ) + r(Xi ) r(i) (Xi ))2 = 2 + E(r(Xi ) r(i) (Xi ))2 2 + E(r(Xi ) rn (Xi ))2 and hence, Thus the cross-validation score is a nearly unbiased estimate of the risk. There is a shortcut formula for computing R just like in linear regression. 13.6 Theorem. Let rn be a linear smoother. Then the leave-one-out cross-validation score R(h) can be written as n 2 Yi rn (Xi ) 1 (108) R(h) = n i=1 1 Lii
1 where Lii = i (Xi ) is the ith diagonal element of the smoothing matrix L.

E(R) predictive risk.

(107)

The smoothing parameter h can then be chosen by minimizing R(h). Rather than minimizing the cross-validation score, an alternative is to use generalized crossvalidation in which each Lii in equation (108) is replaced with its average n1 n i=1 Lii = /n where = tr(L) is the effective degrees of freedom. Thus, we would minimize 1 GCV(h) = n
n

i=1

Yi rn (Xi ) 1 /n 129

Rtraining . (1 /n)2

(109)

Usually, the bandwidth that minimizes the generalized cross-validation score is close to the bandwidth that minimizes the cross-validation score. Using the approximation (1 x)2 1 + 2x we see that 1 GCV(h) n where 2 = n1
n i=1 (Yi

2 2 (Yi rn (Xi )) + Cp n i=1


2

(110)

rn (Xi ))2 . Equation (110) is just like the Cp statistic

13.2

Kernel Regression

We will often use the word kernel. For our purposes, the word kernel refers to any smooth function K such that K (x) 0 and K (x) dx = 1,
2 xK (x)dx = 0 and K

x2 K (x)dx > 0.

(111)

Some commonly used kernels are the following: the boxcar kernel : the Gaussian kernel : the Epanechnikov kernel : the tricube kernel : where I (x) = These kernels are plotted in Figure 46. 1 K (x) = I (x), 2 1 2 K (x) = ex /2 , 2 3 K (x) = (1 x2 )I (x) 4 70 K (x) = (1 |x|3 )3 I (x) 81

1 if |x| 1 0 if |x| > 1.

130

Figure 46: Examples of kernels: boxcar (top left), Gaussian (top right), Epanechnikov (bottom left), and tricube (bottom right).

131

Let h > 0 be a positive number, called the bandwidth. The NadarayaWatson kernel estimator is dened by
n

rn (x) =
i=1

i (x)Yi

(112)

where K is a kernel and the weights i (x) are given by


i (x)

xXi h x xj n j =1 K h

(113)

13.7 Remark. The local average estimator in Example 13.4 is a kernel estimator based on the boxcar kernel.

R-code. In R, I suggest using the loess command or using the locfit library. (You need to download loct.) For loess: plot(x,y) out = loess(y x,span=.25,degree=0) lines(x,fitted(out)) The span option is the bandwidth. To compute GCV, you will need the effective number of parameters. You get this by typing: out$enp The command for kernel regression in loct is: out = locfit(y x,deg=0,alpha=c(0,h)) where h is the bandwidth you want to use. The alpha=c(0,h) part looks strange. There are two ways to specify the smoothing parameter. The rst way is as a percentage of the data, for example, alpha=c(.25,0) makes the bandwidth big enough so that one quarter of the data falls in the kernel. To smooth with a specic value for the bandwidth (as we are doing) we use alpha=c(0,h). The meaning of deg=0 will be explained later. Now try names(out) print(out) summary(out) plot(out) plot(x,fitted(out)) plot(x,residuals(out)) help(locfit) 132

To do cross-validation, create a vector bandwidths h = (h1 , . . . , hk ). alpha then needs to be a matrix. h = c( ... put your values here ... ) k = length(h) zero = rep(0,k) H = cbind(zero,h) out = gcvplot(yx,deg=0,alpha=H) plot(out$df,out$values) 13.8 Example (CMB data). Recall the CMB data from Figure 43. Figure 47 shows four different kernel regression ts (using just the rst 400 data points) based on increasing bandwidths. The top two plots are based on small bandwidths and the ts are too rough. The bottom right plot is based on large bandwidth and the t is too smooth. The bottom left plot is just right. The bottom right plot also shows the presence of bias near the boundaries. As we shall see, this is a general feature of kernel regression. The bottom plot in Figure 48 shows a kernel t to all the data points. The bandwidth was chosen by cross-validation. The choice of kernel K is not too important. Estimates obtained by using different kernels are usually numerically very similar. This observation is conrmed by theoretical calculations which show that the risk is very insensitive to the choice of kernel. What does matter much more is the choice of bandwidth h which controls the amount of smoothing. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general, we will let the bandwidth depend on the sample size so we sometimes write hn . The following theorem shows how the bandwidth affects the estimator. To state these results we need to make some assumption about the behavior of x1 , . . . , xn as n increases. For the purposes of the theorem, we will assume that these are random draws from some density f . 13.9 Theorem. The risk (using integrated squared error loss) of the NadarayaWatson kernel estimator is R(hn ) = h4 n 4 + as hn 0 and nhn . The rst term in (114) is the squared bias and the second term is the variance. What is especially notable is the presence of the term f (x) 2r (x) (115) f ( x) in the bias. We call (115) the design bias since it depends on the design, that is, the distribution of the Xi s. This means that the bias is sensitive to the position of the Xi s. Furthermore, it can be 133 2
2

x2 K (x)dx K 2 (x)dx nhn

r (x) + 2r (x)

f ( x) f (x)

dx (114)

1 1 4 dx + o(nh n ) + o(hn ) f (x)

200

400

200

400

200

400

200

400

Figure 47: Four kernel regressions for the CMB data using just the rst 400 data points. The bandwidths used were h = 1 (top left), h = 10 (top right), h = 50 (bottom left), h = 200 (bottom right). As the bandwidth h increases, the estimated function goes from being too rough to too smooth.

134

shown that kernel estimators also have high bias near the boundaries. This is known as boundary bias. We will see that we can reduce these biases by using a renement called local polynomial regression. If we differentiate (114) and set the result equal to 0, we nd that the optimal bandwidth h is 1/5 h = 1 n
1/5

Thus, h = O(n1/5 ). Plugging h back into (114) we see that the risk decreases at rate O(n4/5 ). In (most) parametric models, the risk of the maximum likelihood estimator decreases to 0 at rate 1/n. The slower rate n4/5 is the price of using nonparametric methods. In practice, we cannot use the bandwidth given in (116) since h depends on the unknown function r. Instead, we use leave-one-out cross-validation as described in Theorem 13.6. 13.10 Example. Figure 48 shows the cross-validation score for the CMB example as a function of the effective degrees of freedom. The optimal smoothing parameter was chosen to minimize this score. The resulting t is also shown in the gure. Note that the t gets quite variable to the right.

K 2 (x)dx

dx/f (x)

x2 K 2 (x)dx)2

r ( x) + 2 r

(x) (x) f f (x)

dx

(116)

13.3

Local Polynomials

Kernel estimators suffer from boundary bias and design bias. These problems can be alleviated by using a generalization of kernel regression called local polynomial regression. To motivate this estimator, rst consider choosing an estimator a rn (x) to minimize the sums 2 of squares n i=1 (Yi a) . The solution is the constant function rn (x) = Y which is obviously not a good estimator of r(x). Now dene the weight function wi (x) = K ((Xi x)/h) and choose a rn (x) to minimize the weighted sums of squares
n i=1

wi (x)(Yi a)2 .

(117)

From elementary calculus, we see that the solution is rn (x)


n i=1 wi (x)Yi n i=1 wi (x)

which is exactly the kernel regression estimator. This gives us an interesting interpretation of the kernel estimator: it is a locally constant estimator, obtained from locally weighted least squares. This suggests that we might improve the estimator by using a local polynomial of degree p instead of a local constant. Let x be some xed value at which we want to estimate r(x). For values u in a neighborhood of x, dene the polynomial a2 ap Px (u; a) = a0 + a1 (u x) + (u x)2 + + (u x)p . (118) 2! p! 135

24

26

28

30

1000

3000

5000

400

800

Figure 48: Top: The cross-validation (CV) score as a function of the effective degrees of freedom. Bottom: the kernel t using the bandwidth that minimizes the cross-validation score.

136

We estimate a = (a0 , . . . , ap )T by choosing a = (a0 , . . . , ap )T to minimize the locally weighted sums of squares
n i=1

We can approximate a smooth regression function r(u) in a neighborhood of the target value x by the polynomial: r(u) Px (u; a). (119)

wi (x) (Yi Px (Xi ; a))2 .

(120)

The estimator a depends on the target value x so we write a(x) = (a0 (x), . . . , ap (x))T if we want to make this dependence explicit. The local estimate of r is rn (u) = Px (u; a). In particular, at the target value u = x we have rn (x) = Px (x; a) = a0 (x). (121)

Warning! Although rn (x) only depends on a0 (x), this is not equivalent to simply tting a local constant. Setting p = 0 gives back the kernel estimator. The special case where p = 1 is called local linear regression and this is the version we recommend as a default choice. As we shall see, local polynomial estimators, and in particular local linear estimators, have some remarkable properties. To nd a(x), it is helpful to re-express the problem in vector notation. Let x) p 1 x1 x (x1 p! x)p 1 x2 x (x2 p! (122) Xx = . . . . . . . . . . . . 1 xn x
(xn x)p p!

and let Wx be the n n diagonal matrix whose (i, i) component is wi (x). We can rewrite (120) as (Y Xx a)T Wx (Y Xx a). (123)

Minimizing (123) gives the weighted least squares estimator


T T a(x) = (Xx Wx Xx )1 Xx Wx Y.

(124)

In particular, rn (x) = a0 (x) is the inner product of the rst row of T T (Xx Wx Xx )1 Xx Wx with Y . Thus we have: The local polynomial regression estimate is
n

rn (x) =
i=1

i (x)Yi

(125)

137

where (x)T = ( 1 (x), . . . ,

n (x)), T 1 T (x)T = eT 1 (Xx Wx Xx ) Xx Wx ,

e1 = (1, 0, . . . , 0)T and Xx and Wx are dened in (122). Once again, our estimate is a linear smoother and we can choose the bandwidth by minimizing the cross-validation formula given in Theorem 13.6. R-code. The R-code is the same except we use deg = 1 for local linear, deg = 2 for local quadratic etc. Thus, for local linear regression: loess(y x,deg=1,span=h) locfit(y x,deg = 1,alpha=c(0,h)) 13.11 Example (LIDAR). These data were introduced in Example 13.2. Figure 49 shows the 221 observations. The top left plot shows the data and the tted function using local linear regression. The cross-validation curve (not shown) has a well-dened minimum at h 37 corresponding to 9 effective degrees of freedom. The tted function uses this bandwidth. The top right plot shows the residuals. There is clear heteroscedasticity (nonconstant variance). The bottom left plot shows the estimate of (x) using the method described later. Next we compute 95 percent condence bands (explained later). The resulting bands are shown in the lower right plot. As expected, there is much greater uncertainty for larger values of the covariate. Local Linear Smoothing
n i=1 i (x)Yi

13.12 Theorem. When p = 1, rn (x) =


i ( x)

where

bi (x) , n j =1 bj (x) (126)

bi (x) = K and Sn,j (x) =

Xi x h
n

(Sn,2 (x) (Xi x)Sn,1 (x)) (Xi x)j , j = 1, 2.

K
i=1

Xi x h

13.13 Example. Figure 50 shows the local regression for the CMB data for p = 0 and p = 1. The bottom plots zoom in on the left boundary. Note that for p = 0 (the kernel estimator), the t is poor near the boundaries due to boundary bias.

138

0.2

residuals

log ratio

1.0

0.6

400

500 range

600

700

0.4

0.0

0.2

0.4

400

500 range

600

700

0.10

sigma(x)

log ratio

0.05

400

500 range

600

700

1.0

0.00

0.6

0.2

400

500 range

600

700

Figure 49: The LIDAR data from Example 13.11. Top left: data and the tted function using local linear regression with h 37 (chosen by cross-validation). Top right: the residuals. Bottom left: estimate of (x). Bottom right: 95 percent condence bands.

139

5000

3000

1000

200

400

1000

3000

5000

200

400

1500

500

20

40

500

1500

20

40

Figure 50: Locally weighted regressions using local polynomials of order p = 0 (top left) and p = 1 (top right). The bottom plots show the left boundary in more detail (p = 0 bottom left and p = 1 bottom right). Notice that the boundary bias is reduced by using local linear estimation (p = 1).

140

0.0

0.5

1.0

0.0

0.5

1.0

100

150

200

0.0

0.5

1.0

Figure 51: The Doppler function estimated by local linear regression. The function (top left), the data (top right), the cross-validation score versus effective degrees of freedom (bottom left), and the tted function (bottom right). 13.14 Example (Doppler function). Let r ( x) = x(1 x) sin 2.1 x + .05 , 0x1 (127)

which is called the Doppler function. This function is difcult to estimate and provides a good test case for nonparametric regression methods. The function is spatially inhomogeneous which means that its smoothness (second derivative) varies over x. The function is plotted in the top left plot of Figure 51. The top right plot shows 1000 data points simulated from Yi = r(i/n) + i with = .1 and i N (0, 1). The bottom left plot shows the cross-validation score versus the effective degrees of freedom using local linear regression. The minimum occurred at 166 degrees of freedom corresponding to a bandwidth of .005. The tted function is shown in the bottom right plot. The t has high effective degrees of freedom and hence the tted function is very wiggly. This is because the estimate is trying to t the rapid uctuations of the function near x = 0. If we used more smoothing, the right-hand side of the t would look better at the cost of missing the structure near x = 0. This is always a problem when estimating spatially inhomogeneous functions. Well discuss that further later. The following theorem gives the large sample behavior of the risk of the local linear estimator 141

and shows why local linear regression is better than kernel regression. 13.15 Theorem. Let Yi = r(Xi ) + (Xi ) i for i = 1, . . . , n and a Xi b. Assume that X1 , . . . , Xn are a sample from a distribution with density f and that (i) f (x) > 0, (ii) f , r and 2 are continuous in a neighborhood of x, and (iii) hn 0 and nhn . Let x (a, b). Given X1 , . . . , Xn , we have the following: the local linear estimator and the kernel estimator both have variance 1 2 ( x) K 2 (u)du + oP . (128) f (x)nhn nhn The NadarayaWatson kernel estimator has bias h2 n 1 r (x)f (x) r ( x) + 2 f (x) u2 K (u)du + oP (h2 ) (129)

whereas the local linear estimator has asymptotic bias 1 h2 n r (x) 2 u2 K (u)du + oP (h2 ) (130)

Thus, the local linear estimator is free from design bias. At the boundary points a and b, the NadarayaWatson kernel estimator has asymptotic bias of order hn while the local linear estimator has bias of order h2 n . In this sense, local linear estimation eliminates boundary bias. 13.16 Remark. The above result holds more generally for local polynomials of order p. Generally, taking p odd reduces design bias and boundary bias without increasing variance. An alternative to locfit is loess. out = loess(y x,span=.1,degree=1) plot(x,fitted(out)) out$trace.hat ### effective degrees of freedom

13.4

Penalized Regression, Regularization and Splines


p

Before introducing splines, consider polynomial regression. Y =


j =0

j xj + .

or r ( x) =

j xj .
j =0

142

In other words, we have design matrix 1 x1 x2 1 1 x2 x2 2 X= 1 xn x2 n . . . xp 1 . . . xp 2 . ... . . . xp n

Least squares minimizes

(Y X )t (Y X ), which implies = (X t X )1 X t Y . If we introduce a ridge regression penalty and aim to minimize (Y X )t (Y X ) + t I, then = (X t X + I )1 X t Y . Spline regression follows a similar pattern, except that we replace X with the B-spline basis matrix B (see below). Consider once again the regression model Yi = r(Xi ) +
i

and suppose we estimate r by choosing rn (x) to minimize the sums of squares


n

i=1

(Yi rn (Xi ))2 ,

over a class of functions. Minimizing over all linear functions (i.e., functions of the form 0 + 1 x) yields the least squares estimator. Minimizing over all functions yields a function that interpolates the data. In the previous section we avoided these two extreme solutions by replacing the sums of squares with a locally weighted sums of squares. An alternative way to get solutions in between these extremes is to minimize the penalized sums of squares M () =
i

(Yi rn (Xi ))2 + J (r)

(131)

where J (r) = (r (x))2 dx (132)

is a roughness penalty. This penalty leads to a solution that favors smoother functions. Adding a penalty term to the criterion we are optimizing is sometimes called regularization. The parameter controls the trade-off between t (the rst term of 131) and the penalty. Let rn denote the function that minimizes M (). When = 0, the solution is the interpolating function. When , rn converges to the least squares line. The parameter controls the amount of smoothing. What does rn look like for 0 < < ? To answer this question, we need to dene splines. A spline is a special piecewise polynomial. The most commonly used splines are piecewise cubic splines. Let 1 < 2 < < k be a set of ordered pointscalled knotscontained in some 143

0.0
0.0

0.6

1.2

0.5

1.0

Figure 52: Cubic B-spline basis using nine equally spaced knots on (0,1). interval (a, b). A cubic spline is a continuous function r such that (i) r is a cubic polynomial over (1 , 2 ), . . . and (ii) r has continuous rst and second derivatives at the knots. A spline that is linear beyond the boundary knots is called a natural spline. Cubic splines are the most common splines used in practice. They arise naturally in the penalized regression framework as the following theorem shows. 13.17 Theorem. The function rn (x) that minimizes M () with penalty (132) is a natural cubic spline with knots at the data points. The estimator rn is called a smoothing spline. In other words, for a tted vector Y , the penalty term is minimized by a cubic spline that goes through the points Y . The theorem above does not give an explicit form for rn . To do so, we will construct a basis for the set of splines. The most commonly used basis for splines is the cubic B-spline. Rather than write out a bunch of polynomials to show their form, I suggest that you explore B-splines bases in R. Figure 52 shows the cubic B-spline basis using nine equally spaced knots on (0,1). B-spline basis functions have compact support which makes it possible to speed up calculations. Without a penalty, using the B-spline basis one can interpolate the data so as to provide a perfect t to the data. Alternatively, with a penalty, one can provide a nice smooth curve that is useful for prediction. We are now in a position to describe the spline estimator in more detail. According to Theorem 13.17, rn (x) is a natural cubic spline. Hence, we can write
N

rn (x) =
j =1

j Bj (x)

(133)

144

To nd our regression estimator we only need to nd the coefcients = (1 , . . . , N )T . By expanding r in the basis, and calculating the 2nd derivatives, we can now rewrite the minimization as follows: minimize : (Y B )T (Y B ) + T (134) where Bij = Bj (Xi ) and jk = Bj (x)Bk (x)dx. Following the pattern we saw for ridge regression, the value of that minimizes (134) is = (B T B + )1 B T Y. Splines are another example of linear smoothers: L = B (B T B + )1 B T . So Y = LY . If we had done ordinary linear regression of Y with basis B , the hat matrix would be L = B (B T B )1 B T and the tted values would interpolate the observed data. The effect of the term in the penalty is to shrink the regression coefcients towards a subspace, which results in a smoother t. As before, we dene the effective degrees of freedom by = tr(L) and we choose the smoothing parameter by minimizing either the cross-validation score (108) or the generalized cross-validation score (109). In R:
out = smooth.spline(x,y,df=10,cv=TRUE) ### df is the effective + degrees of freedom plot(x,y) lines(x,out$y) ### NOTE: the fitted values are in out$y NOT out$fit!! out$cv ### print the cross-validation score

where Bj (x), j = 1, . . . , N , are the basis vectors for the B-spline, N = n + 4. (Note the basis is determined by the observed values of xj .) We follow the pattern of polynomial regression, but replace X with B, where B1 (x1 ) B2 (x1 ) . . . BN (x1 ) B1 (x2 ) B2 (x2 ) . . . BN (x2 ) B= ... B1 (xn ) B2 (xn ) . . . BN (xn )

(135)

You need to do a loop to try many values of df and then use cross-validation to choose df. df must be between 2 and n. For example: cv = rep(0,50) df = seq(2,n,length=50) for(i in 1:50){cv[i] = smooth.spline(x,y,df=df[i],cv=TRUE)$cv} plot(df,cv,type="l") df[cv == min(cv)] 145

1000
0

3000

5000

400

800

Figure 53: Smoothing spline for the CMB data. The smoothing parameter was chosen by crossvalidation. 13.18 Example. Figure 53 shows the smoothing spline with cross-validation for the CMB data. The effective number of degrees of freedom is 8.8. The t is smoother than the local regression estimator. This is certainly visually more appealing, but the difference between the two ts is small compared to the width of the condence bands that we will compute later. Spline estimates rn (x) are approximately kernel estimates in the sense that
i (x)

1 K f (Xi )h(Xi )

Xi x h(Xi )

where f (x) is the density of the covariate (treated here as random), h(x) = and nf (x)
1/4

1 |t| |t| exp sin + . 2 2 2 4 Another nonparametric method that uses splines is called the regression spline method. Rather than placing a knot at each data point, we instead use fewer knots. We then do ordinary linear regression on the basis matrix B with no regularization. The tted values for this estimator are Y = LY with L = B (B T B )1 B T . The difference between this estimate and smoothing splines is that the basis matrix B is based on fewer knots and there is no shrinkage factor . The amount of smoothing is instead controlled by the choice of the number (and placement) of the knots. By using fewer knots, one can save computation time. K (t) = 146

13.5

Smoothing Using Orthogonal Functions


b a

Let L2 (a, b) denote all functions dened on the interval [a, b] such that
b

f (x)2 dx < : (136)

L2 (a, b) =

f : [a, b] R,

f (x)2 dx < .

We sometimes write L2 instead of L2 (a, b). The inner product between two functions f, g L2 is dened by f (x)g (x)dx. The norm of f is ||f || = Two functions are orthogonal if A sequence of functions f (x)2 dx. (137)

f (x)g (x)dx = 0. 1 , 2 , 3 , 4 , . . .

i (x)j (x)dx = 0 for i = j . An orthonormal is orthonormal if 2 j (x)dx = 1 for each j and sequence is complete if the only function that is orthogonal to each j is the zero function. A complete orthonormal set is called an orthonormal basis.

Any f L2 can be written as


b

f (x) =
j =1

j j (x), where j =
a

f (x)j (x)dx.

(138)

Also, we have Parsevals relation:

||f ||2 where = (1 , 2 , . . .).

f 2 (x) dx =
j =1

2 j || ||2

(139)

Note: The equality in the displayed equation means that fn (x) = n j =1 j j (x).

(f (x) fn (x))2 dx 0 where

13.19 Example. An example of an orthonormal basis for L2 (0, 1) is the cosine basis dened as follows. Let 0 (x) = 1 and for j 1 dene j (x) = 2 cos(jx). (140)

147

Figure 54: Approximating the doppler function with its expansion in the cosine baJ sis. The function f (top left) and its approximation fJ (x) = j =1 j j (x) with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). The coefcients 1 j = 0 f (x)j (x)dx were computed numerically.

13.20 Example. Let f ( x) = x(1 x) sin


J

2.1 (x + .05)

which is called the doppler function. Figure 54 shows f (top left) and its approximation f J ( x) =
j =1

j j (x)

with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). As J increases we see that 1 fJ (x) gets closer to f (x). The coefcients j = 0 f (x)j (x)dx were computed numerically. 13.21 Example. The Legendre polynomials on [1, 1] are dened by Pj (x) = 1 dj 2 (x 1)j , j = 0, 1, 2, . . . j j 2 j ! dx (141)

It can be shown that these functions are complete and orthogonal and that
1 1

Pj2 (x)dx =

2 . 2j + 1

(142)

It follows that the functions j (x) = (2j + 1)/2Pj (x), j = 0, 1, . . . form an orthonormal basis for L2 (1, 1). The rst few Legendre polynomials are: P0 (x) = 1, P1 (x) = x, P2 (x) = 1 1 3x2 1 , P3 (x) = 5x3 3x , . . . 2 2 148

These polynomials may be constructed explicitly using the following recursive relation: Pj +1 (x) = (2j + 1)xPj (x) jPj 1 (x) . j+1 (143)

The coefcients 1 , 2 , . . . are related to the smoothness of the function f . To see why, note that 1 if f is smooth, then its derivatives will be nite. Thus we expect that, for some k , 0 (f (k) (x))2 dx < where f (k) is the k th derivative of f . Now consider the cosine basis (140) and let f (x) = j =0 j j (x). Then,
1

(f (k) (x))2 dx = 2
0 j =1

2 j (j )2k .

The only way that summarize:

j =1

2 j (j )2k can be nite is if the j s get small when j gets large. To

If the function f is smooth, then the coefcients j will be small when j is large. Return to the regression model Yi = r(Xi ) + i , i = 1, . . . , n. Now we write r(x) =
j =1

(144)

j j (x). We will approximate r by


J

rJ (x) =
j =1

j j (x).

The number of terms J will be our smoothing parameter. Our estimate is


J

r ( x) =
j =1

j j (x),

To nd rn , let U denote the matrix whose columns are: 1 (X1 ) 2 (X1 ) . . . 1 (X2 ) 2 (X2 ) . . . U = . . . . . . . . . 1 (Xn ) 2 (Xn ) . . . Then and Y = SY 149 = (U T U )1 U T Y

J (X1 ) J (X2 ) . . . J (Xn )

Figure 55: Data from the doppler test function and the estimated function. See Example 13.22. where S = U (U T U )1 U T is the hat matrix. The matrix S is projecting into the space spanned by the rst J basis functions. We can choose J by cross validation. Note that trace(S ) = J so the GCV score takes the following simple form: RSS 1 . GCV(J ) = n (1 J/n)2 13.22 Example. Figure 55 shows the doppler function f and n = 2, 048 observations generated from the model Yi = r(Xi ) + i

where Xi = i/n, i N (0, (.1)2 ). The gure shows the data and the estimated function. The estimate was based on J = 234 terms. Here is another example: The t is in Figure 56 and the smoothing matrix is in 57. Notice that the rows of the smoothing matrix look like kernels. In fact, smoothing with a series is approximately the same as kernel regression with the kernel K (x, y ) = J j =1 j (x)j (y ). Cosine basis smoothers have boundary bias. This can be xed by adding the functions t and t2 to the basis. In other words, use the design matrix 2 1 X1 X1 2 (X1 ) . . . J (X1 ) 1 X2 X 2 2 (X2 ) . . . J (X2 ) 2 U = . . . . . . . . . . . . . 2 1 Xn Xn 2 (Xn ) . . . J (Xn ) This is called the polynomial-cosine basis. 150

GCV 0.013 0.015

0.017

0.019

4 J

10

0.2 0.0

0.1

0.0

0.1

0.2

0.3

0.4

0.2

0.4 x

0.6

0.8

1.0

residuals

0.2 0.0

0.1

0.0

0.1

0.2

0.2

0.4 x

0.6

0.8

1.0

Figure 56: Cosine Regression

151

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.2

0.4 x

0.6

0.8

1.0

0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

Figure 57: Rows of the smoothing spline for Cosine Regression

152

13.6

Summary
i i (x)Yi .

A linear smoother has the form rn (x) = Y =LY effective df equal trace(L). For a kernel smoother

i (x)

K ((x Xi )/h) i K ((x Xi )/h)

The kernel smoother is a weighted average of the Yi s in the neighborhood of x. A local polynomial is a slight variation of the kernel smoother that ts a weighted polynomial rather than a weighted average of the Yi s in the neighborhood of x. The key choice is the smoothing parameter h. A smoothing spline is like linear regression with the Xij s replaced by the basis functions Bj (Xi ) j Bj (x). rn (x) =
j

The basis functions are obtained using B-splines determined by the observed Xi s. The key choice is the smoothing parameter , which penalizes the function for excess curvature. is obtained via least squares (with the penalty) An alternative to smoothing splines are splines with fewer basis functions. In this setting there is no penalty function. The key choice is the placement and number of B-splines. Orthogonal polynomials. The basis function is constructed using orthogonal functions.
J

rn (x) =
j =1

j j (x).

The key choice is J , the number of terms in the is obtained via least squares.

153

13.7

Variance Estimation

Next we consider several methods for estimating 2 . For linear smoothers, there is a simple, nearly unbiased estimate of 2 . 13.23 Theorem. Let rn (x) be a linear smoother. Let = where = tr(L), = tr(L L) =
i=1 T 2 n i=1 (Yi

r(Xi ))2 n 2 +
n

(145)

|| (Xi )||2 .

If r is sufciently smooth, = o(n) and = o(n) then 2 is a consistent estimator of 2 . We will now outline the proof of this result. Recall that if Y is a random vector and Q is a symmetric matrix then Y T QY is called a quadratic form and it is well known that E(Y T QY ) = tr(QV ) + T Q where V = V(Y ) is the covariance matrix of Y and = E(Y ) is the mean vector. Now, Y Y = Y LY = (I L)Y and so 2 = where = (I L)T (I L). Hence, E( 2 ) = E(Y T Y ) Y T Y = 2 + . tr() n 2 + Y T Y tr() (147) (146)

Assuming that and do not grow too quickly, and that r is smooth, the last term is small for large n and hence E( 2 ) 2 . Similarly, one can show that V( 2 ) 0. Here is another estimator. Suppose that the Xi s are ordered. Dene 2 = 1 2(n 1)
n1

i=1

(Yi+1 Yi )2 .

(148)

The motivation for this estimator is as follows. Assuming r(x) is smooth, we have r(xi+1 ) r(xi ) 0 and hence Yi+1 Yi = [r(xi+1 ) +
i+1 ]

[r(xi ) + i ]

i+1

154

and hence (Yi+1 Yi )2

2 i+1

2 i

i+1 i .

Therefore,
2 i+1 ) 2 i+1 )

E(Yi+1 Yi )2 E( = E(

+ E( 2 i ) 2E( i+1 )E( i ) 2 + E( 2 i ) = 2 .

(149)

Thus, E( 2 ) 2 . A variation of this estimator is 2 = where 1 n2


n1 2 c2 i i i=2

(150)

i = ai Yi1 + bi Yi+1 Yi , ai = (xi+1 Xi )/(xi+1 xi1 ), 1 2 2 bi = (xi xi1 )/(xi+1 xi1 ), c2 i = (ai + bi + 1) .

The intuition of this estimator is that it is the average of the residuals that result from tting a line to the rst and third point of each consecutive triple of design points. 13.24 Example. The variance looks roughly constant for the rst 400 observations of the CMB data. Using a local linear t, we applied the two variance estimators. Equation (145) yields 2 = 408.29 while equation (148) yields 2 = 394.55. So far we have assumed homoscedasticity meaning that 2 = V( i ) does not vary with x. In the CMB example this is blatantly false. Clearly, 2 increases with x so the data are heteroscedastic. The function estimate rn (x) is relatively insensitive to heteroscedasticity. However, when it comes to making condence bands for r(x), we must take into account the nonconstant variance. We will take the following approach. Suppose that Yi = r(Xi ) + (Xi ) i . Let Zi = log(Yi r(Xi ))2 and i = log 2 i . Then, Zi = log( 2 (Xi )) + i . (152) (151)

This suggests estimating log 2 (x) by regressing the log squared residuals on x. We proceed as follows. Variance Function Estimation 1. Estimate r(x) with any nonparametric method to get an estimate rn (x). 2. Dene Zi = log(Yi rn (Xi ))2 . 3. Regress the Zi s on the Xi s (again using any nonparametric method) to get an estimate q (x) of log 2 (x) and let b(x) 2 (x) = eq . (153) 155

10

15

20

400

800

Figure 58: The dots are the log squared residuals. The solid line shows the log of the estimated standard variance 2 (x) as a function of x. The dotted line shows the log of the true 2 (x) which is known (to reasonable accuracy) through prior knowledge. 13.25 Example. The solid line in Figure 58 shows the log of 2 (x) for the CMB example. I used local linear estimation and I used cross-validation to choose the bandwidth. The estimated optimal bandwidth for rn was h = 42 while the estimated optimal bandwidth for the log variance was h = 160. In this example, there turns out to be an independent estimate of (x). Specically, because the physics of the measurement process is well understood, physicists can compute a reasonably accurate approximation to 2 (x). The log of this function is the dotted line on the plot.

A drawback of this approach is that the log of a very small residual will be a large outlier. An alternative is to directly smooth the squared residuals on x.

156

13.8

Condence Bands
rn (x) c se(x) (154)

In this section we will construct condence bands for r(x). Typically these bands are of the form

where se(x) is an estimate of the standard deviation of rn (x) and c > 0 is some constant. Before we proceed, we discuss a pernicious problem that arises whenever we do smoothing, namely, the bias problem.

T HE B IAS P ROBLEM . Condence bands like those in (154), are not really condence bands for r(x), rather, they are condence bands for rn (x) = E(rn (x)) which you can think of as a smoothed version of r(x). Getting a condence set for the true function r(x) is complicated for reasons we now explain. Denote the mean and standard deviation of rn (x) by rn (x) and sn (x). Then, rn (x) rn (x) rn (x) r(x) rn (x) r(x) = + sn (x) s n ( x) s n ( x) bias(rn (x)) = Zn (x) + variance(rn (x)) where Zn (x) = (rn (x) rn (x))/sn (x). Typically, the rst term Zn (x) converges to a standard Normal from which one derives condence bands. The second term is the bias divided by the standard deviation. In parametric inference, the bias is usually smaller than the standard deviation of the estimator so this term goes to zero as the sample size increases. In nonparametric inference, we have seen that optimal smoothing corresponds to balancing the bias and the standard deviation. The second term does not vanish even with large sample sizes. The presence of this second, nonvanishing term introduces a bias into the Normal limit. The result is that the condence interval will not be centered around the true function r due to the smoothing bias rn (x) r(x). There are several things we can do about this problem. The rst is: live with it. In other words, just accept the fact that the condence band is for rn not r. There is nothing wrong with this as long as we are careful when we report the results to make it clear that the inferences are for rn not r. A second approach is to estimate the bias function rn (x) r(x). This is difcult to do. Indeed, the leading term of the bias is r (x) and estimating the second derivative of r is much harder than estimating r. This requires introducing extra smoothness conditions which then bring into question the original estimator that did not use this extra smoothness. This has a certain unpleasant circularity to it. 13.26 Example. To understand the implications of estimating rn instead of r, consider the following example. Let r(x) = (x; 2, 1) + (x; 4, 0.5) + (x; 6, 0.1) + (x; 8, 0.05) 157

10

10

10

10

Figure 59: The true function (top left), an estimate rn (top right) based on 100 observations, the function rn (x) = E(rn (x)) (bottom left) and the difference r(x) rn (x) (bottom right). where (x; m, s) denotes a Normal density function with mean m and variance s2 . Figure 59 shows the true function (top left), a locally linear estimate rn (top right) based on 100 observations Yi = r(i/10)+.2N (0, 1), i = 1, . . . , 100, with bandwidth h = 0.27, the function rn (x) = E(rn (x)) (bottom left) and the difference r(x) rn (x) (bottom right). We see that rn (dashed line) smooths out the peaks. Comparing the top right and bottom left plot, it is clear that rn (x) is actually estimating rn not r(x). Overall, rn is quite similar to r(x) except that rn omits some of the ne details of r.

C ONSTRUCTING C ONFIDENCE BANDS . Assume that rn (x) is a linear smoother, so that rn (x) = n i=1 Yi i (x). Then,
n

r(x) = E(rn (x)) =


i=1

i (x)r (Xi ).

Also, because we condition on the xi s,


n

V(rn (x)) = V
i=1

i (x)Yi

158

=
i=1 n

2 i (x)V(Yi |Xi ) 2 2 i (x) (Xi ).

=
i=1

When 2 (x) = 2 this simplies to V(rn (x)) = 2 || (x)||2 . Notice that the variance depends on x and on the sampling of data. For x near many Xi s, the contribution from each of the neighboring (Xi ) will be small, so that the variance is small. If x is far from most Xi s, then a small number of terms will contribute, each with a bigger weight. Consequently the variance will be bigger. We will consider a condence band for rn (x) of the form I (x) = (rn (x) c s(x), rn (x) + c s(x)) for some c > 0 where
n

(155)

s(x) =
i=1

2 ( Xi ) 2 i (x).

At one xed value of x we can just take rn (x) z/2 s(x). If we want a band over an interval a x b we need a constant c larger than z/2 to count for the fact that we are trying to get coverage at many points. To guarantee coverage at all the Xi s we can use the Bonferroni correction and take rn (x) z/(2n) s(x). There is a more rened approach which is used in locfit. R Code. In locfit you can get condence bands as follows.
out = locfit(y x, alpha=c(0,h)) crit(out) = kappa0(out,cov=.95) plot(out,band="local") To actually extract the bands, proceed as follows: tmp r.hat critval se upper lower = = = = = = preplot.locfit(out,band="local",where="data") tmp$fit tmp$critval$crit.val temp$se.fit r.hat + critval*se r.hat - critval*se ### fit the regression ### make locfit find kappa0 and c ### plots the fit and the bands

159

10000

400

800

10000

400

800

Figure 60: Local linear t with simultaneous 95 percent condence bands. The band in the top plot assumes constant variance 2 . The band in the bottom plot allows for nonconstant variance 2 (x). Now suppose that (x) is a function of x. Then, we use rn (x) cs(x). 13.27 Example. Figure 60 shows simultaneous 95 percent condence bands for the CMB data using a local linear t. The bandwidth was chosen using cross-validation. We nd that 0 = 38.85 and c = 3.33. In the top plot, we assumed a constant variance when constructing the band. In the bottom plot, we did not assume a constant variance when constructing the band. We see that if we do not take into account the nonconstant variance, we overestimate the uncertainty for small x and we underestimate the uncertainty for large x. It seems like a good time so summarize the steps needed to construct the estimate rn and a condence band. Summary of Linear Smoothing 1. Choose a smoothing method such as local polynomial, spline, etc. This amounts to choosing the form of the weights (x) = ( 1 (x), . . . , n (x))T . A good default choice is local linear smoothing as described in Theorem 13.12. 2. Choose the bandwidth h by cross-validation using (108). 160

3. Estimate the variance function 2 (x) as described in Section 13.7. 4. An approximate 1 condence band for rn = E(rn (x)) is rn (x) c s(x). (156)

13.28 Example (LIDAR). Recall the LIDAR data from Example 13.2 and Example 13.11. We nd that 0 30 and c 3.25. The resulting bands are shown in the lower right plot. As expected, there is much greater uncertainty for larger values of the covariate.

13.9

Local Likelihood and Exponential Families

If Y is not real valued or is not Gaussian, then the basic regression model we have been using might not be appropriate. For example, if Y {0, 1} then it seems natural to use a Bernoulli model. In this section we discuss nonparametric regression for more general models. Before proceeding, we should point out that the basic model often does work well even in cases where Y is not real valued or is not Gaussian. This is because the asymptotic theory does not really depend on being Gaussian. Thus, at least for large samples, it is worth considering using the tools we have already developed for these cases. 13.29 Example. The BPD data. The outcome Y is presence or absence of BPD and the covariate is x = birth weight. The estimated logistic regression function (solid line) r(x; 0 , 1 ) together with the data are shown in Figure 61. Also shown are two nonparametric estimates. The dashed line is the local likelihood estimator. The dotted line is the local linear estimator which ignores the binary nature of the Yi s. Again we see that there is not a dramatic difference between the local logistic model and the local linear model.

13.10

Multiple Nonparametric Regression


Xi = (Xi1 , . . . , Xid )T .

Suppose now that the covariate is d-dimensional,

The regression equation takes the form Y = r(X1 , . . . , Xd ) + . (157)

In principle, all the methods we have discussed carry over to this case easily. Unfortunately, the risk of a nonparametric regression estimator increases rapidly with the dimension d. This is called the curse of dimensionality. The risk of a nonparametric estimator behaves like n4/5 if r is 161

| |

| | | | | | | | | | | ||| ||||||| |

|| |||||

| |

||

| | |

Bronchopulmonary Dysplasia 0

||

| | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

400

600

800

1000

1200

1400

1600

Birth Weight (grams)

Figure 61: The BPD data. The data are shown with small vertical lines. The estimates are from logistic regression (solid line), local likelihood (dashed line) and local linear regression (dotted line). assumed to have an integrable second derivative. In d dimensions the risk behaves like n4/(4+d) . To make the risk equal to a small number we have = which implies that n= Thus: To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension d. So you might need n = 30000 points when d = 5 to get the same accuracy as n = 300 when d = 1. To get some intuition into why this is true, suppose the data fall into a d-dimensional unit cube. Let x be a point in the cube and let Nh be a cubical neighborhood around x where the cube has sides of length h. Suppose we want to choose h = h( ) so that a fraction of the data falls into Nh . The expected fraction of points in Nh is hd . Setting hd = we see that h( ) = 1/d = e(1/d) log . Thus h( ) 1 as d grows. In high dimensions, we need huge neighborhoods to capture any reasonable fraction of the data. 162 1
(d+4)/4

1 n4/(4+d)

(158)

(159)

With this warning in mind, let us press on and see how we might estimate the regression function.

L OCAL R EGRESSION . Consider local linear regression. The kernel function K is now a function of d variables. Given a nonsingular positive denite d d bandwidth matrix H , we dene K H ( x) = 1 K (H 1/2 x). |H |1/2

Often, one scales each covariate to have the same mean and variance and then we use the kernel hd K (||x||/h) where K is any one-dimensional kernel. Then there is a single bandwidth parameter h. This is equivalent to using a bandwidth matrix of the form H = h2 I . At a target value x = (x1 , . . . , xd )T , the local sum of squares is given by
n d 2

i=1

wi (x) Yi a0

j =1

aj (Xij xj )

(160)

where wi (x) = K (||Xi x||/h). The estimator is rn (x) = a0 (161) where a = (a0 , . . . , ad )T is the value of a = (a0 , . . . , ad )T that minimizes the weighted sums of squares. The solution a is 1 T a = (XT (162) x Wx Xx ) Xx Wx Y where Xx = 1 X11 x1 1 X21 x1 . . . . . . 1 Xn1 x1 X 1 d xd X 2 d xd . .. . . . Xnd xd

and Wx is the diagonal matrix whose (i, i) element is wi (x). This is what locfit does. In other words, if you type locfit(y x1 + x2 + x3) then locfit ts Y = r(x1 , x2 , x3 ) + using one bandwidth. So it is important to rescale your variables.

163

13.30 Theorem (Ruppert and Wand, 1994). Let rn be the multivariate local linear estimator with bandwidth matrix H . The (asymptotic) bias of rn (x) is 1 2 (K )trace(H H) 2 (163)

where H is the matrix of second partial derivatives of r evaluated at x and 2 (K ) is the scalar dened by the equation uuT K (u)du = 2 (K )I . The (asymptotic) variance of rn (x) is 2 (x) K (u)2 du n|H |1/2 f (x) Also, the bias at the boundary is the same order as in the interior. Thus we see that in higher dimensions, local linear regression still avoids excessive boundary bias and design bias. Suppose that H = h2 I . Then, using the above result, the MSE is h4 4
d 2

(164)

2
j =1

rjj (x)

2 (x) K (u)2 du c2 + = c1 h4 + d d nh f (x) nh

(165)

which is minimized at h = cn1/(d+4) giving MSE of size n4/(4+d) .

A DDITIVE M ODELS . Interpreting and visualizing a high-dimensional t is difcult. As the number of covariates increases, the computational burden becomes prohibitive. Sometimes, a more fruitful approach is to use an additive model. An additive model is a model of the form
d

Y =+
j =1

rj (Xj ) +

(166)

where r1 , . . . , rd are smooth functions. The model (166) is not identiable since we can add any constant to and subtract the same constant from one of the rj s without changing the regression function. This problem can be xed in a number of ways, perhaps the easiest being to set = Y and then regard the rj s as deviations from Y . In this case we require that n i=1 rj (Xi ) = 0 for each j . The additive model is clearly not as general as tting r(x1 , . . . , xd ) but it is much simpler to compute and to interpret and so it is often a good starting point. This is a simple algorithm for turning any one-dimensional regression smoother into a method for tting additive models. It is called backtting. The Backtting Algorithm Initialization: set = Y and set initial guesses for r1 , . . . , rd . Iterate until convergence: for j = 1, . . . , d: 164

Compute Yi = Yi

k=j

rk (Xi ), i = 1, . . . , n.

Apply a smoother to Yi on xj to obtain rj . Set rj (x) equal to rj (x) n1


n i=1 rj (Xi ).

The idea of backtting is that for each feature, you t a regression model to the residuals, after tting all the other features. The algorithm applies to any type of smoother kernel, splines, etc. You can write your own function to t an additive model. R has a preprogrammed backtting function called GAM for generalized additive models. This function ts a kernel regression for each feature. 13.31 Example. Here is an example involving three covariates and one response variable. The data are plotted in Figure 62. The data are 48 rock samples from a petroleum reservoir, the response is permeability (in milli-Darcies) and the covariates are: the area of pores (in pixels out of 256 by 256), perimeter in pixels and shape (perimeter/ area). The goal is to predict permeability from the three covariates. First we t the additive model permeability = r1 (area) + r2 (perimeter) + r3 (shape) + . We scale each covariate to have the same variance and then use a common bandwidth for each covariate. The estimates of r1 , r2 and r3 are shown in Figure 62 (bottom). Y was added to each function before plotting it. Next consider a three-dimensional local linear t (161). After scaling each covariate to have mean 0 and variance 1, we found that the bandwidth h 3.2 minimized the cross-validation score. The residuals from the additive model and the full three-dimensional local linear t are shown in Figure 63. Apparently, the tted values are quite similar suggesting that the generalized additive model is adequate.

R EGRESSION T REES . A regression tree is a model of the form


M

r(x) =
m=1

cm I (x Rm )

(167)

where c1 , . . . , cM are constants and R1 , . . . , RM are disjoint rectangles that partition the space of covariates. The model is tted in a recursive manner that can be represented as a tree; hence the name. Denote a generic covariate value by x = (x1 , . . . , xj , . . . , xd ). The covariate for the ith observation is Xi = (xi1 , . . . , xij , . . . , xid ). Given a covariate j and a split point s we dene the rectangles R1 = R1 (j, s) = {x : xj s} and R2 = R2 (j, s) = {x : xj > s} where, in this expression, xj refers the the j th covariate not the j th observation. Then we take c1 to be the average of all the Yi s such that Xi R1 and c2 to be the average of all the Yi s such that Xi R2 . Notice that c1 and c2 minimize the sums of squares Xi R1 (Yi c1 )2 and Xi R2 (Yi c2 )2 . The choice 165

log permeability

log permeability

1000

2000

3000 area

4000

5000

0.1

0.2

0.3 perimeter

0.4

log permeability

200 400 600 800 shape

1200

log permeability

log permeability

1000

2000

3000 area

4000

5000

0.1

0.2

0.3 perimeter

0.4

log permeability

200

400

600

800 1000

shape

Figure 62: top: The rock data. bottom: The plots show r1 , r2 , and r3 for the additive model Y = r1 (x1 ) + r2 (x2 ) + r3 (x3 ) + . 166

0.5

residuals

0.0

0.5

8 predicted values

0.2

0.0

0.2

0.4

0
Normal quantiles

0.5

residuals

residuals

0.0

0.5

8 predicted values

0.5

0.0

0.5

0.5

0.0 predicted values

0.5

Figure 63: The residuals for the rock data. Top left: residuals from the additive model. Top right: qq-plot of the residuals from the additive model. Bottom left: residuals from the multivariate local linear model. Bottom right: residuals from the two ts plotted against each other.

167

< 50 x2 < 100 c1 100 c2

x1

50 c3

110 x2

R2 R3 R1

50

x1

Figure 64: A regression tree for two covariates x1 and x2 . The function estimate is r(x) = c1 I (x R1 ) + c2 I (x R2 ) + c3 I (x R3 ) where R1 , R2 and R3 are the rectangles shown in the lower plot. of which covariate xj to split on and which split point s to use is based on minimizing the residual sums if squares. The splitting process is on repeated on each rectangle R1 and R2 . Figure 64 shows a simple example of a regression tree; also shown are the corresponding rectangles. The function estimate r is constant over the rectangles. Generally one grows a very large tree, then the tree is pruned to form a subtree by collapsing regions together. The size of the tree chosen by cross-validation. Usually, we use ten-fold crossvalidation since leave-one-out is too expensive. Thus we divide the data into ten blocks, remove each block one at a time, t the model on the remaining blocks and prediction error is computed for the observations in the left-out block. This is repeated for each block and the prediction error is averaged over the ten replications. Here are the R commands: library(tree) ### out = tree(y x1 + x2 plot(out) ### text(out) ### load the library + x3) ### fit the tree 1 plot the tree add labels to plot 168

area < 1403

area < 1068

area < 3967 area < 3967 peri < .1991 peri < .1949

7.746

8.407

8.678

8.893 8.985

8.099

8.339

Figure 65: Regression tree for the rock data. print(out) ### print the tree cv = cv.tree(out) ### prune the tree ### and compute the cross validation score plot(cv$size,cv$dev) ### plot the CV score versus tree size m = cv$size[cv$dev == min(cv$dev)] ### find the best size tree new = prune.tree(out,best=m) ### fit the best size tree plot(new) text(new) 13.32 Example. Figure 65 shows a tree for the rock data. Notice that the variable shape does not appear in the tree. This means that the shape variable was never the optimal covariate to split on in the algorithm. The result is that tree only depends on area and peri. This illustrates an important feature of tree regression: it automatically performs variable selection in the sense that a covariate xj will not appear in the tree if the algorithm nds that the variable is not important.

169
1

14

Density Estimation
X1 , . . . , X n f

A problem closely related to nonparametric regression, is nonparametric density estimation. Let

where f is some probabilty density. We want to estimate f . 14.1 Example (Bart Simpson). The top left plot in Figure 66 shows the density 1 1 f (x) = (x; 0, 1) + 2 10
4

j =0

(x; (j/2) 1, 1/10)

(168)

where (x; , ) denotes a Normal density with mean and standard deviation . Based on 1000 draws from f , I computed a kernel density estimator, described later. The top right plot is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more reasonable density estimate. We will evaluate the quality of an estimator fn with the risk, or integrated mean squared error, R = E(L) where L= (fn (x) f (x))2 dx

is the integrated squared error loss function. The estimators will depend on some smoothing parameter h and we will choose h to minimize an estimate of the risk. The usual method for estimating risk is leave-one-out cross-validation. The details are different for density estimation than for 2 regression. In the regression case, the cross-validation score was dened as n i=1 (Yi r(i) (Xi )) but in density estimation, there is no response variable Y . Instead, we proceed as follows. The loss function, which we now write as a function of h, (since fn will depend on some smoothing parameter h) is L(h) = = (fn (x) f (x))2 dx
2 fn (x) dx 2

fn (x)f (x)dx +

f 2 (x) dx.

The last term does not depend on h so minimizing the loss is equivalent to minimizing the expected value of 2 J (h) = fn (x) dx 2 fn (x)f (x)dx. (169) We shall refer to E(J (h)) as the risk, although it differs from the true risk by the constant term f 2 (x) dx. The cross-validation estimator of risk is
2

J (h) =

f n ( x)

2 dx n

f(i) (Xi )
i=1

(170)

170

1.0

0.5

0.0

0 True Density

0.0
3

0.5

1.0

0 Undersmoothed

1.0

0.5

0.0

0 Just Right

0.0
3

0.5

1.0

0 Oversmoothed

Figure 66: The Bart Simpson density from Example 14.1. Top left: true density. The other plots are kernel estimators based on n = 1000 draws. Bottom left: bandwidth h = 0.05 chosen by leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.

171

where f(i) is the density estimator obtained after removing the ith observation. We refer to J (h) as the cross-validation score or estimated risk. Perhaps the simplest nonparametric density estimator is the histogram. Suppose f has its support on some interval which, without loss of generality, we take to be [0, 1]. Let m be an integer and dene bins B1 = 0, 1 m , B2 = 1 2 , m m , . . . , Bm = m1 ,1 . m (171)

Dene the binwidth h = 1/m, let Yj be the number of observations in Bj , let pj = Yj /n and let pj = Bj f (u)du. The histogram estimator is dened by
m

fn (x) =
j =1

pj I (x Bj ). h

(172)

To understand the motivation for this estimator, note that, for x Bj and h small, E(fn (x)) = E(pj ) pj = = h h
Bj

f (u)du h

f (x)h = f (x). h

14.2 Example. Figure 67 shows three different histograms based on n = 1, 266 data points from an astronomical sky survey. Each data point represents a redshift, roughly speaking, the distance from us to a galaxy. Choosing the right number of bins involves nding a good tradeoff between bias and variance. We shall see later that the top left histogram has too many bins resulting in oversmoothing and too much bias. The bottom left histogram has too few bins resulting in undersmoothing. The top right histogram is based on 308 bins (chosen by cross-validation). The histogram reveals the presence of clusters of galaxies. Consider xed x and xed m, and let Bj be the bin containing x. Then, E(fn (x)) = The risk satises R (fn , f ) The value h that minimizes (174) is h = With this choice of binwidth,

pj (1 pj ) pj and V(fn (x)) = . h nh2 h2 12 (f (u))2 du + 1 . nh

(173)

(174)

1 n1/3

6 (f (u))2 du

1/3

(175)

C . (176) n2/3 We see that with an optimally chosen binwidth, the risk decreases to 0 at rate n2/3 . We will see shortly that kernel estimators converge at the faster rate n4/5 . R(fn , f ) 172

80

40

0.0

0.1 Undersmoothed

0.2

0
0.0

40

80

0.1 Just Right

0.2

0
0.0

40

80

0.1 Oversmoothed

0.2

500

1000

Number of Bins

Figure 67: Three versions of a histogram for the astronomy data. The top left histogram has too many bins. The bottom left histogram has too few bins. The top right histogram uses 308 bins (chosen by cross-validation). The lower right plot shows the estimated risk versus the number of bins.

173

14.3 Theorem. The following identity holds: J (h) = n+1 2 h(n 1) h(n 1)
m

pj2 .
j =1

(177)

14.4 Example. We used cross-validation in the astronomy example. We nd that m = 308 is an approximate minimizer. The histogram in the top right plot in Figure 67 was constructed using m = 308 bins. The bottom right plot shows the estimated risk, or more precisely, J , plotted versus the number of bins. Histograms are not smooth. Now we discuss kernel density estimators which are smoother and which converge to the true density faster. Given a kernel K and a positive number h, called the bandwidth, the kernel density estimator is dened to be n 1 x Xi 1 K . (178) fn (x) = n i=1 h h This amounts to placing a smoothed out lump of mass of size 1/n over each data point Xi ; see Figure 68. In R use: kernel(x,bw=h) where h is the bandwidth. As with kernel regression, the choice of kernel K is not crucial, but the choice of bandwidth h is important. Figure 69 shows density estimates with several different bandwidths. Look also at Figure 66. We see how sensitive the estimate fn is to the choice of h. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general we will let the bandwidth depend on the sample size so we write hn . Here are some properties of fn . The risk is K 2 (x)dx 1 4 4 hn (f (x))2 dx + (179) R K 4 nh
2 where K =

x2 K (x)dx.

If we differentiate (179) with respect to h and set it equal to 0, we see that the asymptotically optimal bandwidth is 1/5 c2 h = (180) c2 1 A(f )n where c1 = x2 K (x)dx, c2 = K (x)2 dx and A(f ) = (f (x))2 dx. This is informative because it tells us that the best bandwidth decreases at rate n1/5 . Plugging h into (179), we see that if the optimal bandwidth is used then R = O(n4/5 ). As we saw, histograms converge at rate O(n2/3 ) showing that kernel estimators are superior in rate to histograms. In practice, the bandwidth can be chosen by cross-validation but rst we describe another method which is sometimes used when f is thought to be very smooth. Specically, we compute h from (180) under the idealized assumption that f is Normal. This yields h = 1.06n1/5 . Usually, is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q is the 174

10

10

Figure 68: A kernel density estimator fn . At each point x, fn (x) is the average of the kernels centered over the data points Xi . The data points are indicated by short vertical bars. The kernels are not drawn to scale.

175

0.0

0.1

0.2

0.0

0.1

0.2

0.0

0.1

0.2

0.000

0.008

Figure 69: Kernel density estimators and estimated risk for the astronomy data. Top left: oversmoothed. Top right: just right (bandwidth chosen by cross-validation). Bottom left: undersmoothed. Bottom right: cross-validation curve as a function of bandwidth h. The bandwidth was chosen to be the value of h where the curve is a minimum.

176

interquartile range.1 This choice of h works well if the true density is very smooth and is called the Normal reference rule. The Normal Reference Rule For smooth densities and a Normal kernel, use the bandwidth hn = where = min s, 1.06 n1/5 Q 1.34 .

Since we dont want to necessarily assume that f is very smooth, it is usually better to estimate h using cross-validation. Recall that the cross-validation score is J (h) = 2 f (x)dx n
2 n

fi (Xi )
i=1

(181)

where fi denotes the kernel estimator obtained by omitting Xi . R code. use the bw.ucv function to do cros-validation: h = bw.ucv(x) plot(density(x,bw=h)) The bandwidth for the density estimator in the upper right panel of Figure 69 is based on crossvalidation. In this case it worked well but of course there are lots of examples where there are problems. Do not assume that, if the estimator f is wiggly, then cross-validation has let you down. The eye is not a good judge of risk. Constructing condence bands for kernel density estimators is similar to regression. Note that fn (x) is just a sample average: fn (x) = n1 n i=1 Zi (x) where Zi (x) = 1 K h x Xi h .

So the standard error is se(x) = s(x)/ n where s(x) is the standard deviation of the Zi (x) s: s ( x) = Then we use fn (x) z/(2n) se(x). 1 n
n

i=1

(Zi (x) fn (x))2 .

(182)

177

out$f 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6

out$f

2 1 0 1 2
out$f 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 0.5 out$f 0.6

2 1 0

grid CV 1 2 2 1 1 2

grid CV

178
0 1 2 1 2 0 grid reference rule grid reference rule

14.5 Example. Figure 14 shows two examples. The rst is data from N(0,1) and second from (1/2)N (1, .1) + (1/2)N (1, .1). In both cases, n = 1000. We show the estimates using crossvalidation and the Normal reference rule together with bands. The true curve is also shown. Thats the curve outside the bands in the last plot. Suppose now that the data are d-dimensional so that Xi = (Xi1 , . . . , Xid ). The kernel estimator can easily be generalized to d dimensions. Most often, we use the product kernel 1 fn (x) = nh1 hd
n d

K
i=1 j =1

xj Xij hj

(183)

To further simplify, we can rescale the variables to have the same variance and then use only one bandwidth. A L INK B ETWEEN R EGRESSION AND D ENSITY E STIMATION . Consider regression again. Recall that r(x) = E(Y |X = x) = = yf (x, y ) . f (x, y )dx
n

yf (y |x)dy =

yf (x, y )dy f (x)

(184) (185)

Suppose we compute a bivariate kernel density estimator 1 f (x, y ) = n 1 K h1 x Xi h1 1 K h2 y Yi h2 (186)

i=1

and we insert this into (185). Assuming that y 1 K h2 y Yi h2 dy =

uK (u)du = 0, we see that (h2 u + Yi )K (u)du uK (u)du + Yi K (u)du (187) (188) (189)

= h2 = Yi . Hence, y f (x, y )dy = = =


1

1 n 1 n 1 n

y
i=1 n

1 K h1

x Xi h1 y .

1 K h2 1 K h2

y Yi h2 y Yi h2

dy dy

(190) (191) (192)

i=1 n

1 K h1 Yi 1 K h1

x Xi h1 x Xi h1

i=1

Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing by 1.34 is that Q/1.34 is a consistent estimate of if the data are from a N (, 2 ).

179

Also, 1 f (x, y )dy = n = Therefore, r ( x) =


1 n 1 n n i=1 n

1 n

i=1 n

1 K h1 1 K h1

x Xi h1 .

1 K h2

y Yi h2

dy

(193) (194)

i=1

x Xi h1

y f (x, y ) f (x, y )dx


n i=1 1 Yi h K 1 xXi h1 xXi h1

(195)

n 1 i=1 h1 K

(196)

Yi K K

xXi h1 xXi h1

n i=1

(197)

which is the kernel regression estimator. In other words, the kernel regression estimator can be derived from kernel density estimation.

180

15

Classication

REFERENCES: 1. Hastie, Tibshirani and Friedman (2001). The Elements of Statistical Learning. 2. Devroye, Gy or and Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. 3. Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning.

The problem of predicting a discrete random variable Y from another random variable X is called classication, supervised learning, discrimination, or pattern recognition. Consider IID data (X1 , Y1 ), . . . , (Xn , Yn ) where Xi = (Xi1 , . . . , Xid )T X Rd is a d-dimensional vector and Yi takes values in {0, 1}. Often, the covariates X are also called features. The goal is to predict Y given a new X . This is the same as binary regression except that the focus is on good prediction rather than estimating the regression function.

A classication rule is a function h : X {0, 1}. When we observe a new X , we predict Y to be h(X ). The classication risk (or error rate) of h is R(h) = P(Y = h(X )). (198)

EXAMPLES: 1. The Coronary Risk-Factor Study (CORIS) data. There are 462 males between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease and there are 9 covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age. The goal is to predict Y from all the covariates. 2. Predict if stock will go up or down based on past performance. Here X is past price and Y is the future price. 3. Predict if an email message is spam or real. 4. Identify whether glass fragments in a criminal investigation or from a window or not, based on chemical composition.

181

Figure 70: Zip code data.

182

x2

x1

Figure 71: Two covariates and a linear decision boundary. means Y = 1. means Y = 0. These two groups are perfectly separated by the linear decision boundary. 5. Identity handwritten digits from images. Each Y is a digit from 0 to 9. There are 256 covariates x1 , . . . , x256 corresponding to the intensity values from the pixels of the 16 X 16 image. See Figure 70. 15.1 Example. Figure 71 shows 100 data points. The covariate X = (X1 , X2 ) is 2-dimensional and the outcome Y Y = {0, 1}. The Y values are indicated on the plot with the triangles representing Y = 1 and the squares representing Y = 0. Also shown is a linear classication rule represented by the solid line. This is a rule of the form h(x) = 1 if a + b1 x1 + b2 x2 > 0 0 otherwise.

Everything above the line is classied as a 0 and everything below the line is classied as a 1.

183

15.1

Error Rates, The Bayes Classier and Regression

The true error rate (or classication risk) of a classier h is R(h) = P({h(X ) = Y }) and the empirical error rate or training error rate is 1 Rn (h) = n
n

(199)

I (h(Xi ) = Yi ).
i=1

(200)

Now we will related classication to regression. Let r(x) = E(Y |X = x) = P(Y = 1|X = x) denote the regression function. We have the following important result.

The rule h that minimizes R(h) is h (x) = 1 if r(x) > 1 2 0 otherwise. (201)

The rule h is called the Bayes rule. The risk R = R(h ) of the Bayes rule is called the Bayes risk. The set D(h) = {x : r(x) = 1/2} is called the decision boundary. PROOF. We will show that R(h) R(h ) 0. Note that R(h) = P({Y = h(X )}) = It sufces to show that P({Y = h(X )|X = x} P({Y = h (X )|X = x} 0 for all x. Now, P({Y = h(X )|X = x} = 1 P({Y = h(X )|X = x} 184 (203) P({Y = h(X )|X = x}f (x)dx. (202)

= 1 P({Y = 1, h(X ) = 1|X = x} + P({Y = 0, h(X ) = 0|X = x} = 1 I (h(x) = 1)P({Y = 1|X = x} + I (h(x) = 0)P({Y = 0|X = x} = 1 I (h(x) = 1)r(x) + I (h(x) = 0)(1 r(x)) = 1 I (x)r(x) + (1 I (x))(1 r(x)) where I (x) = I (h(x) = 1). Hence, P({Y = h(X )|X = x} P({Y = h (X )|X = x} = I (x)r(x) + (1 I (x))(1 r(x)) I (x)r(x) + (1 I (x))(1 r(x)) = (2r(x) 1)(I (x) I (x)) 1 (I (x) I (x)). = 2 r ( x) 2 When r(x) 1/2, h (x) = 1 so (204) is non-negative. When r(x) < 1/2, h (x) = 0 so so both terms are nonposiitve and hence (204) is again non-negative. This proves (203). To summarize, if h is any classier, then R(h) R .

15.2

Classication is Easier Than Regression

Let r (x) = E(Y |X = x) be the true regression function and let h (x) denote the corresponding Bayes rule. Let r(x) be an estimate of r (x) and dene the plug-in rule: h(x) = In the previous proof we showed that P(Y = h(X )|X = x) P(Y = h (X )|X = x) = (2r(x) 1)(Ih (x)=1 Ib h(x)=1 ) = |2r(x) 1|Ih (x)=b h(x) = 2|r(x) 1/2|Ih (x)=b h(x) 1 if r(x) > 1 2 0 otherwise. (204)

Now, when h (x) = h(x) we have that |r(x) r (x)| |r(x) 1/2|. Therefore, P(h(X ) = Y ) P(h (X ) = Y ) = 2 2 185 |r(x) 1/2|Ih (x)=b h(x) f (x)dx |r(x) r (x)|Ih (x)=b h(x) f (x)dx

|r(x) r (x)|f (x)dx

= 2E|r(X ) r (X )|. This means that if r(x) is close to r (x) then the classication risk will be close to the Bayes risk. The converse is not true. It is possible for r to be far from r (x) and still lead to a good classier. As long as r(x) and r (x) are on the same side of 1/2 they yield the same classier.

15.3

The Bayes Rule and the Class Densities


r(x) = P(Y = 1|X = x) f (x|Y = 1)P(Y = 1) = f (x|Y = 1)P(Y = 1) + f (x|Y = 0)P(Y = 0) f1 (x) = f1 (x) + (1 )f0 (x)

We can rewrite h in a different way. From Bayes theorem we have that

(205)

where f0 (x) = f (x|Y = 0) f1 (x) = f (x|Y = 1) = P(Y = 1). We call f0 and f1 the class densities. Thus we have: The Bayes rule can be written as: h (x) =

1 if

f1 (x) f0 (x)

>

(1 )

(206)

0 otherwise.

15.4

How to Find a Good Classier

The Bayes rule depends on unknown quantities so we need to use the data to nd some approximation to the Bayes rule. There are three main approaches: 1. Empirical Risk Minimization. Choose a set of classiers H and nd h H that minimizes some estimate of L(h). 2. Regression (Plugin Classiers). Find an estimate r of the regression function r and dene h(x) = 1 if r(x) > 1 2 0 otherwise. 186

3. Density Estimation. Estimate f0 from the Xi s for which Yi = 0, estimate f1 from the Xi s for which Yi = 1 and let = n1 n i=1 Yi . Dene r(x) = P(Y = 1|X = x) = and h(x) = f1 (x) f1 (x) + (1 )f0 (x)

1 if r(x) > 1 2 0 otherwise.

15.5

Empirical Risk Minimization: The Finite Case

Let H be a nite set of classiers. Empirical risk minimization means choosing the classier h H to minimize the training error Rn (h), also called the empirical risk. Thus, h = argminhH Rn (h) = argminhH 1 n I (h(Xi ) = Yi ) .
i

(207)

Let h be the best classier in H, that is, R(h ) = minhH R(h). How good is h compared to h ? We know that R(h ) R(h). We will now show that, with high probability, R(h) R(h ) + for some small > 0. Our main tool for this analysis is Hoeffdings inequality. This inequality is very fundamental and is used in many places in statistics and machine learning. Hoeffdings Inequality If X1 , . . . , Xn Bernoulli(p), then, for any > 0, P (|p p| > ) 2e2n where p = n1
n i=1
2

(208)

Xi .

Another basic fact we need is the union bound: if Z1 , . . . , Zm are random variables then P(max Zj c)
j

P(Zj > c).


j

This follows since P(max Zj c) = P({Z1 c} or {Z2 c} or or {Zm c})


j

P(Zj > c).


j

Recall that H = {h1 , . . . , hm } consists of nitely many classiers. Now we see that: P max |Rn (h) R(h)| >
hH

H H

P |Rn (h) R(h)| >

2me2n .

187

Fix and let


n

2 log n

2m . .

Then P max |Rn (h) R(h)| >


hH n

Hence, with probability at least 1 , the following is true: R(h) R(h) + Summarizing: P R(h) > R(h ) + 8 log n 2m .
n

R(h ) +

R(h ) + 2 n .

We might extend our analysis to innite H later.

15.6

Parametric Methods I: Linear and Logistic Regression

One approach to classication is to estimate the regression function r(x) = E(Y |X = x) = P(Y = 1|X = x) and, once we have an estimate r, use the classication rule h(x) = The linear regression model
d

1 if r(x) > 1 2 0 otherwise.

(209)

Y = r(x) + = 0 +
j =1

j Xj +

(210)

cant be correct since it does not force Y = 0 or 1. Nonetheless, it can sometimes lead to a good classier. An alternative is to use logistic regression: e0 + j j xj P r(x) = P(Y = 1|X = x) = . 1 + e0 + j j xj 15.2 Example. Let us return to the South African heart disease data. > print(names(sa.data)) [1] "sbp" "tobacco" "ldl" [7] "obesity" "alcohol" "age" > n = nrow(sa.data) > > ### linear > out = lm(chd . ,data=sa.data) 188 "adiposity" "famhist" "chd" "typea"
P

(211)

> > > >

tmp = predict(out) yhat = rep(0,n) yhat[tmp > .5] = 1 print(table(chd,yhat)) yhat chd 0 1 0 260 42 1 76 84 > print(sum( chd != yhat)/n) [1] 0.2554113 > > ### logistic > out = glm(chd . ,data=sa.data,family=binomial) > tmp = predict(out,type="response") > yhat = rep(0,n) > yhat[tmp > .5] = 1 > print(table(chd,yhat)) yhat chd 0 1 0 256 46 1 77 83 > print(sum( chd != yhat)/n) [1] 0.2662338 15.3 Example. For the digits example, lets restrict ourselves only to Y = 0 and Y = 1. Here is what we get:
> > > > > > > > ### linear out = lm(ytrain .,data=as.data.frame(xtrain)) tmp = predict(out) n = length(ytrain) yhat = rep(0,n) yhat[tmp > .5] = 1 b = table(ytrain,yhat) print(b) yhat ytrain 0 1 0 600 0 1 0 500 > print((b[1,2]+b[2,1])/sum(b)) ###training error [1] 0 > tmp = predict(out,newdata=as.data.frame(xtest)) Warning message: prediction from a rank-deficient fit may be misleading in: + predict.lm(out, newdata = as.data.frame(xtest))

189

> > > > >

n = length(ytest) yhat = rep(0,n) yhat[tmp > .5] = 1 b = table(ytest,yhat) print(b) yhat ytest 0 1 0 590 4 1 0 505 > print((b[1,2]+b[2,1])/sum(b)) ### testing error [1] 0.003639672

15.7

Parametric Methods II: Gaussian and Linear Classiers


1 1 1 exp (x k )T k (x k ) , k = 0, 1. 2

Suppose that f0 (x) = f (x|Y = 0) and f1 (x) = f (x|Y = 1) are both multivariate Gaussians: fk (x) = (2 )d/2 |k |1/2

Thus, X |Y = 0 N (0 , 0 ) and X |Y = 1 N (1 , 1 ). h ( x) = where


2 2 + 2 log < r0 1 if r1 0 otherwise

15.4 Theorem. If X |Y = 0 N (0 , 0 ) and X |Y = 1 N (1 , 1 ), then the Bayes rule is


1 0

+ log

|0 | |1 |

(212)

is the Manalahobis distance. An equivalent way of expressing the Bayes rule is h (x) = argmaxk{0,1} k (x) where 1 1 1 k (x) = log |k | (x k )T k (x k ) + log k 2 2 and |A| denotes the determinant of a matrix A. (214)

1 2 ri = (x i )T i (x i ), i = 0, 1

(213)

The decision boundary of the above classier is quadratic so this procedure is called quadratic discriminant analysis (QDA). In practice, we use sample estimates of , 1 , 2 , 0 , 1 in place of the true value, namely: 0 1 = n 1 n0 1 (1 Yi ), 1 = n i=1 X i , 1 =
i: Yi =0 n n

Yi
i=1

0 = S0

1 n1

Xi
i: Yi =1

1 = n0

i: Yi =0

(Xi 0 )(Xi 0 )T , S1 = 190

1 n1

i: Yi =1

(Xi 1 )(Xi 1 )T

where n0 = i (1 Yi ) and n1 = i Yi . A simplication occurs if we assume that 0 = 0 = . In that case, the Bayes rule is h (x) = argmaxk k (x) 1 1 k (x) = xT 1 k T k k + log k . 2 The parameters are estimated as before, except that the MLE of is S= The classication rule is h (x) = where n0 S0 + n1 S1 . n0 + n1 1 if 1 (x) > 0 (x) 0 otherwise (217) where now (215)

(216)

1 j (x) = xT S 1 j T S 1 j + log j 2 j is called the discriminant function. The decision boundary {x : 0 (x) = 1 (x)} is linear so this method is called linear discrimination analysis (LDA). 15.5 Example. Let us return to the South African heart disease data. In R use: out = lda(x,y) ### or qda for quadratic yhat = predict(out)$class The error rate of LDA is .25. For QDA we get .24. In this example, there is little advantage to QDA over LDA. Now we generalize to the case where Y takes on more than two values. 15.6 Theorem. Suppose that Y {1, . . . , K }. If fk (x) = f (x|Y = k ) is Gaussian, the Bayes rule is h(x) = argmaxk k (x) 1 1 1 k (x) = log |k | (x k )T k (x k ) + log k . 2 2 If the variances of the Gaussians are equal, then 1 k (x) = xT 1 k T 1 + log k . 2 k where (218)

(219)

191

Figure 72: from Hastie et al. 2001

192

Figure 73: from Hastie et al. 2001

193

In Figure 72 the left panel shows LDA applied to data that happen to represent 3 classes. The linear decision boundaries separate the 3 groups fairly well using the two variables plotted on horizontal and vertical axes (X1 , X2 ). In the right hand panel LDA is applies to 5 variables, 2 2 , X1 , X2 ). The linear boundaries , X2 obtained by adding the quadratic and interaction terms (X1 from the 5-dimensional space correspond to curves in the 2-dimensional space. In Figure 73 the right panel shows QDA applied to the same data. QDA nearly matches LDA applied with the quadratic terms. There is another version of linear discriminant analysis due to Fisher. The idea is to rst reduce the dimension of covariates to one dimension by projecting the data onto a line. Algebraically, this means replacing the covariate X = (X1 , . . . , Xd ) with a linear combination U = wT X = d j =1 wj Xj . The goal is to choose the vector w = (w1 , . . . , wd ) that best separates the data. Then we perform classication with the one-dimensional covariate Z instead of X . Fishers rule is the same as the Bayes linear classier in equation (216) when = 1/2.

15.8

Relationship Between Logistic Regression and LDA

LDA and logistic regression are almost the same thing. If we assume that each group is Gaussian with the same covariance matrix, then we saw earlier that log P(Y = 1|X = x) P(Y = 0|X = x) 0 1 (0 + 1 )T 1 (1 0 ) 1 2 T 1 + x ( 1 0 ) 0 + T x. = log

On the other hand, the logistic model is, by assumption, log P(Y = 1|X = x) P(Y = 0|X = x) = 0 + T x.

These are the same model since they both lead to classication rules that are linear in x. The difference is in how we estimate the parameters. The joint density of a single observation is f (x, y ) = f (x|y )f (y ) = f (y |x)f (x). In LDA we estimated the whole joint distribution by maximizing the likelihood f (Xi , yi ) =
i i

f (Xi |yi )
Gaussian

f (yi ) .
i Bernoulli

(220)

In logistic regression we maximized the conditional likelihood i f (yi |Xi ) but we ignored the second term f (Xi ): f (Xi , yi ) = f (yi |Xi ) f (Xi ) . (221)
i i i logistic ignored

194

Since classication only requires knowing f (y |x), we dont really need to estimate the whole joint distribution. Logistic regression leaves the marginal distribution f (x) unspecied so it is more nonparametric than LDA. This is an advantage of the logistic regression approach over LDA. To summarize: LDA and logistic regression both lead to a linear classication rule. In LDA we estimate the entire joint distribution f (x, y ) = f (x|y )f (y ). In logistic regression we only estimate f (y |x) and we dont bother estimating f (x).

15.9

Training and Testing Data: Model validation

How do we choose a good classier? We would like to have a classier h with a low prediction error rate. Usually, we cant use the training error rate as an estimate of the true error rate because it is biased downward. 15.7 Example. Consider the heart disease data again. Suppose we t a sequence of logistic regression models. In the rst model we include one covariate. In the second model we include two covariates, and so on. The ninth model includes all the covariates. We can go even further. Lets also t a tenth model that includes all nine covariates plus the rst covariate squared. Then we t an eleventh model that includes all nine covariates plus the rst covariate squared and the second covariate squared. Continuing this way we will get a sequence of 18 classiers of increasing complexity. The solid line in Figure 74 shows the observed classication error which steadily decreases as we make the model more complex. If we keep going, we can make a model with zero observed classication error. The dotted line shows the 10-fold cross-validation estimate of the error rate (to be explained shortly) which is a better estimate of the true error rate than the observed classication error. The estimated error decreases for a while then increases. This is essentially the biasvariance tradeoff phenomenon we have seen before. How can we learn if our model is good at classifying new data? The answer involves a trick weve used previouslly Cross-Validation. No analysis of prediction models is complete without evaluating the performance of the model using this technique. Cross-Validation. The basic idea of cross-validation, which we have already encountered in curve estimation, is to leave out some of the data when tting a model. The simplest version of cross-validation involves randomly splitting the data into two pieces: the training set T and the validation set V . Often, about 10 per cent of the data might be set aside as the validation set. The classier h is constructed from the training set. We then estimate the error by L(h) = 1 I (h(Xi ) = YI ). m X V
i

(222)

where m is the size of the validation set. See Figure 75. In an ideal world we would have so much data that we could split the data into two portions, use the rst portion to select a model (i.e., training) and then the second portion to test our model. In this way we could obtain an unbiased estimate of how well we predict future data. In reality we seldom have enough data to spare any of it. As a compromise we use K-fold cross-validation (KCV) to evaluate our approach. K-fold cross-validation is obtained from the following algorithm. 195

error rate 0.34

0.30

0.26 5 number of terms in model 15

Figure 74: The solid line is the observed error rate and dashed line is the cross-validation estimate of true error rate.

196

Training Data T h

Validation Data V L

Figure 75: Cross-validation. The data are divided into two groups: the training data and the validation data. The training data are used to produce an estimated classier h. Then, h is applied to the validation data to obtain an estimate L of the error rate of h.

197

K -fold cross-validation. 1. Randomly divide the data into K chunks of approximately equal size. A common choice is K = 10. 2. For k = 1 to K, do the following: (a) Delete chunk k from the data. (b) Compute the classier h(k) from the rest of the data. (c) Use h(k) to the predict the data in chunk k . Let L(k) denote the observed error rate. 3. Let 1 L(h) = K
K

L(k) .
k=1

(223)

If tuning parameters are chosen using cross-validation, then KCV still underestimates the error. Nevertheless, KCV helps us to evaluate model performance much better than the training error rate.

15.8 Example. Simple CV and KCV in GLM This simple example shows how to use the cv.glm function to do leave-one-out cross-validation and K-fold cross-validation when tting a generalized linear model. First well try the simplest scenario. leave-one-out and 6-fold cross-validation prediction error for data that are appropriate for a linear model. > > > > library(boot) data(mammals, package="MASS") mammals.glm = glm(log(brain)log(body),data=mammals) cv.err = cv.glm(mammals,mammals.glm)

cv.err$delta 1 1 0.4918650 0.4916571 This is the leave-one-out cross-validation. Delta reports the computed error for whatever cost function was chosen. By default the cost is average squared error. (The 2nd delta entry adjusts for the bias in KCV versus leave one out.)

198

# Try 6-fold cross-validation > cv.err.6 = cv.glm(mammals, mammals.glm, K=6) > cv.err.6$delta 1 1 0.5000575 0.4969159 Notice that using 6-fold cross-validation yields similar, but not identical estimate of error. Neither is clearly preferable in terms of performance. 6-fold is computationally faster. As this is a linear model we could have calculated the leave-one-out cross-validation estimate without any extra model-tting using the diagonals of the hat matrix. The function glm.diag gives the diagonals of H. muhat = mammals.glm$fitted mammals.diag = glm.diag(mammals.glm) #to get diagonals of H cv.err = mean((mammals.glm$y-muhat)2/(1-mammals.diag$h)2) > cv.err [1] 0.491865 Notice it matches the leave-one-out CV entry above. Next we try a logistic model to obtain leave-one-out and 11-fold cross-validation prediction error for the nodal data set. Since the response is a binary variable we dont want to use the default cost function, which is squared error. First we need to dene a function, which we call cost. An appropriate cost function is our usual fraction of misclassied subjects in the 2x2 confusion matrix. This function below computes this quantity where y is the binary outcome and pi is the tted value, i.e., L from above. > cost = function(y, pi){ > err = mean(abs(y-pi)>0.5) > return(err) > } > # > > nodal.glm = glm(rstage+xray+acid,binomial,data=nodal) for leave-one-out CV cv.err = cv.glm(nodal, nodal.glm, cost=cost, K=nrow(nodal))$delta cv.err 1 1 0.1886792 0.1886792 # for 11-fold CV > cv.11.err <- cv.glm(nodal, nodal.glm, cost=cost, K=11)$delta > cv.11.err 1 1 0.2264151 0.2192951

199

There are CV forms for each of the methods of prediction. They vary in convenience. For regression trees, in the library called tree, there is a function called cv.tree. For a the method we cover next, called kth nearest neighbors, in the library called class there is a function called knn.cv. For LDA analysis the library needs to be downloaded manually and it is called lda.cv. 15.9 Example. Diabetes in Pima Indians A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes. These data frames contains the following columns: npreg = number of pregnancies. glu = plasma glucose concentration in an oral glucose tolerance test. bp = diastolic blood pressure (mm Hg). skin = triceps skin fold thickness (mm). bmi = body mass index (weight in kg/(height in m)2). ped = diabetes pedigree function. age = age in years. type = Yes or No, for diabetic according to WHO criteria. > > > > > > library(MASS) data(Pima.tr) data(Pima.te) Pima <- rbind(Pima.tr, Pima.te) Pima$type <- ifelse(Pima$type == "Yes", 1, 0) library(boot)

We try 4 models of various complexity and compare their error rate using KCV. M2 is the model chosen by AIC using stepwise regression. > + > + > > + M1 <- glm(type npreg + glu + bp + skin + bmi + age, data = Pima, family = binomial) M2 <- glm(type npreg + glu + bmi + age, data = Pima, family = binomial) M3 <- glm(type 1, data = Pima, family = binomial) M4 <- glm(type (npreg + glu + bp + skin + bmi + age)2, data = Pima, family = binomial)

> F1 <- cv.glm(data = Pima,cost=cost, M1)$delta[2] 200

> > > > >

F2 <- cv.glm(data = Pima,cost=cost, M2)$delta[2] F3 <- cv.glm(data = Pima,cost=cost, M3)$delta[2] F4 <- cv.glm(data = Pima,cost=cost, M4)$delta[2] F <- c(F1 = F1, F2 = F2, F3 = F3, F4 = F4) F F1.1 F2.1 F3.1 F4.1 0.2247442 0.2245428 0.3327068 0.2401068 Based on this analysis, we conclude that Model M2 has slightly better predictive power.

15.10

Nearest Neighbors
n 1 i=1 wi (x)I (Yi = 1) > 0 otherwise n i=1

The k -nearest neighbor rule is h(x) = wi (x)I (Yi = 0) (224)

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. Nearest depends on how you dene the distance. Often we use Euclidean distance ||Xi Xj ||. In that case you should standardize the variables rst. 15.10 Example. Digits again. > > > > > ### knn library(class) yhat = knn(train = xtrain, cl = ytrain, test = xtest, k = 1) b = table(ytest,yhat) print(b) yhat ytest 0 1 0 594 0 1 0 505 > print((b[1,2]+b[2,1])/sum(b)) [1] 0 > > > yhat = knn.cv(train = xtrain, cl = ytrain, k = 1) > b = table(ytrain,yhat) > print(b) yhat ytrain 0 1 0 599 1 1 0 500 > print((b[1,2]+b[2,1])/sum(b)) [1] 0.0009090909 201

An important part of this method is to choose a good value of k . For this we can use crossvalidation. 15.11 Example. South African heart disease data again. library(class) m = 50 error = rep(0,m) for(i in 1:m){ out = knn.cv(train=x,cl=y,k=i) error[i] = sum(y != out)/n } postscript("knn.sa.ps") plot(1:m,error,type="l",lwd=3,xlab="k",ylab="error") See Figure 76. 15.12 Example. Figure 77 compares the decision boundaries in a two-dimensinal example. The boundaries are from (i) linear regression, (ii) quadratic regression, (iii) k -nearest neighbors (k = 1), (iv) k -nearest neighbors (k = 50), and (v) k -nearest neighbors (k = 200). The logistic (not shown) also yields a linear boundary. Some Theoretecal Properties. Let h1 be the nearest neighbor classier with k = 1. Cover and Heart (1967) showed that, under very weak assumptions, R lim R(h1 ) 2R
n

(225)

where R is the Bayes risk. For k > 1 we have 1 R lim R(hk ) R + . n ke (226)

15.11

Density Estimation and Naive Bayes


f1 (x) f0 (x) (1 )

The Bayes rule can be written as h (x) = We can estimate by = 1 n

1 if

>

(227)

0 otherwise.
n

Yi .
i=1

202

error

0.32 0

0.34

0.36

0.38

0.40

0.42

0.44

10

20 k

30

40

50

Figure 76: knn for South Africn heart disease data.

203

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

Data

linear

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

quadratic

knn k =1

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

knn k =50

knn k =200

Figure 77: Comparison of decision boundaries. 204

We can estimate f0 and f1 using density estimation. For example, we could apply kernel density estimation to D0 = {Xi : Yi = 0} to get f0 and to D1 = {Xi : Yi = 1} to get f1 . Then we estimate h with b b) 1 (x) 1 if f > (1 b b f 0 (x) (228) h(x) = 0 otherwise. But if x = (x1 , . . . , xd ) is high-dimensional, nonparametric density estimation is not very reliable. This problem is ameliorated if we assume that X1 , . . . , Xd are independent, for then,
d

f0 (x1 , . . . , xd ) =
j =1 d

f0j (xj ) f1j (xj ).


j =1

(229)

f1 (x1 , . . . , xd ) =

(230)

We can then use one-dimensional density estimators and multiply them:


d

f0 (x1 , . . . , xd ) =
j =1 d

f0j (xj ) f1j (xj ).


j =1

(231)

f1 (x1 , . . . , xd ) =

(232)

The resulting classier is called the naive Bayes classier. The assumption that the components of X are independent is usually wrong yet the resulting classier might still be accurate. Here is a summary of the steps in the naive Bayes classier:

The Naive Bayes Classier 1. For each group k = 0, 1, compute an estimate fkj of the density fkj for Xj , using the data for which Yi = k . 2. Let fk (x) = fk (x1 , . . . , xd ) =
j =1 d

fkj (xj ).

3. Let k = 4. Dene h as in (228). 1 n

Yi .
i=1

205

Naive Bayes is closely related to generalized additive models. Under the naive Bayes model, logit P(Y = 1)|X P(Y = 0)|X = log = log = log f1 (X ) (1 )f0 (X )
d j =1

(233) (234)

f1j (Xj ) f0j (Xj ) log


j =1 d

(1 ) 1
d

d j =1

f1j (Xj ) f0j (Xj )

(235)

= 0 +
j =1

gj (Xj )

(236)

which has the form of a generalized additive model. Thus we expect similar performance using naive Bayes or generalized additive models. 15.13 Example. For the SA sata: Note the use of the gam package.
n = nrow(sa.data) y = chd x = sa.data[,1:9] library(gam) s = .25 out = gam(y lo(sbp,span = .25,degree=1) + lo(tobacco,span = .25,degree=1) + lo(ldl,span = .25,degree=1) + lo(adiposity,span = .25,degree=1) + famhist + lo(typea,span = .25,degree=1) + lo(obesity,span = .25,degree=1) + lo(alcohol,span = .25,degree=1) + lo(age,span = .25,degree=1)) tmp = fitted(out) yhat = rep(0,n) yhat[tmp > .5] = 1 print(table(y,yhat)) yhat y 0 1 0 256 46 1 77 83 print(mean(y != yhat)) [1] 0.2662338

15.14 Example. Figure 78 (top) shows an articial data set with two covariates x1 and x2 . Figure 78 (middle) shows kernel density estimators f1 (x1 ), f1 (x2 ), f0 (x1 ), f0 (x2 ). The top left plot shows the resulting naive Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear model yields a classier with error rate of 0.78.

206

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.6

1.4

1.2

f10

1.0

f11 0.0 0.2 0.4 x 0.6 0.8 1.0

0.8

0.6

0.4

0.0 0.0

0.5

1.0

1.5

2.0

2.5

0.2

0.4 x

0.6

0.8

1.0

1.4

1.2

f20

1.0

f21 0.0 0.2 0.4 x 0.6 0.8 1.0

0.8

0.6

0.0 0.0

0.5

1.0

1.5

2.0

2.5

0.2

0.4 x

0.6

0.8

1.0

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

Data

x2 0.0 0.0 0.2 0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

x1 Predictions

207

Figure 78: Top: Artical Data. Middle: Naive Bayes and Gam classiers. Bottom: Naive Bayes and Gam classiers.

Age < 50 50 1 100 1

Blood Pressure < 100 0

Figure 79: A simple classication tree.

15.12

Trees

Trees are classication methods that partition the covariate space X into disjoint pieces and then classify the observations according to which partition element they fall in. As the name implies, the classier can be represented as a tree. For illustration, suppose there are two covariates, X1 = age and X2 = blood pressure. Figure 79 shows a classication tree using these variables. The tree is used in the following way. If a subject has Age 50 then we classify him as Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 80 shows the same classier as a partition of the covariate space. Here is how a tree is constructed. First, suppose that y Y = {0, 1} and that there is only a single covariate X . We choose a split point t that divides the real line into two sets A1 = (, t] 1 and A2 = (t, ). Let ps (j ) be the proportion of observations in As such that Yi = j : ps (j ) =
n i=1

I (Yi = j, Xi As ) n i=1 I (Xi As )


2

(237)

for s = 1, 2 and j = 0, 1. The impurity of the split t is dened to be I (t) =


s=1

(238)

208

1 Blood Pressure 110 1 0

50 Age

Figure 80: Partition representation of classication tree. where s = 1

ps (j )2 .
j =0

(239)

This particular measure of impurity is known as the Gini index. If a partition element As contains all 0s or all 1s, then s = 0. Otherwise, s > 0. We choose the split point t to minimize the impurity. (Other indices of impurity besides can be used besides the Gini index.) When there are several covariates, we choose whichever covariate and split that leads to the lowest impurity. This process is continued until some stopping criterion is met. For example, we might stop when every partition element has fewer than n0 data points, where n0 is some xed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0 or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition element. This procedure is easily generalized to the 1 case where Y {1, . . . , K }. We simply dene the impurity by
k

s = 1

ps ( j ) 2
j =1

(240)

where pi (j ) is the proportion of observations in the partition element for which Y = j . 15.15 Example. Heart disease data.

209

X = scan("sa.data",skip=1,sep=",") >Read 5082 items X = matrix(X,ncol=11,byrow=T) chd = X[,11] n = length(chd) X = X[,-c(1,11)] names = c("sbp","tobacco","ldl","adiposity","famhist","typea", + "obesity","alcohol","age") for(i in 1:9){ assign(names[i],X[,i]) } famhist = as.factor(famhist) formula = paste(names,sep="",collapse="+") formula = paste("chd ",formula) formula = as.formula(formula) print(formula) > chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + > alcohol + age chd = as.factor(chd) d = data.frame(chd,sbp,tobacco,ldl,adiposity,famhist,typea, + obesity,alcohol,age) library(tree) postscript("south.africa.tree.plot1.ps") out = tree(formula,data=d) print(summary(out)) >Classification tree: >tree(formula = formula, data = d) >Variables actually used in tree construction: >[1] "age" "tobacco" "alcohol" "typea" >[7] "ldl" >Number of terminal nodes: 15 >Residual mean deviance: 0.8733 = 390.3 / 447 >Misclassification error rate: 0.2078 = 96 / 462 plot(out,type="u",lwd=3) text(out) cv = cv.tree(out,method="misclass") plot(cv,lwd=3) newtree = prune.tree(out,best=6,method="misclass") print(summary(newtree))

"famhist"

"adiposity"

210

age < 31.5 |

tobacco < 0.51

age < 50.5

alcohol < 11.105 typea < 68.5 0

famhist:a

tobacco < 7.605 0 0 0 1

ldl < 6.705

typea < 42.5

adiposity < 28.955

tobacco < 4.15 1

adiposity < 24.435 0 1 1

adiposity < 28 1

typea < 48 0 0 0

Figure 81: Tree


>Classification tree: >snip.tree(tree = out, nodes = c(2, 28, 29, 15)) >Variables actually used in tree construction: >[1] "age" "typea" "famhist" "tobacco" >Number of terminal nodes: 6 >Residual mean deviance: 1.042 = 475.2 / 456 >Misclassification error rate: 0.2294 = 106 / 462 plot(newtree,lwd=3) text(newtree,cex=2)

See Figures 81, 82, 83.

211

12 165

10

Inf

misclass 145 150

155

160

8 size

10

12

14

Figure 82: Tree

age < | 31.5

0 typea < 68.5 0 1

age < 50.5

famhist:a tobacco < 7.605 0 1 1

Figure 83: Tree 212

15.16 Example. CV for Trees

Tree Regression Splits

4.0 2

4.5

5.0

5.5

6.0

6.5

7.0

7.5

4 X

Figure 84: To illustrate the concept of ten-fold cross-validation in the context of tree regression, we begin with a dataset of 40 observations (Fig. 84). Within this dataset there are two groups: 26 observations from Group 1 (blue circles), and 14 observations from Group 2 (orange squares). The 40 observations are partitioned via tree regression into three sections. In Figure 85, we see the proportion of points that are blue circles within each region. For classication, points falling within a region with pblue > 0.5 will be classied as belonging to Group 1, and points falling in a 213

region with pblue < 0.5 will be classied as belonging to Group 2. Consequently, points in the left and bottom-right regions would be classied as belonging to Group 1.

Proportion of Blue Circles

0.27

0.79
0.80

4 2

Figure 85: For ten-fold cross-validation, 10% of the data is removed as testing data, and a regression model is created using the remaining 90%. Prediction is made for the testing data based on the model created by the training data set, and prediction rates are computed. This process is repeated for each 10% chunk of the data, and the prediction rates across the 10 10% chunks are averaged to form the ten-fold cross-validation score. In Figure 86, we illustrate this process for a single 10% chunk from this dataset. 4 of the 40 214

observations are removed as a testing dataset, and the remaining 36 observations are used to create a tree regression model.

4.0 2

4.5

5.0

5.5

6.0

6.5

7.0

7.5

4 X

Figure 86: The new model lines shown in Figure 87. The four removed points will be classied using the newly created tree regression model. The two orange-squares that were removed now fall into the new left bin, and will be classied incorrectly. The two blue points that were removed still lie in predominantly blue regions, and are classied correctly. Thus, the prediction rate for these four observations is 50%. To nd the ten-fold cross-validation score, this process would be repeated 9 more times, each time on a different group of 4 observations.

215

Old Lines New Lines

4.0 2

4.5

5.0

5.5

6.0

6.5

7.0

7.5

4 X

Figure 87:

15.13

Support Vector Machines

In this section we consider a class of linear classiers called support vector machines. It will be convenient to label the outcomes as 1 and +1 instead of 0 and 1. A linear classier can then be written as h(x) = sign H (x) where x = (x1 , . . . , xd ),
d

H ( x ) = a0 +
i=1

ai X i

216

and

Note that:

1 if z < 0 0 if z = 0 sign(z ) = 1 if z > 0. classier correct classier incorrect = Yi H (Xi ) 0 = Yi H (Xi ) 0.

The classication risk is R = P(Y = h(X )) = P(Y H (X ) 0) = E(L(Y H (X ))) where the loss function L is L(a) = 1 if a < 0 and L(a) = 0 if a 0. Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectly separates the two classes. How can we nd a separating hyperplane? LDA is not guaranteed to nd it. A separating hyperplane will minimize Yi H (Xi ).
misclassied

Rosenblatts perceptron algorithm takes starting values and updates them: 0 0 + Yi Xi Yi .

However, there are many separating hyperplanes. The particular separating hyperplane that this algorithm converges to depends on the starting values. Intuitively, it seems reasonable to choose the hyperplane furthest from the data in the sense that it separates the +1s and -1s and maximizes the distance to the closest point. This hyperplane is called the maximum margin hyperplane. The margin is the distance to from the hyperplane to the nearest point. Points on the boundary of the margin are called support vectors. See Figure 88. 15.17 Lemma. The data can be separated by some hyperplane if and only if there exists a hyperplane H (x) = a0 + d i=1 ai Xi such that Yi H (Xi ) 1, i = 1, . . . , n. (241)

P ROOF. Suppose the data can be separated by a hyperplane W (x) = b0 + d i=1 bi Xi . It follows that there exists some constant c such that Yi = 1 implies W (Xi ) c and Yi = 1 implies W (Xi ) c. Therefore, Yi W (Xi ) c for all i. Let H (x) = a0 + d i=1 ai Xi where aj = bj /c. Then Yi H (Xi ) 1 for all i. The reverse direction is straightforward. The goal, then, is to maximize the margin, subject to (241). Given two vectors a and b let a, b = aT b = j aj bj denote the inner product of a and b. 15.18 Theorem. Let H (x) = a0 + Then, for j = 1, . . . , d,
d i=1

ai Xi denote the optimal (largest margin) hyperplane.


n

aj =
i=1

i Yi Xj (i) 217

H (x) = a0 + aT x = 0

Figure 88: The hyperplane H (x) has the largest margin of all hyperplanes that separate the two classes.

218

where Xj (i) is the value of the covariate Xj for the ith data point, and = (1 , . . . , n ) is the vector that maximizes n n n 1 i i k Yi Yk Xi , Xk (242) 2 i=1 i=1 k=1 subject to i 0 and 0=
i

i Yi .

The points Xi for which = 0 are called support vectors. a0 can be found by solving i Yi (XiT a + a0 = 0 for any support point Xi . H may be written as
n

H (x) = 0 +
i=1

i Yi x, Xi .

There are many software packages that will solve this problem quickly. If there is no perfect linear classier, then one allows overlap between the groups by replacing the condition (241) with Yi H (Xi ) 1 i , i 0, i = 1, . . . , n. The variables 1 , . . . , n are called slack variables. We now maximize (242) subject to 0 i c, i = 1, . . . , n and
n

(243)

i Yi = 0.
i=1

The constant c is a tuning parameter that controls the amount of overlap. In R we can use the package e1071. 15.19 Example. The iris data. library(e1071) data(iris) x = iris[51:150,] a = x[,5] x = x[,-5] 219

attributes(a) $levels [1] "setosa" $class [1] "factor"

"versicolor" "virginica"

n = length(a) y = rep(0,n) y[a == "versicolor"] = 1 y = as.factor(y) out = svm(x, y) print(out) Call: svm.default(x = x, y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma:

C-classification radial 1 0.25 33

Number of Support Vectors: summary(out) Call: svm.default(x = x, y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma:

C-classification radial 1 0.25 33

Number of Support Vectors: ( 17 16 )

Number of Classes:

220

Levels: 0 1

## test with train data pred = predict(out, x) table(pred, y) y pred 0 1 0 49 2 1 1 48 Lets have a look at what is happening with these data and svm. In order to make a 2 dimensional plot of the 4 dimensional data, we will plot the rst 2 principal components of the distance matrix. The supporting vectors are circled. On the whole the two species of iris are separated and the supporting vectors mostly fall near the dividing line. Some dont because these are 4dimensional data shown in 2-dimensions. M = cmdscale(dist(x)) plot(M,col = as.integer(y)+1,pch = as.integer(y)+1) ## support vectors I = 1:n %in% out$index points(M[I,],lwd=2) See Figure 89. Here is another (easier) way to think about the SVM. The SVM hyperplan H (x) = 0 + xT x can be obtained by minimizing
n

i=1

(1 Yi H (Xi ))+ + || ||2 .

Figure 90 compares the svm loss, squared loss, classication error and logistic loss log(1 + eyH (x) ).

221

0.5

q q q qq q q q q q q q q q q q q q q

q q q q q q

M[,2]

0.0

0.5

0 M[,1]

Figure 89:

222

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 y H(x)

Figure 90: Hinge

223

15.13.1

Kernelization

There is a trick called kernelization for improving a computationally simple classier h. The idea is to map the covariate X which takes values in X into a higher dimensional space Z and apply the classier in the bigger space Z . This can yield a more exible classier while retaining computationally simplicity. The standard example of this idea is illustrated in Figure 91. The covariate x = (x1 , x2 ). The Yi s can be separated into two groups using an ellipse. Dene a mapping by , 2x1 x2 , x2 z = (z1 , z2 , z3 ) = (x) = (x2 2 ). 1 Thus, maps X = R2 into Z = R3 . In the higher-dimensional space Z , the Yi s are separable by a linear decision boundary. In other words, a linear classier in a higher-dimensional space corresponds to a non-linear classier in the original space. The point is that to get a richer set of classiers we do not need to give up the convenience of linear classiers. We simply map the covariates to a higher-dimensional space. This is akin to making linear regression more exible by using polynomials.

15.14

Other Classiers

There are many other classiers and space precludes a full discussion of all of them. Let us briey mention a few. Bagging is a method for reducing the variability of a classier. It is most helpful for highly nonlinear classiers such as trees. We draw B bootstrap samples from the data. The bth bootstrap sample yields a classier hb . The nal classier is h(x) =
B 1 1 if B b=1 hb (x) 0 otherwise. 1 2

Boosting is a method for starting with a simple classier and gradually improving it by retting the data giving higher weight to misclassied samples. Suppose that H is a collection of classiers, for example, trees with only one split. Assume that Yi {1, 1} and that each h is such that h(x) {1, 1}. We usually give equal weight to all data points in the methods we have discussed. But one can incorporate unequal weights quite easily in most algorithms. For example, in constructing a tree, we could replace the impurity measure with a weighted impurity measure. The original version of boosting, called AdaBoost, is as follows.

224

x2 z2

+ + + + + + + +

+ + + + + + +
z1

+ + + + +
x1

z3

Figure 91: Kernelization. Mapping the covariates into a higher-dimensional space can make a complicated decision boundary into a simpler decision boundary.

225

1. Set the weights wi = 1/n, i = 1, . . . , n. 2. For j = 1, . . . , J , do the following steps: (a) Constructing a classier hj from the data using the weights w1 , . . . , w n . (b) Compute the weighted error estimate: Lj =
n i=1

wi I (Yi = hj (Xi )) . n i=1 wi

(d) Update the weights:

(c) Let j = log((1 Lj )/Lj ). wi wi ej I (Yi =hj (Xi ))

3. The nal classier is


J

h(x) = sign
j =1

j hj (x) .

There is now an enormous literature trying to explain and improve on boosting. Whereas bagging is a variance reduction technique, boosting can be thought of as a bias reduction technique. We starting with a simple and hence highly-biased classier, and we gradually reduce the bias. The disadvantage of boosting is that the nal classier is quite complicated. To understand what boosting is doing, consider the following modifed algorithm: (Friedman, Hastie and Tibshirani (2000), Annals of Statistics, p. 337407):

226

1. Set the weights wi = 1/n, i = 1, . . . , n. 2. For j = 1, . . . , J , do the following steps: (a) Constructing a weighted binary regression pj (x) = P(Y = 1|X = x). (b) Let f j ( x) = 1 log 2 p j ( x) 1 pj (x) .

(c) Set wi wi eYi fj (Xi ) then normalize the weights to sum to one. 3. The nal classier is
J

h(x) = sign
j =1

f j ( x) .

Consider the risk function J (F ) = E(eY F (X ) ). This is minimized by F (x) = Thus, 1 log 2 P(Y = 1|X = x) P(Y = 1|X = x) .

e2F (x) . 1 + e2F (x) Friedman, Hastie and Tibshirani show that stagewise regression, applied to loss J (F ) = E(eY F (X ) ) yields the boosting algorithm. Moreover, this is essentially logistic regression. To see this, let Y = (Y + 1)/2 so that Y {0, 1}. The logistic log-likelihood is P(Y = 1|X = x) = = Y log p(x) + (1 Y ) log(1 p(x)). Insert Y = (Y + 1)/2 and p = e2F /(1 + e2F ) and then (F ) = log(1 + e2Y F (X ) ). Now do a second order Taylor series expansion around F = 0 to conclude that (F ) J (F ) + constant. Hence, boosting is essentially stagewise logistic regression. 227

Neural Networks are regression models of the form 2


p

Y = 0 +
j =1

j (0 + T X )

where is a smooth function, often taken to be (v ) = ev /(1 + ev ). This is really nothing more than a nonlinear regression model. Neural nets were fashionable for some time but they pose great computational difculties. In particular, one often encounters multiple minima when trying to nd the least squares estimates of the parameters. Also, the number of terms p is essentially a smoothing parameter and there is the usual problem of trying to choose p to nd a good balance between bias and variance.

This is the simplest version of a neural net. There are more complex versions of the model.

228