You are on page 1of 280

Abdelkader BENHARI

Regression Analysis
Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator seeks to ascertain the causal eect of one variable upon anotherthe eect of a price increase upon demand, for example, or the eect of changes in the money supply upon the ination rate. To explore such issues, the investigator assembles data on the underlying variables of interest and employs regression to estimate the quantitative eect of the causal variables upon the variable that they inuence. The investigator also typically assesses the statistical signicance of the estimated relationships, that is, the degree of condence that the true relationship is close to the estimated relationship.

1. 2. 3. 4.
5.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.

Contents Prediction Some Terminology Simple Linear Regression: X scalar and r(x) linear Inference ANOVA and R2 Prediction Intervals Why Are We Doing This If The Model is Wrong? Association versus causation Condence Bands Review of Linear Algebra Multiple Linear Regression Example : crime data Testing Subsets of Coefcients The Hat Matrix Weighted Least Squares The Predictive Viewpoint The Bias-Variance Decomposition Diagnostics outliers Inuence Tweaking the Regression Quantitative variables Variable Selection The Bias-Variance Tradeoff Variable Selection versus Hypothesis Testing Collinearity Robust Regression Nonlinear Regression Logistic Regression More About Logistic Regression Logistic Regression With Replication Generalized Linear Models Measurement error Nonparametric Regression Choosing the Smoothing Parameter Kernel Regression Local Polynomials Penalized Regression, Regularization and Splines Smoothing Using Orthogonal Functions Variance Estimation Condence Bands Testing the Fit of a Linear Model Local Likelihood and Exponential Families Multiple Nonparametric Regression Density Estimation Classication Graphical Models Directed Graphs Estimation for DAGS Homework Appendix : Clustering Bibliography

1 Prediction
Suppose (X, Y ) have a joint distribution f (x, y). You observe X = x. What is your best prediction of Y ? Let g(x) be any prediction function. The prediction error (or risk) is R(g) = E(Y g(X))2 . Let r(x) = E(Y |X = x) = Key result: for any g, R(r) R(g). yf (y|x)dy the regression function.

Let = Y r(X). Then, E( ) = 0 and we can write Y = r(X) + . But we dont know r(x). So we estimate from the data. (1)

2 Some Terminology
Given data (X1 , Y1 ), . . . , (Xn , Yn ) we have two goals: estimation: prediction: Find an estimate r(x) of the regression function r(x). Given a new X, predict Y ; we use Y = r(X) as the prediction.

At rst we assume that Yi R. Later in the course, we consider other cases such as Yi {0, 1}. X scalar X vector r linear r(x) = 0 + 1 x simple linear regression r(x) = 0 + j j xj multiple linear regression r arbitrary r(x) is some smooth function nonparametric regression r(x1 , . . . , xp ) is some smooth function multiple nonparametric regression

3 Simple Linear Regression: X scalar and r(x) linear


Suppose that Yi R, Xi R and that

r(x) = 0 + 1 x.

(2)

This model is wrong. There is no reason to assume that r is linear. We make this assumption tentatively but we will drop it later. I use the hsymbol to alert you to model-based statements. 1

Y = Heart Weight

6
2.0

10

12

14

16

18

20

2.5

3.0 X = Body Weight

3.5

Figure 1: Cat example We can write Y i = 0 + 1 Xi +


i

(3)

where E( i ) = 0 and 1 , . . . , n are independent. We also assume that V( i ) = 2 does not depend on x. (Homoskedasticity.) The unknown parameters are: 0 , 1 , 2 . Dene the residual sums of squares
n 2

RSS (0 , 1 )

=
i=1

Yi (0 + 1 Xi )

(4)

The least squares estimators (LS) minimize:

RSS (0 , 1 ).

3.1 Theorem. The LS estimators are 1 0 where X = n1


n i=1

n i=1 (Xi X)(Yi n 2 i=1 (Xi X)

Y)

(5) (6)

= Y 1 X
n i=1

Xi and Y = n1

Yi .

We dene: The tted line: The predicted or tted values: The residuals: The residual sums of squares:

r(x) = 0 + 1 x Yi = r(Xi ) = 0 + 1 Xi i = Y i Yi n 2 RSS = i=1 i

An unbiased estimate of 2 is

2 =

RSS

n2

(7)

The estimators are random variables and have the following properties (conditional on X 1 , . . . , Xn ): E(0 ) = 0 , E(1 ) = 1 , V(1 ) = where s2 = n1 x
n i=1 (Xi

2 1 n s2 x

X)2 . Also, E( 2 ) = 2 . The standard error: se(1 ) = 1 , n sx 1 se(1 ) = . n sx

Approximate Normality

h
If
i

N 0 , se2 (0 ) ,

1 N 1 , se2 (1 )

(8)

N (0, 2 ) then:

1. Equation (8) is exact. 2. The least squares estimators are the maximum likelihood estimators. 3. The variance estimator satises: 2 2 2 n2 n2

4 Inference
It follows from (8) that an approximate 1 condence interval for 1 is

1 z/2 se(1 )

(9)

where z/2 is the upper /2 quantile of a standard Normal: , where Z N (0, 1). 2 = 1.96 2, so, an approximate 95 per cent condence interval for 1 is P(Z > z/2 ) = 1 2se(1 ). 4.1 Remark. If the residuals are Normal, then an exact 1 condence interval for 1 is 1 t/2,n2 se(1 ) (11) where t/2,n2 is the upper /2 quantile of a t with n 2 degrees of freedom. This interval is bogus. If n is large, t/2,n2 z/2 so just use the Normal interval. If n is so small, that t/2,n2 is much different than z/2 , the n is too small to be doing statistical inference. (Do you really believe that the residuals are exactly Normal anyway?) 3 (10)

For = .05, z/2

To test use the test statistic

H0 : 1 = 0 versus H1 : 1 = 0

(12)

z=

1 0 se(1 )

(13)

Under H0 , z N (0, 1). The p-value is p value = P(|Z| > |z|) = 2(|z|) where Z N (0, 1). Reject H0 if p-value is small. 4.2 Example. Here is an example. The plots are shown in Figure 2. ### Cat example ### library(MASS) attach(cats); help(cats) names(cats) [1] "Sex" "Bwt" "Hwt" summary(cats) Sex Bwt F:47 Min. :2.000 M:97 1st Qu.:2.300 Median :2.700 Mean :2.724 3rd Qu.:3.025 Max. :3.900 (14)

Hwt Min. : 6.30 1st Qu.: 8.95 Median :10.10 Mean :10.63 3rd Qu.:12.12 Max. :20.50

postscript("cat.ps",horizontal=F) par(mfrow=c(2,2)) boxplot(cats[,2:3]) plot(Bwt,Hwt) out = lm(Hwt Bwt,data = cats) summary(out) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.3567 0.6923 -0.515 0.607 Bwt 4.0341 0.2503 16.119 <2e-16 *** --Residual standard error: 1.452 on 142 degrees of freedom Multiple R-Squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16 abline(out,lwd=3) names(out) [1] "coefficients" "residuals" [5] "fitted.values" "assign"

"effects" "qr" 4

"rank" "df.residual"

[9] "xlevels" "call" "terms" r = out$residuals plot(Bwt,r,pch=19) lines(Bwt,rep(0,length(Bwt)),lty=3,col=2,lwd=3) qqnorm(r) dev.off()

"model"

How a qq-plot works. If you are not familiar with qq-plots, read this. Order the data: X(1) X(2) X(n) Let zj = 1 (j/n) (Actually, we dont quite use j/n but never mind.) Plot X(j) versus zj If Xi N (, 2 ) then this plot should a straight line with slope and intercept . Why? Let number of observations x b F (x) = . n b So F (x) F (x) = P(X x). Note that ` j b F X(j) = n and so j j j b X(j) = F 1 F 1 1 + = zj + . n n n We used the fact that F 1 (q) = 1 (q) + . Here is a proof of this fact. Let xq = F 1 (q) be the q th quantile. Then x xq X xq q = =P Z q = F (xq ) = P(X xq ) = P

and hence, F 1 (q) = xq = 1 (q) + .

4.3 Example (Example: Election 2000). Figure 3 shows the plot of votes for Buchanan (Y) versus votes for Bush (X) in Florida. The least squares estimates (omitting Palm Beach County) and the standard errors are 0 1 The tted line is Buchanan = 66.0991 + 0.0035 Bush. Figure 3 also shows the residuals. The inferences from linear regression are most accurate when the residuals behave like random normal numbers. Based on the residual plot, this is not the case in this example. If we repeat the analysis replacing votes with log(votes) we get 0 1 This gives the t log(Buchanan) = 2.3298 + 0.7303 log(Bush). The residuals look much healthier. Later, we shall address the following questions: how do we see if Palm Beach County has a statistically plausible outcome? On the log scale, a 95 per cent condence interval is .7303 2(.0358) = (.66, .80). The statistic for testing H0 : 1 = 0 versus H1 : 1 = 0 is |Z| = |.7303 0|/.0358 = 20.40 with a p-value of P(|Z| > 20.40) 0. This is strong evidence that that the true slope is not 0. = 2.3298 se(0 ) = 0.3529 = 66.0991 se(0 ) = 17.2926 = 0.0035 se(1 ) = 0.0002.

= 0.730300 se(1 ) = 0.0358.

20

15

10

Hwt 5
Bwt Hwt

6
2.0

10

12

14

16

18

20

2.5

3.0 Bwt

3.5

Normal QQ Plot

Sample Quantiles
2.0 2.5 3.0 Bwt 3.5

Theoretical Quantiles

Figure 2: Cat example

Buchanan

3000

residuals
0 125000 250000

1500

500

500

125000

250000

Bush

Bush

log Buchanan

residuals
7 8 9 10 11 12 13

10

11

12

13

log Bush

log Bush Figure 3: Voting Data for Election 2000.

5 h ANOVA and R2
In the olden days, statisticians were obsessed with summarizing things in analysis of variance (ANOVA) tables. It works like this. We can write
n i=1 n n

(Yi Y )2 SStotal

=
i=1

(Yi Yi )2 +

i=1

( Yi Y ) 2

= RSS +

SSreg

Then we create this table:

Source Regression Residual Total

df 1 n-2 n-1

SS SSreg RSS SStotal

MS SSreg /1 RSS/(n-2)

F M Sreg /MSE

Under H0 : 1 = 0, F F1,n2 . This is just another (equivalent) way to test this hypothesis. 7

The coefcient of determination is R2 =

SSreg RSS =1 SStot SStot

(15)

Amount of variability in Y explained by X. Also, R 2 = r2 where r=


n i=1 (Yi n i=1 (Yi

Y )(Xi X)
n i=1 (Xi

Y )2

X)2

is the sample correlation. This is an estimate of the correlation E (X X )(Y Y ) X Y 1 1. 5.1 Remark. What happens to R2 if we move the minimum xi further to the left and we move the maximum xi further to the right?

= Note that

6 Prediction Intervals
Given new value X , we want to predict The prediction is Y = 0 + 1 X . Dene sepred (Y ) = A condence interval for Y is 1+ 1 + n (X X)2 . n 2 i=1 (Xi X) (17) (16) Y = 0 + 1 X + .

Y z/2 sepred (Y ).
.

6.1 Remark. This is not really the standard error of the quantity Y . It is the standard error of 0 + 1 X + that sepred (Y ) does not go to 0 as n . Why?

Note

6.2 Example (Election Data Revisited). On the log-scale, our linear regression gives the following prediction equation: log(Buchanan) = 2.3298 + 0.7303 log(Bush). In Palm Beach, Bush had 152,954 votes and Buchanan had 3,467 votes. On the log scale this is 11.93789 and 8.151045. How likely is this outcome, assuming our regression model is appropriate? Our prediction for log Buchanan votes 2.3298 + .7303 (11.93789)=6.388441. Now, 8.151045 is bigger than 6.388441 but is is signicantly bigger? Let us compute a condence interval. We nd that sepred = .093775 and the approximate 95 per cent condence interval is (6.200,6.578) which clearly excludes 8.151. Indeed, 8.151 is nearly 20 standard errors from Y . Going back to the vote scale by exponentiating, the condence interval is (493,717) compared to the actual number of votes which is 3,467.

7 Why Are We Doing This If The Model is Wrong?


The model Y = 0 + 1 x + is certainly false. There is no reason why r(x) should be exactly linear. Nonetheless, the linear assumption might be adequate. But how do we assess whether the linear assumption is adequate? There are three ways. 1. We can do a goodness-of-t test. 2. Second, we can do a nonparametric regression that does not assume linearity. 3. We can take a purely predictive point and view 0 + 1 x as an estimate of the best linear predictor not as an estimate of the true regression function. We will return to these points later.

8 Association Versus Causation


There is much confusion about the difference between causation and association. Roughly speaking, the statement X causes Y means that changing the value of X will change the distribution of Y . When X causes Y , X and Y will be associated but the reverse is not, in general, true. Association does not necessarily imply causation. For example, there is a strong linear relationship between death rate due to breast cancer and fat intake. So, RISK OF DEATH = 0 + 1 FAT + where 1 > 0. Does that mean that FAT causes breast cancer? Consider two interpretations of (18). ASSOCIATION (or correlation). Fat intake and breast cancer are associated. Therefore, if I observe someones fat intake, I can use equation (18) to predict their chance of dying from breast cancer. CAUSATION. Fat intake causes Breast cancer. Therefore, if I observe someones fat intake, I can use equation (18) to predict their chance of dying from breast cancer. Moreover, if I change someones fat intake by one unit, their risk of death from breast cancer changes by 1 . If the data are from a randomized study (X is randomly assigned) then the causal interpretation is correct. If the data are from an observational study, (X is not randomly assigned) then the association interpretation is correct. To see why the causal interpretation is wrong in the observational study, suppose that people with high fat intake are the rich people. And suppose, for the sake of the example, that rich people smoke a lot. Further, suppose that smoking does cause cancer. Then it will be true that high fat intake predicts high cancer rate. But changing someones fat intake will not change their cancer risk. How can we make these ideas precise? The answer is to use either counterfactuals or directed acyclic graphs. Look at the top left plot in Figure 4. These are observed data on vitamin C (X) and colds (Y). You conclude that increasing vitamin C decreases colds. You tell everyone to take more vitamin C but the prevalence of colds stays the same. Why? Look at the second plot. The dotted lines show the counterfactuals. The counterfactual y i (x) is the value Y person i would have had if they had taken dose X = x. Note that (18)

Yi = yi (Xi ). In other words: Yi is the function yi () evaluated at Xi . The causal regression is the average of the counterfactual curves y i (x): 9

(19)

y 0

3 x data

3 x data

x counterfactuals

x counterfactuals

c(x)

c(x) 0

x causal regression function

x causal regression function

Figure 4: Causation

c(x) = E(yi (x)).

(20)

The average is over the population. In other words, x a vaalue of x then average y i (x) over all individuals. In general: r(x) = c(x) association does not equal causation (21)

In this example, changing everyones dose does not change the outcome. The causal regression curve c(x) is shown in the third plot. In the second example (right side of Figure 4) it is worse. You tell everyone to take more vitamin C but the prevalence of colds increases. Suppose that we randomly assign dose X. Then Xi is independent of the counterfactuals {yi (x) : x R}. In that case: c(x) = E(y(x)) = E(y(x)|X = x) since X is indep of {y(x) : x R} = E(Y |X = x) since Y = c(X) = r(x). (22) (23) (24) (25)

Thus, if X is randomly assigned then association is equal to causation. In an observational (non randomized) study, the best we can do is try to measure confounding variables. These are variables that affect both X and Y . If we can nd all the confounding variables Z then {y(x) : x R} is independent of X given Z. Hence, c(x) = E(y(x)) = = E(y(x)|Z = z)f (z)dz E(y(x)|Z = z, X = x)f (z)dz since X is indep of {yi (x) : x R}|Z 10 (26) (27) (28)

= =

E(Y |X = x, Z = z)f (z)dz 1 x + 2 z f (z)dz if linear

(29) (30) (31)

= 1 x + 2 E(Z).

This is called adjusting for confounders. of course, we can never be sure we have included all confounders. This is why obervational studies have to be treated with caution. Note the following difference: c(x) = E(Y |Z = z, X = x)f (z)dz E(Y |Z = z, X = x)f (z|x)dz. (32) (33)

E(Y |X = x) =

9 Condence Bands
9.1 Theorem (Scheffe, 1959). Let I(x) = where r(x) = 0 + 1 x 2F,2,n2 1 + n (x x)2 . 2 i (xi x) 1 . r(x) c , r(x) + c (34)

c = Then,

P r(x) I(x) for all x

(35)

9.2 Example. Let us return to the cat example. The R code is: library(MASS) attach(cats) plot(Bwt,Hwt) out = lm(Hwt Bwt,data = cats) abline(out,lwd=3) r = out$residuals n = length(Bwt) x = seq(min(Bwt),max(Bwt),length=1000) d = qf(.95,2,n-2) beta = out$coeff xbar = mean(Bwt) ssx = sum( (Bwt-xbar)2 ) sigma.hat = sqrt(sum(r2)/(n-2)) stuff = sqrt(2*d)*sqrt( (1/n) + ((x-xbar)2/ssx))*sigma.hat ### Important: Note that these are all scalars except that x is a vector. 11

Hwt 6
2.0

10

12

14

16

18

20

2.5

3.0 Bwt

3.5

Figure 5: Condence Band for Cat Example

r.hat = beta[1] + beta[2]*x upper = r.hat + stuff lower = r.hat - stuff lines(x,upper,lty=2,col=2,lwd=3) lines(x,lower,lty=2,col=2,lwd=3)

The bands are shown in Figure 5.

12

10 Review of Linear Algebra


Before starting multiple regression, we will briey review some linear algebra. Read pages 278-287 of Weisberg. The inner product of two vectors x and y is x, y = xT y =
j

xj y j .

Two vectors are orthogonal is x, y = 0. We then write x y. The norm of a vector is ||x|| = x, x =
j

x2 . j

If A is a matrix, denote its inverse by A1 and its transpose by AT . The trace of a square matrix, denoted by tr(A) is the sum of its diagonal elements. P ROJECTIONS . We will make extensive use of projections. Let us start with a simple example. Let e1 = (1, 0), e2 = (0, 1) and note that R2 is the linear span of e1 and e2 : any vector (a, b) R2 is a linear combination of e1 and e2 . Let L = {ae1 : a R} be the set of vectors of the form (a, 0). Note that L is a linear subspace of R 2 . Given a vector x = (a, b) R2 , the projection x of x onto L is the vector in L that is closest to x. In other words, x minimizes ||x x|| among all vectors in L. Another characterization of x is this: it is the unique vector such that (i) x L and (ii) x x y for all y L. It is easy to see, in our simple example, that the projection of x = (a, b) is just (a, 0). Note that we can write x = Px where P = 1 0 0 0

This is the projection matrix. In general, given a vector space V and a linear subspace L there is a projection matrix P that maps any vector v into its projection P v. The projection matrix satises these properties: P v exists and is unique. P is linear: if a and b are scalars then P (ax + by) = aP x + bP y. P is symmetric. P is idempotent: P 2 = P . If x L then P x = x. Now let X be some n q matrix and suppose that XT X is invertible. The column space is the space L of all vectors that can be obtained by taking linear combinations of the columns of X. It can be shown that the projection matrix for the column space is P = X(XT X)1 XT . Exercise: check that P is idempotent and that if x L then P x = x. 13

R ANDOM V ECTORS . Let Y be a random vector. Denote the mean vector by and the covariance matrix by V(Y ) or Cov(Y ). If a is a vector then E(aT Y ) = aT , V(aT Y ) = aT a. If A is a matrix then E(AY ) = A, V(AY ) = AAT . (37) (36)

11 Multiple Linear Regression


The multiple linear regression model is

Y = 0 + 1 X1 + + p Xp + = T X +

(38)

where = (0 , . . . , p )T and X = (1, X1 , . . . , Xp )T . The value of the j th covariate for the ith subject is denoted by Xij . Thus Yi = 0 + 1 Xi1 + + p Xip + i . (39) At this point, it is convenient to use matrix notation. Let 1 X11 1 X21 X = . . . . nq . . 1 Xn1 X12 X22 . . . Xn2 ... ... . . . X1p X2p . . . . Xnp

We can then rewrite (38) as

Each subject corresponds to one row. The number of columns of X will be denoted by q. Now dene, 0 Y1 1 1 2 Y2 = . = . . Y = . . . . . . . p Yn n

(40)

Y = X + Note that Yi = XiT + where XiT is the ith row of X. The RSS is given by
RSS ()

(41)

(Yi XiT )2 = (Y X)T (Y X).

(42)

11.1 Theorem. The least squares estimator is = SY where S = (XT X)1 XT assuming that (XT X) is invertible. (44) (43)

14

The tted values are Y = X and the residuals are estimated by 2 =


RSS

= Y Y . Thus, =
RSS

RSS

= || ||2 =

. The variance is (45)

np1

nq

11.2 Theorem. 1. E() = .

The estimators satisfy the following properties.

2. V() = 2 (XT X)1 . 3. M N (, 2 (XT X)1 ). 4. An approximate 1 condence interval for j is j z/2 se(j ) (46)

where se(j ) is the square root of the appropriate diagonal element of the matrix 2 (XT X)1 .

Lets prove the rst two assertions. Note that E() = E(SY ) = SE(Y ) = SX = (XT X)1 XT X = . Also, since V(Y ) = 2 I, where I is the identity matrix,
T

V()

= V(SY ) = SV(Y )S T = 2 SS T = 2 (XT X)1 XT (XT X)1 XT = 2 (XT X)1 XT X(XT X)1 = 2 (XT X)1 .

The ANOVA table is

Source Regression Residual Total

df q-1 n-q n-1

SS SSreg RSS SStotal

MS SSreg /p RSS/(n-p-1)

F M Sreg /MSE

The F test has F Fp,np1 . This is testing the hypothesis H0 : 1 = p = 0 Testing this hypothesis is of limited value.

12 Example: Crime Data


### multiple linear regression crime data x = scan("/=classes/=stat707/=data/crime.dat",skip=1) x = matrix(x,ncol=14,byrow=T) names = c("Crime","Age","Southern","Education", 15

"Expenditure","Ex1","Labor","Males", "pop","NW","U1","U2","Wealth","X") crime.dat = as.data.frame(x) names(crime.dat) = names postscript("crime.ps",horizontal=F) boxplot(crime.dat) out = lm(Crime Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth, data=crime.dat) print(summary(out)) r = out$residuals qqnorm(r) dev.off() > ### multiple linear regression crime data > x = scan("/=classes/=stat707/=data/crime.dat",skip=1) Read 658 items > x = matrix(x,ncol=14,byrow=T) > names = c("Crime","Age","Southern","Education", + "Expenditure","Ex1","Labor","Males", + "pop","NW","U1","U2","Wealth","X") > crime.dat = as.data.frame(x) > names(crime.dat) = names > > postscript("crime.ps",horizontal=F) > boxplot(crime.dat) > out = lm(Crime Age + Southern + Education + + Expenditure + Labor + Males + + pop + U1 + U2 + Wealth, data=crime.dat) > print(summary(out)) Call: lm(formula = Crime Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth, data = crime.dat) Residuals: Min 1Q -43.6447 -13.3772 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -589.39985 167.59057 -3.517 0.001201 ** Age 1.04058 0.44639 2.331 0.025465 * Southern 11.29464 13.24510 0.853 0.399441 Education 1.17794 0.68181 1.728 0.092620 . Expenditure 0.96364 0.24955 3.862 0.000451 *** Labor 0.10604 0.15327 0.692 0.493467 Males 0.30353 0.22269 1.363 0.181344 pop 0.09042 0.13866 0.652 0.518494 U1 -0.68179 0.48079 -1.418 0.164774 U2 2.15028 0.95078 2.262 0.029859 * 16

Median 0.5895

3Q 12.1430

Max 55.4624

Wealth -0.08309 0.09099 -0.913 0.367229 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 24.56 on 36 degrees of freedom Multiple R-Squared: 0.6845, Adjusted R-squared: 0.5968 F-statistic: 7.809 on 10 and 36 DF, p-value: 1.704e-06 > r = out$residuals > qqnorm(r) > dev.off()

13 Testing Subsets of Coefcients


Suppose you want to test if a set of coefcients is 0. Use, F = (RSSsmall RSSbig )/(df small df big ) RSSbig /df big (47)

which has a Fa,b distribution under H0 , where a = dfsmall dfbig and b = dfbig . 13.1 Example. Lets try dropping the unemployment and labor variables. ##Fit full model out = lm(Crime Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth, data=crime.dat) anova(out)

>Analysis of Variance Table > >Response: Crime > Df Sum Sq Mean Sq F value Pr(>F) >Age 1 550.8 550.8 0.9133 0.3456065 >Southern 1 153.7 153.7 0.2548 0.6167591 >Education 1 9056.7 9056.7 15.0166 0.0004333 *** >Expenditure 1 30760.3 30760.3 51.0027 2.142e-08 *** >Labor 1 1207.0 1207.0 2.0012 0.1657635 >Males 1 1381.5 1381.5 2.2906 0.1388888 >pop 1 528.8 528.8 0.8768 0.3553240 >U1 1 198.7 198.7 0.3295 0.5695451 >U2 1 2756.8 2756.8 4.5710 0.0393779 * >Wealth 1 502.9 502.9 0.8339 0.3672287 >Residuals 36 21712.0 603.1

##drop Labor and U1 and U2 out = lm(Crime Age + Southern + Education + Expenditure + Males + pop + Wealth, data=crime.dat) 17

200

400

600

800

1000

Crime

Southern

Ex1

Males

NW

U1

U2

Normal QQ Plot

Sample Quantiles

40

20

20

40

0 Theoretical Quantiles

Figure 6: crime example

18

anova(out) >Analysis of Variance Table > >Response: Crime > Df Sum Sq Mean Sq F value Pr(>F) >Age 1 550.8 550.8 0.8493 0.3624211 >Southern 1 153.7 153.7 0.2370 0.6291234 >Education 1 9056.7 9056.7 13.9636 0.0005963 *** >Expenditure 1 30760.3 30760.3 47.4262 3.067e-08 *** >Males 1 2092.7 2092.7 3.2265 0.0802032 . >pop 1 667.6 667.6 1.0294 0.3165618 >Wealth 1 232.2 232.2 0.3581 0.5530417 >Residuals 39 25295.1 648.6 top = (25295.1-21712)/(39-36) bottom = 21712/36 f = top/bottom pvalue = 1-pf(f,3,36) print(f) > 1.980343 print(pvalue) > 0.1343155 We conclude that these variables are not important in the regression. However, we should only do this test if there is some a priori reason to test those variables. This is not a variable selection strategy.

14 The Hat Matrix


Recall that Y = X = X(XT X)1 XT Y = HY where (48)

H = X = X(XT X)1 XT

(49)

is called the hat matrix. The hat matrix is the projector onto the column space of X. The residuals are = Y Y = Y HY = (I H)Y. The hat matrix will play an important role in all that follows. 14.1 Theorem. The hat matrix has the following properties. 1. HX = X. 2. H is symetric and idempotent: H 2 = H 3. H projects Y onto the column space of X. 4. rank(X) = tr(H). (50)

19

14.2 Theorem. Properties of residuals: 1. True residuals: E( ) = 0, V( ) = 2 I. 2. Estimated residuals: E( ) = 0, V( ) = 2 (I H). 3.


i i

= 0.

4. V( i ) = 2 (1 hii ) where hii is diagonal element of H. Lets prove a few of these. First, E( ) = (I H)E(Y ) = (I H)X = X HX = X X since HX = X = 0.

Next, V( ) = (I H)V(Y )(I H)T = 2 (I H)(I H)T = 2 (I H)(I H) = 2 (I H H + H 2 )

= 2 (I H H + H) since H 2 = H = 2 (I H).

To see that the sum of the residuals is 0, note that i i = 1, where 1 denotes a vector of ones. Now, 1 L, Y is the projection onto L and = Y Y . By the properties of the projection, is perpendicular to every vector in L. Hence, i i = 1, = 0. 14.3 Example. Let 1 1 . . . 1

Then H=

X= 1 n

The column space is

and

1 1 V = a . : a R . . 1 HY = Y Y . . . Y

1 1 1 1 1 1 . . . . . . . . . . . . 1 1 1

20

14.4 Example. Suppose that the X matrix has two-columns. Denote these columns by x 1 and x2 . The column space is V = {a1 x1 + a2 x2 : a1 , a2 R}. The hat matrix projects Y Rn onto V. See Figure 7.

x2

Y x1 Figure 7: Projection

21

15 Weighted Least Squares


So far we have assumed that the i s are independent and have the same variance. What happens if this is wrong? Suppose that Y = X + where V( ) = . Suppose we use the usual least squares estimator . Then, E() = E((XT X)1 XT Y ) = (XT X)1 XT E(Y ) = (XT X)1 XT X = . So is still unbiased. Also, under weak conditions, it can be shown that is consistent (converges to as we get more data). The usual estimator has reasonable properties. However, there are two problems. First, with constant variance, the usual least squares estimator is not just unbiased, it is an optimal estimator in the sense that it is they are the minimum variance, linear, unbiased estimator. This is no longer true with non-constant variance. Second, and more importantly, the formula for the standard error of is wrong. To see this, recall that V(AY ) = AV(Y )AT . Hence, V() = V((XT X)1 XT Y ) = (XT X)1 XT V(Y )X(XT X)1 = (XT X)1 XT X(XT X)1 which is different than the usual formula. It can be shown that minimum variance, linear, unbiased estimator is obtained by minimizing
RSS ()

= (Y X)T 1 (Y X). = SY (51) (52)

The solution is where S = (XT 1 X)1 XT 1 . This is unbiased with variance V() = (XT 1 X)1 . This is called weighted least squares. Let B denote the square root of . Thus, B is a symmetric matrix that satises B T B = BB T = . It can be shown that B 1 is the square root of 1 . Let Z = B 1 Y . Then we have Z = B 1 Y = B 1 (X + ) = B 1 X + B 1 = M + where M = B 1 X, and, = B 1 . Moreover, V() = B 1 V ( )B 1 = B 1 B 1 = B 1 BBB 1 = I. Thus we can simply regress Z on M and do ordinary regression. 22

Let us look more closely at a special case. If the residuals are uncorrelated then 2 0 0 ... 0 w1 2 0 ... 0 0 w2 . = . . . . . . . . . . . . 0 2 0 0 0 0 wn In this case,
RSS ()
n

= (Y X)T 1 (Y X)

i=1

wi (Yi xT )2 . i

Thus, in weighted least squares we are simply giving lower weight to the more variable (less precise) observations. Now we have to address the following question: where do we get the weights? Or equivalently, how do we estimate 2 i = V( i )? There are four approaches. (1) Do a transformation to make the variances approximately equal. Then we dont need to do a weighted regression. (2) Use external information. There are some cases where other information (besides the current data) will allow you to know (or estimate) i . These cases are rare but they do occur. I am working on such a problem right now, in 2 fact. It is a problem from physics and the i are from instrument error which is known to a good approximation. (3) Use replications. If there are several Y values corresponding to each x value, we can use the sample variance 2 of those Y values to estimate i . However, it is rare that you would have so many replications. (4) Estimate (x) as a function of x. Just as we can estimate the regression line, we can also estimate the variance, thinking of it as a function of x. We could assume a simple model like (xi ) = 0 + 1 xi for example. Then we could try to nd a way to estimate the parameters 0 and 1 from the data. In fact, we will do something more ambitious. We will estimate (x) assuming only that it is a smooth function of x. We will do this later in the course when we discuss nonparametric regression.

23

16 The Predictive Viewpoint


The main motivation for studying regression is prediction. Suppose we observe X and then predict Y with g(X). Recall that the prediction error, or prediction risk, R(g) = E(Y g(X))2 and this is minimized by taking g(x) = r(x) where r(x) = E(Y |X = x). Consider the set of linear predictors L= (x) = xT : Rp .
(x)

(We assume usually that x1 = 1.) The best linear predictor, or linear oracle, is R( ) = min R( ).
L

= xT where

In other words, (x) = xT gives the smallest prediction error of all linear predictors. Note that is well-dened even without assuming that the true regression function is linear. One way to think about linear regression is as follows. When we are using least squares, we are trying to estimate the linear oracle, not the true regression function. Let us make the connection between the best linear predictor and least squares more explicit. We have R() = E(Y 2 ) 2E(Y X T ) + T E(XX T ) = E(Y 2 ) 2E(Y X T ) + T

where = E(XX T ). By differentiating R() with respect to and setting the derivative equal to 0, we see that the best value of is = 1 C (53) where C is the p 1 vector whose j th element is E(Y Xj ). We can estimate with the matrix 1 n = XT X n and we can estimate C with n1 XT Y . An estimate of the oracle is thus = (XT X)1 XT Y which is the least squares estimator.

17 The Bias-Variance Decomposition


Let r(x) be any predictor. Then R = E(Y r(X))2 = R(x)f (x)dx

where R(x) = E((Y r(X))2 |X = x). Let r(x) = E(r(x)), V (x) = V(r(x)) and 2 (x) = V(Y |X = x). Now R(x) = E((Y r(X))2 |X = x) = E ((Y r(x)) + (r(x) r(x)) + (r(x) r(x)))2 X = x = 2 (x)
irreducible error

+ (r(x) r(x))2 + V (x) .


bias squared variance

(54)

24

We call (54) the bias-variance decomposition. If we combine the last two terms, we can also write R(x) = 2 (x) + MSEn (x) where MSEn (x) = E((r(X) r(X))2 |X = x) is the conditional mean squared error of r(x). Now R= R(x)f (x)dx 1 n
n i=1

R(Xi ) Rav

and Rav is called the average prediction risk. We have Rav = 1 n


n

R(Xi ) =
i=1

1 n

2 (Xi ) +
i=1

1 n
n

n i=1

(r(Xi ) r(Xi ))2 +

1 n

V (Xi ).
i=1

Finally, dene the training error Rtraining = 1 n

i=1

(Yi Yi )2

where Yi = r(Xi ). We might guess that Rtraining estimates the prediction error well but this is not true. To see this, let r i = E(r(Xi )) and compute E(Yi Yi )2 Hence, = E(Yi r(Xi ) + r(Xi ) r(Xi ) + r(Xi ) Yi )2

= 2 + E(r(Xi ) r(Xi ))2 + V(r(Xi )) 2Cov(Yi , Yi ).

E(Rtraining ) = E(Rav ) 2Cov(Yi , Yi ).

(55)

Typically, Cov(Yi , Yi ) > 0 and so Rtraining underestimates the risk. Later, we shall see how to estimate the prediction risk. Another Property of Least Squares. Let L denote all linear functions of the form j j xj . For any vector v 2 dene the norm ||v||2 = (1/n) n vi and for any function h write ||h||2 = (1/n) n h2 (Xi ). Then: n n i=1 i=1 17.1 Theorem. E ||r r||2 X1 , . . . , Xn n Thus, r is a close linear approximation to r. Summary 1. The linear oracle, or best linear predictor, is xT where = 1 C. An estimate of is = (XT X)1 XT Y . 2. The least squares estimator is = (XT X)1 XT Y . We can regard as an estimate of the linear oracle. If the regression function r(x) is actually linear so that r(x) = xT , then the least squares estimator is unbiased and has variance matrix 2 (XT X)1 . 3. The predicted values are Y = X = HY where H = X(XT X)1 XT is the hat matrix which projects Y onto the column space of X. q 2 + min ||f r||2 . n f F n

25

15

10

y 0
0 5 10 x 15 20

0
0

10

15

10 x

15

20

15

10

y 0
0 5 10 x 15 20

0
0

10

15

10 x

15

20

Figure 8: The Ansombe Example

18 Diagnostics
Figure 8 shows a famous example. Four different data sets with the same t. The moral: looking at the t is not enough. We should also use some diagnostics. Generally, we diagnose problems by looking at the residuals. When we do this, we are looking for: (1) outliers, (2) inuential points, (3) nonconstant variance, (4) nonlinearity, (5) nonnormality. The remedies are:

Problem 1. Outliers 2. Inuential points 3. Nonconstant variance

4. Nonlinearity 5. Nonnormality

Remedy Non-inuential: dont worry about it. Inuential: remove or use robust regression. Fit regression with and without the point and report both analyses. Use transformation or nonparametric methods. Note: doesnt affect the t too much; mainly an issue for condence intervals. Use transformation or nonparametric methods. Large samples: not a problem. Small samples: use transformations

26

Three types of residuals: Name residual standardized residual studentized residual Formula i = Y i Yi
Yi b Yi 1hii b b Y Yi i (i) 1hii b

R command (assume lm output is in tmp) resid(tmp) rstandard(tmp) rstudent(tmp)

19 Outliers
Can be found (i) graphically or (ii) by testing. Let us write Yj = Test H0 : case i is an outlier versus H1 : case i is not an outlier Do the following: (1) Delete case i. (2) Compute (i) and (i) . (3) Predict the deleted case: Yi = XiT (i) . (4) Compute Yi Yi . ti = se (5) Reject H0 if p-value is less than /n. Note that T V(Yi Yi ) = V(Yi ) + V(Yi ) = 2 + 2 xT (X(i) X(i) )1 xi . i So, se(Yi Yi ) =
T 1 + xT (X(i) X(i) )1 xi . i T Xj + T Xj + j j

j=i j = i.

How do the residuals come into this? Internally studentized residuals: ri = Externally studentized residuals: r(i) = 19.1 Theorem. ti = r i np2 2 = r(i) . n p 1 ri . (i) 1 hii
i

. 1 hii

20 Inuence
Cooks distance Di = (Y(i) Y )T (Y(i) Y ) 1 2 = ri q 2 q hii 1 hii

where Y = X and Y(i) = X (i) . Points with Di 1 might be inuential. Points near the edge are typically the inuential points.

27

20.1 Example (Rats). data = c(176,6.5,.88,.42,176,9.5,.88,.25,190,9.0,1.00,.56, .... ) data bwt lwt dose y n = = = = = = matrix(data,ncol=4,byrow=T) data[,1] data[,2] data[,3] data[,4] length(y) bwt + lwt + dose, qr=TRUE)

out = lm(y summary(out)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.265922 0.194585 1.367 0.1919 bwt -0.021246 0.007974 -2.664 0.0177 * lwt 0.014298 0.017217 0.830 0.4193 dose 4.178111 1.522625 2.744 0.0151 * --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.07729 on 15 degrees of freedom Multiple R-Squared: 0.3639, Adjusted R-squared: 0.2367 F-statistic: 2.86 on 3 and 15 DF, p-value: 0.07197

diagnostics = ls.diag(out) names(diagnostics) "std.dev" "hat" "std.res" "stud.res" "dfits" "correlation" "std.err" "cov.scaled" plot(bwt,diagnostics$stud.res,pch=19);abline(h=0) plot(lwt,diagnostics$stud.res,pch=19);abline(h=0) plot(dose,diagnostics$stud.res,pch=19);abline(h=0) qqnorm(diagnostics$stud.res,pch=19);abline(a=0,b=1) ### Another way to get the residuals r = rstandard(out) ### standardized r = rstudent(out) ### studentized plot(fitted(out),rstudent(out),pch=19);abline(h=0)

"cooks" "cov.unscaled"

### More diagnostics I = influence.measures(out) names(I) "infmat" "is.inf" "call" I$infmat[1:5,] dfb.1_ dfb.bwt

dfb.lwt

dfb.dose

dffit

cov.r

cook.d

hat

28

1 -0.03835128 0.31491627 -0.7043633 -0.2437488 0.8920451 0.6310012 0.16882682 0.1779827 2 0.14256373 -0.09773917 -0.4817784 0.1256122 -0.6087606 1.0164073 0.08854024 0.1793410 3 -0.23100202 -1.66770314 0.3045718 1.7471972 1.9047699 7.4008047 0.92961596 0.8509146 4 0.12503004 -0.12685888 -0.3036512 0.1400908 -0.4943610 0.8599033 0.05718456 0.1076158 5 0.52160605 -0.39626771 0.5500161 0.2747418 -0.9094531 1.5241607 0.20291617 0.3915382 cook = I$infmat[,7] plot(cook,type="h",lwd=3,col="red") ### remove third case y = y[-3] bwt = bwt[-3] lwt = lwt[-3] dose = dose[-3] out = lm(y bwt + lwt + dose) summary(out) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.311427 0.205094 1.518 0.151 bwt -0.007783 0.018717 -0.416 0.684 lwt 0.008989 0.018659 0.482 0.637 dose 1.484877 3.713064 0.400 0.695 Residual standard error: 0.07825 on 14 degrees of freedom Multiple R-Squared: 0.02106, Adjusted R-squared: -0.1887 F-statistic: 0.1004 on 3 and 14 DF, p-value: 0.9585

29

Warning! Notice the command: out = lm(y bwt + lwt + dose, qr=TRUE) You need the qr=TRUE option if you want to use ls.diag

21 Tweaking the Regression


If residual plots indicate some problem, we need to apply some remedies. Look at Figure 6.3 p 132 of Weisberg. Possible remedies are: Transformation Robust regression nonparametric regression Examples of transformations: Y , log(Y ), log(Y + c), 1/Y

These can be applied to Y or x. We transform to make the assumptions valid, not to chase statistical signicance. 21.1 Example (Bacteria). This example is from Chatterjee and Price(1991, p 36). Bacteria were exposed to radiation. Figure 10 shows the number of surviving bacteria versus time of exposure to radiation. The program and output look like this. > > > > > > > > time = 1:15 survivors = c(355,211,197,166,142,106,104,60,56,38,36,32,21,19,15) plot(time,survivors) out = lm(survivors time) abline(out) plot(out,which=c(1,2,4)) print(summary(out))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 259.58 22.73 11.420 3.78e-08 *** time -19.46 2.50 -7.786 3.01e-06 *** --Residual standard error: 41.83 on 13 degrees of freedom Multiple R-Squared: 0.8234, Adjusted R-squared: 0.8098 F-statistic: 60.62 on 1 and 13 DF, p-value: 3.006e-06 The residual plot suggests a problem. Consider the following transformation. > > > > > > >

logsurv = log(survivors) plot(time,logsurv) out = lm(logsurv time) abline(out) plot(out,which=c(1,2,4)) print(summary(out)) 30

diagnostics$stud.res

diagnostics$stud.res
150 160 170 bwt 180 190 200

7 lwt

10

Normal QQ Plot

diagnostics$stud.res

Sample Quantiles
0.75 0.80 0.85 dose 0.90 0.95 1.00

1
2

0 Theoretical Quantiles

rstudent(out)

cook
0.30 0.35 0.40 fitted(out) 0.45 0.50

0.0

0.2

0.4

0.6

0.8

10 Index

15

Figure 9: Rat Data

31

Residuals vs Fitted

350

250

300

Residuals

200

50 0

100
15

survivors

50

100

150

8 time

10

12

14

50

50

100

150

200

250

Fitted values

Normal QQ plot
1

Cooks distance plot

Standardized residuals

Cooks distance

15

0.5

1.0

1.5

15

14

0.0

10

12

14

Theoretical Quantiles

Obs. number

Figure 10: Bacteria Data

32

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.973160 0.059778 99.92 < 2e-16 *** time -0.218425 0.006575 -33.22 5.86e-14 *** --Residual standard error: 0.11 on 13 degrees of freedom Multiple R-Squared: 0.9884, Adjusted R-squared: 0.9875 F-statistic: 1104 on 1 and 13 DF, p-value: 5.86e-14 Check out Figure 11. Much better. In fact, theory predicts Nt = N0 et where Nt is number of survivors at exposure t and N0 is the number of bacteria before exposure. So the fact that the log transformation is useful here is not surprising.

33

Residuals vs Fitted

0.2

5.0

5.5

4.5

Residuals

logsurv

4.0

0.1

3.5

0.0

0.1

10

3.0

0.2

8 time

10

12

14

3.0

3.5

4.0

4.5

5.0

5.5

Fitted values

Normal QQ plot

Cooks distance plot


2

Standardized residuals

Cooks distance

0.3
1

0.2

0.4

10

0.0

0.1

10

12

14

Theoretical Quantiles

Obs. number

Figure 11: Bacteria Data

34

22 Qualitative Variables
If Xi {0, 1}, then it is called a dummy variable. More generally, if x takes discrete values, it is called a qualitative variable or a factor. Let d be a dummy variable. Consider E(Y ) = 0 + 1 x + 2 d. Then coefcient intercept slope d=0 0 1 d=1 0 + 2 1 These are parallel lines. Now consider this model: E(Y ) = 0 + 1 x + 2 d + 3 x d Then: coefcient intercept slope d=0 0 1 d=1 0 + 2 1 + 3 These are nonparallel lines. To include a discrete variable with k levels, use k 1 dummy variables. For example, if z {1, 2, 3}, do this: z d 1 d2 1 1 0 1 2 0 3 0 0 In the model Y = 0 + 1 d1 + 2 d2 + 3 x + we see E(Y |z = 1) = 0 + 1 + 3 x

E(Y |z = 2) = 0 + 2 + 3 x E(Y |z = 3) = 0 + 3 x

You should not create k dummy variables because they will not be linearly independent. If we added d 3 above we have that d3 = 1 d2 d1 + d1 d2 . Then X T X is not invertible. 22.1 Example. Salary data from Chatterjee and Price p 96. ##salary example p 97 chatterjee and price sdata = read.table("salaries.dat",skip=1) names(sdata) = c("salary","experience","education","management") attach(sdata) n = length(salary) d1 = rep(0,n) d1[education==1] = 1 d2 = rep(0,n) d2[education==2] = 1 out1 = lm(salary experience + d1 + d2 + management) summary(out1) Coefficients: 35

Estimate Std. Error t value Pr(>|t|) (Intercept) 11031.81 383.22 28.787 < 2e-16 *** experience 546.18 30.52 17.896 < 2e-16 *** d1 -2996.21 411.75 -7.277 6.72e-09 *** d2 147.82 387.66 0.381 0.705 management 6883.53 313.92 21.928 < 2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.9568, Adjusted R-squared: 0.9525 F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16 Intepretation: Each year of experience increases ourn prediction by 546 dollars. Increment for management position is 6883 dollars. Compare bachelors to high school. For high school, d 1 = 1 and d2 = 0 so: E(Y ) = 0 + 1 experience 2996 + 4 management For bachelors, d1 = 0 and d2 = 1 so: E(Y ) = 0 + 1 experience + 147 + 4 management So Ebach (Y ) Ehigh (Y ) = 3144. ### another way ed = as.factor(education) out2 = lm(salary experience + ed + management) summary(out2)

Coefficients: Estimate Std. Error t value (Intercept) 8035.60 386.69 20.781 experience 546.18 30.52 17.896 ed2 3144.04 361.97 8.686 ed3 2996.21 411.75 7.277 management 6883.53 313.92 21.928 --Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|) < 2e-16 < 2e-16 7.73e-11 6.72e-09 < 2e-16

*** *** *** *** ***

* 0.05 . 0.1 1

Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.9568, Adjusted R-squared: 0.9525 F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16 Apparently, R codes the dummy variables differently. level mean d1 d2 ed2 ed3 high-school 8036 1 0 0 0 BS 11179 0 1 1 0 advanced 11032 0 0 0 1 You can change the way R does this. Do help C and contr.treatment. 36

23 Variable Selection
If the dimension p of the covariate X is large, then we might get better predictions by omitting some covariates. Models with many covariates have low bias but high variance; models with few covariates have high bias but low variance. The best predictions come from balacning these two extremes. This is called the bias-variance tradeoff. To reiterate: including many covariates leads to low bias and high variance including few covariates leads to high bias and low variance The problem of deciding which variables to include in the regression model to achive a good tradeoff is called model selection or variable selection. It is convenient in model selection to rst standardize all the variables by subtracting off the mean and dividing by the standard deviation. For example, we replace Xij with (Xij X j )/sj where X j = n1 n Xij is the mean of i=1 covariate Xj and sj is the standard deviation. The R function scale will do this for you. Thus, we assume throughout this section that

1 n 1 n

Yi
i=1 n

= 0, = 0,

1 n 1 n

Yi2 = 1
i=1 n 2 Xij = 1, j = 1, . . . , p. i=1

(56) (57)

Xij
i=1

Given S {1, . . . , p}, let (Xj : j S) denote a subset of the covariates. There are 2p such subsets. Let (S) = (j : j S) denote the coefcients of the corresponding set of covariates and let (S) = (XT XS )1 XT Y S S denote the least squares estimate of (S), where XS denotes the design matrix for this subset of covariates. Thus, (S) is the least squares estimate of (S) from the submodel Y = XS (S) + . The vector of predicted values from model S is Y (S) = XS (S). For the null model S = , Y is dened to be a vector of 0s. Let rS (x) = jS j (S)xj denote the estimated regression function for the submodel. We measure the predictive quality of the model via the prediction risk.

The prediction risk of the submodel S is dened to be 1 R(S) = n where Yi = r(Xi ) + value Xi .
i n i=1

E(Yi (S) Yi )2

(58)

denotes the value of a future observation of Y at covariate

Ideally, we want to select a submodel S to make R(S) as small as possible. We face two problems. The rst is estimating R(S) and the second is searching through all the submodels S.

24 The Bias-Variance Tradeoff


Before discussing the estimation of the prediction risk, we recall an important result.

37

Bias-Variance Decomposition of the Prediction Risk R(S) = 2 unavoidable error +


1 n n 2 i=1 bi

1 n

n i=1

vi

squared bias

variance

(59)

where bi = E(rS (Xi )|Xi ) r(Xi ) is the bias and vi = V(rS (Xi )|Xi ) is the variance. Let us look at the bias-variance tradeoff in some simpler settings. 24.1 Example. Suppose that Y N (, 2 ). The minimum variance unbiased estimator of is Y . Now consider the estimator = Y where 0 1. The bias is E() = ( 1), the variance is 2 2 and mean squared error is bias2 + variance = (1 )2 2 + 2 2 . (60) Notice that the bias increases and the variance decreases as 0. Conversely, the bias decreases and the variance increases as 1. The optimal estimator is obtained by taking = 2 /( 2 + 2 ). 24.2 Example. Consider the following model: Yi = N (i , 2 ), i = 1, . . . , p. We want to estimate = (1 , . . . , p )T . Fix 1 k p and let i = The MSE is R = bias2 + variance =
i=k+1

(61)

Yi 0

ik i > k.
p

(62)

2 + k 2 . i

(63)

As k increases the bias term decreases and the variance term increases. Since E(Y i2 2 ) = 2 , we can form an i unbiased estimate of the risk, namely,
p

R=
i=k+1

(Yi2 2 ) + k 2 = RSS + 2k 2 n 2 = RSS + 2k 2 + constant

(64)

where RSS =

n i=1 (Yi

i )2 . We can estimate the optimal choice of k by minimizing RSS + 2k 2 (65)

over k.

24.1 Risk Estimation and Model Scoring


An obvious candidate to estimate R(S) is the training error Rtr (S) = 1 n
n i=1

(Yi (S) Yi )2 .

(66)

For the null model S = Yi = 0, i = 1, . . . , n, and Rtr (S) is an unbiased estimator of R(S) and this is the risk estimator we will use for this model. But in general, this is a poor estimator of R(S) because it is very biased. Indeed, if we add more and more covariates to the model, we can track the data better and better and make Rtr (S) smaller and smaller. Thus if we used Rtr (S) for model selection we would be led to include every covariate in the model. 38

24.3 Theorem. The training error is a downward-biased estimate of the prediction risk, menaing that E( Rtr (S)) < R(S). In fact, 2 Cov(Yi , Yi ). (67) bias(Rtr (S)) = E(Rtr (S)) R(S) = n i=1 Now we discuss some better estimates of risk. Mallows Cp statistic is dened by R(S) = Rtr (S) + 2|S| 2 n (68)

where |S| denotes the number of terms in S and 2 is the estimate of 2 obtained from the full model (with all covariates in the model). This is simply the training error plus a bias correction. This estimate is named in honor of Colin Mallows who invented it. The rst term in (68) measures the t of the model while the second measure the complexity of the model. Think of the Cp statistic as: lack of t + complexity penalty. The disadvantage of Cp is that we need to supply an estimate of . Another method for estimating risk is leave-one-out cross-validation. The leave-one-out cross-validation (CV) estimator of risk is RCV (S) = 1 n
n i=1

(69)

(Yi Y(i) )

(70)

where Y(i) is the prediction for Yi obtained by tting the model with Yi omitted. It can be shown that 2 n 1 Yi Yi (S) (71) RCV (S) = n i=1 1 Hii (S) where Hii (S) is the ith diagonal element of the hat matrix H(S) = XS (XT XS )1 XT . S S (72)

From equation (71) it follows that we can compute the leave-one-out cross-validation estimator without actually dropping out each observation and retting the model. An important advantage of cross-validation is that it does not require an estimate of . n We can relate CV to Cp as follows. First, approximate each Hii (S) with their average value n1 i=1 Hii (S) = trace(H(S))/n = |S|/n. This yields 1 RSS(S) RCV (S) (73) 2. n 1 |S| n The right hand side of (73) is called the generalized cross validation (GCV) score and will come up again later. Next, use that fact that 1/(1 x)2 1 + 2x and conclude that RCV (S) Rtr (S) + 2 2 |S| n (74)

where 2 = RSS(S)/n. This is identical to Cp except that the estimator of 2 is different. Another criterion for model selection is AIC (Akaike Information Criterion). The idea is to choose S to maximize AIC(S) = S |S| (75) 39

where S = S (S , ) is the log-likelihood (assuming Normal errors) of the model evaluated at the be thought of goodness of t minus complexity. Assuming Normal errors, AIC(S) = n log 2
RSS (S)

MLE . 1

This can

n |S| 2

(76)

where RSS(S) is the residual sums of squares in model S. If instead we take equal to its estimate from the largest model, then maximizing AIC is equivalent to minimizing Mallows C p . Some texts dene AIC as AIC(S) = 2
S

+ 2|S|n log

RSS (S)

+ 2|S|

(77)

in which case we minimize AIC instead of maximizing. To see where AIC comes from, note that, assuming Gaussian errors, (, 2 ) = constant Inserting yields (, 2 ) = constant In this case, up to a constant, RSS + 2|S| 2 which is essentially Cp . But, 2 is unknown too. Inserting 2 yields, 2 + 2|S| = (, 2 ) = constant n log 2 RSS n . 1 n log 2 2 ||Y X|2 . 2 2 n 1 log 2 2 RSS. 2 2

This looks different than Cp . But, let us compare the AIC of one model to the full model. We will use the fact that log x x 1. Let m = |S| and let q be the number of terms in the full model. Also, let 2 = RSSf ull /n which is the MLE of 2 under the full model. Then, AIC AICf ull = n log RSS + 2(m q) n 2 RSS n 1 + 2(m q) n 2 RSS = n + 2(m q) 2

which corresponds to Cp . Yet another criterion for model selection is BIC (Bayesian information criterion). Here we choose a model to maximize |S| n |S| RSS (S) BIC(S) = S log n = log log n. (78) 2 2 n 2 The BIC score has a Bayesian interpretation. Let S = {S1 , . . . , Sm } where m = 2p denote all the models. Suppose we assign the prior P(Sj ) = 1/m over the models. Also, assume we put a smooth prior on the parameters within each model. It can be shown that the posterior probability for a model is approximately, P(Sj |data) eBIC(Sj ) . BIC(Sr ) re (79)

Hence, choosing the model with highest BIC is like choosing the model with highest posterior probability. The BIC score also has an information-theoretic interpretation in terms of something called minimum description length. The BIC score is identical to Mallows Cp except that it puts a more severe penalty for complexity. It thus leads one to choose a smaller model than the other methods.
texts use a slightly different denition of AIC which involves multiplying the denition here by 2 or -2. This has no effect on which model is selected.
1 Some

40

Model Null Model (Yi = 0) Age Expenditure Population Age + Expenditure Age + Population Expenditure + Population Age + Expenditure + Population

CV Score 1.000 0.991 0.568 0.947 0.486 0.973 0.635 0.537

Table 1: Cross validation scores for all 8 models in Example 24.4.

24.2 Model Search


Once we choose a model selection criterion, such as cross-validation or AIC, we then need to search through all 2 p models, assign a score to each one, and choose the model with the best score. We will consider 4 methods for searching through the space of models: 1. Fit all submodels. 2. Forward stepwise regression. 3. Ridge Regression. 4. The Lasso. Fitting All Submodels. If p is not too large we can do a complete search over all the models. 24.4 Example. Consider the crime data but let us only consider three variables: Age, Expenditure and Population. There are 8 possible submodels. Their CV scores are shown in Table 1. The best model has two variables, Age and Expenditure. To run all subsets regressions in R you need the leaps library. This is installed in the Statistics department. You can also download it for free from the R web site. In R you type: library(leaps) out = leaps(x,y,method="Cp") Here, x is a matrix of explanatory variables. (Do not include a column of 1s.) You can also use the nbest= option, for example, out = leaps(x,y,method="Cp",nbest=10) This will report only the best 10 subsets of each size model. The output is a list with several components. In particular, out$which shows which variables are in the model, out$size shows how many parameters are in the model and out$Cp shows the Cp statistic. Here is a sample example with three variables. library(leaps) x = cbind(x1,x2,x3) out = leaps(x,y,method="Cp") print(out) > out $which 41

Forward Stepwise Regression 1. For j = 1, . . . , p, regress Y on the j th covariate Xj and let Rj be the estimated risk. Set j = argminj Rj and let S = {j}. 2. For each j S c , t the regression model Y = j Xj + j = argminjS c Rj and update S S {j}.
sS

s Xs + and let Rj be the estimated risk. Set

3. Repeat the previous step until all variables are in S or until it is not possible to t the regression. 4. Choose the nal model to be the one with the smallest estimated risk.

Figure 12: Forward stepwise regression. 1 2 3 1 TRUE FALSE FALSE 1 FALSE FALSE TRUE 1 FALSE TRUE FALSE 2 TRUE TRUE FALSE 2 TRUE FALSE TRUE 2 FALSE TRUE TRUE 3 TRUE TRUE TRUE $label [1] "(Intercept)" "1" $size [1] 2 2 2 3 3 3 4 $Cp [1] [7]

"2"

"3"

0.2834636 104.5794484 105.0097649 4.0000000

2.0972299

2.1932414 106.5772075

So the best model is the rst which has only x1 in it. Stepwise. When p is large, searching through all 2p models is infeasible. In that case we need to search over a subset of all the models. Two common methods are forward and backward stepwise regression. In forward stepwise regression, we start with no covariates in the model. We then add the one variable that leads to the best score. We continue adding variables one at a time this way. See Figure 12. Backwards stepwise regression is the same except that we start with the biggest model and drop one variable at a time. Both are greedy searches; nether is guaranteed to nd the model with the best score. Backward selection is infeasible when p is larger than n since will not be dened for the largest model. Hence, forward selection is preferred when p is large. 24.5 Example. Figure 13 shows forward stepwise regression on the crime data. The x-axis shows the order that the variables entered. The y-axis is the cross-validation score. We start with a null model and we nd that adding x4 reduces the cross-validation score the most. Next we nd try adding each of the remaining variables to the model abd nd the x13 leads to the most improvement. We continue this way until all the variables have been added. The

42

0.60 Crossvalidation score 0.50 0.55

13 8 3 1 11 10 7 12 5 2 6

0.30
1

0.35

0.40

0.45

10

11

12

13

Number of variables

Figure 13: Forward stepwise regression on the crime data. sequence of models chosen by the algorithm is Y Y Y . . . = = = . . . 4 X4 4 X4 + 13 X13 4 X4 + 13 X13 + 3 X3 . . . S = {4} S = {4, 13} S = {4, 13, 3} . . .

(80)

The best overall model we nd is the model with ve variables x4 , x13 , x3 , x1 , x11 although the model with seven variables is essentially just as good.

Regularization: Ridge Regression and the Lasso. Another way to deal with variable selection is to use regularization or penalization. Specically, we dene to minimize the penalized sums of squares
n

Q() =
i=1

(Yi XiT )2 + pen()

where pen() is a penalty and 0. We consider three choices for the penalty: L0 penalty ||||0 L1 penalty ||||1 L2 penalty ||||2 = #{j : j = 0}
p

=
j=1 p

|j |
2 j .

=
j=1

The L0 penalty would force us to choose estimates which make many of the j s equal to 0. But there is no way to minimize Q() without searching through all the submodels. 43

The L2 penalty is easy to implement. The estimate that minimizes


n i=1 p p

(Yi

j Xij )2 +
j=1 j=1

2 j

is called the ridge estimator. It can be shown that the estimator that minimizes the penalized sums of squares is a follows (assuming the Xij s are standardized) is = (XT X + I)1 XT Y, where I is the identity. When = 0 we get the least squares estimate (low bias, high variance). When we get = 0 (high bias, low variance). 24.6 Example (Crime Data Revisited). Here is a re-analysis of the crime data using ridge regression. The estimates are plotted as a function of in Figure 14, top left. Notice that ridge regression produces a linear estimator: = SY where S = (XT X + I)1 XT and Y = HY where where H = X(XT X + I)1 XT .

The effective degrees of freedom is dened to be df() = trace(H). When = 0 we have df() = p and when , df() 0. See Figure 14, top right. How do we choose ? Recall that the cross-validation estimate of predictive risk is
n

CV =
i=1

(Yi r(i) (Xi ))2 . Yi r(xi ) 1 Hii


2

It can be shown that CV =

n i=1

Thus we can choose to minimize CV. See the middle, left plot in Figure 14. An alternative criterion that is sometimes used is generalized cross validation or, GCV. This is just an approximation to CV where H ii is replaced with its n average: n1 i=1 Hii . Thus, n 2 Yi r(xi ) GCV = 1b i=1 where 1 b= n
n

Hii =
i=1

df () . n

The function lm.ridge which is part of the MASS library, does ridge regression and computes GCV. The middle right and bottom left plots are from this function. Here is the R code: Trace = function(X){ sum(diag(X)) }

44

30

20

10

beta

beta
0 lambda df 5 10 15 20 25

10

10

10

20

30

10

39000

38000

t(out2$coef)
5 out1$df lambda 6 7 8 9 10

out1$cv

36000

37000

35000

10
0

10

20

10

15

20

25

out2$GCV 15.0 15.5 16.0

16.5

17.0

5 out1$df

10

Figure 14: Ridge regression of crime data.

45

ridge.fun = function(X,y,lambda){ n = length(y) p = ncol(X) k = length(lambda) I = diag(rep(1,p)) beta = matrix(0,p,k) df = rep(0,k) cv = rep(0,k) for(i in 1:k){ S = solve(t(X) %*% X + (lambda[i]*I) ) %*% t(X) beta[,i] = S %*% y H = X % *% S y.hat = H % *% y df[i] = Trace(H) cv[i] = sum( ((y-y.hat)/(1-diag(H)))2 ) } return(beta,df,cv) }

p = ncol(crime.dat) crime.dat = scale(crime.dat) ### scale the variables Crime = scale(Crime)

lambda out1

= seq(0,25,length=100) = ridge.fun(as.matrix(crime.dat),Crime,lambda)

postscript("ridge.ps",horizontal=F) par(mfrow=c(3,2)) matplot(lambda,t(out1$beta),type="l",lty=1,xlab="lambda",ylab="beta") lines(lambda,rep(0,length(lambda))) matplot(out1$df,t(out1$beta),type="l",lty=1,xlab="df",ylab="beta") lines(out1$df,rep(0,length(lambda))) plot(out1$df, out1$cv,type="l")

library(MASS) out2 = lm.ridge(Crime Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth, data=crime.dat,lambda=lambda) print(summary(out2)) matplot(lambda,t(out2$coef),type="l",lty=1) plot(out1$df,out2$GCV,type="l") dev.off() The problem with ridge regression is that we really havent done variable selection because we havent forced any j s to be 0. This is where the L1 penalty comes in.

46

The lasso estimator () is the value of that solves:


n R

min p
i=1 p j=1

(Yi XiT )2 + ||||1

(81)

where > 0 and ||||1 =

|j | is the L1 norm of the vector .

The lasso is called basis pursuit in the signal processing literature. Equation (81) denes a convex optimation problem with a unique solution () that depends on . Typically, it turns out that many of the j () s are zero. Thus, the lasso performs estimation as model selection simultaneously. The selected model, for a given , is S() = j : j () = 0 . (82)

The constant can be chosen by cross-validation. The estimator has to be computed numerically but this is a convex optimization and so can be solved quickly. What is special about the L1 penalty? First, this is the closest penalty to the L0 penalty that makes Q() convex. Moreover, the L1 penalty captures sparsity. Digression on Sparsity. We would like our estimator to be sparse, meaning that most j s are zero (or close to zero). Consider the following three vectors, each of length p: u = (1, 0, . . . , 0) v = (1/ p, 1/ p, . . . , 1/ p). Intuitively, u is sparse while v is not. Let us now comoute the norms: ||u||1 = 1 ||u||2 = 1 ||v||1 = p ||v||2 = 1. So the L1 norm coorectly captures sparseness. 24.7 Example. Figure 15 shows the traces of j () as a function of the number of steps in the algorithm when the lasso is run on the crime data. The value of Cp is shown in Figure 16. The best model includes variables x7 , x1 , x13 , x6 , x3 , x11 , x12 , x10 . This is similar to the model selected by forward stepwise regression but notice that x4 (expenditures) is chosen rst by forward stepwise but is chosen last by the lasso. Two related variable selection methods are forward stagewise regression and lars. In forward stagewise regression we rst set Y = (0, . . . , 0)T and we choose a small, positive constant . Now we build the predicted values incrementally. Let Y denote the current vector of predicted values. Find the current correlations c = c( Y ) = XT (Y Y ) and set j = argmaxj |cj |. (83) Finally, we update Y by the following equation: Y Y + sign(cj )xj . (84)

This is like forward stepwise regression except that we only take small, incremental steps towards the next variable and we do not go back and ret the previous variables by least squares. A modication of forward stagewise regression is called least angle regression. We begin with all coefcients set to 0 and then nd the predictor xj most correlated with Y . Then increase j in direction of the sign of its correlation with Y and set = Y Y . When some other predictor xk has as much correlation with as xj has we increase (j , k ) in their joint least squares direction, until some other predictor xm has as much correlation with the residual . Continue until all predictors are in the model. A formal description is in Figure 17. lars can be easily modied to produce the lasso estimator. If a non-zero coefcient ever hits zero, remove it from the active set A of predictors and recompute the joint direction. This is why the lars function in R is used to compute the lasso estimator. You need to download the lars package rst. 47

Standardized Coefficients

0.0

0.2

0.4 |beta|/max|beta|

0.6

0.8

1.0

Figure 15: Coefcient trace plot for the lasso with the crime data.

4
80

Cp

40

60

1 13 6 3 11 12 10 2 5 8 6
10

20

15

Df

Figure 16: Cp plot for the lasso with the crime data.

48

10

13

lars 1. Set Y = 0, k = 0, A = . Now repeat steps 23 until Ac = . 2. Compute the following quantities: c = XT (Y Y ) C = maxj {|cj |} A = {j : |cj | = C} sj = sign(cj ), j A XA = (sj xj : j A) G = XT XA A B = (1T G1 1)1/2 w = BG1 1 u = XA w a = XT u where 1 is a vector of 1s of length |A|. 3. Set Y Y + u where = min + c
jA

(85)

(86) . (87)

Here, min+ means that the minimum is only over positive components.

C cj C + cj , B aj B + aj

Figure 17: A formal descrription of lars.

Summary 1. The prediction risk R(S) = n1 i=1 (Yi (S) Yi )2 can be decomposed into unavoidable error, bias and variance. 2. Large models have low bias and high variance. Small models have high bias and low variance. This is the bias-variance tradeoff. 3. Model selection methods aim to nd a model which balances bias and varaince, yielding a small risk. 4. Cp or cross-validation are used to estimate the risk. 5. Search methods look through a subset of models and nd the one with the smallest value of estimated risk R(S). 6. The lasso estimates with the penalized residual sums of squares i=1 (Yi XiT )2 + ||||1 . Some of the estimates will be 0 and this corresponds to omitting them from the model. lars is an efcient algorithm for computing the lasso estimates.
n n

25 Variable Selection versus Hypothesis Testing


The difference between variable selection and hypothesis testing can be confusing. Look at a simple example. Let Y1 , . . . , Yn N (, 1). 49

We want to compare two models: M0 : N (0, 1), and M1 : N (, 1). Hypothesis Testing. We test H0 : = 0 versus = 0. The test statistic is Z= Y 0 V(Y ) We reject H0 if |Z| > z /2. For = 0.05, we reject H0 if |Z| > 2, i.e., if 2 |Y | > . n = nY.

AIC. The likelihood is proportional to


n

L() = where S 2 =
i (Yi

e(Yi )
i=1

/2

= en(Y )

/2 nS 2 /2

Y )2 . Hence, () = |S|. The AIC scores are

nS 2 n(Y )2 . 2 2
2

Recall that AIC =

AIC0 = (0) 0 = and

nY 2

nS 2 2

AIC1 = () 1 = since = Y . We choose model 1 if

nS 2 1 2

AIC1 > AIC0 that is, if nS 2 nY nS 2 1> 2 2 2 2 |Y | > . n


2

or

Similar to but not the same as the hypothesis test.

BIC. The BIC scores are BIC0 = (0) and BIC1 = () nY 0 log n = 2 2

nS 2 2

1 nS 2 1 log n = log n. 2 2 2

50

We choose model 1 if BIC1 > BIC0 that is, if |Y | > log n . n

Hypothesis testing AIC/CV/Cp BIC

controls type I errors nds the most predictive model nds the true model (with high probability)

26 Collinearity
If one of the predictor variables is a linear combination of the others, then we say that the variables are collinear. The result is that X T X is not invertible. Formally, this means that the standard error of is innite and the standard error for predictions is innite. For example, suppose that x1i = 2 and suppose we include an intercept. Then the X matrix is 1 2 1 2 . . . . . . 1 2 and so XT X = n which is not invertible. The implied mode in this example is Yi = 0 + 1 xi1 +
i

1 2 2 4

= 0 + 21 +

0 +

where 0 = 0 + 21 . We can estimate 0 using Y but there is no way to separate this into estimates for 0 and 1 . Sometimes the variables are close to collinear. The result is that it maybe difcult to invert X T X. However, the bigger problem is that the standard errors will be huge. The solution is easy. Dont use all the variables; use variable selection. Multicollinearity is just an extreme example of the bias-variance tradeoff we face whenever we do regression. If we include too many variables, we get poor predictions due to increased variance.

51

27 Robust Regression
So far we have dealt with outliers by looking for them and deleting them. A more systematic way to deal with outliers is through robust regression. Recall that in least squares, we estimate by minimizing RSS. The idea of robust regreesion is to replace the RSS with a different criterion. Lets start with a simpler setting. Suppose Yi = + i and we want to estimate . If we choose to minimize RSS =
i

(Yi )2

we get = Y . Now, = Y is not a robust estimator. The value of = Y will change drastically if we move one observation. An alternative estimator is the median . The median is very robust. Moving one observation will have little or no effect on the median. How does the median compare to the mean as an estimator? If the data are Normal, it can be shown that V(Y ) .64 V() The implication is that the median is less efcient than the mean (which is the MLE). However, this is only true if the data are exactly Normal. If the data are non-Normal and in particular if there are occasional outliers the median is preferable because of its robustness. This is a general idea. We can give up a bit of efciency in favor of gaining some robustness. It can be shown that median is obtained by minimizing |Yi |.

More generally, we can estimate by minimizing (Yi )

for some function . Hubers estimator corresponds to (x) = x2 |x| c c(2|x| c) |x| c.

As c we get the mean. As c 0 we get the median. A common choice is c=1.345. This gives 95 per cent efciency at the Normal. Actually, we need to be a bit more sophisticated about this estimator. The choice of cutoff c must be relative the scale of Y . So in fact we minimize Yi s i where s is an estimate of . Now s also has to be a robust estimate estimate. An example is the MAD (median absolute deviation): mediani |Yi medianj Yj | s= . .6745 The reason we divide by .6745 is becuase this ensures that s converges to as n increases. Otherwise, it would converge to .6745. How can we transfer this ides to regression. We choose to minimize
n

i=1

Yi XiT s 52

where is the Huber function (or some other function) and s is an estimate of . The resulting estmator is called an M-estimator. (Choosing = log f for some density function f corresponds to maximum likelihood estimation.) 27.1 Example. We will create a synthetic example. postscript("robust.ps",horizontal=F) library(MASS) n = 100 x = (1:n)/n eps = rnorm(n,0,.1) y = 2 + 3*x + eps y[90] = 2 ### create an outlier plot(x,y) out1 = lm(y x) out2 = rlm(y x) print(summary(out1))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.05480 0.05704 36.02 <2e-16 *** x 2.82522 0.09806 28.81 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.2831 on 98 degrees of freedom Multiple R-Squared: 0.8944, Adjusted R-squared: 0.8933 F-statistic: 830.1 on 1 and 98 DF, p-value: < 2.2e-16 > print(summary(out2)) Call: rlm.formula(formula = y x) Residuals: Min 1Q Median 3Q -2.678044 -0.064177 0.005511 0.062456 Coefficients: Value Std. Error t value (Intercept) 2.0197 0.0195 103.5637 x 2.9537 0.0335 88.1012 Residual standard error: 0.09395 on 98 degrees of freedom

Max 0.181200

> print(sqrt(sum(out1$res2)/(n-2))) [1] 0.283053 > print(out2$s) [1] 0.09394536

53

28 Nonlinear Regression
We can t regression models when the regression is nonlinear: Yi = r(Xi ; ) +
i

where the regression function r(x; ) is a known function except for some parameters = ( 1 , . . . , k ). 28.1 Example. Figure 18 shows the weight of a patient on a weight rehabilitation program as a function of the number of dayes in the program. The data are from Venables and Ripley (1994). It is hypothesized that Yi = r(x; ) + , where r(x; ) = 0 + 1 2x/2 . Since
x

lim r(x; ) = 0

we see that 0 is the ideal stable lean weight. Also, r(0; ) r(; ) = 1 so 1 is the amount of weight to be lost. Finally, we see that expected remaining weight r(x; ) 0 is one-half the starting remaining weight r(0; ) 0 when x = 2 . So 2 is the half-life, i.e., the time to lose half the remaining weight. The parameter estimate is found by minimizing
n

RSS =
i=1

(Yi r(Xi ; ))2 .

Generally, this must be done numerically. The algorithms are iterative and you must supply starting values for the parameters. Here is how to t the example in R. library(MASS) attach(wtloss) plot(Days,Weight,pch=19) out = nls(Weight b0 + b1*2(-Days/b2),data=wtloss, start=list(b0=90,b1=95,b2=120)) info = summary(out) print(info) Formula: Weight b0 + b1 * 2(-Days/b2) Parameters: Estimate Std. Error t value Pr(>|t|) b0 81.374 2.269 35.86 <2e-16 *** b1 102.684 2.083 49.30 <2e-16 *** b2 141.911 5.295 26.80 <2e-16 *** ---

Residual standard error: 0.8949 on 49 degrees of freedom

54

110 120 130 140 150 160 170 180


0

Weight

50

100 Days

150

200

250

info$residuals

2
0

50

100 Days

150

200

250

Figure 18: Weight Loss Data Correlation of Parameter Estimates: b0 b1 b1 -0.9891 b2 -0.9857 0.9561 b = info$parameters[,1] grid = seq(0,250,length=1000) fit = b[1] + b[2]*2(-grid/b[3]) lines(grid,fit,lty=1,lwd=3,col=2) plot(Days,info$residuals) lines(Days,rep(0,length(Days))) dev.off() The t and residuals are shown in Figure 18.

55

29 Logistic Regression
Logistic regression is a generalization of regression that is used when the outcome Y is binary. Suppose that Y i {0, 1} and we want to relate Y to some covariate x. The usual regression model is not appropriate since it does not constrain Y to be binary. With the logistic regression model we assume that E(Yi |Xi ) = P(Yi = 1|Xi ) = e0 +1 Xi . 1 + e0 +1 Xi

Note that since Yi is binary, E(Yi |Xi ) = P(Yi = 1|Xi ). Figure 19 shows the logistic function e0 +1 x /(1 + e0+1 x ). The parameter 1 controls the steepness of the curve. The parameter 0 controls the horizontal shift of the curve. Dene the logit function z . logit(z) = log 1z Also, dene i = P(Yi = 1|Xi ). Then we can rewrite the logistic model as logit(i ) = 0 + 1 Xi . The extension to several covariates is starightforward:
p

logit(i ) = 0 +
j=1

j xij = XiT .

How do we estimate the parameters? Usually we use maximum likelihood. Lets review the basics of maximum likelihood. Let Y {0, 1} denote the outcome of a coin toss. We call Y a Bernoulli random variable. Let = P(Y = 1) and 1 = P(Y = 0). The probability function is f (y; ) = y (1 )1y . The probability function for n independent tosses, Y1 , . . . , Yn , is
n n

f (y1 , . . . , yn ; ) =
i=1

f (yi ; ) =
i=1

yi (1 )1yi .

The likelihood function is just the probability function regarded as a function of the parameter and treating the data as xed:
n

L() =

i=1

yi (1 )1yi .

The maximum likelihood estimator or MLE, is the value that maximizes L(). Maximizing the likelihood is equivalent to maximizing the loglikelihood function
n

() = log L() = Setting the derivative of () to zero yields

i=1

yi log + (1 yi ) log(1 ) .
n i=1

Yi

56

0 Figure 19: The logistic function p = ex /(1 + ex ). Recall that the Fisher information is dened to be I() = E The approxmate standard error is se() = 1 = I() (1 ) . n 2 () . 2

Returning to logistic regression, the likelihood function is


n n

L() = where

i=1

f (yi |Xi ; ) =
T

i=1

Y i i (1 i )1Yi

i =

e Xi . T 1 + e Xi

The maximum likelihood estimator has to be found numerically. The usual algorithm is called reweighted least squares and works as follows. First set starting values (0) . Now, for k = 1, 2, . . . do the following steps until convergence: 1. Compute tted values e Xi i = 1, . . . , n. i = T (k) , 1 + e Xi 2. Dene an n n weight matrix W whose ith diagonal element is i (1 i ). 3. Dene the adjusted response vector where T = (1 , . . . , n ). 4. Take (k+1) = (XT W X)1 XT W Z which is the weighted linear regression of Z on X. Z = X (k) + W 1 (Y )
T (k)

The standard errors are given by V() = (XT W X)1 . 57

29.1 Example. The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease. There are 9 covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age. A logistic regression yields the following estimates and Wald statistics W j for the coefcients: Covariate Intercept sbp tobacco ldl adiposity famhist typea obesity alcohol age j -6.145 0.007 0.079 0.174 0.019 0.925 0.040 -0.063 0.000 0.045 se 1.300 0.006 0.027 0.059 0.029 0.227 0.012 0.044 0.004 0.012 Wj -4.738 1.138 2.991 2.925 0.637 4.078 3.233 -1.427 0.027 3.754 p-value 0.000 0.255 0.003 0.003 0.524 0.000 0.001 0.153 0.979 0.000

Are you surprised by the fact that systolic blood pressure is not signicant or by the minus sign for the obesity coefcient? If yes, then you are confusing association and causation. The fact that blood pressure is not signicant does not mean that blood pressure is not an important cause of heart disease. It means that it is not an important predictor of heart disease relative to the other variables in the model. Model selection can be done using AIC or BIC: AICS = 2 (S ) + 2|S| where S is a subset of the covariates. The are two different types of residuals, the Pearson or 2 residuals: Yi i

i (1 i ) and the deviance residuals sign(Yi i ) 2 Yi log Yi i + (1 Yi ) log 1 Yi 1 i

where we interpret 0 log 0 = 0. These are approximately the same. When there are replications at each x value, the residuals will behave like N (0, 1) random variables when the model is correct. Without replication, the residuals are useless. To t this model in R we use the glm command, which stands for genralized linear model.

> attach(sa.data) > out = glm(chd ., family=binomial,data=sa.data) > print(summary(out))

Coefficients: Estimate Std. Error z value Pr(>|z|) 58

(Intercept) -6.1482113 sbp 0.0065039 tobacco 0.0793674 ldl 0.1738948 adiposity 0.0185806 famhist 0.9252043 typea 0.0395805 obesity -0.0629112 alcohol 0.0001196 age 0.0452028 ---

1.2977108 0.0057129 0.0265321 0.0594451 0.0291616 0.2268939 0.0122417 0.0440721 0.0044703 0.0120398

-4.738 1.138 2.991 2.925 0.637 4.078 3.233 -1.427 0.027 3.754

2.16e-06 0.254928 0.002777 0.003441 0.524020 4.55e-05 0.001224 0.153447 0.978655 0.000174

*** ** ** *** **

***

> out2 = step(out) Start: AIC= 492.14 chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age

etc.

Step: AIC= 487.69 chd tobacco + ldl + famhist + typea + age Df Deviance 475.69 1 484.71 1 485.44 1 486.03 1 492.09 1 502.38 AIC 487.69 494.71 495.44 496.03 502.09 512.38

<none> - ldl - typea - tobacco - famhist - age > > > > > > >

p = out2$fitted.values names(p) = NULL n = nrow(sa.data) predict = rep(0,n) predict[p > .5] = 1 print(table(chd,predict))

predict chd 0 1 0 256 46 1 73 87 > error = sum( ((chd==1)&(predict==0)) | ((chd==0)&(predict==1)) )/n > print(error) [1] 0.2575758

59

30 More About Logistic Regression


Just when you thought you understood logistic regression... Suppose we have a binary outcome Yi and a continuous covariate Xi . To examine the relationship between x and Y we used the logistic model e0 +1 x P(Y = 1|x) = . 1 + e0 +1 x To formally test if there is a relationship between x and Y we test H0 : 1 = 0 versus H1 : 1 = 0. When the Xi s are random (so I am writing them with a capital letter) there is another way to think about this and it is instructive to do so. Suppose, for example, that X is the amount of exposure to a chemical and Y is presence or absence of disease. Instead of regressing Y in X, you might simply compare the distribution of X among the sick (Y = 1) and among the healthy (Y = 0). Lets consider both methods for analyzing the data.

Method 1: Logistic Regression. (Y |X) The rst plot in Figure 20 shows Y versus x and the tted logistic model. The results of the regression are: Coefficients: Estimate Std. Error z value (Intercept) -2.2785 0.5422 -4.202 x 2.1933 0.4567 4.802 --Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|z|) 2.64e-05 *** 1.57e-06 *** * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 138.629 Residual deviance: 72.549 AIC: 76.55 X. on 99 on 98 degrees of freedom degrees of freedom

The test for H0 : 1 = 0 is highly signicant and we conclude that there is a strong relationship between Y and

Method 2: Comparing Two Distributions. (X|Y ) Think of X as the outcome and Y as a group indicator. Examine the boxplots and the histograms in the gure. To test whether these distributions (or at least the means of the distributions) are the same, we can do a standard t-test for H0 : E(X|Y = 1) = E(X|Y = 0) versus E(X|Y = 1) = E(X|Y = 0). x0 = x[y==0] x1 = x[y==1] > print(t.test(x0,x1)) Welch Two Sample t-test 60

1.0

0.8

0.6

0.4

0.2

0.0

0 x

Histogram of x0

Histogram of x1

10

Frequency

Frequency
2 1 0 x0 1 2 3 4

10

0 x1

Figure 20: Logistic Regression?

61

data: x0 and x1 t = -9.3604, df = 97.782, p-value = 3.016e-15 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.341486 -1.522313 sample estimates: mean of x mean of y 0.1148648 2.0467645 Again we conclude that there is a difference.

Whats the connection? Let f0 and f1 be the probability density functions for the two groups. By Bayes theorem, and letting = P(Y = 1), P(Y = 1|X = x) = = = f (x|Y = 1) f (x|Y = 1) + f (x|Y = 0)(1 ) f1 (x) f1 (x) + f0 (x)(1 ) 1
f1 (x) f0 (x)(1) f1 (x) + f0 (x)(1)

Now suppose that X|Y = 0 N (0 , 2 ) and that X|Y = 1 N (1 , 2 ). Also, let = P(Y = 1). Then the last equation becomes e0 +x P(Y = 1|X = x) = 1 + e0 +x where 2 2 0 = log (88) + 0 2 1 1 2 1 0 . (89) 2 This is exactly the logistic regression model! Moreover, 1 = 0 if and only if 0 = 1 . Thus, the two approaches are testing the same thing. In fact, here is how I generated the data for the example. I took P(Y = 1) = 1/2, f 0 = N (0, 1) and f0 = N (2, 1). Plugging into (88) and (89) we see that 0 = 2 and 1 = 2. Indeed, we see that 0 = 2.3 and 1 = 2.2. There are two different ways of answering the same question. 1 = and

31 Logistic Regression With Replication


When there are replications, we can say more about diagnostics. Suppose there is one covariate taking values x1 , . . . , xk and suppose there are ni observations at each Xi . Now we let Yi denote the number of successes at Xi . Hence, Yi Binomial(ni , i ). We can t the logistic regression as before: logit(i ) = XiT

62

and now we dene the Pearson residuals ri = and deviance residuals di = sign(Yi Yi ) 2Yi log

ni i (1 i )

Yi n i i

Yi Yi

+ 2(ni Yi ) log

(ni Yi ) (ni Yi )

where Yi = ni . We can also form standardized versions of these. Let H = W 1/2 X(XT W X)1 XT W 1/2 where W is diagonal with ith element ni i (1 i ). The standardized Pearson residuals are ri = ri 1 Hii

which should behave like N(0,1) random variables if the model is correct. Similarly dene standardized deviance residuals by di . di = 1 Hii Goodness-of-Fit Test. The Pearson 2 2 =
i 2 ri

and deviance D=
i

d2 i

both have, approximately, a

2 np

distribution if the model is correct. Large values are indicative of a problem.

Let us now discuss the use of residuals. Well do this in the context of an example. Here are the data: y = c(2 , 7, 9,14,23,29,29,29) n = c(29,30,28,27,30,31,30,29) x = c(49.06,52.99,56.91,60.84,64.76,68.69,72.61,76.54) The data, from Strand (1930) and Collett (1991) are the number of our beetles killed by carbon disulphide (CS 2 ). The covariate is the dose of CS2 in mg/l. There are two ways to run the regression: > out = glm(cbind(y,n-y) x,family=binomial) > print(summary(out)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -14.7312 1.8300 -8.050 8.28e-16 *** x 0.2478 0.0303 8.179 2.87e-16 ***

Null deviance: 137.7204

on 7

degrees of freedom 63

Residual deviance:

2.6558

on 6

degrees of freedom

> > > > > > > > > >

b = out$coef grid = seq(min(x),max(x),length=1000) l = b[1] + b[2]*grid fit = exp(l)/(1+exp(l)) lines(grid,fit,lwd=3) Y = c(rep(1,sum(y)),rep(0,sum(n)-sum(y))) X = c(rep(x,y),rep(x,n-y)) out2 = glm(Y X,family=binomial) print(summary(out2))

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -14.73081 1.82170 -8.086 6.15e-16 *** X 0.24784 0.03016 8.218 < 2e-16 ***

Null deviance: 313.63 Residual deviance: 178.56

on 233 on 232

degrees of freedom degrees of freedom

The outcome is the same except for the deviance. The correct deviance is from the rst method. The test of it is: > print(out$dev) [1] 2.655771 > pvalue = 1-pchisq(out$dev,out$df.residual) > print(pvalue) [1] 0.8506433 So far so good. Still, we should look at the residuals. > r = resid(out,type="deviance") > p = out$linear.predictors > plot(p,r,pch=19,xlab="linear predictor",ylab="deviance residuals") Note that > print(sum(r2)) [1] 2.655771 gives back the deviance test. Now lets create standardized residuals. r = rstandard(out) plot(x,r) You could do this by hand: w = out$weights W = diag(w) WW = diag(sqrt(w)) 64

X = cbind(rep(1,8),x) H = WW %*% X %*% solve(t(X) %*% W %*% X) %*% t(X) %*% WW h = diag(H) rr = r/sqrt(1-h) plot(p,rr,pch=19,xlab="linear predictor",ylab="standardized deviance residuals") qqnorm(rr) Some people nd it easier to look at the half-Normal plot in which we plot the sorted absolute residuals versus their expected value, which is: 1 (i + k (1/8))/(2k + (1/2)) . For regular (not half) probability plot we used 1 (i (3/8))/(k + (1/4)) .

65

32 Generalized Linear Models


We can write the logistic regression model as Yi Bernoulli(i )

g(i ) = XiT

where g(z) = logit(z). The function g is an example of a link function and the Bernoulli is an example of an exponential family, which we explain below. Any model in which Y has a distribution that is in the exponential family and a function of its mean is linear in a set of predictors, is called a generalized linear model. A probability function (or probability density function) is said to be in the exponential family if there are functions (), B(), T (x) and h(x) such that f (x; ) = h(x)e()T (x)B() . 32.1 Example. Let X Poisson(). Then f (x; ) = x e 1 = ex log x! x!

and hence, this is an exponential family with () = log , B() = , T (x) = x, h(x) = 1/x!. 32.2 Example. Let X Binomial(n, ). Then f (x; ) = In this case, () = log and 1 , B() = n log() n . x n x (1 )nx = x n exp x log x 1 + n log(1 ) .

T (x) = x, h(x) =

32.3 Example. Consider the Normal family with = (, ). Now, f (x; ) = exp This is exponential with x2 1 x 2 2 2 2 1 () =

If = (1 , . . . , k ) is a vector, then we say that f (x; ) has exponential family form if k f (x; ) = h(x) exp j ()Tj (x) B() .
j=1

2 + log(2 2 ) 2

, T1 (x) = x 2 1 2 () = 2 , T2 (x) = x2 2 B() = 1 2 2 + log(2 2 ) , h(x) = 1. 2

66

Now consider independent random variables Y1 , . . . , Yn each from the same exponential family distribution. Let i = E(Yi ) and suppose that g(i ) = XiT . This is a generalized linear model with link g. Notice that the regression equation E(Yi ) = g 1 (XiT ) is based on the inverse of the link function. 32.4 Example (Normal Regression). Here, Yi N (i , 2 ) and the link g(i ) = i is the indentity function. 32.5 Example (Logistic Regression). Here, Yi Bernoulli(i ) and g(i ) = logit(i ). 32.6 Example (Poisson Regression). This is often used when the outcomes are counts. Here, Y i Poisson(i ) and the usual link function is g(i ) = log(i ). Although many link functions could be used, there are defauly link functions that are standard for each family. Here they are (from Table 12.5 in Weisberg): Distribution Link Inverse Link (Regression Function) Normal Identity = xT g() = T Poisson Log = ex g() = log() Bernoulli Gamma In R you type: glm(y x, family= xxxx) where xxxx is Normal, binomial, poisson etc. R will assume the default link. 32.7 Example. This is a famous data set collected by Sir Richard Doll in the 1950s. I am following example 9.2.1 in Dobson. The data are on smoking and number of deaths due to coronary heart disease. Here are the data: Logit g() = logit() Inverse g() = 1/ = =
ex 1+exT 1 xT
T

Age 35-44 45-54 55-64 65-74 75-84 Deaths 32 104 206 186 102

Smokers Person-years 52407 43248 28612 12663 5317

Non-smokers Deaths Person-years 2 18790 12 10673 28 5710 28 2585 31 1462

There is an obvious increasing relationship with age which shows some hint of nonlinearity. The increase may differ between smokers and non-smokers so we will include an interaction term. We took the midpoint of each age

67

group as the age. When we t the model, we can examine the t by comparing the observed and predicted values. There are two types of residuals. These are Pearson residuals: ri = Yi Yi Yi and deviance residuals di = sign(Yi Yi ) 2(Yi log(Yi /Yi ) (Yi Yi )). Both should look like N(0,1) random variables if the model is correct. There is also a formal way to test the model for goodness of t. In fact there are two test statistics: the 2 is
n

2 =
i=1

2 ri

and the deviance D = d2 . i Both should have approximately and 2 distribution where p is the number of parameters. Lets look at the results. np > > > > + > > > > > > > > > > > ### page deaths = age = py = rate smoke agesq sm.age = = = = 155 dobson c(32,104,206,186,102,2,12,28,28,31) c(40,50,60,70,80,40,50,60,70,80) c(52407,43248,28612,12663,5317, 18790,10673,5710,2585,1462) deaths/py c(1,1,1,1,1,0,0,0,0,0) age*age smoke*age

postscript("poisson.ps",horizontal=F) plot(age,rate*100000,xlab="Age",ylab="Death Rate (times 100,000)",pch=c(rep(19,6)))

out = glm(deaths smoke + age + agesq + sm.age + py,family=poisson) print(summary(out))

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.195e+01 1.667e+00 -7.164 7.84e-13 *** smoke 3.314e+00 1.737e+00 1.908 0.0564 . age 4.398e-01 4.348e-02 10.113 < 2e-16 *** agesq -3.120e-03 3.032e-04 -10.290 < 2e-16 *** sm.age -2.491e-02 2.183e-02 -1.141 0.2540 py 8.989e-06 2.026e-05 0.444 0.6574 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Null deviance: 644.2690

on 9

degrees of freedom

68

Residual deviance: AIC: 70.694 > > > > >

3.6264

on 4

degrees of freedom

pred = predict(out,type="response") ##as opposed to type ="link" pearson =(deaths - pred)/sqrt(pred) dev = sign(deaths - pred)*sqrt(2*(deaths*log(deaths/pred) - (deaths - pred))) cbind(age,smoke,deaths,round(pred),round(pearson,2),round(dev,2)) age smoke deaths 1 40 1 32 31 0.13 0.13 2 50 1 104 110 -0.57 -0.58 3 60 1 206 198 0.60 0.60 4 70 1 186 188 -0.13 -0.13 5 80 1 102 103 -0.13 -0.13 6 40 0 2 2 -0.18 -0.19 7 50 0 12 10 0.51 0.49 8 60 0 28 26 0.38 0.38 9 70 0 28 36 -1.28 -1.33 10 80 0 31 27 0.85 0.83 > > ch2 = sum(pearson2) > print(ch2);print(1-pchisq(ch2,10-6)) [1] 3.543416 [1] 0.4713077 > D = sum(dev2) > print(D);print(1-pchisq(D,10-6)) [1] 3.626378 [1] 0.4589242 The model appears to t well. Smoking appears to be quite important (but keep the usual causal caveats in mind). Suppose we want to compare smokers to non-smokers for 40 years olds. The estimated model is E(Y |x) = exp{0 + 1 smoke + 2 age + 3 age2 + 4 smoke age + 5 PY} and hence E(Y |smoker, age = 40) E(Y |non smoke, age = 40) = exp{0 + 1 + 402 + 16003 + 404 + 524075} exp{0 + 402 b1 +404 +336175 b b + 16003 + 187905}

= e = 13.73

suggesting that smokers in this group have a death rate due to coronary heart disease that is 13.73 times higher than non-smokers. Lets get a condence interval for this. First, set = 1 + 404 + 336175 = and = 1 + 404 + 336175 = where
T T T

= (0, 1, 0, 0, 40, 33617). 69

Then, V() =
T

where V = V(()). An approximate 95 percent condence interval for is (a, b) = ( 2 We are interested in = e . The condence interval is (ea , eb ). In R: summ = summary(out) v = summ$dispersion * summ$cov.unscaled print(v) ell = c(0,1,0,0,40,py[1]-py[6]) gam = sum(ell*out$coef) print(exp(gam)) se = sqrt(ell %*% v %*% ell) ci = exp(c(gam - 2*se, gam + 2*se)) print(ci) The result is (7,27). V(), + 2 V()).

70

Figure 21: Regression with measurement error. X is not observed. W is a noisy version of X. If you regress Y on W , you will get an inconsistent estimate of 1 .

33 Measurement Error
Suppose we are interested in regressing the outcome Y on a covariate X but we cannot observe X directly. Rather, we observe a X plus noise U . The observed data are (Y1 , W1 ), . . . , (Yn , Wn ) where Yi Wi = 0 + 1 Xi + = Xi + U i
i

where

and E(Ui ) = 0. This is called a measurement error problem or an errors-in-variables problem. The model is illustrated by the directed graph in Figure 21. It is tempting to ignore the error and just regress Y on W . If the goal is just to predict Y from W then there is no problem. But if the goal is to estimate 1 , regressing Y on W leads to inconsistent estimates. 2 Let x = V(X), and assume that is independent of X, has mean 0 and variance 2 . Also assume that U is 2 independent of X, with mean 0 and variance u . Let 1 be the least squares estimator of 1 obtained by regressing the Yi s on the Wi s. It can be shown that as (90) 1 1
2 x < 1. (91) 2 2 x + u Thus, the effect of the measurement is to bias the estimated slope towards 0, an effect that is usually called attenuation bias. Let us give a heuristic explanation of why (90) is true. For simplicity, assume that 0 = 0 and that E(X) = 0. So Y 0, W 0 and

1 Now, 1 n Yi W i
i

i (Yi

Y )(Wi W ) 2 i (Wi W )

1 n 1 n

Yi W i . 2 i Wi

= =

1 n 1 n

(1 Xi + i )(Xi + Ui )
i

Xi2 +

i 2 1 x .

1 n

Ui +
i

1 n

i Xi i

1 n

i Ui i

Also, 1 n Wi2
i

= =

1 n 1 n

(Xi + Ui )2
i

Xi2 +
i

1 n

Ui2 +
i

2 n

Xi U i
i

2 2 x + u

71

which yields (90). 2 2 If there are several observed values of W for each X then u can be estimated. Otherwise, u must be estimated by external means such as through background knowledge of the noise mechanism. For our purposes, we will assume 2 2 2 2 2 that u is known. Since, w = x + u , we can estimate x by
2 2 2 x = w u

(92)

2 2 This is called the method of moments estimator. This estimator makes little sense if w u 0. In such cases, one might reasonable conclude that the sample size is simply not large enough to estimate 1 . Another method for correcting the attenuation bias is SIMEX which stands for simulation extrapolation and is due to Cook and Stefanski. Recall that the least squares estimate 1 is a consistent estimate of 2 1 x . 2 2 x + u

2 2 2 2 where w is the sample variance of the Wi s. Plugging these estimates into (91), we get an estimate = (w u )/w of . An estimate of 1 is 1 2 1 = = 2 w 2 1 . (93) w u

Generate new random variables Wi = W i +

Ui

where Ui N (0, 1). The least squares estimate obtained by regressing the Yi s on the Wi s is a consistent estimate of () =
2 1 x . 2 2 x + (1 + )u

(94)

Repeat this process B times (where B is large) and denote the resulting estimators by 1,1 (), . . . , 1,B (). Then dene B 1 1,b (). () = B
b=1

Now comes some clever sleight of hand. Setting = 1 in (94) we see that (1) = 1 which is the quantity we want to estimate. The idea is to compute () for a range of values of such as 0, 0.5, 1.0, 1.5, 2.0. We then extrapolate the curve () back to = 1; see Figure 22. To do the extrapolation, we t the values (j ) to the curve G(; 1 , 2 , 3 ) = 1 + 2 3 + (95)

using standard nonlinear regression. Once we have estimates of the s, we take 1 = G(1; 1 , 2 , 3 ) (96)

as our corrected estimate of 1 . Fitting the nonlinear regression (95) is inconvenient; it often sufces to approximate G() with a quadratic. Thus, we t the (j ) s to the curve Q(; 1 , 2 , 3 ) = 1 + 2 + 3 2 and the corrected estimate of 1 is 1 = Q(1; 1 , 2 , 3 ) = 1 2 + 3 . An advantage of SIMEX is that it extends readily to nonlinear and nonparametric regression.

72

SIMEX Estimate 1

()

Uncorrected Least Squares Estimate 1

-1.0

0.0

1.0

2.0

Figure 22: In the SIMEX method we extrapolate () back to = 1. SIMEX

73

34 Nonparametric Regression
Now we will study nonparametric regression, also known as learning a function in the jargon of machine learning. We are given n pairs of observations (X1 , Y1 ), . . . , (Xn , Yn ) where Yi = r(Xi ) + i , and r(x) = E(Y |X = x). (98) i = 1, . . . , n (97)

power

26000

200

400

600

800

multipole

power

5000

200

400

multipole Figure 23: CMB data. The horizontal axis is the multipole moment, essentially the frequency of uctuations in the temperature eld of the CMB. The vertical axis is the power or strength of the uctuations at each frequency. The top plot shows the full data set. The bottom plot shows the rst 400 data points. The rst peak, around x 200, is obvious. There may be a second and third peak further to the right. 34.1 Example (CMB data). Figure 23 shows data on the cosmic microwave background (CMB). The rst plot shows 899 data points over the whole range while the second plot shows the rst 400 data points. We have noisy measurements Yi of r(Xi ) so the data are of the form (97). Our goal is to estimate r. It is believed that r may have three peaks over the range of the data. The rst peak is obvious from the second plot. The presence of a second or third peak is much less obvious; careful inferences are required to assess the signicance of these peaks.

74

The simplest nonparametric estimator is the regressogram. Suppose the X i s are in the interval [a, b]. Divide the interval into m bins of of equal length. Thus each has length h = (b a)/m. Denote the bins by B 1 , . . . , Bm . Let kj be the number of observations in bin Bj and let Y j be the mean of the Yi s in bin Bj . Dene rn (x) = Y j We can rewrite the estimator as rn (x) =
i=1 n i (x)Yi

for x Bj .

(99)

where i (x) = 1/kj if x, Xi Bj and i (x) = 0 otherwise. Thus, (x) = 0, 0, . . . , 0, 1 1 , . . . , , 0, . . . , 0 kj kj


T

34.2 Example (LIDAR). These are data from a light detection and ranging (LIDAR) experiment. LIDAR is used to monitor pollutants. Figure 24 shows 221 observations. The response is the log of the ratio of light received from two lasers. The frequency of one laser is the resonance frequency of mercury while the second has a different frequency. The estimates shown here are regressograms. The smoothing parameter h is the width of the bins. As the binsize h decreases, the estimated regression function rn goes from oversmoothing to undersmoothing.

0.2

log ratio

0.6

log ratio

1.0

400

500 range

600

700

1.0

0.6

0.2

400

500 range

600

700

0.2

log ratio

0.6

log ratio

1.0

400

500 range

600

700

1.0

0.6

0.2

400

500 range

600

700

Figure 24: The LIDAR data from Example 34.2. The estimates are regressograms, obtained by averaging the Y i s over bins. As we decrease the binwidth h, the estimator becomes less smooth. Let us now compute the bias and variance of the estimator. For simplicity, suppose that [a, b] = [0, 1] and further suppose that the Xi s are equally spaced so that each bin has k = n/m. Let us focus on r n (0). The mean (conditional 75

on the Xi s) is E(rn (0)) =

1 k

E(Yi ) =
iB1

1 k

r(Xi ).
iB1

By Taylors theorem r(Xi ) r(0) + Xi r (0). So, E(rn (0)) r(0) +

r (0) k

Xi .
iB1

The largest Xi can be in bin B1 is the length of the bin h = 1/m. So the absolute value of the bias is |r (0)| k The variance is Xi h|r (0)|.

iB1

2 m 2 2 = = . k n nh The mean squared error is the squared bias plus the variance: 2 . nh Large bins cause large bias. Small bins cause large variance. The MSE is minimized at MSE = h2 (r (0))2 + h= 2 2(r (0))2 n
1/3

c n1/3

for some c. With this optimal value of h, the riks (or MSE) is of the order n2/3 . Another simple estimator is the local average dened by rn (x) = 1 kx Yi .
i: |Xi x|h

(100)

The smoothing parameter is h. We can rewrite the estimtor as rn (x) =


n i=1 Yi K((x Xi )/h) n i=1 K((x Xi )/h)

(101)
n

where K(z) = 1 if |z| 1 and K(z) = 0 if |z| > 1. We can further rewrite the estimator as r n (x) = i=1 Yi i (x) where K((x Xi )/h)/ n K((x Xt )/h). We shall see later that his estimator has risk n4/5 which is better t=1 than n2/3 . n Notice that both estimators so far have the form rn (x) = i=1 i (x)Yi . In fact, most of the estimators we consider have this form. 34.3 Denition. An estimator rn of r is a linear smoother if, for each x, there exists a vector (x) = ( 1 (x), . . . , n (x))T such that
n

rn (x) =
i=1

i (x)Yi .

(102)

Dene the vector of tted values Y = (rn (x1 ), . . . , rn (xn ))T where Y = (Y1 , . . . , Yn ) . It then follows that Y = LY where L is an n n matrix whose i row is (Xi ) ; thus, Lij = given to each Yi in forming the estimate rn (Xi ).
th T j (Xi ). T

(103) (104) The entries of the i


th

row show the weights

76

34.4 Denition. The matrix L is called the smoothing matrix or the hat matrix. The ith row of L is called the effective kernel for estimating r(Xi ). We dene the effective degrees of freedom by = tr(L). (105)

34.5 Example (Regressogram). Recall that we divide (a, b) into m equally spaced bins denoted by B 1 , B2 , . . . , Bm . Dene rn (x) by 1 rn (x) = Yi , for x Bj (106) kj
i:Xi Bj

where kj is the number of points in Bj . In other words, the estimate rn is a step function obtained by averaging the Yi s over each bin. This estimate is called the regressogram. An example is given in Figure 24. For x B j dene n i (x) = 1/kj if Xi Bj and i (x) = 0 otherwise. Thus, rn (x) = i=1 Yi i (x). The vector of weights (x) looks like this: 1 1 (x)T = 0, 0, . . . , 0, , . . . , , 0, . . . , 0 . kj kj To see what the smoothing matrix L looks like, suppose that n = 9, m = 3 and k 1 = k2 = k3 = 3. Then, 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 L = 0 0 0 1 1 1 0 0 0 . 3 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1

In general, it is easy to see that there are = tr(L) = m effective degrees of freedom. The binwidth h = (b a)/m controls how smooth the estimate is. 34.6 Example (Local averages). Fix h > 0 and let Bx = {i : |Xi x| h}. Let nx be the number of points in Bx . For any x for which nx > 0 dene 1 Yi . rn (x) = nx
iBx

34.7 Example. Linear Regression. We have Y = HY where H = X(X T X)1 X T . We can write r(x) = xT xT (X T X)1 X T Y = i i (x)Yi . 77

This is the local average estimator of r(x), a special case of the kernel estimator discussed shortly. In this case, n rn (x) = i=1 Yi i (x) where i (x) = 1/nx if |Xi x| h and i (x) = 0 otherwise. As a simple example, suppose that n = 9, Xi = i/9 and h = 1/9. Then, 1/2 1/2 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 . L= 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/2 1/2

35 Choosing the Smoothing Parameter


The smoothers depend on some smoothing parameter h and we will need some way of choosing h. Recall from our discussion of variable selection that the predictive risk is E(Y rn (X))2 = 2 + E(r(X) rn (X))2 = 2 + MSE where MSE means mean-squared-error. Also, MSE = where bias(x) = E(rn (x)) r(x) is the bias of rn (x) and var(x) = Variance(rn (x)) is the variance. When the data are oversmoothed, the bias term is large and the variance is small. When the data are undersmoothed the opposite is true; see Figure 25. This is called the biasvariance tradeoff. Minimizing risk corresponds to balancing bias and variance. bias2 (x)p(x)dx + var(x)p(x)dx

Risk Bias squared

Less smoothing

Optimal smoothing

Figure 25: The biasvariance tradeoff. The bias increases and the variance decreases with the amount of smoothing. The optimal amount of smoothing, indicated by the vertical line, minimizes the risk = bias 2 + variance. Ideally, we would like to choose h to minimize R(h) but R(h) depends on the unknown function r(x). Instead, we will minimize an estimate R(h) of R(h). As a rst guess, we might use the average residual sums of squares, also called the training error n 1 (Yi rn (Xi ))2 (107) n i=1 to estimate R(h). This turns out to be a poor estimate of R(h): it is biased downwards and typically leads to undersmoothing (overtting). The reason is that we are using the data twice: to estimate the function and to estimate the 78

Variance

More smoothing

risk. The function estimate is chosen to make i=1 (Yi rn (Xi ))2 small so this will tend to underestimate the risk. We will estimate the risk using the leave-one-out cross-validation score which is dened as follows. 35.1 Denition. The leave-one-out cross-validation score is dened by
CV

= R(h) =

1 n

n i=1

(Yi r(i) (Xi ))2

(108)

where r(i) is the estimator obtained by omitting the ith pair (Xi , Yi ). The intuition for cross-validation is as follows. Note that E(Yi r(i) (Xi ))2 = E(Yi r(Xi ) + r(Xi ) r(i) (Xi ))2 = 2 + E(r(Xi ) r(i) (Xi ))2 2 + E(r(Xi ) rn (Xi ))2

and hence, E(R) predictive risk. (109) Thus the cross-validation score is a nearly unbiased estimate of the risk. There is a shortcut formula for computing R just like in linear regression. 35.2 Theorem. Let rn be a linear smoother. Then the leave-one-out cross-validation score R(h) can be written as n 2 1 Yi rn (Xi ) R(h) = (110) n i=1 1 Lii where Lii =
i (Xi )

is the ith diagonal element of the smoothing matrix L.

The smoothing parameter h can then be chosen by minimizing R(h). Rather than minimizing the cross-validation score, an alternative is to use generalized cross-validation in which each Lii in equation (110) is replaced with its average n1 n Lii = /n where = tr(L) is the effective degrees i=1 of freedom. Thus, we would minimize GCV(h) = 1 n
n i=1

Yi rn (Xi ) 1 /n

Rtraining . (1 /n)2

(111)

Usually, the bandwidth that minimizes the generalized cross-validation score is close to the bandwidth that minimizes the cross-validation score. Using the approximation (1 x)2 1 + 2x we see that GCV(h) where 2 = n1
n i=1 (Yi

1 n

n i=1

(Yi rn (Xi ))2 +

2 2 Cp n

(112)

rn (Xi ))2 . Equation (112) is just like the Cp statistic

36 Kernel Regression
We will often use the word kernel. For our purposes, the word kernel refers to any smooth function K such that K(x) 0 and K(x) dx = 1,
2 xK(x)dx = 0 and K

x2 K(x)dx > 0.

(113)

79

Some commonly used kernels are the following: the boxcar kernel : the Gaussian kernel : the Epanechnikov kernel : the tricube kernel : where I(x) = These kernels are plotted in Figure 26. 1 I(x), 2 2 1 K(x) = ex /2 , 2 3 K(x) = (1 x2 )I(x) 4 70 K(x) = (1 |x|3 )3 I(x) 81 K(x) =

1 if |x| 1 0 if |x| > 1.

Figure 26: Examples of kernels: boxcar (top left), Gaussian (top right), Epanechnikov (bottom left), and tricube (bottom right).

80

36.1 Denition. Let h > 0 be a positive number, called the bandwidth. The NadarayaWatson kernel estimator is dened by
n

rn (x) =
i=1

i (x)Yi

(114)

where K is a kernel and the weights

i (x)

are given by
i (x)

xXi h xxj n j=1 K h

(115)

36.2 Remark. The local average estimator in Example 34.6 is a kernel estimator based on the boxcar kernel.

R-code. In R, I suggest using the loess command or using the locfit library. (You need to download loct.) For loess: plot(x,y) out = loess(y x,span=.25,degree=0) lines(x,fitted(out)) The span option is the bandwidth. To compute GCV, you will need the effective number of parameters. You get this by typing: out$enp The command for kernel regression in loct is: out = locfit(y x,deg=0,alpha=c(0,h)) where h is the bandwidth you want to use. The alpha=c(0,h) part looks strange. There are two ways to specify the smoothing parameter. The rst way is as a percentage of the data, for example, alpha=c(.25,0) makes the bandwidth big enough so that one quarter of the data falls in the kernel. To smooth with a specic value for the bandwidth (as we are doing) we use alpha=c(0,h). The meaning of deg=0 will be explained later. Now try names(out) print(out) summary(out) plot(out) plot(x,fitted(out)) plot(x,residuals(out)) help(locfit) To do cross-validation, create a vector bandwidths h = (h1 , . . . , hk ). alpha then needs to be a matrix. h = c( ... put your values here ... ) k = length(h) zero = rep(0,k) H = cbind(zero,h) out = gcvplot(yx,deg=0,alpha=H) plot(out$df,out$values)

81

36.3 Example (CMB data). Recall the CMB data from Figure 23. Figure 27 shows four different kernel regression ts (using just the rst 400 data points) based on increasing bandwidths. The top two plots are based on small bandwidths and the ts are too rough. The bottom right plot is based on large bandwidth and the t is too smooth. The bottom left plot is just right. The bottom right plot also shows the presence of bias near the boundaries. As we shall see, this is a general feature of kernel regression. The bottom plot in Figure 28 shows a kernel t to all the data points. The bandwidth was chosen by cross-validation.

power

200

400

power
0

200

400

multipole

multipole

power

200

400

power
0

200

400

multipole

multipole

Figure 27: Four kernel regressions for the CMB data using just the rst 400 data points. The bandwidths used were h = 1 (top left), h = 10 (top right), h = 50 (bottom left), h = 200 (bottom right). As the bandwidth h increases, the estimated function goes from being too rough to too smooth. The choice of kernel K is not too important. Estimates obtained by using different kernels are usually numerically very similar. This observation is conrmed by theoretical calculations which show that the risk is very insensitive to the choice of kernel. What does matter much more is the choice of bandwidth h which controls the amount of smoothing. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general, we will let the bandwidth depend on the sample size so we sometimes write h n . The following theorem shows how the bandwidth affects the estimator. To state these results we need to make some assumption about the behavior of x1 , . . . , xn as n increases. For the purposes of the theorem, we will assume that these are random draws from some density f . 36.4 Theorem. The risk (using integrated squared error loss) of the NadarayaWatson kernel estimator is R(rn , r) = h4 n 4 + 2
2

x2 K(x)dx K 2 (x)dx nhn 82

r (x) + 2r (x)

f (x) f (x)

dx (116)

1 dx + o(nh1 ) + o(h4 ) n n f (x)

as hn 0 and nhn . The rst term in (116) is the squared bias and the second term is the variance. What is especially notable is the presence of the term f (x) (117) 2r (x) f (x) in the bias. We call (117) the design bias since it depends on the design, that is, the distribution of the X i s. This means that the bias is sensitive to the position of the Xi s. Furthermore, it can be shown that kernel estimators also have high bias near the boundaries. This is known as boundary bias. We will see that we can reduce these biases by using a renement called local polynomial regression. If we differentiate (116) and set the result equal to 0, we nd that the optimal bandwidth h is h = 1 n
1/5

K (x)dx

dx/f (x)
(x) (x) f (x) f 2

( x2 K 2 (x)dx)2

r (x) + 2r

dx

1/5

(118)

Thus, h = O(n1/5 ). Plugging h back into (116) we see that the risk decreases at rate O(n4/5 ). In (most) parametric models, the risk of the maximum likelihood estimator decreases to 0 at rate 1/n. The slower rate n 4/5 is the price of using nonparametric methods. In practice, we cannot use the bandwidth given in (118) since h depends on the unknown function r. Instead, we use leave-one-out cross-validation as described in Theorem 35.2. 36.5 Example. Figure 28 shows the cross-validation score for the CMB example as a function of the effective degrees of freedom. The optimal smoothing parameter was chosen to minimize this score. The resulting t is also shown in the gure. Note that the t gets quite variable to the right.

37 Local Polynomials
Kernel estimators suffer from boundary bias and design bias. These problems can be alleviated by using a generalization of kernel regression called local polynomial regression. To motivate this estimator, rst consider choosing an estimator a r n (x) to minimize the sums of squares n 2 i=1 (Yi a) . The solution is the constant function rn (x) = Y which is obviously not a good estimator of r(x). Now dene the weight function wi (x) = K((Xi x)/h) and choose a rn (x) to minimize the weighted sums of squares
n i=1

wi (x)(Yi a)2 .

(119)

From elementary calculus, we see that the solution is rn (x)


n i=1 wi (x)Yi n i=1 wi (x)

which is exactly the kernel regression estimator. This gives us an interesting interpretation of the kernel estimator: it is a locally constant estimator, obtained from locally weighted least squares. This suggests that we might improve the estimator by using a local polynomial of degree p instead of a local constant. Let x be some xed value at which we want to estimate r(x). For values u in a neighborhood of x, dene the polynomial ap a2 (120) Px (u; a) = a0 + a1 (u x) + (u x)2 + + (u x)p . 2! p! We can approximate a smooth regression function r(u) in a neighborhood of the target value x by the polynomial: r(u) Px (u; a). 83 (121)

CV score

24

26

28

30

Effective degrees of freedom

Power

1000

3000

5000

400

800

Multipole Figure 28: Top: The cross-validation (CV) score as a function of the effective degrees of freedom. Bottom: the kernel t using the bandwidth that minimizes the cross-validation score. We estimate a = (a0 , . . . , ap )T by choosing a = (a0 , . . . , ap )T to minimize the locally weighted sums of squares
n i=1

wi (x) (Yi Px (Xi ; a))2 .

(122)

The estimator a depends on the target value x so we write a(x) = (a 0 (x), . . . , ap (x))T if we want to make this dependence explicit. The local estimate of r is rn (u) = Px (u; a). In particular, at the target value u = x we have rn (x) = Px (x; a) = a0 (x). (123)

Warning! Although rn (x) only depends on a0 (x), this is not equivalent to simply tting a local constant. Setting p = 0 gives back the kernel estimator. The special case where p = 1 is called local linear regression and this is the version we recommend as a default choice. As we shall see, local polynomial estimators, and in particular local linear estimators, have some remarkable properties. To nd a(x), it is helpful to re-express the problem in vector notation. Let x)p 1 x1 x (x1 p! (x2 x)p 1 x2 x p! Xx = . . (124) . . .. . . . . . . x)p 1 xn x (xnp! 84

and let Wx be the n n diagonal matrix whose (i, i) component is wi (x). We can rewrite (122) as (Y Xx a)T Wx (Y Xx a). Minimizing (125) gives the weighted least squares estimator
T T a(x) = (Xx Wx Xx )1 Xx Wx Y.

(125)

(126)

In particular, rn (x) = a0 (x) is the inner product of the rst row of T T (Xx Wx Xx )1 Xx Wx with Y . Thus we have: The local polynomial regression estimate is
n

rn (x) =
i=1

i (x)Yi

(127)

where (x)T = ( 1 (x), . . . ,

n (x)), T T (x)T = eT (Xx Wx Xx )1 Xx Wx , 1

e1 = (1, 0, . . . , 0)T and Xx and Wx are dened in (124). Once again, our estimate is a linear smoother and we can choose the bandwidth by minimizing the cross-validation formula given in Theorem 35.2. R-code. The R-code is the same except we use deg = 1 for local linear, deg = 2 for local quadratic etc. Thus, for local linear regression: loess(y x,deg=1,span=h) locfit(y x,deg = 1,alpha=c(0,h)) 37.1 Example (LIDAR). These data were introduced in Example 34.2. Figure 29 shows the 221 observations. The top left plot shows the data and the tted function using local linear regression. The cross-validation curve (not shown) has a well-dened minimum at h 37 corresponding to 9 effective degrees of freedom. The tted function uses this bandwidth. The top right plot shows the residuals. There is clear heteroscedasticity (nonconstant variance). The bottom left plot shows the estimate of (x) using the method described later. Next we compute 95 percent condence bands (explained later). The resulting bands are shown in the lower right plot. As expected, there is much greater uncertainty for larger values of the covariate.

85

0.2

residuals

log ratio

1.0

0.6

400

500 range

600

700

0.4

0.0

0.2

0.4

400

500 range

600

700

0.10

sigma(x)

log ratio

0.05

400

500 range

600

700

1.0

0.00

0.6

0.2

400

500 range

600

700

Figure 29: The LIDAR data from Example 37.1. Top left: data and the tted function using local linear regression with h 37 (chosen by cross-validation). Top right: the residuals. Bottom left: estimate of (x). Bottom right: 95 percent condence bands.

37.2 Theorem. When p = 1, rn (x) =

n i=1 i (x)Yi i (x)

Local Linear Smoothing where = bi (x) , n j=1 bj (x) Sn,2 (x) (Xi x)Sn,1 (x) Xi x h (Xi x)j , j = 1, 2. (128)

bi (x) = K and Sn,j (x) =

Xi x h
n

K
i=1

37.3 Example. Figure 30 shows the local regression for the CMB data for p = 0 and p = 1. The bottom plots zoom in on the left boundary. Note that for p = 0 (the kernel estimator), the t is poor near the boundaries due to boundary bias. 37.4 Example (Doppler function). Let r(x) = x(1 x) sin 2.1 x + .05 86 , 0x1 (129)

5000

3000

1000

200

400

1000

3000

5000

200

400

1500

500

20

40

500

1500

20

40

Figure 30: Locally weighted regressions using local polynomials of order p = 0 (top left) and p = 1 (top right). The bottom plots show the left boundary in more detail (p = 0 bottom left and p = 1 bottom right). Notice that the boundary bias is reduced by using local linear estimation (p = 1). which is called the Doppler function. This function is difcult to estimate and provides a good test case for nonparametric regression methods. The function is spatially inhomogeneous which means that its smoothness (second derivative) varies over x. The function is plotted in the top left plot of Figure 31. The top right plot shows 1000 data points simulated from Yi = r(i/n) + i with = .1 and i N (0, 1). The bottom left plot shows the crossvalidation score versus the effective degrees of freedom using local linear regression. The minimum occurred at 166 degrees of freedom corresponding to a bandwidth of .005. The tted function is shown in the bottom right plot. The t has high effective degrees of freedom and hence the tted function is very wiggly. This is because the estimate is trying to t the rapid uctuations of the function near x = 0. If we used more smoothing, the right-hand side of the t would look better at the cost of missing the structure near x = 0. This is always a problem when estimating spatially inhomogeneous functions. Well discuss that further later. The following theorem gives the large sample behavior of the risk of the local linear estimator and shows why local linear regression is better than kernel regression. 37.5 Theorem. Let Yi = r(Xi ) + (Xi ) i for i = 1, . . . , n and a Xi b. Assume that X1 , . . . , Xn are a sample from a distribution with density f and that (i) f (x) > 0, (ii) f , r and 2 are continuous in a neighborhood of x, and (iii) hn 0 and nhn . Let x (a, b). Given X1 , . . . , Xn , we have the following: the local linear estimator and the kernel estimator both have variance 2 (x) f (x)nhn K 2 (u)du + oP 1 nhn . (130)

87

0.0

0.5

1.0

0.0

0.5

1.0

100

150

200

0.0

0.5

1.0

Figure 31: The Doppler function estimated by local linear regression. The function (top left), the data (top right), the cross-validation score versus effective degrees of freedom (bottom left), and the tted function (bottom right). The NadarayaWatson kernel estimator has bias h2 n 1 r (x)f (x) r (x) + 2 f (x) u2 K(u)du + oP (h2 ) (131)

whereas the local linear estimator has asymptotic bias 1 h2 r (x) n 2 u2 K(u)du + oP (h2 ) (132)

Thus, the local linear estimator is free from design bias. At the boundary points a and b, the NadarayaWatson kernel estimator has asymptotic bias of order hn while the local linear estimator has bias of order h2 . In this sense, local n linear estimation eliminates boundary bias. 37.6 Remark. The above result holds more generally for local polynomials of order p. Generally, taking p odd reduces design bias and boundary bias without increasing variance.

88

An alternative to locfit is loess. out = loess(y x,span=.1,degree=1) plot(x,fitted(out)) out$trace.hat ### effective degrees of freedom

38 Penalized Regression, Regularization and Splines


Consider once again the regression model Yi = r(Xi ) +
i

and suppose we estimate r by choosing rn (x) to minimize the sums of squares


n i=1

(Yi rn (Xi ))2 .

Minimizing over all linear functions (i.e., functions of the form 0 +1 x) yields the least squares estimator. Minimizing over all functions yields a function that interpolates the data. In the previous section we avoided these two extreme solutions by replacing the sums of squares with a locally weighted sums of squares. An alternative way to get solutions in between these extremes is to minimize the penalized sums of squares M () =
i

(Yi rn (Xi ))2 + J(r)

(133)

where J(r) = (r (x))2 dx (134)

is a roughness penalty. Adding a penalty term to the criterion we are optimizing is sometimes called regularization. The parameter controls the trade-off between t (the rst term of 133) and the penalty. Let r n denote the function that minimizes M (). When = 0, the solution is the interpolating function. When , r n converges to the least squares line. The parameter controls the amount of smoothing. What does r n look like for 0 < < ? To answer this question, we need to dene splines. A spline is a special piecewise polynomial. The most commonly used splines are piecewise cubic splines. 38.1 Denition. Let 1 < 2 < < k be a set of ordered pointscalled knotscontained in some interval (a, b). A cubic spline is a continuous function r such that (i) r is a cubic polynomial over ( 1 , 2 ), . . . and (ii) r has continuous rst and second derivatives at the knots. More generally, an M th -order spline is a piecewise M 1 degree polynomial with M 2 continuous derivatives at the knots. A spline that is linear beyond the boundary knots is called a natural spline. Cubic splines (M = 4) are the most common splines used in practice. They arise naturally in the penalized regression framework as the following theorem shows. 38.2 Theorem. The function rn (x) that minimizes M () with penalty (134) is a natural cubic spline with knots at the data points. The estimator rn is called a smoothing spline. The theorem above does not give an explicit form for rn . To do so, we will construct a basis for the set of splines. Let 0 = a and k+1 = b. Dene new knots 1 , . . . , M such that 1 2 3 M 0 , j+M = j for j = 1, . . . , k, and k+1 k+M +1 k+2M . 89

0.0
0.0

0.6

1.2

0.5

1.0

Figure 32: Cubic B-spline basis using nine equally spaced knots on (0,1). The choice of extra knots is arbitrary; usually one takes 1 = = M = 0 and k+1 = k+M +1 = = k+2M . We dene the basis functions recursively as follows. First we dene Bi,1 = 1 if i x < i+1 0 otherwise

for i = 1, . . . , k + 2M 1. Next, for m M we dene Bi,m = i+m x x i Bi,m1 + Bi+1,m1 i+m1 i i+m i+1

for i = 1, . . . , k + 2M m. It is understood that if the denominator is 0, then the function is dened to be 0. 38.3 Theorem. The functions {Bi,4 , i = 1, . . . , k + 4} are a basis for the set of cubic splines. The are called the k+4 B-spline basis functions. Hence, any spline f (x) can be written as f (x) = j=1 j Bj (x). B-spline basis functions have compact support which makes it possible to speed up calculations. Figure 32 shows the cubic B-spline basis using nine equally spaced knots on (0,1). We are now in a position to describe the spline estimator in more detail. According to Theorem 38.2, r is a natural cubic spline. Hence, we can write
N

rn (x) =
j=1

j Bj (x)

(135)

where N = n + 4. We only need to nd the coefcients = (1 , . . . , N )T . By expanding r in the basis we can now rewrite the minimization as follows: minimize : where Bij = Bj (Xi ) and jk = (Y B)T (Y B) + T (136)

Bj (x)Bk (x)dx.

90

38.4 Theorem. The value of that minimizes (136) is2 = (B T B + )1 B T Y. Splines are another example of linear smoothers. 38.5 Theorem. The smoothing spline rn (x) is a linear smoother, that is, there exist weights (x) such that rn (x) = n i=1 Yi i (x). In particular, the smoothing matrix L is L = B(B T B + )1 B T and the vector Y of tted values is given by Y = LY. (139) If we had done ordinary linear regression of Y on B, the hat matrix would be L = B(B T B)1 B T and the tted values would interpolate the observed data. The effect of the term in (138) is to shrink the regression coefcients towards a subspace, which results in a smoother t. As before, we dene the effective degrees of freedom by = tr(L) and we choose the smoothing parameter by minimizing either the cross-validation score (110) or the generalized cross-validation score (111). In R: out = smooth.spline(x,y,df=10,cv=TRUE) ### df is the effective degrees of freedom plot(x,y) lines(x,out$y) ### NOTE: the fitted values are in out$y NOT out$fit!! out$cv ### print the cross-validation score You need to do a loop to try many values of df and then use cross-validation to choose df. df must be between 2 and n. For example: cv = rep(0,50) df = seq(2,n,length=50) for(i in 1:50){cv[i] = smooth.spline(x,y,df=df[i],cv=TRUE)$cv} plot(df,cv,type="l") df[cv == min(cv)] 38.6 Example. Figure 33 shows the smoothing spline with cross-validation for the CMB data. The effective number of degrees of freedom is 8.8. The t is smoother than the local regression estimator. This is certainly visually more appealing, but the difference between the two ts is small compared to the width of the condence bands that we will compute later. Spline estimates rn (x) are approximately kernel estimates in the sense that
i (x)

(137)

(138)

1 K f (Xi )h(Xi )

Xi x h(Xi )

where f (x) is the density of the covariate (treated here as random), h(x) = and K(t) =
2 You

nf (x)

1/4

|t| 1 exp 2 2

sin

|t| + 4 2

will recognize this as being similar to ridge regression.

91

power

1000
0

3000

5000

400

800

multipole Figure 33: Smoothing spline for the CMB data. The smoothing parameter was chosen by cross-validation. Another nonparametric method that uses splines is called the regression spline method. Rather than placing a knot at each data point, we instead use fewer knots. We then do ordinary linear regression on the basis matrix B with no regularization. The tted values for this estimator are Y = LY with L = B(B T B)1 B T . The difference between this estimate and (138) is that the basis matrix B is based on fewer knots and there is no shrinkage factor . The amount of smoothing is instead controlled by the choice of the number (and placement) of the knots. By using fewer knots, one can save computation time.

92

39 Smoothing Using Orthogonal Functions


Let L2 (a, b) denote all functions dened on the interval [a, b] such that
b b a

f (x)2 dx < : (140)

L2 (a, b) =

f : [a, b] R,

f (x)2 dx < .

We sometimes write L2 instead of L2 (a, b). The inner product between two functions f, g L2 is dened by f (x)g(x)dx. The norm of f is ||f || = Two functions are orthogonal if A sequence of functions 2 (x)dx j f (x)g(x)dx = 0. 1 , 2 , 3 , 4 , . . . is orthonormal if = 1 for each j and i (x)j (x)dx = 0 for i = j. An orthonormal sequence is complete if the only function that is orthogonal to each j is the zero function. A complete orthonormal set is called an orthonormal basis. f (x)2 dx. (141)

Any f L2 can be written as f (x) =


j=1 b

j j (x), where j =
a

f (x)j (x)dx.

(142)

Also, we have Parsevals relation: ||f ||2 where = (1 , 2 , . . .). Note: The equality in the displayed equation means that (f (x)fn (x))2 dx 0 where fn (x) =
n j=1

f 2 (x) dx =

j=1

2 j ||||2

(143)

j j (x).

39.1 Example. An example of an orthonormal basis for L2 (0, 1) is the cosine basis dened as follows. Let 0 (x) = 1 and for j 1 dene j (x) = 2 cos(jx). (144)

39.2 Example. Let f (x) = x(1 x) sin 2.1 (x + .05)

which is called the doppler function. Figure 34 shows f (top left) and its approximation
J

fJ (x) =
j=1

j j (x)

with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). As J increases we see that f J (x) gets closer to 1 f (x). The coefcients j = 0 f (x)j (x)dx were computed numerically. 93

Figure 34: Approximating the doppler function with its expansion in the cosine basis. The function f (top left) and its approximation fJ (x) = J j j (x) with J equal to 5 (top right), 20 (bottom left), and j=1 200 (bottom right). The coefcients j =
1 0

f (x)j (x)dx were computed numerically.

39.3 Example. The Legendre polynomials on [1, 1] are dened by Pj (x) = 1 dj 2 (x 1)j , j = 0, 1, 2, . . . 2j j! dxj (145)

It can be shown that these functions are complete and orthogonal and that
1 1

Pj2 (x)dx =

2 . 2j + 1

(146)

It follows that the functions j (x) = rst few Legendre polynomials are:

(2j + 1)/2Pj (x), j = 0, 1, . . . form an orthonormal basis for L2 (1, 1). The 1 1 3x2 1 , P3 (x) = 5x3 3x , . . . 2 2

P0 (x) = 1, P1 (x) = x, P2 (x) =

These polynomials may be constructed explicitly using the following recursive relation: Pj+1 (x) = (2j + 1)xPj (x) jPj1 (x) . j +1 (147)

The coefcients 1 , 2 , . . . are related to the smoothness of the function f . To see why, note that if f is smooth, 1 then its derivatives will be nite. Thus we expect that, for some k, 0 (f (k) (x))2 dx < where f (k) is the k th derivative of f . Now consider the cosine basis (144) and let f (x) = j=0 j j (x). Then,
1

(f (k) (x))2 dx = 2
0

j=1

2 j (j)2k .

The only way that

j=1

2 j (j)2k can be nite is if the j s get small when j gets large. To summarize:

If the function f is smooth, then the coefcients j will be small when j is large. Return to the regression model Yi = r(Xi ) + i , i = 1, . . . , n. 94 (148)

Now we write r(x) =

j=1

j j (x). We will approximate r by


J

rJ (x) =
j=1

j j (x).

The number of terms J will be our smoothing parameter. Our estimate is


J

r(x) =
j=1

j j (x),

To nd rn , let U denote the matrix whose columns are: 1 (X1 ) 2 (X1 ) 1 (X2 ) 2 (X2 ) U = . . . . . . 1 (Xn ) Then and

. . . J (X1 ) . . . J (X2 ) . . . . . . 2 (Xn ) . . . J (Xn )

= (U T U )1 U T Y Y = SY where S = U (U T U )1 U T is the hat matrix. The matrix S is projecting into the space spanned by the rst J basis functions. We can choose J by cross validation. Note that trace(S) = J so the GCV score takes the following simple form: GCV(J) = RSS 1 . n (1 J/n)2

39.4 Example. Figure 37 shows the doppler function f and n = 2, 048 observations generated from the model Yi = r(Xi ) + where Xi = i/n, J = 234 terms.
i i

N (0, (.1)2 ). The gure shows the data and the estimated function. The estimate was based on

This is called the polynomial-cosine basis.

Here is another example: The t is in Figure 35 and the smoothing matrix is in 36. Notice that the rows of the smoothing matrix look like kernels. In fact, smoothing with a series is approximately the same as kernel regression with the kernel K(x, y) = J j=1 j (x)j (y). Cosine basis smoothers have boundary bias. This can be xed by adding the functions t and t 2 to the basis. In other words, use the design matrix 2 1 X1 X1 2 (X1 ) . . . J (X1 ) 2 1 X2 X2 2 (X2 ) . . . J (X2 ) U = . . . . . . . . . . . . . 2 1 Xn Xn 2 (Xn ) . . . J (Xn )

95

GCV 0.013 0.015

0.017

0.019

2 J

10

0.2 0.0 x

0.1

0.0

0.1

0.2

0.3

0.4

0.2

0.4

0.6

0.8

1.0

residuals 0.2 0.0 0.1 0.0

0.1

0.2

0.2

0.4 x

0.6

0.8

1.0

Figure 35: Cosine Regression

96

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.2

0.4 x

0.6

0.8

1.0

0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

Figure 36: Cosine Regression

97

Figure 37: Data from the doppler test function and the estimated function. See Example 39.4.

40 Variance Estimation
Next we consider several methods for estimating 2 . For linear smoothers, there is a simple, nearly unbiased estimate of 2 . 40.1 Theorem. Let rn (x) be a linear smoother. Let 2 = where = tr(L), = tr(LT L) =
i=1 2 n i=1 (Yi

r(Xi ))2 n 2 +
n

(149)

|| (Xi )||2 .

If r is sufciently smooth, = o(n) and = o(n) then is a consistent estimator of 2 . We will now outline the proof of this result. Recall that if Y is a random vector and Q is a symmetric matrix then Y T QY is called a quadratic form and it is well known that E(Y T QY ) = tr(QV ) + T Q where V = V(Y ) is the covariance matrix of Y and = E(Y ) is the mean vector. Now, Y Y = Y LY = (I L)Y and so 2 = where = (I L)T (I L). Hence, E( 2 ) = E(Y T Y ) Y T Y = 2 + . tr() n 2 + Y T Y tr() (151) (150)

Assuming that and do not grow too quickly, and that r is smooth, the last term is small for large n and hence E( 2 ) 2 . Similarly, one can show that V( 2 ) 0. 98

Here is another estimator. Suppose that the Xi s are ordered. Dene 2 = 1 2(n 1)
n1 i=1

(Yi+1 Yi )2 .

(152)

The motivation for this estimator is as follows. Assuming r(x) is smooth, we have r(x i+1 ) r(xi ) 0 and hence Yi+1 Yi = r(xi+1 ) + and hence (Yi+1 Yi )2
2 i+1 i+1

r(xi ) +

i+1

2 i

i+1 i .

Therefore,
2 i+1 ) 2 i+1 )

E(Yi+1 Yi )2

E( = E(

+ E( 2 ) 2E( i+1 )E( i ) i + E( 2 ) = 2 2 . i

(153)

Thus, E( 2 ) 2 . A variation of this estimator is 1 = n2


2 n1 2 c2 i i i=2

(154)

where

i = ai Yi1 + bi Yi+1 Yi , ai = (xi+1 Xi )/(xi+1 xi1 ), bi = (xi xi1 )/(xi+1 xi1 ), c2 = (a2 + b2 + 1)1 . i i i

The intuition of this estimator is that it is the average of the residuals that result from tting a line to the rst and third point of each consecutive triple of design points. 40.2 Example. The variance looks roughly constant for the rst 400 observations of the CMB data. Using a local linear t, we applied the two variance estimators. Equation (149) yields 2 = 408.29 while equation (152) yields 2 = 394.55. So far we have assumed homoscedasticity meaning that 2 = V( i ) does not vary with x. In the CMB example this is blatantly false. Clearly, 2 increases with x so the data are heteroscedastic. The function estimate r n (x) is relatively insensitive to heteroscedasticity. However, when it comes to making condence bands for r(x), we must take into account the nonconstant variance. We will take the following approach. Suppose that Yi = r(Xi ) + (Xi ) i . Let Zi = log(Yi r(Xi ))2 and i = log
2 i.

(155)

Then, Zi = log 2 (Xi ) + i . (156)

This suggests estimating log 2 (x) by regressing the log squared residuals on x. We proceed as follows. Variance Function Estimation 1. Estimate r(x) with any nonparametric method to get an estimate r n (x). 2. Dene Zi = log(Yi rn (Xi ))2 . 3. Regress the Zi s on the Xi s (again using any nonparametric method) to get an estimate q(x) of log 2 (x) and let b 2 (x) = eq(x) . (157)

99

log 2 (x)
15 5 10

20

400

800

multipole Figure 38: The dots are the log squared residuals. The solid line shows the log of the estimated standard variance 2 (x) as a function of x. The dotted line shows the log of the true 2 (x) which is known (to reasonable accuracy) through prior knowledge. 40.3 Example. The solid line in Figure 38 shows the log of 2 (x) for the CMB example. I used local linear estimation and I used cross-validation to choose the bandwidth. The estimated optimal bandwidth for r n was h = 42 while the estimated optimal bandwidth for the log variance was h = 160. In this example, there turns out to be an independent estimate of (x). Specically, because the physics of the measurement process is well understood, physicists can compute a reasonably accurate approximation to 2 (x). The log of this function is the dotted line on the plot. A drawback of this approach is that the log of a very small residual will be a large outlier. An alternative is to directly smooth the squared residuals on x.

100

41 Condence Bands
In this section we will construct condence bands for r(x). Typically these bands are of the form rn (x) c se(x) (158)

where se(x) is an estimate of the standard deviation of rn (x) and c > 0 is some constant. Before we proceed, we discuss a pernicious problem that arises whenever we do smoothing, namely, the bias problem.

T HE B IAS P ROBLEM . Condence bands like those in (158), are not really condence bands for r(x), rather, they are condence bands for r n (x) = E(rn (x)) which you can think of as a smoothed version of r(x). Getting a condence set for the true function r(x) is complicated for reasons we now explain. Denote the mean and standard deviation of rn (x) by r n (x) and sn (x). Then, rn (x) r(x) sn (x) = rn (x) r n (x) r n (x) r(x) + sn (x) sn (x) bias(rn (x)) = Zn (x) + variance(rn (x))

where Zn (x) = (rn (x) rn (x))/sn (x). Typically, the rst term Zn (x) converges to a standard Normal from which one derives condence bands. The second term is the bias divided by the standard deviation. In parametric inference, the bias is usually smaller than the standard deviation of the estimator so this term goes to zero as the sample size increases. In nonparametric inference, we have seen that optimal smoothing corresponds to balancing the bias and the standard deviation. The second term does not vanish even with large sample sizes. The presence of this second, nonvanishing term introduces a bias into the Normal limit. The result is that the condence interval will not be centered around the true function r due to the smoothing bias r n (x) r(x). There are several things we can do about this problem. The rst is: live with it. In other words, just accept the fact that the condence band is for r n not r. There is nothing wrong with this as long as we are careful when we report the results to make it clear that the inferences are for r n not r. A second approach is to estimate the bias function r n (x) r(x). This is difcult to do. Indeed, the leading term of the bias is r (x) and estimating the second derivative of r is much harder than estimating r. This requires introducing extra smoothness conditions which then bring into question the original estimator that did not use this extra smoothness. This has a certain unpleasant circularity to it. 41.1 Example. To understand the implications of estimating r n instead of r, consider the following example. Let r(x) = (x; 2, 1) + (x; 4, 0.5) + (x; 6, 0.1) + (x; 8, 0.05) where (x; m, s) denotes a Normal density function with mean m and variance s 2 . Figure 39 shows the true function (top left), a locally linear estimate rn (top right) based on 100 observations Yi = r(i/10) + .2N (0, 1), i = 1, . . . , 100, with bandwidth h = 0.27, the function r n (x) = E(rn (x)) (bottom left) and the difference r(x) r n (x) (bottom right). We see that r n (dashed line) smooths out the peaks. Comparing the top right and bottom left plot, it is clear that rn (x) is actually estimating r n not r(x). Overall, r n is quite similar to r(x) except that r n omits some of the ne details of r.

C ONSTRUCTING C ONFIDENCE BANDS . Assume that rn (x) is a linear smoother, so that rn (x) = Then,
n

n i=1

Yi i (x).

r(x) = E(rn (x)) =


i=1

i (x)r(Xi ).

101

10

10

10

10

Figure 39: The true function (top left), an estimate rn (top right) based on 100 observations, the function r n (x) = E(rn (x)) (bottom left) and the difference r(x) r n (x) (bottom right). Also, V(rn (x)) =
i=1 n

2 (Xi ) 2 (x). i

When (x) = this simplies to V(rn (x)) = 2 || (x)||2 . We will consider a condence band for r n (x) of the form I(x) = rn (x) c s(x), rn (x) + c s(x) for some c > 0 where
n

(159)

s(x) =
i=1

2 (Xi ) 2 (x). i

At one xed value of x we can just take rn (x) z/2 s(x). If we want a band over an interval a x b we need a constant c larger than z /2 to count for the fact that we are trying to get coverage at many points. To guarantee coverage at all the X i s we can use the Bonferroni correction and take rn (x) z/(2n) s(x). There is a more rened approach which is used in locfit. R Code. In locfit you can get condence bands as follows. 102

out = locfit(y x, alpha=c(0,h)) crit(out) = kappa0(out,cov=.95) plot(out,band="local") To actually extract the bands, proceed as follows: tmp r.hat critval se upper lower = = = = = =

### fit the regression ### make locfit find kappa0 and c ### plots the fit and the bands

preplot.locfit(out,band="local",where="data") tmp$fit tmp$critval$crit.val temp$se.fit r.hat + critval*se r.hat - critval*se

Now suppose that (x) is a function of x. Then, we use rn (x) cs(x). 41.2 Example. Figure 40 shows simultaneous 95 percent condence bands for the CMB data using a local linear t. The bandwidth was chosen using cross-validation. We nd that 0 = 38.85 and c = 3.33. In the top plot, we assumed a constant variance when constructing the band. In the bottom plot, we did not assume a constant variance when constructing the band. We see that if we do not take into account the nonconstant variance, we overestimate the uncertainty for small x and we underestimate the uncertainty for large x.

10000

400

800

10000

400

800

Figure 40: Local linear t with simultaneous 95 percent condence bands. The band in the top plot assumes constant variance 2 . The band in the bottom plot allows for nonconstant variance 2 (x). It seems like a good time so summarize the steps needed to construct the estimate r n and a condence band.

103

Summary of Linear Smoothing 1. Choose a smoothing method such as local polynomial, spline, etc. This amounts to choosing the form of the weights (x) = ( 1 (x), . . . , Theorem 37.2.
n (x)) T

. A good default choice is local linear smoothing as described in

2. Choose the bandwidth h by cross-validation using (110). 3. Estimate the variance function 2 (x) as described in Section 40. 4. An approximate 1 condence band for r n = E(rn (x)) is rn (x) c s(x). (160)

41.3 Example (LIDAR). Recall the LIDAR data from Example 34.2 and Example 37.1. We nd that 0 30 and c 3.25. The resulting bands are shown in the lower right plot. As expected, there is much greater uncertainty for larger values of the covariate.

42 Testing the Fit of a Linear Model


A nonparametric estimator rn can be used to construct a test to see whether a linear t is adequate. Consider testing H0 : r(x) is linear versus H1 : r(x) is not linear. Denote the hat matrix from tting the linear model by H and the smoothing matrix from tting the nonparametric regression by L. Let ||LY HY ||/ T = 2 where = tr((L H)T (L H)) and 2 is dened by (149). We can approximate the distribution of T under H 0 using the bootstrap. Also, under H0 , the F-distribution with and n 21 + 2 degrees of freedom provides a rough approximation to the distribution of T . Thus we would reject H0 at level if T > F,n21 +2 , . As with any test, the failure to reject H0 should not be regarded as proof that H0 is true. Rather, it indicates that the data are not powerful enough to detect deviations from H0 . In such cases, a linear t might be considered a reasonable tentative model. Of course, making such decisions based solely on the basis of a test can be dangerous. In the unlikely case that there are replications we can test t without using a nonparametric t. Just do the following. Denote the unique values of x as {x1 , . . . , xk }. Do this: 1. Create k 1 dummy variables Z1 , . . . , Zk1 for the k groups. 2. Fit: Y = X +
r=1 k1

r Zr +

(161)

3. Fit Y = X + 4. Test H0 : 1 = = k1 = 0 with an F-test by comparing the two models. Example from Page 103 (162)

104

> > x = c(1,1,1,2,3,3,4,4,4,4) > y = c(2.55,2.75,2.57,2.40,4.19,4.70, + 3.81,4.87,2.93,4.52) > > > out = lm(yx) > anova(out) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 4.5693 4.5693 8.669 0.01859 * Residuals 8 4.2166 0.5271 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > > > z1 = c(1,1,1,rep(0,7)) > z2 = c(0,0,0,1,rep(0,6)) > z3 = c(0,0,0,0,1,1,0,0,0,0) > out2 = lm(yx+z1+z2+z3) > anova(out2) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 4.5693 4.5693 11.6247 0.01433 * z1 1 7.963e-09 7.963e-09 2.026e-08 0.99989 z2 1 1.8582 1.8582 4.7276 0.07263 . Residuals 6 2.3584 0.3931 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > > f = ((4.2166-2.3584)/(8-6))/(2.3584/6) > print(f) [1] 2.363721 > p = 1-pf(2.363721,2,6) > print(p) [1] 0.1749707

105

43 Local Likelihood and Exponential Families


If Y is not real valued or is not Gaussian, then the basic regression model we have been using might not be appropriate. For example, if Y {0, 1} then it seems natural to use a Bernoulli model. In this section we discuss nonparametric regression for more general models. Before proceeding, we should point out that the basic model often does work well even in cases where Y is not real valued or is not Gaussian. This is because the asymptotic theory does not really depend on being Gaussian. Thus, at least for large samples, it is worth considering using the tools we have already developed for these cases. Recall that Y has an exponential family distribution, given x, if f (y|x) = exp y(x) b((x)) + c(y, ) a() (163)

for some functions a(), b() and c(, ). Here, () is called the canonical parameter and is called the dispersion parameter. It then follows that r(x) 2 (x) The usual parametric form of this model is g(r(x)) = xT for some known function g called the link function. The model Y |X = x f (y|x), g(E(Y |X = x)) = xT is called a generalized linear model. For example, if Y given X = x is Binomial (m, r(x)) then f (y|x) = which has the form (163) with (x) = log r(x) , b() = m log(1 + e ) 1 r(x) m r(x)y (1 r(x))my y (164) E(Y |X = x) = b ((x)), V(Y |X = x) = a()b ((x)).

and a() 1. Taking g(t) = log(t/(m t)) yields the logistic regression model. The parameters are usually estimated by maximum likelihood. Lets consider a nonparametric version of logistic regression. For simplicity, we focus on local linear estimation. The data are (x1 , Y1 ), . . . , (xn , Yn ) where Yi {0, 1}. We assume that Yi Bernoulli(r(Xi )) for some smooth function r(x) for which 0 r(x) 1. Thus, P(Yi = 1|Xi = Xi ) = r(Xi ) and P(Yi = 0|Xi = Xi ) = 1 r(Xi ). The likelihood function is
n i=1

r(Xi )Yi (1 r(Xi ))1Yi

so, with (x) = log(r(x)/(1 r(x))), the log-likelihood is


n

(r) =
i=1

(Yi , (Xi ))

(165)

106

where (y, ) = log e 1 + e


y

1 1 + e

1y

= y log 1 + e .

(166)

To estimate the regression function at x we approximate the regression function r(u) for u near x by the local logistic function ea0 +a1 (ux) . r(u) 1 + ea0 +a1 (ux) Equivalently, we approximate log(r(u)/(1 r(u))) with a0 + a1 (x u). Now dene the local log-likelihood
n x (a)

=
i=1 n

K K
i=1

x Xi h x Xi h

(Yi , a0 + a1 (Xi x)) Yi (a0 + a1 (Xi x)) log(1 + ea0 +a1 (Xi x) ) .

Let a(x) = (a0 (x), a1 (x)) maximize x which can be found by any convenient optimization routine such as Newton Raphson. The nonparametric estimate of r(x) is rn (x) =
a eb0 (x) . a 1 + eb0 (x)

(167)

The bandwidth can be chosen by using the leave-one-out log-likelihood cross-validation


n

CV =
i=1

(Yi , (i) (Xi ))

(168)

where (i) (x) is the estimator obtained by leaving out (Xi , Yi ). Unfortunately, there is no identity as in Theorem 35.2. There is, however, the following approximation. Recall the denition of (x, ) from (166) and let (y, ) and (y, ) denote rst and second derivatives of (y, ) with respect to . Thus, (y, ) = y p() (y, ) = p()(1 p()) where p() = e /(1 + e ). Dene matrices Xx and Wx as in (124) and let Vx be a diagonal matrix with j th diagonal entry equal to (Yi , a0 + a1 (xj Xi )). Then,
n

CV

x (a)

+
i=1

m(Xi ) (Yi , a0 )

(169)

where
T m(x) = K(0)eT (Xx Wx Vx Xx )1 e1 1

(170)

and e1 = (1, 0, . . . , 0) . The effective degrees of freedom is


n

=
i=1

m(Xi )E( (Yi , a0 )).

43.1 Example. Figure 41 shows the local linear logistic regression estimator for an example generated from the model Yi Bernoulli(r(Xi )) with r(x) = e3 sin(x) /(1 + e3 sin(x) ). The solid line is the true function r(x). The dashed line is the local linear logistic regression estimator. We also computed the local linear regression estimator which ignores the fact that the data are Bernoulli. The dotted line is the resulting local linear regression estimator. 3 cross-validation was used to select the bandwidth in both cases. 107

0.0
3

0.5

1.0

Figure 41: Local linear logistic regression. The solid line is the true regression function r(x) = P(Y = 1|X = x). The dashed line is the local linear logistic regression estimator. The dotted line is the local linear regression estimator. 43.2 Example. The BPD data. The outcome Y is presence or absence of BPD and the covariate is x = birth weight. The estimated logistic regression function (solid line) r(x; 0 , 1 ) together with the data are shown in Figure 42. Also shown are two nonparametric estimates. The dashed line is the local likelihood estimator. The dotted line is the local linear estimator which ignores the binary nature of the Yi s. Again we see that there is not a dramatic difference between the local logistic model and the local linear model.
1
| | | | | | | | | | | | | | |||| | | | | | | | || ||||| | | | || | | | | | |

Bronchopulmonary Dysplasia 0

||

| | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

400

600

800

1000

1200

1400

1600

Birth Weight (grams)

Figure 42: The BPD data. The data are shown with small vertical lines. The estimates are from logistic regression (solid line), local likelihood (dashed line) and local linear regression (dotted line).

3 It

might be appropriate to use a weighted t since the variance of the Bernoulli is a function of the mean.

108

44 Multiple Nonparametric Regression


Suppose now that the covariate is d-dimensional, Xi = (Xi1 , . . . , Xid )T . The regression equation takes the form Y = r(X1 , . . . , Xd ) + . (171) In principle, all the methods we have discussed carry over to this case easily. Unfortunately, the risk of a nonparametric regression estimator increases rapidly with the dimension d. This is called the curse of dimensionality. The risk of a nonparametric estimator behaves like n4/5 if r is assumed to have an integrable second derivative. In d dimensions the risk behaves like n4/(4+d) . To make the risk equal to a small number we have = which implies that n= Thus: To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension d. So you might need n = 30000 points when d = 5 to get the same accuracy as n = 300 when d = 1. To get some intuition into why this is true, suppose the data fall into a d-dimensional unit cube. Let x be a point in the cube and let Nh be a cubical neighborhood around x where the cube has sides of length h. Suppose we want to choose h = h() so that a fraction of the data falls into Nh . The expected fraction of points in Nh is hd . Setting hd = we see that h() = 1/d = e(1/d) log . Thus h() 1 as d grows. In high dimensions, we need huge neighborhoods to capture any reasonable fraction of the data. With this warning in mind, let us press on and see how we might estimate the regression function. 1
(d+4)/4

1 n4/(4+d)

(172)

(173)

L OCAL R EGRESSION . Consider local linear regression. The kernel function K is now a function of d variables. Given a nonsingular positive denite d d bandwidth matrix H, we dene KH (x) = 1 K(H 1/2 x). |H|1/2

Often, one scales each covariate to have the same mean and variance and then we use the kernel hd K(||x||/h) where K is any one-dimensional kernel. Then there is a single bandwidth parameter h. This is equivalent to using a bandwidth matrix of the form H = h2 I. At a target value x = (x1 , . . . , xd )T , the local sum of squares is given by
n i=1 d 2

wi (x) Yi a0

j=1

aj (Xij xj )

(174)

where wi (x) = K(||Xi x||/h). The estimator is rn (x) = a0 109 (175)

where a = (a0 , . . . , ad )T is the value of a = (a0 , . . . , ad )T that minimizes the weighted sums of squares. The solution a is a = (XT Wx Xx )1 XT Wx Y (176) x x where 1 1 . . . X11 x1 X21 x1 . . . .. . X1d xd X2d xd . . .

and Wx is the diagonal matrix whose (i, i) element is wi (x). This is what locfit does. In other words, if you type locfit(y x1 + x2 + x3) then locfit ts Y = r(x1 , x2 , x3 ) + using one bandwidth. So it is important to rescale your variables. 44.1 Theorem (Ruppert and Wand, 1994). Let rn be the multivariate local linear estimator with bandwidth matrix H. The (asymptotic) bias of rn (x) is 1 2 (K)trace(HH) (177) 2 where H is the matrix of second partial derivatives of r evaluated at x and 2 (K) is the scalar dened by the equation uuT K(u)du = 2 (K)I. The (asymptotic) variance of rn (x) is 2 (x) K(u)2 du n|H|1/2 f (x) Also, the bias at the boundary is the same order as in the interior. Thus we see that in higher dimensions, local linear regression still avoids excessive boundary bias and design bias. Suppose that H = h2 I. Then, using the above result, the MSE is 2 d 2 (x) K(u)2 du h4 c2 (179) = c 1 h4 + 2 rjj (x) + 4 nhd f (x) nhd j=1 (178)

Xx =

1 Xn1 x1

Xnd xd

which is minimized at h = cn1/(d+4) giving MSE of size n4/(4+d) .

S PLINES. If we take a spline approach, we need to dene splines in higher dimensions. For d = 2 we minimize (Yi rn (xi1 , xi2 ))2 + J(r) 2 r(x) x1 x2 2 r(x) x2 2

where J(r) =

2 r(x) x2 1

+2

dx1 dx2 .

The minimizer rn is called a thin-plate spline. It is hard to describe and even harder (but certainly not impossible) to t.

O RTHOGONAL BASIS F UNCTIONS . For s = 1, . . . , d, let s = s1 (xs ), . . . , sJ (xs )

110

be a set of basis functions for xs . Dene = 1,j1 (x1 )2,j2 (x2 ) d,jd (xd ) : 1 js J, s = 1, . . . , d .

That is, we choose one function from 1 , one function from 2 , and so on, and multiply them together. Then is the set of all J d such functions. We call the tensor product basis. Now we approximate r as r(x1 , . . . , xd ) where j1 ,...,jd (x1 , . . . , xd ) = 1,j1 (x1 )2,j2 (x2 ) d,jd (xd ). If we let A = {j1 , . . . , jd } denote a subset of A= then we can write this more compactly as r(x1 , . . . , xd ) A A (x1 , . . . , xd ).
AA

j1 ,...,jd j1 ,...,jd (x1 , . . . , xd )


1j1 ,...,jd J

(j1 , . . . , jd ) : 1 j1 , . . . , jd J

(180)

We can t this model by collecting all the basis functions into a design matrix and then use least squares. We use cross-validation to choose J. A variation is to use a different J for each covariate. It is instructive to rewrite the model as follows. Suppose that the rst basis function is s1 (xs ) = 1 for each s. Then (180) can be written as rst (xs , xt ) + rs (xs ) + r(x) = 0 +
s s,t

If we truncate the sum after the rst order terms, we get an additive model which we now discuss.

A DDITIVE M ODELS . Interpreting and visualizing a high-dimensional t is difcult. As the number of covariates increases, the computational burden becomes prohibitive. Sometimes, a more fruitful approach is to use an additive model. An additive model is a model of the form
d

Y =+
j=1

rj (Xj ) +

(181)

where r1 , . . . , rd are smooth functions. The model (181) is not identiable since we can add any constant to and subtract the same constant from one of the rj s without changing the regression function. This problem can be xed in a number of ways, perhaps the easiest being to set = Y and then regard the r j s as deviations from Y . In this n case we require that i=1 rj (Xi ) = 0 for each j. The additive model is clearly not as general as tting r(x1 , . . . , xd ) but it is much simpler to compute and to interpret and so it is often a good starting point. This is a simple algorithm for turning any one-dimensional regression smoother into a method for tting additive models. It is called backtting.

111

The Backtting Algorithm Initialization: set = Y and set initial guesses for r1 , . . . , rd . Iterate until convergence: for j = 1, . . . , d: Compute Yi = Yi
k=j

rk (Xi ), i = 1, . . . , n.

Apply a smoother to Yi on xj to obtain rj . Set rj (x) equal to rj (x) n1


n i=1 rj (Xi ).

You can (and should) write your own function to t an additive model. 44.2 Example. Here is an example involving three covariates and one response variable. The data are plotted in Figure 43. The data are 48 rock samples from a petroleum reservoir, the response is permeability (in milli-Darcies) and the covariates are: the area of pores (in pixels out of 256 by 256), perimeter in pixels and shape (perimeter/ area). The goal is to predict permeability from the three covariates. First we t the additive model permeability = r1 (area) + r2 (perimeter) + r3 (shape) + . We scale each covariate to have the same variance and then use a common bandwidth for each covariate. The estimates of r1 , r2 and r3 are shown in Figure 44. Y was added to each function before plotting it. Next consider a threedimensional local linear t (175). After scaling each covariate to have mean 0 and variance 1, we found that the bandwidth h 3.2 minimized the cross-validation score. The residuals from the additive model and the full threedimensional local linear t are shown in Figure 45. Apparently, the tted values are quite similar suggesting that the generalized additive model is adequate.

R EGRESSION T REES . A regression tree is a model of the form


M

r(x) =
m=1

cm I(x Rm )

(182)

where c1 , . . . , cM are constants and R1 , . . . , RM are disjoint rectangles that partition the space of covariates. The model is tted in a recursive manner that can be represented as a tree; hence the name. Denote a generic covariate value by x = (x1 , . . . , xj , . . . , xd ). The covariate for the ith observation is Xi = (xi1 , . . . , xij , . . . , xid ). Given a covariate j and a split point s we dene the rectangles R 1 = R1 (j, s) = {x : xj s} and R2 = R2 (j, s) = {x : xj > s} where, in this expression, xj refers the the j th covariate not the j th observation. Then we take c1 to be the average of all the Yi s such that Xi R1 and c2 to be the average of all the Yi s such that Xi R2 . Notice that c1 and c2 minimize the sums of squares Xi R1 (Yi c1 )2 and Xi R2 (Yi c2 )2 . The choice of which covariate xj to split on and which split point s to use is based on minimizing the residual sums if squares. The splitting process is on repeated on each rectangle R1 and R2 . Figure 46 shows a simple example of a regression tree; also shown are the corresponding rectangles. The function estimate r is constant over the rectangles. Generally one grows a very large tree, then the tree is pruned to form a subtree by collapsing regions together. The size of the tree chosen by cross-validation. Usually, we use ten-fold cross-validation since leave-one-out is too expensive. Thus we divide the data into ten blocks, remove each block one at a time, t the model on the remaining blocks and prediction error is computed for the observations in the left-out block. This is repeated for each block and the prediction error is averaged over the ten replications. Here are the R commands: library(tree) ### load the library 112

log permeability

log permeability

1000

2000

3000 area

4000

5000

0.1

0.2

0.3 perimeter

0.4

log permeability

200 400 600 800 shape

1200

Figure 43: The rock data. out = tree(y x1 + x2 + x3) ### fit the tree plot(out) ### plot the tree text(out) ### add labels to plot print(out) ### print the tree cv = cv.tree(out) ### prune the tree ### and compute the cross validation score plot(cv$size,cv$dev) ### plot the CV score versus tree size m = cv$size[cv$dev == min(cv$dev)] ### find the best size tree new = prune.tree(out,best=k) ### fit the best size tree plot(new) text(new) 44.3 Example. Figure 47 shows a tree for the rock data. Notice that the variable shape does not appear in the tree. This means that the shape variable was never the optimal covariate to split on in the algorithm. The result is that tree only depends on area and peri. This illustrates an important feature of tree regression: it automatically performs variable selection in the sense that a covariate xj will not appear in the tree if the algorithm nds that the variable is not important.

113

log permeability

log permeability

1000

2000

3000 area

4000

5000

0.1

0.2

0.3 perimeter

0.4

log permeability

200

400

600

800 1000

shape

Figure 44: The rock data. The plots show r1 , r2 , and r3 for the additive model Y = r1 (x1 ) + r2 (x2 ) + r3 (x3 ) + .

114

0.5

residuals

0.0

0.5

8 predicted values

0.2

0.0

0.2

0.4

0
Normal quantiles

0.5

residuals

residuals

0.0

0.5

8 predicted values

0.5

0.0

0.5

0.5

0.0 predicted values

0.5

Figure 45: The residuals for the rock data. Top left: residuals from the additive model. Top right: qq-plot of the residuals from the additive model. Bottom left: residuals from the multivariate local linear model. Bottom right: residuals from the two ts plotted against each other. x1 50 < 50 x2 < 100 c1 100 c2 c3

R2 110 x2 R1 R3

50

x1

Figure 46: A regression tree for two covariates x1 and x2 . The function estimate is r(x) = c1 I(x R1 ) + c2 I(x R2 ) + c3 I(x R3 ) where R1 , R2 and R3 are the rectangles shown in the lower plot. 115

area < 1403

area < 1068 area < 3967 peri < .1991 7.746 8.407 8.678

area < 3967 peri < .1949

8.893 8.985

8.099

8.339

Figure 47: Regression tree for the rock data.

116

45 Density Estimation
A problem closely related to nonparametric regression, is nonparametric density estimation. Let X1 , . . . , X n f where f is some probabilty density. We want to estimate f . 45.1 Example (Bart Simpson). The top left plot in Figure 48 shows the density f (x) = 1 1 (x; 0, 1) + 2 10
4 j=0

(x; (j/2) 1, 1/10)

(183)

where (x; , ) denotes a Normal density with mean and standard deviation . Based on 1000 draws from f , I computed a kernel density estimator, described later. The top right plot is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more reasonable density estimate.

1.0

0.5

0.0

0 True Density

0.0
3

0.5

1.0

0 Undersmoothed

1.0

0.5

0.0

0 Just Right

0.0
3

0.5

1.0

0 Oversmoothed

Figure 48: The Bart Simpson density from Example 45.1. Top left: true density. The other plots are kernel estimators based on n = 1000 draws. Bottom left: bandwidth h = 0.05 chosen by leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h. We will evaluate the quality of an estimator fn with the risk, or integrated mean squared error, R = E(L) where L= (fn (x) f (x))2 dx 117

is the integrated squared error loss function. The estimators will depend on some smoothing parameter h and we will choose h to minimize an estimate of the risk. The usual method for estimating risk is leave-one-out cross-validation. The details are different for density estimation than for regression. In the regression case, the cross-validation score n was dened as i=1 (Yi r(i) (Xi ))2 but in density estimation, there is no response variable Y . Instead, we proceed as follows. The loss function, which we now write as a function of h, (since fn will depend on some smoothing parameter h) is L(h) = = (fn (x) f (x))2 dx
2 fn (x) dx 2

fn (x)f (x)dx +

f 2 (x) dx.

The last term does not depend on h so minimizing the loss is equivalent to minimizing the expected value of J(h) =
2 fn (x) dx 2

fn (x)f (x)dx. f 2 (x) dx.

(184)

We shall refer to E(J(h)) as the risk, although it differs from the true risk by the constant term 45.2 Denition. The cross-validation estimator of risk is
2

J(h) =

fn (x)

dx

2 n

f(i) (Xi )
i=1

(185)

where f(i) is the density estimator obtained after removing the ith observation. We refer to J(h) as the cross-validation score or estimated risk. Perhaps the simplest nonparametric density estimator is the histogram. Suppose f has its support on some interval which, without loss of generality, we take to be [0, 1]. Let m be an integer and dene bins B1 = 0, 1 m , B2 = 1 2 , m m , . . . , Bm = m1 ,1 . m
Bj

(186) f (u)du.

Dene the binwidth h = 1/m, let Yj be the number of observations in Bj , let pj = Yj /n and let pj = The histogram estimator is dened by
m

fn (x) =
j=1

pj I(x Bj ). h

(187)

To understand the motivation for this estimator, note that, for x B j and h small, E(fn (x)) = pj E(pj ) = = h h
Bj

f (u)du h

f (x)h = f (x). h

45.3 Example. Figure 49 shows three different histograms based on n = 1, 266 data points from an astronomical sky survey. Each data point represents a redshift, roughly speaking, the distance from us to a galaxy. Choosing the right number of bins involves nding a good tradeoff between bias and variance. We shall see later that the top left histogram has too many bins resulting in oversmoothing and too much bias. The bottom left histogram has too few bins resulting in undersmoothing. The top right histogram is based on 308 bins (chosen by cross-validation). The histogram reveals the presence of clusters of galaxies.

118

80

40

0.0

0.1 Undersmoothed

0.2

0
0.0

40

80

0.1 Just Right

0.2

0
0.0

40

80

0.1 Oversmoothed

0.2

500

1000

Number of Bins

Figure 49: Three versions of a histogram for the astronomy data. The top left histogram has too many bins. The bottom left histogram has too few bins. The top right histogram uses 308 bins (chosen by cross-validation). The lower right plot shows the estimated risk versus the number of bins. Consider xed x and xed m, and let Bj be the bin containing x. Then, E(fn (x)) = The risk satises R(fn , f ) The value h that minimizes (189) is h = With this choice of binwidth, R(fn , f ) 1 n1/3 6 (f (u))2 du C . n2/3 pj pj (1 pj ) . and V(fn (x)) = h nh2 h2 12 (f (u))2 du + 1 . nh (188)

(189)

1/3

(190)

(191)

We see that with an optimally chosen binwidth, the risk decreases to 0 at rate n 2/3 . We will see shortly that kernel estimators converge at the faster rate n4/5 . 45.4 Theorem. The following identity holds: 2 n+1 J(h) = h(n 1) h(n 1)
m

pj2 .
j=1

(192)

119

45.5 Example. We used cross-validation in the astronomy example. We nd that m = 308 is an approximate minimizer. The histogram in the top right plot in Figure 49 was constructed using m = 308 bins. The bottom right plot shows the estimated risk, or more precisely, J , plotted versus the number of bins. Histograms are not smooth. Now we discuss kernel density estimators which are smoother and which converge to the true density faster. 45.6 Denition. Given a kernel K and a positive number h, called the bandwidth, the kernel density estimator is dened to be n 1 1 x Xi . fn (x) = K n i=1 h h This amounts to placing a smoothed out lump of mass of size 1/n over each data point X i ; see Figure 50.

(193)

10

10

Figure 50: A kernel density estimator fn . At each point x, fn (x) is the average of the kernels centered over the data points Xi . The data points are indicated by short vertical bars. The kernels are not drawn to scale. In R use: kernel(x,bw=h) where h is the bandwidth. As with kernel regression, the choice of kernel K is not crucial, but the choice of bandwidth h is important. Figure 51 shows density estimates with several different bandwidths. Look also at Figure 48. We see how sensitive the estimate fn is to the choice of h. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general we will let the bandwidth depend on the sample size so we write h n . Here are some properties of fn . The risk is K 2 (x)dx 1 4 (194) R K h4 (f (x))2 dx + n 4 nh
2 where K =

x2 K(x)dx.

120

0.0

0.1

0.2

0.0

0.1

0.2

redshift

redshift

0.0

0.1

0.2

estimated MSE
0.000

0.008

redshift

Figure 51: Kernel density estimators and estimated risk for the astronomy data. Top left: oversmoothed. Top right: just right (bandwidth chosen by cross-validation). Bottom left: undersmoothed. Bottom right: cross-validation curve as a function of bandwidth h. The bandwidth was chosen to be the value of h where the curve is a minimum. If we differentiate (194) with respect to h and set it equal to 0, we see that the asymptotically optimal bandwidth is h = c2 c2 A(f )n 1
1/5

(195)

where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = (f (x))2 dx. This is informative because it tells us that the best bandwidth decreases at rate n1/5 . Plugging h into (194), we see that if the optimal bandwidth is used then R = O(n4/5 ). As we saw, histograms converge at rate O(n2/3 ) showing that kernel estimators are superior in rate to histograms. In practice, the bandwidth can be chosen by cross-validation but rst we describe another method which is sometimes used when f is thought to be very smooth. Specically, we compute h from (195) under the idealized assumption that f is Normal. This yields h = 1.06n1/5 . Usually, is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q is the interquartile range.4 This choice of h works well if the true density is very smooth and is called the Normal reference rule.
4 Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing by 1.34 is that Q/1.34 is a consistent estimate of if the data are from a N (, 2 ).

121

The Normal Reference Rule For smooth densities and a Normal kernel, use the bandwidth hn = where = min s, 1.06 n1/5 Q 1.34 .

Since we dont want to necessarily assume that f is very smooth, it is usually better to estimate h using crossvalidation. Recall that the cross-validation score is J(h) = f 2 (x)dx 2 n
n i=1

fi (Xi )

(196)

where fi denotes the kernel estimator obtained by omitting Xi . R code. use the bw.ucv function to do cros-validation: h = bw.ucv(x) plot(density(x,bw=h)) The bandwidth for the density estimator in the upper right panel of Figure 51 is based on cross-validation. In this case it worked well but of course there are lots of examples where there are problems. Do not assume that, if the estimator f is wiggly, then cross-validation has let you down. The eye is not a good judge of risk. Constructing condence bands for kernel density estimators is similar to regression. Note that fn (x) is just a n sample average: fn (x) = n1 i=1 Zi (x) where Zi (x) = 1 K h x Xi h .

So the standard error is se(x) = s(x)/ n where s(x) is the standard deviation of the Zi (x) s: s(x) Then we use fn (x) z/(2n) se(x). 45.7 Example. Figure 45 shows two examples. The rst is data from N(0,1) and second from (1/2)N (1, .1) + (1/2)N (1, .1). In both cases, n = 1000. We show the estimates using cross-validation and the Normal reference rule together with bands. The true curve is also shown. Thats the curve outside the bands in the last plot. Suppose now that the data are d-dimensional so that Xi = (Xi1 , . . . , Xid ). The kernel estimator can easily be generalized to d dimensions. Most often, we use the product kernel n d 1 xj Xij fn (x) = . (198) K nh1 hd hj
i=1 j=1

1 n

n i=1

(Zi (x) fn (x))2 .

(197)

To further simplify, we can rescale the variables to have the same variance and then use only one bandwidth.

122

0.4

out$f

out$f

0.3

0.2

0.1

0.0

0 grid CV

0.0

0.1

0.2

0.3

0.4

0 grid reference rule

3.0

2.5

2.0

out$f

out$f

1.5

1.0

0.5

0.0

0 grid CV

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 grid reference rule

A L INK B ETWEEN R EGRESSION r(x)

AND

D ENSITY E STIMATION . Consider regression again. Recall that yf (y|x)dy = yf (x, y)dy f (x) (199) (200)

= E(Y |X = x) = = yf (x, y) . f (x, y)dx

Suppose we compute a bivariate kernel density estimator f (x, y) = 1 n


n i=1

1 K h1

x Xi h1

1 K h2

y Yi h2

(201)

and we insert this into (200). Assuming that y 1 K h2 y Yi h2

uK(u)du = 0, we see that dy = = h2 = Yi . (h2 u + Yi )K(u)du uK(u)du + Yi K(u)du (202) (203) (204)

Hence, y f(x, y)dy = = 1 n 1 n


n

y
i=1 n i=1

1 K h1

x Xi h1 y

1 K h2 1 K h2

y Yi h2 y Yi h2

dy dy

(205) (206)

1 K h1

x Xi h1 123

= Also, f(x, y)dy = = Therefore, r(x)

1 n

Yi
i=1

1 K h1

x Xi h1

(207)

1 n 1 n

n i=1 n i=1

1 K h1 1 K h1

x Xi h1 .

1 K h2

y Yi h2

dy

(208) (209)

x Xi h1

=
1 n

y f(x, y) f (x, y)dx


n i=1 1 Y i h1 K xXi h1 xXi h1

(210)

1 n n i=1

n 1 i=1 h1 K

(211)

Yi K K

xXi h1 xXi h1

n i=1

(212)

which is the kernel regression estimator. In other words, the kernel regression estimator can be derived from kernel density estimation.

124

46 Classication
REFERENCES: 1. Hastie, Tibshirani and Friedman (2001). The Elements of Statistical Learning. 2. Devroye, Gy r and Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. o 3. Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning.

The problem of predicting a discrete random variable Y from another random variable X is called classication, supervised learning, discrimination, or pattern recognition. Consider IID data (X1 , Y1 ), . . . , (Xn , Yn ) where Xi = (Xi1 , . . . , Xid )T X Rd is a d-dimensional vector and Yi takes values in {0, 1}. Often, the covariates X are also called features. The goal is to predict Y given a new X. This is the same as binary regression except that the focus is on good prediction rather than estimating the regression function. A classication rule is a function h : X {0, 1}. When we observe a new X, we predict Y to be h(X). The classication risk (or error rate) of h is R(h) = P(Y = h(X)). (213)

EXAMPLES: 1. The Coronary Risk-Factor Study (CORIS) data. There are 462 males between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease and there are 9 covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age. The goal is to predict Y from all the covariates. 2. Predict if stock will go up or down based on past performance. Here X is past price and Y is the future price. 3. Predict if an email message is spam or real. 4. Identify whether glass fragments in a criminal investigation or from a window or not, based on chemical composition. 5. Identity handwritten digits from images. Each Y is a digit from 0 to 9. There are 256 covariates x 1 , . . . , x256 corresponding to the intensity values from the pixels of the 16 X 16 image. See Figure 52. 46.1 Example. Figure 53 shows 100 data points. The covariate X = (X1 , X2 ) is 2-dimensional and the outcome Y Y = {0, 1}. The Y values are indicated on the plot with the triangles representing Y = 1 and the squares representing Y = 0. Also shown is a linear classication rule represented by the solid line. This is a rule of the form h(x) = 1 if a + b1 x1 + b2 x2 > 0 0 otherwise.

Everything above the line is classied as a 0 and everything below the line is classied as a 1.

125

Figure 52: Zip code data.

126

Figure 53: Two covariates and a linear decision boundary. means Y = 1. means Y = 0. These two groups are perfectly separated by the linear decision boundary.

46.1 Error Rates, The Bayes Classier and Regression


The true error rate (or classication risk) of a classier h is R(h) = P({h(X) = Y }) and the empirical error rate or training error rate is Rn (h) = 1 n
n

(214)

I(h(Xi ) = Yi ).
i=1

(215)

Now we will related classication to regression. Let r(x) = E(Y |X = x) = P(Y = 1|X = x) denote the regression function. We have the following important result.

The rule h that minimizes R(h) is h (x) = 1 if r(x) > 1 2 0 otherwise. (216)

The rule h is called the Bayes rule. The risk R = R(h ) of the Bayes rule is called the Bayes risk. The set D(h) = {x : r(x) = 1/2} (217) is called the decision boundary. PROOF. We will show that R(h) R(h ) 0. Note that R(h) = P({Y = h(X)}) = P({Y = h(X)|X = x}f (x)dx. 127

x2

x1

It sufces to show that P({Y = h(X)|X = x} P({Y = h (X)|X = x} 0 for all x. Now, P({Y = h(X)|X = x} = 1 P({Y = h(X)|X = x} =1 =1 =1 =1 P({Y = 1, h(X) = 1|X = x} + P({Y = 0, h(X) = 0|X = x} I(h(x) = 1)P({Y = 1|X = x} + I(h(x) = 0)P({Y = 0|X = x} I(h(x) = 1)r(x) + I(h(x) = 0)(1 r(x)) I(x)r(x) + (1 I(x))(1 r(x)) (218)

where I(x) = I(h(x) = 1). Hence, P({Y = h(X)|X = x} P({Y = h (X)|X = x} = I (x)r(x) + (1 I (x))(1 r(x)) I(x)r(x) + (1 I(x))(1 r(x))

= (2r(x) 1)(I (x) I(x)) 1 = 2 r(x) (I (x) I(x)). 2 When r(x) 1/2, h (x) = 1 so (219) is non-negative. When r(x) < 1/2, h (x) = 0 so so both terms are nonposiitve and hence (219) is again non-negative. This proves (218). To summarize, if h is any classier, then R(h) R .

46.2 Classication is Easier Than Regression


Let r (x) = E(Y |X = x) be the true regression function and let h (x) denote the corresponding Bayes rule. Let r(x) be an estimate of r (x) and dene the plug-in rule: h(x) = In the previous proof we showed that P(Y = h(X)|X = x) P(Y = h (X)|X = x) = (2r(x) 1)(Ih (x)=1 Ib h(x)=1 ) = 2|r(x) 1/2|Ih (x)=b h(x) Now, when h (x) = h(x) we have that |r(x) r (x)| |r(x) 1/2|. Therefore, P(h(X) = Y ) P(h (X) = Y ) = 2 128 |r(x) 1/2|Ih(x)=b f (x)dx h(x) = |2r(x) 1|Ih (x)=b h(x) 1 if r(x) > 1 2 0 otherwise. (219)

2 2

|r(x) r (x)|Ih (x)=b f (x)dx h(x) |r(x) r (x)|f (x)dx

= 2E|r(X) r (X)|. This means that if r(x) is close to r (x) then the classication risk will be close to the Bayes risk. The converse is not true. It is possible for r to be far from r (x) and still lead to a good classier. As long as r(x) and r (x) are on the same side of 1/2 they yield the same classier.

46.3 The Bayes Rule and the Class Densities


We can rewrite h in a different way. From Bayes theorem we have that r(x) = P(Y = 1|X = x) f (x|Y = 1)P(Y = 1) = f (x|Y = 1)P(Y = 1) + f (x|Y = 0)P(Y = 0) f1 (x) = f1 (x) + (1 )f0 (x) f0 (x) f1 (x) We call f0 and f1 the class densities. Thus we have: The Bayes rule can be written as: h (x) =

(220)

where = f (x|Y = 0) = f (x|Y = 1) = P(Y = 1).

1 if

f1 (x) f0 (x)

>

(1)

(221)

0 otherwise.

46.4 How to Find a Good Classier


The Bayes rule depends on unknown quantities so we need to use the data to nd some approximation to the Bayes rule. There are three main approaches: 1. Empirical Risk Minimization. Choose a set of classiers H and nd h H that minimizes some estimate of L(h). 2. Regression (Plugin Classiers). Find an estimate r of the regression function r and dene h(x) =
1 1 if r(x) > 2 0 otherwise.

3. Density Estimation. Estimate f0 from the Xi s for which Yi = 0, estimate f1 from the Xi s for which Yi = 1 n and let = n1 i=1 Yi . Dene r(x) = P(Y = 1|X = x) = f1 (x) f1 (x) + (1 )f0 (x)

129

and h(x) =

1 if r(x) > 1 2 0 otherwise.

46.5 Empirical Risk Minimization: The Finite Case


Let H be a nite set of classiers. Empirical risk minimization means choosing the classier h H to minimize the training error Rn (h), also called the empirical risk. Thus, h = argminhH Rn (h) = argminhH 1 n I(h(Xi ) = Yi ) .
i

(222)

Let h be the best classier in H, that is, R(h ) = minhH R(h). How good is h compared to h ? We know that R(h ) R(h). We will now show that, with high probability, R(h) R(h ) + for some small > 0. Our main tool for this analysis is Hoeffdings inequality. This inequality is very fundamental and is used in many places in statistics and machine learning. Hoeffdings Inequality If X1 , . . . , Xn Bernoulli(p), then, for any > 0, P (|p p| > ) 2e2n where p = n1
n i=1
2

(223)

Xi .

Another basic fact we need is the union bound: if Z1 , . . . , Zm are random variables then P(max Zj c)
j

P(Zj > c).


j

This follows since P(max Zj c) = P({Z1 c} or {Z2 c} or or {Zm c})


j

P(Zj > c).


j

Recall that H = {h1 , . . . , hm } consists of nitely many classiers. Now we see that: P max |Rn (h) R(h)| >
hH

HH

P |Rn (h) R(h)| > 2m . .

2me2n .

Fix and let


n

2 log n

Then P max |Rn (h) R(h)| >


hH n

Hence, with probability at least 1 , the following is true: R(h) R(h) + Summarizing: P R(h) > R(h ) + We might extend our analysis to innite H later. 130 8 log n 2m .
n

R(h ) +

R(h ) + 2 n .

46.6 Parametric Methods I: Linear and Logistic Regression


One approach to classication is to estimate the regression function r(x) = E(Y |X = x) = P(Y = 1|X = x) and, once we have an estimate r, use the classication rule h(x) = The linear regression model
d 1 1 if r(x) > 2 0 otherwise.

(224)

Y = r(x) + = 0 +
j=1

j Xj +

(225)

cant be correct since it does not force Y = 0 or 1. Nonetheless, it can sometimes lead to a good classier. An alternative is to use logistic regression: r(x) = P(Y = 1|X = x) = e0 + 1+e
P
j

0 +

j xj P j j xj

(226)

46.2 Example. Let us return to the South African heart disease data. > print(names(sa.data)) [1] "sbp" "tobacco" "ldl" "adiposity" "famhist" [7] "obesity" "alcohol" "age" "chd" > n = nrow(sa.data) > > ### linear > out = lm(chd . ,data=sa.data) > tmp = predict(out) > yhat = rep(0,n) > yhat[tmp > .5] = 1 > print(table(chd,yhat)) yhat chd 0 1 0 260 42 1 76 84 > print(sum( chd != yhat)/n) [1] 0.2554113 > > ### logistic > out = glm(chd . ,data=sa.data,family=binomial) > tmp = predict(out,type="response") > yhat = rep(0,n) > yhat[tmp > .5] = 1 > print(table(chd,yhat)) yhat chd 0 1 0 256 46 1 77 83 > print(sum( chd != yhat)/n) [1] 0.2662338 "typea"

46.3 Example. For the digits example, lets restrict ourselves only to Y = 0 and Y = 1. Here is what we get:

131

> > > > > > > >

### linear out = lm(ytrain .,data=as.data.frame(xtrain)) tmp = predict(out) n = length(ytrain) yhat = rep(0,n) yhat[tmp > .5] = 1 b = table(ytrain,yhat) print(b) yhat ytrain 0 1 0 600 0 1 0 500 > print((b[1,2]+b[2,1])/sum(b)) ###training error [1] 0 > tmp = predict(out,newdata=as.data.frame(xtest)) Warning message: prediction from a rank-deficient fit may be misleading in: predict.lm(out, newdata = as.dat > n = length(ytest) > yhat = rep(0,n) > yhat[tmp > .5] = 1 > b = table(ytest,yhat) > print(b) yhat ytest 0 1 0 590 4 1 0 505 > print((b[1,2]+b[2,1])/sum(b)) ### testing error [1] 0.003639672

46.7 Parametric Methods II: Gaussian and Linear Classiers


Suppose that f0 (x) = f (x|Y = 0) and f1 (x) = f (x|Y = 1) are both multivariate Gaussians: fk (x) = 1 1 exp (x k )T 1 (x k ) , k = 0, 1. k 2 (2)d/2 |k |1/2

Thus, X|Y = 0 N (0 , 0 ) and X|Y = 1 N (1 , 1 ). 46.4 Theorem. If X|Y = 0 N (0 , 0 ) and X|Y = 1 N (1 , 1 ), then the Bayes rule is h (x) = where
2 ri = (x i )T 1 (x i ), i = 1, 2 i 2 2 1 if r1 < r0 + 2 log 0 otherwise 1 0 |0 | |1 |

+ log

(227)

(228)

is the Manalahobis distance. An equivalent way of expressing the Bayes rule is h (x) = argmaxk{0,1} k (x) 1 1 k (x) = log |k | (x k )T 1 (x k ) + log k k 2 2 and |A| denotes the determinant of a matrix A. 132 where (229)

The decision boundary of the above classier is quadratic so this procedure is called quadratic discriminant analysis (QDA). In practice, we use sample estimates of , 1 , 2 , 0 , 1 in place of the true value, namely: 0 0 S0 = = = 1 n
n i=1

(1 Yi ), 1 = Xi , 1 =

1 n

Yi
i=1

1 n0 1 n0

i: Yi =0

1 n1

Xi
i: Yi =1

i: Yi =0

(Xi 0 )(Xi 0 )T , S1 =

1 n1

i: Yi =1

(Xi 1 )(Xi 1 )T

where n0 = i (1 Yi ) and n1 = i Yi . A simplication occurs if we assume that 0 = 0 = . In that case, the Bayes rule is h (x) = argmaxk k (x) 1 k (x) = xT 1 k T 1 + log k . 2 k The parameters are estimated as before, except that the MLE of is S= The classication rule is h (x) = where n0 S0 + n 1 S1 . n0 + n 1 1 if 1 (x) > 0 (x) 0 otherwise (232) where now (230)

(231)

1 j (x) = xT S 1 j T S 1 j + log j 2 j is called the discriminant function. The decision boundary {x : 0 (x) = 1 (x)} is linear so this method is called linear discrimination analysis (LDA). 46.5 Example. Let us return to the South African heart disease data. In R use: out = lda(x,y) ### or qda for quadratic yhat = predict(out)$class The error rate of LDA is .25. For QDA we get .24. In this example, there is little advantage to QDA over LDA. Now we generalize to the case where Y takes on more than two values. 46.6 Theorem. Suppose that Y {1, . . . , K}. If fk (x) = f (x|Y = k) is Gaussian, the Bayes rule is h(x) = argmaxk k (x) 1 1 k (x) = log |k | (x k )T 1 (x k ) + log k . k 2 2 If the variances of the Gaussians are equal, then 1 k (x) = xT 1 k T 1 + log k . 2 k where

(233)

(234)

133

There is another version of linear discriminant analysis due to Fisher. The idea is to rst reduce the dimension of covariates to one dimension by projecting the data onto a line. Algebraically, this means replacing the covariate d X = (X1 , . . . , Xd ) with a linear combination U = w T X = j=1 wj Xj . The goal is to choose the vector w = (w1 , . . . , wd ) that best separates the data. Then we perform classication with the one-dimensional covariate Z instead of X. We need dene what we mean by separation of the groups. We would like the two groups to have means that are far apart relative to their spread. Let j denote the mean of X for Y = j and let be the variance matrix of X. Then E(U |Y = j) = E(w T X|Y = j) = wT j and V(U ) = w T w. 5 Dene the separation by J(w) = = =
n

(E(U |Y = 0) E(U |Y = 1))2 wT w T T (w 0 w 1 )2 wT w T w (0 1 )(0 1 )T w . wT w

We estimate J as follows. Let nj = i=1 I(Yi = j) be the number of observations in group j, let X j be the sample mean vector of the Xs for group j, and let Sj be the sample covariance matrix in group j. Dene J(w) = where SB SW 46.7 Theorem. The vector
1 w = SW (X 0 X 1 )

w T SB w w T SW w

(235)

= (X 0 X 1 )(X 0 X 1 )T (n0 1)S0 + (n1 1)S1 = . (n0 1) + (n1 1) (236) (237)

is a minimizer of J(w). We call

1 U = wT X = (X 0 X 1 )T SW X

the Fisher linear discriminant function. The midpoint m between X 0 and X 1 is m= Fishers classication rule is h(x) = 0 if wT X m 1 if wT X < m. 1 1 1 (X 0 + X 1 ) = (X 0 X 1 )T SB (X 0 + X 1 ) 2 2 (238)

Fishers rule is the same as the Bayes linear classier in equation (231) when = 1/2.

46.8 Relationship Between Logistic Regression and LDA


LDA and logistic regression are almost the same thing. If we assume that each group is Gaussian with the same covariance matrix, then we saw earlier that log P(Y = 1|X = x) P(Y = 0|X = x) = log 0 1 1 (0 + 1 )T 1 (1 0 ) 2

0 + T x.
5 The

+ xT 1 (1 0 )

quantity J arises in physics, where it is called the Rayleigh coefcient.

134

On the other hand, the logistic model is, by assumption, log P(Y = 1|X = x) P(Y = 0|X = x) = 0 + T x.

These are the same model since they both lead to classication rules that are linear in x. The difference is in how we estimate the parameters. The joint density of a single observation is f (x, y) = f (x|y)f (y) = f (y|x)f (x). In LDA we estimated the whole joint distribution by maximizing the likelihood f (Xi , yi ) =
i i

f (Xi |yi )
Gaussian

f (yi ) .
i Bernoulli

(239)

In logistic regression we maximized the conditional likelihood f (Xi , yi ) =


i i

f (yi |Xi ) but we ignored the second term f (Xi ): f (Xi ) .


i ignored

f (yi |Xi )
logistic

(240)

Since classication only requires knowing f (y|x), we dont really need to estimate the whole joint distribution. Logistic regression leaves the marginal distribution f (x) unspecied so it is more nonparametric than LDA. This is an advantage of the logistic regression approach over LDA. To summarize: LDA and logistic regression both lead to a linear classication rule. In LDA we estimate the entire joint distribution f (x, y) = f (x|y)f (y). In logistic regression we only estimate f (y|x) and we dont bother estimating f (x).

135

46.9 Nearest Neighbors


The k-nearest neighbor rule is h(x) = 1 i=1 wi (x)I(Yi = 1) > 0 otherwise
n n i=1

wi (x)I(Yi = 0)

(241)

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. Nearest depends on how you dene the distance. Often we use Euclidean distance ||Xi Xj ||. In that case you should standardize the variables rst. 46.8 Example. Digits again. > > > > > ### knn library(class) yhat = knn(train = xtrain, cl = ytrain, test = xtest, k = 1) b = table(ytest,yhat) print(b) yhat ytest 0 1 0 594 0 1 0 505 > print((b[1,2]+b[2,1])/sum(b)) [1] 0 > > > yhat = knn.cv(train = xtrain, cl = ytrain, k = 1) > b = table(ytrain,yhat) > print(b) yhat ytrain 0 1 0 599 1 1 0 500 > print((b[1,2]+b[2,1])/sum(b)) [1] 0.0009090909 An important part of this method is to choose a good value of k. For this we can use cross-validation. 46.9 Example. South African heart disease data again. library(class) m = 50 error = rep(0,m) for(i in 1:m){ out = knn.cv(train=x,cl=y,k=i) error[i] = sum(y != out)/n } postscript("knn.sa.ps") plot(1:m,error,type="l",lwd=3,xlab="k",ylab="error") See Figure 54. 46.10 Example. Figure 55 compares the decision boundaries in a two-dimensinal example. The boundaries are from (i) linear regression, (ii) quadratic regression, (iii) k-nearest neighbors (k = 1), (iv) k-nearest neighbors (k = 50), and (v) k-nearest neighbors (k = 200). The logistic (not shown) also yields a linear boundary. 136

error

0.32

0.34

0.36

0.38

0.40

0.42

0.44

10

20 k

30

40

50

Figure 54: knn for South Africn heart disease data.

137

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Data

linear

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

quadratic

knn k =1

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

knn k =50

knn k =200

Figure 55: Comparison of decision boundaries.

138

Some Theoretecal Properties. Let h1 be the nearest neighbor classier with k = 1. Cover and Heart (1967) showed that, under very weak assumptions, R lim R(h1 ) 2R
n

(242)

where R is the Bayes risk. For k > 1 we have 1 R lim R(hk ) R + . n ke (243)

46.10 Density Estimation and Naive Bayes


The Bayes rule can be written as h (x) = We can estimate by = 1 n 1 if
f1 (x) f0 (x)

>

(1)

0 otherwise.
n

(244)

Yi .
i=1

We can estimate f0 and f1 using density estimation. For example, we could apply kernel density estimation to D 0 = {Xi : Yi = 0} to get f0 and to D1 = {Xi : Yi = 1} to get f1 . Then we estimate h with b 1 if f1 (x) > (1b) b b f0 (x) (245) h(x) = 0 otherwise. But if x = (x1 , . . . , xd ) is high-dimensional, nonparametric density estimation is not very reliable. This problem is ameliorated if we assume that X1 , . . . , Xd are independent, for then,
d

f0 (x1 , . . . , xd ) =
j=1 d

f0j (xj ) f1j (xj ).


j=1

(246)

f1 (x1 , . . . , xd ) =

(247)

We can then use one-dimensional density estimators and multiply them:


d

f0 (x1 , . . . , xd ) =
j=1 d

f0j (xj ) f1j (xj ).


j=1

(248)

f1 (x1 , . . . , xd ) =

(249)

The resulting classier is called the naive Bayes classier. The assumption that the components of X are independent is usually wrong yet the resulting classier might still be accurate. Here is a summary of the steps in the naive Bayes classier:

139

The Naive Bayes Classi er 1. For each group k = 0, 1, compute an estimate fkj of the density fkj for Xj , using the data for which Yi = k. 2. Let fk (x) = fk (x1 , . . . , xd ) =
j=1 d

fkj (xj ).

3. Let k = 4. Dene h as in (245).

1 n

Yi .
i=1

Naive Bayes is closely related to generalized additive models. Under the naive Bayes model, logit P(Y = 1)|X P(Y = 0)|X = log = log = log = 0 +
j=1

f1 (X) (1 )f0 (X) (1

(250) (251)

d j=1 f1j (Xj ) d ) j=1 f0j (Xj ) d

1
d

+
j=1

log

f1j (Xj ) f0j (Xj )

(252)

gj (Xj )

(253)

which has the form of a generalized additive model. Thus we expect similar performance using naive Bayes or generalized additive models. 46.11 Example. For the SA sata: Note the use of the gam package. n = nrow(sa.data) y = chd x = sa.data[,1:9] library(gam) s = .25 out = gam(y lo(sbp,span = .25,degree=1) + lo(tobacco,span = .25,degree=1) + lo(ldl,span = .25,degree=1) + lo(adiposity,span = .25,degree=1) + famhist + lo(typea,span = .25,degree=1) + lo(obesity,span = .25,degree=1) + lo(alcohol,span = .25,degree=1) + lo(age,span = .25,degree=1)) tmp = fitted(out) yhat = rep(0,n) yhat[tmp > .5] = 1 print(table(y,yhat)) yhat y 0 1 140

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 56: Artical Data. 0 256 46 1 77 83 print(mean(y != yhat)) [1] 0.2662338 46.12 Example. Figure 56 shows an articial data set with two covariates x1 and x2 . Figure 57 shows kernel density estimators f1 (x1 ), f1 (x2 ), f0 (x1 ), f0 (x2 ). The top left plot shows the resulting naive Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear model yields a classier with error rate of 0.78.

46.11 Trees
Trees are classication methods that partition the covariate space X into disjoint pieces and then classify the observations according to which partition element they fall in. As the name implies, the classier can be represented as a tree. For illustration, suppose there are two covariates, X1 = age and X2 = blood pressure. Figure 59 shows a classication tree using these variables. The tree is used in the following way. If a subject has Age 50 then we classify him as Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 60 shows the same classier as a partition of the covariate space. Here is how a tree is constructed. First, suppose that y Y = {0, 1} and that there is only a single covariate X. We choose a split point t that divides the real line into two sets A1 = (, t] and A2 = (t, ). Let ps (j) be the

141

1.6

1.4

1.2

f10

1.0

f11

0.8

0.6

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.2

0.4 x

0.6

0.8

1.0

1.4

1.2

f20

1.0

f21

0.8

0.6

0.0

0.2

0.4 x

0.6

0.8

1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.2

0.4 x

0.6

0.8

1.0

Figure 57: Density Estimates

142

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Data

x2 0.0 0.2 0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1 Predictions

Figure 58: Naive Bayes and Gam classiers.

Age < 50 50

Blood Pressure < 100 0 100 1

Figure 59: A simple classication tree.

143

1 Blood Pressure 110 1 0

50 Age Figure 60: Partition representation of classication tree. proportion of observations in As such that Yi = j: ps (j) =
n i=1

I(Yi = j, Xi As ) n i=1 I(Xi As )


2

(254)

for s = 1, 2 and j = 0, 1. The impurity of the split t is dened to be I(t) =


s=1

(255)

where s = 1

ps (j)2 .
j=0

(256)

This particular measure of impurity is known as the Gini index. If a partition element A s contains all 0s or all 1s, then s = 0. Otherwise, s > 0. We choose the split point t to minimize the impurity. (Other indices of impurity besides can be used besides the Gini index.) When there are several covariates, we choose whichever covariate and split that leads to the lowest impurity. This process is continued until some stopping criterion is met. For example, we might stop when every partition element has fewer than n0 data points, where n0 is some xed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0 or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition element. This procedure is easily generalized to the case where Y {1, . . . , K}. We simply dene the impurity by
k

s = 1

ps (j)2
j=1

(257)

where pi (j) is the proportion of observations in the partition element for which Y = j. 46.13 Example. Heart disease data. X = scan("sa.data",skip=1,sep=",") >Read 5082 items 144

X = matrix(X,ncol=11,byrow=T) chd = X[,11] n = length(chd) X = X[,-c(1,11)] names = c("sbp","tobacco","ldl","adiposity","famhist","typea","obesity","alcohol","age") for(i in 1:9){ assign(names[i],X[,i]) } famhist = as.factor(famhist) formula = paste(names,sep="",collapse="+") formula = paste("chd ",formula) formula = as.formula(formula) print(formula) > chd sbp + tobacco + ldl + adiposity + famhist + typea + obesity + > alcohol + age chd d = as.factor(chd) = data.frame(chd,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age)

library(tree) postscript("south.africa.tree.plot1.ps") out = tree(formula,data=d) print(summary(out)) >Classification tree: >tree(formula = formula, data = d) >Variables actually used in tree construction: >[1] "age" "tobacco" "alcohol" "typea" >[7] "ldl" >Number of terminal nodes: 15 >Residual mean deviance: 0.8733 = 390.3 / 447 >Misclassification error rate: 0.2078 = 96 / 462 plot(out,type="u",lwd=3) text(out) cv = cv.tree(out,method="misclass") plot(cv,lwd=3) newtree = prune.tree(out,best=6,method="misclass") print(summary(newtree)) >Classification tree: >snip.tree(tree = out, nodes = c(2, 28, 29, 15)) >Variables actually used in tree construction: >[1] "age" "typea" "famhist" "tobacco" >Number of terminal nodes: 6 >Residual mean deviance: 1.042 = 475.2 / 456 >Misclassification error rate: 0.2294 = 106 / 462 plot(newtree,lwd=3) text(newtree,cex=2) See Figures 61, 62, 63. 145

"famhist"

"adiposity"

age < 31.5 |

tobacco < 0.51

age < 50.5

alcohol < 11.105 typea < 68.5 0


146

famhist:a

tobacco < 7.605 0 0 0 1

ldl < 6.705

typea < 42.5

adiposity < 28.955

tobacco < 4.15

adiposity < 24.435 0 1 1

adiposity < 28 1

typea < 48 0 0 0

12 165

10

Inf

misclass 145 150

155

160

8 size

10

12

14

Figure 62: Tree

147

age < 31.5 |

0 typea < 68.5 0 1

age < 50.5

famhist:a tobacco < 7.605 0 1 1

Figure 63: Tree

148

46.12 Perceptrons and Support Vector Machines


In this section we consider a class of linear classiers called support vector machines. It will be convenient to label the outcomes as 1 and +1 instead of 0 and 1. A linear classier can then be written as h(x) = sign H(x) where x = (x1 , . . . , xd ),
d

H(x) = a0 +
i=1

ai Xi

and

Note that:

1 if z < 0 0 if z = 0 sign(z) = 1 if z > 0. classier correct classier incorrect = Yi H(Xi ) 0 = Yi H(Xi ) 0.

The classication risk is R = P(Y = h(X)) = P(Y H(X) 0) = E(L(Y H(X))) where the loss function L is L(a) = 1 if a < 0 and L(a) = 0 if a 0. Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectly separates the two classes. How can we nd a separating hyperplane? LDA is not guaranteed to nd it. A separating hyperplane will minimize Yi H(Xi ).
misclassied

Rosenblatts perceptron algorithm takes starting values and updates them: 0 0 + Yi Xi Yi .

However, there are many separating hyperplanes. The particular separating hyperplane that this algorithm converges to depends on the starting values. Intuitively, it seems reasonable to choose the hyperplane furthest from the data in the sense that it separates the +1s and -1s and maximizes the distance to the closest point. This hyperplane is called the maximum margin hyperplane. The margin is the distance to from the hyperplane to the nearest point. Points on the boundary of the margin are called support vectors. See Figure 64. 46.14 Lemma. The data can be separated by some hyperplane if and only if there exists a hyperplane H(x) = d a0 + i=1 ai Xi such that Yi H(Xi ) 1, i = 1, . . . , n. (258) P ROOF. Suppose the data can be separated by a hyperplane W (x) = b0 + i=1 bi Xi . It follows that there exists some constant c such that Yi = 1 implies W (Xi ) c and Yi = 1 implies W (Xi ) c. Therefore, Yi W (Xi ) c d for all i. Let H(x) = a0 + i=1 ai Xi where aj = bj /c. Then Yi H(Xi ) 1 for all i. The reverse direction is straightforward. The goal, then, is to maximize the margin, subject to (258). Given two vectors a and b let a, b = a T b = j aj bj denote the inner product of a and b.
d

149

H(x) = a0 + aT x = 0 Figure 64: The hyperplane H(x) has the largest margin of all hyperplanes that separate the two classes. 46.15 Theorem. Let H(x) = a0 + 1, . . . , d,
d i=1

ai Xi denote the optimal (largest margin) hyperplane. Then, for j =


n

aj =
i=1

i Yi Xj (i)

where Xj (i) is the value of the covariate Xj for the ith data point, and = (1 , . . . , n ) is the vector that maximizes
n i=1

1 2

i k Yi Yk Xi , Xk
i=1 k=1

(259)

subject to and

i 0 0=
i

i Yi .

The points Xi for which = 0 are called support vectors. a0 can be found by solving
T i Yi (Xi a + a0

=0

for any support point Xi . H may be written as


n

H(x) = 0 +
i=1

i Yi x, Xi .

There are many software packages that will solve this problem quickly. If there is no perfect linear classier, then one allows overlap between the groups by replacing the condition (258) with (260) Yi H(Xi ) 1 i , i 0, i = 1, . . . , n. The variables 1 , . . . , n are called slack variables. 150

We now maximize (259) subject to and

0 i c, i = 1, . . . , n
n

i Yi = 0.
i=1

The constant c is a tuning parameter that controls the amount of overlap. In R we can use the package e1071. 46.16 Example. The iris data. library(e1071) data(iris) x = iris[51:150,] a = x[,5] x = x[,-5] attributes(a) $levels [1] "setosa" "versicolor" "virginica" $class [1] "factor" n = length(a) y = rep(0,n) y[a == "versicolor"] = 1 y = as.factor(y) out = svm(x, y) print(out) Call: svm.default(x = x, y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma:

C-classification radial 1 0.25 33

Number of Support Vectors: summary(out) Call: svm.default(x = x, y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma:

C-classification radial 1 0.25

151

M[,2]

0.5

0.0

0.5

0 M[,1]

M[I, ][,2]

0.5

0.0

0.5

0 M[I, ][,1]

Figure 65: Number of Support Vectors: ( 17 16 ) 33

Number of Classes: Levels: 0 1

## test with train data pred = predict(out, x) table(pred, y) y pred 0 1 0 49 2 1 1 48 M = cmdscale(dist(x)) plot(M,col = as.integer(y)+1,pch = as.integer(y)+1) ## support vectors I = 1:n %in% out$index points(M[I,],lwd=2) See Figure 65. Here is another (easier) way to think about the SVM. The SVM hyperplan H(x) = 0 + xT x can be obtained by minimizing
n i=1

(1 Yi H(Xi ))+ + ||||2 .

Figure 66 compares the svm loss, squared loss, classication error and logistic loss log(1 + e yH(x) ). 152

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 y H(x)

Figure 66: Hinge

153

x2 z2

+ + + + + + + +

+ + + + + + +
z1

+ +
x1

+ + + +
z3

Figure 67: Kernelization. Mapping the covariates into a higher-dimensional space can make a complicated decision boundary into a simpler decision boundary.

46.13

Kernelization

There is a trick called kernelization for improving a computationally simple classier h. The idea is to map the covariate X which takes values in X into a higher dimensional space Z and apply the classier in the bigger space Z. This can yield a more exible classier while retaining computationally simplicity. The standard example of this idea is illustrated in Figure 67. The covariate x = (x 1 , x2 ). The Yi s can be separated into two groups using an ellipse. Dene a mapping by z = (z1 , z2 , z3 ) = (x) = (x2 , 2x1 x2 , x2 ). 2 1 Thus, maps X = R2 into Z = R3 . In the higher-dimensional space Z, the Yi s are separable by a linear decision boundary. In other words, a linear classier in a higher-dimensional space corresponds to a non-linear classier in the original space. The point is that to get a richer set of classiers we do not need to give up the convenience of linear classiers. We simply map the covariates to a higher-dimensional space. This is akin to making linear regression more exible by using polynomials. There is a potential drawback. If we signicantly expand the dimension of the problem, we might increase the computational burden. For example, if x has dimension d = 256 and we wanted to use all fourth-order terms, then z = (x) has dimension 183,181,376. We are spared this computational nightmare by the following two facts. First, many classiers do not require that we know the values of the individual points but, rather, just the inner product between pairs of points. Second, notice in our example that the inner product in Z can be written z, z = = = (x), (x) x2 x2 + 2x1 x1 x2 x2 + x2 x2 1 1 2 2 ( x, x )2 K(x, x).

Thus, we can compute z, z without ever computing Zi = (Xi ). To summarize, kernelization involves nding a mapping : X Z and a classier such that: 154

1. Z has higher dimension than X and so leads a richer set of classiers. 2. The classier only requires computing inner products. 3. There is a function K, called a kernel, such that (x), (x) = K(x, x). 4. Everywhere the term x, x appears in the algorithm, replace it with K(x, x). In fact, we never need to construct the mapping at all. We only need to specify a kernel K(x, x) that corresponds to (x), (x) for some . This raises an interesting question: given a function of two variables K(x, y), does there exist a function (x) such that K(x, y) = (x), (y) ? The answer is provided by Mercers theorem which says, roughly, that if K is positive denite meaning that K(x, y)f (x)f (y)dxdy 0 for square integrable functions f then such a exists. Examples of commonly used kernels are:
r

polynomial sigmoid Gaussian

K(x, x) K(x, x) K(x, x)

= = =

x, x + a tanh(a x, x + b) exp ||x x||2 /(2 2 )

Let us now see how we can use this trick in LDA and in support vector machines. Recall that the Fisher linear discriminant method replaces X with U = w T X where w is chosen to maximize the Rayleigh coefcient w T SB w , J(w) = T w SW w SB = (X 0 X 1 )(X 0 X 1 )T and SW = (n0 1)S0 (n0 1) + (n1 1) + (n1 1)S1 (n0 1) + (n1 1) .

In the kernelized version, we replace Xi with Zi = (Xi ) and we nd w to maximize J(w) = where SB = (Z 0 Z 1 )(Z 0 Z 1 )T and SW = (n0 1)S0 (n0 1) + (n1 1) + w T SB w w T SW w

(261)

(n1 1)S1 (n0 1) + (n1 1)

Here, Sj is the sample of covariance of the Zi s for which Y = j. However, to take advantage of kernelization, we need to re-express this in terms of inner products and then replace the inner products with kernels. It can be shown that the maximizing vector w is a linear combination of the Z i s. Hence we can write
n

w=
i=1

i Z i .

Also, 1 Zj = nj

(Xi )I(Yi = j).


i=1

155

Therefore,
n T

wT Z j

= = = = = 1 nj 1 nj 1 nj

i Z i
i=1 n n

1 nj

(Xi )I(Yi = j)
i=1

T i I(Ys = j)Zi (Xs ) i=1 s=1 n n

i
i=1 n s=1 n

I(Ys = j)(Xi )T (Xs ) I(Ys = j)K(Xi , Xs )


s=1

i
i=1

T Mj

where Mj is a vector whose ith component is Mj (i) = It follows that w T SB w = T M where M = (M0 M1 )(M0 M1 )T . By similar calculations, we can write w T SW w = T N where N = K0 I 1 1 T T 1 K0 + K 1 I 1 K1 , n0 n1 1 nj
n

K(Xi , Xs )I(Yi = j).


s=1

I is the identity matrix, 1 is a matrix of all ones, and Kj is the n nj matrix with entries (Kj )rs = K(xr , xs ) with xs varying over the observations in group j. Hence, we now nd to maximize J() = T M . T N

All the quantities are expressed in terms of the kernel. Formally, the solution is = N 1 (M0 M1 ). However, N might be non-invertible. In this case one replaces N by N + bI, for some constant b. Finally, the projection onto the new subspace can be written as
n

U = wT (x) =
i=1

i K(Xi , x).

The support vector machine can similarly be kernelized. We simply replace X i , Xj with K(Xi , Xj ). For example, instead of maximizing (259), we now maximize
n i=1

1 2

i k Yi Yk K(Xi , Xj ).
i=1 k=1 n i=1

(262)

The hyperplane can be written as H(x) = a0 +

i Yi K(X, Xi ).

156

46.14

Other Classiers

There are many other classiers and space precludes a full discussion of all of them. Let us briey mention a few. Bagging is a method for reducing the variability of a classier. It is most helpful for highly nonlinear classiers such as trees. We draw B bootstrap samples from the data. The bth bootstrap sample yields a classier hb . The nal classier is B 1 1 if B b=1 hb (x) 1 2 h(x) = 0 otherwise. Boosting is a method for starting with a simple classier and gradually improving it by retting the data giving higher weight to misclassied samples. Suppose that H is a collection of classiers, for example, trees with only one split. Assume that Yi {1, 1} and that each h is such that h(x) {1, 1}. We usually give equal weight to all data points in the methods we have discussed. But one can incorporate unequal weights quite easily in most algorithms. For example, in constructing a tree, we could replace the impurity measure with a weighted impurity measure. The original version of boosting, called AdaBoost, is as follows.

1. Set the weights wi = 1/n, i = 1, . . . , n. 2. For j = 1, . . . , J, do the following steps: (a) Constructing a classier hj from the data using the weights w1 , . . . , wn . (b) Compute the weighted error estimate: Lj = (c) Let j = log((1 Lj )/Lj ). wi wi ej I(Yi =hj (Xi )) 3. The nal classier is
J n i=1

wi I(Yi = hj (Xi )) . n i=1 wi

(d) Update the weights:

h(x) = sign
j=1

j hj (x) .

There is now an enormous literature trying to explain and improve on boosting. Whereas bagging is a variance reduction technique, boosting can be thought of as a bias reduction technique. We starting with a simple and hence highly-biased classier, and we gradually reduce the bias. The disadvantage of boosting is that the nal classier is quite complicated. To understand what boosting is doing, consider the following modifed algorithm: (Friedman, Hastie and Tibshirani (2000), Annals of Statistics, p. 337407):

157

1. Set the weights wi = 1/n, i = 1, . . . , n. 2. For j = 1, . . . , J, do the following steps: (a) Constructing a weighted binary regression pj (x) = P(Y = 1|X = x). (b) Let fj (x) = 1 log 2 pj (x) 1 pj (x) .

(c) Set wi wi eYi fj (Xi ) then normalize the weights to sum to one. 3. The nal classier is h(x) = sign
j=1 J

fj (x) .

Consider the risk function J(F ) = E(eY F (X) ). This is minimized by F (x) = Thus, 1 log 2 P(Y = 1|X = x) P(Y = 1|X = x) .

e2F (x) . 1 + e2F (x) Friedman, Hastie and Tibshirani show that stagewise regression, applied to loss J(F ) = E(e Y F (X) ) yields the boosting algorithm. Moreover, this is essentially logistic regression. To see this, let Y = (Y + 1)/2 so that Y {0, 1}. The logistic log-likelihood is P(Y = 1|X = x) = Insert Y = (Y + 1)/2 and p = e2F /(1 + e2F ) and then = Y log p(x) + (1 Y ) log(1 p(x)). (F ) = log(1 + e2Y F (X) ). Now do a second order Taylor series expansion around F = 0 to conclude that (F ) J(F ) + constant. Hence, boosting is essentially stagewise logistic regression.

Neural Networks are regression models of the form 6


p

Y = 0 +
j=1

j (0 + T X)

where is a smooth function, often taken to be (v) = ev /(1 + ev ). This is really nothing more than a nonlinear regression model. Neural nets were fashionable for some time but they pose great computational difculties. In particular, one often encounters multiple minima when trying to nd the least squares estimates of the parameters. Also, the number of terms p is essentially a smoothing parameter and there is the usual problem of trying to choose p to nd a good balance between bias and variance.
6 This

is the simplest version of a neural net. There are more complex versions of the model.

158

46.15

Assessing Error Rates and Choosing a Good Classier

How do we choose a good classier? We would like to have a classier h with a low true error rate L(h). Usually, we cant use the training error rate Ln (h) as an estimate of the true error rate because it is biased downward. 46.17 Example. Consider the heart disease data again. Suppose we t a sequence of logistic regression models. In the rst model we include one covariate. In the second model we include two covariates, and so on. The ninth model includes all the covariates. We can go even further. Lets also t a tenth model that includes all nine covariates plus the rst covariate squared. Then we t an eleventh model that includes all nine covariates plus the rst covariate squared and the second covariate squared. Continuing this way we will get a sequence of 18 classiers of increasing complexity. The solid line in Figure 68 shows the observed classication error which steadily decreases as we make the model more complex. If we keep going, we can make a model with zero observed classication error. The dotted line shows the 10-fold cross-validation estimate of the error rate (to be explained shortly) which is a better estimate of the true error rate than the observed classication error. The estimated error decreases for a while then increases. This is essentially the biasvariance tradeoff phenomenon we have seen before.

error rate

0.34

0.30

0.26

number of terms in model

15

Figure 68: The solid line is the observed error rate and dashed line is the cross-validation estimate of true error rate. There are many ways to estimate the error rate. Well consider two: cross-validation and probability inequalities. C ROSS -VALIDATION . The basic idea of cross-validation, which we have already encountered in curve estimation, is to leave out some of the data when tting a model. The simplest version of cross-validation involves randomly splitting the data into two pieces: the training set T and the validation set V. Often, about 10 per cent of the data might be set aside as the validation set. The classier h is constructed from the training set. We then estimate the error by 1 I(h(Xi ) = YI ). (263) L(h) = m
Xi V

where m is the size of the validation set. See Figure 69. Another approach to cross-validation is K-fold cross-validation which is obtained from the following algorithm. K-fold cross-validation. 1. Randomly divide the data into K chunks of approximately equal size. A common choice is K = 10. 2. For k = 1 to K, do the following: 159

Training Data T

Validation Data V

Figure 69: Cross-validation. The data are divided into two groups: the training data and the validation data. The training data are used to produce an estimated classier h. Then, h is applied to the validation data to obtain an estimate L of the error rate of h. (a) Delete chunk k from the data. (b) Compute the classier h(k) from the rest of the data. (c) Use h(k) to the predict the data in chunk k. Let L(k) denote the observed error rate. 3. Let L(h) = 1 K
K

L(k) .
k=1

(264)

46.18 Example. We applied 10-fold cross-validation to the heart disease data. Thats what cv.tree does.

46.16

Optional Reading: Some Theory on Estimating Classication Error

(OPTIONAL. READ ONLY IF YOU ARE INTERESTED). Another approach to estimating the error rate is to nd a condence interval for Ln (h) using probability inequalities. This method is useful in the context of empirical risk minimization. Let H be a set of classiers, for example, all linear classiers. Empirical risk minimization means choosing the classier h H to minimize the training error Ln (h), also called the empirical risk. Thus, h = argminhH Ln (h) = argminhH 1 n I(h(Xi ) = Yi ) .
i

(265)

Typically, Ln (h) underestimates the true error rate L(h) because h was chosen to make Ln (h) small. Our goal is to assess how much underestimation is taking place. Our main tool for this analysis is Hoeffdings inequality. If X1 , . . . , Xn Bernoulli(p), then, for any > 0, P (|p p| > ) 2e2n
n
2

(266)

where p = n1 i=1 Xi . First, suppose that H = {h1 , . . . , hm } consists of nitely many classiers. For any xed h, Ln (h) converges in almost surely to L(h) by the law of large numbers. We will now establish a stronger result. 46.19 Theorem (Uniform Convergence). Assume H is nite and has m elements. Then, P max |Ln (h) L(h)| >
hH

2me2n .

160

P(

P ROOF. We will use Hoeffdings inequality and we will also use the fact that if A1 , . . . , Am is a set of events then m m i=1 P(Ai ). Now, i=1 Ai ) P max |Ln (h) L(h)| >
hH

P
hH

|Ln (h) L(h)| >

HH

P |Ln (h) L(h)| > 2e2n


2

= 2me2n .

HH

46.20 Theorem. Let = 2 log n 2m .

Then Ln (h) is a 1 condence interval for L(h). P ROOF. This follows from the fact that P(|Ln (h) L(h)| > ) P max |Ln (h) L(h)| >
hH

2me2n

= .

When H is large the condence interval for L(h) is large. The more functions there are in H the more likely it is we have overt which we compensate for by having a larger condence interval. In practice we usually use sets H that are innite, such as the set of linear classiers. To extend our analysis to these cases we want to be able to say something like P sup |Ln (h) L(h)| >
hH

something not too big.

One way to develop such a generalization is by way of the Vapnik-Chervonenkis or VC dimension. Let A be a class of sets. Give a nite set F = {x1 , . . . , xn } let NA (F ) = # F A: AA (267)

be the number of subsets of F picked out by A. Here #(B) denotes the number of elements of a set B. The shatter coefcient is dened by s(A, n) = max NA (F ) (268)
F Fn

where Fn consists of all nite sets of size n. Now let X1 , . . . , Xn P and let Pn (A) = 1 n
i

I(Xi A)

denote the empirical probability measure. The following remarkable theorem bounds the distance between P and Pn . 46.21 Theorem (Vapnik and Chervonenkis (1971)). For any P, n and > 0, P sup |Pn (A) P(A)| > 161 8s(A, n)en
2

/32

(269)

AA

The proof, though very elegant, is long and we omit it. If H is a set of classiers, dene A to be the class of sets of the form {x : h(x) = 1}. We then dene s(H, n) = s(A, n). 46.22 Theorem. P
hH

sup |Ln (h) L(h)| >


n

8s(H, n)en

/32

A 1 condence interval for L(h) is Ln (h)


2 n

where 8s(H, n) .

32 log n

These theorems are only useful if the shatter coefcients do not grow too quickly with n. This is where VC dimension enters.

The VC (Vapnik-Chervonenkis) dimension of a class of sets A is dened as follows. If s(A, n) = 2n for all n, set V C(A) = . Otherwise, dene V C(A) to be the largest k for which s(A, n) = 2k . Thus, the VC-dimension is the size of the largest nite set F that can be shattered by A meaning that A picks out each subset of F . If H is a set of classiers we dene V C(H) = V C(A) where A is the class of sets of the form {x : h(x) = 1} as h varies in H. The following theorem shows that if A has nite VC-dimension, then the shatter coefcients grow as a polynomial in n. 46.23 Theorem. If A has nite VC-dimension v, then s(A, n) nv + 1. 46.24 Example. Let A = {(, a]; a R}. The A shatters every 1-point set {x} but it shatters no set of the form {x, y}. Therefore, V C(A) = 1. 46.25 Example. Let A be the set of closed intervals on the real line. Then A shatters S = {x, y} but it cannot shatter sets with 3 points. Consider S = {x, y, z} where x < y < z. One cannot nd an interval A such that A S = {x, z}. So, V C(A) = 2. 46.26 Example. Let A be all linear half-spaces on the plane. Any 3-point set (not all on a line) can be shattered. No 4 point set can be shattered. Consider, for example, 4 points forming a diamond. Let T be the left and rightmost points. This cant be picked out. Other congurations can also be seen to be unshatterable. So V C(A) = 3. In general, halfspaces in Rd have VC dimension d + 1. 46.27 Example. Let A be all rectangles on the plane with sides parallel to the axes. Any 4 point set can be shattered. Let S be a 5 point set. There is one point that is not leftmost, rightmost, uppermost, or lowermost. Let T be all points in S except this point. Then T cant be picked out. So V C(A) = 4. 46.28 Theorem. Let x have dimension d and let H be th set of linear classiers. The VC-dimension of H is d + 1. Hence, a 1 condence interval for the true error rate is L(h) where
2 n

32 log n

8(nd+1 + 1)

162

Figure 70: A graph with vertices V = {X, Y, Z}. The edge set is E = {(X, Y ), (Y, Z)}. X1

X2

X3

Figure 71: A regression model

47 Graphical Models
Regression models are a special case of a more general class of models called graphical models. There are two main types of graphical models: undirected and directed.

47.1 Undirected Graphs


Undirected graphs are a method for representing independence relations. An undirected graph G = (V, E) also called a Markov random eld has a nite set V of vertices (or nodes) and a set E of edges (or arcs) consisting of pairs of vertices. The vertices correspond to random variables X, Y, Z, . . . and edges are written as unordered pairs. For example, (X, Y ) E means that X and Y are joined by an edge. Examples of graphs are in Figure 70 and Figure 71. Two vertices are adjacent, written X Y , if there is an edge between them. In Figure 70, X and Y are adjacent but X and Z are not adjacent. A sequence X0 , . . . , Xn is called a path if Xi1 Xi for each i. In Figure 70, X, Y, Z is a path. A graph is complete if there is an edge between every pair of vertices. A subset U V of vertices together with their edges is called a subgraph. If A, B and C are three distinct subsets of V , we say that C separates A and B if every path from a variable in A to a variable in B intersects a variable in C. In Figure 72 {Y, W } and {Z} are separated by {X}. Also, W and Z are separated by {X, Y }. Now we relate graphs to probability distributions. Recall that X1 and X2 are independent written X1 X2 if p(x1 , x2 ) = p(x1 )p(x2 ). We say that X and Y are conditionally independent given Z written X Y |Z if p(x, y|z) = p(x|z)p(y|z). Let V be a set of random variables with distribution P. Construct a graph with one vertex for each random variable

163

X4

X5

Figure 72: {Y, W } and {Z} are separated by {X}. Also, W and Z are separated by {X, Y }. Y

Figure 73: X

in V . Omit the edge between a pair of variables if they are independent given the rest of the variables: no edge between X and Y X Y |rest

where rest refers to all the other variables besides X and Y . The resulting graph is called a pairwise Markov graph. Some examples are shown in Figures 73, 74, 75, and 76. The graph encodes a set of pairwise conditional independence relations. These relations imply other conditional independence relations. How can we gure out what they are? Fortunately, we can read these other conditional independence relations directly from the graph as well, as is explained in the next theorem. 47.1 Theorem. Let G = (V, E) be a pairwise Markov graph for a distribution P. Let A, B and C be distinct subsets of V such that C separates A and B. Then A B|C. 47.2 Remark. If A and B are not connected (i.e., there is no path from A to B) then we may regard A and B as being separated by the empty set. Then Theorem 47.1 implies that A B. The independence condition in Theorem 47.1 is called the global Markov property. We thus see that the pairwise and global Markov properties are equivalent. Let us state this more precisely. Given a graph G, let M pair (G) be the set of distributions which satisfy the pairwise Markov property: thus P M pair (G) if, under P, X Y |rest if and only if there is no edge between X and Y . Let Mglobal (G) be the set of distributions which satisfy the global Markov property: thus P Mpair (G) if, under P, A B|C if and only if C separates A and B. 47.3 Theorem. Let G be a graph. Then, Mpair (G) = Mglobal (G). Theorem 47.3 allows us to construct graphs using the simpler pairwise property and then we can deduce other independence relations using the global Markov property. Think how hard this would be to do algebraically. Returning to 76, we now see that X Z|Y and Y W |Z. 164

Z|Y .

Figure 74: No implied independence relations.

Figure 75: X

Figure 76: Pairwise independence implies that X

Z|{Y, W }. But is X

165

Z|{Y, W } and Y

W |{X, Z}.

Z|Y ?

Figure 77: X X

Figure 78: X 47.4 Example. Figure 77 implies that X 47.5 Example. Figure 78 implies that X Y,X

W |(Y, Z) and X

A clique is a set of variables in a graph that are all adjacent to each other. A set of variables is a maximal clique if it is a clique and if it is not possible to include another variable and still be a clique. A potential is any positive function. It can be shown that P is Markov G if and only if its probability function f can be written as f (x) = where C is the set of maximal cliques and
CC

Z=
x CC

is called the normalizing constant or the partition function. This is called the Hammersley-Clifford theorem. If we dene the energy function C (xC ) = log C (xC ) then we can write f (x) e
P
C

47.6 Example. The maximal cliques for the graph in Figure 70 are C1 = {X, Y } and C2 = {Y, Z}. Hence, if P is Markov to the graph, then its probability function can be written f (x, y, z) 1 (x, y)2 (y, z) for some positive functions 1 and 2 . 47.7 Example. The maximal cliques for the graph in Figure 79 are {X1 , X2 }, {X1 , X3 }, {X2 , X4 }, {X3 , X5 }, {X2 , X5 , X6 }. Thus we can write the probability function as f (x1 , x2 , x3 , x4 , x5 , x6 ) 12 (x1 , x2 )13 (x1 , x3 )24 (x2 , x4 ) 35 (x3 , x5 )256 (x2 , x5 , x6 ). 166

Y,X

Z and X

(Y, Z).

W |(Y, Z) and X Z and X (Y, Z). Z|Y .

Z|Y .

C (xC ) Z

(270)

C (xC )

C (xC )

X2

X4

X3

X5

Figure 79: The maximumly cliques of this graph are {X1 , X2 }, {X1 , X3 }, {X2 , X4 }, {X3, X5 }, {X2 , X5 , X6 }. 47.8 Example (Images and the Ising Model). Consider an image with pixels taking values Yi {1, 1}. Can we construct a probability distribution for images? A common example of such a distribution is the Ising model xi xj p(x) exp
ie j

where the sum is over nieghboring pixels. Similar models are used to describe materials that experience phase transitions such as magnets.

X1

X6

167

47.2 Fitting Undirected Graphs


We will consider these cases:

Continuous Discrete

Small SIN (Drton-Perlman) Loglinear model

Large Lasso (Meinshausen-B hlmann) u Loglinear Lasso

References:

Drton, M. and Perlman, M. (2004). A SINful Approach to Model Selection for Gaussian Concentratopn Graphs. Biometrika. Meinshausen, N. and Buhlmann, P. (2005). High Dimensional Graphs and Variable Selection. The Annals of Statistics.

Small Graphs Continuous Variables. Let X = (X1 , . . . , Xd )T and suppose that X Nd (, ) distribution. Thus, 1 1 f (x) = exp (x )T 1 (x ) . (271) d/2 ||1/2 2 (2)

Let ij denote the (i,j) element of and let ij denote the (i,j) element of 1 . Let i, j V and let Rij = V {i, j}. Dene the partial correlation E(Xi Xj |Rij ) E(Xi |Rij )E(Xj |Rij ) . (272) ij = V(Xi |Rij )V(Xj |Rij ) RESULT. A formula for the partial correlation is ij ij = . ii jj Furthermore, ij = 0 Now let V = {1, . . . , d} where vertex j V corresponds to Xj . Let G = (V, E) be a graph where E = {eij } denotes the edge set of the graph G: eij = 1 if vertices i and j are connected and eij = 0 if vertices i and j are not connected. Dene M (G) = N (, ) : eij = 0 = ij = 0 . (275) ij = 0 Xi Xj |XRij . (274) (273)

A distribution P M (G) that satises ij = 0 but eij = 1 is called unfaithful. Let F (G) = N (, ) M (G) : eij = 0 ij = 0 . (276)

Then F (G) M (G) are the faithful distributions.

168

X1

c X3

(X2 , X3 ) = 0 even though b=0

X2

Figure 80: An unfaithful distribution. 47.9 Example. Example of unfaithfulness. Let 1 , 2 , 3 N (0, 1) be independent. Dene X1 X2 X3 = 1 = aX1 + 2 = bX2 + cX1 + 3 (277) (278) (279)

where a, b, c are nonzero. See Figure 80. Now suppose that c= Then, Cov(X2 , X3 ) = E(X2 X3 ) E(X2 )E(X3 ) = E(X2 X3 ) = E((aX1 + 2 )(bX2 + cX1 + 3 )) = E((a1 + 2 )(b(a1 + 2 ) + c1 + 3 ))
2 2 = (a2 b + ac)E(1 ) + bE(2 ) = a2 b + ac + b = 0.

b(a2 + 1) . a

(280)

(281) (282) (283) (284) (285) (286) X3 |X1 which

Thus, X2 X3 . We would like to drop the edge between X2 and X3 . But this would imply that X2 is not true. The unfaithful distributions are a nonlinear subspace of M (G); see Figure 81.

Given n random vectors X (1) , . . . , X (n) N (, ) dene the sample covariance matrix S= The sample partial correlation is sij rij = sii sjj where {sij } are the elements of S 1 . Next, dene z ij = 1 log 2 169 1 + rij 1 rij . (289) (288) 1 n
n i=1

(X (i) X)(X (i) X)T .

(287)

M (G)

unfaithful distributions

Figure 81: The unfaithful distrubutions are a nonlinear subspace of M (G). Then7 z ij N where ij = and m = n d 1. Also, Now dene 1 log 2 ij ,

1 m

(290)

1 + ij 1 ij ij = 0.

(291)

ij = 0

(292) (293)

b Iij = z ij m

where b = z/(2k) and k = d(d 1)/2. Then, P( ij Iij , for all 1 i < j d) 1 = 1 (294) (295) (296) (297) (298) = 1 P( ij Iij , for all i, h) /
i,j

P( ij Iij ) / d(d 1)

i,j

= 1 .

Hence, {Iij } are simultaneous condence intervals for { ij }. Thus, if 0 Iij then the data are compatible with H0 : ij = 0 and we set eij = 0. That is, eij = Let E0 = {ij : eij = 0}. Then It then follows that P(G G) 1
7 More

0 if |z ij | m1/2 b 1 if |z ij | > m1/2 b.

(299)

P(eij = 0 for all ij E0 ) 1 .

(300) (301)

precisely,

d n(z ij ij ) N (0, 1).

170

where G is the estimated graph. If the distribution is faithful, we further have that lim inf P(G = G) 1 .
n

(302)

To see this, let > 0. Then,

ij

N (0, 1), let E1 = {ij : eij = 1}, and let = min{| ij | : ij E1 }. Faithfulness implies that P(G G) = = = = = P(eij = 1 for all ij E1 )
ij 1/2

(303) (304) (305) (306) (307) (308) (309) (310)

P(|z | > m b for all ij E1 ) 1 P(|z ij | m1/2 b for some ij E1 )


ijE1

P(|z ij | m1/2 b) P(| ij + m1/2


ij

1 1

ijE1

| m1/2 b)

P(| m ij +

ij

1.

1 |E1 | P(| m + N (0, 1)| b)

ijE1

| b)

47.10 Example. We can do this in R using the SIN package. library(SIN) data(fowlbones);attach(fowlbones) out = sinUG(fowlbones$corr,fowlbones$n) plotUGpvalues(out) E = getgraph(out,.05);print(E) skull length skull breadth humerus ulna femur tibia 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0

skull length skull breadth humerus ulna femur tibia

The output for this and for another example (HIV data) are shown in Figures 82 and 83.

171

1.0 1 2 3 4 5 6 skull length skull breadth humerus ulna femur tibia

0.9

0.8

0.7

0.6

Pvalue

0.5

0.4

0.3

0.2

0.1

0.0

12

13

14

15

16

23

24

25

26

34

35

36

45

46

Edge

Figure 82: Fowlbones 172

56

1.0 1 2 3 4 5 6 immunoglobin G immunoglobin A lymphocyte B platelet count lymphocyte T4 T4/T8 lymphocyte ratio

0.9

0.8

0.7

0.6

Pvalue

0.5

0.4

0.3

0.2

0.1

0.0

12

13

14

15

16

23

24

25

26

34

35

36

45

46

Edge

Figure 83: HIV 173

56

Large Graphs Continuous Variables. Estimating 1 is not feasible if d is large. In fact, S will not be invertible if d is large. Instead, we take a different approach. Dene the neighborhood (a) of node a to be the smallest subset of V {a} such that X a rest|X(a) where X(a) = (Xb : b (b)) and rest denotes all the other variables besides Xa and X(a) . Consider predicting Xi from all the other variables. The best predictor corresponds to choosing a vector a Rd to minimize
2

E Xa

a b Xb b

(311)

a a subject to a = 0. It can be shown that ab = 0 implies that b = 0. Hence, a (a) = {b : b = 0}.

(312)

Therefore, nding the neighborhood (a) corresponds to doing variable selection when regression X a on the other variables. Let a, be the lasso estimator that minimizes 1 n (Xia X(a) )2 + |a | (313)

where X(a) is the design matrix (omitting Xa ). Dene


a, (a) = {b : b = 0}.

(314)

Cross-validation does not lead to a good choice of in this case. We have the following result: if d = n and we have sparsity: max |(a)| n (315)
a

for some 0 < 1 and if n

for an appropriate then, for all a, P((a) = (a)) 1, as n . (316)

In practice, we can use 2i = 1 1 2 . n 2d In this case, we have the following property. Let C(a) denote all nodes connected to a by some path. Then, P(C(a) C(a) for all a) 1 . (318) (317)

In practice, I use a a two-stage lasso: stage 1 uses Cp to choose a set of variables; at stage 2 I do least squares on these variables and test them at level /p2 and retain the signicant ones. Finally, to estimate the graph we take E= #### Example n = 50 p = 60 X = matrix(rnorm(n*p),n,p) for(i in 1:(p/2)){ X[,2*i] = 5*X[,2*i-1] + rnorm(n,0,.1) } 174 (a, b) : a (b) and b (a) . (319)

1:p 0 0 10 20 30 40 50 60 1:p 10 20 30 40 50 60

Figure 84: A Big Graph See Figure 84. Small Graphs Discrete Variables. There are many tools for this case. We shall use log-linear models. Let X = (X1 , . . . , Xd ) be a discrete random vector with probability function f (x) = P(X = x) = P(X1 = x1 , . . . , Xd = xd ) where x = (x1 , . . . , xd ). Let rj be the number of values that Xj takes. Without loss of generality, we can assume that Xj {0, 1, . . . , rj 1}. Suppose now that we have n such random vectors. We can think of the data as a sample from a Multinomial with N = r1 r2 rd categories. The data can be represented as counts in a r1 r2 rd table. Let p = (p1 , . . . , pN ) denote the multinomial parameter. Let V = {1, . . . , d}. Given a vector x = (x1 , . . . , xd ) and a subset A V , let xA = (xj : j A). For example, if A = {1, 3} then xA = (x1 , x3 ). Recall that we can write f in terms of its maximal cliques f (x) A (xA ).
A

(320)

The log-linear representation is obtained by taking the logarithm. 47.11 Theorem. The joint probability function f (x) of a single random vector X = (X 1 , . . . , Xd ) can be written as log f (x) = A (x) (321)
AV

where the sum is over all subsets A of V = {1, . . . , d} and the s satisfy the following conditions: 1. (x) is a constant; 2. For every A V , A (x) is only a function of xA and not the rest of the xj s. 3. If i A and Xi = 0, then A (x) = 0.

175

X5

X4

X1

X2

X3 Figure 85: Graph for Example 47.14.

Each A (x) may depend on some unknown parameters A . Let = (A : A V ) be the set of all these parameters. We will write f (x) = f (x; ) when we want to emphasize the dependence on the unknown parameters . This is easiest to understand if all Xi s are binary. In that case log f (x) = 0 +
j

j Xj +
j<k

jk Xj Xk +
j<k<

so that A (x) = A
iA

The next theorem gives an easy way to check for conditional independence in a log-linear model. 47.12 Theorem. Let (Xa , Xb , Xc ) be a partition of a vectors (X1 , . . ., Xd ). Then Xb Xc |Xa if and only if all the -terms in the log-linear expansion that have at least one coordinate in b and one coordinate in c are 0. 47.13 Example. In the model log f (x) = 0 + 1 x1 + 2 x2 + 12 x1 x2 we have that X1 X2 if and only if 12 = 0. In the model log f (x) = 0 + 1 x1 + 2 x2 + 3 x3 +12 x1 x2 + 13 x1 x3 we have that X2 X3 |X2 since 2 and 3 never appear together, that is 23 = 123 = 0. (325) (326) (324)

Let log f (x) = AS A (x) be a log-linear model. Then f is graphical if all -terms are nonzero except for any pair of coordinates not in the edge set for some graph G. In other words, A (x) = 0 if and only if {i, j} A and (i, j) is not an edge. Here is a way to think about the denition above: If you can add a term to the model and the graph does not change, then the model is not graphical. 47.14 Example. Consider the graph in Figure 85. The graphical log-linear model that corresponds to this graph is log f (x) = + 1 (x) + 2 (x) + 3 (x) + 4 (x) + 5 (x)

+ 12 (x) + 23 (x) + 25 (x) + 34 (x) + 35 (x) + 45 (x) + 235 (x) + 345 (x). 176

jk Xj Xk X +

(322)

Xj .

(323)

X2

X1

X3 Figure 86: Graph for Example 47.16.

Lets see why this model is graphical. The edge (1, 5) is missing in the graph. Hence any term containing that pair of indices is omitted from the model. For example, 15 , 125 , 135 , 145 , 1235 , 1245 , 1345 , 12345 are all omitted. Similarly, the edge (2, 4) is missing and hence 24 , 124 , 234 , 245 , 1234 , 1245 , 2345 , 12345 are all omitted. There are other missing edges as well. You can check that the model omits all the corresponding terms. Now consider the model log f (x) = (x) + 1 (x) + 2 (x) + 3 (x) + 4 (x) + 5 (x)

+ 12 (x) + 23 (x) + 25 (x) + 34 (x) + 35 (x) + 45 (x).

This is the same model except that the three way interactions were removed. If we draw a graph for this model, we will get the same graph. For example, no terms contain (1, 5) so we omit the edge between X 1 and X5 . But this is not graphical since it has extra terms omitted. The independencies and graphs for the two models are the same but the latter model has other constraints besides conditional independence constraints. This is not a bad thing. It just means that if we are only concerned about presence or absence of conditional independences, then we need not consider such a model. The presence of the three-way interaction 235 means that the strength of association between X2 and X3 varies as a function of X5 . Its absence indicates that this is not so. There is a set of log-linear models that is larger than the set of graphical models and that are used quite a bit. These are the hierarchical log-linear models. A log-linear model is hierarchical if A = 0 and A B implies that B = 0. 47.15 Lemma. A graphical model is hierarchical but the reverse need not be true. 47.16 Example. Let log f (x) = (x) + 1 (x) + 2 (x) + 3 (x) + 12 (x) + 13 (x). The model is hierarchical; its graph is given in Figure 86. The model is graphical because all terms involving (2,3) are omitted. It is also hierarchical. 47.17 Example. Let log f (x) = (x) + 1 (x) + 2 (x) + 3 (x) + 12 (x) + 13 (x) + 23 (x). The model is hierarchical. It is not graphical. The graph corresponding to this model is complete; see Figure 87. It is not graphical because 123 (x) = 0 which does not correspond to any pairwise conditional independence.

177

X1

X2

X3 Figure 87: The graph is complete. The model is hierarchical but not graphical.

X1

X2

X3

Figure 88: The model for this graph is not hierarchical. 47.18 Example. Let log f (x) = (x) + 3 (x) + 12 (x). The graph corresponding is in Figure 88. This model is not hierarchical since 2 = 0 but 12 is not. Since it is not hierarchical, it is not graphical either. A summary of the models is in Figure 89. Hierarchical models can be written succinctly using generators. This is most easily explained by example. Suppose that X = (X1 , X2 , X3 ). Then, M = 1.2 + 1.3 stands for log f = + 1 + 2 + 3 + 12 + 13 . The formula M = 1.2 + 1.3 says: include 12 and 13 . We have to also include the lower order terms or it wont be hierarchical. The generator M = 1.2.3 is the saturated model log f = + 1 + 2 + 3 + 12 + 13 + 23 + 123 . The saturated models corresponds to tting an unconstrained multinomial. Consider M = 1 + 2 + 3 which means log f = + 1 + 2 + 3 . This is the mutual independence model. Finally, consider M = 1.2 which has log-linear expansion log f = + 1 + 2 + 12 . This model makes X3 |X2 = x2 , X1 = x1 a uniform distribution. Let denote all the parameters in a log-linear model M . The loglikelihood for is
n

() =
i=1

log f (X i ; )

178

loglinear

hierarchical

graphical

Figure 89: Loglinear models, hierarchical models, and graphical models.

179

i i where f (X i ; ) is the probability function for the ith random vector X i = (X1 , . . . , Xd ) as give by equation (321). The MLE generally has to be found numerically. The Fisher information matrix is also found numerically and we can then get the estimated standard errors from the inverse Fisher information matrix. When tting log-linear models, one has to address the following model selection problem: which terms should we include in the model? This is essentially the same as the model selection problem in linear regression. One approach is is to use AIC or BIC. Let M denote some log-linear model. Different models correspond to setting different terms to 0. Now we choose the model M which maximizes

AIC(M ) = (M ) |M | or BIC(M ) = (M )

|M | log n 2

(327)

where |M | is the number of parameters in model M and (M ) is the value of the log-likelihood evaluated at the MLE for that model. Usually the model search is restricted to hierarchical models. This reduces the search space. After nding a best model this way we can draw the corresponding graph. A different approach is based on hypothesis testing. The model that includes all possible -terms is called the saturated model and we denote it by Msat . Now for each M we test the hypothesis H0 : the true model is M versus H1 : the true model is Msat . The likelihood ratio test for this hypothesis is called the deviance. For any submodel M , dene the deviance dev(M ) by dev(M ) = 2(
sat

M) M

where sat is the log-likelihood of the saturated model evaluated at the MLE and M evaluated at its MLE. 47.19 Theorem. The deviance is the likelihood ratio test statistic for

is the log-likelihood of the model

H0 : the model is M versus H1 : the model is Msat . Under H0 , dev(M ) 2 with degrees of freedom equal to the difference in the number of parameters between the saturated model and M . One way to nd a good model is to use the deviance to test every sub-model. Every model that is not rejected by this test is then considered a plausible model. However, this is not a good strategy for two reasons. First, we will end up doing many tests which means that there is ample opportunity for making Type I and Type II errors. Second, we will end up using models where we failed to reject H0 . But we might fail to reject H0 due to low power. The result is that we end up with a bad model just due to low power. 47.20 Example. Lets create a model of the form log f (x) = 12 x1 x2 + 23 x2 x3 + 34 x3 x4 + 45 x4 x5 . (328)
d

x1 = c(rep(0,16),rep(1,16)) x2 = c(rep(0,8),rep(1,8),rep(0,8),rep(1,8)) x3 = c(rep(0,4),rep(1,4),rep(0,4),rep(1,4),rep(0,4),rep(1,4),rep(0,4),rep(1,4)) x4 = rep(c(0,0,1,1),8) x5 = rep(c(0,1),16) beta = 1 f = beta*x1*x2 + beta*x2*x3 + beta*x3*x4 + beta*x4*x5 f = exp(f) f = f/sum(f) n = 1000 180

y = sample(1:32,size=n,prob=f,replace=TRUE) X1 = X2 = X3 = X4 = X5 = NULL for(i in 1:n){ X1 = c(X1,x1[y[i]]) X2 = c(X2,x2[y[i]]) X3 = c(X3,x3[y[i]]) X4 = c(X4,x4[y[i]]) X5 = c(X5,x5[y[i]]) }

out = glm(table(y) x1*x2*x3*x4*x5,family=poisson) summary(out) tmp = step(out,k=log(n)) summary(tmp) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.79711 0.18430 9.751 < 2e-16 *** x1 -0.05609 0.13677 -0.410 0.682 x2 -0.07969 0.17486 -0.456 0.649 x3 -0.03039 0.19359 -0.157 0.875 x4 0.09482 0.17731 0.535 0.593 x5 0.06062 0.14220 0.426 0.670 x1:x2 1.05860 0.15870 6.671 2.55e-11 *** x2:x3 0.95163 0.17633 5.397 6.78e-08 *** x3:x4 1.02836 0.17916 5.740 9.47e-09 *** x4:x5 0.84435 0.16218 5.206 1.93e-07 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Nailed it!

181

Figure 90: A directed graph with vertices V = {X, Y, Z} and edges E = {(Y, X), (Y, Z)}.

overweight

smoking

heart disease

cough

Figure 91: DAG for Example 48.1.

48 Directed Graphs
A directed graph consists of a set of nodes with arrows between some nodes. An example is shown in Figure 90. Formally, a directed graph G consists of a set of vertices V and an edge set E of ordered pairs of vertices. For our purposes, each vertex will correspond to a random variable. If (X, Y ) E then there is an arrow pointing from X to Y . See Figure 90. If an arrow connects two variables X and Y (in either direction) we say that X and Y are adjacent. If there is an arrow from X to Y then X is a parent of Y and Y is a child of X. The set of all parents of X is denoted by X or (X). A directed path between two variables is a set of arrows all pointing in the same direction linking one variable to the

other such as: A sequence of adjacent vertices staring with X and ending with Y but ignoring the direction of the arrows is called an undirected path. The sequence {X, Y, Z} in Figure 90 is an undirected path. X is an ancestor of Y if there is a directed path from X to Y (or X = Y ). We also say that Y is a descendant of X. A conguration of the form:

184

X Z Y W

Figure 92: Another DAG. is called a collider at Y . A conguration not of that form is called a non-collider, for example,

or X Y Z

The collider property is path dependent. In Figure 96, Y is a collider on the path {X, Y, Z} but it is a non-collider on the path {X, Y, W }. When the variables pointing into the collider are not adjacent, we say that the collider is unshielded. A directed path that starts and ends at the same variable is called a cycle. A directed graph is acyclic if it has no cycles. In this case we say that the graph is a directed acyclic graph or DAG. From now on, we only deal with acyclic graphs. Let G be a DAG with vertices V = (X1 , . . . , Xk ). If P is a distribution for V with probability function f , we say that P is Markov to G, or that G represents P, if
k

f (v) =
i=1

f (Xi | i )

(329)

where i are the parents of Xi . The set of distributions represented by G is denoted by M (G). 48.1 Example. Figure 91 shows a DAG with four variables. The probability function for this example factors as f (overweight, smoking, heart disease, cough) = f (overweight) f (smoking)

f (heart disease | overweight, smoking) f (cough | smoking).

48.2 Example. For the DAG in Figure 92, P M (G) if and only if its probability function f has the form f (x, y, z, w) = f (x)f (y)f (z | x, y)f (w | z).

185

C Figure 93: Yet another DAG. The following theorem says that P M (G) if and only if the Markov Condition holds. Roughly speaking, the Markov Condition means that every variable W is independent of the past given its parents. 48.3 Theorem. A distribution P M (G) if and only if the following Markov Condition holds: for every variable W, W W | W (330) where W denotes all the other variables except the parents and descendants of W . 48.4 Example. In Figure 92, the Markov Condition implies that X Y and W {X, Y } | Z.

48.5 Example. Consider the DAG in Figure 93. In this case probability function must factor like f (a, b, c, d, e) = f (a)f (b|a)f (c|a)f (d|b, c)f (e|d). The Markov Condition implies the following independence relations: D A | {B, C}, E {A, B, C} | D and B C|A

The Markov Condition allows us to list some independence relations implied by a DAG. These relations might imply other independence relations. Consider the DAG in Figure 94. The Markov Condition implies: X1 X4 X2 , X2 {X1 , X4 }, X5 X3 X4 | {X1 , X2 },

{X2 , X3 } | X1 , {X4 , X5 }

{X1 , X2 } | {X3 , X4 }

It turns out (but it is not obvious) that these conditions imply that X2 | {X1 , X3 }.

How do we nd these extra independence relations? The answer is d-separation which means directed separation. d-separation can be summarized by three rules. Consider the four DAGs in Figure 95 and the DAG in Figure 96. The rst 3 DAGs in Figure 95 have no colliders. The DAG in the lower right of Figure 95 has a collider. The DAG in Figure 96 has a collider with a descendant. 186

X2

X3

X1

X5

X4 Figure 94: And yet another DAG.

Figure 95: The rst three DAGs have no colliders. The fourth DAG in the lower right corner has a collider at Y .

W Figure 96: A collider with a descendant.

187

S1

S2 Figure 97: d-separation explained. The Rules of d-Separation

Consider the DAGs in Figures 95 and 96. 1. When Y is not a collider, X and Z are d-connected, but they are d-separated given Y . 2. If X and Z collide at Y , then X and Z are d-separated, but they are d-connected given Y . 3. Conditioning on the descendant of a collider has the same effect as conditioning on the collider. Thus in Figure 96, X and Z are d-separated but they are d-connected given W . Here is a more formal denition of d-separation. Let X and Y be distinct vertices and let W be a set of vertices not containing X or Y . Then X and Y are d-separated given W if there exists no undirected path U between X and Y such that (i) every collider on U has a descendant in W , and (ii) no other vertex on U is in W . If A, B, and W are distinct sets of vertices and A and B are not empty, then A and B are d-separated given W if for every X A and Y B, X and Y are d-separated given W . Sets of vertices that are not d-separated are said to be d-connected. 48.6 Example. Consider the DAG in Figure 97. From the d-separation rules we conclude that: X and Y are d-separated (given the empty set); X and Y are d-connected given {S1 , S2 }; X and Y are d-separated given {S1 , S2 , V }. 48.7 Theorem. by C.
8

Let A, B, and C be disjoint sets of vertices. Then A B | C if and only if A and B are d-separated

48.8 Example. The fact that conditioning on a collider creates dependence might not seem intuitive. Here is a whimsical example from Jordan (2003). that makes this idea more palatable. Your friend appears to be late for a meeting with you. There are two explanations: she was abducted by aliens or you forgot to set your watch ahead one hour for daylight savings time. (See Figure 98.) Aliens and Watch are blocked by a collider which implies they are marginally independent. This seems reasonable since before we know anything about your friend being late we would expect these variables to be independent. We would also expect that P(Aliens = yes|Late = yes) > P(Aliens = yes); learning that your friend is late certainly increases the probability that she was abducted. But when we learn that you forgot to set your watch properly, we would lower the chance that your friend was abducted. Hence, P(Aliens = yes|Late = yes) = P(Aliens = yes|Late = yes, Watch = no). Thus, Aliens and Watch are dependent given Late. 48.9 Example. Consider the DAG in Figure 91. In this example, overweight and smoking are marginally independent but they are dependent given heart disease. Graphs that look different may actually imply the same independence relations. If G is a DAG, we let I(G) denote all the independence statements implied by G. Two DAGs G1 and G2 for the same variables V are Markov equivalent if I(G1 ) = I(G2 ). Given a DAG G, let skeleton(G) denote the undirected graph obtained by replacing the arrows with undirected edges.
8 We implicitly assume that P is faithful to G which means that P has no extra independence relations other than those logically implied by the Markov Condition.

188

aliens

watch

late Figure 98: Jordans alien example (Example 48.8). Was your friend kidnapped by aliens or did you forget to set your watch? 48.10 Theorem. Two DAGs G1 and G2 are Markov equivalent if and only if (i) skeleton(G1 ) = skeleton(G2 ) and (ii) G1 and G2 have the same unshielded colliders. 48.11 Example. The rst three DAGs in Figure 95 are Markov equivalent. The DAG in the lower right of the Figure is not Markov equivalent to the others.

189

49 Estimation for DAGs


Estimating a DAG structure is harder than an undirected graph. For the sparse graphs, the PC algorithm due to Spirtes, Glymour and Scheines is the fastest algorithm I know of. Here, we will consider the simpler case where there is a known ordering on the variables. An example is time order. Without loss of generality, assume that V = {1, . . . , d} has been ordered according to . Thus, i j if and only if i j. For continuous, Gaussian variables, and small graphs we can use SIN. For every i < j we test H0 : ij = 0 versus H1 : ij = 0. Here, ij denotes the partial correlation of Xi and Xj given {1, . . . , j} {i, j}. 49.1 Example. This example, from Spirtes et al (2000) involves data on publishing productivity. The variables are: sex, ability, GPQ (graduate program quality), preprod (preliminary measure of productivity), QFJ (quality of rst job), pubs (publication rates), and cites (citation rates). These are essentially time ordered. > library(SIN) > postscript("dagsin.ps",horizontal=FALSE) > data(pubprod) > attach(pubprod) > m = pubprod$cor > print(dimnames(m)[[1]]) [1] "ability" "GPQ" "preprod" "QFJ" "sex" > n = pubprod$n > o = c(5,1,2,3,4,7,6) > ### sex < ability < GPQ < pre < QFJ < pubs < cites > m = m[o,o] > p = sinDAG(1:7,m,n) > plotDAGpvalues(p) (331)

"cites"

"pubs"

190

> G = getgraph(p,.05,type="DAG") > print(G) sex ability GPQ preprod QFJ pubs cites sex 0 0 0 0 0 1 0 ability 0 0 1 1 0 0 0 GPQ 0 0 0 0 1 0 0 preprod 0 0 0 0 0 0 1 QFJ 0 0 0 0 0 1 0 pubs 0 0 0 0 0 0 1 cites 0 0 0 0 0 0 0 The p-values are plotted in Figure 99. An alternative is simply to do the following: regress X2 on X1 regress X3 on X1 , X2 regress X4 on X1 , X2 , X3 etc. One could test for signicant effects or, when d is large, use the lasso.

191

1.0 1 2 3 4 5 6 7 sex ability GPQ preprod QFJ pubs cites

0.9

0.8

0.7

0.6

Pvalue

0.5

0.4

0.3

0.2

0.1

0.0

1>3

2>3

1>4

1>2

2>4

3>4

3>6

4>6

5>6

1>7

2>7

3>7

4>7

5>7

2>5

3>5

4>5

1>6

Edge

Figure 99: p-values for DAG example

192

2>6

6>7

1>5

Homework 1

1. Download the R handouts from the web site and practice R. Do not turn this question in. 2. Show that R(r) R(g) as claimed in page 1 of the notes. 3. Problem 2.1. Use R for all calculations. 4. Problem 2.2. 5. Problem 2.3. 6. Problem 2.10. 7. A simulation experiment. In R, type: par(mfrow=c(1,1),bg="pink",fg="blue") x = c(-3,13) y = 5 + 10 * x plot(x,y,type="l",cex.lab=2) lines(x,y, lwd=3,col="black",lwd=3) sd = 10 for(i in 1:50){ x = runif(100, 0, 10) y = 5 + 10*x + rnorm(100, 0, sd) abline(lm(y~x), lty=2, lwd=0.5,col=3) } (a) Comment every line of this little program (say what each line is doing). (b) Record your conclusions. 8. Association Versus Causation. Suppose that we measure lung capacity Y on individuals in a population. We are interested in whether exposure to pollution X reduces lung capacity Y . Suppose, for the sake of this question, that there is no eect. Let the lung capacity of person i be i . Thus, each person has a dierent average lung capacity i but i does not depend on X. In terms of counterfactuals, yi (x) = i . Show

193

that the causal regression c(x) does not depend on x. In particular, the slope of c(x) is 0. Now we observe n people and we get the data: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi = yi (Xi ) + i where i N (0, 1). Suppose that Xi = 25 i . Assume that n = 25 and that i = i. Produce a plot showing yi (x) for i = 1, . . . , 25. Generate data from the model. Plot the data and t a regression model. Explain why you get 1 even though the slope of c(x) is zero. Note: All the data sets from Weisbergs book are available on the course web

194

Homework 2 (1) Suppose we t the no-intercept model Yi = Xi + i . (a) Find the least squares estimator of . (b) Find an expression for the hat matrix. (c) Describe the column space. (d) Find the mean and variance of assuming the model is correct. (e) Find the mean and variance of assuming the model is not correct. (2) If is an estimate of a parameter , show that E( )2 = bias2 + variance.

(3) Let Y N (, 1). Let = aY where a is a constant. (a) Find the bias, variance and mean-squared error of . (b) What value of a minimizes the mean squared error. (c) Suppose we predict a new observation Y N (, 1) and we use Y = as the prediction. Find the prediction error.

(4) Problem 3.1 parts 1 and 3.

(5) Problem 3.5 parts 1 and 2.

(6) 4.3.

(7) 4.5. (see page 89).

(8) 4.8.

(9) 5.1.

(10) 5.3. You need to use the bootstrap, which is fully described in the question. You also need to perform a nonparametric t (or smoother) which we havent yet discussed. However, all you need to do is:

195

fitted(loess(y x)). Thus, the quantity G is obtained this way: linfit sigma nonpar G = = = = lm(y x) summary(linfit)$sigma loess(y x) sum((fitted(linfit) - fitted(nonpar))2)/sigma2

196

Homework 3

(1) In class we said that the minimizer of Q() = (Y )2 + || is given by the soft thresholding estimator = sign(Y )(|Y | /2)+ where (a)+ = a if a > 0 0 if a 0.

Show that this is true. Well do this in steps. We could try taking the derivative of Q with respect to and setting it equal to 0. But Q() is not differential. (a) First, show that Q is convex in . That is, show that Q(t1 + (1 t)2 ) tQ(1 ) + (1 t)Q(2 ). Since Q is convex, for any there is a line passing through (, Q()) that never goes above Q. The slope of such a line is called a subderivative. Formally, z is a subderivative of Q at 0 if Q() Q(0 ) z( 0 ) The set of subservatives is called the subdifferential. (b) Show that the subdifferential of Q is Q () = 2(Y ) + z() where z() = sign() if = 0 and z() [1, 1] if = 0. Note that the Q () is a number if = 0 but Q () is a set of numbers if = 0. A point is a minimum of Q if and only if zero is contained in the subdifferential. Now show that = sign(Y )(Y /2)+ . (2) Problem 7.1. (3) (a) Let A be a m n matrix and let B be an n m matrix. Show that trace(AB) = tr(BA). (b) Let H be the hat matrix. Show that trace(H) = q. (4) Problem 8.3 (but ignore anything about the score test.) (5) Problem 9.3. (6) Problem 9.4.

197

(7) Problem 9.8. (8) Get the highway accident data highway.txt. Do model selection using: forward stepwise, ridge regression, the lasso. Summarize your results. (9) Suppose that Yi N (i , 2 ), i = 1, . . . p. Assume that 2 is known. Dene = (1 , . . . , p ) where is obtained by minimizing
p

(Yi i )2 + J().
i=1

(a) Find for the following three cases: J() = ||||0 = #{i : i = 0} J() = ||||1 =
i

|i | 2 . i
i

J() =

||||2 2

(b) For the case J() = i 2 , nd the bias, variance and risk of as a function of . Find the i value of that minimizes the risk. (c) The elastic net estimator minimies
p

(Yi i )2 + 1
i=1 i

|i | + 2
i

2 . i

Find an expression for the estimator.

198

Homework 4

(1) Use the data for question 12.1. Fit a logistic regression. Use stepwise regression to choose a submodel. Summarize your ndings.

(2) Question 12.3.

(3) Question 12.5.

(4) Generate data as follows: Xi N (0, 1) Yi = 3 + 4Xi + Wi = X i + U i

where i = 1, . . . , n, n = 1000, i N (0, 2 ), = 2 and Ui N (0, 1). (a) Regress the Yi s on the Xi s. Plot the data with the regression line. (b) Regress the Yi s on the Wi s. Plot the data with the regression line. How do the two regression compare? (c) Use the method we discussed in class for correcting the regression in (b). Compare the tted line to the true regression line and to the tted line from (a).

(5) Question 11.2. For the third part, just use the bootstrap (as in the example in section 11.3).

(6) Download the forestry data (ufcwc.txt) which is used in section 7.1.2 of the book. (a) Fit a kernel regression (Y = Height and X = Dbh). Show several different ts using different bandwidths and different kernels. (b) Plot the cross-validation score. Find the best bandwidth and plot the corresponding t. (c) Fit a local linear regression. Use cross-validation (or generalized cross validation) to nd the best bandwidth. Plot the cross-validation score versus bandwidth. Plot the t based on the best bandwidth. Plot the residuals. (d) Repeat (c) using splines. (e) Repeat (e) using orthogonal functions.

199

Homework 5

(1) Get the data set fuel2001.txt. The variables are: Drivers (numer of licensed drivers) FuelC (fuel consumption in thousands of gallons) Income (per capita income) Miles (miles of highway in state) MPC (miles driven per person) Pop (number of people 16 and older) Tax (tax rate on gas). Ignore the variable State. The goal is to predict fuel consumption from the other variables. Do a complete analysis using all the tools we have learned. This should inlude: plots, linear ts, nonparametric methods, multivariate nonparametric methods. Provide a brief summary of what you did.

(2) Use the fuel data again. Let Y = log(fuel consumption). Estimate the density of Y using a histogram and using kernel density estimator. Use cross-validation to choose the amount of smoothing.

200

Homework 6

(1) Suppose that X R and that X Uniform(0, 10), and r(x) = P(Y = 1|X = x) = .2 for .1 < x < .9 and r(x) = .9 otherwise. (a) Find the Bayes classictaion rule h . (b) Find the Bayes risk R = R(h ). (c) Let H be the set of linear calssiers of the form h(x) = 1 if 0 +1 x 0 and h(x) = 0 if 0 +1 x < 0. What is the smallest risk over all such classiers in this problem?

(2) Let f0 (x) = f (x|Y = 0) and f1 (x) = f (x|Y = 1). Show that R = 1 1 2 4 |f1 (x) f0 (x)|dx.

Interpret this result. Find f1 and f0 in problem 1 and apply the above formula for R and conrm that you get the same answer as before.

(3) Get the glass fragment data: library(MASS) data(fgl) help(fgl) The goal is to predict the variable type from the others. Note that type takes 6 dierent values. (a) First combine the rst three classes (dierent types of window glass) and the last three classes so that type now only has two values. Classify the data (window or not window) using (i) linear regression, (ii) logistic regression, (iii) nearest neighbors, (iv) lda, (v) qda, (vi) trees. Compare the results. (b) Now let type have 6 dierent levels. To do this with regression methods requires you to be inventive.

201

202

ExHomework 7 (1) Create a simulated dataset as follows. Create a 20 by 20 matrix E. The diagonal elements should be zero. Make each upper diagonal matrix a 1 or 0 by generating a random Bernoulli with success probability .05. Now let b = 40 and dene 1 using this code: B a a a Sinv = = = = = b*E apply(B,1,sum) max(a) + 1 rep(a,nrow(E)) B + diag(a)

Draw the graph. The following code might be helpful for drawing a graph: draw = function(E){ par(pty="s") n = nrow(E) angle = seq(0,2*pi,length=n+1) angle = angle[1:n] x = cos(angle) y = sin(angle) plot(x,y,pch=20,lwd=3,xlab="",ylab="", xaxt="n",yaxt="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5)) for(i in 1:n){ text(1.2*cos(angle[i]),1.2*sin(angle[i]),paste("X",i),lwd=3,font=2) } for(i in 1:(n-1)){ for(j in (i+1):n){ if(E[i,j] == 1)lines(c(x[i],x[j]),c(y[i],y[j]),lwd=3) } } return(NULL) } Now generate 100 random vectors from a N(0, ). To do this, use the following fact: if Z N(0, I) then X = 1/2 Z N(0, ). To compute the square root of a matrix use: e = eigen(A) V = e$vectors s = V %*% sqrt(diag(e$values)) %*% t(V) Estimate the graph from your data. Use the SIN method and the lasso method and compare your answers. To assess the variability of the estimator draw 10 bootstrap samples and draw the graphs from these bootstrap samples.

203

(2) Consider random variables (X1 , X2 , X3 ). In each of the following cases, draw a graph with the fewest possible number of edges that has the given independence relations. (a) X1 X3 | X2 . (b) X1 X2 | X3 and X1 X3 | X2 . (c) X1 X2 | X3 and X1 X3 | X2 and X2 X3 | X1 .

(3) Consider random variables (X1 , X2 , X3 , X4 ). In each of the following cases, draw a graph with the fewest possible number of edges that has the given independence relations. (a) X1 X3 | X2 , X4 and X1 X4 | X2 , X3 and X2 X4 | X1 , X3 . (b) X1 X2 | X3 , X4 and X1 X3 | X2 , X4 and X2 X3 | X1 , X4 . (c) X1 X3 | X2 , X4 and X2 X4 | X1 , X3 .

(4) Construct a distribution on three variables that cannot be represented by an undirected graph. Construct a distribution on four variables that cannot be represented by a directed graph.

(5) Get the undirected graph data from the course website. There are 5 binary variables. Fit an undirected graph using loglinear models. Approximate the distribution of the data with a Normal. Now estimate the graph using SIN. (In other words, just use the sample covariance matrix.)

(6) Write down the conditional independencies from Figures 1-4.

204

X1

X2

X4

X3 Figure 1:

X1

X2

X3

X4

Figure 2:

X3

X2

X4

X1

Figure 3:

205

X1

X2

X3

X4

X5

X6 Figure 4:

206

Solution 1

Solution 2

Solution 3

Solution 4

Solution 6

Solution 7

Appendix

Clustering
Clustering is really a type of dimension reduction. Clustering methods aim to nd small sets with a high concentration of data. Dimension reduction methods aim to nd low dimensional structures that preserve properties of the full data set. These methods are sometimes called unsupervised learning. Some of these methods take the following form. We try to nd a low dimensional set C belonging to a class of sets C to minimize some quantity such as the projection error E||X C (X)||2 where C is the projector onto C, that is, C (X) is the point in C closest to X. The solution to this problem depends on the class C. For example: C Singletons Sets with k points Lines Smooth curves Sets of k lines Lines in feature space Method The mean k-means clustering Principal components Principal curves k-lines clustering Kernel principal components

Some of the methods can be cast in terms of the probability density function p(x). One assumesexplicitly or implicitly that p(x) can be written as p(x) = p0 (x) + pj (x)
j

where pj is very concentrated around a set Cj and C1 , C2 , . . . are disjoint sets. The sets could be points, lines, linear subspaces, manifolds, curves, blobs, ridges and so on. The rst term p0 (x) represents mass not in any of the sets Cj . The goal of clustering is to nd disjoint sets C1 , . . . , Ck such that each Cj is has a high concentration of data. Figure 6 shows some synthetic examples. The examples are simple but illustrate some of the challenges in clustering. Let us also introduce some real examples. EXAMPLE. 6830 genese on 64 people. Find groups of people. EXAMPLE. Image analysis. 1

q q q qq q q qq q q q q q q qq q q q q q qq q q q q q q q qqq q q q q q q qq q q q q qq q

q qq q qqqq q q qqq q q q q q q qq q q q q qq q qq qq q q q q q q qq q q q q q q q q

Figure 1: Synthetic examples of clusters.

qqqq qqq qq qq q q q q q q q q q qqq q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqqq q q q q q q q q q qq qq qqq qqqq

q q q q qq q q q q q qqqq q q q qqq q q q qq q qq qq q q q q q q q q qq q q q q q qq

q q q qq qq q q q qqq qq qq q qq qq qq q q qqqq qqq q qq qqq q q qq

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

q qq qqqq qq qq q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q qq q q q q qq q q q q qq q qq q qq q qqq q q q q q q qq q qq q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q qqq q q q q q q q q q q q q q

q q q q q q q q q q q q q q q q

q q qq

q q

q q q q q qq q q qq q q q q qq qq q q qq q qqqqq qq q q q q qq qq qq qqq q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q qq q q q qq q q qq qq qq q qqq q q q q q q q qq q q q qq qq q q qqqq qq q qq qqq q q qqq q qqqqq qqqq q q q q qq qqq q qqqqq qqq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

q q q q q q q q q q q q q q q qq q q q qq q q q qqqqq q q qq qq qq qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q qqq q q q qqq q q q q q q q qq q qqqq q qq q q qq q q q q q q qq q q qq q q q q q q qqqq q q qq q q q q q q q q q q q q q q qq q q qq q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q

0.1

k-means

In k-means clustering we try to nd k points c1 , . . . , ck such that most of the data are tightly concentrated around one of the cj . Let X be a d-dimensional random vector and let Ck denote all sets of the form C = {c1 , . . . , ck } where each cj Rd . Dene the risk R(C) = E||X C (X)||2 = ||x C (x)||2 dP (x)

where C (x) is the projection of X onto C: C (x) = argmin1jk ||x cj ||2 . In other words, C (x) is the point in C closest to x. The population k-means set is C = arminCCk R(C). We think of C = {c1 , . . . , ck } as a set of cluster centers. We partition Rd into k sets T1 , . . . , Tk where x Tj if and only if ||x cj || ||x cs || for all s = j.

The sets T1 , . . . , Tk are called the k-means partition. To estimate the centers from the data, we take Ck = arminCCk R(C) where R(C) = 1 ||x C (x)|| dPn (x) = n
2 n

||Xi C (X)i ||2 .


i=1

Once we nd C = {c1 , . . . , ck } we partition the sample points into disjoint sets T1 , . . . , Tk where Xi Tj if and only if ||Xi cj || ||Xi cs || for all s = j. The sets T1 , . . . , Tk are the estimated clusters. The usual algorithm to nd C is the k-means clustering algorithm in Figure 2. The algorithm is not guaranteed to nd a global minimum so it is often a good idea to rerun the algorithm a few times with random starting values and then take the best solution. Now we discuss the theoretical properties of the k-means method. 3

1. Choose k centers c1 , . . . , ck at random from the data. 2. Form the clusters T1 , . . . , Tk where Xi is in Tj if cj is the center closest to Xi . 3. Let nj denote the number of points in Tj and set cj 1 nj Xi .
b i: Xi Tj

4. Repeat steps 2 and 3 until convergence. Figure 2: The k-means clustering algorithm. THEOREM(Bartlett, Linder and Lugosi 1997). Suppose that P(||X||2 /d 1) and that: n k 4/d , dk 12/d log n 15, kd 8, n 8d and n/ log n dk 1+2/d . Then, d3 k 12/d log n R(C) R(C ) 32 . n Also, if k 3, n 16k/(22 (2)) then, for any method C that selects k centers, there exists P such that R(C) R(C ) c0 d k 14/d n

where c0 = 4 (2)212 / 6. It follows that the method is consistent in the sense that R(C)R(C ) 0, as long as k = o(n/(d3 log n)). Moreover, the lower bound implies that we cannot nd any other method that improves much over the k-means approach, at least with respect to this loss function. An important practical question is how to choose a good value for k? There are numerous approaches to answering this question. The reason there are so many answers is that the question is vague. What does it mean to choose a good k? Rather than get into a prolonged discussion of the various methods let us instead discuss a simple method that is tied to the risk function R(C). 4

To indicate the dependence on the number of clusters k write Ck for the optimal clustering and let Rk = R(Ck ). It is easy to see that Rk is a nonincreasing function of k so minimizing Rk does not make sense. Instead, we can look for the rst k such that the improvement Rk Rk+1 is small, sometimes called an elbow. For example, x a small number > 0 and dene Rk Rk+1 k = min k : 2

where 2 = E(||X ||2 ) and = E(X). An estimate of k is k = min k : where 2 = n1


n i=1

Rk Rk+1 2

||Xi X||2 .

0.2

Hierarchical Clustering

Agglomerative: start with each point in a separate cluster. Merge the two closest points. Continue. Need to dene the distance between clusters. Example: d(C1 , C2 ) = min{d(xi , xj ) : xi C1 , xj C2 }. Divisive: start with one cluster. Divide recursively.

0.3

Level Set Clustering

The density function p(x) can also be used to dene clusters. For a xed non-negative number dene the level set L() = {x : p(x) > }. Suppose that L() can be decomposed into a nite collection of bounded, connected, disjoint sets:
k

L() =
i=1

Cj .

We assume that this decomposiiton is minimal in the sense that this is fewest number of sets for such a decomposition. We then call C = {C1 , . . . , Ck } the 5

0.0 0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Original Image

0.0 0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Compressed Image

q q q q qq q q q q q q qq q q q q qq q q q qq q qq q q q qqq qqq q q qq q q q q q q q q

q q q q qq q q q q q q qq q q q q qq q q q qq q qq q q q qqq qqq q q qq q q q q q q q q

q qq q q qq qq qq q qq q q q q q q qq qq q q q qq q q qqq q qq q q q q q qq q q q q q q q q q qq q qq q q q qq q q qq q qq qq q q q q q q qq q q qq qq q q q q q q q q q q q q q

6 Number of Clusters

10

Figure 5: Synthetic examples of clusters.

qqqq qqq qq qq q q q q q q q q q qqq q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqqq q q q q q q q q q qq qq qqq qqqq

q q q q q q q q q q q q q q q q

q q q q q q q q q q q q q q q q

6 Number of Clusters

10

density clusters. The set C0 = {x : p(x) } is called the background and points Xi in C0 are called background clutter. We can estimate these clusters by rst estimating the density with, say, the kernel density estimator p(x) = An estimator of L() is L() = {x : p(x) > }. One can then decompose L() into estimated clusters C1 , . . . , Ck . The following theorem gives the rate of convergence of this estimator. Let B(x, ) = {y : ||x y|| } and for any set A dene A =
xA

1 n

Kh (x Xi ).
i=1

B(x, ).

The Hausdor distance between two sets is dened by dH (A, B) = inf : A B and B A . (1)

A set S is standard if for every > 0 there exists a (0, 1) such that (B(x, ) S) (B(x, )) for all x S and all 0 < , where denotes Lebesgue measure. THEOREM (Cuevas and Fraiman 1996). Suppose that the following assumptions hold. 1. The kernel K is a bounded density, uniformly Lipschitz, supported on a compact set, ||t||d K(t) is bounded, there exists c, r such that cI(x B(0, r)) K(x), and K(t) is decreasing in ||t||. 2. The bandwidth hn satises hn 0 and nhd / log n . n 9

3. p is bounded and L = {x : p(x) } is a compact, standard set. Let L = {x : p + cn } where cn > 0 and cn 0. Then n dH (L, L) 0
a.s

(2)

where n is any sequence satisfying n and n hn 0. In particular, we can choose hn (log2 n/n)1/d and n (n/ log3 n)1/d . Faster rates of convergence are possible under stronger conditions; see Riggolet (2007). To implement this method one must choose . One possibility is to x a small number and then chooses = where = sup{ : P (L()) 1 }. We estimate by = sup : #{Xi L()} 1 . n

A dicult computational problem is how to decompose L() into disjoint clusters. Two points Xi and Xj are in the same cluster if and only if there exists a path between Xi and Xi such that p(x) > for all x along the path. This is a very dicult condition to check. Cuevas, Febrero and Fraiman (2000) suggest the following algorithm. Fix an > 0. Noe we approximate L with B(Xi , ) L=
Xi S

where S = {i : p(Xi ) > } and B(x, ) = {y : ||x y|| }. The connected components of L can be found as follows. Draw onservations Z1 , Z2 , . . . from f . Keep only those observations for which f (Zi ) . Suppose there are N such observations where N is much larger than n. Now nd the connected components as explained in See Figure ??. One can do this using the original sample but using the generated sample makes the procedure more accurate. One possible choice for is = max min ||Xi Xj ||/2
i j=i

which is the smallest vaalue that connects every Xi to its nearest neighbor. Other choices are discussed in Cuevas, Febrero and Fraiman (2000). 10

hierarchical clusters

qqqqqqq qq qq qq q q q q q q q qqqq qqqqq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

113 114 111 112 53 54 51 52 46 47 48 49 50 106 107 108 109 110 96 97 98 99 36 37 38 39 40 41 42 43 44 45 100 101 102 103 104 105 35 33 34 29 30 31 32 95 93 94 89 90 91 92 85 86 87 88 25 26 27 28 20 21 22 23 2 24 80 81 8 83 84 665 67 6 63 64 6 7 5 3 4 55 56 57 58 59 2 1 6015 1 116 117 118 119 62 617 12057 7 76 78 79 15 16 17 18 19 8 9 10 11 12 13 14 68 69 70 71 72 73 74

qqqqqqq qq qq qq q q q q q q q qqqq qqqqq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qqqqq qqqq q q q q q q q qq qq qq qqqqqqq

Height 0 1 2 3 4

Cluster Dendrogram

Figure 6: Synthetic examples of clusters.

dist(X) density clusters hclust (*, "complete")

11

0.4

Modal Clustering

Suppose that the density p on Rd has nitely many modes m1 , . . . , mk . Next, dene a partition T1 , . . . , Tk as follows. A point x belongs to Tj if and only if the steepest ascent path beginning at x leads to mj . Finally the data are clustered according to which element Tj they fall in. The steepest ascent path starting at x is dened to be the curve t : [0, ) Rd such that t(0) = x and t (s) = V (t(s)) where V (x) = p(x). In other words, the steepest ascent path is the intergal curve dened by the vector eld V . A sucient condition for the existence and uniqueness of the steepest ascent paths is that there exist c such that ||V (x) V (y)|| c||x y|| for all x and y. As described so far, mode clustering is not very useful since, for example, p may not even have a nite number of modes. A renement that is used in practice is the following. Given h > 0 dene ph (x) = 1 K hd ||u x|| h p(u)du.

Thus, ph is the density of the random variable X + Z where X P and Z N (0, h2 I). We can think of ph as the smoothed out version of p using a Gaussian kernel. THEOREM. For each h, ph has nitely many modes. The number of modes is nondecreasing in h. We estimate ph with the kernel density estimator. To nd the modes and cluster the data we use the mean shift algorithm; see Figure ??. This algorithm not only nds the modes but it shows what mode each Xi belongs to. pdf("cat1.pdf") library(adimpro) x = read.image("cat.jpeg") y = extract.image(x) y = matrix(y,130,89) g = gray(seq(0,1,length=1024)) image(y,col=g) tmp = kmeans(c(y),centers=5) 12

1. Choose a grid of points x1 , . . . , xN . This grid can be identical to the data {X1 , . . . , Xn } but in general it can be dierent. Set t = 0. 2. Let t = t + 1. For j = 1, . . . , N set xj xj + 3. Repeat until convergence. Figure 7: The Mean Shift Algorithm. Y = tmp$centers[tmp$cluster,1] Y = matrix(Y,130,89) dev.off() pdf("cat2.pdf") image(Y,col=g) dev.off() ###### blobs x1 = c(rnorm(50,-.8,.2),rnorm(50,.8,.2),rnorm(50,.8,.2)) x2 = c(rnorm(50,-.8,.2),rnorm(50,.8,.2),rnorm(50,-.8,.2)) plot(x1,x2,col="black",xlab="",ylab="",xaxt="n",yaxt="n",pch=K,lwd=3, xlim=c(-1.5,1.5),ylim=c(-1.5,1.5)) X = cbind(x1,x2) n = nrow(X) ss = rep(0,10) ss[1] = n*sum(diag(var(X))) ### kmeans for(i in 2:10){ out = kmeans(X,centers=i) ss[i] = sum(out$withinss) } plot(ss,type="h",lwd=2,xlab="Number of Clusters",yaxt="n",ylab="") out = kmeans(X,centers=3) 13
n i=1 Xi Kh (xj Xi ) . n i=1 Kh (xj Xi )

plot(x1,x2,xlab="",ylab="",xaxt="n",yaxt="n",lwd=3, xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),pch=out$cluster,col=out$cluster)

###### circles a = seq(0,2*pi,length=60) x1 = cos(a); x2 = sin(a) x1 = c(x1,2*x1) x2 = c(x2,2*x2) X = cbind(x1,x2) plot(x1,x2,col="black",xlab="",ylab="",xaxt="n",yaxt="n",pch=K,lwd=3, xlim=c(-3,3),ylim=c(-3,3)) n = nrow(X) ss = rep(0,10) ss[1] = n*sum(diag(var(X))) ## kmeans for(i in 2:10){ out = kmeans(X,centers=i) ss[i] = sum(out$withinss) } plot(ss,type="h",lwd=2,xlab="Number of Clusters",yaxt="n",ylab="") out = kmeans(X,centers=4) plot(x1,x2,xlab="",ylab="",xaxt="n",yaxt="n",lwd=3, xlim=c(-3,3),ylim=c(-3,3),pch=out$cluster,col=out$cluster)

par(mfrow=c(2,2),pty="s",pin=c(2,2)) plot(x1,x2,col="black",xlab="",ylab="",xaxt="n",yaxt="n",pch=K,lwd=3, xlim=c(-3,3),ylim=c(-3,3)) ### hierarchical clustering out = hclust(dist(X)) plot(out) tmp = cutree(out,k=2) plot(x1,x2,xlab="",ylab="",xaxt="n",yaxt="n",lwd=3, xlim=c(-3,3),ylim=c(-3,3),pch=tmp,col=tmp, main="hierarchical clusters")

14

### density clusters h = .1 y1 = seq(-3,3,length=50) y2 = seq(-3,3,length=50) y1 = rep(y1,rep(50,50)) y2 = rep(y2,50) f = rep(0,50*50) for(i in 1:n){ f = f + dnorm(y1,x1[i],h)*dnorm(y2,x2[i],h)/n } g = gray(seq(1,0,length=10)) image(matrix(f,50,50),col=g,main="density clusters",xaxt="n",yaxt="n") ### genes m = read.table("nci.data") out = kmeans(t(m),centers=3) true=c("CNS","CNS","CNS","RENAL","BREAST","CNS","CNS", "BREAST","NSCLC","NSCLC","RENAL","RENAL","RENAL","RENAL", "RENAL","RENAL","RENAL","BREAST","NSCLC","RENAL","UNKNOWN", "OVARIAN","MELANOMA","PROSTATE","OVARIAN","OVARIAN","OVARIAN", "OVARIAN","OVARIAN","PROSTATE","NSCLC","NSCLC","NSCLC", "LEUKEMIA","K562B-repro","K562A-repro","LEUKEMIA","LEUKEMIA", "LEUKEMIA","LEUKEMIA","LEUKEMIA","COLON","COLON","COLON", "COLON","COLON","COLON","COLON","MCF7A-repro","BREAST","MCF7D-repro", "BREAST","NSCLC","NSCLC","NSCLC","MELANOMA","BREAST","BREAST", "MELANOMA","MELANOMA","MELANOMA","MELANOMA","MELANOMA","MELANOMA" ) table(true,out$cluster) cluster 1 2 3 true BREAST 0 4 3 CNS 0 0 5 COLON 0 7 0 K562A-repro 1 0 0 K562B-repro 1 0 0 LEUKEMIA 5 1 0 15

MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN

0 0 0 0 0 0 0 0

1 1 7 3 2 1 0 0

0 0 1 6 4 1 9 1

16

Bibliography

[1] Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates. [2] R. D. Cook and S. Weisberg, An Introduction to Regression Graphics. New York: Wiley, 1994. [3] Fox, J. Regression Diagnostics: An Introduction. Newbury Park, CA.: Sage, 1991. [4] Fox, J. Applied Regression Analysis, Linear Models, and Related Methods. Newbury Park, CA.: Sage, 1997, [5] Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley. [5] Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.

You might also like