This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Regression Analysis

Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator seeks to ascertain the causal eﬀect of one variable upon another—the eﬀect of a price increase upon demand, for example, or the eﬀect of changes in the money supply upon the inﬂation rate. To explore such issues, the investigator assembles data on the underlying variables of interest and employs regression to estimate the quantitative eﬀect of the causal variables upon the variable that they inﬂuence. The investigator also typically assesses the “statistical signiﬁcance” of the estimated relationships, that is, the degree of conﬁdence that the true relationship is close to the estimated relationship.

1. 2. 3. 4.

5.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. • •

•

Contents Prediction Some Terminology Simple Linear Regression: X scalar and r(x) linear Inference ANOVA and R2 Prediction Intervals Why Are We Doing This If The Model is Wrong? Association versus causation Conﬁdence Bands Review of Linear Algebra Multiple Linear Regression Example : crime data Testing Subsets of Coefﬁcients The Hat Matrix Weighted Least Squares The Predictive Viewpoint The Bias-Variance Decomposition Diagnostics outliers Inﬂuence Tweaking the Regression Quantitative variables Variable Selection The Bias-Variance Tradeoff Variable Selection versus Hypothesis Testing Collinearity Robust Regression Nonlinear Regression Logistic Regression More About Logistic Regression Logistic Regression With Replication Generalized Linear Models Measurement error Nonparametric Regression Choosing the Smoothing Parameter Kernel Regression Local Polynomials Penalized Regression, Regularization and Splines Smoothing Using Orthogonal Functions Variance Estimation Conﬁdence Bands Testing the Fit of a Linear Model Local Likelihood and Exponential Families Multiple Nonparametric Regression Density Estimation Classiﬁcation Graphical Models Directed Graphs Estimation for DAGS Homework Appendix : Clustering Bibliography

1 Prediction

Suppose (X, Y ) have a joint distribution f (x, y). You observe X = x. What is your best prediction of Y ? Let g(x) be any prediction function. The prediction error (or risk) is R(g) = E(Y − g(X))2 . Let r(x) = E(Y |X = x) = Key result: for any g, R(r) ≤ R(g). yf (y|x)dy the regression function.

Let = Y − r(X). Then, E( ) = 0 and we can write Y = r(X) + . But we don’t know r(x). So we estimate from the data. (1)

2 Some Terminology

Given data (X1 , Y1 ), . . . , (Xn , Yn ) we have two goals: estimation: prediction: Find an estimate r(x) of the regression function r(x). Given a new X, predict Y ; we use Y = r(X) as the prediction.

At ﬁrst we assume that Yi ∈ R. Later in the course, we consider other cases such as Yi ∈ {0, 1}. X scalar X vector r linear r(x) = β0 + β1 x simple linear regression r(x) = β0 + j βj xj multiple linear regression r arbitrary r(x) is some smooth function nonparametric regression r(x1 , . . . , xp ) is some smooth function multiple nonparametric regression

**3 Simple Linear Regression: X scalar and r(x) linear
**

Suppose that Yi ∈ R, Xi ∈ R and that

h

r(x) = β0 + β1 x.

(2)

This model is wrong. There is no reason to assume that r is linear. We make this assumption tentatively but we will drop it later. I use the hsymbol to alert you to model-based statements. 1

Y = Heart Weight

6

2.0

8

10

12

14

16

18

20

2.5

3.0 X = Body Weight

3.5

**Figure 1: Cat example We can write Y i = β 0 + β 1 Xi +
**

i

(3)

where E( i ) = 0 and 1 , . . . , n are independent. We also assume that V( i ) = σ 2 does not depend on x. (Homoskedasticity.) The unknown parameters are: β0 , β1 , σ 2 . Deﬁne the residual sums of squares

n 2

RSS (β0 , β1 )

=

i=1

Yi − (β0 + β1 Xi )

.

(4)

The least squares estimators (LS) minimize:

RSS (β0 , β1 ).

**3.1 Theorem. The LS estimators are β1 β0 where X = n−1
**

n i=1

=

n i=1 (Xi − X)(Yi − n 2 i=1 (Xi − X)

Y)

(5) (6)

= Y − β1 X

n i=1

Xi and Y = n−1

Yi .

We deﬁne: The ﬁtted line: The predicted or ﬁtted values: The residuals: The residual sums of squares:

r(x) = β0 + β1 x Yi = r(Xi ) = β0 + β1 Xi i = Y i − Yi n 2 RSS = i=1 i

2

An unbiased estimate of σ 2 is

h

σ2 =

RSS

n−2

(7)

**The estimators are random variables and have the following properties (conditional on X 1 , . . . , Xn ): E(β0 ) = β0 , E(β1 ) = β1 , V(β1 ) = where s2 = n−1 x
**

n i=1 (Xi

σ2 1 n s2 x

− X)2 . Also, E(σ 2 ) = σ 2 . The standard error: se(β1 ) = σ 1 √ , n sx σ 1 se(β1 ) = √ . n sx

Approximate Normality

h

If

i

β0

≈ N β0 , se2 (β0 ) ,

h

β1 ≈ N β1 , se2 (β1 )

(8)

∼ N (0, σ 2 ) then:

1. Equation (8) is exact. 2. The least squares estimators are the maximum likelihood estimators. 3. The variance estimator satisﬁes: σ2 ∼ σ 2 χ2 n−2 n−2

4 Inference

It follows from (8) that an approximate 1 − α conﬁdence interval for β 1 is

h

β1 ± zα/2 se(β1 )

(9)

where zα/2 is the upper α/2 quantile of a standard Normal: α , where Z ∼ N (0, 1). 2 = 1.96 ≈ 2, so, an approximate 95 per cent conﬁdence interval for β 1 is P(Z > zα/2 ) = β1 ± 2se(β1 ). 4.1 Remark. If the residuals are Normal, then an exact 1 − α conﬁdence interval for β 1 is β1 ± tα/2,n−2 se(β1 ) (11) where tα/2,n−2 is the upper α/2 quantile of a t with n − 2 degrees of freedom. This interval is bogus. If n is large, tα/2,n−2 ≈ zα/2 so just use the Normal interval. If n is so small, that tα/2,n−2 is much different than zα/2 , the n is too small to be doing statistical inference. (Do you really believe that the residuals are exactly Normal anyway?) 3 (10)

For α = .05, zα/2

452 on 142 degrees of freedom Multiple R-Squared: 0.:3.ps".30 1st Qu.6441 F-statistic: 259. Here is an example.:2.63 3rd Qu.8 on 1 and 142 DF.50 postscript("cat.3567 0. :2.700 Mean :2.2 Example.515 0.2)) boxplot(cats[. z ≈ N (0.607 Bwt 4. 4.025 Max.95 Median :10. ### Cat example ### library(MASS) attach(cats).values" "assign" "effects" "qr" 4 "rank" "df.To test use the test statistic h H0 : β1 = 0 versus H1 : β1 = 0 (12) z= β1 − 0 se(β1 ) .300 Median :2.119 <2e-16 *** --Residual standard error: 1. help(cats) names(cats) [1] "Sex" "Bwt" "Hwt" summary(cats) Sex Bwt F:47 Min.:12.12 Max. : 6. :3.6466.000 M:97 1st Qu. 1).horizontal=F) par(mfrow=c(2.2503 16. :20.: 8.0341 0.2:3]) plot(Bwt. Error t value Pr(>|t|) (Intercept) -0.Hwt) out = lm(Hwt ˜ Bwt. 1).6923 -0. The p-value is p − value = P(|Z| > |z|) = 2Φ(−|z|) where Z ∼ N (0. The plots are shown in Figure 2. p-value: < 2.724 3rd Qu. (13) Under H0 . Adjusted R-squared: 0.data = cats) summary(out) Coefficients: Estimate Std.10 Mean :10.2e-16 abline(out.lwd=3) names(out) [1] "coefficients" "residuals" [5] "fitted.900 (14) Hwt Min. Reject H0 if p-value is small.residual" .

[9] "xlevels" "call" "terms" r = out$residuals plot(Bwt. n n n We used the fact that F −1 (q) = σΦ−1 (q) + µ. Then « „ “ “x − µ” xq − µ ” X −µ xq − µ q =Φ =P Z≤ q = F (xq ) = P(X ≤ xq ) = P ≤ σ σ σ σ and hence.0035 Bush.7303 − 0|/. . = 0.66. a 95 per cent conﬁdence interval is .pch=19) lines(Bwt. = −2. Later. Figure 3 also shows the residuals. The least squares estimates (omitting Palm Beach County) and the standard errors are β0 β1 The ﬁtted line is Buchanan = 66. we don’t quite use j/n but never mind.3 Example (Example: Election 2000).40 with a p-value of P(|Z| > 20.r.40) ≈ 0.0358 = 20. 5 . Here is a proof of this fact. Let xq = F −1 (q) be the q th quantile. σ 2 ) then this plot should a straight line with slope σ and intercept µ. If we repeat the analysis replacing votes with log(votes) we get β0 β1 This gives the ﬁt log(Buchanan) = −2.col=2. we shall address the following questions: how do we see if Palm Beach County has a statistically plausible outcome? On the log scale. Why? Let number of observations ≤ x b F (x) = .3298 + 0. Based on the residual plot. This is strong evidence that that the true slope is not 0.730300 se(β1 ) = 0.3529 = 66.7303 log(Bush). F −1 (q) = xq = σΦ−1 (q) + µ. The inferences from linear regression are most accurate when the residuals behave like random normal numbers.lwd=3) qqnorm(r) dev.7303 ± 2(.length(Bwt)). Order the data: X(1) ≤ X(2) ≤ · · · ≤ X(n) Let zj = Φ−1 (j/n) (Actually.0991 + 0. this is not the case in this example.2926 = 0.rep(0. read this. 4. Note that ` ´ j b F X(j) = n and so „ « „ « „ « j j j b X(j) = F −1 ≈ F −1 ≈ σΦ−1 + µ = σzj + µ.lty=3.) Plot X(j) versus zj If Xi ∼ N (µ. The statistic for testing H0 : β1 = 0 versus H1 : β1 = 0 is |Z| = |.0002.3298 se(β0 ) = 0.0358. Figure 3 shows the plot of votes for Buchanan (Y) versus votes for Bush (X) in Florida.80).0991 se(β0 ) = 17.0358) = (.off() "model" How a qq-plot works. The residuals look much healthier.0035 se(β1 ) = 0. If you are not familiar with qq-plots. n b So F (x) ≈ F (x) = P(X ≤ x).

5 r 0 −2 −2 0 2 4 −2 −1 0 1 2 Theoretical Quantiles Figure 2: Cat example 6 .5 3.5 Normal Q−Q Plot 4 2 Sample Quantiles 2.0 Bwt 3.0 8 10 12 14 16 18 20 2.5 3.20 15 10 Hwt 5 Bwt Hwt 6 2.0 2.0 Bwt 3.

Buchanan 3000 residuals 0 125000 250000 1500 −500 0 0 500 0 125000 250000 Bush Bush log Buchanan 7 8 residuals 7 8 9 10 11 12 13 6 3 4 5 2 −1 0 1 7 8 9 10 11 12 13 log Bush log Bush Figure 3: Voting Data for Election 2000. It works like this.n−2 . 7 . We can write n i=1 n n (Yi − Y )2 SStotal = i=1 (Yi − Yi )2 + i=1 ( Yi − Y ) 2 = RSS + SSreg Then we create this table: Source Regression Residual Total df 1 n-2 n-1 SS SSreg RSS SStotal MS SSreg /1 RSS/(n-2) F M Sreg /MSE Under H0 : β1 = 0. This is just another (equivalent) way to test this hypothesis. 5 h ANOVA and R2 In the olden days. F ∼ F1. statisticians were obsessed with summarizing things in analysis of variance (ANOVA) tables.

8. Bush had 152.93789 and 8. On the log-scale. 8.388441.6. R 2 = r2 where r= n i=1 (Yi n i=1 (Yi − Y )(Xi − X) n i=1 (Xi − Y )2 − X)2 is the sample correlation. 6 Prediction Intervals Given new value X∗ . What happens to R2 if we move the minimum xi further to the left and we move the maximum xi further to the right? ρ= Note that .151 is nearly 20 standard errors from Y∗ . We ﬁnd that sepred = .467 votes. On the log scale this is 11. assuming our regression model is appropriate? Our prediction for log Buchanan votes 2.578) which clearly excludes 8. n 2 i=1 (Xi − X) (17) (16) Y ∗ = β 0 + β 1 X∗ + .200. Now. This is not really the standard error of the quantity Y∗ . Why? Note 6.3298 + 0.3298 + . our linear regression gives the following prediction equation: log(Buchanan) = −2.7303 log(Bush). In Palm Beach. Also. 6.151. Indeed.151045 is bigger than 6.467.1 Remark. we want to predict The prediction is Y ∗ = β0 + β1 X∗ . the conﬁdence interval is (493. Deﬁne sepred (Y∗ ) = σ A conﬁdence interval for Y∗ is 1+ 1 + n (X∗ − X)2 .954 votes and Buchanan had 3. h Y∗ ± zα/2 sepred (Y∗ ).7303 (11.151045.93789)=6. Going back to the vote scale by exponentiating. 5. This is an estimate of the correlation E (X − µX )(Y − µY ) σX σY −1 ≤ ρ ≤ 1.1 Remark. ∗.2 Example (Election Data Revisited).093775 and the approximate 95 per cent conﬁdence interval is (6. It is the standard error of β0 + β1 X∗ + that sepred (Y∗ ) does not go to 0 as n → ∞. How likely is this outcome. 8 .717) compared to the actual number of votes which is 3.The coefﬁcient of determination is R2 = SSreg RSS =1− SStot SStot (15) Amount of variability in Y explained by X.388441 but is is “signiﬁcantly” bigger? Let us compute a conﬁdence interval.

Fat intake and breast cancer are associated. Moreover. We can take a purely predictive point and view β0 + β1 x as an estimate of the best linear predictor not as an estimate of the true regression function. (X is not randomly assigned) then the association interpretation is correct. the statement “X causes Y ” means that changing the value of X will change the distribution of Y . if I change someone’s fat intake by one unit. suppose that smoking does cause cancer. true. if I observe someone’s fat intake. there is a strong linear relationship between death rate due to breast cancer and fat intake. suppose that people with high fat intake are the rich people. Why? Look at the second plot. So. Look at the top left plot in Figure 4. Therefore. When X causes Y . for the sake of the example. we can do a nonparametric regression that does not assume linearity.7 Why Are We Doing This If The Model is Wrong? The model Y = β0 + β1 x + is certainly false. For example. 3. the linear assumption might be adequate. I can use equation (18) to predict their chance of dying from breast cancer. 8 Association Versus Causation There is much confusion about the difference between causation and association. Further. If the data are from an observational study. Fat intake causes Breast cancer. Second. Roughly speaking. You conclude that increasing vitamin C decreases colds. Note that (18) Yi = yi (Xi ). ASSOCIATION (or correlation). Association does not necessarily imply causation. Nonetheless. if I observe someone’s fat intake. But how do we assess whether the linear assumption is adequate? There are three ways. We can do a goodness-of-ﬁt test. The dotted lines show the counterfactuals. These are observed data on vitamin C (X) and colds (Y). RISK OF DEATH = β0 + β1 FAT + where β1 > 0. We will return to these points later. The counterfactual y i (x) is the value Y person i would have had if they had taken dose X = x. Therefore. To see why the causal interpretation is wrong in the observational study. And suppose. The causal regression is the average of the counterfactual curves y i (x): 9 (19) . X and Y will be associated but the reverse is not. CAUSATION. But changing someone’s fat intake will not change their cancer risk. in general. their risk of death from breast cancer changes by β 1 . If the data are from a randomized study (X is randomly assigned) then the causal interpretation is correct. that rich people smoke a lot. I can use equation (18) to predict their chance of dying from breast cancer. There is no reason why r(x) should be exactly linear. 1. Does that mean that FAT causes breast cancer? Consider two interpretations of (18). Then it will be true that high fat intake predicts high cancer rate. How can we make these ideas precise? The answer is to use either counterfactuals or directed acyclic graphs. You tell everyone to take more vitamin C but the prevalence of colds stays the same. In other words: Yi is the function yi (·) evaluated at Xi . 2.

(20) The average is over the population. If we can ﬁnd all the confounding variables Z then {y(x) : x ∈ R} is independent of X given Z. In other words. changing everyone’s dose does not change the outcome. Suppose that we randomly assign dose X. c(x) = E(y(x)) = = E(y(x)|Z = z)f (z)dz E(y(x)|Z = z. The causal regression curve c(x) is shown in the third plot. These are variables that affect both X and Y .6 6 6 4 4 4 y y y 2 2 2 y 0 0 0 0 1 2 3 x data 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 x data 4 5 6 7 0 2 4 6 0 1 2 3 4 5 6 7 x counterfactuals x counterfactuals 6 4 c(x) 2 c(x) 0 0 1 2 3 4 5 6 7 0 2 4 6 0 1 2 3 4 5 6 7 x causal regression function x causal regression function Figure 4: Causation c(x) = E(yi (x)). the best we can do is try to measure confounding variables. In the second example (right side of Figure 4) it is worse. In that case: c(x) = E(y(x)) = E(y(x)|X = x) since X is indep of {y(x) : x ∈ R} = E(Y |X = x) since Y = c(X) = r(x). if X is randomly assigned then association is equal to causation. In an observational (non randomized) study. Then Xi is independent of the counterfactuals {yi (x) : x ∈ R}. You tell everyone to take more vitamin C but the prevalence of colds increases. ﬁx a vaalue of x then average y i (x) over all individuals. (22) (23) (24) (25) Thus. Hence. X = x)f (z)dz since X is indep of {yi (x) : x ∈ R}|Z 10 (26) (27) (28) . In general: r(x) = c(x) association does not equal causation (21) In this example.

Note the following difference: c(x) = E(Y |Z = z. we can never be sure we have included all confounders. Let I(x) = where r(x) = β0 + β1 x 2Fα. 2 i (xi − x) ≥ 1 − α. 11 . 1959).95.hat ### Important: Note that these are all scalars except that x is a vector. Let us return to the cat example.Hwt) out = lm(Hwt ˜ Bwt. (32) (33) E(Y |X = x) = 9 Conﬁdence Bands 9. r(x) − c σ.hat = sqrt(sum(rˆ2)/(n-2)) stuff = sqrt(2*d)*sqrt( (1/n) + ((x-xbar)ˆ2/ssx))*sigma.n-2) beta = out$coeff xbar = mean(Bwt) ssx = sum( (Bwt-xbar)ˆ2 ) sigma. The R code is: library(MASS) attach(cats) plot(Bwt.2 Example.lwd=3) r = out$residuals n = length(Bwt) x = seq(min(Bwt). Z = z)f (z)dz β1 x + β2 z f (z)dz if linear (29) (30) (31) = β1 x + β2 E(Z). of course.1 Theorem (Scheffe.max(Bwt).data = cats) abline(out.= = E(Y |X = x. This is why obervational studies have to be treated with caution.2. X = x)f (z)dz E(Y |Z = z. This is called adjusting for confounders.n−2 1 + n (x − x)2 . X = x)f (z|x)dz.length=1000) d = qf(.2. h P r(x) ∈ I(x) for all x (35) 9. r(x) + c σ (34) c = Then.

lower.lwd=3) The bands are shown in Figure 5.5 Figure 5: Conﬁdence Band for Cat Example r.0 Bwt 3.lty=2.stuff lines(x.hat + stuff lower = r.Hwt 6 2.col=2.col=2.hat .0 8 10 12 14 16 18 20 2. 12 .upper.5 3.lwd=3) lines(x.hat = beta[1] + beta[2]*x upper = r.lty=2.

The inner product of two vectors x and y is x. we will brieﬂy review some linear algebra. 13 . • If x ∈ L then P x = x. y = xT y = j xj y j . b) is just (a. We then write x ⊥ y. y = 0. Now let X be some n × q matrix and suppose that XT X is invertible. The column space is the space L of all vectors that can be obtained by taking linear combinations of the columns of X. It can be shown that the projection matrix for the column space is P = X(XT X)−1 XT . This is the projection matrix. The norm of a vector is ||x|| = x. the projection x of x onto L is the vector in L that is closest to x. • P is symmetric. Let e1 = (1. Another characterization of x is this: it is the unique vector such that (i) x ∈ L and (ii) x − x ⊥ y for all y ∈ L. In other words. 0). x minimizes ||x − x|| among all vectors in L. Note that we can write x = Px where P = 1 0 0 0 . e2 = (0. b) ∈ R2 is a linear combination of e1 and e2 . The trace of a square matrix. that the projection of x = (a. • P is idempotent: P 2 = P . denote its inverse by A−1 and its transpose by AT . It is easy to see. The projection matrix satisﬁes these properties: • P v exists and is unique. Read pages 278-287 of Weisberg. x = j x2 . We will make extensive use of projections. given a vector space V and a linear subspace L there is a projection matrix P that maps any vector v into its projection P v. Two vectors are orthogonal is x. denoted by tr(A) is the sum of its diagonal elements. In general. 0).10 Review of Linear Algebra Before starting multiple regression. Exercise: check that P is idempotent and that if x ∈ L then P x = x. 0). b) ∈ R2 . in our simple example. • P is linear: if a and b are scalars then P (ax + by) = aP x + bP y. Let L = {ae1 : a ∈ R} be the set of vectors of the form (a. Note that L is a linear subspace of R 2 . Given a vector x = (a. 1) and note that R2 is the linear span of e1 and e2 : any vector (a. P ROJECTIONS . j If A is a matrix. Let us start with a simple example.

If a is a vector then E(aT Y ) = aT µ. Xp )T . . The RSS is given by RSS (β) (41) i = (Yi − XiT β)2 = (Y − Xβ)T (Y − Xβ). V(aT Y ) = aT Σa. . V(AY ) = AΣAT . 1 Xn1 X12 X22 . . The least squares estimator is β = SY where S = (XT X)−1 XT assuming that (XT X) is invertible. (42) 11.. . (44) (43) 14 . . (37) (36) 11 Multiple Linear Regression The multiple linear regression model is h Y = β 0 + β 1 X1 + · · · + β p Xp + = β T X + (38) where β = (β0 .1 Theorem. . X1 . Let Y be a random vector. β = . Denote the mean vector by µ and the covariance matrix by V(Y ) or Cov(Y ). . . βp Yn n (40) Y = Xβ + Note that Yi = XiT β + where XiT is the ith row of X. n×q ... . . . . The number of columns of X will be denoted by q. . (39) At this point.. . The value of the j th covariate for the ith subject is denoted by Xij . Now deﬁne. βp )T and X = (1. . . β0 Y1 1 β1 2 Y2 = . . . . . Xn2 . . .R ANDOM V ECTORS . . . Y = . Let 1 X11 1 X21 X = . X1p X2p . Xnp We can then rewrite (38) as Each subject corresponds to one row. . it is convenient to use matrix notation. . If A is a matrix then E(AY ) = Aµ. Thus Yi = β0 + β1 Xi1 + · · · + βp Xip + i . . .

skip=1) x = matrix(x. 2.n−p−1 . V(β) = σ 2 (XT X)−1 ≡ Σ. Let’s prove the ﬁrst two assertions. 4. Also. β ≈ M N (β.dat". since V(Y ) = σ 2 I. 3. The variance is (45) n−p−1 n−q . The ANOVA table is Source Regression Residual Total df q-1 n-q n-1 SS SSreg RSS SStotal MS SSreg /p RSS/(n-p-1) F M Sreg /MSE The F test has F ∼ Fp. 12 Example: Crime Data ### multiple linear regression crime data x = scan("˜/=classes/=stat707/=data/crime. h The estimators satisfy the following properties. = RSS RSS = || ||2 = T . E(β) = β. where I is the identity matrix."Age". An approximate 1 − α conﬁdence interval for βj is βj ± zα/2 se(βj ) (46) where se(βj ) is the square root of the appropriate diagonal element of the matrix σ 2 (XT X)−1 ."Education".byrow=T) names = c("Crime". σ 2 (XT X)−1 ). 1.ncol=14. 15 . This is testing the hypothesis H0 : β 1 = · · · β p = 0 Testing this hypothesis is of limited value. Thus.2 Theorem.The ﬁtted values are Y = Xβ and the residuals are estimated by σ2 = RSS = Y − Y . T V(β) = V(SY ) = SV(Y )S T = σ 2 SS T = σ 2 (XT X)−1 XT (XT X)−1 XT = σ 2 (XT X)−1 XT X(XT X)−1 = σ 2 (XT X)−1 . 11."Southern". Note that E(β) = E(SY ) = SE(Y ) = SXβ = (XT X)−1 XT Xβ = β.

001201 ** Age 1.17794 0.29464 13. data = crime."U2".dat) > out = lm(Crime ˜ Age + Southern + Education + + Expenditure + Labor + Males + + pop + U1 + U2 + Wealth."U2".44639 2.frame(x) > names(crime.48079 -1.frame(x) names(crime."U1"."Southern"."X") crime."Males".164774 U2 2.dat) Residuals: Min 1Q -43.6447 -13.95078 2.data. Error t value Pr(>|t|) (Intercept) -589.68179 0.518494 U1 -0.24510 0.dat) = names > > postscript("crime.517 0.59057 -3.ncol=14.horizontal=F) boxplot(crime.181344 pop 0.dat) print(summary(out)) r = out$residuals qqnorm(r) dev.399441 Education 1."X") > crime.1430 Max 55."Age". "pop". + "pop".dat) out = lm(Crime ˜ Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth.39985 167."Labor".horizontal=F) > boxplot(crime.dat = as."U1".331 0.363 0."Labor".5895 3Q 12. Expenditure 0."NW".10604 0.262 0."Ex1".15028 0.092620 .04058 0."Males". data=crime."Wealth".96364 0. + "Expenditure".09042 0."NW".418 0.493467 Males 0. data=crime.off() > ### multiple linear regression crime data > x = scan("˜/=classes/=stat707/=data/crime.4624 .ps".000451 *** Labor 0.728 0.652 0.3772 Coefficients: Estimate Std.853 0.dat".skip=1) Read 658 items > x = matrix(x."Education"."Ex1".30353 0.025465 * Southern 11.13866 0.24955 3.22269 1."Wealth".byrow=T) > names = c("Crime".data.15327 0.692 0.dat) > print(summary(out)) Call: lm(formula = Crime ˜ Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth.dat = as.862 0.68181 1.dat) = names postscript("crime.029859 * 16 Median 0."Expenditure".ps".

3 30760.1388888 >pop 1 528.3456065 >Southern 1 153.9 502.5 1381. Let’s try dropping the unemployment and labor variables.7 0.5968 F-statistic: 7.8 0.0393779 * >Wealth 1 502. Use. F = (RSSsmall − RSSbig )/(df small − df big ) RSSbig /df big (47) which has a Fa.2906 0.0004333 *** >Expenditure 1 30760.0 603.3553240 >U1 1 198.9 0.05 ‘.Wealth -0. ##Fit full model out = lm(Crime ˜ Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth.1 Example.001 ‘**’ 0.8339 0.1 ##drop Labor and U1 and U2 out = lm(Crime ˜ Age + Southern + Education + Expenditure + Males + pop + Wealth.08309 0. where a = dfsmall − dfbig and b = dfbig .3672287 >Residuals 36 21712. 13. Adjusted R-squared: 0.913 0.704e-06 > r = out$residuals > qqnorm(r) > dev.5 2.142e-08 *** >Labor 1 1207.6845.809 on 10 and 36 DF.8 2756. p-value: 1.3295 0.0 1207.b distribution under H0 .2548 0.6167591 >Education 1 9056.0012 0.7 198.3 51.dat) anova(out) >Analysis of Variance Table > >Response: Crime > Df Sum Sq Mean Sq F value Pr(>F) >Age 1 550.’ 0.1657635 >Males 1 1381.8 0.off() 13 Testing Subsets of Coefﬁcients Suppose you want to test if a set of coefﬁcients is 0.8768 0.01 ‘*’ 0.8 550.7 9056.367229 --Signif. data=crime.56 on 36 degrees of freedom Multiple R-Squared: 0. data=crime.5695451 >U2 1 2756.0027 2.dat) 17 .5710 0.7 0.8 528.8 4.0166 0.9133 0.09099 -0.7 15.1 ‘ ’ 1 Residual standard error: 24.0 2. codes: 0 ‘***’ 0.7 153.

0 200 400 600 800 1000 Crime Southern Ex1 Males NW U1 U2 X Normal Q−Q Plot Sample Quantiles −40 −20 0 20 40 −2 −1 0 Theoretical Quantiles 1 2 Figure 6: crime example 18 .

9636 0. H is symetric and idempotent: H 2 = H 3. 14.8 0.3624211 >Southern 1 153. rank(X) = tr(H). 2.2370 0. The hat matrix will play an important role in all that follows.3165618 >Wealth 1 232. 4.2 0. HX = X.3 30760.2 232.6291234 >Education 1 9056. However.980343 print(pvalue) > 0.6 top = (25295.8 550. 14 The Hat Matrix Recall that Y = Xβ = X(XT X)−1 XT Y = HY where (48) H = Xβ = X(XT X)−1 XT (49) is called the hat matrix.7 3.6 1.3.7 9056.0294 0.1 648.anova(out) >Analysis of Variance Table > >Response: Crime > Df Sum Sq Mean Sq F value Pr(>F) >Age 1 550.6 667.7 2092.2265 0. H projects Y onto the column space of X.3581 0.5530417 >Residuals 39 25295.1-21712)/(39-36) bottom = 21712/36 f = top/bottom pvalue = 1-pf(f.7 13.4262 3.1 Theorem. >pop 1 667.36) print(f) > 1.7 0. 1.3 47.0005963 *** >Expenditure 1 30760. This is not a variable selection strategy.8493 0. The hat matrix is the projector onto the column space of X.1343155 We conclude that these variables are not important in the regression. The residuals are = Y − Y = Y − HY = (I − H)Y.067e-08 *** >Males 1 2092. we should only do this test if there is some a priori reason to test those variables. (50) 19 .0802032 . The hat matrix has the following properties.7 153.

. To see that the sum of the residuals is 0. . 1 1 ··· 1 . . V( i ) = σ 2 (1 − hii ) where hii is diagonal element of H. . 4. Let 1 1 . 14. V( ) = σ 2 I. V( ) = (I − H)V(Y )(I − H)T = σ 2 (I − H)(I − H)T = σ 2 (I − H)(I − H) = σ 2 (I − H − H + H 2 ) = σ 2 (I − H − H + H) since H 2 = H = σ 2 (I − H). 20 . . . Now. = 0. Next. E( ) = (I − H)E(Y ) = (I − H)Xβ = Xβ − HXβ = Xβ − Xβ since HX = X = 0. Estimated residuals: E( ) = 0. . Let’s prove a few of these. i i = 1.3 Example.14. . Properties of residuals: 1. Y 1 1 ··· 1 1 1 ··· 1 . note that i i = 1. 2. . 3. . . True residuals: E( ) = 0. V( ) = σ 2 (I − H). where 1 denotes a vector of ones. . : a ∈ R . 1 Then H= X= 1 n The column space is and 1 1 V = a . 1 ∈ L. Y is the projection onto L and = Y − Y . . First. Hence.2 Theorem. . is perpendicular to every vector in L. . 1 HY = Y Y . . i i = 0. By the properties of the projection.

Suppose that the X matrix has two-columns. a2 ∈ R}. Y x2 Y x1 Figure 7: Projection 21 . Denote these columns by x 1 and x2 . The column space is V = {a1 x1 + a2 x2 : a1 .14.4 Example. The hat matrix projects Y ∈ Rn onto V. See Figure 7.

Let Z = B −1 Y . β = SY (51) (52) The solution is where S = (XT Σ−1 X)−1 XT Σ−1 . Then we have Z = B −1 Y = B −1 (Xβ + ) = B −1 Xβ + B −1 = Mβ + δ where M = B −1 X. unbiased estimator is obtained by minimizing RSS (β) = (Y − Xβ)T Σ−1 (Y − Xβ). Then. Thus we can simply regress Z on M and do ordinary regression. This is unbiased with variance V(β) = (XT Σ−1 X)−1 . First. unbiased estimator. So β is still unbiased. linear. 22 . Let B denote the square root of Σ. the formula for the standard error of β is wrong. Also. it is an optimal estimator in the sense that it is they are the minimum variance. Moreover. V(β) = V((XT X)−1 XT Y ) = (XT X)−1 XT V(Y )X(XT X)−1 = (XT X)−1 XT ΣX(XT X)−1 which is different than the usual formula. with constant variance. V(δ) = B −1 V ( )B −1 = B −1 ΣB −1 = B −1 BBB −1 = I. This is no longer true with non-constant variance. there are two problems. under weak conditions. it can be shown that β is consistent (converges to β as we get more data). The usual estimator has reasonable properties. To see this. the usual least squares estimator is not just unbiased. This is called weighted least squares. and. It can be shown that B −1 is the square root of Σ−1 . recall that V(AY ) = AV(Y )AT . What happens if this is wrong? Suppose that Y = Xβ + where V( ) = Σ. and more importantly.15 Weighted Least Squares So far we have assumed that the i ’s are independent and have the same variance. E(β) = E((XT X)−1 XT Y ) = (XT X)−1 XT E(Y ) = (XT X)−1 XT Xβ = β. B is a symmetric matrix that satisﬁes B T B = BB T = Σ. Suppose we use the usual least squares estimator β. linear. Thus. However. Second. δ = B −1 . It can be shown that minimum variance. Hence.

(4) Estimate σ(x) as a function of x. 0 w1 σ2 0 . Σ= .. we can also estimate the variance. 0 0 w2 . .. In fact. i Thus. . .. Then we don’t need to do a weighted regression. We could assume a simple model like σ(xi ) = α0 + α1 xi for example. RSS (β) n = (Y − Xβ)T Σ−1 (Y − Xβ) ∝ i=1 wi (Yi − xT β)2 . we can use the sample variance 2 of those Y values to estimate σi . . Just as we can estimate the regression line. Now we have to address the following question: where do we get the weights? Or equivalently. thinking of it as a function of x. . (2) Use external information. We will do this later in the course when we discuss nonparametric regression. It is a problem from physics and the σi are from instrument error which is known to a good approximation. (1) Do a transformation to make the variances approximately equal. . . in weighted least squares we are simply giving lower weight to the more variable (less precise) observations. . 0 2 σ 0 0 0 0 wn In this case. These cases are rare but they do occur.Let us look more closely at a special case. We will estimate σ(x) assuming only that it is a smooth function of x. I am working on such a problem right now. Then we could try to ﬁnd a way to estimate the parameters α 0 and α1 from the data. we will do something more ambitious. it is rare that you would have so many replications. If there are several Y values corresponding to each x value. .. in 2 fact. However. . how do we estimate 2 σi = V( i )? There are four approaches. (3) Use replications. 23 . . If the residuals are uncorrelated then σ2 0 0 . There are some cases where other information (besides the current data) will allow you to know (or estimate) σi .

bias squared variance (54) 24 .) The best linear predictor. or linear oracle. Recall that the prediction error. Note that ∗ is well-deﬁned even without assuming that the true regression function is linear. ∗ (x) (We assume usually that x1 = 1. When we are using least squares. R(g) = E(Y − g(X))2 and this is minimized by taking g(x) = r(x) where r(x) = E(Y |X = x). We can estimate Σ with the matrix 1 Σn = XT X n and we can estimate C with n−1 XT Y . or prediction risk. We have R(β) = E(Y 2 ) − 2E(Y X T β) + β T E(XX T )β = E(Y 2 ) − 2E(Y X T β) + β T Σβ where Σ = E(XX T ). Let us make the connection between the best linear predictor and least squares more explicit. Consider the set of linear predictors L= (x) = xT β : β ∈ Rp . not the true regression function. Now R(x) = E((Y − r(X))2 |X = x) = E ((Y − r(x)) + (r(x) − r(x)) + (r(x) − r(x)))2 X = x = σ 2 (x) irreducible error + (r(x) − r(x))2 + V (x) . By differentiating R(β) with respect to β and setting the derivative equal to 0. One way to think about linear regression is as follows. we are trying to estimate the linear oracle. we see that the best value of β is β∗ = Σ−1 C (53) where C is the p × 1 vector whose j th element is E(Y Xj ). ∗ (x) = xT β∗ gives the smallest prediction error of all linear predictors.16 The Predictive Viewpoint The main motivation for studying regression is prediction. An estimate of the oracle is thus β = (XT X)−1 XT Y which is the least squares estimator. is R( ∗ ) = min R( ). Let r(x) = E(r(x)). Suppose we observe X and then predict Y with g(X). 17 The Bias-Variance Decomposition Let r(x) be any predictor. V (x) = V(r(x)) and σ 2 (x) = V(Y |X = x). Then R = E(Y − r(X))2 = R(x)f (x)dx where R(x) = E((Y − r(X))2 |X = x). ∈L = xT β∗ where In other words.

Another Property of Least Squares. we shall see how to estimate the prediction risk. i=1 Finally. Now R= R(x)f (x)dx ≈ 1 n n i=1 R(Xi ) ≡ Rav and Rav is called the average prediction risk. Later. 2. Summary 1. . E(Rtraining ) = E(Rav ) − 2Cov(Yi . The predicted values are Y = Xβ = HY where H = X(XT X)−1 XT is the hat matrix which projects Y onto the column space of X. Let L denote all linear functions of the form j βj xj . (55) Typically. . If the regression function r(x) is actually linear so that r(x) = xT β. Yi ). Then: n n i=1 i=1 17. Yi ).We call (54) the bias-variance decomposition. . We have Rav = 1 n n R(Xi ) = i=1 1 n n σ 2 (Xi ) + i=1 1 n n n i=1 (r(Xi ) − r(Xi ))2 + 1 n n V (Xi ). If we combine the last two terms. An estimate of β∗ is β = (XT X)−1 XT Y . we can also write R(x) = σ 2 (x) + MSEn (x) where MSEn (x) = E((r(X) − r(X))2 |X = x) is the conditional mean squared error of r(x). deﬁne the training error Rtraining = 1 n i=1 (Yi − Yi )2 where Yi = r(Xi ). ≤ qσ 2 + min ||f − r||2 . We can regard β as an estimate of the linear oracle. n f ∈F n 25 . 3. r is a close linear approximation to r. let r i = E(r(Xi )) and compute E(Yi − Yi )2 Hence. The linear oracle.1 Theorem. Xn n Thus. or best linear predictor. For any vector v 2 deﬁne the norm ||v||2 = (1/n) n vi and for any function h write ||h||2 = (1/n) n h2 (Xi ). then the least squares estimator is unbiased and has variance matrix σ 2 (XT X)−1 . = E(Yi − r(Xi ) + r(Xi ) − r(Xi ) + r(Xi ) − Yi )2 = σ 2 + E(r(Xi ) − r(Xi ))2 + V(r(Xi )) − 2Cov(Yi . We might guess that Rtraining estimates the prediction error well but this is not true. . Cov(Yi . E ||r − r||2 X1 . Yi ) > 0 and so Rtraining underestimates the risk. is xT β∗ where β∗ = Σ−1 C. The least squares estimator is β = (XT X)−1 XT Y . To see this.

Nonlinearity 5. The moral: looking at the ﬁt is not enough. (4) nonlinearity. Note: doesn’t affect the ﬁt too much. Small samples: use transformations 26 . (3) nonconstant variance. Inﬂuential: remove or use robust regression. we are looking for: (1) outliers. (5) nonnormality. mainly an issue for conﬁdence intervals. Use transformation or nonparametric methods. When we do this. We should also use some diagnostics. Inﬂuential points 3. Nonconstant variance 4. Four different data sets with the same ﬁt. Use transformation or nonparametric methods. Nonnormality Remedy Non-inﬂuential: don’t worry about it. Outliers 2. The remedies are: Problem 1. we diagnose problems by looking at the residuals. Large samples: not a problem. Generally. (2) inﬂuential points.15 10 y 5 y 0 0 5 10 x 15 20 0 0 5 10 15 5 10 x 15 20 15 10 y 5 y 0 0 5 10 x 15 20 0 0 5 10 15 5 10 x 15 20 Figure 8: The Ansombe Example 18 Diagnostics Figure 8 shows a famous example. Fit regression with and without the point and report both analyses.

se(Yi − Yi ) = σ T 1 + xT (X(i) X(i) )−1 xi . (2) Compute β(i) and σ(i) . σ 1 − hii i 20 Inﬂuence Cook’s distance Di = (Y(i) − Y )T (Y(i) − Y ) 1 2 = ri qσ 2 q hii 1 − hii where Y = X β and Y(i) = X β(i) . i So. Points with Di ≥ 1 might be inﬂuential. i T Xj β + T Xj β + j j +δ j=i j = i.1 Theorem. How do the residuals come into this? Internally studentized residuals: ri = Externally studentized residuals: r(i) = 19.Three types of residuals: Name residual standardized residual studentized residual Formula i = Y i − Yi Yi b √ − Yi σ 1−hii b b Y√ Yi i− σ(i) 1−hii b R command (assume lm output is in tmp) resid(tmp) rstandard(tmp) rstudent(tmp) 19 Outliers Can be found (i) graphically or (ii) by testing. Let us write Yj = Test H0 : case i is an outlier versus H1 : case i is not an outlier Do the following: (1) Delete case i. Note that T V(Yi − Yi ) = V(Yi ) + V(Yi ) = σ 2 + σ 2 xT (X(i) X(i) )−1 xi . Points near the edge are typically the inﬂuential points. n − p − 1 − ri √ . 27 . (4) Compute Yi − Yi . (3) Predict the deleted case: Yi = XiT β(i) . σ(i) 1 − hii i √ . ti = r i n−p−2 2 = r(i) . ti = se (5) Reject H0 if p-value is less than α/n.

5.1 Example (Rats).265922 0.56.pch=19).42.0151 * --Signif.pch=19).abline(h=0) plot(dose.ncol=4.664 0.01 ‘*’ 0.05 ‘.1] data[..diagnostics$stud.res.. codes: 0 ‘***’ 0.190.0177 * lwt 0.] dfb.abline(h=0) qqnorm(diagnostics$stud. qr=TRUE) out = lm(y ˜ summary(out) Coefficients: Estimate Std.001 ‘**’ 0.d hat 28 .dose dffit cov.9. Adjusted R-squared: 0.07197 diagnostics = ls.’ 0.021246 0.pch=19).178111 1.1.bwt dfb.0.20.176.err" "cov.4] length(y) bwt + lwt + dose.4193 dose 4..abline(a=0.007974 -2.byrow=T) data[.unscaled" ### More diagnostics I = influence. ) data bwt lwt dose y n = = = = = = matrix(data. Error t value Pr(>|t|) (Intercept) 0. data = c(176.2367 F-statistic: 2.3639..res.inf" "call" I$infmat[1:5.pch=19).r cook.017217 0.1919 bwt -0.pch=19)..res..88.diagnostics$stud.measures(out) names(I) "infmat" "is.scaled" plot(bwt.367 0.522625 2.86 on 3 and 15 DF.9.dev" "hat" "std.res" "stud.diagnostics$stud.744 0. .1 ‘ ’ 1 Residual standard error: 0.07729 on 15 degrees of freedom Multiple R-Squared: 0.00.2] data[.3] data[.res.lwt dfb.830 0.res" "dfits" "correlation" "std.diag(out) names(diagnostics) "std.6.194585 1.abline(h=0) plot(lwt.88. p-value: 0..abline(h=0) "cooks" "cov.1_ dfb.rstudent(out).014298 0.25.b=1) ### Another way to get the residuals r = rstandard(out) ### standardized r = rstudent(out) ### studentized plot(fitted(out).5..

16882682 0.151 bwt -0.23100202 -1.2747418 -0.1004 on 3 and 14 DF.02106.6310012 0.1779827 2 0.9094531 1.695 Residual standard error: 0.8599033 0.637 dose 1. Adjusted R-squared: -0.2437488 0.1793410 3 -0.518 0.8920451 0.9585 29 .08854024 0.7471972 1.6087606 1.14256373 -0.018717 -0. p-value: 0.1076158 5 0.39626771 0.3036512 0.type="h".92961596 0.lwd=3.4817784 0.07825 on 14 degrees of freedom Multiple R-Squared: 0.1256122 -0.416 0.12685888 -0.482 0.3045718 1.31491627 -0.5500161 0.1 -0.205094 1.0164073 0.7] plot(cook.484877 3.20291617 0.4008047 0.5241607 0.684 lwt 0.311427 0.018659 0.52160605 -0. Error t value Pr(>|t|) (Intercept) 0.007783 0.col="red") ### remove third case y = y[-3] bwt = bwt[-3] lwt = lwt[-3] dose = dose[-3] out = lm(y ˜ bwt + lwt + dose) summary(out) Coefficients: Estimate Std.8509146 4 0.1400908 -0.03835128 0.12503004 -0.713064 0.09773917 -0.9047699 7.4943610 0.1887 F-statistic: 0.3915382 cook = I$infmat[.05718456 0.7043633 -0.400 0.66770314 0.008989 0.

which=c(1.survivors) out = lm(survivors ˜ time) abline(out) plot(out.83 on 13 degrees of freedom Multiple R-Squared: 0. Possible remedies are: • Transformation • Robust regression • nonparametric regression Examples of transformations: √ Y .01e-06 *** --Residual standard error: 41. > > > > > > > > time = 1:15 survivors = c(355.104.58 22.21.8098 F-statistic: 60. We transform to make the assumptions valid.1 Example (Bacteria). Adjusted R-squared: 0.4)) print(summary(out)) 30 .8234.Warning! Notice the command: out = lm(y bwt + lwt + dose. Consider the following transformation.50 -7.36.38.73 11. This example is from Chatterjee and Price(1991.2.32.19. Look at Figure 6.006e-06 The residual plot suggests a problem.diag 21 Tweaking the Regression If residual plots indicate some problem. log(Y + c). p-value: 3. we need to apply some remedies. p 36).4)) print(summary(out)) Coefficients: Estimate Std.78e-08 *** time -19.which=c(1.62 on 1 and 13 DF.197.786 3. 21. not to chase statistical signiﬁcance.166.46 2.logsurv) out = lm(logsurv ˜ time) abline(out) plot(out.142.211. 1/Y These can be applied to Y or x.2.60. Error t value Pr(>|t|) (Intercept) 259.420 3. Bacteria were exposed to radiation.106. Figure 10 shows the number of surviving bacteria versus time of exposure to radiation. > > > > > > > logsurv = log(survivors) plot(time.3 p 132 of Weisberg. log(Y ). The program and output look like this.15) plot(time. qr=TRUE) You need the qr=TRUE option if you want to use ls.56.

80 0.8 5 10 Index 15 Figure 9: Rat Data 31 .35 0.90 0.50 −1 0.6 0.0 0.res 150 160 170 bwt 180 190 200 1 0 −1 −1 0 1 2 6 7 lwt 8 9 10 Normal Q−Q Plot 2 diagnostics$stud.4 0 0.30 0.95 1.45 0.85 dose 0.res diagnostics$stud.2 0.res 1 Sample Quantiles 0.2 diagnostics$stud.40 fitted(out) 0.00 0 −1 −1 −2 0 1 2 −1 0 Theoretical Quantiles 1 2 2 1 rstudent(out) cook 0.75 0.

5 15 14 −1 8 −1 0 1 0.0 2 4 6 8 10 12 14 Theoretical Quantiles Obs. number Figure 10: Bacteria Data 32 .Residuals vs Fitted 350 1 250 300 Residuals 200 50 0 100 15 survivors 50 100 150 2 4 6 8 time 10 12 14 −50 8 0 50 100 150 200 250 Fitted values Normal Q−Q plot 1 Cook’s distance plot 1 3 Standardized residuals 2 Cook’s distance 15 1 0 0.0 1.5 1.

Adjusted R-squared: 0. Much better.92 < 2e-16 *** time -0.86e-14 Check out Figure 11.86e-14 *** --Residual standard error: 0. So the fact that the log transformation is useful here is not surprising. Error t value Pr(>|t|) (Intercept) 5. theory predicts Nt = N0 eβt where Nt is number of survivors at exposure t and N0 is the number of bacteria before exposure.9875 F-statistic: 1104 on 1 and 13 DF. p-value: 5.22 5. In fact.006575 -33.9884.973160 0.Coefficients: Estimate Std.11 on 13 degrees of freedom Multiple R-Squared: 0. 33 .218425 0.059778 99.

0 0.0 3.2 7 5.0 5. number Figure 11: Bacteria Data 34 .5 0.1 3.1 10 3.4 7 10 −2 2 −1 0 1 0.Residuals vs Fitted 0.5 Residuals logsurv 4.2 2 2 4 6 8 time 10 12 14 3.0 −0.2 0.5 4.0 0.5 5.0 −0.0 4.3 1 0 0.0 5.5 Fitted values Normal Q−Q plot Cook’s distance plot 2 2 7 Standardized residuals 1 Cook’s distance 0.1 −1 2 4 6 8 10 12 14 Theoretical Quantiles Obs.5 4.

if z ∈ {1. If we added d 3 above we have that d3 = 1 − d2 − d1 + d1 d2 . 2. Then X T X is not invertible. Salary data from Chatterjee and Price p 96. it is called a qualitative variable or a factor."management") attach(sdata) n = length(salary) d1 = rep(0. 1}.dat".n) d2[education==2] = 1 out1 = lm(salary ˜ experience + d1 + d2 + management) summary(out1) Coefficients: 35 . More generally. then it is called a dummy variable. 22.1 Example. do this: z d 1 d2 1 1 0 1 2 0 3 0 0 In the model Y = β0 + β1 d1 + β2 d2 + β3 x + we see E(Y |z = 1) = β0 + β1 + β3 x E(Y |z = 2) = β0 + β2 + β3 x E(Y |z = 3) = β0 + β3 x You should not create k dummy variables because they will not be linearly independent."education".table("salaries. 3}. Then coefﬁcient intercept slope d=0 β0 β1 d=1 β 0 + β 2 β1 These are parallel lines.n) d1[education==1] = 1 d2 = rep(0. Let d be a dummy variable.skip=1) names(sdata) = c("salary". Now consider this model: E(Y ) = β0 + β1 x + β2 d + β3 x d Then: coefﬁcient intercept slope d=0 β0 β1 d=1 β 0 + β 2 β1 + β 3 These are nonparallel lines.22 Qualitative Variables If Xi ∈ {0. if x takes discrete values. use k − 1 dummy variables. Consider E(Y ) = β0 + β1 x + β2 d. For example."experience". To include a discrete variable with k levels. ##salary example p 97 chatterjee and price sdata = read.

01 ‘*’ 0.treatment.’ 0.66 0.72e-09 < 2e-16 *** *** *** *** *** ‘*’ 0.781 experience 546.705 management 6883. For high school.75 7.72e-09 *** d2 147. Error t value (Intercept) 8035.05 ‘. Compare bachelors to high school.01 Pr(>|t|) < 2e-16 < 2e-16 7.’ 0. R codes the dummy variables differently.92 21.52 17.9568.9525 F-statistic: 226.81 383.18 30.73e-11 6.787 < 2e-16 *** experience 546. codes: 0 ‘***’ 0.1 ‘ ’ 1 Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.686 ed3 2996. d1 = 0 and d2 = 1 so: E(Y ) = β0 + β1 experience + 147 + β4 management So Ebach (Y ) − Ehigh (Y ) = 3144.53 313.896 < 2e-16 *** d1 -2996. Error t value Pr(>|t|) (Intercept) 11031.896 ed2 3144.22 28.8 on 4 and 41 DF.9525 F-statistic: 226.21 411. Increment for management position is 6883 dollars.001 ‘**’ 0.9568.factor(education) out2 = lm(salary ˜ experience + ed + management) summary(out2) Coefficients: Estimate Std. p-value: < 2.05 ‘. Do help C and contr.52 17.001 ‘**’ 0.04 361.277 6. d 1 = 1 and d2 = 0 so: E(Y ) = β0 + β1 experience − 2996 + β4 management For bachelors. Adjusted R-squared: 0.69 20.97 8.928 --Signif.928 < 2e-16 *** --Signif.Estimate Std.277 management 6883.1 ‘ ’ 1 Residual standard error: 1027 on 41 degrees of freedom Multiple R-Squared: 0.82 387. level mean d1 d2 ed2 ed3 high-school 8036 1 0 0 0 BS 11179 0 1 1 0 advanced 11032 0 0 0 1 You can change the way R does this.8 on 4 and 41 DF.60 386.2e-16 Intepretation: Each year of experience increases ourn prediction by 546 dollars.18 30. Adjusted R-squared: 0.75 -7.53 313. 36 .21 411.92 21.381 0. p-value: < 2. ### another way ed = as. codes: 0 ‘***’ 0.2e-16 Apparently.

models with few covariates have high bias but low variance. The vector of predicted values from model S is Y (S) = XS β(S). For example. There are 2p such subsets. where XS denotes the design matrix for this subset of covariates. Let rS (x) = j∈S βj (S)xj denote the estimated regression function for the submodel. . The R function scale will do this for you. then we might get better predictions by omitting some covariates. The best predictions come from balacning these two extremes. We measure the predictive quality of the model via the prediction risk. To reiterate: including many covariates leads to low bias and high variance including few covariates leads to high bias and low variance The problem of deciding which variables to include in the regression model to achive a good tradeoff is called model selection or variable selection. Let β(S) = (βj : j ∈ S) denote the coefﬁcients of the corresponding set of covariates and let β(S) = (XT XS )−1 XT Y S S denote the least squares estimate of β(S). i=1 (56) (57) Xij i=1 Given S ⊂ {1. . 1 n 1 n n Yi2 = 1 i=1 n 2 Xij = 1. The prediction risk of the submodel S is deﬁned to be 1 R(S) = n where Yi∗ = r(Xi∗ ) + value Xi . It is convenient in model selection to ﬁrst standardize all the variables by subtracting off the mean and dividing by the standard deviation. Y is deﬁned to be a vector of 0’s. β(S) is the least squares estimate of β(S) from the submodel Y = XS β(S) + . we replace Xij with (Xij − X j )/sj where X j = n−1 n Xij is the mean of i=1 covariate Xj and sj is the standard deviation. we want to select a submodel S to make R(S) as small as possible. Thus. ∗ i n i=1 E(Yi (S) − Yi∗ )2 (58) denotes the value of a future observation of Y at covariate Ideally.23 Variable Selection If the dimension p of the covariate X is large. we assume throughout this section that 1 n 1 n n Yi i=1 n = 0. p}. Thus. 37 . we recall an important result. . This is called the bias-variance tradeoff. let (Xj : j ∈ S) denote a subset of the covariates. = 0. . j = 1. We face two problems. . For the null model S = ∅. . . p. . The ﬁrst is estimating R(S) and the second is searching through all the submodels S. Models with many covariates have low bias but high variance. 24 The Bias-Variance Tradeoff Before discussing the estimation of the prediction risk.

the bias decreases and the variance increases as α → 1. this is a poor estimator of R(S) because it is very biased.2 Example. the variance is σ 2 α2 and mean squared error is bias2 + variance = (1 − α)2 µ2 + σ 2 α2 . The bias is E(µ) − µ = (α − 1)µ. . . 38 . . We want to estimate µ = (µ1 . . i = 1. . Conversely. (66) For the null model S = ∅ Yi = 0. i (63) As k increases the bias term decreases and the variance term increases. Fix 1 ≤ k ≤ p and let µi = The MSE is R = bias2 + variance = i=k+1 (61) Yi 0 i≤k i > k. σ 2 ). namely. . if we add more and more covariates to the model. The minimum variance unbiased estimator of µ is Y . Indeed. Thus if we used Rtr (S) for model selection we would be led to include every covariate in the model. . i = 1. we can track the data better and better and make Rtr (S) smaller and smaller. Now consider the estimator µ = αY where 0 ≤ α ≤ 1. 24. p (62) µ2 + kσ 2 .1 Risk Estimation and Model Scoring An obvious candidate to estimate R(S) is the training error Rtr (S) = 1 n n i=1 (Yi (S) − Yi )2 . Since E(Y i2 − σ 2 ) = µ2 . n. 24. we can form an i unbiased estimate of the risk. (60) Notice that the bias increases and the variance decreases as α → 0.Bias-Variance Decomposition of the Prediction Risk R(S) = σ2 unavoidable error + 1 n n 2 i=1 bi + 1 n n i=1 vi squared bias variance (59) where bi = E(rS (Xi )|Xi ) − r(Xi ) is the bias and vi = V(rS (Xi )|Xi ) is the variance. σ 2 ). and Rtr (S) is an unbiased estimator of R(S) and this is the risk estimator we will use for this model. Suppose that Y ∼ N (µ. The optimal estimator is obtained by taking α = µ 2 /(σ 2 + µ2 ). .1 Example. µp )T . . Consider the following model: Yi = N (µi . p. But in general. . Let us look at the bias-variance tradeoff in some simpler settings. p R= i=k+1 (Yi2 − σ 2 ) + kσ 2 = RSS + 2kσ 2 − nσ 2 = RSS + 2kσ 2 + constant (64) where RSS = n i=1 (Yi − µi )2 . . We can estimate the optimal choice of k by minimizing RSS + 2kσ 2 (65) over k. . 24.

24. An important advantage of cross-validation is that it does not require an estimate of σ. n 1 − |S| n The right hand side of (73) is called the generalized cross validation (GCV) score and will come up again later. The ﬁrst term in (68) measures the ﬁt of the model while the second measure the complexity of the model. Mallow’s Cp statistic is deﬁned by R(S) = Rtr (S) + 2|S|σ 2 n (68) where |S| denotes the number of terms in S and σ 2 is the estimate of σ 2 obtained from the full model (with all covariates in the model). The training error is a downward-biased estimate of the prediction risk. This yields 1 RSS(S) RCV (S) ≈ (73) 2. 2 Cov(Yi . Think of the Cp statistic as: lack of ﬁt + complexity penalty. The disadvantage of Cp is that we need to supply an estimate of σ. The idea is to choose S to maximize AIC(S) = S − |S| (75) 39 .3 Theorem. use that fact that 1/(1 − x)2 ≈ 1 + 2x and conclude that RCV (S) ≈ Rtr (S) + 2σ 2 |S| n (74) where σ 2 = RSS(S)/n. S S (72) From equation (71) it follows that we can compute the leave-one-out cross-validation estimator without actually dropping out each observation and reﬁtting the model. (67) bias(Rtr (S)) = E(Rtr (S)) − R(S) = − n i=1 Now we discuss some better estimates of risk. Next. The leave-one-out cross-validation (CV) estimator of risk is RCV (S) = 1 n n i=1 (69) (Yi − Y(i) ) (70) where Y(i) is the prediction for Yi obtained by ﬁtting the model with Yi omitted. menaing that E( Rtr (S)) < R(S). This estimate is named in honor of Colin Mallows who invented it. In fact. Another method for estimating risk is leave-one-out cross-validation. Another criterion for model selection is AIC (Akaike Information Criterion). This is simply the training error plus a bias correction. First. approximate each Hii (S) with their average value n−1 i=1 Hii (S) = trace(H(S))/n = |S|/n. n We can relate CV to Cp as follows. Yi ). This is identical to Cp except that the estimator of σ 2 is different. It can be shown that 2 n 1 Yi − Yi (S) (71) RCV (S) = n i=1 1 − Hii (S) where Hii (S) is the ith diagonal element of the hat matrix H(S) = XS (XT XS )−1 XT .

texts use a slightly different deﬁnition of AIC which involves multiplying the deﬁnition here by 2 or -2. Then. 1 n log σ 2 − 2 ||Y − Xβ|2 . The BIC score also has an information-theoretic interpretation in terms of something called minimum description length. 1 Some 40 . then maximizing AIC is equivalent to minimizing Mallow’s C p . 2 2σ This looks different than Cp . Let S = {S1 . Some texts deﬁne AIC as AIC(S) = −2 S + 2|S|n log RSS (S) n + 2|S| (77) in which case we minimize AIC instead of maximizing. But. σ 2 ) = constant − Inserting β yields (β. BIC(Sr ) re (79) Hence. Also. The BIC score is identical to Mallows Cp except that it puts a more severe penalty for complexity. . But. . −2 + 2|S| = (β. Let m = |S| and let q be the number of terms in the full model.where S = S (βS . . RSS + 2|S| σ2 which is essentially Cp . AIC − AICf ull = n log RSS + 2(m − q) nσ 2 RSS ≈ n − 1 + 2(m − q) nσ 2 RSS = − n + 2(m − q) σ2 which corresponds to Cp . (β. Inserting σ 2 yields. 1 This can n − n − |S| 2 (76) where RSS(S) is the residual sums of squares in model S. assume we put a smooth prior on the parameters within each model. P(Sj |data) ≈ eBIC(Sj ) . Also.” Assuming Normal errors. σ 2 ) = constant − In this case. To see where AIC comes from. Suppose we assign the prior P(Sj ) = 1/m over the models. AIC(S) = − n log 2 RSS (S) MLE . (78) 2 2 n 2 The BIC score has a Bayesian interpretation. up to a constant. choosing the model with highest BIC is like choosing the model with highest posterior probability. Here we choose a model to maximize |S| n |S| RSS (S) BIC(S) = S − log n = − log − log n. It can be shown that the posterior probability for a model is approximately. note that. Yet another criterion for model selection is BIC (Bayesian information criterion). 2 2σ n 1 log σ 2 − 2 RSS. Sm } where m = 2p denote all the models. let us compare the AIC of one model to the full model. This has no effect on which model is selected. It thus leads one to choose a smaller model than the other methods. let σ 2 = RSSf ull /n which is the MLE of σ 2 under the full model. σ 2 is unknown too. σ 2 ) = constant − n log 2 RSS n . assuming Gaussian errors. . If instead we take σ equal to its estimate from the largest model. σ) is the log-likelihood (assuming Normal errors) of the model evaluated at the be thought of “goodness of ﬁt” minus “complexity. We will use the fact that log x ≈ x − 1.

method="Cp") print(out) > out $which 41 .method="Cp") Here.y. out$size shows how many parameters are in the model and out$Cp shows the Cp statistic. 2. 4. Fit all submodels.973 0. library(leaps) x = cbind(x1. such as cross-validation or AIC. Here is a sample example with three variables. You can also download it for free from the R web site.) You can also use the “nbest= ” option.method="Cp". In R you type: library(leaps) out = leaps(x.y. If p is not too large we can do a complete search over all the models. Their CV scores are shown in Table 1.4.000 0. We will consider 4 methods for searching through the space of models: 1. 24.y. Forward stepwise regression. The Lasso. 3. To run all subsets regressions in R you need the leaps library.x2.486 0.947 0. This is installed in the Statistics department. Consider the crime data but let us only consider three variables: Age.nbest=10) This will report only the best 10 subsets of each size model.Model Null Model (Yi = 0) Age Expenditure Population Age + Expenditure Age + Population Expenditure + Population Age + Expenditure + Population CV Score 1. assign a score to each one.537 Table 1: Cross validation scores for all 8 models in Example 24. out$which shows which variables are in the model. Ridge Regression. In particular. Fitting All Submodels. 24. There are 8 possible submodels.568 0. The output is a list with several components.991 0. The best model has two variables.4 Example. we then need to search through all 2 p models. x is a matrix of explanatory variables.2 Model Search Once we choose a model selection criterion. and choose the model with the best score. out = leaps(x.635 0. Expenditure and Population.x3) out = leaps(x. (Do not include a column of 1’s. Age and Expenditure. for example.

1 2 3 1 TRUE FALSE FALSE 1 FALSE FALSE TRUE 1 FALSE TRUE FALSE 2 TRUE TRUE FALSE 2 TRUE FALSE TRUE 2 FALSE TRUE TRUE 3 TRUE TRUE TRUE $label [1] "(Intercept)" "1" $size [1] 2 2 2 3 3 3 4 $Cp [1] [7] "2" "3" 0. Choose the ﬁnal model to be the one with the smallest estimated risk.0097649 4. s∈S βs Xs + and let Rj be the estimated risk. .5 Example. Repeat the previous step until all variables are in S or until it is not possible to ﬁt the regression.0972299 2. We continue adding variables one at a time this way. . Two common methods are forward and backward stepwise regression. Set j = argminj Rj and let S = {j}. We then add the one variable that leads to the best score. we start with no covariates in the model.1932414 106. p. When p is large. See Figure 12. For j = 1. For each j ∈ S c . The x-axis shows the order that the variables entered. nether is guaranteed to ﬁnd the model with the best score. In that case we need to search over a subset of all the models. Figure 13 shows forward stepwise regression on the crime data. searching through all 2p models is infeasible. Next we ﬁnd try adding each of the remaining variables to the model abd ﬁnd the x13 leads to the most improvement. 2. The y-axis is the cross-validation score.Forward Stepwise Regression 1.5772075 So the best model is the ﬁrst which has only x1 in it. In forward stepwise regression. . . Stepwise.5794484 105. Backward selection is infeasible when p is larger than n since β will not be deﬁned for the largest model. forward selection is preferred when p is large. Backwards stepwise regression is the same except that we start with the biggest model and drop one variable at a time. 4. ﬁt the regression model Y = βj Xj + j = argminj∈S c Rj and update S ←− S {j}.0000000 2. The 42 . regress Y on the j th covariate Xj and let Rj be the estimated risk. Figure 12: Forward stepwise regression.2834636 104. Set 3. We start with a null model and we ﬁnd that adding x4 reduces the cross-validation score the most. 24. Both are greedy searches. We continue this way until all the variables have been added. Hence.

45 2 3 4 5 6 7 8 9 10 11 12 13 Number of variables Figure 13: Forward stepwise regression on the crime data. β4 X4 β4 X4 + β13 X13 β4 X4 + β13 X13 + β3 X3 .35 0. 13. = = = . S = {4} S = {4. 3} . we deﬁne β to minimize the penalized sums of squares n Q(β) = i=1 (Yi − XiT β)2 + λpen(β) where pen(β) is a penalty and λ ≥ 0. = j=1 The L0 penalty would force us to choose estimates which make many of the βj ’s equal to 0. . x3 . .50 0. (80) The best overall model we ﬁnd is the model with ﬁve variables x4 .40 0. Regularization: Ridge Regression and the Lasso. . x11 although the model with seven variables is essentially just as good. 13} S = {4. x13 . .60 Cross−validation score 0. Speciﬁcally. Another way to deal with variable selection is to use regularization or penalization. But there is no way to minimize Q(β) without searching through all the submodels. sequence of models chosen by the algorithm is Y Y Y .0. 43 . . .55 4 13 8 3 1 11 10 7 12 5 2 6 9 0. We consider three choices for the penalty: L0 penalty ||β||0 L1 penalty ||β||1 L2 penalty ||β||2 = #{j : βj = 0} p = j=1 p |βj | 2 βj . . x1 .30 1 0. .

df(λ) → 0.ridge which is part of the MASS library. It can be shown that the estimator β that minimizes the penalized sums of squares is a follows (assuming the Xij s are standardized) is β = (XT X + λI)−1 XT Y. 24. Thus. high variance). Here is the R code: Trace = function(X){ sum(diag(X)) } 44 . n 2 Yi − r(xi ) GCV = 1−b i=1 where 1 b= n n Hii = i=1 df (λ) . n The function lm. When λ → ∞ we get β = 0 (high bias. See Figure 14. Notice that ridge regression produces a linear estimator: β = SY where S = (XT X + λI)−1 XT and Y = HY where where H = X(XT X + λI)−1 XT . How do we choose λ? Recall that the cross-validation estimate of predictive risk is n CV = i=1 (Yi − r(−i) (Xi ))2 . low variance). This is just an approximation to CV where H ii is replaced with its n average: n−1 i=1 Hii . Here is a re-analysis of the crime data using ridge regression.The L2 penalty is easy to implement. does ridge regression and computes GCV. When λ = 0 we get the least squares estimate (low bias. left plot in Figure 14. When λ = 0 we have df(λ) = p and when λ → ∞. Yi − r(xi ) 1 − Hii 2 It can be shown that CV = n i=1 . The middle right and bottom left plots are from this function. An alternative criterion that is sometimes used is generalized cross validation or.6 Example (Crime Data Revisited). The estimates are plotted as a function of λ in Figure 14. The effective degrees of freedom is deﬁned to be df(λ) = trace(H). See the middle. Thus we can choose λ to minimize CV. top right. The estimate β that minimizes n i=1 p p (Yi − βj Xij )2 + λ j=1 j=1 2 βj is called the ridge estimator. where I is the identity. GCV. top left.

0 16. 45 .5 17.5 16.0 15.0 5 out1$df 6 7 8 9 10 Figure 14: Ridge regression of crime data.30 20 10 beta 0 beta 0 lambda df 5 10 15 20 25 −10 −10 0 10 20 30 5 6 7 8 9 10 39000 38000 t(out2$coef) 5 out1$df lambda 6 7 8 9 10 out1$cv 36000 37000 35000 −10 0 0 10 20 5 10 15 20 25 out2$GCV 15.

This is where the L1 penalty comes in.length(lambda))) matplot(out1$df.xlab="lambda".dat) ### scale the variables Crime = scale(Crime) lambda out1 = seq(0.p.lty=1.out2$GCV.type="l".length(lambda))) plot(out1$df.25.type="l".off() The problem with ridge regression is that we really haven’t done variable selection because we haven’t forced any βj ’s to be 0.hat)/(1-diag(H)))ˆ2 ) } return(beta.Crime.ylab="beta") lines(lambda.y.hat = H % *% y df[i] = Trace(H) cv[i] = sum( ((y-y.t(out1$beta).fun = function(X.df.k) for(i in 1:k){ S = solve(t(X) %*% X + (lambda[i]*I) ) %*% t(X) beta[.ps".ylab="beta") lines(out1$df.type="l") dev.t(out2$coef).dat = scale(crime.dat) crime.fun(as. data=crime.rep(0.t(out1$beta).xlab="df".lty=1) plot(out1$df.rep(0. out1$cv.horizontal=F) par(mfrow=c(3.type="l".2)) matplot(lambda.ridge(Crime ˜ Age + Southern + Education + Expenditure + Labor + Males + pop + U1 + U2 + Wealth. 46 .length=100) = ridge.lambda){ n = length(y) p = ncol(X) k = length(lambda) I = diag(rep(1.lambda=lambda) print(summary(out2)) matplot(lambda.dat.k) df = rep(0.lty=1.matrix(crime.dat).i] = S %*% y H = X % *% S y.type="l") library(MASS) out2 = lm.ridge.lambda) postscript("ridge.k) cv = rep(0.cv) } p = ncol(crime.p)) beta = matrix(0.

positive constant . it turns out that many of the βj (λ) s are zero. The value of Cp is shown in Figure 16. This is why the lars function in R is used to compute the lasso estimator. Let us now comoute the norms: ||u||1 = 1 ||u||2 = 1 √ ||v||1 = p ||v||2 = 1. remove it from the active set A of predictors and recompute the joint direction. βk ) in their joint least squares direction. lars can be easily modiﬁed to produce the lasso estimator. 24. .The lasso estimator β(λ) is the value of β that solves: n β∈R min p i=1 p j=1 (Yi − XiT β)2 + λ||β||1 (81) where λ > 0 and ||β||1 = |βj | is the L1 norm of the vector β. Let Y denote the current vector of predicted values. When some other predictor xk has as much correlation with as xj has we increase (βj . Now we build the predicted values incrementally.7 Example. . u is sparse while v is not. A modiﬁcation of forward stagewise regression is called least angle regression. meaning that most βj ’s are zero (or close to zero). we update Y by the following equation: Y ←− Y + sign(cj )xj . Continue until all predictors are in the model. x12 . x1 . for a given λ. So the L1 norm coorectly captures sparseness. A formal description is in Figure 17. The lasso is called basis pursuit in the signal processing literature. In forward stagewise regression we ﬁrst set Y = (0. (84) This is like forward stepwise regression except that we only take small. 1/ p). 0. . Equation (81) deﬁnes a convex optimation problem with a unique solution β(λ) that depends on λ. The estimator has to be computed numerically but this is a convex optimization and so can be solved quickly. x10 . 0) √ √ √ v = (1/ p. Two related variable selection methods are forward stagewise regression and lars. You need to download the lars package ﬁrst. 0)T and we choose a small. . 47 . Moreover. The best model includes variables x7 . Find the current correlations c = c( Y ) = XT (Y − Y ) and set j = argmaxj |cj |. . 1/ p. . x3 . What is special about the L1 penalty? First. Thus. We would like our estimator β to be sparse. each of length p: u = (1. this is the closest penalty to the L0 penalty that makes Q(β) convex. Intuitively. We begin with all coefﬁcients set to 0 and then ﬁnd the predictor xj most correlated with Y . This is similar to the model selected by forward stepwise regression but notice that x4 (expenditures) is chosen ﬁrst by forward stepwise but is chosen last by the lasso. Consider the following three vectors. until some other predictor xm has as much correlation with the residual . . . the lasso performs estimation as model selection simultaneously. . . If a non-zero coefﬁcient ever hits zero. incremental steps towards the next variable and we do not go back and reﬁt the previous variables by least squares. x11 . Figure 15 shows the traces of βj (λ) as a function of the number of steps in the algorithm when the lasso is run on the crime data. x6 . . x13 . (82) The constant λ can be chosen by cross-validation. (83) Finally. Digression on Sparsity. Then increase βj in direction of the sign of its correlation with Y and set = Y − Y . the L1 penalty captures sparsity. is S(λ) = j : βj (λ) = 0 . Typically. The selected model. .

0 Figure 15: Coefﬁcient trace plot for the lasso with the crime data. 48 5 10 2 9 7 1 3 13 4 .6 8 Standardized Coefficients 4 2 0 −2 0.0 0.4 |beta|/max|beta| 0. 4 80 Cp 40 60 7 1 13 6 3 11 12 10 2 5 8 −6 10 6 9 4 20 0 5 15 Df Figure 16: Cp plot for the lasso with the crime data.2 0.8 1.6 0.

n n 25 Variable Selection versus Hypothesis Testing The difference between variable selection and hypothesis testing can be confusing. k = 0. B − aj B + aj Figure 17: A formal descrription of lars. 2. A = ∅. Cp or cross-validation are used to estimate the risk. lars is an efﬁcient algorithm for computing the lasso estimates. . yielding a small risk. Compute the following quantities: c = XT (Y − Y ) C = maxj {|cj |} A = {j : |cj | = C} sj = sign(cj ). . 4. bias and variance. The lasso estimates β with the penalized residual sums of squares i=1 (Yi − XiT β)2 + λ||β||1 . Now repeat steps 2–3 until Ac = ∅. . 5.lars 1. Summary 1. 3. 49 . Some of the estimates will be 0 and this corresponds to omitting them from the model. Large models have low bias and high variance. 6. Set Y ←− Y + γu where γ = min + c j∈A (85) (86) . Small models have high bias and low variance. This is the bias-variance tradeoff. Search methods look through a subset of models and ﬁnd the one with the smallest value of estimated risk R(S). j ∈ A XA = (sj xj : j ∈ A) G = XT XA A B = (1T G−1 1)−1/2 w = BG−1 1 u = XA w a = XT u where 1 is a vector of 1’s of length |A|. min+ means that the minimum is only over positive components. C − cj C + cj . 3. Let Y1 . The prediction risk R(S) = n−1 i=1 (Yi (S) − Yi∗ )2 can be decomposed into unavoidable error. 2. Model selection methods aim to ﬁnd a model which balances bias and varaince. . Yn ∼ N (µ. (87) Here. Set Y = 0. Look at a simple example. 1).

2 2 2 50 . we reject H0 if |Z| > 2. and M1 : N (µ. (µ) = − − |S|. n 2 or Similar to but not the same as the hypothesis test. We choose model 1 if nS 2 −1 2 AIC1 > AIC0 that is. We test H0 : µ = 0 versus µ = 0. i. 2 2 2 Recall that AIC = S AIC0 = (0) − 0 = − and nY 2 − nS 2 2 AIC1 = (µ) − 1 = − since µ = Y . BIC. The test statistic is Z= Y −0 V(Y ) We reject H0 if |Z| > zα /2.We want to compare two models: M0 : N (0. For α = 0. Hence. The BIC scores are BIC0 = (0) − and BIC1 = (µ) − nY 0 log n = − 2 2 2 − nS 2 2 1 nS 2 1 log n = − − log n. if nS 2 nY nS 2 − −1>− − 2 2 2 √ 2 |Y | > √ . Hypothesis Testing.. 1). The AIC scores are nS 2 n(Y − µ)2 − .e.05. The likelihood is proportional to n L(µ) = where S 2 = i (Yi e−(Yi −µ) i=1 2 /2 = e−n(Y −µ) 2 /2 −nS 2 /2 e − Y )2 . 1). n = √ nY. AIC. if 2 |Y | > √ .

if |Y | > log n . Multicollinearity is just an extreme example of the bias-variance tradeoff we face whenever we do regression. then we say that the variables are collinear. . the bigger problem is that the standard errors will be huge. Sometimes the variables are close to collinear. use variable selection. Formally. Then the X matrix is 1 2 1 2 . . . . 1 2 and so XT X = n which is not invertible. this means that the standard error of β is inﬁnite and the standard error for predictions is inﬁnite. suppose that x1i = 2 and suppose we include an intercept. n Hypothesis testing AIC/CV/Cp BIC controls type I errors ﬁnds the most predictive model ﬁnds the true model (with high probability) 26 Collinearity If one of the predictor variables is a linear combination of the others. 51 . For example. The result is that it maybe difﬁcult to invert X T X. The solution is easy. The result is that X T X is not invertible. we get poor predictions due to increased variance. The implied mode in this example is Yi = β0 + β1 xi1 + i 1 2 2 4 = β0 + 2β1 + i ≡ β0 + i where β0 = β0 + 2β1 . Don’t use all the variables. If we include too many variables. . We can estimate β0 using Y but there is no way to separate this into estimates for β0 and β1 . However.We choose model 1 if BIC1 > BIC0 that is.

The idea of robust regreesion is to replace the RSS with a different criterion. Now s also has to be a robust estimate estimate. we need to be a bit more sophisticated about this estimator. As c → ∞ we get the mean. If the data are non-Normal – and in particular if there are occasional outliers – the median is preferable because of its robustness. Moving one observation will have little or no effect on the median. We choose β to minimize n ρ i=1 Yi − XiT β s 52 . It can be shown that median is obtained by minimizing |Yi − µ|. Otherwise. However.6745 The reason we divide by . A more systematic way to deal with outliers is through robust regression. An example is the MAD (median absolute deviation): mediani |Yi − medianj Yj | s= . Let’s start with a simpler setting. The median is very robust. Huber’s estimator corresponds to ρ(x) = x2 |x| ≤ c c(2|x| − c) |x| ≥ c. The value of µ = Y will change drastically if we move one observation.64 V(µ) The implication is that the median is less efﬁcient than the mean (which is the MLE).6745σ. We can give up a bit of efﬁciency in favor of gaining some robustness. it can be shown that V(Y ) ≈ . i More generally. this is only true if the data are exactly Normal. This is a general idea. Now. Actually. µ = Y is not a robust estimator.345. How does the median compare to the mean as an estimator? If the data are Normal. it would converge to . Suppose Yi = µ + i and we want to estimate µ. A common choice is c=1. This gives 95 per cent efﬁciency at the Normal. An alternative estimator is the median µ. Recall that in least squares.27 Robust Regression So far we have dealt with outliers by looking for them and deleting them.6745 is becuase this ensures that s converges to σ as n increases. we can estimate µ by minimizing ρ(Yi − µ) i for some function ρ. As c → 0 we get the median. . The choice of cutoff c must be relative the scale σ of Y . So in fact we minimize Yi − µ ρ s i where s is an estimate of σ. How can we transfer this ides to regression. we estimate β by minimizing RSS. If we choose µ to minimize RSS = i (Yi − µ)2 we get µ = Y .

81 <2e-16 *** --Signif.where ρ is the Huber function (or some other function) and s is an estimate of σ.9537 0.05 ‘.064177 0.1 Example.’ 0.09395 on 98 degrees of freedom Max 0.283053 > print(out2$s) [1] 0.8933 F-statistic: 830.) 27.05480 0. p-value: < 2.005511 0.001 ‘**’ 0. (Choosing ρ = log f for some density function f corresponds to maximum likelihood estimation. postscript("robust. Error t value Pr(>|t|) (Intercept) 2. We will create a synthetic example. Adjusted R-squared: 0.1012 Residual standard error: 0.678044 -0. codes: 0 ‘***’ 0.01 ‘*’ 0.0195 103.181200 > print(sqrt(sum(out1$resˆ2)/(n-2))) [1] 0.formula(formula = y ˜ x) Residuals: Min 1Q Median 3Q -2.05704 36.2e-16 > print(summary(out2)) Call: rlm.82522 0.0335 88.horizontal=F) library(MASS) n = 100 x = (1:n)/n eps = rnorm(n.02 <2e-16 *** x 2. The resulting estmator is called an M-estimator.09394536 53 ..1) y = 2 + 3*x + eps y[90] = 2 ### create an outlier plot(x.ps".y) out1 = lm(y ˜ x) out2 = rlm(y ˜ x) print(summary(out1)) Coefficients: Estimate Std.0197 0.8944.2831 on 98 degrees of freedom Multiple R-Squared: 0.1 ‘ ’ 1 Residual standard error: 0.1 on 1 and 98 DF.0.5637 x 2. Error t value (Intercept) 2.09806 28.062456 Coefficients: Value Std.

Generally. we see that expected remaining weight r(x. Since x→∞ lim r(x. Also. . Figure 18 shows the weight of a patient on a weight rehabilitation program as a function of the number of dayes in the program.pch=19) out = nls(Weight ˜ b0 + b1*2ˆ(-Days/b2).28 Nonlinear Regression We can ﬁt regression models when the regression is nonlinear: Yi = r(Xi .083 49. β) = β0 we see that β0 is the ideal stable lean weight.e. β) − β0 when x = β2 . Error t value Pr(>|t|) b0 81. i.b2=120)) info = summary(out) print(info) Formula: Weight ˜ b0 + b1 * 2ˆ(-Days/b2) Parameters: Estimate Std. β) + i where the regression function r(x.374 2.8949 on 49 degrees of freedom 54 . β) is a known function except for some parameters β = (β 1 .30 <2e-16 *** b2 141.86 <2e-16 *** b1 102. r(0. β) − β 0 is one-half the starting remaining weight r(0. .80 <2e-16 *** --- Residual standard error: 0.1 Example.b1=95.295 26. β) = β0 + β1 2−x/β2 . start=list(b0=90.684 2. β))2 . the time to lose half the remaining weight. β) − r(∞. this must be done numerically.269 35.data=wtloss. . The algorithms are iterative and you must supply starting values for the parameters. The data are from Venables and Ripley (1994). β) + . . β) = β1 so β1 is the amount of weight to be lost. library(MASS) attach(wtloss) plot(Days.Weight. So β2 is the half-life. Finally. It is hypothesized that Yi = r(x. Here is how to ﬁt the example in R. βk ). where r(x. The parameter estimate is found by minimizing n RSS = i=1 (Yi − r(Xi .911 5.. 28.

fit.rep(0.length(Days))) dev.110 120 130 140 150 160 170 180 0 Weight 50 100 Days 150 200 250 info$residuals −2 0 −1 0 1 2 50 100 Days 150 200 250 Figure 18: Weight Loss Data Correlation of Parameter Estimates: b0 b1 b1 -0.9891 b2 -0.lwd=3.length=1000) fit = b[1] + b[2]*2ˆ(-grid/b[3]) lines(grid.info$residuals) lines(Days.1] grid = seq(0.lty=1.off() The ﬁt and residuals are shown in Figure 18.9857 0.9561 b = info$parameters[. 55 .250.col=2) plot(Days.

is the value π that maximizes L(π). n i=1 π= Yi n . The likelihood function is just the probability function regarded as a function of the parameter π and treating the data as ﬁxed: n L(π) = i=1 π yi (1 − π)1−yi . Then we can rewrite the logistic model as logit(πi ) = β0 + β1 Xi . . 1} denote the outcome of a coin toss. . E(Yi |Xi ) = P(Yi = 1|Xi ). Deﬁne the logit function z . The probability function for n independent tosses. Figure 19 shows the logistic function eβ0 +β1 x /(1 + eβ0+β1 x ). . logit(z) = log 1−z Also. . Maximizing the likelihood is equivalent to maximizing the loglikelihood function n (π) = log L(π) = Setting the derivative of (π) to zero yields i=1 yi log π + (1 − yi ) log(1 − π) . The extension to several covariates is starightforward: p logit(πi ) = β0 + j=1 βj xij = XiT β. . The parameter β0 controls the horizontal shift of the curve. 1} and we want to relate Y to some covariate x. Let’s review the basics of maximum likelihood. The usual regression model is not appropriate since it does not constrain Y to be binary. With the logistic regression model we assume that E(Yi |Xi ) = P(Yi = 1|Xi ) = eβ0 +β1 Xi .29 Logistic Regression Logistic regression is a generalization of regression that is used when the outcome Y is binary. . is n n f (y1 . The probability function is f (y. Let π = P(Y = 1) and 1 − π = P(Y = 0). . Y1 . The parameter β1 controls the steepness of the curve. π) = i=1 f (yi . yn . We call Y a Bernoulli random variable. Suppose that Y i ∈ {0. Let Y ∈ {0. The maximum likelihood estimator or MLE. . 56 . How do we estimate the parameters? Usually we use maximum likelihood. Yn . 1 + eβ0 +β1 Xi Note that since Yi is binary. deﬁne πi = P(Yi = 1|Xi ). π) = π y (1 − π)1−y . π) = i=1 π yi (1 − π)1−yi .

the likelihood function is n n L(β) = where i=1 f (yi |Xi . . Z = Xβ (k) + W −1 (Y − π) T (k) The standard errors are given by V(β) = (XT W X)−1 . . Recall that the Fisher information is deﬁned to be I(π) = −E The approxmate standard error is se(π) = 1 = I(π) π(1 − π) . Compute ﬁtted values e Xi β i = 1. 1 + e Xi β 2. . ∂π 2 x Returning to logistic regression. for k = 1. . 57 . . Deﬁne the adjusted response vector where π T = (π1 . First set starting values β (0) . Now. . 3. πi = T (k) . 2. do the following steps until convergence: 1. . . The usual algorithm is called reweighted least squares and works as follows. n. Deﬁne an n × n weight matrix W whose ith diagonal element is πi (1 − πi ). n ∂ 2 (π) . Take β (k+1) = (XT W X)−1 XT W Z which is the weighted linear regression of Z on X.1 0 Figure 19: The logistic function p = ex /(1 + ex ). πn ). . 4. . T 1 + e Xi β The maximum likelihood estimator β has to be found numerically. . β) = T i=1 Y πi i (1 − πi )1−Yi πi = e Xi β .

data) > print(summary(out)) Coefficients: Estimate Std. 1) random variables when the model is correct. ldl (low density lipoprotein cholesterol).000 0.003 0.000 0. then you are confusing association and causation.045 se 1.006 0. the Pearson or χ2 residuals: Yi − π i πi (1 − πi ) and the deviance residuals sign(Yi − πi ) 2 Yi log Yi πi + (1 − Yi ) log 1 − Yi 1 − πi where we interpret 0 log 0 = 0. obesity. The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15 and 64 from three rural areas in South Africa.754 p-value 0.003 0.29.300 0.059 0. There are 9 covariates: systolic blood pressure.044 0. To ﬁt this model in R we use the glm command.data=sa.001 0. > attach(sa. the residuals are useless.153 0. which stands for genralized linear model.004 0.000 Are you surprised by the fact that systolic blood pressure is not signiﬁcant or by the minus sign for the obesity coefﬁcient? If yes.079 0.007 0. and age.979 0.012 0.255 0.027 3.524 0. These are approximately the same. Without replication.012 Wj -4.227 0.019 0. famhist (family history of heart disease).138 2.040 -0. Error z value Pr(>|z|) 58 . It means that it is not an important predictor of heart disease relative to the other variables in the model.data) > out = glm(chd ˜ .991 2. the residuals will behave like N (0.233 -1.637 4.174 0.427 0. alcohol (current alcohol consumption).925 0.. adiposity.145 0.000 0.027 0.029 0. The fact that blood pressure is not signiﬁcant does not mean that blood pressure is not an important cause of heart disease. The are two different types of residuals. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease.078 3.1 Example. A logistic regression yields the following estimates and Wald statistics W j for the coefﬁcients: Covariate Intercept sbp tobacco ldl adiposity famhist typea obesity alcohol age βj -6.063 0. Model selection can be done using AIC or BIC: AICS = −2 (βS ) + 2|S| where S is a subset of the covariates. When there are replications at each x value.925 0. family=binomial. cumulative tobacco (kg). typea (type-A behavior).738 1.

027 3.233 -1.254928 0.69 494.0793674 ldl 0.44 1 486.2977108 0.values names(p) = NULL n = nrow(sa.typea .2575758 59 .000174 *** ** ** *** ** *** > out2 = step(out) Start: AIC= 492.1482113 sbp 0.002777 0.0122417 0.978655 0.0594451 0.427 0.0057129 0.predict)) predict chd 0 1 0 256 46 1 73 87 > error = sum( ((chd==1)&(predict==0)) | ((chd==0)&(predict==1)) )/n > print(error) [1] 0.0395805 obesity -0.9252043 typea 0.2268939 0.0001196 age 0.138 2.0440721 0.0452028 --- 1.524020 4.age > > > > > > > p = out2$fitted.44 496.14 chd ˜ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age etc.001224 0.0044703 0.16e-06 0.famhist .03 502.738 1.0265321 0.754 2.078 3.69 chd ˜ tobacco + ldl + famhist + typea + age Df Deviance 475.0065039 tobacco 0.03 1 492.09 1 502.38 AIC 487.5] = 1 print(table(chd.0291616 0.(Intercept) -6.0629112 alcohol 0.tobacco .71 1 485.data) predict = rep(0.637 4.09 512.0120398 -4.n) predict[p > .0185806 famhist 0.69 1 484.003441 0.38 <none> .1738948 adiposity 0.ldl .153447 0.71 495. Step: AIC= 487.55e-05 0.925 0.991 2.

5422 -4. we can do a standard t-test for H0 : E(X|Y = 1) = E(X|Y = 0) versus E(X|Y = 1) = E(X|Y = 0). Suppose we have a binary outcome Yi and a continuous covariate Xi . Method 1: Logistic Regression. When the Xi ’s are random (so I am writing them with a capital letter) there is another way to think about this and it is instructive to do so. you might simply compare the distribution of X among the sick (Y = 1) and among the healthy (Y = 0).1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138. Examine the boxplots and the histograms in the ﬁgure.01 Pr(>|z|) 2. To test whether these distributions (or at least the means of the distributions) are the same. on 99 on 98 degrees of freedom degrees of freedom The test for H0 : β1 = 0 is highly signiﬁcant and we conclude that there is a strong relationship between Y and Method 2: Comparing Two Distributions. To examine the relationship between x and Y we used the logistic model eβ0 +β1 x P(Y = 1|x) = .x1)) Welch Two Sample t-test 60 . codes: 0 ‘***’ 0. (X|Y ) Think of X as the outcome and Y as a group indicator.549 AIC: 76.57e-06 *** ‘*’ 0.2785 0.202 x 2.4567 4... that X is the amount of exposure to a chemical and Y is presence or absence of disease. 1 + eβ0 +β1 x To formally test if there is a relationship between x and Y we test H0 : β1 = 0 versus H1 : β1 = 0. Error z value (Intercept) -2.05 ‘.test(x0. Suppose.001 ‘**’ 0.55 X.1933 0. for example.30 More About Logistic Regression Just when you thought you understood logistic regression. Let’s consider both methods for analyzing the data. Instead of regressing Y in X. The results of the regression are: Coefficients: Estimate Std. x0 = x[y==0] x1 = x[y==1] > print(t.64e-05 *** 1.629 Residual deviance: 72.’ 0. (Y |X) The ﬁrst plot in Figure 20 shows Y versus x and the ﬁtted logistic model.802 --Signif.

0 −2 −1 0 x 1 2 3 4 −2 −1 0 1 2 3 4 1 2 Histogram of x0 Histogram of x1 10 8 Frequency Frequency −2 −1 0 x0 1 2 3 4 6 4 0 2 0 2 4 6 8 10 −2 −1 0 x1 1 2 3 4 Figure 20: Logistic Regression? 61 .8 0.1.2 0.6 y 0.4 0.0 0.

xk and suppose there are ni observations at each Xi .782. (89) σ2 This is exactly the logistic regression model! Moreover. Now suppose that X|Y = 0 ∼ N (µ0 . df = 97. . In fact. Suppose there is one covariate taking values x1 . . There are two different ways of answering the same question.016e-15 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2. .data: x0 and x1 t = -9. σ 2 ) and that X|Y = 1 ∼ N (µ1 .3 and β1 = 2. f 0 = N (0. We can ﬁt the logistic regression as before: logit(πi ) = XiT β 62 . What’s the connection? Let f0 and f1 be the probability density functions for the two groups.341486 -1. 1) and f0 = N (2. Then the last equation becomes eβ0 +βx P(Y = 1|X = x) = 1 + eβ0 +βx where µ2 − µ 2 π β0 = log (88) + 0 2 1 1−π 2σ µ1 − µ 0 . we see that β0 = −2. β1 = 0 if and only if µ0 = µ1 . 1).522313 sample estimates: mean of x mean of y 0. Also. and letting π = P(Y = 1). I took P(Y = 1) = 1/2. Yi ∼ Binomial(ni . Plugging into (88) and (89) we see that β0 = −2 and β1 = 2. By Bayes’ theorem. Now we let Yi denote the number of successes at Xi . Hence. P(Y = 1|X = x) = = = f (x|Y = 1)π f (x|Y = 1)π + f (x|Y = 0)(1 − π) f1 (x)π f1 (x)π + f0 (x)(1 − π) 1 f1 (x)π f0 (x)(1−π) f1 (x)π + f0 (x)(1−π) . let π = P(Y = 1).2. πi ). . β1 = and 31 Logistic Regression With Replication When there are replications.0467645 Again we conclude that there is a difference. here is how I generated the data for the example. σ 2 ).3604. the two approaches are testing the same thing. p-value = 3.1148648 2. Thus. Indeed. we can say more about diagnostics.

179 2. di = √ 1 − Hii Goodness-of-Fit Test.n-y) ˜ x.69.28e-16 *** x 0.29) n = c(29.91.1) random variables if the model is correct.60.29) x = c(49. We’ll do this in the context of an example. The covariate is the dose of CS2 in mg/l.29. The Pearson χ2 χ2 = i 2 ri and deviance D= i d2 i both have.76.99.family=binomial) > print(summary(out)) Coefficients: Estimate Std. a χ2 n−p distribution if the model is correct.30.7312 1. We can also form standardized versions of these. 9.and now we deﬁne the Pearson residuals ri = and deviance residuals di = sign(Yi − Yi ) 2Yi log ni πi (1 − πi ) Yi − n i πi Yi Yi + 2(ni − Yi ) log (ni − Yi ) (ni − Yi ) where Yi = nπi .2478 0.68. There are two ways to run the regression: > out = glm(cbind(y.87e-16 *** Null deviance: 137. approximately. 7.06.7204 on 7 degrees of freedom 63 .14.29.050 8. Large values are indicative of a problem. Similarly deﬁne standardized deviance residuals by di . The standardized Pearson residuals are ri = √ ri 1 − Hii which should behave like N(0.27. Here are the data: y = c(2 .0303 8. from Strand (1930) and Collett (1991) are the number of ﬂour beetles killed by carbon disulphide (CS 2 ).76.52.56. Error z value Pr(>|z|) (Intercept) -14.31.72.28. Let H = W 1/2 X(XT W X)−1 XT W 1/2 where W is diagonal with ith element ni πi (1 − πi ).23. Let us now discuss the use of residuals.30.84.64.54) The data.30.61.8300 -8.

> r = resid(out.655771 > pvalue = 1-pchisq(out$dev.r) You could do this by hand: w = out$weights W = diag(w) WW = diag(sqrt(w)) 64 .8506433 So far so good.max(x).family=binomial) print(summary(out2)) Coefficients: Estimate Std.lwd=3) Y = c(rep(1. r = rstandard(out) plot(x.086 6. we should look at the residuals.82170 -8. Error z value Pr(>|z|) (Intercept) -14.residual) > print(pvalue) [1] 0.r.218 < 2e-16 *** Null deviance: 313.56 on 233 on 232 degrees of freedom degrees of freedom The outcome is the same except for the deviance. Now let’s create standardized residuals.63 Residual deviance: 178.655771 gives back the deviance test.15e-16 *** X 0.6558 on 6 degrees of freedom > > > > > > > > > > b = out$coef grid = seq(min(x). The test of it is: > print(out$dev) [1] 2.rep(x.03016 8.predictors > plot(p.ylab="deviance residuals") Note that > print(sum(rˆ2)) [1] 2.y).length=1000) l = b[1] + b[2]*grid fit = exp(l)/(1+exp(l)) lines(grid.out$df.sum(n)-sum(y))) X = c(rep(x.24784 0. The correct deviance is from the ﬁrst method.73081 1. Still.rep(0.sum(y)).Residual deviance: 2.type="deviance") > p = out$linear.pch=19.xlab="linear predictor".fit.n-y)) out2 = glm(Y ˜ X.

ylab="standardized deviance residuals") qqnorm(rr) Some people ﬁnd it easier to look at the “half-Normal plot” in which we plot the sorted absolute residuals versus their expected value. 65 .xlab="linear predictor". For regular (not half) probability plot we used Φ−1 (i − (3/8))/(k + (1/4)) .x) H = WW %*% X %*% solve(t(X) %*% W %*% X) %*% t(X) %*% WW h = diag(H) rr = r/sqrt(1-h) plot(p.pch=19.8).rr.X = cbind(rep(1. which is: Φ−1 (i + k − (1/8))/(2k + (1/2)) .

T (x) = x.3 Example. h(x) = 1/x!. B(θ).2 Example. A probability function (or probability density function) is said to be in the exponential family if there are functions η(θ). x n x θ (1 − θ)n−x = x n exp x log x θ 1−θ + n log(1 − θ) . Then f (x. θ) = h(x) exp ηj (θ)Tj (x) − B(θ) . σ). h(x) = 1. θ). Then f (x. this is an exponential family with η(θ) = log θ. σ2 66 . T (x) and h(x) such that f (x.1 Example. h(x) = 32. B(θ) = θ. Consider the Normal family with θ = (µ. Let X ∼ Poisson(θ). . The function g is an example of a link function and the Bernoulli is an example of an exponential family. B(θ) = −n log(θ) n . . f (x.32 Generalized Linear Models We can write the logistic regression model as Yi ∼ Bernoulli(µi ) g(µi ) = XiT β where g(z) = logit(z). T1 (x) = x σ2 1 η2 (θ) = − 2 . θ) has exponential family form if k f (x. θk ) is a vector. θ) = exp This is exponential with µ x2 1 x− 2 − 2 σ 2σ 2 η1 (θ) = If θ = (θ1 . which we explain below. η(θ) = log and θ 1−θ . j=1 µ2 + log(2πσ 2 ) σ2 . . T (x) = x. θ) = h(x)eη(θ)T (x)−B(θ) . θ) = θx e−θ 1 = ex log θ−θ x! x! and hence. . Any model in which Y has a distribution that is in the exponential family and a function of its mean is linear in a set of predictors. T2 (x) = x2 2σ B(θ) = 1 2 µ2 + log(2πσ 2 ) . is called a generalized linear model. then we say that f (x. 32. µ . Let X ∼ Binomial(n. Now. θ) = In this case. 32.

Y i ∼ Poisson(µi ) and the usual link function is g(µi ) = log(µi ). . there are defauly link functions that are standard for each family. This is a famous data set collected by Sir Richard Doll in the 1950’s. . 32. Although many link functions could be used.2. Here. Here. This is a generalized linear model with link g.5 Example (Logistic Regression).1 in Dobson. The increase may differ between smokers and non-smokers so we will include an interaction term. Here are the data: Logit g(µ) = logit(µ) Inverse g(µ) = 1/µ µ= µ= ex β 1+exT β 1 xT β T Age 35-44 45-54 55-64 65-74 75-84 Deaths 32 104 206 186 102 Smokers Person-years 52407 43248 28612 12663 5317 Non-smokers Deaths Person-years 2 18790 12 10673 28 5710 28 2585 31 1462 There is an obvious increasing relationship with age which shows some hint of nonlinearity. binomial. The data are on smoking and number of deaths due to coronary heart disease. We took the midpoint of each age 67 . Yi ∼ Bernoulli(µi ) and g(µi ) = logit(µi ). Yn each from the same exponential family distribution. poisson etc. This is often used when the outcomes are counts.4 Example (Normal Regression). 32. . .7 Example. I am following example 9.5 in Weisberg): Distribution Link Inverse Link (Regression Function) Normal Identity µ = xT β g(µ) = µ T Poisson Log µ = ex β g(µ) = log(µ) Bernoulli Gamma In R you type: glm(y ˜ x. σ 2 ) and the link g(µi ) = µi is the indentity function.6 Example (Poisson Regression). R will assume the default link. 32. Let µi = E(Yi ) and suppose that g(µi ) = XiT β. Yi ∼ N (µi . family= xxxx) where xxxx is Normal. Here they are (from Table 12. Here. Notice that the regression equation E(Yi ) = g −1 (XiT β) is based on the inverse of the link function. 32.Now consider independent random variables Y1 .

05 ‘. These are Pearson residuals: ri = Yi − Yi Yi and deviance residuals di = sign(Yi − Yi ) 2(Yi log(Yi /Yi ) − (Yi − Yi )).195e+01 1.491e-02 2. Let’s look at the results.40.908 0.026e-05 0.1.290 < 2e-16 *** sm.348e-02 10.84e-13 *** smoke 3.183e-02 -1.group as the age.1) random variables if the model is correct. Both should look like N(0. In fact there are two test statistics: the χ 2 is n χ2 = i=1 2 ri and the deviance D = d2 .6574 --Signif. Error z value Pr(>|z|) (Intercept) -1.141 0.444 0.667e+00 -7.186.0.50.10673.0.ps".28.314e+00 1.70.xlab="Age".5710. codes: 0 ‘***’ 0.1 ‘ ’ 1 Null deviance: 644.1.’ 0.6))) out = glm(deaths ˜ smoke + age + agesq + sm.28. we can examine the ﬁt by comparing the observed and predicted values. n−p > > > > + > > > > > > > > > > > ### page deaths = age = py = rate smoke agesq sm.120e-03 3.5317.032e-04 -10.60.102.age = = = = 155 dobson c(32.1.age -2.28612.1462) deaths/py c(1.12.113 < 2e-16 *** agesq -3. There is also a formal way to test the model for goodness of ﬁt.rate*100000.737e+00 1. i Both should have approximately and χ2 distribution where p is the number of parameters.0.398e-01 4. When we ﬁt the model.000)".164 7.horizontal=F) plot(age.0.60.001 ‘**’ 0.family=poisson) print(summary(out)) Coefficients: Estimate Std.2.01 ‘*’ 0. 18790. There are two types of residuals.206.1.80.989e-06 2.70.12663.50.104.2540 py 8.31) c(40.ylab="Death Rate (times 100.0) age*age smoke*age postscript("poisson.pch=c(rep(19. age 4.0564 .2585.age + py.2690 on 9 degrees of freedom 68 .80) c(52407.43248.

1.543416 [1] 0.print(1-pchisq(ch2.pred)/sqrt(pred) dev = sign(deaths .10-6)) [1] 3.deaths.626378 [1] 0.38 0.print(1-pchisq(D. set γ = β1 + 40β4 + 33617β5 = and γ = β1 + 40β4 + 33617β5 = where T T T β β = (0.type="response") ##as opposed to type ="link" pearson =(deaths .10-6)) [1] 3.85 0.6264 on 4 degrees of freedom pred = predict(out.73 times higher than non-smokers. 0. 0.round(pearson.60 0.13 2 50 1 104 110 -0.73 suggesting that smokers in this group have a death rate due to coronary heart disease that is 13.58 3 60 1 206 198 0. Suppose we want to compare smokers to non-smokers for 40 years olds.Residual deviance: AIC: 70. age = 40) = exp{β0 + β1 + 40β2 + 1600β3 + 40β4 + 52407β5} exp{β0 + 40β2 b1 +40β4 +33617β5 b b β + 1600β3 + 18790β5} = e = 13. First. 69 .49 8 60 0 28 26 0.4713077 > D = sum(devˆ2) > print(D).57 -0.51 0. age = 40) E(Y |non − smoke.694 > > > > > 3. The estimated model is E(Y |x) = exp{β0 + β1 smoke + β2 age + β3 age2 + β4 smoke ∗ age + β5 PY} and hence E(Y |smoker.13 -0.2)) age smoke deaths 1 40 1 32 31 0. 40.round(dev.13 0.13 5 80 1 102 103 -0.18 -0. Smoking appears to be quite important (but keep the usual causal caveats in mind).round(pred).33 10 80 0 31 27 0.pred)*sqrt(2*(deaths*log(deaths/pred) . 33617).smoke.(deaths .19 7 50 0 12 10 0. Let’s get a conﬁdence interval for this.pred))) cbind(age.28 -1.13 6 40 0 2 2 -0.83 > > ch2 = sum(pearsonˆ2) > print(ch2).38 9 70 0 28 36 -1.60 4 70 1 186 188 -0.4589242 The model appears to ﬁt well.2).13 -0.

gam + 2*se)) print(ci) The result is (7. 70 . b) = (γ − 2 We are interested in ψ = eγ .0. In R: summ = summary(out) v = summ$dispersion * summ$cov.unscaled print(v) ell = c(0.1. V(γ) = T V where V = V((β)). The conﬁdence interval is (ea . eb ).40.27).Then. An approximate 95 percent conﬁdence interval for γ is (a.0. γ + 2 V(γ)).2*se. V(γ).py[1]-py[6]) gam = sum(ell*out$coef) print(exp(gam)) se = sqrt(ell %*% v %*% ell) ci = exp(c(gam .

The observed data are (Y1 . Rather. you will get an inconsistent estimate of β1 . has mean 0 and variance σ 2 . This is called a measurement error problem or an errors-in-variables problem. and assume that is independent of X. For simplicity. X is not observed. (91) 2 2 σx + σ u Thus. So Y ≈ 0. If the goal is just to predict Y from W then there is no problem. . with mean 0 and variance σu . 1 n Yi W i i = i (Yi − Y )(Wi − W ) ≈ 2 i (Wi − W ) 1 n 1 n i Yi W i . It is tempting to ignore the error and just regress Y on W . Let us give a heuristic explanation of why (90) is true. 33 Measurement Error Suppose we are interested in regressing the outcome Y on a covariate X but we cannot observe X directly. W ≈ 0 and λ= β1 Now. . β1 n Ui + i 1 n i Xi i + 1 n i Ui i Also. Wn ) where Yi Wi = β 0 + β 1 Xi + = Xi + U i i where and E(Ui ) = 0. . But if the goal is to estimate β 1 . . (Yn . 1 n Wi2 i = = 1 n 1 n (Xi + Ui )2 i Xi2 + i 1 n Ui2 + i 2 n Xi U i i 2 2 ≈ σx + σu 71 . The model is illustrated by the directed graph in Figure 21. It can be shown that as (90) β1 −→ λβ1 2 σx < 1. an effect that is usually called attenuation bias. assume that β 0 = 0 and that E(X) = 0. 2 i Wi = = ≈ 1 n β1 n (β1 Xi + i )(Xi + Ui ) i Xi2 + i 2 β 1 σx . Also assume that U is 2 independent of X. we observe a X plus noise U . W is a noisy version of X. regressing Y on W leads to inconsistent estimates. Let β1 be the least squares estimator of β1 obtained by regressing the Yi s on the Wi s.X Y W Figure 21: Regression with measurement error. If you regress Y on W . 2 Let σx = V(X). the effect of the measurement is to bias the estimated slope towards 0. W1 ).

γ2 . Once we have estimates of the γ’s. γ1 .0. 0. (93) σw − σ u λ Generate new random variables Wi = W i + √ ρ Ui where Ui ∼ N (0. To do the extrapolation. 1). Recall that the least squares estimate β1 is a consistent estimate of 2 β1 σx . 2 2 σx + (1 + ρ)σu (94) Repeat this process B times (where B is large) and denote the resulting estimators by β1. Ω(ρ) = B b=1 Now comes some clever sleight of hand. . β1. γ3 ) = γ1 + γ2 γ3 + ρ (95) using standard nonlinear regression.B (ρ). γ3 ) (96) as our corrected estimate of β1 . Since. 2 2 If there are several observed values of W for each X then σu can be estimated. γ1 . γ3 ) = γ1 − γ2 + γ3 . 1. 72 .5.which yields (90). An advantage of SIMEX is that it extends readily to nonlinear and nonparametric regression. An estimate of β1 is β1 σ2 β1 = = 2 w 2 β1 . it often sufﬁces to approximate G(ρ) with a quadratic. we take β1 = G(−1. Plugging these estimates into (91). we will assume 2 2 2 2 2 that σu is known. . . γ2 . γ1 . In such cases. Then deﬁne B 1 β1. γ2 .0. The idea is to compute Ω(ρ) for a range of values of ρ such as 0. we ﬁt the values Ω(ρj ) to the curve G(ρ. . σw = σx + σu . We then extrapolate the curve Ω(ρ) back to ρ = −1. σu must be estimated by external means such as through background knowledge of the noise mechanism. This estimator makes little sense if σ w − σu ≤ 0. 2. Fitting the nonlinear regression (95) is inconvenient. Another method for correcting the attenuation bias is SIMEX which stands for simulation extrapolation and is due to Cook and Stefanski. Otherwise.1 (ρ). see Figure 22. For our purposes. γ1 . The least squares estimate obtained by regressing the Yi s on the Wi s is a consistent estimate of Ω(ρ) = 2 β1 σx . 2 2 σx + σ u 2 2 2 2 where σw is the sample variance of the Wi s. Thus. we ﬁt the Ω(ρj ) s to the curve Q(ρ.5.b (ρ). γ3 ) = γ1 + γ2 ρ + γ3 ρ2 and the corrected estimate of β1 is β1 = Q(−1. Setting ρ = −1 in (94) we see that Ω(−1) = β 1 which is the quantity we want to estimate. 1. γ2 . one might reasonable conclude that the sample size is simply not large enough to estimate β 1 . we can estimate σx by 2 2 2 σx = σ w − σ u (92) 2 2 This is called the method of moments estimator. we get an estimate λ = (σw − σu )/σw of λ.

0 Figure 22: In the SIMEX method we extrapolate Ω(ρ) back to ρ = −1. SIMEX 73 .0 ρ 1.0 0.0 2.SIMEX Estimate β1 ¡ Ω(ρ) Uncorrected Least Squares Estimate β1 -1.

We are given n pairs of observations (X1 . careful inferences are required to assess the signiﬁcance of these peaks. 74 . Our goal is to estimate r. The bottom plot shows the ﬁrst 400 data points. 34. . . around x ≈ 200. There may be a second and third peak further to the right.1 Example (CMB data). The vertical axis is the power or strength of the ﬂuctuations at each frequency. . also known as “learning a function” in the jargon of machine learning. The ﬁrst peak is obvious from the second plot. essentially the frequency of ﬂuctuations in the temperature ﬁeld of the CMB. . The top plot shows the full data set. and r(x) = E(Y |X = x). n (97) power 0 26000 0 200 400 600 800 multipole power 0 5000 0 200 400 multipole Figure 23: CMB data. (98) i = 1. It is believed that r may have three peaks over the range of the data. is obvious. . Yn ) where Yi = r(Xi ) + i . . . (Xn .34 Nonparametric Regression Now we will study nonparametric regression. We have noisy measurements Yi of r(Xi ) so the data are of the form (97). The ﬁrst plot shows 899 data points over the whole range while the second plot shows the ﬁrst 400 data points. . The horizontal axis is the multipole moment. Y1 ). The ﬁrst peak. The presence of a second or third peak is much less obvious. Figure 23 shows data on the cosmic microwave background (CMB).

b] = [0.0 −0.2. LIDAR is used to monitor pollutants. The frequency of one laser is the resonance frequency of mercury while the second has a different frequency.2 400 500 range 600 700 Figure 24: The LIDAR data from Example 34.6 −0. . −0.0 −0. Divide the interval into m bins of of equal length. Figure 24 shows 221 observations. . . . 0. 0 kj kj T . The mean (conditional 75 .The simplest nonparametric estimator is the regressogram. These are data from a light detection and ranging (LIDAR) experiment.2 400 500 range 600 700 −0. The smoothing parameter h is the width of the bins. obtained by averaging the Y i s over bins. . Let us now compute the bias and variance of the estimator.6 −0. Let us focus on r n (0). As the binsize h decreases. Xi ∈ Bj and i (x) = 0 otherwise. As we decrease the binwidth h. Thus each has length h = (b − a)/m. Thus. For simplicity. b].2 log ratio −0. . Bm . the estimator becomes less smooth. . The response is the log of the ratio of light received from two lasers.6 log ratio −1.0 400 500 range 600 700 −1.2 Example (LIDAR). Denote the bins by B 1 . . The estimates are regressograms. 0. Suppose the X i ’s are in the interval [a. .6 log ratio −1. Deﬁne rn (x) = Y j We can rewrite the estimator as rn (x) = i=1 n i (x)Yi for x ∈ Bj . the estimated regression function rn goes from oversmoothing to undersmoothing. 1] and further suppose that the Xi ’s are equally spaced so that each bin has k = n/m. . suppose that [a. . . 34. Let kj be the number of observations in bin Bj and let Y j be the mean of the Yi ’s in bin Bj . (x) = 0. . 1 1 . (99) where i (x) = 1/kj if x. . The estimates shown here are regressograms. .0 400 500 range 600 700 −1. 0.2 log ratio −0. . .

th T j (Xi ). Small bins cause large variance. . the riks (or MSE) is of the order n−2/3 . i∈B1 σ2 mσ 2 σ2 = = . n Notice that both estimators so far have the form rn (x) = i=1 i (x)Yi . . We can rewrite the estimtor as rn (x) = n i=1 Yi K((x − Xi )/h) n i=1 K((x − Xi )/h) (101) n where K(z) = 1 if |z| ≤ 1 and K(z) = 0 if |z| > 1. (102) Deﬁne the vector of ﬁtted values Y = (rn (x1 ). . So the absolute value of the bias is |r (0)| k The variance is Xi ≤ h|r (0)|. . there exists a vector (x) = ( 1 (x). . We can further rewrite the estimator as r n (x) = i=1 Yi i (x) where K((x − Xi )/h)/ n K((x − Xt )/h). rn (xn ))T where Y = (Y1 . . E(rn (0)) ≈ r(0) + r (0) k Xi . . Yn ) . We shall see later that his estimator has risk n−4/5 which is better t=1 than n−2/3 . So. 34. i∈B1 By Taylor’s theorem r(Xi ) ≈ r(0) + Xi r (0). Another simple estimator is the local average deﬁned by rn (x) = 1 kx Yi . It then follows that Y = LY where L is an n × n matrix whose i row is (Xi ) . . In fact.3 Deﬁnition. . T (103) (104) The entries of the i th row show the weights 76 . for each x. . The MSE is minimized at MSE = h2 (r (0))2 + h= σ2 2(r (0))2 n 1/3 = c n1/3 for some c. . n (x))T such that n rn (x) = i=1 i (x)Yi . most of the estimators we consider have this form. thus. k n nh The mean squared error is the squared bias plus the variance: σ2 . i∈B1 The largest Xi can be in bin B1 is the length of the bin h = 1/m. .on the Xi ’s) is E(rn (0)) = 1 k E(Yi ) = i∈B1 1 k r(Xi ). Lij = given to each Yi in forming the estimate rn (Xi ). An estimator rn of r is a linear smoother if. i: |Xi −x|≤h (100) The smoothing parameter is h. With this optimal value of h. nh Large bins cause large bias.

. Bm . .7 Example. We deﬁne the effective degrees of freedom by ν = tr(L). . B2 . n rn (x) = i=1 Yi i (x) where i (x) = 1/nx if |Xi − x| ≤ h and i (x) = 0 otherwise. 3 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 In general. 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 L = × 0 0 0 1 1 1 0 0 0 . . Fix h > 0 and let Bx = {i : |Xi − x| ≤ h}. a special case of the kernel estimator discussed shortly. 0. We have Y = HY where H = X(X T X)−1 X T . For any x for which nx > 0 deﬁne 1 Yi . . .6 Example (Local averages). 1/2 1/2 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 . . rn (x) = nx i∈Bx 34. . L= 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 1/2 1/2 . . The vector of weights (x) looks like this: 1 1 (x)T = 0. Linear Regression. 0. b) into m equally spaced bins denoted by B 1 . suppose that n = 9. 34.34. . An example is given in Figure 24. suppose that n = 9. Then. In other words. Deﬁne rn (x) by 1 rn (x) = Yi . Let nx be the number of points in Bx . We can write r(x) = xT βxT (X T X)−1 X T Y = i i (x)Yi . (105) 34. . For x ∈ B j deﬁne n i (x) = 1/kj if Xi ∈ Bj and i (x) = 0 otherwise. Recall that we divide (a. 0. . The matrix L is called the smoothing matrix or the hat matrix. the estimate rn is a step function obtained by averaging the Yi s over each bin. The ith row of L is called the effective kernel for estimating r(Xi ). Then. . Xi = i/9 and h = 1/9. . As a simple example. This estimate is called the regressogram. 77 This is the local average estimator of r(x). it is easy to see that there are ν = tr(L) = m effective degrees of freedom.4 Deﬁnition. m = 3 and k 1 = k2 = k3 = 3. rn (x) = i=1 Yi i (x). Thus. The binwidth h = (b − a)/m controls how smooth the estimate is. . 0 . kj kj To see what the smoothing matrix L looks like. . In this case. . .5 Example (Regressogram). for x ∈ Bj (106) kj i:Xi ∈Bj where kj is the number of points in Bj .

the bias term is large and the variance is small. Minimizing risk corresponds to balancing bias and variance. we will minimize an estimate R(h) of R(h). Also. indicated by the vertical line. Ideally. The optimal amount of smoothing. This turns out to be a poor estimate of R(h): it is biased downwards and typically leads to undersmoothing (overﬁtting).35 Choosing the Smoothing Parameter The smoothers depend on some smoothing parameter h and we will need some way of choosing h. As a ﬁrst guess. MSE = where bias(x) = E(rn (x)) − r(x) is the bias of rn (x) and var(x) = Variance(rn (x)) is the variance. Recall from our discussion of variable selection that the predictive risk is E(Y − rn (X))2 = σ 2 + E(r(X) − rn (X))2 = σ 2 + MSE where MSE means mean-squared-error. we might use the average residual sums of squares. we would like to choose h to minimize R(h) but R(h) depends on the unknown function r(x). minimizes the risk = bias 2 + variance. When the data are oversmoothed. When the data are undersmoothed the opposite is true. The reason is that we are using the data twice: to estimate the function and to estimate the 78 Variance ¡ More smoothing . The bias increases and the variance decreases with the amount of smoothing. bias2 (x)p(x)dx + var(x)p(x)dx Risk Bias squared Less smoothing Optimal smoothing Figure 25: The bias–variance tradeoff. Instead. This is called the bias–variance tradeoff. also called the training error n 1 (Yi − rn (Xi ))2 (107) n i=1 to estimate R(h). see Figure 25.

the bandwidth that minimizes the generalized cross-validation score is close to the bandwidth that minimizes the cross-validation score. There is a shortcut formula for computing R just like in linear regression. Note that E(Yi − r(−i) (Xi ))2 = E(Yi − r(Xi ) + r(Xi ) − r(−i) (Xi ))2 = σ 2 + E(r(Xi ) − r(−i) (Xi ))2 ≈ σ 2 + E(r(Xi ) − rn (Xi ))2 and hence. the word kernel refers to any smooth function K such that K(x) ≥ 0 and K(x) dx = 1. Let rn be a linear smoother. (109) Thus the cross-validation score is a nearly unbiased estimate of the risk. 35.” For our purposes. an alternative is to use generalized cross-validation in which each Lii in equation (110) is replaced with its average n−1 n Lii = ν/n where ν = tr(L) is the effective degrees i=1 of freedom. We will estimate the risk using the leave-one-out cross-validation score which is deﬁned as follows. 2 xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0.risk. Then the leave-one-out cross-validation score R(h) can be written as n 2 1 Yi − rn (Xi ) R(h) = (110) n i=1 1 − Lii where Lii = i (Xi ) is the ith diagonal element of the smoothing matrix L. Rather than minimizing the cross-validation score. Yi ). 35. The smoothing parameter h can then be chosen by minimizing R(h). The intuition for cross-validation is as follows. (113) 79 . we would minimize GCV(h) = 1 n n i=1 Yi − rn (Xi ) 1 − ν/n 2 = Rtraining . The function estimate is chosen to make i=1 (Yi − rn (Xi ))2 small so this will tend to underestimate the risk. Equation (112) is just like the Cp statistic 36 Kernel Regression We will often use the word “kernel. Thus. (1 − ν/n)2 (111) Usually.2 Theorem.1 Deﬁnition. Using the approximation (1 − x)−2 ≈ 1 + 2x we see that GCV(h) ≈ where σ 2 = n−1 n i=1 (Yi 1 n n i=1 (Yi − rn (Xi ))2 + 2νσ 2 ≡ Cp n (112) − rn (Xi ))2 . The leave-one-out cross-validation score is deﬁned by CV n = R(h) = 1 n n i=1 (Yi − r(−i) (Xi ))2 (108) where r(−i) is the estimator obtained by omitting the ith pair (Xi . E(R) ≈ predictive risk.

Some commonly used kernels are the following: the boxcar kernel : the Gaussian kernel : the Epanechnikov kernel : the tricube kernel : where I(x) = These kernels are plotted in Figure 26. and tricube (bottom right). Epanechnikov (bottom left). 1 I(x). 2π 3 K(x) = (1 − x2 )I(x) 4 70 K(x) = (1 − |x|3 )3 I(x) 81 K(x) = 1 if |x| ≤ 1 0 if |x| > 1. Gaussian (top right). 80 . −3 0 3 −3 0 3 −3 0 3 −3 0 3 Figure 26: Examples of kernels: boxcar (top left). 2 2 1 K(x) = √ e−x /2 .

(115) 36. put your values here . Now try names(out) print(out) summary(out) plot(out) plot(x. .degree=0) lines(x.1 Deﬁnition. you will need the effective number of parameters.h) out = gcvplot(y˜x. alpha=c(. The meaning of deg=0 will be explained later. There are two ways to specify the smoothing parameter. create a vector bandwidths h = (h1 .fitted(out)) plot(x. The ﬁrst way is as a percentage of the data..y) out = loess(y ˜ x. I suggest using the loess command or using the locfit library.h).h) part looks strange..out$values) 81 .2 Remark..6 is a kernel estimator based on the boxcar kernel. The local average estimator in Example 34. called the bandwidth.deg=0.residuals(out)) help(locfit) To do cross-validation. To compute GCV. R-code.k) H = cbind(zero.fitted(out)) The span option is the bandwidth.36. .25. alpha then needs to be a matrix.alpha=H) plot(out$df. for example. To smooth with a speciﬁc value for the bandwidth (as we are doing) we use alpha=c(0.) For loess: plot(x. You get this by typing: out$enp The command for kernel regression in locﬁt is: out = locfit(y ˜ x.deg=0.alpha=c(0. The alpha=c(0. ) k = length(h) zero = rep(0. The Nadaraya–Watson kernel estimator is deﬁned by n rn (x) = i=1 i (x)Yi (114) where K is a kernel and the weights i (x) are given by i (x) = x−Xi h x−xj n j=1 K h K . .h)) where h is the bandwidth you want to use. (You need to download locﬁt. hk ). Let h > 0 be a positive number. In R.0) makes the bandwidth big enough so that one quarter of the data falls in the kernel. .25..span=. h = c( .

The bottom right plot also shows the presence of bias near the boundaries. The top two plots are based on small bandwidths and the ﬁts are too rough.36. . As the bandwidth h increases. The bandwidth was chosen by cross-validation. As we shall see. The following theorem shows how the bandwidth affects the estimator. Recall the CMB data from Figure 23. we will assume that these are random draws from some density f . Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general. the estimated function goes from being too rough to too smooth. This observation is conﬁrmed by theoretical calculations which show that the risk is very insensitive to the choice of kernel.3 Example (CMB data). . The risk (using integrated squared error loss) of the Nadaraya–Watson kernel estimator is R(rn . power 0 200 400 power 0 200 400 multipole multipole power 0 200 400 power 0 200 400 multipole multipole Figure 27: Four kernel regressions for the CMB data using just the ﬁrst 400 data points. 36. For the purposes of the theorem. To state these results we need to make some assumption about the behavior of x1 . The bottom left plot is just right. Estimates obtained by using different kernels are usually numerically very similar. this is a general feature of kernel regression. h = 10 (top right). The bandwidths used were h = 1 (top left). xn as n increases. . Figure 27 shows four different kernel regression ﬁts (using just the ﬁrst 400 data points) based on increasing bandwidths. The bottom plot in Figure 28 shows a kernel ﬁt to all the data points. . we will let the bandwidth depend on the sample size so we sometimes write h n . r) = h4 n 4 + σ2 2 x2 K(x)dx K 2 (x)dx nhn 82 r (x) + 2r (x) f (x) f (x) 2 dx (116) 1 dx + o(nh−1 ) + o(h4 ) n n f (x) . What does matter much more is the choice of bandwidth h which controls the amount of smoothing. h = 200 (bottom right). h = 50 (bottom left).4 Theorem. The choice of kernel K is not too important. The bottom right plot is based on large bandwidth and the ﬁt is too smooth.

Let x be some ﬁxed value at which we want to estimate r(x). This means that the bias is sensitive to the position of the Xi s. 37 Local Polynomials Kernel estimators suffer from boundary bias and design bias. we use leave-one-out cross-validation as described in Theorem 35. The resulting ﬁt is also shown in the ﬁgure. The ﬁrst term in (116) is the squared bias and the second term is the variance. the risk of the maximum likelihood estimator decreases to 0 at rate 1/n. 83 (121) . Now deﬁne the weight function wi (x) = K((Xi − x)/h) and choose a ≡ rn (x) to minimize the weighted sums of squares n i=1 wi (x)(Yi − a)2 . These problems can be alleviated by using a generalization of kernel regression called local polynomial regression.5 Example. (118) Thus. it can be shown that kernel estimators also have high bias near the boundaries. that is. Furthermore. 2! p! We can approximate a smooth regression function r(u) in a neighborhood of the target value x by the polynomial: r(u) ≈ Px (u. deﬁne the polynomial ap a2 (120) Px (u. Plugging h∗ back into (116) we see that the risk decreases at rate O(n−4/5 ). In (most) parametric models. 36. we see that the solution is rn (x) ≡ n i=1 wi (x)Yi n i=1 wi (x) which is exactly the kernel regression estimator. We will see that we can reduce these biases by using a reﬁnement called local polynomial regression. We call (117) the design bias since it depends on the design. h∗ = O(n−1/5 ). The optimal smoothing parameter was chosen to minimize this score. Note that the ﬁt gets quite variable to the right. This is known as boundary bias. ﬁrst consider choosing an estimator a ≡ r n (x) to minimize the sums of squares n 2 i=1 (Yi − a) . (119) From elementary calculus. In practice. The slower rate n −4/5 is the price of using nonparametric methods. a) = a0 + a1 (u − x) + (u − x)2 + · · · + (u − x)p .2. The solution is the constant function rn (x) = Y which is obviously not a good estimator of r(x). This gives us an interesting interpretation of the kernel estimator: it is a locally constant estimator. This suggests that we might improve the estimator by using a local polynomial of degree p instead of a local constant. we ﬁnd that the optimal bandwidth h ∗ is h∗ = 1 n 1/5 σ 2 K (x)dx 2 dx/f (x) (x) (x) f (x) f 2 ( x2 K 2 (x)dx)2 r (x) + 2r dx 1/5 . the distribution of the X i ’s. For values u in a neighborhood of x. If we differentiate (116) and set the result equal to 0.as hn → 0 and nhn → ∞. a). we cannot use the bandwidth given in (118) since h ∗ depends on the unknown function r. Figure 28 shows the cross-validation score for the CMB example as a function of the effective degrees of freedom. obtained from locally weighted least squares. Instead. To motivate this estimator. What is especially notable is the presence of the term f (x) (117) 2r (x) f (x) in the bias.

Bottom: the kernel ﬁt using the bandwidth that minimizes the cross-validation score. ap )T by choosing a = (a0 . To ﬁnd a(x).. −x)p 1 xn − x · · · (xnp! 84 . this is not equivalent to simply ﬁtting a local constant. ap (x))T if we want to make this dependence explicit. . . The special case where p = 1 is called local linear regression and this is the version we recommend as a default choice. . In particular. a). . . . We estimate a = (a0 . The local estimate of r is rn (u) = Px (u.CV score 24 26 28 30 Effective degrees of freedom ν Power 1000 3000 5000 0 400 800 Multipole Figure 28: Top: The cross-validation (CV) score as a function of the effective degrees of freedom. Let −x)p 1 x1 − x · · · (x1 p! (x2 −x)p 1 x2 − x · · · p! Xx = . local polynomial estimators. a) = a0 (x). have some remarkable properties. . . (123) Warning! Although rn (x) only depends on a0 (x). (124) . it is helpful to re-express the problem in vector notation. . . . . a))2 . . . and in particular local linear estimators. . Setting p = 0 gives back the kernel estimator. . As we shall see. . ap )T to minimize the locally weighted sums of squares n i=1 wi (x) (Yi − Px (Xi . . at the target value u = x we have rn (x) = Px (x. . (122) The estimator a depends on the target value x so we write a(x) = (a 0 (x). . .

The R-code is the same except we use deg = 1 for local linear.and let Wx be the n × n diagonal matrix whose (i. Once again.deg = 1. The resulting bands are shown in the lower right plot.span=h) locfit(y ˜ x. Thus. i) component is wi (x). . The cross-validation curve (not shown) has a well-deﬁned minimum at h ≈ 37 corresponding to 9 effective degrees of freedom. There is clear heteroscedasticity (nonconstant variance). The ﬁtted function uses this bandwidth. . These data were introduced in Example 34. .1 Example (LIDAR). . rn (x) = a0 (x) is the inner product of the ﬁrst row of T T (Xx Wx Xx )−1 Xx Wx with Y . .2.2. R-code. T T (x)T = eT (Xx Wx Xx )−1 Xx Wx . for local linear regression: loess(y ˜ x.alpha=c(0. n (x)). 85 . The bottom left plot shows the estimate of σ(x) using the method described later. (125) (126) In particular.deg=1. Next we compute 95 percent conﬁdence bands (explained later). there is much greater uncertainty for larger values of the covariate. 0. The top right plot shows the residuals. . . . 0)T and Xx and Wx are deﬁned in (124). We can rewrite (122) as (Y − Xx a)T Wx (Y − Xx a). As expected. Thus we have: The local polynomial regression estimate is n rn (x) = i=1 i (x)Yi (127) where (x)T = ( 1 (x). 1 e1 = (1. Figure 29 shows the 221 observations. our estimate is a linear smoother and we can choose the bandwidth by minimizing the cross-validation formula given in Theorem 35. Minimizing (125) gives the weighted least squares estimator T T a(x) = (Xx Wx Xx )−1 Xx Wx Y. The top left plot shows the data and the ﬁtted function using local linear regression.h)) 37. deg = 2 for local quadratic etc.

1 (x) Xi − x h (Xi − x)j .1.4 400 500 range 600 700 0. j = 1. Let r(x) = x(1 − x) sin 2. Top right: the residuals. rn (x) = n i=1 i (x)Yi i (x) Local Linear Smoothing where = bi (x) .2 residuals log ratio −1. Bottom right: 95 percent conﬁdence bands. n j=1 bj (x) Sn. Figure 30 shows the local regression for the CMB data for p = 0 and p = 1. (128) bi (x) = K and Sn.4 0. 37. 2.2 (x) − (Xi − x)Sn.j (x) = Xi − x h n K i=1 37. Bottom left: estimate of σ(x).2 0.6 −0.3 Example.0 0.−0.2 Theorem.0 0. When p = 1.6 400 500 range 600 700 −0.4 Example (Doppler function). the ﬁt is poor near the boundaries due to boundary bias.0 −0. 0≤x≤1 (129) . Top left: data and the ﬁtted function using local linear regression with h ≈ 37 (chosen by cross-validation).05 86 .00 −0.10 sigma(x) log ratio 0.05 400 500 range 600 700 −1. The bottom plots zoom in on the left boundary.2 400 500 range 600 700 Figure 29: The LIDAR data from Example 37. 37.1π x + . Note that for p = 0 (the kernel estimator).

The function is spatially inhomogeneous which means that its smoothness (second derivative) varies over x. . n and a ≤ Xi ≤ b. The bottom plots show the left boundary in more detail (p = 0 bottom left and p = 1 bottom right). The following theorem gives the large sample behavior of the risk of the local linear estimator and shows why local linear regression is better than kernel regression. . The minimum occurred at 166 degrees of freedom corresponding to a bandwidth of . The ﬁt has high effective degrees of freedom and hence the ﬁtted function is very wiggly. Assume that X1 . 37. The function is plotted in the top left plot of Figure 31. . . Xn are a sample from a distribution with density f and that (i) f (x) > 0. r and σ 2 are continuous in a neighborhood of x. (130) 87 . 1). If we used more smoothing. (ii) f . and (iii) hn → 0 and nhn → ∞. . . The ﬁtted function is shown in the bottom right plot. . Let Yi = r(Xi ) + σ(Xi ) i for i = 1. This is because the estimate is trying to ﬁt the rapid ﬂuctuations of the function near x = 0. the right-hand side of the ﬁt would look better at the cost of missing the structure near x = 0. The bottom left plot shows the crossvalidation score versus the effective degrees of freedom using local linear regression. Given X1 . which is called the Doppler function. This is always a problem when estimating spatially inhomogeneous functions. . Let x ∈ (a.5 Theorem. b). we have the following: the local linear estimator and the kernel estimator both have variance σ 2 (x) f (x)nhn K 2 (u)du + oP 1 nhn . . The top right plot shows 1000 data points simulated from Yi = r(i/n) + σ i with σ = .1 and i ∼ N (0. . . . We’ll discuss that further later. Xn .005. This function is difﬁcult to estimate and provides a good test case for nonparametric regression methods.5000 3000 1000 0 200 400 1000 3000 5000 0 200 400 1500 500 0 20 40 500 1500 0 20 40 Figure 30: Locally weighted regressions using local polynomials of order p = 0 (top left) and p = 1 (top right). Notice that the boundary bias is reduced by using local linear estimation (p = 1).

0 −1 0 1 0.5 1. the Nadaraya–Watson kernel estimator has asymptotic bias of order hn while the local linear estimator has bias of order h2 . The function (top left).0 0.0 Figure 31: The Doppler function estimated by local linear regression. taking p odd reduces design bias and boundary bias without increasing variance. and the ﬁtted function (bottom right). local n linear estimation eliminates boundary bias. the local linear estimator is free from design bias.0 100 150 200 −1 0 1 0.1 0 −1 0. The above result holds more generally for local polynomials of order p.0 0. the data (top right). 88 .6 Remark.5 1. Generally. In this sense. At the boundary points a and b. the cross-validation score versus effective degrees of freedom (bottom left).5 1. 37. The Nadaraya–Watson kernel estimator has bias h2 n 1 r (x)f (x) r (x) + 2 f (x) u2 K(u)du + oP (h2 ) (131) whereas the local linear estimator has asymptotic bias 1 h2 r (x) n 2 u2 K(u)du + oP (h2 ) (132) Thus.0 0.

e. More generally. 38.1 Deﬁnition. .span=. .2 Theorem. . Minimizing over all linear functions (i. The theorem above does not give an explicit form for rn . When λ = 0. A spline that is linear beyond the boundary knots is called a natural spline. k. an M th -order spline is a piecewise M − 1 degree polynomial with M − 2 continuous derivatives at the knots. b).1.An alternative to locfit is loess.hat ### effective degrees of freedom 38 Penalized Regression. Let ξ1 < ξ2 < · · · < ξk be a set of ordered points—called knots—contained in some interval (a. 89 . . τj+M = ξj for j = 1. In the previous section we avoided these two extreme solutions by replacing the sums of squares with a locally weighted sums of squares. ξ2 ). The most commonly used splines are piecewise cubic splines. The parameter λ controls the trade-off between ﬁt (the ﬁrst term of 133) and the penalty. What does r n look like for 0 < λ < ∞? To answer this question.. . . Adding a penalty term to the criterion we are optimizing is sometimes called regularization. A cubic spline is a continuous function r such that (i) r is a cubic polynomial over (ξ 1 . the solution is the interpolating function. Minimizing over all functions yields a function that interpolates the data. Let ξ0 = a and ξk+1 = b. Regularization and Splines Consider once again the regression model Yi = r(Xi ) + i and suppose we estimate r by choosing rn (x) to minimize the sums of squares n i=1 (Yi − rn (Xi ))2 . . functions of the form β0 +β1 x) yields the least squares estimator. τM such that τ1 ≤ τ 2 ≤ τ 3 ≤ · · · ≤ τ M ≤ ξ 0 . we will construct a basis for the set of splines. The parameter λ controls the amount of smoothing. Deﬁne new knots τ1 .degree=1) plot(x. and (ii) r has continuous ﬁrst and second derivatives at the knots. When λ → ∞. Let r n denote the function that minimizes M (λ). The estimator rn is called a smoothing spline. An alternative way to get solutions in between these extremes is to minimize the penalized sums of squares M (λ) = i (Yi − rn (Xi ))2 + λ J(r) (133) where J(r) = (r (x))2 dx (134) is a roughness penalty. 38. . and ξk+1 ≤ τk+M +1 ≤ · · · ≤ τk+2M .fitted(out)) out$trace. To do so. r n converges to the least squares line. . we need to deﬁne splines. A spline is a special piecewise polynomial. . Cubic splines (M = 4) are the most common splines used in practice. out = loess(y ˜ x. The function rn (x) that minimizes M (λ) with penalty (134) is a natural cubic spline with knots at the data points. . They arise naturally in the penalized regression framework as the following theorem shows.

. .m−1 τi+m−1 − τi τi+m − τi+1 for i = 1.2. 38.1). It is understood that if the denominator is 0. We only need to ﬁnd the coefﬁcients β = (β1 . . usually one takes τ1 = · · · = τM = ξ0 and ξk+1 = τk+M +1 = · · · = τk+2M .3 Theorem.6 1.0 Figure 32: Cubic B-spline basis using nine equally spaced knots on (0. i = 1. Hence. . B-spline basis functions have compact support which makes it possible to speed up calculations. . . . r is a natural cubic spline. we can write N rn (x) = j=1 βj Bj (x) (135) where N = n + 4. Figure 32 shows the cubic B-spline basis using nine equally spaced knots on (0. The choice of extra knots is arbitrary. k + 2M − m.0 0. then the function is deﬁned to be 0. .2 0. The are called the k+4 B-spline basis functions.1 = 1 if τi ≤ x < τi+1 0 otherwise for i = 1.m = τi+m − x x − τi Bi. Next. Hence. 90 .0. . any spline f (x) can be written as f (x) = j=1 βj Bj (x). We deﬁne the basis functions recursively as follows. The functions {Bi. . for m ≤ M we deﬁne Bi.5 1. βN )T . By expanding r in the basis we can now rewrite the minimization as follows: minimize : where Bij = Bj (Xi ) and Ωjk = (Y − Bβ)T (Y − Bβ) + λβ T Ωβ (136) Bj (x)Bk (x)dx. . We are now in a position to describe the spline estimator in more detail. .1). First we deﬁne Bi. k + 2M − 1. According to Theorem 38. .m−1 + Bi+1.0 0.4 . . . . k + 4} are a basis for the set of cubic splines.

out$y) ### NOTE: the fitted values are in out$y NOT out$fit!! out$cv ### print the cross-validation score You need to do a loop to try many values of df and then use cross-validation to choose df. Figure 33 shows the smoothing spline with cross-validation for the CMB data. df must be between 2 and n. In particular. we deﬁne the effective degrees of freedom by ν = tr(L) and we choose the smoothing parameter λ by minimizing either the cross-validation score (110) or the generalized cross-validation score (111).38. For example: cv = rep(0. The smoothing spline rn (x) is a linear smoother. The value of β that minimizes (136) is2 β = (B T B + λΩ)−1 B T Y.6 Example. 38. The ﬁt is smoother than the local regression estimator.df=df[i]. but the difference between the two ﬁts is small compared to the width of the conﬁdence bands that we will compute later. In R: out = smooth.spline(x.type="l") df[cv == min(cv)] 38. will recognize this as being similar to ridge regression.cv=TRUE) ### df is the effective degrees of freedom plot(x. (139) If we had done ordinary linear regression of Y on B. the smoothing matrix L is L = B(B T B + λΩ)−1 B T and the vector Y of ﬁtted values is given by Y = LY.length=50) for(i in 1:50){cv[i] = smooth. The effect of the term λΩ in (138) is to shrink the regression coefﬁcients towards a subspace.cv.8.5 Theorem. Spline estimates rn (x) are approximately kernel estimates in the sense that i (x) (137) (138) ≈ 1 K f (Xi )h(Xi ) Xi − x h(Xi ) where f (x) is the density of the covariate (treated here as random).df=10.n. there exist weights (x) such that rn (x) = n i=1 Yi i (x).4 Theorem. the hat matrix would be L = B(B T B)−1 B T and the ﬁtted values would interpolate the observed data. As before.spline(x. The effective number of degrees of freedom is 8.y) lines(x. h(x) = and K(t) = 2 You λ nf (x) 1/4 |t| 1 exp − √ 2 2 sin |t| π √ + 4 2 .y. Splines are another example of linear smoothers. that is. 91 . This is certainly visually more appealing.y. which results in a smoother ﬁt.50) df = seq(2.cv=TRUE)$cv} plot(df.

Rather than placing a knot at each data point. Another nonparametric method that uses splines is called the regression spline method. By using fewer knots. one can save computation time.power 1000 0 3000 5000 400 800 multipole Figure 33: Smoothing spline for the CMB data. 92 . The ﬁtted values for this estimator are Y = LY with L = B(B T B)−1 B T . The difference between this estimate and (138) is that the basis matrix B is based on fewer knots and there is no shrinkage factor λΩ. The amount of smoothing is instead controlled by the choice of the number (and placement) of the knots. We then do ordinary linear regression on the basis matrix B with no regularization. The smoothing parameter was chosen by cross-validation. we instead use fewer knots.

**39 Smoothing Using Orthogonal Functions
**

Let L2 (a, b) denote all functions deﬁned on the interval [a, b] such that

b b a

f (x)2 dx < ∞: (140)

L2 (a, b) =

f : [a, b] → R,

a

f (x)2 dx < ∞ .

We sometimes write L2 instead of L2 (a, b). The inner product between two functions f, g ∈ L2 is deﬁned by f (x)g(x)dx. The norm of f is ||f || = Two functions are orthogonal if A sequence of functions φ2 (x)dx j f (x)g(x)dx = 0. φ1 , φ2 , φ3 , φ4 , . . . is orthonormal if = 1 for each j and φi (x)φj (x)dx = 0 for i = j. An orthonormal sequence is complete if the only function that is orthogonal to each φj is the zero function. A complete orthonormal set is called an orthonormal basis. f (x)2 dx. (141)

**Any f ∈ L2 can be written as f (x) =
**

∞ j=1 b

βj φj (x), where βj =

a

f (x)φj (x)dx.

(142)

Also, we have Parseval’s relation: ||f ||2 ≡ where β = (β1 , β2 , . . .). Note: The equality in the displayed equation means that (f (x)−fn (x))2 dx → 0 where fn (x) =

n j=1

f 2 (x) dx =

∞ j=1

2 βj ≡ ||β||2

(143)

βj φj (x).

39.1 Example. An example of an orthonormal basis for L2 (0, 1) is the cosine basis deﬁned as follows. Let φ0 (x) = 1 and for j ≥ 1 deﬁne √ φj (x) = 2 cos(jπx). (144)

39.2 Example. Let f (x) = x(1 − x) sin 2.1π (x + .05)

**which is called the “doppler function.” Figure 34 shows f (top left) and its approximation
**

J

fJ (x) =

j=1

βj φj (x)

with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). As J increases we see that f J (x) gets closer to 1 f (x). The coefﬁcients βj = 0 f (x)φj (x)dx were computed numerically. 93

Figure 34: Approximating the doppler function with its expansion in the cosine basis. The function f (top left) and its approximation fJ (x) = J βj φj (x) with J equal to 5 (top right), 20 (bottom left), and j=1 200 (bottom right). The coefﬁcients βj =

1 0

f (x)φj (x)dx were computed numerically.

39.3 Example. The Legendre polynomials on [−1, 1] are deﬁned by Pj (x) = 1 dj 2 (x − 1)j , j = 0, 1, 2, . . . 2j j! dxj (145)

**It can be shown that these functions are complete and orthogonal and that
**

1 −1

Pj2 (x)dx =

2 . 2j + 1

(146)

It follows that the functions φj (x) = ﬁrst few Legendre polynomials are:

(2j + 1)/2Pj (x), j = 0, 1, . . . form an orthonormal basis for L2 (−1, 1). The 1 1 3x2 − 1 , P3 (x) = 5x3 − 3x , . . . 2 2

P0 (x) = 1, P1 (x) = x, P2 (x) =

These polynomials may be constructed explicitly using the following recursive relation: Pj+1 (x) = (2j + 1)xPj (x) − jPj−1 (x) . j +1 (147)

The coefﬁcients β1 , β2 , . . . are related to the smoothness of the function f . To see why, note that if f is smooth, 1 then its derivatives will be ﬁnite. Thus we expect that, for some k, 0 (f (k) (x))2 dx < ∞ where f (k) is the k th ∞ derivative of f . Now consider the cosine basis (144) and let f (x) = j=0 βj φj (x). Then,

1

(f (k) (x))2 dx = 2

0

∞ j=1

2 βj (πj)2k .

The only way that

∞ j=1

2 βj (πj)2k can be ﬁnite is if the βj ’s get small when j gets large. To summarize:

If the function f is smooth, then the coefﬁcients βj will be small when j is large. Return to the regression model Yi = r(Xi ) + i , i = 1, . . . , n. 94 (148)

Now we write r(x) =

∞ j=1

**βj φj (x). We will approximate r by
**

J

rJ (x) =

j=1

βj φj (x).

**The number of terms J will be our smoothing parameter. Our estimate is
**

J

r(x) =

j=1

βj φj (x),

To ﬁnd rn , let U denote the matrix whose columns are: φ1 (X1 ) φ2 (X1 ) φ1 (X2 ) φ2 (X2 ) U = . . . . . . φ1 (Xn ) Then and

. . . φJ (X1 ) . . . φJ (X2 ) . . . . . . φ2 (Xn ) . . . φJ (Xn )

.

β = (U T U )−1 U T Y Y = SY where S = U (U T U )−1 U T is the hat matrix. The matrix S is projecting into the space spanned by the ﬁrst J basis functions. We can choose J by cross validation. Note that trace(S) = J so the GCV score takes the following simple form: GCV(J) = RSS 1 . n (1 − J/n)2

39.4 Example. Figure 37 shows the doppler function f and n = 2, 048 observations generated from the model Yi = r(Xi ) + where Xi = i/n, J = 234 terms.

i i

∼ N (0, (.1)2 ). The ﬁgure shows the data and the estimated function. The estimate was based on

This is called the polynomial-cosine basis.

Here is another example: The ﬁt is in Figure 35 and the smoothing matrix is in 36. Notice that the rows of the smoothing matrix look like kernels. In fact, smoothing with a series is approximately the same as kernel regression with the kernel K(x, y) = J j=1 φj (x)φj (y). Cosine basis smoothers have boundary bias. This can be ﬁxed by adding the functions t and t 2 to the basis. In other words, use the design matrix 2 1 X1 X1 φ2 (X1 ) . . . φJ (X1 ) 2 1 X2 X2 φ2 (X2 ) . . . φJ (X2 ) U = . . . . . . . . . . . . . 2 1 Xn Xn φ2 (Xn ) . . . φJ (Xn )

95

GCV 0.013 0.015

0.017

0.019

2 J

4

6

8

10

−0.2 0.0 x

−0.1

0.0

0.1

y

0.2

0.3

0.4

0.2

0.4

0.6

0.8

1.0

residuals −0.2 0.0 −0.1 0.0

0.1

0.2

0.2

0.4 x

0.6

0.8

1.0

Figure 35: Cosine Regression

96

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.2

0.4 x

0.6

0.8

1.0

−0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

−0.1

0.0

0.1

0.2

0.3

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

Figure 36: Cosine Regression

97

and that r is smooth. Y − Y = Y − LY = (I − L)Y and so σ2 = where Λ = (I − L)T (I − L). tr(Λ) n − 2ν + ν Y T ΛY tr(Λ) (151) (150) Assuming that ν and ν do not grow too quickly.4. 98 . For linear smoothers. 40 Variance Estimation Next we consider several methods for estimating σ 2 . Now. We will now outline the proof of this result. See Example 39. Recall that if Y is a random vector and Q is a symmetric matrix then Y T QY is called a quadratic form and it is well known that E(Y T QY ) = tr(QV ) + µT Qµ where V = V(Y ) is the covariance matrix of Y and µ = E(Y ) is the mean vector. E(σ 2 ) = E(Y T ΛY ) Y T ΛY = σ2 + . Similarly. ν = o(n) and ν = o(n) then σ is a consistent estimator of σ 2 . nearly unbiased estimate of σ 2 . If r is sufﬁciently smooth.1 Theorem. Hence. the last term is small for large n and hence E(σ 2 ) ≈ σ 2 . 40. there is a simple.Figure 37: Data from the doppler test function and the estimated function. Let σ2 = where ν = tr(L). ν = tr(LT L) = i=1 2 n i=1 (Yi − r(Xi ))2 n − 2ν + ν n (149) || (Xi )||2 . Let rn (x) be a linear smoother. one can show that V(σ 2 ) → 0.

29 while equation (152) yields σ 2 = 394. (155) Then. Clearly. Estimate r(x) with any nonparametric method to get an estimate r n (x).Here is another estimator.2 Example.55. However. Therefore. 3. Variance Function Estimation 1. we have r(x i+1 ) − r(xi ) ≈ 0 and hence Yi+1 − Yi = r(xi+1 ) + and hence (Yi+1 − Yi )2 ≈ 2 i+1 i+1 − r(xi ) + i ≈ i+1 − i + 2 i −2 i+1 i . E(σ 2 ) ≈ σ 2 . (152) The motivation for this estimator is as follows. bi = (xi − xi−1 )/(xi+1 − xi−1 ). We will take the following approach. i (153) Thus. ai = (xi+1 − Xi )/(xi+1 − xi−1 ). The function estimate r n (x) is relatively insensitive to heteroscedasticity. i i i The intuition of this estimator is that it is the average of the residuals that result from ﬁtting a line to the ﬁrst and third point of each consecutive triple of design points. σ 2 increases with x so the data are heteroscedastic. Suppose that Yi = r(Xi ) + σ(Xi ) i . The variance looks roughly constant for the ﬁrst 400 observations of the CMB data. 2. Let Zi = log(Yi − r(Xi ))2 and δi = log 2 i. Zi = log σ 2 (Xi ) + δi . we applied the two variance estimators. Deﬁne σ2 = 1 2(n − 1) n−1 i=1 (Yi+1 − Yi )2 . We proceed as follows. In the CMB example this is blatantly false. when it comes to making conﬁdence bands for r(x). Regress the Zi ’s on the Xi ’s (again using any nonparametric method) to get an estimate q(x) of log σ 2 (x) and let b σ 2 (x) = eq(x) . Using a local linear ﬁt. 2 i+1 ) 2 i+1 ) E(Yi+1 − Yi )2 ≈ E( = E( + E( 2 ) − 2E( i+1 )E( i ) i + E( 2 ) = 2σ 2 . So far we have assumed homoscedasticity meaning that σ 2 = V( i ) does not vary with x. (156) This suggests estimating log σ 2 (x) by regressing the log squared residuals on x. c2 = (a2 + b2 + 1)−1 . 40. we must take into account the nonconstant variance. (157) 99 . Deﬁne Zi = log(Yi − rn (Xi ))2 . A variation of this estimator is 1 σ = n−2 2 n−1 2 c2 δi i i=2 (154) where δi = ai Yi−1 + bi Yi+1 − Yi . Suppose that the Xi s are ordered. Equation (149) yields σ 2 = 408. Assuming r(x) is smooth.

log σ 2 (x)

15 5 10

20

0

400

800

multipole Figure 38: The dots are the log squared residuals. The solid line shows the log of the estimated standard variance σ 2 (x) as a function of x. The dotted line shows the log of the true σ 2 (x) which is known (to reasonable accuracy) through prior knowledge. 40.3 Example. The solid line in Figure 38 shows the log of σ 2 (x) for the CMB example. I used local linear estimation and I used cross-validation to choose the bandwidth. The estimated optimal bandwidth for r n was h = 42 while the estimated optimal bandwidth for the log variance was h = 160. In this example, there turns out to be an independent estimate of σ(x). Speciﬁcally, because the physics of the measurement process is well understood, physicists can compute a reasonably accurate approximation to σ 2 (x). The log of this function is the dotted line on the plot. A drawback of this approach is that the log of a very small residual will be a large outlier. An alternative is to directly smooth the squared residuals on x.

100

41 Conﬁdence Bands

In this section we will construct conﬁdence bands for r(x). Typically these bands are of the form rn (x) ± c se(x) (158)

where se(x) is an estimate of the standard deviation of rn (x) and c > 0 is some constant. Before we proceed, we discuss a pernicious problem that arises whenever we do smoothing, namely, the bias problem.

T HE B IAS P ROBLEM . Conﬁdence bands like those in (158), are not really conﬁdence bands for r(x), rather, they are conﬁdence bands for r n (x) = E(rn (x)) which you can think of as a smoothed version of r(x). Getting a conﬁdence set for the true function r(x) is complicated for reasons we now explain. Denote the mean and standard deviation of rn (x) by r n (x) and sn (x). Then, rn (x) − r(x) sn (x) = rn (x) − r n (x) r n (x) − r(x) + sn (x) sn (x) bias(rn (x)) = Zn (x) + variance(rn (x))

where Zn (x) = (rn (x) − rn (x))/sn (x). Typically, the ﬁrst term Zn (x) converges to a standard Normal from which one derives conﬁdence bands. The second term is the bias divided by the standard deviation. In parametric inference, the bias is usually smaller than the standard deviation of the estimator so this term goes to zero as the sample size increases. In nonparametric inference, we have seen that optimal smoothing corresponds to balancing the bias and the standard deviation. The second term does not vanish even with large sample sizes. The presence of this second, nonvanishing term introduces a bias into the Normal limit. The result is that the conﬁdence interval will not be centered around the true function r due to the smoothing bias r n (x) − r(x). There are several things we can do about this problem. The ﬁrst is: live with it. In other words, just accept the fact that the conﬁdence band is for r n not r. There is nothing wrong with this as long as we are careful when we report the results to make it clear that the inferences are for r n not r. A second approach is to estimate the bias function r n (x) − r(x). This is difﬁcult to do. Indeed, the leading term of the bias is r (x) and estimating the second derivative of r is much harder than estimating r. This requires introducing extra smoothness conditions which then bring into question the original estimator that did not use this extra smoothness. This has a certain unpleasant circularity to it. 41.1 Example. To understand the implications of estimating r n instead of r, consider the following example. Let r(x) = φ(x; 2, 1) + φ(x; 4, 0.5) + φ(x; 6, 0.1) + φ(x; 8, 0.05) where φ(x; m, s) denotes a Normal density function with mean m and variance s 2 . Figure 39 shows the true function (top left), a locally linear estimate rn (top right) based on 100 observations Yi = r(i/10) + .2N (0, 1), i = 1, . . . , 100, with bandwidth h = 0.27, the function r n (x) = E(rn (x)) (bottom left) and the difference r(x) − r n (x) (bottom right). We see that r n (dashed line) smooths out the peaks. Comparing the top right and bottom left plot, it is clear that rn (x) is actually estimating r n not r(x). Overall, r n is quite similar to r(x) except that r n omits some of the ﬁne details of r.

**C ONSTRUCTING C ONFIDENCE BANDS . Assume that rn (x) is a linear smoother, so that rn (x) = Then,
**

n

n i=1

Yi i (x).

r(x) = E(rn (x)) =

i=1

i (x)r(Xi ).

101

6

4

0

2

0

5

10

0

2

4

6

0

5

10

6

4

2

0

5

10

−2

0

0

2

4

0

5

10

Figure 39: The true function (top left), an estimate rn (top right) based on 100 observations, the function r n (x) = E(rn (x)) (bottom left) and the difference r(x) − r n (x) (bottom right). Also, V(rn (x)) =

i=1 n

σ 2 (Xi ) 2 (x). i

**When σ (x) = σ this simpliﬁes to V(rn (x)) = σ 2 || (x)||2 . We will consider a conﬁdence band for r n (x) of the form I(x) = rn (x) − c s(x), rn (x) + c s(x) for some c > 0 where
**

n

2

2

(159)

s(x) =

i=1

σ 2 (Xi ) 2 (x). i

At one ﬁxed value of x we can just take rn (x) ± zα/2 s(x). If we want a band over an interval a ≤ x ≤ b we need a constant c larger than z α/2 to count for the fact that we are trying to get coverage at many points. To guarantee coverage at all the X i ’s we can use the Bonferroni correction and take rn (x) ± zα/(2n) s(x). There is a more reﬁned approach which is used in locfit. R Code. In locfit you can get conﬁdence bands as follows. 102

out = locfit(y ˜ x, alpha=c(0,h)) crit(out) = kappa0(out,cov=.95) plot(out,band="local") To actually extract the bands, proceed as follows: tmp r.hat critval se upper lower = = = = = =

### fit the regression ### make locfit find kappa0 and c ### plots the fit and the bands

preplot.locfit(out,band="local",where="data") tmp$fit tmp$critval$crit.val temp$se.fit r.hat + critval*se r.hat - critval*se

Now suppose that σ(x) is a function of x. Then, we use rn (x) ± cs(x). 41.2 Example. Figure 40 shows simultaneous 95 percent conﬁdence bands for the CMB data using a local linear ﬁt. The bandwidth was chosen using cross-validation. We ﬁnd that κ 0 = 38.85 and c = 3.33. In the top plot, we assumed a constant variance when constructing the band. In the bottom plot, we did not assume a constant variance when constructing the band. We see that if we do not take into account the nonconstant variance, we overestimate the uncertainty for small x and we underestimate the uncertainty for large x.

0

10000

0

400

800

0

10000

0

400

800

Figure 40: Local linear ﬁt with simultaneous 95 percent conﬁdence bands. The band in the top plot assumes constant variance σ 2 . The band in the bottom plot allows for nonconstant variance σ 2 (x). It seems like a good time so summarize the steps needed to construct the estimate r n and a conﬁdence band.

103

Summary of Linear Smoothing 1. Choose a smoothing method such as local polynomial, spline, etc. This amounts to choosing the form of the weights (x) = ( 1 (x), . . . , Theorem 37.2.

n (x)) T

. A good default choice is local linear smoothing as described in

2. Choose the bandwidth h by cross-validation using (110). 3. Estimate the variance function σ 2 (x) as described in Section 40. 4. An approximate 1 − α conﬁdence band for r n = E(rn (x)) is rn (x) ± c s(x). (160)

41.3 Example (LIDAR). Recall the LIDAR data from Example 34.2 and Example 37.1. We ﬁnd that κ 0 ≈ 30 and c ≈ 3.25. The resulting bands are shown in the lower right plot. As expected, there is much greater uncertainty for larger values of the covariate.

**42 Testing the Fit of a Linear Model
**

A nonparametric estimator rn can be used to construct a test to see whether a linear ﬁt is adequate. Consider testing H0 : r(x) is linear versus H1 : r(x) is not linear. Denote the hat matrix from ﬁtting the linear model by H and the smoothing matrix from ﬁtting the nonparametric regression by L. Let ||LY − HY ||/λ T = σ2 where λ = tr((L − H)T (L − H)) and σ 2 is deﬁned by (149). We can approximate the distribution of T under H 0 using the bootstrap. Also, under H0 , the F-distribution with ν and n − 2ν1 + ν2 degrees of freedom provides a rough approximation to the distribution of T . Thus we would reject H0 at level α if T > Fν,n−2ν1 +ν2 ,α . As with any test, the failure to reject H0 should not be regarded as proof that H0 is true. Rather, it indicates that the data are not powerful enough to detect deviations from H0 . In such cases, a linear ﬁt might be considered a reasonable tentative model. Of course, making such decisions based solely on the basis of a test can be dangerous. In the unlikely case that there are replications we can test ﬁt without using a nonparametric ﬁt. Just do the following. Denote the unique values of x as {x1 , . . . , xk }. Do this: 1. Create k − 1 dummy variables Z1 , . . . , Zk−1 for the k groups. 2. Fit: Y = βX +

r=1 k−1

γ r Zr +

(161)

3. Fit Y = βX + 4. Test H0 : γ1 = · · · = γk−1 = 0 with an F-test by comparing the two models. Example from Page 103 (162)

104

70.0.0.4.2166 0.4.5693 4.3931 --Signif.57.> > x = c(1. codes: 0 ‘***’ 0.4.’ 0. + 3.6)) > z3 = c(0.5693 4.8582 4.5271 --Signif.2166-2.19.026e-08 0.75.1 ‘ ’ 1 > > f = ((4.6) > print(p) [1] 0.2.3.0) > out2 = lm(y˜x+z1+z2+z3) > anova(out2) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 4.3584/6) > print(f) [1] 2.1.669 0.01 ‘*’ 0.5693 11.05 ‘.’ 0.0.3584)/(8-6))/(2.4.0.52) > > > out = lm(y˜x) > anova(out) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 4.87.01 ‘*’ 0.2.363721 > p = 1-pf(2.1.0.963e-09 7.5693 8.001 ‘**’ 0.99989 z2 1 1.55.1.81.001 ‘**’ 0. codes: 0 ‘***’ 0.4.rep(0.1749707 105 .05 ‘.0.2.rep(0.01433 * z1 1 7.3584 0.7276 0.363721.2.40.2.4) > y = c(2.93.0.7)) > z2 = c(0.4.07263 .1.1 ‘ ’ 1 > > > z1 = c(1.1.4.1.3.0. Residuals 6 2.1.8582 1.2.963e-09 2.01859 * Residuals 8 4.6247 0.

Yn ) where Yi ∈ {0. Before proceeding. ·). Recall that Y has an exponential family distribution. it is worth considering using the tools we have already developed for these cases. The parameters β are usually estimated by maximum likelihood. b(·) and c(·. if Y given X = x is Binomial (m. The data are (x1 . Taking g(t) = log(t/(m − t)) yields the logistic regression model. Thus. (xn . We assume that Yi ∼ Bernoulli(r(Xi )) for some smooth function r(x) for which 0 ≤ r(x) ≤ 1. Y1 ). at least for large samples. if Y ∈ {0. θ(·) is called the canonical parameter and φ is called the dispersion parameter. For example. This is because the asymptotic theory does not really depend on being Gaussian. P(Yi = 1|Xi = Xi ) = r(Xi ) and P(Yi = 0|Xi = Xi ) = 1 − r(Xi ). It then follows that r(x) σ 2 (x) The usual parametric form of this model is g(r(x)) = xT β for some known function g called the link function. 1} then it seems natural to use a Bernoulli model. For simplicity. b(θ) = m log(1 + eθ ) 1 − r(x) m r(x)y (1 − r(x))m−y y (164) ≡ E(Y |X = x) = b (θ(x)). then the basic regression model we have been using might not be appropriate. Here. Thus. ≡ V(Y |X = x) = a(φ)b (θ(x)). The likelihood function is n i=1 r(Xi )Yi (1 − r(Xi ))1−Yi so. ξ(Xi )) (165) 106 . the log-likelihood is n (r) = i=1 (Yi . Let’s consider a nonparametric version of logistic regression. we should point out that the basic model often does work well even in cases where Y is not real valued or is not Gaussian. given x. In this section we discuss nonparametric regression for more general models. if f (y|x) = exp yθ(x) − b(θ(x)) + c(y. . The model Y |X = x ∼ f (y|x). we focus on local linear estimation. . g(E(Y |X = x)) = xT β is called a generalized linear model. . For example. 1}. with ξ(x) = log(r(x)/(1 − r(x))). r(x)) then f (y|x) = which has the form (163) with θ(x) = log r(x) . and a(φ) ≡ 1.43 Local Likelihood and Exponential Families If Y is not real valued or is not Gaussian. φ) a(φ) (163) for some functions a(·). .

Thus. The nonparametric estimate of r(x) is rn (x) = a eb0 (x) . . there is no identity as in Theorem 35. a0 + a1 (xj − Xi )). the following approximation. ξ(−i) (Xi )) (168) where ξ(−i) (x) is the estimator obtained by leaving out (Xi . We also computed the local linear regression estimator which ignores the fact that the data are Bernoulli.2. ξ) with respect to ξ. Recall the deﬁnition of (x. 0) . . ξ) denote ﬁrst and second derivatives of (y. 43. we approximate log(r(u)/(1 − r(u))) with a0 + a1 (x − u). however. Now deﬁne the local log-likelihood n x (a) = i=1 n K K i=1 x − Xi h x − Xi h (Yi . Then. ξ) and ¨(y. Unfortunately.where (y. Deﬁne matrices Xx and Wx as in (124) and let Vx be a diagonal matrix with j th diagonal entry equal to − ¨(Yi . ˙(y. Figure 41 shows the local linear logistic regression estimator for an example generated from the model Yi ∼ Bernoulli(r(Xi )) with r(x) = e3 sin(x) /(1 + e3 sin(x) ). 107 . a0 ) 2 (169) where T m(x) = K(0)eT (Xx Wx Vx Xx )−1 e1 1 (170) and e1 = (1. (166) To estimate the regression function at x we approximate the regression function r(u) for u near x by the local logistic function ea0 +a1 (u−x) . a0 )). ξ) = −p(ξ)(1 − p(ξ)) where p(ξ) = eξ /(1 + eξ ). ξ) = log eξ 1 + eξ y 1 1 + eξ 1−y = yξ − log 1 + eξ . ξ) = y − p(ξ) ¨(y. There is. Yi ). 0. ξ) from (166) and let ˙(y. 3 cross-validation was used to select the bandwidth in both cases. = Let a(x) = (a0 (x). a0 + a1 (Xi − x)) Yi (a0 + a1 (Xi − x)) − log(1 + ea0 +a1 (Xi −x) ) . n CV ≈ x (a) + i=1 m(Xi ) ˙(Yi .1 Example. The effective degrees of freedom is n T ν= i=1 m(Xi )E(− ¨(Yi . The dotted line is the resulting local linear regression estimator. a1 (x)) maximize x which can be found by any convenient optimization routine such as Newton– Raphson. The dashed line is the local linear logistic regression estimator. r(u) ≈ 1 + ea0 +a1 (u−x) Equivalently. . The solid line is the true function r(x). a 1 + eb0 (x) (167) The bandwidth can be chosen by using the leave-one-out log-likelihood cross-validation n CV = i=1 (Yi . .

Also shown are two nonparametric estimates. The dashed line is the local linear logistic regression estimator. The dotted line is the local linear regression estimator. 3 It might be appropriate to use a weighted ﬁt since the variance of the Bernoulli is a function of the mean.0. β0 . local likelihood (dashed line) and local linear regression (dotted line). The outcome Y is presence or absence of BPD and the covariate is x = birth weight. 108 . 43.2 Example.0 −3 0. The estimated logistic regression function (solid line) r(x. The solid line is the true regression function r(x) = P(Y = 1|X = x). The dotted line is the local linear estimator which ignores the binary nature of the Yi ’s. β1 ) together with the data are shown in Figure 42. The dashed line is the local likelihood estimator. Again we see that there is not a dramatic difference between the local logistic model and the local linear model. 1 | | | | | | | | | | | | | | |||| | | | | | | | || ||||| | | | || | | | | | | Bronchopulmonary Dysplasia 0 | || | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 400 600 800 1000 1200 1400 1600 Birth Weight (grams) Figure 42: The BPD data. The BPD data. The data are shown with small vertical lines.5 1.0 0 3 Figure 41: Local linear logistic regression. The estimates are from logistic regression (solid line).

To make the risk equal to a small number δ we have δ= which implies that n= Thus: To maintain a given degree of accuracy of an estimator. . The kernel function K is now a function of d variables. the local sum of squares is given by n i=1 d 2 wi (x) Yi − a0 − j=1 aj (Xij − xj ) (174) where wi (x) = K(||Xi − x||/h). . This is equivalent to using a bandwidth matrix of the form H = h2 I. (173) L OCAL R EGRESSION . . This is called the curse of dimensionality. Consider local linear regression. Let x be a point in the cube and let Nh be a cubical neighborhood around x where the cube has sides of length h. The regression equation takes the form Y = r(X1 . Unfortunately. Xid )T . Given a nonsingular positive deﬁnite d × d bandwidth matrix H. Suppose we want to choose h = h(π) so that a fraction π of the data falls into Nh . the risk of a nonparametric regression estimator increases rapidly with the dimension d. In high dimensions. At a target value x = (x1 . In d dimensions the risk behaves like n−4/(4+d) . Setting hd = π we see that h(π) = π 1/d = e−(1/d) log π . . . . Thus h(π) → 1 as d grows. xd )T . Xd ) + . With this warning in mind. Then there is a single bandwidth parameter h. all the methods we have discussed carry over to this case easily. we deﬁne KH (x) = 1 K(H −1/2 x). suppose the data fall into a d-dimensional unit cube. The estimator is rn (x) = a0 109 (175) . 1 δ (d+4)/4 1 n4/(4+d) (172) . . (171) In principle. one scales each covariate to have the same mean and variance and then we use the kernel h−d K(||x||/h) where K is any one-dimensional kernel. So you might need n = 30000 points when d = 5 to get the same accuracy as n = 300 when d = 1. To get some intuition into why this is true. we need huge neighborhoods to capture any reasonable fraction of the data. . The risk of a nonparametric estimator behaves like n−4/5 if r is assumed to have an integrable second derivative. the sample size must increase exponentially with the dimension d. Xi = (Xi1 . . .44 Multiple Nonparametric Regression Suppose now that the covariate is d-dimensional. let us press on and see how we might estimate the regression function. . |H|1/2 Often. The expected fraction of points in Nh is hd . .

. So it is important to rescale your variables. . x3 ) + using one bandwidth.where a = (a0 . . . In other words. If we take a spline approach. . This is what locfit does. . . . d. S PLINES.. Thus we see that in higher dimensions. the bias at the boundary is the same order as in the interior. O RTHOGONAL BASIS F UNCTIONS . . . 44. using the above result. . . the MSE is 2 d σ 2 (x) K(u)2 du h4 c2 (179) = c 1 h4 + µ2 rjj (x) + 4 nhd f (x) nhd j=1 (178) Xx = 1 Xn1 − x1 · · · Xnd − xd which is minimized at h = cn−1/(d+4) giving MSE of size n−4/(4+d) . xi2 ))2 + λJ(r) ∂ 2 r(x) ∂x1 ∂x2 ∂ 2 r(x) ∂x2 2 i where J(r) = ∂ 2 r(x) ∂x2 1 +2 + dx1 dx2 . ad )T that minimizes the weighted sums of squares. . . The (asymptotic) variance of rn (x) is σ 2 (x) K(u)2 du n|H|1/2 f (x) Also. The solution a is a = (XT Wx Xx )−1 XT Wx Y (176) x x where 1 1 . Let rn be the multivariate local linear estimator with bandwidth matrix H. 1994). x2 . . . and Wx is the diagonal matrix whose (i. . The minimizer rn is called a thin-plate spline. φsJ (xs ) 110 . The (asymptotic) bias of rn (x) is 1 µ2 (K)trace(HH) (177) 2 where H is the matrix of second partial derivatives of r evaluated at x and µ 2 (K) is the scalar deﬁned by the equation uuT K(u)du = µ2 (K)I. For d = 2 we minimize (Yi − rn (xi1 . For s = 1. if you type locfit(y ˜ x1 + x2 + x3) then locfit ﬁts Y = r(x1 . Suppose that H = h2 I. . Then. . X1d − xd X2d − xd . . let Φs = φs1 (xs ). It is hard to describe and even harder (but certainly not impossible) to ﬁt. local linear regression still avoids excessive boundary bias and design bias. ad )T is the value of a = (a0 . . . i) element is wi (x). we need to deﬁne splines in higher dimensions.1 Theorem (Ruppert and Wand. ··· ··· . . X11 − x1 X21 − x1 .

. and multiply them together. A variation is to use a different J for each covariate. In this n case we require that i=1 rj (Xi ) = 0 for each j.jd (x1 . . . xd ).. Now we approximate r as r(x1 . . . we choose one function from Φ1 . .j1 (x1 )φ2. .. one function from Φ2 . . .. rd are smooth functions.t If we truncate the sum after the ﬁrst order terms. Then Φ is the set of all J d such functions.j2 (x2 ) · · · φd. . . . . The model (181) is not identiﬁable since we can add any constant to α and subtract the same constant from one of the rj ’s without changing the regression function. .j1 (x1 )φ2.. 111 .jd (xd ).. and so on.jd (x1 . . That is.. We call Φ the tensor product basis. jd } denote a subset of A= then we can write this more compactly as r(x1 . . the computational burden becomes prohibitive. ... A DDITIVE M ODELS . . .. jd ≤ J (180) We can ﬁt this model by collecting all the basis functions into a design matrix and then use least squares. xd ) = φ1. . ..j2 (x2 ) · · · φd. xd ) 1≤j1 . A∈A βj1 . . Suppose that the ﬁrst basis function is φ s1 (xs ) = 1 for each s. we get an additive model which we now discuss. It is instructive to rewrite the model as follows. Sometimes. Then (180) can be written as rst (xs . . d . .. xd ) ≈ where φj1 . a more fruitful approach is to use an additive model. . .jd φj1 . The additive model is clearly not as general as ﬁtting r(x1 . It is called backﬁtting. . Interpreting and visualizing a high-dimensional ﬁt is difﬁcult. . We use cross-validation to choose J.be a set of basis functions for xs ... . . If we let A = {j1 . An additive model is a model of the form d Y =α+ j=1 rj (Xj ) + (181) where r1 . perhaps the easiest being to set α = Y and then regard the r j ’s as deviations from Y . This problem can be ﬁxed in a number of ways. . xt ) + · · · rs (xs ) + r(x) = β0 + s s.. .. . .. . . . xd ) but it is much simpler to compute and to interpret and so it is often a good starting point.jd ≤J (j1 . . . . . As the number of covariates increases. jd ) : 1 ≤ j1 . This is a simple algorithm for turning any one-dimensional regression smoother into a method for ﬁtting additive models. s = 1.jd (xd ) : 1 ≤ js ≤ J. . . . Deﬁne Φ= φ1. xd ) ≈ βA φA (x1 .

Given a covariate j and a split point s we deﬁne the rectangles R 1 = R1 (j. . the ﬁtted values are quite similar suggesting that the generalized additive model is adequate. . Usually. . • Apply a smoother to Yi on xj to obtain rj . 44. . s) = {x : xj ≤ s} and R2 = R2 (j. . also shown are the corresponding rectangles. . . xj refers the the j th covariate not the j th observation. cM are constants and R1 . . . The size of the tree chosen by cross-validation. • Set rj (x) equal to rj (x) − n−1 n i=1 rj (Xi ). The model is ﬁtted in a recursive manner that can be represented as a tree. The splitting process is on repeated on each rectangle R1 and R2 . then the tree is pruned to form a subtree by collapsing regions together. xd ). Generally one grows a very large tree. . d: • Compute Yi = Yi − α − k=j rk (Xi ). . . RM are disjoint rectangles that partition the space of covariates. Then we take c1 to be the average of all the Yi ’s such that Xi ∈ R1 and c2 to be the average of all the Yi ’s such that Xi ∈ R2 . . . The data are 48 rock samples from a petroleum reservoir. we use ten-fold cross-validation since leave-one-out is too expensive. R EGRESSION T REES . . . . The residuals from the additive model and the full threedimensional local linear ﬁt are shown in Figure 45. The choice of which covariate xj to split on and which split point s to use is based on minimizing the residual sums if squares. Here is an example involving three covariates and one response variable. . A regression tree is a model of the form M r(x) = m=1 cm I(x ∈ Rm ) (182) where c1 . . The estimates of r1 . xid ). . The covariate for the ith observation is Xi = (xi1 . You can (and should) write your own function to ﬁt an additive model. . . The goal is to predict permeability from the three covariates. Thus we divide the data into ten blocks. s) = {x : xj > s} where. . . The function estimate r is constant over the rectangles. Iterate until convergence: for j = 1. . . in this expression. Notice that c1 and c2 minimize the sums of squares Xi ∈R1 (Yi − c1 )2 and Xi ∈R2 (Yi − c2 )2 . . i = 1. ﬁt the model on the remaining blocks and prediction error is computed for the observations in the left-out block. Figure 46 shows a simple example of a regression tree. Denote a generic covariate value by x = (x1 . xj . r2 and r3 are shown in Figure 44. .2 Example.2 minimized the cross-validation score. Here are the R commands: library(tree) ### load the library 112 . rd . Y was added to each function before plotting it. we found that the bandwidth h ≈ 3. . hence the name. . the response is permeability (in milli-Darcies) √ and the covariates are: the area of pores (in pixels out of 256 by 256). xij . Next consider a threedimensional local linear ﬁt (175). perimeter in pixels and shape (perimeter/ area). After scaling each covariate to have mean 0 and variance 1. . . This is repeated for each block and the prediction error is averaged over the ten replications. We scale each covariate to have the same variance and then use a common bandwidth for each covariate. The data are plotted in Figure 43. . Apparently. n. remove each block one at a time. . . First we ﬁt the additive model permeability = r1 (area) + r2 (perimeter) + r3 (shape) + .The Backﬁtting Algorithm Initialization: set α = Y and set initial guesses for r1 . .

This illustrates an important feature of tree regression: it automatically performs variable selection in the sense that a covariate xj will not appear in the tree if the algorithm ﬁnds that the variable is not important. Figure 47 shows a tree for the rock data.best=k) ### fit the best size tree plot(new) text(new) 44. 113 . Notice that the variable shape does not appear in the tree. out = tree(y ˜ x1 + x2 + x3) ### fit the tree plot(out) ### plot the tree text(out) ### add labels to plot print(out) ### print the tree cv = cv. The result is that tree only depends on area and peri.2 0.1 0.4 log permeability 7 8 9 0 200 400 600 800 shape 1200 Figure 43: The rock data.3 perimeter 0.3 Example.tree(out) ### prune the tree ### and compute the cross validation score plot(cv$size.log permeability log permeability 9 8 7 1000 2000 3000 area 4000 5000 7 8 9 0.cv$dev) ### plot the CV score versus tree size m = cv$size[cv$dev == min(cv$dev)] ### find the best size tree new = prune.tree(out. This means that the shape variable was never the optimal covariate to split on in the algorithm.

r2 .4 log permeability 7 8 9 0 200 400 600 800 1000 shape Figure 44: The rock data. The plots show r1 .3 perimeter 0.log permeability log permeability 9 8 7 1000 2000 3000 area 4000 5000 7 8 9 0. and r3 for the additive model Y = r1 (x1 ) + r2 (x2 ) + r3 (x3 ) + .1 0. 114 .2 0.

x1 ≥ 50 < 50 x2 < 100 c1 ≥ 100 c2 c3 R2 110 x2 R1 R3 50 x1 Figure 46: A regression tree for two covariates x1 and x2 .0 predicted values 0. The function estimate is r(x) = c1 I(x ∈ R1 ) + c2 I(x ∈ R2 ) + c3 I(x ∈ R3 ) where R1 . Bottom right: residuals from the two ﬁts plotted against each other.5 7 8 predicted values 9 −0.5 Figure 45: The residuals for the rock data. Top left: residuals from the additive model. R2 and R3 are the rectangles shown in the lower plot.5 7 8 predicted values 9 −0. Top right: qq-plot of the residuals from the additive model.5 −0.5 residuals residuals 0.5 residuals 0.0 −0. 115 .2 0.0 0. Bottom left: residuals from the multivariate local linear model.2 0.0 −0.0 0.5 0.0.4 −2 −1 0 Normal quantiles 1 2 0.5 0.

1991 7.678 area < 3967 peri < .893 8.746 8.area < 1403 area < 1068 area < 3967 peri < .099 8.985 8.407 8. 116 .339 Figure 47: Regression tree for the rock data.1949 8.

1. We want to estimate f . described later.0 0 Undersmoothed 3 1. The bottom right plot is based on a large bandwidth h which leads to oversmoothing. X n ∼ f where f is some probabilty density.0 0.5 1. The bottom left plot is based on a bandwidth h which was chosen to minimize estimated risk. (j/2) − 1. Bottom right: bandwidth 10h.5 0. Top right: bandwidth h/10. µ.0 −3 0.5 0.0 −3 0.0 −3 0 True Density 3 0. is nonparametric density estimation. We will evaluate the quality of an estimator fn with the risk.5 1. Based on 1000 draws from f . Bottom left: bandwidth h = 0. I computed a kernel density estimator. or integrated mean squared error. .0 0.0 0 Oversmoothed 3 Figure 48: The Bart Simpson density from Example 45. The top right plot is based on a small bandwidth h which leads to undersmoothing. 1/10) (183) where φ(x. 0. 1) + 2 10 4 j=0 φ(x.05 chosen by leave-one-out cross-validation. . This leads to a much more reasonable density estimate. Top left: true density.45 Density Estimation A problem closely related to nonparametric regression. .1. 45. Let X1 .1 Example (Bart Simpson). σ) denotes a Normal density with mean µ and standard deviation σ. The top left plot in Figure 48 shows the density f (x) = 1 1 φ(x. The other plots are kernel estimators based on n = 1000 draws. . R = E(L) where L= (fn (x) − f (x))2 dx 117 .0 −3 0 Just Right 3 0.

The loss function. The last term does not depend on h so minimizing the loss is equivalent to minimizing the expected value of J(h) = 2 fn (x) dx − 2 fn (x)f (x)dx. the cross-validation score n was deﬁned as i=1 (Yi − r(−i) (Xi ))2 but in density estimation. The cross-validation estimator of risk is 2 J(h) = fn (x) dx − 2 n n f(−i) (Xi ) i=1 (185) where f(−i) is the density estimator obtained after removing the ith observation. m Bj (186) f (u)du. B2 = 1 2 . We refer to J(h) as the cross-validation score or estimated risk. 266 data points from an astronomical sky survey.2 Deﬁnition.3 Example.1 . Each data point represents a “redshift. Choosing the right number of bins involves ﬁnding a good tradeoff between bias and variance. there is no response variable Y . . although it differs from the true risk by the constant term 45. 118 . The histogram reveals the presence of clusters of galaxies. The top right histogram is based on 308 bins (chosen by cross-validation). for x ∈ B j and h small. let Yj be the number of observations in Bj . The bottom left histogram has too few bins resulting in undersmoothing.is the integrated squared error loss function. We shall see later that the top left histogram has too many bins resulting in oversmoothing and too much bias. The usual method for estimating risk is leave-one-out cross-validation. Deﬁne the binwidth h = 1/m. Let m be an integer and deﬁne bins B1 = 0. let pj = Yj /n and let pj = The histogram estimator is deﬁned by m fn (x) = j=1 pj I(x ∈ Bj ). we take to be [0. h 45. Suppose f has its support on some interval which. E(fn (x)) = pj E(pj ) = = h h Bj f (u)du h ≈ f (x)h = f (x). 1]. 1 m . In the regression case. which we now write as a function of h. the distance from us to a galaxy. . The estimators will depend on some smoothing parameter h and we will choose h to minimize an estimate of the risk. Figure 49 shows three different histograms based on n = 1. m m . Instead. h (187) To understand the motivation for this estimator.” roughly speaking. Perhaps the simplest nonparametric density estimator is the histogram. The details are different for density estimation than for regression. . we proceed as follows. Bm = m−1 . . (184) We shall refer to E(J(h)) as the risk. (since fn will depend on some smoothing parameter h) is L(h) = = (fn (x) − f (x))2 dx 2 fn (x) dx − 2 fn (x)f (x)dx + f 2 (x) dx. f 2 (x) dx. without loss of generality. note that.

The lower right plot shows the estimated risk versus the number of bins. j=1 (192) 119 . f ) ∼ 1 n1/3 6 (f (u))2 du C . The top right histogram uses 308 bins (chosen by cross-validation). and let Bj be the bin containing x. 45.0 0. The following identity holds: 2 n+1 − J(h) = h(n − 1) h(n − 1) m pj2 . nh (188) (189) 1/3 . R(fn . Then. We will see shortly that kernel estimators converge at the faster rate n−4/5 .0 40 80 0.0 40 80 0. f ) ≈ The value h∗ that minimizes (189) is h∗ = With this choice of binwidth.1 Undersmoothed 0. Consider ﬁxed x and ﬁxed m. the risk decreases to 0 at rate n −2/3 .80 40 0 0. n2/3 pj pj (1 − pj ) . (190) (191) We see that with an optimally chosen binwidth. The top left histogram has too many bins. The bottom left histogram has too few bins.2 0 0.4 Theorem. and V(fn (x)) = h nh2 h2 12 (f (u))2 du + 1 .1 Oversmoothed 0. E(fn (x)) = The risk satisﬁes R(fn .1 Just Right 0.2 1 500 1000 Number of Bins Figure 49: Three versions of a histogram for the astronomy data.2 0 0.

At each point x. The histogram in the top right plot in Figure 49 was constructed using m = 308 bins. called the bandwidth. As with kernel regression.bw=h) where h is the bandwidth. Histograms are not smooth.6 Deﬁnition. The kernels are not drawn to scale. fn (x) is the average of the kernels centered over the data points Xi . The risk is K 2 (x)dx 1 4 (194) R ≈ σK h4 (f (x))2 dx + n 4 nh 2 where σK = x2 K(x)dx. fn (x) = K n i=1 h h This amounts to placing a smoothed out lump of mass of size 1/n over each data point X i . In R use: kernel(x. Here are some properties of fn . 120 . Figure 51 shows density estimates with several different bandwidths. We ﬁnd that m = 308 is an approximate minimizer. (193) −10 −5 0 5 10 Figure 50: A kernel density estimator fn . but the choice of bandwidth h is important. Now we discuss kernel density estimators which are smoother and which converge to the true density faster. see Figure 50. Given a kernel K and a positive number h. J . the choice of kernel K is not crucial. We see how sensitive the estimate fn is to the choice of h. Look also at Figure 48. plotted versus the number of bins. or more precisely.5 Example. The data points are indicated by short vertical bars. the kernel density estimator is deﬁned to be n 1 1 x − Xi . 45.45. The bottom right plot shows the estimated risk. Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. In general we will let the bandwidth depend on the sample size so we write h n . We used cross-validation in the astronomy example.

2 estimated MSE 0.2 0.34 is that Q/1.2 redshift redshift f 0. 121 . In practice. σ is estimated by min{s.000 0.0 0. Q/1. Top right: just right (bandwidth chosen by cross-validation). Bottom right: cross-validation curve as a function of bandwidth h.f f 0. histograms converge at rate O(n−2/3 ) showing that kernel estimators are superior in rate to histograms. we see that the asymptotically optimal bandwidth is h∗ = c2 c2 A(f )n 1 1/5 (195) where c1 = x2 K(x)dx. we see that if the optimal bandwidth is used then R = O(n−4/5 ). If we differentiate (194) with respect to h and set it equal to 0. The reason for dividing by 1. 4 Recall that the interquartile range is the 75th percentile minus the 25th percentile. Bottom left: undersmoothed.34 is a consistent estimate of σ if the data are from a N (µ. c2 = K(x)2 dx and A(f ) = (f (x))2 dx.1 0.0 0. The bandwidth was chosen to be the value of h where the curve is a minimum. Usually. Speciﬁcally.4 This choice of h∗ works well if the true density is very smooth and is called the Normal reference rule. the bandwidth can be chosen by cross-validation but ﬁrst we describe another method which is sometimes used when f is thought to be very smooth. Top left: oversmoothed. Plugging h∗ into (194). we compute h ∗ from (195) under the idealized assumption that f is Normal. This is informative because it tells us that the best bandwidth decreases at rate n−1/5 .1 0.008 redshift h Figure 51: Kernel density estimators and estimated risk for the astronomy data. σ 2 ). As we saw.1 0.06σn−1/5 .0 0.34} where s is the sample standard deviation and Q is the interquartile range. This yields h∗ = 1.

we can rescale the variables to have the same variance and then use only one bandwidth.7 Example. Most often. Recall that the cross-validation score is J(h) = f 2 (x)dx − 2 n n i=1 f−i (Xi ) (196) where f−i denotes the kernel estimator obtained by omitting Xi . The true curve is also shown.bw=h)) The bandwidth for the density estimator in the upper right panel of Figure 51 is based on cross-validation. it is usually better to estimate h using crossvalidation.ucv function to do cros-validation: h = bw. The kernel estimator can easily be generalized to d dimensions. use the bw. use the bandwidth hn = where σ = min s. . 1. The eye is not a good judge of risk. . (198) K nh1 · · · hd hj i=1 j=1 = 1 n n i=1 (Zi (x) − fn (x))2 . (197) To further simplify. then cross-validation has let you down. The ﬁrst is data from N(0. √ So the standard error is se(x) = s(x)/ n where s(x) is the standard deviation of the Zi (x) s: s(x) Then we use fn (x) ± zα/(2n) se(x). Constructing conﬁdence bands for kernel density estimators is similar to regression.The Normal Reference Rule For smooth densities and a Normal kernel. we use the product kernel n d 1 xj − Xij fn (x) = . In both cases. We show the estimates using cross-validation and the Normal reference rule together with bands. Suppose now that the data are d-dimensional so that Xi = (Xi1 . 45. . .1).ucv(x) plot(density(x. R code.06 σ n1/5 Q 1.1) + (1/2)N (1. if the estimator f is wiggly. Figure 45 shows two examples. 122 . Do not assume that.34 . . Since we don’t want to necessarily assume that f is very smooth. In this case it worked well but of course there are lots of examples where there are problems. n = 1000.1) and second from (1/2)N (−1. That’s the curve outside the bands in the last plot. . Note that fn (x) is just a n sample average: fn (x) = n−1 i=1 Zi (x) where Zi (x) = 1 K h x − Xi h . Xid ).

we see that dy = = h2 = Yi .0 −2 −1 0 grid CV 1 2 0. Assuming that y 1 K h2 y − Yi h2 uK(u)du = 0. Consider regression again. y)dx Suppose we compute a bivariate kernel density estimator f (x.4 out$f out$f 0.2 0.0 0.1 0. Recall that yf (y|x)dy = yf (x.0 2.0 out$f out$f 1.0 −2 −1 0 grid reference rule 1 2 A L INK B ETWEEN R EGRESSION r(x) AND D ENSITY E STIMATION .5 1. y f(x. y) .0 0.3 0. y)dy = = 1 n 1 n n y i=1 n i=1 1 K h1 x − Xi h1 y 1 K h2 1 K h2 y − Yi h2 y − Yi h2 dy dy (205) (206) 1 K h1 x − Xi h1 123 . f (x.0 1.3 0.5 3.1 0.4 −2 −1 0 grid reference rule 1 2 3.0 2.5 2.2 0.0 0.0. y)dy f (x) (199) (200) = E(Y |X = x) = = yf (x.5 1.5 2. (h2 u + Yi )K(u)du uK(u)du + Yi K(u)du (202) (203) (204) Hence.0 −2 −1 0 grid CV 1 2 0.5 0. y) = 1 n n i=1 1 K h1 x − Xi h1 1 K h2 y − Yi h2 (201) and we insert this into (200).

124 . the kernel regression estimator can be derived from kernel density estimation. y)dy = = Therefore. y) f (x. y)dx n i=1 1 Y i h1 K x−Xi h1 x−Xi h1 (210) = 1 n n i=1 n 1 i=1 h1 K (211) = Yi K K x−Xi h1 x−Xi h1 n i=1 (212) which is the kernel regression estimator.= Also. (207) 1 n 1 n n i=1 n i=1 1 K h1 1 K h1 x − Xi h1 . f(x. 1 K h2 y − Yi h2 dy (208) (209) x − Xi h1 = 1 n y f(x. In other words. r(x) 1 n n Yi i=1 1 K h1 x − Xi h1 .

Each Y is a digit from 0 to 9. X2 ) is 2-dimensional and the outcome Y ∈ Y = {0.46 Classiﬁcation REFERENCES: 1. Tibshirani and Friedman (2001). . the covariates X are also called features. See Figure 52. The classiﬁcation risk (or error rate) of h is R(h) = P(Y = h(X)). o 3. Identify whether glass fragments in a criminal investigation or from a window or not. obesity. The Elements of Statistical Learning. ldl (low density lipoprotein cholesterol). (2006). . A classiﬁcation rule is a function h : X → {0. typea (type-A behavior). 1}. Xid )T ∈ X ⊂ Rd is a d-dimensional vector and Yi takes values in {0. Here X is past price and Y is the future price. 4. . The Coronary Risk-Factor Study (CORIS) data. Bishop. There are 462 males between the ages of 15 and 64 from three rural areas in South Africa. (213) EXAMPLES: 1. . The goal is to predict Y from all the covariates. 2. When we observe a new X. . 125 . This is a rule of the form h(x) = 1 if a + b1 x1 + b2 x2 > 0 0 otherwise. Identity handwritten digits from images. x256 corresponding to the intensity values from the pixels of the 16 X 16 image. There are 256 covariates x 1 . Also shown is a linear classiﬁcation rule represented by the solid line. . alcohol (current alcohol consumption). 1}. . Gy¨ rﬁ and Lugosi. . Consider IID data (X1 . Predict if an email message is spam or real. 3. The problem of predicting a discrete random variable Y from another random variable X is called classiﬁcation. Devroye. discrimination. 46. (1996). based on chemical composition. and age. 2. A Probabilistic Theory of Pattern Recognition. supervised learning. 5. Pattern Recognition and Machine Learning. Y1 ). The Y values are indicated on the plot with the triangles representing Y = 1 and the squares representing Y = 0. Everything above the line is classiﬁed as a 0 and everything below the line is classiﬁed as a 1. This is the same as binary regression except that the focus is on good prediction rather than estimating the regression function. adiposity. (Xn . Predict if stock will go up or down based on past performance. . we predict Y to be h(X). . Christopher M. Often. . or pattern recognition. cumulative tobacco (kg). . famhist (family history of heart disease). Figure 53 shows 100 data points. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease and there are 9 covariates: systolic blood pressure. 1}.1 Example. The covariate X = (X1 . Yn ) where Xi = (Xi1 . Hastie. The goal is to predict Y given a new X.

126 .Figure 52: Zip code data.

1 Error Rates. The Bayes Classiﬁer and Regression The true error rate (or classiﬁcation risk) of a classiﬁer h is R(h) = P({h(X) = Y }) and the empirical error rate or training error rate is Rn (h) = 1 n n (214) I(h(Xi ) = Yi ). means Y = 0. PROOF. These two groups are perfectly separated by the linear decision boundary. 46. 127 ¢£ ¢£ ¢£ ¢£ ¢£ ¢£ ¡ ¡ ¡ ¡ ¡ ¡ x2 x1 . We have the following important result. Let r(x) = E(Y |X = x) = P(Y = 1|X = x) denote the regression function. means Y = 1. The risk R∗ = R(h∗ ) of the Bayes rule is called the Bayes’ risk. Note that R(h) = P({Y = h(X)}) = P({Y = h(X)|X = x}f (x)dx. i=1 (215) Now we will related classiﬁcation to regression. (216) The rule h∗ is called the Bayes’ rule. We will show that R(h) − R(h∗ ) ≥ 0.Figure 53: Two covariates and a linear decision boundary. The rule h that minimizes R(h) is h∗ (x) = 1 if r(x) > 1 2 0 otherwise. The set D(h) = {x : r(x) = 1/2} (217) is called the decision boundary.

Let r(x) be an estimate of r ∗ (x) and deﬁne the plug-in rule: h(x) = In the previous proof we showed that P(Y = h(X)|X = x) − P(Y = h∗ (X)|X = x) = (2r(x) − 1)(Ih∗ (x)=1 − Ib h(x)=1 ) = 2|r(x) − 1/2|Ih∗ (x)=b h(x) Now. Now. 2 When r(x) ≥ 1/2. when h∗ (x) = h(x) we have that |r(x) − r ∗ (x)| ≥ |r(x) − 1/2|.2 Classiﬁcation is Easier Than Regression Let r∗ (x) = E(Y |X = x) be the true regression function and let h∗ (x) denote the corresponding Bayes rule. When r(x) < 1/2. This proves (218). if h is any classiﬁer. h(X) = 0|X = x} I(h(x) = 1)P({Y = 1|X = x} + I(h(x) = 0)P({Y = 0|X = x} I(h(x) = 1)r(x) + I(h(x) = 0)(1 − r(x)) I(x)r(x) + (1 − I(x))(1 − r(x)) (218) where I(x) = I(h(x) = 1). then R(h) ≥ R ∗ . P({Y = h(X)|X = x} = 1 − P({Y = h(X)|X = x} =1− =1− =1− =1− P({Y = 1.It sufﬁces to show that P({Y = h(X)|X = x} − P({Y = h∗ (X)|X = x} ≥ 0 for all x. P({Y = h(X)|X = x} − P({Y = h∗ (X)|X = x} = I ∗ (x)r(x) + (1 − I ∗ (x))(1 − r(x)) − I(x)r(x) + (1 − I(x))(1 − r(x)) = (2r(x) − 1)(I ∗ (x) − I(x)) 1 = 2 r(x) − (I ∗ (x) − I(x)). 46. To summarize. P(h(X) = Y ) − P(h∗ (X) = Y ) = 2 128 |r(x) − 1/2|Ih∗(x)=b f (x)dx h(x) = |2r(x) − 1|Ih∗ (x)=b h(x) 1 if r(x) > 1 2 0 otherwise. Hence. h(X) = 1|X = x} + P({Y = 0. h∗ (x) = 1 so (219) is non-negative. (219) . Therefore. h∗ (x) = 0 so so both terms are nonposiitve and hence (219) is again non-negative.

It is possible for r to be far from r ∗ (x) and still lead to a good classiﬁer. As long as r(x) and r ∗ (x) are on the same side of 1/2 they yield the same classiﬁer. 46. 3.≤ 2 ≤ 2 |r(x) − r∗ (x)|Ih∗ (x)=b f (x)dx h(x) |r(x) − r∗ (x)|f (x)dx = 2E|r(X) − r∗ (X)|. Deﬁne r(x) = P(Y = 1|X = x) = π f1 (x) π f1 (x) + (1 − π)f0 (x) 129 . 46.3 The Bayes’ Rule and the Class Densities We can rewrite h∗ in a different way. 2. Thus we have: The Bayes’ rule can be written as: h (x) = ∗ (220) where = f (x|Y = 0) = f (x|Y = 1) = P(Y = 1). Estimate f0 from the Xi ’s for which Yi = 0. This means that if r(x) is close to r ∗ (x) then the classiﬁcation risk will be close to the Bayes risk. There are three main approaches: 1. Empirical Risk Minimization. Density Estimation.4 How to Find a Good Classiﬁer The Bayes rule depends on unknown quantities so we need to use the data to ﬁnd some approximation to the Bayes rule. Find an estimate r of the regression function r and deﬁne h(x) = 1 1 if r(x) > 2 0 otherwise. From Bayes’ theorem we have that r(x) = P(Y = 1|X = x) f (x|Y = 1)P(Y = 1) = f (x|Y = 1)P(Y = 1) + f (x|Y = 0)P(Y = 0) πf1 (x) = πf1 (x) + (1 − π)f0 (x) f0 (x) f1 (x) π We call f0 and f1 the class densities. Choose a set of classiﬁers H and ﬁnd h ∈ H that minimizes some estimate of L(h). Regression (Plugin Classiﬁers). 1 if f1 (x) f0 (x) > (1−π) π (221) 0 otherwise. estimate f1 from the Xi ’s for which Yi = 1 n and let π = n−1 i=1 Yi . The converse is not true.

Hoeffding’s Inequality If X1 . j Recall that H = {h1 . . . h = argminh∈H Rn (h) = argminh∈H 1 n I(h(Xi ) = Yi ) . R(h) ≤ R(h∗ ) + for some small > 0. with high probability. . j This follows since P(max Zj ≥ c) = P({Z1 ≥ c} or {Z2 ≥ c} or · · · or {Zm ≥ c}) ≤ j P(Zj > c). hm } consists of ﬁnitely many classiﬁers. i (222) Let h∗ be the best classiﬁer in H. . . Our main tool for this analysis is Hoeffding’s inequality. 46. 2 Fix α and let n = 2 log n Then P max |Rn (h) − R(h)| > h∈H n Hence. Xn ∼ Bernoulli(p). for any > 0. ≤ 2me−2n . . . .5 Empirical Risk Minimization: The Finite Case Let H be a ﬁnite set of classiﬁers. . P (|p − p| > ) ≤ 2e−2n where p = n−1 n i=1 2 (223) Xi . also called the empirical risk. Empirical risk minimization means choosing the classiﬁer h ∈ H to minimize the training error Rn (h).and h(x) = 1 if r(x) > 1 2 0 otherwise. Another basic fact we need is the union bound: if Z1 . . . We will now show that. R(h∗ ) = minh∈H R(h). This inequality is very fundamental and is used in many places in statistics and machine learning. How good is h compared to h∗ ? We know that R(h∗ ) ≤ R(h). that is. the following is true: R(h) ≤ R(h) + Summarizing: P R(h) > R(h∗ ) + We might extend our analysis to inﬁnite H later. 130 8 log n 2m α ≤ α. . α ≤ α. Zm are random variables then P(max Zj ≥ c) ≤ j P(Zj > c). with probability at least 1 − α. then. Now we see that: P max |Rn (h) − R(h)| > h∈H ≤ H∈H P |Rn (h) − R(h)| > 2m . . n ≤ R(h∗ ) + n ≤ R(h∗ ) + 2 n . Thus.

let’s restrict ourselves only to Y = 0 and Y = 1.data=sa. it can sometimes lead to a good classiﬁer.2662338 "typea" 46.46.5] = 1 > print(table(chd.3 Example.n) > yhat[tmp > . Nonetheless. An alternative is to use logistic regression: r(x) = P(Y = 1|X = x) = eβ0 + 1+e P j β0 + βj xj P j βj xj .yhat)) yhat chd 0 1 0 260 42 1 76 84 > print(sum( chd != yhat)/n) [1] 0. For the digits example. Let us return to the South African heart disease data.data)) [1] "sbp" "tobacco" "ldl" "adiposity" "famhist" [7] "obesity" "alcohol" "age" "chd" > n = nrow(sa.data=sa. . once we have an estimate r.yhat)) yhat chd 0 1 0 256 46 1 77 83 > print(sum( chd != yhat)/n) [1] 0.n) > yhat[tmp > .5] = 1 > print(table(chd.data) > > ### linear > out = lm(chd ˜ .6 Parametric Methods I: Linear and Logistic Regression One approach to classiﬁcation is to estimate the regression function r(x) = E(Y |X = x) = P(Y = 1|X = x) and.type="response") > yhat = rep(0. (226) 46. .2 Example. (224) Y = r(x) + = β0 + j=1 βj Xj + (225) can’t be correct since it does not force Y = 0 or 1. > print(names(sa.data.2554113 > > ### logistic > out = glm(chd ˜.data) > tmp = predict(out) > yhat = rep(0.family=binomial) > tmp = predict(out. Here is what we get: 131 . use the classiﬁcation rule h(x) = The linear regression model d 1 1 if r(x) > 2 0 otherwise.

X|Y = 0 ∼ N (µ0 .frame(xtrain)) tmp = predict(out) n = length(ytrain) yhat = rep(0.data=as.1])/sum(b)) ###training error [1] 0 > tmp = predict(out.dat > n = length(ytest) > yhat = rep(0. 1.1])/sum(b)) ### testing error [1] 0. k 2 (2π)d/2 |Σk |1/2 Thus.frame(xtest)) Warning message: prediction from a rank-deficient fit may be misleading in: predict.data.newdata=as.yhat) print(b) yhat ytrain 0 1 0 600 0 1 0 500 > print((b[1.003639672 46.lm(out. then the Bayes rule is h∗ (x) = where 2 ri = (x − µi )T Σ−1 (x − µi ). 2 i 2 2 1 if r1 < r0 + 2 log 0 otherwise π1 π0 |Σ0 | |Σ1 | + log (227) (228) is the Manalahobis distance. newdata = as.2]+b[2.> > > > > > > > ### linear out = lm(ytrain ˜ .n) > yhat[tmp > .data.5] = 1 > b = table(ytest. i = 1.2]+b[2.yhat) > print(b) yhat ytest 0 1 0 590 4 1 0 505 > print((b[1. Σ0 ) and X|Y = 1 ∼ N (µ1 .4 Theorem.n) yhat[tmp > . Σ1 ). 46.5] = 1 b = table(ytrain.1} δk (x) 1 1 δk (x) = − log |Σk | − (x − µk )T Σ−1 (x − µk ) + log πk k 2 2 and |A| denotes the determinant of a matrix A. k = 0. Σ1 ). If X|Y = 0 ∼ N (µ0 .. 132 where (229) .7 Parametric Methods II: Gaussian and Linear Classiﬁers Suppose that f0 (x) = f (x|Y = 0) and f1 (x) = f (x|Y = 1) are both multivariate Gaussians: fk (x) = 1 1 exp − (x − µk )T Σ−1 (x − µk ) . Σ0 ) and X|Y = 1 ∼ N (µ1 . An equivalent way of expressing the Bayes’ rule is h∗ (x) = argmaxk∈{0.

µ2 . S1 = 1 n1 i: Yi =1 (Xi − µ1 )(Xi − µ1 )T where n0 = i (1 − Yi ) and n1 = i Yi . 46. except that the MLE of Σ is S= The classiﬁcation rule is h∗ (x) = where n0 S0 + n 1 S1 .24. In practice. . 2 k The parameters are estimated as before. π1 = Xi .5 Example. the Bayes rule is h(x) = argmaxk δk (x) 1 1 δk (x) = − log |Σk | − (x − µk )T Σ−1 (x − µk ) + log πk .y) ### or qda for quadratic yhat = predict(out)$class The error rate of LDA is . If fk (x) = f (x|Y = k) is Gaussian. . k 2 2 If the variances of the Gaussians are equal. 46. then 1 δk (x) = xT Σ−1 µk − µT Σ−1 + log πk . In this example. K}. In R use: out = lda(x. namely: π0 µ0 S0 = = = 1 n n i=1 (1 − Yi ). Σ1 in place of the true value. For QDA we get . Σ0 . Let us return to the South African heart disease data.25. . . the Bayes rule is h∗ (x) = argmaxk δk (x) 1 δk (x) = xT Σ−1 µk − µT Σ−1 + log πk . there is little advantage to QDA over LDA. µ1 . we use sample estimates of π. n0 + n 1 1 if δ1 (x) > δ0 (x) 0 otherwise (232) where now (230) (231) 1 δj (x) = xT S −1 µj − µT S −1 µj + log πj 2 j is called the discriminant function. A simpliﬁcation occurs if we assume that Σ0 = Σ0 = Σ. 2 k where (233) (234) 133 .The decision boundary of the above classiﬁer is quadratic so this procedure is called quadratic discriminant analysis (QDA). In that case. The decision boundary {x : δ0 (x) = δ1 (x)} is linear so this method is called linear discrimination analysis (LDA).6 Theorem. Suppose that Y ∈ {1. µ 1 = 1 n n Yi i=1 1 n0 1 n0 i: Yi =0 1 n1 Xi i: Yi =1 i: Yi =0 (Xi − µ0 )(Xi − µ0 )T . Now we generalize to the case where Y takes on more than two values.

wd ) that “best separates the data. Algebraically.7 Theorem. . The goal is to choose the vector w = (w1 . Let nj = i=1 I(Yi = j) be the number of observations in group j. 5 Deﬁne the separation by J(w) = = = n (E(U |Y = 0) − E(U |Y = 1))2 wT Σw T T (w µ0 − w µ1 )2 wT Σw T w (µ0 − µ1 )(µ0 − µ1 )T w . then we saw earlier that log P(Y = 1|X = x) P(Y = 0|X = x) = log π0 π1 1 − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) 2 ≡ α0 + αT x. (n0 − 1) + (n1 − 1) (236) (237) is a minimizer of J(w). . 1 1 −1 (X 0 + X 1 ) = (X 0 − X 1 )T SB (X 0 + X 1 ) 2 2 (238) Fisher’s rule is the same as the Bayes linear classiﬁer in equation (231) when π = 1/2. let X j be the sample mean vector of the X’s for group j. Deﬁne J(w) = where SB SW 46. this means replacing the covariate d X = (X1 . The idea is to ﬁrst reduce the dimension of covariates to one dimension by projecting the data onto a line. .8 Relationship Between Logistic Regression and LDA LDA and logistic regression are almost the same thing. and let Sj be the sample covariance matrix in group j. If we assume that each group is Gaussian with the same covariance matrix. . . Then E(U |Y = j) = E(w T X|Y = j) = wT µj and V(U ) = w T Σw. Xd ) with a linear combination U = w T X = j=1 wj Xj .” Then we perform classiﬁcation with the one-dimensional covariate Z instead of X. where it is called the Rayleigh coefﬁcient. wT Σw We estimate J as follows. 134 . 5 The + xT Σ−1 (µ1 − µ0 ) quantity J arises in physics. The vector −1 w = SW (X 0 − X 1 ) w T SB w w T SW w (235) = (X 0 − X 1 )(X 0 − X 1 )T (n0 − 1)S0 + (n1 − 1)S1 = . We need deﬁne what we mean by separation of the groups. The midpoint m between X 0 and X 1 is m= Fisher’s classiﬁcation rule is h(x) = 0 if wT X ≥ m 1 if wT X < m. We call −1 U = wT X = (X 0 − X 1 )T SW X the Fisher linear discriminant function.There is another version of linear discriminant analysis due to Fisher. Let µj denote the mean of X for Y = j and let Σ be the variance matrix of X. 46. . . . We would like the two groups to have means that are far apart relative to their spread.

The difference is in how we estimate the parameters. by assumption. To summarize: LDA and logistic regression both lead to a linear classiﬁcation rule. In LDA we estimated the whole joint distribution by maximizing the likelihood f (Xi . i Bernoulli (239) In logistic regression we maximized the conditional likelihood f (Xi . yi ) = i i f (Xi |yi ) Gaussian f (yi ) . yi ) = i i i f (yi |Xi ) but we ignored the second term f (Xi ): f (Xi ) . These are the same model since they both lead to classiﬁcation rules that are linear in x. the logistic model is. This is an advantage of the logistic regression approach over LDA. 135 . Logistic regression leaves the marginal distribution f (x) unspeciﬁed so it is more nonparametric than LDA. log P(Y = 1|X = x) P(Y = 0|X = x) = β0 + β T x. y) = f (x|y)f (y). we don’t really need to estimate the whole joint distribution. y) = f (x|y)f (y) = f (y|x)f (x). i ignored f (yi |Xi ) logistic (240) Since classiﬁcation only requires knowing f (y|x). In LDA we estimate the entire joint distribution f (x.On the other hand. In logistic regression we only estimate f (y|x) and we don’t bother estimating f (x). The joint density of a single observation is f (x.

9 Example. The logistic (not shown) also yields a linear boundary. 46. 136 . 46. South African heart disease data again. library(class) m = 50 error = rep(0. (ii) quadratic regression.46.lwd=3.type="l".yhat) print(b) yhat ytest 0 1 0 594 0 1 0 505 > print((b[1.2]+b[2.m) for(i in 1:m){ out = knn. In that case you should standardize the variables ﬁrst. The boundaries are from (i) linear regression.cl=y.10 Example. (iii) k-nearest neighbors (k = 1).sa.8 Example. cl = ytrain.xlab="k".cv(train = xtrain. 46. and (v) k-nearest neighbors (k = 200). k = 1) > b = table(ytrain. For this we can use cross-validation. Figure 55 compares the decision boundaries in a two-dimensinal example. Digits again. Often we use Euclidean distance ||Xi − Xj ||.ps") plot(1:m. k = 1) b = table(ytest.9 Nearest Neighbors The k-nearest neighbor rule is h(x) = 1 i=1 wi (x)I(Yi = 1) > 0 otherwise n n i=1 wi (x)I(Yi = 0) (241) where wi (x) = 1 if Xi is one of the k nearest neighbors of x. wi (x) = 0.ylab="error") See Figure 54.1])/sum(b)) [1] 0.yhat) > print(b) yhat ytrain 0 1 0 599 1 1 0 500 > print((b[1. (iv) k-nearest neighbors (k = 50). cl = ytrain. test = xtest.k=i) error[i] = sum(y != out)/n } postscript("knn.1])/sum(b)) [1] 0 > > > yhat = knn.2]+b[2.error. > > > > > ### knn library(class) yhat = knn(train = xtrain.cv(train=x. “Nearest” depends on how you deﬁne the distance.0009090909 An important part of this method is to choose a good value of k. otherwise.

32 0. 137 .40 0.34 0.36 0.42 0.error 0.38 0.44 0 10 20 k 30 40 50 Figure 54: knn for South Africn heart disease data.

0 0.4 0.6 0.6 0.8 1.8 1.8 1.8 1. 138 .2 0.0 0.8 1.2 0.8 1.4 0.4 0.0 Data linear 1.4 0.6 0.8 1.6 0.2 0.2 0.0 0.0 quadratic knn k =1 1.2 0.6 0.0 0.0 knn k =50 knn k =200 Figure 55: Comparison of decision boundaries.4 0.6 0.0 0.2 0.0 0.0 0.0 0.2 0.0 0.2 0.2 0.0 0.4 0.0 0.0 0.0 0.1.0 0.6 0.0 0.4 0.6 0.0 0.0 0.2 0.4 0.0 0.4 0.6 0.2 0.2 0.0 0.0 0.8 1.6 0.0 0.8 0.8 0.4 0.6 0.8 0.4 0.8 1.6 0.4 0.

Some Theoretecal Properties. Let h1 be the nearest neighbor classiﬁer with k = 1. Cover and Heart (1967) showed that, under very weak assumptions, R∗ ≤ lim R(h1 ) ≤ 2R∗

n→∞

(242)

where R∗ is the Bayes risk. For k > 1 we have 1 R∗ ≤ lim R(hk ) ≤ R∗ + √ . n→∞ ke (243)

**46.10 Density Estimation and Naive Bayes
**

The Bayes rule can be written as h∗ (x) = We can estimate π by π= 1 n 1 if

f1 (x) f0 (x)

>

(1−π) π

0 otherwise.

n

(244)

Yi .

i=1

We can estimate f0 and f1 using density estimation. For example, we could apply kernel density estimation to D 0 = {Xi : Yi = 0} to get f0 and to D1 = {Xi : Yi = 1} to get f1 . Then we estimate h∗ with b π 1 if f1 (x) > (1−b) b π b f0 (x) (245) h(x) = 0 otherwise. But if x = (x1 , . . . , xd ) is high-dimensional, nonparametric density estimation is not very reliable. This problem is ameliorated if we assume that X1 , . . . , Xd are independent, for then,

d

f0 (x1 , . . . , xd ) =

j=1 d

f0j (xj ) f1j (xj ).

j=1

(246)

f1 (x1 , . . . , xd ) =

(247)

**We can then use one-dimensional density estimators and multiply them:
**

d

f0 (x1 , . . . , xd ) =

j=1 d

f0j (xj ) f1j (xj ).

j=1

(248)

f1 (x1 , . . . , xd ) =

(249)

The resulting classiﬁer is called the naive Bayes classiﬁer. The assumption that the components of X are independent is usually wrong yet the resulting classiﬁer might still be accurate. Here is a summary of the steps in the naive Bayes classiﬁer:

139

The Naive Bayes Classiﬁ er 1. For each group k = 0, 1, compute an estimate fkj of the density fkj for Xj , using the data for which Yi = k. 2. Let fk (x) = fk (x1 , . . . , xd ) =

j=1 d

fkj (xj ).

3. Let πk = 4. Deﬁne h as in (245).

1 n

n

Yi .

i=1

Naive Bayes is closely related to generalized additive models. Under the naive Bayes model, logit P(Y = 1)|X P(Y = 0)|X = log = log = log = β0 +

j=1

πf1 (X) (1 − π)f0 (X) π (1 −

(250) (251)

d j=1 f1j (Xj ) d π) j=1 f0j (Xj ) d

π 1−π

d

+

j=1

log

f1j (Xj ) f0j (Xj )

(252)

gj (Xj )

(253)

which has the form of a generalized additive model. Thus we expect similar performance using naive Bayes or generalized additive models. 46.11 Example. For the SA sata: Note the use of the gam package. n = nrow(sa.data) y = chd x = sa.data[,1:9] library(gam) s = .25 out = gam(y ˜ lo(sbp,span = .25,degree=1) + lo(tobacco,span = .25,degree=1) + lo(ldl,span = .25,degree=1) + lo(adiposity,span = .25,degree=1) + famhist + lo(typea,span = .25,degree=1) + lo(obesity,span = .25,degree=1) + lo(alcohol,span = .25,degree=1) + lo(age,span = .25,degree=1)) tmp = fitted(out) yhat = rep(0,n) yhat[tmp > .5] = 1 print(table(y,yhat)) yhat y 0 1 140

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 56: Artiﬁcal Data. 0 256 46 1 77 83 print(mean(y != yhat)) [1] 0.2662338 46.12 Example. Figure 56 shows an artiﬁcial data set with two covariates x1 and x2 . Figure 57 shows kernel density estimators f1 (x1 ), f1 (x2 ), f0 (x1 ), f0 (x2 ). The top left plot shows the resulting naive Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear model yields a classiﬁer with error rate of 0.78.

46.11 Trees

Trees are classiﬁcation methods that partition the covariate space X into disjoint pieces and then classify the observations according to which partition element they fall in. As the name implies, the classiﬁer can be represented as a tree. For illustration, suppose there are two covariates, X1 = age and X2 = blood pressure. Figure 59 shows a classiﬁcation tree using these variables. The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 60 shows the same classiﬁer as a partition of the covariate space. Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is only a single covariate X. We choose a split point t that divides the real line into two sets A1 = (−∞, t] and A2 = (t, ∞). Let ps (j) be the

141

1.6

1.4

1.2

f10

1.0

f11

0.8

0.6

0.4

0.0

0.2

0.4 x

0.6

0.8

1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.2

0.4 x

0.6

0.8

1.0

1.4

1.2

f20

1.0

f21

0.8

0.6

0.0

0.2

0.4 x

0.6

0.8

1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.2

0.4 x

0.6

0.8

1.0

Figure 57: Density Estimates

142

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Data

x2 0.0 0.2 0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1 Predictions

Figure 58: Naive Bayes and Gam classiﬁers.

Age < 50 ≥ 50

Blood Pressure < 100 0 ≥ 100 1

1

Figure 59: A simple classiﬁcation tree.

143

skip=1.") >Read 5082 items 144 . The bottom nodes of the tree are called the leaves. This procedure is easily generalized to the case where Y ∈ {1. Each leaf is assigned a 0 or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition element. . where n0 is some ﬁxed number. we choose whichever covariate and split that leads to the lowest impurity. The impurity of the split t is deﬁned to be I(t) = s=1 γs (255) where γs = 1 − 1 ps (j)2 . Xi ∈ As ) n i=1 I(Xi ∈ As ) 2 (254) for s = 1. This process is continued until some stopping criterion is met. We choose the split point t to minimize the impurity. . If a partition element A s contains all 0’s or all 1’s. we might stop when every partition element has fewer than n0 data points. . proportion of observations in As such that Yi = j: ps (j) = n i=1 I(Yi = j. Otherwise.13 Example. 1. γs > 0. X = scan("sa. j=0 (256) This particular measure of impurity is known as the Gini index. K}.1 Blood Pressure 110 1 0 50 Age Figure 60: Partition representation of classiﬁcation tree.data". 2 and j = 0.sep=".) When there are several covariates. 46. For example. then γs = 0. We simply deﬁne the impurity by k γs = 1 − ps (j)2 j=1 (257) where pi (j) is the proportion of observations in the partition element for which Y = j. Heart disease data. (Other indices of impurity besides can be used besides the Gini index. .

-c(1.tree.2078 = 96 / 462 plot(out. 145 "famhist" "adiposity" . 63.11] n = length(chd) X = X[.collapse="+") formula = paste("chd ˜ "."age") for(i in 1:9){ assign(names[i].factor(famhist) formula = paste(names. 29.3 / 447 >Misclassification error rate: 0.ldl.obesity.cex=2) See Figures 61. nodes = c(2.africa.sep="".tree(out."obesity". 62.method="misclass") print(summary(newtree)) >Classification tree: >snip.adiposity.2 / 456 >Misclassification error rate: 0.byrow=T) chd = X[.plot1."alcohol".formula(formula) print(formula) > chd ˜ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + > alcohol + age chd d = as. data = d) >Variables actually used in tree construction: >[1] "age" "tobacco" "alcohol" "typea" >[7] "ldl" >Number of terminal nodes: 15 >Residual mean deviance: 0.X = matrix(X.type="u".ps") out = tree(formula.factor(chd) = data."famhist".alcohol.formula) formula = as."tobacco".frame(chd.i]) } famhist = as.famhist.X[.2294 = 106 / 462 plot(newtree.lwd=3) text(newtree.data=d) print(summary(out)) >Classification tree: >tree(formula = formula.best=6.8733 = 390.lwd=3) text(out) cv = cv.typea."typea".method="misclass") plot(cv.sbp.lwd=3) newtree = prune.age) library(tree) postscript("south."ldl".042 = 475.tree(out."adiposity".11)] names = c("sbp". 15)) >Variables actually used in tree construction: >[1] "age" "typea" "famhist" "tobacco" >Number of terminal nodes: 6 >Residual mean deviance: 1.ncol=11.tree(tree = out.tobacco. 28.

105 typea < 68.5 adiposity < 28.705 typea < 42.955 tobacco < 4.435 0 1 1 adiposity < 28 1 typea < 48 0 0 0 0 1 .51 age < 50.age < 31.5 alcohol < 11.605 0 0 0 1 ldl < 6.5 | tobacco < 0.15 adiposity < 24.5 0 146 famhist:a tobacco < 7.

12 165 10 8 3 1 0 −Inf misclass 145 150 155 160 2 4 6 8 size 10 12 14 Figure 62: Tree 147 .

5 famhist:a tobacco < 7.605 0 1 1 Figure 63: Tree 148 .5 0 1 age < 50.age < 31.5 | 0 typea < 68.

n. Intuitively. Yi W (Xi ) ≥ c d for all i. Given two vectors a and b let a. The data can be separated by some hyperplane if and only if there exists a hyperplane H(x) = d a0 + i=1 ai Xi such that Yi H(Xi ) ≥ 1. Suppose that the data are linearly separable. is to maximize the margin. . Then Yi H(Xi ) ≥ 1 for all i. classiﬁer correct classiﬁer incorrect =⇒ Yi H(Xi ) ≥ 0 =⇒ Yi H(Xi ) ≤ 0. d H(x) = a0 + i=1 ai Xi and Note that: −1 if z < 0 0 if z = 0 sign(z) = 1 if z > 0. However. subject to (258). . . A linear classiﬁer can then be written as h(x) = sign H(x) where x = (x1 . i = 1. it seems reasonable to choose the hyperplane “furthest” from the data in the sense that it separates the +1s and -1s and maximizes the distance to the closest point. there are many separating hyperplanes.46. The particular separating hyperplane that this algorithm converges to depends on the starting values. The margin is the distance to from the hyperplane to the nearest point. . This hyperplane is called the maximum margin hyperplane. . .14 Lemma.12 Perceptrons and Support Vector Machines In this section we consider a class of linear classiﬁers called support vector machines. then. The goal. A separating hyperplane will minimize − Yi H(Xi ). xd ). The classiﬁcation risk is R = P(Y = h(X)) = P(Y H(X) ≤ 0) = E(L(Y H(X))) where the loss function L is L(a) = 1 if a < 0 and L(a) = 0 if a ≥ 0. . that is. b = a T b = j aj bj denote the inner product of a and b. It follows that there exists some constant c such that Yi = 1 implies W (Xi ) ≥ c and Yi = −1 implies W (Xi ) ≤ −c. there exists a hyperplane that perfectly separates the two classes. misclassiﬁed Rosenblatt’s perceptron algorithm takes starting values and updates them: β β0 ←− β β0 +ρ Yi Xi Yi . Let H(x) = a0 + i=1 ai Xi where aj = bj /c. 46. Therefore. The reverse direction is straightforward. How can we ﬁnd a separating hyperplane? LDA is not guaranteed to ﬁnd it. . (258) P ROOF. Suppose the data can be separated by a hyperplane W (x) = b0 + i=1 bi Xi . See Figure 64. Points on the boundary of the margin are called support vectors. It will be convenient to label the outcomes as −1 and +1 instead of 0 and 1. d 149 .

. The points Xi for which α = 0 are called support vectors. n. . . . . . ξi ≥ 0. Then. Xi . . 150 . . ξn are called slack variables. . .H(x) = a0 + aT x = 0 Figure 64: The hyperplane H(x) has the largest margin of all hyperplanes that separate the two classes. . . . a0 can be found by solving T αi Yi (Xi a + a0 =0 for any support point Xi . Let H(x) = a0 + 1. αn ) is the vector that maximizes n i=1 αi − 1 2 n n αi αk Yi Yk Xi . and α = (α1 . for j = n aj = i=1 αi Yi Xj (i) where Xj (i) is the value of the covariate Xj for the ith data point. If there is no perfect linear classiﬁer. . 46. d i=1 ai Xi denote the optimal (largest margin) hyperplane. then one allows overlap between the groups by replacing the condition (258) with (260) Yi H(Xi ) ≥ 1 − ξi . . i = 1. H may be written as n H(x) = α0 + i=1 αi Yi x. . d. Xk i=1 k=1 (259) subject to and αi ≥ 0 0= i αi Yi . There are many software packages that will solve this problem quickly.15 Theorem. The variables ξ1 .

default(x = x. .We now maximize (259) subject to and 0 ≤ ξi ≤ c. n n αi Yi = 0.25 33 Number of Support Vectors: summary(out) Call: svm. .25 151 .factor(y) out = svm(x. The iris data. y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 0. y = y) Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 0.5] x = x[.] a = x[. library(e1071) data(iris) x = iris[51:150. y) print(out) Call: svm.n) y[a == "versicolor"] = 1 y = as.default(x = x. .-5] attributes(a) $levels [1] "setosa" "versicolor" "virginica" $class [1] "factor" n = length(a) y = rep(0. i = 1.16 Example. 46. i=1 The constant c is a tuning parameter that controls the amount of overlap. . In R we can use the package e1071.

5 0.col = as. x) table(pred.1] 1 2 M[I.2] −0.0 0. classiﬁcation error and logistic loss log(1 + e −yH(x) ).integer(y)+1.5 −2 −1 0 M[. ][.2] −0.5 0. 152 .pch = as. squared loss.5 −2 −1 0 M[I. ][.integer(y)+1) ## support vectors I = 1:n %in% out$index points(M[I.M[.1] 1 2 Figure 65: Number of Support Vectors: ( 17 16 ) 33 Number of Classes: Levels: 0 1 2 ## test with train data pred = predict(out.lwd=2) See Figure 65. Figure 66 compares the svm loss. The SVM hyperplan H(x) = β 0 + xT x can be obtained by minimizing n i=1 (1 − Yi H(Xi ))+ + λ||β||2 .].0 0. y) y pred 0 1 0 49 2 1 1 48 M = cmdscale(dist(x)) plot(M. Here is another (easier) way to think about the SVM.

0.0 −3 −2 −1 0 y H(x) 1 2 3 Figure 66: Hinge 153 .0 1.5 1.5 3.0 0.5 2.0 2.

we might increase the computational burden. φ maps X = R2 into Z = R3 . kernelization involves ﬁnding a mapping φ : X → Z and a classiﬁer such that: 154 . There is a potential drawback. φ(x) x2 x2 + 2x1 x1 x2 x2 + x2 x2 1 1 2 2 ( x.376. The idea is to map the covariate X — which takes values in X — into a higher dimensional space Z and apply the classiﬁer in the bigger space Z. z without ever computing Zi = φ(Xi ). notice in our example that the inner product in Z can be written z. If we signiﬁcantly expand the dimension of the problem. To summarize. a linear classiﬁer in a higher-dimensional space corresponds to a non-linear classiﬁer in the original space. x )2 ≡ K(x. x). We simply map the covariates to a higher-dimensional space. 2 1 Thus. 2x1 x2 . just the inner product between pairs of points. This is akin to making linear regression more ﬂexible by using polynomials. many classiﬁers do not require that we know the values of the individual points but. if x has dimension d = 256 and we wanted to use all fourth-order terms. Mapping the covariates into a higher-dimensional space can make a complicated decision boundary into a simpler decision boundary. The point is that to get a richer set of classiﬁers we do not need to give up the convenience of linear classiﬁers. x2 ). the Yi ’s are separable by a linear decision boundary. The standard example of this idea is illustrated in Figure 67. Deﬁne a mapping φ by √ z = (z1 . The covariate x = (x 1 . This can yield a more ﬂexible classiﬁer while retaining computationally simplicity. For example. rather. The Yi s can be separated into two groups using an ellipse. In other words.13 Kernelization There is a trick called kernelization for improving a computationally simple classiﬁer h. In the higher-dimensional space Z. then z = φ(x) has dimension 183. We are spared this computational nightmare by the following two facts. we can compute z.181.x2 φ z2 + + + + + + + + + + + + + + + + z1 + + x1 + + + + + z3 Figure 67: Kernelization. z3 ) = φ(x) = (x2 . x2 ). 46. z = = = φ(x). Second. Thus. z2 . First.

2. Examples of commonly used kernels are: r polynomial sigmoid Gaussian K(x. y) = φ(x). There is a function K. J(w) = T w SW w SB = (X 0 − X 1 )(X 0 − X 1 )T and SW = (n0 − 1)S0 (n0 − 1) + (n1 − 1) + (n1 − 1)S1 (n0 − 1) + (n1 − 1) . y).1. 1 Zj = nj n φ(Xi )I(Yi = j). x) that corresponds to φ(x). x) = = = x. x). that if K is positive deﬁnite — meaning that K(x. Also. roughly. It can be shown that the maximizing vector w is a linear combination of the Z i ’s. to take advantage of kernelization. Hence we can write n w= i=1 αi Z i . 4. does there exist a function φ(x) such that K(x. such that φ(x). x) K(x. 3. i=1 155 . Here. φ(x) for some φ. φ(y) ? The answer is provided by Mercer’s theorem which says. replace it with K(x. y)f (x)f (y)dxdy ≥ 0 for square integrable functions f — then such a φ exists. x) K(x. called a kernel. The classiﬁer only requires computing inner products. x + a tanh(a x. Sj is the sample of covariance of the Zi ’s for which Y = j. Z has higher dimension than X and so leads a richer set of classiﬁers. In fact. x appears in the algorithm. x + b) exp −||x − x||2 /(2σ 2 ) Let us now see how we can use this trick in LDA and in support vector machines. We only need to specify a kernel K(x. Everywhere the term x. we replace Xi with Zi = φ(Xi ) and we ﬁnd w to maximize J(w) = where SB = (Z 0 − Z 1 )(Z 0 − Z 1 )T and SW = (n0 − 1)S0 (n0 − 1) + (n1 − 1) + w T SB w w T SW w (261) (n1 − 1)S1 (n0 − 1) + (n1 − 1) . we need to re-express this in terms of inner products and then replace the inner products with kernels. In the kernelized version. This raises an interesting question: given a function of two variables K(x. we never need to construct the mapping φ at all. However. Recall that the Fisher linear discriminant method replaces X with U = w T X where w is chosen to maximize the Rayleigh coefﬁcient w T SB w . x). φ(x) = K(x.

instead of maximizing (259). for some constant b. However. x). 1 is a matrix of all one’s. Xj ). Xs )I(Yi = j). n T wT Z j = = = = = 1 nj 1 nj 1 nj αi Z i i=1 n n 1 nj n φ(Xi )I(Yi = j) i=1 T αi I(Ys = j)Zi φ(Xs ) i=1 s=1 n n αi i=1 n s=1 n I(Ys = j)φ(Xi )T φ(Xs ) I(Ys = j)K(Xi . Formally. 156 . Hence. By similar calculations. the projection onto the new subspace can be written as n U = wT φ(x) = i=1 αi K(Xi . the solution is α = N −1 (M0 − M1 ). s=1 I is the identity matrix. Xj with K(Xi . αT N α All the quantities are expressed in terms of the kernel. Xs ) s=1 αi i=1 α T Mj where Mj is a vector whose ith component is Mj (i) = It follows that w T SB w = α T M α where M = (M0 − M1 )(M0 − M1 )T . we now ﬁnd α to maximize J(α) = αT M α . i=1 k=1 n i=1 (262) The hyperplane can be written as H(x) = a0 + αi Yi K(X. N might be non-invertible. xs ) with xs varying over the observations in group j. and Kj is the n × nj matrix with entries (Kj )rs = K(xr . we can write w T SW w = α T N α where N = K0 I − 1 1 T T 1 K0 + K 1 I − 1 K1 . n0 n1 1 nj n K(Xi .Therefore. Xi ). The support vector machine can similarly be kernelized. Finally. we now maximize n i=1 αi − 1 2 n n αi αk Yi Yk K(Xi . Xj ). In this case one replaces N by N + bI. For example. We simply replace X i .

. . wn . We usually give equal weight to all data points in the methods we have discussed. . Annals of Statistics. The disadvantage of boosting is that the ﬁnal classiﬁer is quite complicated. Assume that Yi ∈ {−1. The ﬁnal classiﬁer is B 1 1 if B b=1 hb (x) ≥ 1 2 h(x) = 0 otherwise. The bth bootstrap sample yields a classiﬁer hb . But one can incorporate unequal weights quite easily in most algorithms. We starting with a simple — and hence highly-biased — classiﬁer. consider the following modifed algorithm: (Friedman. . n. boosting can be thought of as a bias reduction technique. trees with only one split. called AdaBoost. n i=1 wi (d) Update the weights: h(x) = sign j=1 αj hj (x) . Suppose that H is a collection of classiﬁers. . . we could replace the impurity measure with a weighted impurity measure. Boosting is a method for starting with a simple classiﬁer and gradually improving it by reﬁtting the data giving higher weight to misclassiﬁed samples. is as follows. 1} and that each h is such that h(x) ∈ {−1. 337–407): 157 . p. wi ←− wi eαj I(Yi =hj (Xi )) 3. 1. J. It is most helpful for highly nonlinear classiﬁers such as trees.14 Other Classiﬁers There are many other classiﬁers and space precludes a full discussion of all of them. The original version of boosting. We draw B bootstrap samples from the data. . do the following steps: (a) Constructing a classiﬁer hj from the data using the weights w1 . The ﬁnal classiﬁer is J n i=1 wi I(Yi = hj (Xi )) . for example. . . and we gradually reduce the bias. . . Hastie and Tibshirani (2000). (b) Compute the weighted error estimate: Lj = (c) Let αj = log((1 − Lj )/Lj ). 1}. 2. Whereas bagging is a variance reduction technique. Bagging is a method for reducing the variability of a classiﬁer. i = 1. in constructing a tree. There is now an enormous literature trying to explain and improve on boosting. . Let us brieﬂy mention a few.46. For example. Set the weights wi = 1/n. For j = 1. To understand what boosting is doing.

1 + e2F (x) Friedman. Hence. To see this.1. Moreover. J. . 2. . 3. 158 . The logistic log-likelihood is P(Y = 1|X = x) = Insert Y ∗ = (Y + 1)/2 and p = e2F /(1 + e2F ) and then = Y ∗ log p(x) + (1 − Y ∗ ) log(1 − p(x)). Neural Networks are regression models of the form 6 p Y = β0 + j=1 βj σ(α0 + αT X) where σ is a smooth function. this is essentially logistic regression. 1 log 2 P(Y = 1|X = x) P(Y = −1|X = x) . boosting is essentially stagewise logistic regression. This is really nothing more than a nonlinear regression model. Now do a second order Taylor series expansion around F = 0 to conclude that − (F ) ≈ J(F ) + constant. (c) Set wi ← wi e−Yi fj (Xi ) then normalize the weights to sum to one. Consider the risk function J(F ) = E(e−Y F (X) ). applied to loss J(F ) = E(e −Y F (X) ) yields the boosting algorithm. the number of terms p is essentially a smoothing parameter and there is the usual problem of trying to choose p to ﬁnd a good balance between bias and variance. 1}. let Y ∗ = (Y + 1)/2 so that Y ∗ ∈ {0. i = 1. The ﬁnal classiﬁer is h(x) = sign j=1 J fj (x) . often taken to be σ(v) = ev /(1 + ev ). n. This is minimized by F (x) = Thus. There are more complex versions of the model. In particular. 6 This is the simplest version of a neural net. e2F (x) . Neural nets were fashionable for some time but they pose great computational difﬁculties. . For j = 1. . Set the weights wi = 1/n. . . (F ) = − log(1 + e−2Y F (X) ). . do the following steps: (a) Constructing a weighted binary regression pj (x) = P(Y = 1|X = x). . one often encounters multiple minima when trying to ﬁnd the least squares estimates of the parameters. Also. (b) Let fj (x) = 1 log 2 pj (x) 1 − pj (x) . Hastie and Tibshirani show that stagewise regression.

Another approach to cross-validation is K-fold cross-validation which is obtained from the following algorithm. error rate 0. For k = 1 to K. is to leave out some of the data when ﬁtting a model. Randomly divide the data into K chunks of approximately equal size. C ROSS -VALIDATION . Consider the heart disease data again. we can’t use the training error rate Ln (h) as an estimate of the true error rate because it is biased downward. This is essentially the bias–variance tradeoff phenomenon we have seen before. do the following: 159 . The ninth model includes all the covariates. The classiﬁer h is constructed from the training set.15 Assessing Error Rates and Choosing a Good Classiﬁer How do we choose a good classiﬁer? We would like to have a classiﬁer h with a low true error rate L(h). See Figure 69. We’ll consider two: cross-validation and probability inequalities. we can make a model with zero observed classiﬁcation error. 1. There are many ways to estimate the error rate. and so on.17 Example. The basic idea of cross-validation. If we keep going. We then estimate the error by 1 I(h(Xi ) = YI ).34 0.30 0. 46. (263) L(h) = m Xi ∈V where m is the size of the validation set. Let’s also ﬁt a tenth model that includes all nine covariates plus the ﬁrst covariate squared.46. Usually. which we have already encountered in curve estimation. A common choice is K = 10. Often. The simplest version of cross-validation involves randomly splitting the data into two pieces: the training set T and the validation set V. 2. The solid line in Figure 68 shows the observed classiﬁcation error which steadily decreases as we make the model more complex. The estimated error decreases for a while then increases. K-fold cross-validation.26 5 number of terms in model 15 Figure 68: The solid line is the observed error rate and dashed line is the cross-validation estimate of true error rate. The dotted line shows the 10-fold cross-validation estimate of the error rate (to be explained shortly) which is a better estimate of the true error rate than the observed classiﬁcation error. In the ﬁrst model we include one covariate. In the second model we include two covariates. about 10 per cent of the data might be set aside as the validation set. We can go even further. Suppose we ﬁt a sequence of logistic regression models. Continuing this way we will get a sequence of 18 classiﬁers of increasing complexity. Then we ﬁt an eleventh model that includes all nine covariates plus the ﬁrst covariate squared and the second covariate squared.

We will now establish a stronger result. 46. That’s what cv. for example. First. Ln (h) converges in almost surely to L(h) by the law of large numbers. P max |Ln (h) − L(h)| > h∈H ≤ 2me−2n . . Assume H is ﬁnite and has m elements. (a) Delete chunk k from the data. (b) Compute the classiﬁer h(k) from the rest of the data.16 Optional Reading: Some Theory on Estimating Classiﬁcation Error (OPTIONAL. . Thus. 3. READ ONLY IF YOU ARE INTERESTED).tree does. h is applied to the validation data to obtain an estimate L of the error rate of h. then. . also called the empirical risk. Xn ∼ Bernoulli(p). . Our goal is to assess how much underestimation is taking place. for any > 0. The data are divided into two groups: the training data and the validation data. This method is useful in the context of empirical risk minimization. Empirical risk minimization means choosing the classiﬁer h ∈ H to minimize the training error Ln (h). The training data are used to produce an estimated classiﬁer h. (c) Use h(k) to the predict the data in chunk k. k=1 (264) 46. Our main tool for this analysis is Hoeffding’s inequality. Then. Let L(k) denote the observed error rate. Let L(h) = 1 K K L(k) . If X1 .18 Example. . . We applied 10-fold cross-validation to the heart disease data. For any ﬁxed h. i (265) Typically. all linear classiﬁers. P (|p − p| > ) ≤ 2e−2n n 2 (266) where p = n−1 i=1 Xi .19 Theorem (Uniform Convergence). Then. Ln (h) underestimates the true error rate L(h) because h was chosen to make Ln (h) small. . h = argminh∈H Ln (h) = argminh∈H 1 n I(h(Xi ) = Yi ) . hm } consists of ﬁnitely many classiﬁers.Training Data T Validation Data V h L Figure 69: Cross-validation. . suppose that H = {h1 . Let H be a set of classiﬁers. 46. 2 160 . Another approach to estimating the error rate is to ﬁnd a conﬁdence interval for Ln (h) using probability inequalities.

This follows from the fact that P(|Ln (h) − L(h)| > ) ≤ ≤ P max |Ln (h) − L(h)| > h∈H 2me−2n 2 = α. . . Now let X1 . The following remarkable theorem bounds the distance between P and Pn . α Then Ln (h) ± is a 1 − α conﬁdence interval for L(h). such as the set of linear classiﬁers. . . . n and > 0. When H is large the conﬁdence interval for L(h) is large. In practice we usually use sets H that are inﬁnite. . . Give a ﬁnite set F = {x1 . One way to develop such a generalization is by way of the Vapnik-Chervonenkis or VC dimension. . P ROOF. Here #(B) denotes the number of elements of a set B. P sup |Pn (A) − P(A)| > 161 ≤ 8s(A. Let = 2 log n 2m . 46. For any P. . n) = max NA (F ) (268) F ∈Fn where Fn consists of all ﬁnite sets of size n. Xn ∼ P and let Pn (A) = 1 n i I(Xi ∈ A) denote the empirical probability measure. The shatter coefﬁcient is deﬁned by s(A. Let A be a class of sets. i=1 Ai ) ≤ P max |Ln (h) − L(h)| > h∈H = ≤ ≤ P h∈H |Ln (h) − L(h)| > H∈H P |Ln (h) − L(h)| > 2e−2n 2 = 2me−2n . 2 H∈H 46. Am is a set of events then m m i=1 P(Ai ). The more functions there are in H the more likely it is we have “overﬁt” which we compensate for by having a larger conﬁdence interval.P( P ROOF. (269) A∈A . n)e−n 2 /32 . .21 Theorem (Vapnik and Chervonenkis (1971)). xn } let NA (F ) = # F A: A∈A (267) be the number of subsets of F “picked out” by A. To extend our analysis to these cases we want to be able to say something like P sup |Ln (h) − L(h)| > h∈H ≤ something not too big. . Now.20 Theorem. We will use Hoeffding’s inequality and we will also use the fact that if A1 . .

Thus. then the shatter coefﬁcients grow as a polynomial in n. 46. If A has ﬁnite VC-dimension v. Then T can’t be picked out. Consider. Let S be a 5 point set. deﬁne A to be the class of sets of the form {x : h(x) = 1}. This is where VC dimension enters. We then deﬁne s(H. for example. n) ≤ nv + 1. uppermost. Then A shatters S = {x. n). If H is a set of classiﬁers. a ∈ R}.The proof. So V C(A) = 3. n) α . Let T be the left and rightmost points. y}. halfspaces in Rd have VC dimension d + 1. One cannot ﬁnd an interval A such that A S = {x. 46. This can’t be picked out.26 Example. 46.28 Theorem. n) = 2n for all n. Let A be the set of closed intervals on the real line.27 Example. or lowermost. though very elegant. In general. So. 46. The VC (Vapnik-Chervonenkis) dimension of a class of sets A is deﬁned as follows. The following theorem shows that if A has ﬁnite VC-dimension. P h∈H sup |Ln (h) − L(h)| > n ≤ 8s(H. Let A be all rectangles on the plane with sides parallel to the axes. a 1 − α conﬁdence interval for the true error rate is L(h) ± where 2 n = 32 log n 8(nd+1 + 1) α . A 1 − α conﬁdence interval for L(h) is Ln (h) ± 2 n where 8s(H. V C(A) = 2. is long and we omit it. n) = s(A. 4 points forming a diamond. 46. Any 3-point set (not all on a line) can be shattered. 46.23 Theorem. Let A = {(−∞. y. If s(A. z}.24 Example. then s(A. rightmost. The VC-dimension of H is d + 1. = 32 log n These theorems are only useful if the shatter coefﬁcients do not grow too quickly with n. If H is a set of classiﬁers we deﬁne V C(H) = V C(A) where A is the class of sets of the form {x : h(x) = 1} as h varies in H. Let x have dimension d and let H be th set of linear classiﬁers. 46. Consider S = {x.25 Example.22 Theorem. 162 . Let T be all points in S except this point. deﬁne V C(A) to be the largest k for which s(A. No 4 point set can be shattered. Hence. So V C(A) = 4. Let A be all linear half-spaces on the plane. Therefore. Other conﬁgurations can also be seen to be unshatterable. z} where x < y < z. y} but it cannot shatter sets with 3 points. Any 4 point set can be shattered. the VC-dimension is the size of the largest ﬁnite set F that can be shattered by A meaning that A picks out each subset of F . There is one point that is not leftmost. V C(A) = 1. n)e−n 2 /32 . Otherwise. a]. The A shatters every 1-point set {x} but it shatters no set of the form {x. n) = 2k . set V C(A) = ∞.

y|z) = p(x|z)p(y|z). In Figure 72 {Y. Y. W } and {Z} are separated by {X}. Two vertices are adjacent. For example. Z)}. A subset U ⊂ V of vertices together with their edges is called a subgraph. W and Z are separated by {X. Y. X and Y are adjacent but X and Z are not adjacent. An undirected graph G = (V. .” There are two main types of graphical models: undirected and directed. Let V be a set of random variables with distribution P. We say that X and Y are conditionally independent given Z— written X Y |Z— if p(x. and edges are written as unordered pairs. E)— also called a Markov random ﬁeld— has a ﬁnite set V of vertices (or nodes) and a set E of edges (or arcs) consisting of pairs of vertices. x2 ) = p(x1 )p(x2 ). Z}. (Y. Xn is called a path if Xi−1 ∼ Xi for each i. The vertices correspond to random variables X. If A. A sequence X0 . In Figure 70. . Y. Recall that X1 and X2 are independent— written X1 X2 — if p(x1 . Y }.Y Figure 70: A graph with vertices V = {X. . 47. . . In Figure 70. Z is a path. Y ) ∈ E means that X and Y are joined by an edge. Construct a graph with one vertex for each random variable 163 ¡ ¡ ¡ X Z X4 X5 . X. Z. The edge set is E = {(X. B and C are three distinct subsets of V . X1 ¡ X2 ¡ X3 ¡ Y Figure 71: A regression model 47 Graphical Models Regression models are a special case of a more general class of models called “graphical models. A graph is complete if there is an edge between every pair of vertices.1 Undirected Graphs Undirected graphs are a method for representing independence relations. we say that C separates A and B if every path from a variable in A to a variable in B intersects a variable in C. Now we relate graphs to probability distributions. . (X. Examples of graphs are in Figure 70 and Figure 71. . Y ). Also. if there is an edge between them. written X ∼ Y .

X Y |rest if and only if there is no edge between X and Y . Let A. Then. Let Mglobal (G) be the set of distributions which satisfy the global Markov property: thus P ∈ Mpair (G) if. B and C be distinct subsets of V such that C separates A and B.. we now see that X Z|Y and Y W |Z. W and Z are separated by {X. We thus see that the pairwise and global Markov properties are equivalent.Figure 72: {Y. E) be a pairwise Markov graph for a distribution P. The resulting graph is called a pairwise Markov graph. These relations imply other conditional independence relations. Then Theorem 47. The independence condition in Theorem 47. 164 ¡ ¡ X Y W X Z Z Z|Y . Given a graph G. 47.1 implies that A B. Let us state this more precisely. and 76.1 is called the global Markov property. Y ¡ Figure 73: X in V . Let G be a graph. under P. A B|C if and only if C separates A and B.e. Some examples are shown in Figures 73. Then A B|C. let M pair (G) be the set of distributions which satisfy the pairwise Markov property: thus P ∈ M pair (G) if. 47. under P. Let G = (V. W } and {Z} are separated by {X}. How can we ﬁgure out what they are? Fortunately. Returning to 76. as is explained in the next theorem.2 Remark. The graph encodes a set of pairwise conditional independence relations. Think how hard this would be to do algebraically. If A and B are not connected (i.1 Theorem. Omit the edge between a pair of variables if they are independent given the rest of the variables: no edge between X and Y ⇐⇒ X Y |rest where “rest” refers to all the other variables besides X and Y . 47.3 Theorem. Theorem 47. Y }. we can read these other conditional independence relations directly from the graph as well. 74. there is no path from A to B) then we may regard A and B as being separated by the empty set. Also. Mpair (G) = Mglobal (G). 75.3 allows us to construct graphs using the simpler pairwise property and then we can deduce other independence relations using the global Markov property. .

W } and Y ¢ X Z W |{X. But is X 165 ¢ ¢ ¡ ¢ ¡ Y ¡ ¡ X W Z Z|{Y. Z}. W }.Y Figure 74: No implied independence relations. Z|Y ? . Figure 75: X X Y Z W Figure 76: Pairwise independence implies that X Z|{Y.

4 Example. This is called the Hammersley-Clifford theorem. Z) and X Z and X (Y. z) for some positive functions φ1 and φ2 . Hence. Y } and C2 = {Y. φC (xC ) Z (270) φC (xC ) ψC (xC ) . y.Y Figure 77: X X ¡ Y ¡ ¡ Z Figure 78: X 47. Figure 78 implies that X Y. . X5 . x2 .5 Example. W W |(Y. x5 . Z}. X3 }. x3 . x6 ) ∝ φ12 (x1 . X2 }. {X2 . It can be shown that P is Markov G if and only if its probability function f can be written as f (x) = where C is the set of maximal cliques and C∈C Z= x C∈C is called the normalizing constant or the partition function. x4 ) ×φ35 (x3 . x6 ). If we deﬁne the energy function ψC (xC ) = − log φC (xC ) then we can write f (x) ∝ e− P C 47. 47. Z). Z|Y . The maximal cliques for the graph in Figure 79 are {X1 . X5 }.X W |(Y. X6 }. Thus we can write the probability function as f (x1 . {X3 . x3 )φ24 (x2 .X Z and X (Y. z) ∝ φ1 (x. Z) and X A clique is a set of variables in a graph that are all adjacent to each other. Z|Y . then its probability function can be written f (x. x5 . Z). A potential is any positive function.6 Example.7 Example. The maximal cliques for the graph in Figure 70 are C1 = {X. X4 }. y)φ2 (y. x4 . {X1 . if P is Markov to the graph. 166 ¡ X Z Y. {X2 . Figure 77 implies that X 47. x2 )φ13 (x1 . A set of variables is a maximal clique if it is a clique and if it is not possible to include another variable and still be a clique. x5 )φ256 (x2 .

Can we construct a probability distribution for images? A common example of such a distribution is the Ising model xi xj p(x) ∝ exp β ie j where the sum is over nieghboring pixels. {X1 . X1 X6 167 .8 Example (Images and the Ising Model). 47. X6 }. {X2 . X5 }. {X3. Consider an image with pixels taking values Yi ∈ {−1. X5 . {X2 . Similar models are used to describe materials that experience phase transitions such as magnets. 1}. X3 }. X2 }. X4 }.X2 X4 X3 X5 Figure 79: The maximumly cliques of this graph are {X1 .

and Perlman. d} where vertex j ∈ V corresponds to Xj . . (274) (273) A distribution P ∈ M (G) that satisﬁes σ ij = 0 but eij = 1 is called unfaithful. (271) d/2 |Σ|1/2 2 (2π) Let σij denote the (i. M. 1 1 f (x) = exp − (x − µ)T Σ−1 (x − µ) . . Let G = (V. . (2004). A formula for the partial correlation is −σ ij ρij = √ .j) element of Σ−1 . σ ij = 0 Now let V = {1. j ∈ V and let Rij = V − {i. (2005). M. Let X = (X1 .47. σ ii σ jj Furthermore. . Biometrika. . Let F (G) = N (µ. Let i. (272) ρij = V(Xi |Rij )V(Xj |Rij ) RESULT. Deﬁne the partial correlation E(Xi Xj |Rij ) − E(Xi |Rij )E(Xj |Rij ) . High Dimensional Graphs and Variable Selection. A SINful Approach to Model Selection for Gaussian Concentratopn Graphs. Deﬁne M (G) = N (µ. Σ) : eij = 0 =⇒ σ ij = 0 . The Annals of ¨ Statistics. P. N. Meinshausen. j}. Xd )T and suppose that X ∼ Nd (µ. (275) ⇐⇒ ρij = 0 ⇐⇒ Xi Xj |XRij . .j) element of Σ and let σ ij denote the (i.2 Fitting Undirected Graphs We will consider these cases: Continuous Discrete Small SIN (Drton-Perlman) Loglinear model Large Lasso (Meinshausen-B¨ hlmann) u Loglinear Lasso References: Drton. Σ) ∈ M (G) : eij = 0 ⇐⇒ σ ij = 0 . E) be a graph where E = {eij } denotes the edge set of the graph G: eij = 1 if vertices i and j are connected and eij = 0 if vertices i and j are not connected. (276) Then F (G) ⊂ M (G) are the faithful distributions. Σ) distribution. . Thus. and Buhlmann. 168 . Small Graphs – Continuous Variables. .

X2 X3 . (287) . see Figure 81. We would like to drop the edge between X2 and X3 . X (n) ∼ N (µ. 1) be independent.X1 a c X3 ρ(X2 . . . Cov(X2 . See Figure 80. The unfaithful distributions are a nonlinear subspace of M (G). c are nonzero. b. Given n random vectors X (1) . deﬁne z ij = 1 log 2 169 1 + rij 1 − rij . δ3 ∼ N (0. Let δ1 . b(a2 + 1) . Next. (289) (288) 1 n n i=1 (X (i) − X)(X (i) − X)T . Deﬁne X1 X2 X3 = δ1 = aX1 + δ2 = bX2 + cX1 + δ3 (277) (278) (279) where a. X3 ) = E(X2 X3 ) − E(X2 )E(X3 ) = E(X2 X3 ) = E((aX1 + δ2 )(bX2 + cX1 + δ3 )) = E((aδ1 + δ2 )(b(aδ1 + δ2 ) + cδ1 + δ3 )) 2 2 = (a2 b + ac)E(δ1 ) + bE(δ2 ) = a2 b + ac + b = 0. . Σ) deﬁne the sample covariance matrix S= The sample partial correlation is −sij rij = √ sii sjj where {sij } are the elements of S −1 . X3 ) = 0 even though b=0 X2 b Figure 80: An unfaithful distribution. . Example of unfaithfulness. δ2 . a (280) (281) (282) (283) (284) (285) (286) X3 |X1 which Thus. 47. Now suppose that c=− Then. But this would imply that X2 is not true.9 Example.

√ d n(z ij − ξ ij ) → N (0. (291) ξ ij = 0 (292) (293) b Iij = z ij ± √ m where b = zα/(2k) and k = d(d − 1)/2. for all i. 170 . Thus. for all 1 ≤ i < j ≤ d) ≥ 1− = 1− (294) (295) (296) (297) (298) = 1 − P(ξ ij ∈ Iij . h) / i.M (G) unfaithful distributions Figure 81: The unfaithful distrubutions are a nonlinear subspace of M (G). Then It then follows that P(G ⊂ G) ≥ 1 − α 7 More 0 if |z ij | ≤ m−1/2 b 1 if |z ij | > m−1/2 b. 1 m (290) 1 + ρij 1 − ρij ρij = 0. 1). Hence. eij = Let E0 = {ij : eij = 0}. Also. Now deﬁne 1 log 2 ⇐⇒ ξ ij . That is. Then7 z ij ≈ N where ξ ij = and m = n − d − 1. if 0 ∈ Iij then the data are compatible with H0 : ρij = 0 and we set eij = 0. (299) P(eij = 0 for all ij ∈ E0 ) ≥ 1 − α. (300) (301) precisely.j = 1 − α.j P(ξ ij ∈ Iij ) / α d(d − 1) i. Then. {Iij } are simultaneous conﬁdence intervals for {ξ ij }. P(ξ ij ∈ Iij .

. If the distribution is faithful.fowlbones$n) plotUGpvalues(out) E = getgraph(out. we further have that lim inf → ∞ P(G = G) ≥ 1 − α. 1)| ≤ b) ij∈E1 | ≤ b) 47.print(E) skull length skull breadth humerus ulna femur tibia 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 skull length skull breadth humerus ulna femur tibia The output for this and for another example (HIV data) are shown in Figures 82 and 83. ij ∼ N (0.attach(fowlbones) out = sinUG(fowlbones$corr. We can do this in R using the SIN package. Faithfulness implies that P(G ⊂ G) = = = ≥ = = ≥ P(eij = 1 for all ij ∈ E1 ) ij −1/2 (303) (304) (305) (306) (307) (308) (309) (310) 1− P(|z | > m b for all ij ∈ E1 ) 1 − P(|z ij | ≤ m−1/2 b for some ij ∈ E1 ) ij∈E1 P(|z ij | ≤ m−1/2 b) P(|ξ ij + m−1/2 ij 1− 1− ij∈E1 | ≤ m−1/2 b) √ P(| mξ ij + ij → 1. n (302) To see this. and let Λ = min{|ξ ij | : ij ∈ E1 }. √ 1 − |E1 | P(| m Λ + N (0.10 Example. Then. library(SIN) data(fowlbones).where G is the estimated graph. let E1 = {ij : eij = 1}. 1). let Λ > 0.05). 171 .

7 0.3 0.4 0.6 P−value 0.1.8 0.1 0.9 0.0 1 2 3 4 5 6 skull length skull breadth humerus ulna femur tibia 0.2 0.0 1−2 1−3 1−4 1−5 1−6 2−3 2−4 2−5 2−6 3−4 3−5 3−6 4−5 4−6 Edge Figure 82: Fowlbones 172 5−6 .5 0.

1 0.3 0.0 1−2 1−3 1−4 1−5 1−6 2−3 2−4 2−5 2−6 3−4 3−5 3−6 4−5 4−6 Edge Figure 83: HIV 173 5−6 .0 1 2 3 4 5 6 immunoglobin G immunoglobin A lymphocyte B platelet count lymphocyte T4 T4/T8 lymphocyte ratio 0.4 0.6 P−value 0.8 0.2 0.7 0.9 0.1.5 0.

b) : a ∈ ν(b) and b ∈ ν(a) . In fact. (318) (317) In practice. Deﬁne a. I use a a two-stage lasso: stage 1 uses Cp to choose a set of variables. a ν(a) = {b : θb = 0}. (319) .0.n. (316) In practice. Instead. Consider predicting Xi from all the other variables. Estimating Σ−1 is not feasible if d is large.Large Graphs – Continuous Variables. Deﬁne the neighborhood ν(a) of node a to be the smallest subset of V − {a} such that X a rest|Xν(a) where Xν(a) = (Xb : b ∈ ν(b)) and “rest” denotes all the other variables besides Xa and Xν(a) .2*i] = 5*X[.λ ν(a) = {b : θb = 0}. (314) Cross-validation does not lead to a good choice of λ in this case. Let θa.p) for(i in 1:(p/2)){ X[. Finally. we have the following property. P(C(a) ⊂ C(a) for all a) ≥ 1 − α. Hence. P(ν(a) = ν(a)) → 1. n 2d In this case.2*i-1] + rnorm(n. as n → ∞. S will not be invertible if d is large.. for all a. The best predictor corresponds to choosing a vector θ a ∈ Rd to minimize 2 E Xa − a θ b Xb b (311) a a subject to θa = 0. ﬁnding the neighborhood ν(a) corresponds to doing variable selection when regression X a on the other variables. (312) Therefore.λ be the lasso estimator that minimizes 1 n (Xia − X(−a) θ)2 + λ |θa | (313) i a where X(−a) is the design matrix (omitting Xa ). It can be shown that ρab = 0 implies that θb = 0.1) } 174 (a. Then. at stage 2 I do least squares on these variables and test them at level α/p2 and retain the signiﬁcant ones. to estimate the graph we take E= #### Example n = 50 p = 60 X = matrix(rnorm(n*p). we take a different approach. We have the following result: if d = n γ and we have sparsity: max |ν(a)| ≤ nκ (315) a for some 0 ≤ κ < 1 and if λ ≈ n −β for an appropriate β then. we can use 2σi α λ = √ Φ−1 1 − 2 . Let C(a) denote all nodes connected to a by some path.

. . . . . let xA = (xj : j ∈ A). Small Graphs – Discrete Variables. If i ∈ A and Xi = 0. For example. then ψA (x) = 0. . Let V = {1. . we can assume that Xj ∈ {0. . . if A = {1. Let X = (X1 . We can think of the data as a sample from a Multinomial with N = r1 × r2 × · · · × rd categories. rj − 1}. Given a vector x = (x1 . xd ) and a subset A ⊂ V . A (320) The log-linear representation is obtained by taking the logarithm. . . ψA (x) is only a function of xA and not the rest of the xj s. The joint probability function f (x) of a single random vector X = (X 1 . . Suppose now that we have n such random vectors. . . 3} then xA = (x1 . Let p = (p1 . The data can be represented as counts in a r1 × r2 × · · · × rd table. 2. . . . . 3. .11 Theorem. . . 1. ψ∅ (x) is a constant. xd ). . Xd = xd ) where x = (x1 . . . . Xd ) can be written as log f (x) = ψA (x) (321) A⊂V where the sum is over all subsets A of V = {1. 47. For every A ⊂ V . Without loss of generality. d} and the ψ’s satisfy the following conditions: 1. There are many tools for this case. . . Xd ) be a discrete random vector with probability function f (x) = P(X = x) = P(X1 = x1 . d}. . . x3 ). . We shall use log-linear models. pN ) denote the multinomial parameter. . . . 175 . .1:p 0 0 10 20 30 40 50 60 1:p 10 20 30 40 50 60 Figure 84: A Big Graph See Figure 84. . Let rj be the number of values that Xj takes. . Recall that we can write f in terms of its maximal cliques f (x) ∝ φA (xA ).

then the model is not graphical. 176 βjk Xj Xk X + · · · (322) Xj . . . Here is a way to think about the deﬁnition above: If you can add a term to the model and the graph does not change. Let β = (βA : A ⊂ V ) be the set of all these parameters. In the model log f (x) = β0 + β1 x1 + β2 x2 + β12 x1 x2 we have that X1 X2 if and only if β12 = 0. Xd ). β) when we want to emphasize the dependence on the unknown parameters β. We will write f (x) = f (x. . Xc ) be a partition of a vectors (X1 . Each ψA (x) may depend on some unknown parameters βA . (325) (326) (324) Let log f (x) = A⊂S ψA (x) be a log-linear model. Xb . Let (Xa . In that case log f (x) = β0 + j βj Xj + j<k βjk Xj Xk + j<k< so that ψA (x) = βA i∈A The next theorem gives an easy way to check for conditional independence in a log-linear model. Then Xb Xc |Xa if and only if all the ψ-terms in the log-linear expansion that have at least one coordinate in b and one coordinate in c are 0. ψ A (x) = 0 if and only if {i.13 Example. Then f is graphical if all ψ-terms are nonzero except for any pair of coordinates not in the edge set for some graph G. This is easiest to understand if all Xi ’s are binary.14.X5 X4 X1 X2 X3 Figure 85: Graph for Example 47.14 Example. j) is not an edge. (323) .12 Theorem. In the model log f (x) = β0 + β1 x1 + β2 x2 + β3 x3 +β12 x1 x2 + β13 x1 x3 we have that X2 X3 |X2 since 2 and 3 never appear together. Consider the graph in Figure 85. In other words. j} ⊂ A and (i. that is β23 = β123 = 0. The graphical log-linear model that corresponds to this graph is log f (x) = ψ∅ + ψ1 (x) + ψ2 (x) + ψ3 (x) + ψ4 (x) + ψ5 (x) + ψ12 (x) + ψ23 (x) + ψ25 (x) + ψ34 (x) + ψ35 (x) + ψ45 (x) + ψ235 (x) + ψ345 (x). 47. 47. 47..

Let log f (x) = ψ∅ (x) + ψ1 (x) + ψ2 (x) + ψ3 (x) + ψ12 (x) + ψ13 (x). its graph is given in Figure 86. ψ2345 . 4) is missing and hence ψ24 . 47. ψ1345 . Its absence indicates that this is not so. These are the hierarchical log-linear models. A log-linear model is hierarchical if ψA = 0 and A ⊂ B implies that ψB = 0. It just means that if we are only concerned about presence or absence of conditional independences. For example. The presence of the three-way interaction ψ235 means that the strength of association between X2 and X3 varies as a function of X5 . ψ234 . Now consider the model log f (x) = ψ∅ (x) + ψ1 (x) + ψ2 (x) + ψ3 (x) + ψ4 (x) + ψ5 (x) + ψ12 (x) + ψ23 (x) + ψ25 (x) + ψ34 (x) + ψ35 (x) + ψ45 (x). ψ245 . ψ1245 . It is also hierarchical. For example. If we draw a graph for this model. then we need not consider such a model. The model is hierarchical. ψ135 . You can check that the model omits all the corresponding ψ terms. see Figure 87. ψ145 .3) are omitted. There is a set of log-linear models that is larger than the set of graphical models and that are used quite a bit. It is not graphical because ψ123 (x) = 0 which does not correspond to any pairwise conditional independence. ψ125 .16. It is not graphical. ψ1234 . Let log f (x) = ψ∅ (x) + ψ1 (x) + ψ2 (x) + ψ3 (x) + ψ12 (x) + ψ13 (x) + ψ23 (x). Hence any term containing that pair of indices is omitted from the model.15 Lemma.16 Example. ψ12345 are all omitted. The model is hierarchical. the edge (2. The graph corresponding to this model is complete. ψ124 . This is the same model except that the three way interactions were removed. 47. But this is not graphical since it has extra terms omitted. There are other missing edges as well. 5) so we omit the edge between X 1 and X5 . 5) is missing in the graph. A graphical model is hierarchical but the reverse need not be true. 47. Let’s see why this model is graphical.17 Example.X2 X1 X3 Figure 86: Graph for Example 47. The model is graphical because all terms involving (2. ψ15 . Similarly. ψ12345 are all omitted. we will get the same graph. 177 . The edge (1. This is not a bad thing. no ψ terms contain (1. ψ1235 . The independencies and graphs for the two models are the same but the latter model has other constraints besides conditional independence constraints. ψ1245 .

Hierarchical models can be written succinctly using generators. This is the mutual independence model.2 which has log-linear expansion log f = ψ∅ + ψ1 + ψ2 + ψ12 .X1 X2 X3 Figure 87: The graph is complete. Finally. The loglikelihood for β is n (β) = i=1 ¡ ¡ ¡ log f (X i . X2 . X3 ). β) 178 . Let β denote all the parameters in a log-linear model M .2 + 1. it is not graphical either. A summary of the models is in Figure 89.3 is the saturated model log f = ψ∅ + ψ1 + ψ2 + ψ3 + ψ12 + ψ13 + ψ23 + ψ123 . 47. The model is hierarchical but not graphical. Since it is not hierarchical. This model is not hierarchical since ψ 2 = 0 but ψ12 is not. Consider M = 1 + 2 + 3 which means log f = ψ∅ + ψ1 + ψ2 + ψ3 .3 says: “include ψ12 and ψ13 .18 Example. The formula M = 1. The saturated models corresponds to ﬁtting an unconstrained multinomial. Then. This model makes X3 |X2 = x2 . Suppose that X = (X1 . X1 = x1 a uniform distribution. This is most easily explained by example. X1 X2 X3 Figure 88: The model for this graph is not hierarchical. The graph corresponding is in Figure 88.2 + 1.2. M = 1. consider M = 1. Let log f (x) = ψ∅ (x) + ψ3 (x) + ψ12 (x).3 stands for log f = ψ∅ + ψ1 + ψ2 + ψ3 + ψ12 + ψ13 .” We have to also include the lower order terms or it won’t be hierarchical. The generator M = 1.

and graphical models.loglinear hierarchical graphical Figure 89: Loglinear models. 179 . hierarchical models.

4).4).16) beta = 1 f = beta*x1*x2 + beta*x2*x3 + beta*x3*x4 + beta*x4*x5 f = exp(f) f = f/sum(f) n = 1000 180 . However. Now we choose the model M which maximizes AIC(M ) = (M ) − |M | or BIC(M ) = (M ) − |M | log n 2 (327) where |M | is the number of parameters in model M and (M ) is the value of the log-likelihood evaluated at the MLE for that model.8).16). For any submodel M . Every model that is not rejected by this test is then considered a plausible model. When ﬁtting log-linear models.rep(1.rep(1.rep(1. Usually the model search is restricted to hierarchical models. dev(M ) → χ2 with ν degrees of freedom equal to the difference in the number of parameters between the ν saturated model and M .0. (328) d x1 = c(rep(0. .rep(1. The result is that we end up with a bad model just due to low power. First. Let’s create a model of the form log f (x) = β12 x1 x2 + β23 x2 x3 + β34 x3 x4 + β45 x4 x5 . . one has to address the following model selection problem: which ψ terms should we include in the model? This is essentially the same as the model selection problem in linear regression. Different models correspond to setting different ψ terms to 0.1).8).4).rep(0.rep(1.20 Example.1. Now for each M we test the hypothesis H0 : the true model is M versus H1 : the true model is Msat . The MLE β generally has to be found numerically. But we might fail to reject H0 due to low power. this is not a good strategy for two reasons. Under H0 . This reduces the search space. The model that includes all possible ψ-terms is called the saturated model and we denote it by Msat .8)) x3 = c(rep(0. .4).rep(1. 47. After ﬁnding a “best model” this way we can draw the corresponding graph. deﬁne the deviance dev(M ) by dev(M ) = 2( sat − M) M where sat is the log-likelihood of the saturated model evaluated at the MLE and M evaluated at its MLE.1). One way to ﬁnd a good model is to use the deviance to test every sub-model.rep(1.rep(0. Xd ) as give by equation (321). 47.i i where f (X i . The deviance is the likelihood ratio test statistic for is the log-likelihood of the model H0 : the model is M versus H1 : the model is Msat .8).4).4). Let M denote some log-linear model. we will end up using models where we failed to reject H0 . Second.rep(0.19 Theorem.4). The likelihood ratio test for this hypothesis is called the deviance. we will end up doing many tests which means that there is ample opportunity for making Type I and Type II errors.16)) x2 = c(rep(0. One approach is is to use AIC or BIC. .rep(0.8) x5 = rep(c(0. β) is the probability function for the ith random vector X i = (X1 . A different approach is based on hypothesis testing.4)) x4 = rep(c(0. The Fisher information matrix is also found numerically and we can then get the estimated standard errors from the inverse Fisher information matrix.

replace=TRUE) X1 = X2 = X3 = X4 = X5 = NULL for(i in 1:n){ X1 = c(X1.95163 0. Error z value Pr(>|z|) (Intercept) 1.79711 0.93e-07 *** --Signif.84435 0.family=poisson) summary(out) tmp = step(out.05860 0.17486 -0.’ 0.682 x2 -0.05 ‘.751 < 2e-16 *** x1 -0.206 1.x4[y[i]]) X5 = c(X5.y = sample(1:32.03039 0.157 0.16218 5.671 2.410 0.593 x5 0.1 ‘ ’ 1 Nailed it! 181 .05609 0.649 x3 -0.02836 0.535 0.x3[y[i]]) X4 = c(X4.15870 6.09482 0.18430 9.x2[y[i]]) X3 = c(X3.prob=f.740 9.456 0.07969 0.875 x4 0.78e-08 *** x3:x4 1.14220 0.01 ‘*’ 0.670 x1:x2 1.17916 5.17731 0.13677 -0.397 6.06062 0.17633 5.001 ‘**’ 0.x1[y[i]]) X2 = c(X2.19359 -0.k=log(n)) summary(tmp) Coefficients: Estimate Std.47e-09 *** x4:x5 0.size=n. codes: 0 ‘***’ 0.x5[y[i]]) } out = glm(table(y) ˜ x1*x2*x3*x4*x5.55e-11 *** x2:x3 0.426 0.

If (X.1. The sequence {X. An example is shown in Figure 90. Y. Z} and edges E = {(Y. The set of all parents of X is denoted by π X or π(X). If there is an arrow from X to Y then X is a parent of Y and Y is a child of X. Y.Y X Z Figure 90: A directed graph with vertices V = {X. Formally. overweight smoking heart disease cough Figure 91: DAG for Example 48. A conﬁguration of the form: X Y Z 184 . If an arrow connects two variables X and Y (in either direction) we say that X and Y are adjacent. Y ) ∈ E then there is an arrow pointing from X to Y . A directed path between two variables is a set of arrows all pointing in the same direction linking one variable to the X Y other such as: A sequence of adjacent vertices staring with X and ending with Y but ignoring the direction of the arrows is called an undirected path. X is an ancestor of Y if there is a directed path from X to Y (or X = Y ). For our purposes. Z} in Figure 90 is an undirected path. each vertex will correspond to a random variable. X). Z)}. We also say that Y is a descendant of X. (Y. a directed graph G consists of a set of vertices V and an edge set E of ordered pairs of vertices. 48 Directed Graphs A directed graph consists of a set of nodes with arrows between some nodes. See Figure 90.

Xk ). X Y Z or X Y Z The collider property is path dependent. we say that P is Markov to G. Y is a collider on the path {X. W }. we only deal with acyclic graphs. In Figure 96. heart disease. . Figure 91 shows a DAG with four variables. The set of distributions represented by G is denoted by M (G). A directed graph is acyclic if it has no cycles. is called a collider at Y . For the DAG in Figure 92. A conﬁguration not of that form is called a non-collider. The probability function for this example factors as f (overweight. 48. cough) = f (overweight) × f (smoking) × × f (heart disease | overweight. smoking) f (cough | smoking). A directed path that starts and ends at the same variable is called a cycle. . 185 . we say that the collider is unshielded. . 48. for example. or that G represents P. y)f (w | z). .1 Example. smoking. z. Y. P ∈ M (G) if and only if its probability function f has the form f (x.2 Example. Let G be a DAG with vertices V = (X1 . y. Z} but it is a non-collider on the path {X.X Z Y W Figure 92: Another DAG. When the variables pointing into the collider are not adjacent. If P is a distribution for V with probability function f . From now on. if k f (v) = i=1 f (Xi | πi ) (329) where πi are the parents of Xi . w) = f (x)f (y)f (z | x. In this case we say that the graph is a directed acyclic graph or DAG. Y.

X2 {X1 . In Figure 92. The DAG in the lower right of Figure 95 has a collider.3 Theorem. Consider the four DAG’s in Figure 95 and the DAG in Figure 96. {X4 . c)f (e|d). c. 48. X2 } | {X3 . X4 }. 186 . 48. {X2 . d. The DAG in Figure 96 has a collider with a descendant.” d-separation can be summarized by three rules. b. How do we ﬁnd these extra independence relations? The answer is “d-separation” which means “directed separation. A distribution P ∈ M (G) if and only if the following Markov Condition holds: for every variable W. 48. W W | πW (330) where W denotes all the other variables except the parents and descendants of W . The Markov Condition implies: X1 X4 X2 . Roughly speaking. In this case probability function must factor like f (a. E {A. These relations might imply other independence relations. B. X3 } | X1 . X4 } It turns out (but it is not obvious) that these conditions imply that X2 | {X1 . C}. X3 }.B A D E C Figure 93: Yet another DAG. Y } | Z. the Markov Condition means that every variable W is independent of the “past” given its parents. X5 } {X1 . Consider the DAG in Figure 94. the Markov Condition implies that X Y and W {X. The following theorem says that P ∈ M (G) if and only if the Markov Condition holds.4 Example.5 Example. The ﬁrst 3 DAG’s in Figure 95 have no colliders. The Markov Condition implies the following independence relations: D A | {B. C} | D and B C|A The Markov Condition allows us to list some independence relations implied by a DAG. X5 X3 X4 | {X1 . Consider the DAG in Figure 93. X2 }. e) = f (a)f (b|a)f (c|a)f (d|b.

X Y Z X Y Z X Y Z X Y Z Figure 95: The ﬁrst three DAG’s have no colliders.X2 X3 X1 X5 X4 Figure 94: And yet another DAG. 187 . The fourth DAG in the lower right corner has a collider at Y . X Y Z W Figure 96: A collider with a descendant.

2. and (ii) no other vertex on U is in W . We would also expect that P(Aliens = yes|Late = yes) > P(Aliens = yes). 8 We implicitly assume that P is faithful to G which means that P has no extra independence relations other than those logically implied by the Markov Condition. If A. Hence. X and Y are d-separated given W . overweight and smoking are marginally independent but they are dependent given heart disease. P(Aliens = yes|Late = yes) = P(Aliens = yes|Late = yes. Here is a more formal deﬁnition of d-separation.9 Example. Thus in Figure 96. If G is a DAG. X and Y are d-separated given {S1 . 8 Let A. Conditioning on the descendant of a collider has the same effect as conditioning on the collider. but they are d-separated given Y . B.6 Example. X and Z are d-connected. Watch = no). This seems reasonable since — before we know anything about your friend being late — we would expect these variables to be independent. Two DAGs G1 and G2 for the same variables V are Markov equivalent if I(G1 ) = I(G2 ). and C be disjoint sets of vertices. Thus. X and Y are d-connected given {S1 . and W are distinct sets of vertices and A and B are not empty. The Rules of d-Separation Consider the DAGs in Figures 95 and 96. X and Z are d-separated but they are d-connected given W . But when we learn that you forgot to set your watch properly. V }. that makes this idea more palatable. Here is a whimsical example from Jordan (2003). by C.8 Example. but they are d-connected given Y . Given a DAG G. There are two explanations: she was abducted by aliens or you forgot to set your watch ahead one hour for daylight savings time. 3. then A and B are d-separated given W if for every X ∈ A and Y ∈ B. Consider the DAG in Figure 97. Graphs that look different may actually imply the same independence relations.7 Theorem. Then X and Y are d-separated given W if there exists no undirected path U between X and Y such that (i) every collider on U has a descendant in W . The fact that conditioning on a collider creates dependence might not seem intuitive. Your friend appears to be late for a meeting with you.X U V W Y S1 S2 Figure 97: d-separation explained. we would lower the chance that your friend was abducted. S2 }. In this example. then X and Z are d-separated. let skeleton(G) denote the undirected graph obtained by replacing the arrows with undirected edges. 48. B. 48. When Y is not a collider. Then A B | C if and only if A and B are d-separated 48. learning that your friend is late certainly increases the probability that she was abducted. From the d-separation rules we conclude that: X and Y are d-separated (given the empty set). 48. 188 .) Aliens and Watch are blocked by a collider which implies they are marginally independent. (See Figure 98. Aliens and Watch are dependent given Late. 1. Consider the DAG in Figure 91. we let I(G) denote all the independence statements implied by G. S2 . If X and Z collide at Y . Sets of vertices that are not d-separated are said to be d-connected. Let X and Y be distinct vertices and let W be a set of vertices not containing X or Y .

Two DAGs G1 and G2 are Markov equivalent if and only if (i) skeleton(G1 ) = skeleton(G2 ) and (ii) G1 and G2 have the same unshielded colliders. The DAG in the lower right of the Figure is not Markov equivalent to the others.11 Example. Was your friend kidnapped by aliens or did you forget to set your watch? 48. 48.8).10 Theorem.aliens watch late Figure 98: Jordan’s alien example (Example 48. The ﬁrst three DAGs in Figure 95 are Markov equivalent. 189 .

ability. i j if and only if i ≤ j. QFJ (quality of ﬁrst job). GPQ (graduate program quality). This example. Here. .o] > p = sinDAG(1:7. Without loss of generality. These are essentially time ordered. For the sparse graphs. . An example is time order.3. assume that V = {1. . . the PC algorithm due to Spirtes. .2. and cites (citation rates). Thus. Glymour and Scheines is the fastest algorithm I know of.1 Example. pubs (publication rates).49 Estimation for DAGs Estimating a DAG structure is harder than an undirected graph. For continuous.6) > ### sex < ability < GPQ < pre < QFJ < pubs < cites > m = m[o. we will consider the simpler case where there is a known ordering on the variables. preprod (preliminary measure of productivity). ← 49.ps". ρij denotes the partial correlation of Xi and Xj given {1. . j} − {i.1.horizontal=FALSE) > data(pubprod) > attach(pubprod) > m = pubprod$cor > print(dimnames(m)[[1]]) [1] "ability" "GPQ" "preprod" "QFJ" "sex" > n = pubprod$n > o = c(5.7. j}. and small graphs we can use SIN. . The variables are: sex.n) > plotDAGpvalues(p) (331) "cites" "pubs" 190 . d} has been ordered according to .4.m. Gaussian variables. from Spirtes et al (2000) involves data on publishing productivity. . > library(SIN) > postscript("dagsin. For every i < j we test H0 : ρij = 0 versus H1 : ρij = 0. ← ← Here.

use the lasso.type="DAG") > print(G) sex ability GPQ preprod QFJ pubs cites sex 0 0 0 0 0 1 0 ability 0 0 1 1 0 0 0 GPQ 0 0 0 0 1 0 0 preprod 0 0 0 0 0 0 1 QFJ 0 0 0 0 0 1 0 pubs 0 0 0 0 0 0 1 cites 0 0 0 0 0 0 0 The p-values are plotted in Figure 99. X3 etc. X2 regress X4 on X1 .05. An alternative is simply to do the following: regress X2 on X1 regress X3 on X1 .> G = getgraph(p. 191 .. X2 . when d is large. One could test for signiﬁcant effects or.

3 0.0 1 2 3 4 5 6 7 sex ability GPQ preprod QFJ pubs cites 0.1 0.2 0.6 P−value 0.0 1−>3 2−>3 1−>4 1−>2 2−>4 3−>4 3−>6 4−>6 5−>6 1−>7 2−>7 3−>7 4−>7 5−>7 2−>5 3−>5 4−>5 1−>6 Edge Figure 99: p-values for DAG example 192 2−>6 6−>7 1−>5 .5 0.8 0.1.9 0.7 0.4 0.

3.3.col=3) } (a) Comment every line of this little program (say what each line is doing). each person has a diﬀerent average lung capacity µi but µi does not depend on X. Problem 2. Suppose that we measure lung capacity Y on individuals in a population. Thus.cex. yi (x) = µi . 2. 7. We are interested in whether exposure to pollution X reduces lung capacity Y . Suppose. Show that R(r) ≤ R(g) as claimed in page 1 of the notes.y. A simulation experiment. 0.1). 0.col="black". In R. Problem 2. lwd=3. type: par(mfrow=c(1. sd) abline(lm(y~x). lty=2. 6. Problem 2. 8.bg="pink". Show 193 .type="l". Use R for all calculations. lwd=0.Homework 1 1. Association Versus Causation. 10) y = 5 + 10*x + rnorm(100. (b) Record your conclusions. Do not turn this question in.1.13) y = 5 + 10 * x plot(x. Let the lung capacity of person i be µi .5. that there is no eﬀect.lwd=3) sd = 10 for(i in 1:50){ x = runif(100. In terms of counterfactuals.lab=2) lines(x. Problem 2.y. for the sake of this question. Download the R handouts from the web site and practice R.fg="blue") x = c(-3.10. 4.2. 5.

that the causal regression c(x) does not depend on x. the slope of c(x) is 0. . Y1 ). Assume that n = 25 and that µi = i. . 25. 1). . Yn ) where Yi = yi (Xi ) + i where i ∼ N (0. . Now we observe n people and we get the data: (X1 . Note: All the data sets from Weisberg’s book are available on the course web 194 . Suppose that Xi = 25 − µi . Generate data from the model. . Plot the data and ﬁt a ˆ regression model. . . In particular. . Produce a plot showing yi (x) for i = 1. Explain why you get β ≈ −1 even though the slope of c(x) is zero. (Xn .

(a) Find the bias. (7) 4. (c) Describe the column space. (e) Find the mean and variance of β assuming the model is not correct.3. 1) and we use Y = θ as the prediction.5. (3) Let Y ∼ N (θ. 1). (c) Suppose we predict a new observation Y ∗ ∼ N (θ.1 parts 1 and 3.5 parts 1 and 2.3. (a) Find the least squares estimator of β. You also need to perform a nonparametric ﬁt (or smoother) which we haven’t yet discussed. (5) Problem 3. However. (6) 4. (b) Find an expression for the hat matrix. variance and mean-squared error of θ. all you need to do is: 195 . show that E(θ − θ)2 = bias2 + variance. (b) What value of a minimizes the mean squared error. (8) 4. Let θ = aY where a is a constant. (10) 5.1. You need to use the bootstrap. (d) Find the mean and variance of β assuming the model is correct.Homework 2 (1) Suppose we ﬁt the no-intercept model Yi = βXi + i . (4) Problem 3. (9) 5. which is fully described in the question. Find the prediction error.8. (2) If θ is an estimate of a parameter θ. (see page 89).

the quantity G is obtained this way: linfit sigma nonpar G = = = = lm(y ˜ x) summary(linfit)$sigma loess(y ˜ x) sum((fitted(linfit) .fitted(loess(y ˜ x)). Thus.fitted(nonpar))ˆ2)/sigmaˆ2 196 .

We’ll do this in steps.4. A point β is a minimum of Q if and only if zero is contained in the subdifferential.Homework 3 (1) In class we said that the minimizer of Q(β) = (Y − β)2 + λ|β| is given by the soft thresholding estimator β = sign(Y )(|Y | − λ/2)+ where (a)+ = a if a > 0 0 if a ≤ 0. (b) Let H be the hat matrix. Formally. But Q(β) is not differential. (4) Problem 8.3 (but ignore anything about the “score test. Since Q is convex. (a) First. show that Q(tβ1 + (1 − t)β2 ) ≤ tQ(β1 ) + (1 − t)Q(β2 ). 197 .”) (5) Problem 9. z is a subderivative of Q at β 0 if Q(β) − Q(β0 ) ≥ z(β − β0 ) The set of subservatives is called the subdifferential. That is. (3) (a) Let A be a m × n matrix and let B be an n × m matrix. Now show that β = sign(Y )(Y − λ/2)+ . (6) Problem 9. Show that this is true. Note that the Q (β) is a number if β = 0 but Q (β) is a set of numbers if β = 0. Show that trace(H) = q. show that Q is convex in β. The slope of such a line is called a subderivative. for any β there is a line passing through (β. 1] if β = 0.1. (b) Show that the subdifferential of Q is Q (β) = −2(Y − β) + z(β)λ where z(β) = sign(β) if β = 0 and z(β) ∈ [−1. Q(β)) that never goes above Q.3. (2) Problem 7. We could try taking the derivative of Q with respect to β and setting it equal to 0. Show that trace(AB) = tr(BA).

. 198 . . . (8) Get the highway accident data highway. i Find an expression for the estimator. ridge regression. σ 2 ). Find the i value of λ that minimizes the risk. Summarize your results. i=1 (a) Find µ for the following three cases: J(µ) = ||µ||0 = #{i : µi = 0} J(µ) = ||µ||1 = i |µi | µ2 . p. µp ) where µ is obtained by minimizing p (Yi − µi )2 + λJ(µ). (c) The “elastic net” estimator minimies p (Yi − µi )2 + λ1 i=1 i |µi | + λ2 i µ2 . ﬁnd the bias. Assume that σ 2 is known. . . Deﬁne µ = (µ1 .8. i i J(µ) = ||µ||2 2 = (b) For the case J(µ) = i µ2 . i = 1.txt.(7) Problem 9. variance and risk of µ as a function of λ. (9) Suppose that Yi ∼ N (µi . Do model selection using: forward stepwise. . the lasso. .

Plot the data with the regression line. (d) Repeat (c) using splines. (a) Fit a kernel regression (Y = Height and X = Dbh). Summarize your ﬁndings. Fit a logistic regression. (a) Regress the Yi ’s on the Xi ’s. σ = 2 and Ui ∼ N (0.2 of the book. Compare the ﬁtted line to the true regression line and to the ﬁtted line from (a). n = 1000. (3) Question 12.3).Homework 4 (1) Use the data for question 12. (b) Regress the Yi ’s on the Wi ’s. 1). . . 1) Yi = 3 + 4Xi + Wi = X i + U i i where i = 1. σ 2 ). (e) Repeat (e) using orthogonal functions. How do the two regression compare? (c) Use the method we discussed in class for correcting the regression in (b). Plot the ﬁt based on the best bandwidth. n. For the third part. Find the best bandwidth and plot the corresponding ﬁt. (b) Plot the cross-validation score. (2) Question 12.2.1. Plot the data with the regression line.5. Use stepwise regression to choose a submodel. (c) Fit a local linear regression.1. (6) Download the forestry data (ufcwc.txt) which is used in section 7.3. (5) Question 11. . Show several different ﬁts using different bandwidths and different kernels. Plot the cross-validation score versus bandwidth. (4) Generate data as follows: Xi ∼ N (0. i ∼ N (0. just use the bootstrap (as in the example in section 11. Plot the residuals. Use cross-validation (or generalized cross validation) to ﬁnd the best bandwidth. 199 . .

Let Y = log(fuel consumption). Estimate the density of Y using a histogram and using kernel density estimator. multivariate nonparametric methods. Use cross-validation to choose the amount of smoothing. Ignore the variable State. This should inlude: plots. linear ﬁts. Provide a brief summary of what you did.txt. The variables are: • Drivers (numer of licensed drivers) • FuelC (fuel consumption in thousands of gallons) • Income (per capita income) • Miles (miles of highway in state) • MPC (miles driven per person) • Pop (number of people 16 and older) • Tax (tax rate on gas). Do a complete analysis using all the tools we have learned. The goal is to predict fuel consumption from the other variables.Homework 5 (1) Get the data set fuel2001. (2) Use the fuel data again. 200 . nonparametric methods.

To do this with regression methods requires you to be inventive. What is the smallest risk over all such classiﬁers in this problem? (2) Let f0 (x) = f (x|Y = 0) and f1 (x) = f (x|Y = 1). (ii) logistic regression.2 for . Note that type takes 6 diﬀerent values. Find f1 and f0 in problem 1 and apply the above formula for R∗ and conﬁrm that you get the same answer as before.1 < x < . (vi) trees.9 otherwise.Homework 6 (1) Suppose that X ∈ R and that X ∼ Uniform(0. (a) First combine the ﬁrst three classes (diﬀerent types of window glass) and the last three classes so that type now only has two values. (iv) lda. Show that R∗ = 1 1 − 2 4 |f1 (x) − f0 (x)|dx. Compare the results. Interpret this result. (v) qda. (c) Let H be the set of linear calssiﬁers of the form h(x) = 1 if β0 +β1 x ≥ 0 and h(x) = 0 if β0 +β1 x < 0. and r(x) = P(Y = 1|X = x) = . Classify the data (window or not window) using (i) linear regression. 10). (3) Get the glass fragment data: library(MASS) data(fgl) help(fgl) The goal is to predict the variable type from the others.9 and r(x) = . (a) Find the Bayes classiﬁctaion rule h∗ . (b) Find the Bayes risk R∗ = R(h∗ ). (iii) nearest neighbors. (b) Now let type have 6 diﬀerent levels. 201 .

202

ExHomework 7 (1) Create a simulated dataset as follows. Create a 20 by 20 matrix E. The diagonal elements should be zero. Make each upper diagonal matrix a 1 or 0 by generating a random Bernoulli with success probability .05. Now let b = 40 and deﬁne Σ−1 using this code: B a a a Sinv = = = = = b*E apply(B,1,sum) max(a) + 1 rep(a,nrow(E)) B + diag(a)

Draw the graph. The following code might be helpful for drawing a graph: draw = function(E){ par(pty="s") n = nrow(E) angle = seq(0,2*pi,length=n+1) angle = angle[1:n] x = cos(angle) y = sin(angle) plot(x,y,pch=20,lwd=3,xlab="",ylab="", xaxt="n",yaxt="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5)) for(i in 1:n){ text(1.2*cos(angle[i]),1.2*sin(angle[i]),paste("X",i),lwd=3,font=2) } for(i in 1:(n-1)){ for(j in (i+1):n){ if(E[i,j] == 1)lines(c(x[i],x[j]),c(y[i],y[j]),lwd=3) } } return(NULL) } Now generate 100 random vectors from a N(0, Σ). To do this, use the following fact: if Z ∼ N(0, I) then X = Σ1/2 Z ∼ N(0, Σ). To compute the square root of a matrix use: e = eigen(A) V = e$vectors s = V %*% sqrt(diag(e$values)) %*% t(V) Estimate the graph from your data. Use the SIN method and the lasso method and compare your answers. To assess the variability of the estimator draw 10 bootstrap samples and draw the graphs from these bootstrap samples.

203

(2) Consider random variables (X1 , X2 , X3 ). In each of the following cases, draw a graph with the fewest possible number of edges that has the given independence relations. (a) X1 ∐ X3 | X2 . (b) X1 ∐ X2 | X3 and X1 ∐ X3 | X2 . (c) X1 ∐ X2 | X3 and X1 ∐ X3 | X2 and X2 ∐ X3 | X1 .

(3) Consider random variables (X1 , X2 , X3 , X4 ). In each of the following cases, draw a graph with the fewest possible number of edges that has the given independence relations. (a) X1 ∐ X3 | X2 , X4 and X1 ∐ X4 | X2 , X3 and X2 ∐ X4 | X1 , X3 . (b) X1 ∐ X2 | X3 , X4 and X1 ∐ X3 | X2 , X4 and X2 ∐ X3 | X1 , X4 . (c) X1 ∐ X3 | X2 , X4 and X2 ∐ X4 | X1 , X3 .

(4) Construct a distribution on three variables that cannot be represented by an undirected graph. Construct a distribution on four variables that cannot be represented by a directed graph.

(5) Get the undirected graph data from the course website. There are 5 binary variables. Fit an undirected graph using loglinear models. Approximate the distribution of the data with a Normal. Now estimate the graph using SIN. (In other words, just use the sample covariance matrix.)

(6) Write down the conditional independencies from Figures 1-4.

204

X1

X2

X4

X3 Figure 1:

X1

X2

X3

X4

Figure 2:

X3

X2

X4

X1

Figure 3:

205

X1

X2

X3

X4

X5

X6 Figure 4:

206

Solution 1 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Solution 2 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Solution 3 .

.

.

.

.

.

.

.

.

.

.

.

Solution 4 .

.

.

Solution 6 .

.

.

.

Solution 7 .

.

.

EXAMPLE. The examples are simple but illustrate some of the challenges in clustering. C2 . ridges and so on. The goal of clustering is to ﬁnd disjoint sets C1 . . . curves. EXAMPLE. linear subspaces. Image analysis. Dimension reduction methods aim to ﬁnd low dimensional structures that preserve properties of the full data set. One assumes—explicitly or implicitly— that p(x) can be written as p(x) = p0 (x) + pj (x) j where pj is very concentrated around a set Cj and C1 . . The solution to this problem depends on the class C. . . ΠC (X) is the point in C closest to X. We try to ﬁnd a low dimensional set C belonging to a class of sets C to minimize some quantity such as the projection error E||X − ΠC (X)||2 where ΠC is the projector onto C. 6830 genese on 64 people. For example: C Singletons Sets with k points Lines Smooth curves Sets of k lines Lines in feature space Method The mean k-means clustering Principal components Principal curves k-lines clustering Kernel principal components Some of the methods can be cast in terms of the probability density function p(x). . The ﬁrst term p0 (x) represents mass not in any of the sets Cj . Clustering methods aim to ﬁnd small sets with a high concentration of data. Some of these methods take the following form.Appendix Clustering Clustering is really a type of dimension reduction. Ck such that each Cj is has a high concentration of data. lines. Find groups of people. The sets could be points. These methods are sometimes called unsupervised learning. Figure 6 shows some synthetic examples. are disjoint sets. 1 . Let us also introduce some real examples. manifolds. that is. . blobs.

qqqq qqq qq qq q q q q q q q q q qqq q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqqq q q q q q q q q q qq qq qqq qqqq q q q q qq q q q q q qqqq q q q qqq q q q qq q qq qq q q q q q q q q qq q q q q q qq q q q qq qq q q q qqq qq qq q qq qq qq q q qqqq qqq q qq qqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqqq qq qq q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q qq q q q q qq q q q q qq q qq q qq q qqq q q q q q q qq q qq q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q qq q q q q qq qq q q qq q qqqqq qq q q q q qq qq qq qqq q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q qq q q q qq q q qq qq qq q qqq q q q q q q q qq q q q qq qq q q qqqq qq q qq qqq q q qqq q qqqqq qqqq q q q q qq qqq q qqqqq qqq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q qqqqq q q qq qq qq qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q qqq q q q qqq q q q q q q q qq q qqqq q qq q q qq q q q q q q qq q q qq q q q q q q qqqq q q qq q q q q q q q q q q q q q q qq q q qq q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q 2 .q q q qq q q qq q q q q q q qq q q q q q qq q q q q q q q qqq q q q q q q qq q q q q qq q q qq q qqqq q q qqq q q q q q q qq q q q q qq q qq qq q q q q q q qq q q q q q q q q q Figure 1: Synthetic examples of clusters.

Now we discuss the theoretical properties of the k-means method. . Tk where x ∈ Tj if and only if ||x − cj || ≤ ||x − cs || for all s = j. . . Tk where Xi ∈ Tj if and only if ||Xi − cj || ≤ ||Xi − cs || for all s = j. . ck } as a set of “cluster centers. . The population k-means set is C ∗ = arminC∈Ck R(C). The algorithm is not guaranteed to ﬁnd a global minimum so it is often a good idea to rerun the algorithm a few times with random starting values and then take the best solution. . we take Ck = arminC∈Ck R(C) where R(C) = 1 ||x − ΠC (x)|| dPn (x) = n 2 n ||Xi − ΠC (X)i ||2 . . ΠC (x) is the point in C closest to x. . . .1 k-means In k-means clustering we try to ﬁnd k points c1 . . . . . . ck } where each cj ∈ Rd . Let X be a d-dimensional random vector and let Ck denote all sets of the form C = {c1 . . .0. . The usual algorithm to ﬁnd C is the k-means clustering algorithm in Figure 2. . ck such that most of the data are tightly concentrated around one of the cj . . Deﬁne the risk R(C) = E||X − ΠC (X)||2 = ||x − ΠC (x)||2 dP (x) where ΠC (x) is the projection of X onto C: ΠC (x) = argmin1≤j≤k ||x − cj ||2 . i=1 Once we ﬁnd C = {c1 . . Tk are the estimated clusters. To estimate the centers from the data. . In other words. . . Tk are called the k-means partition. . . . The sets T1 . . ck } we partition the sample points into disjoint sets T1 . 3 . . .” We partition Rd into k sets T1 . We think of C ∗ = {c1 . . The sets T1 . .

. Repeat steps 2 and 3 until convergence. 2. n ≥ 8d and n/ log n ≥ dk 1+2/d . . An important practical question is how to choose a good value for k? There are numerous approaches to answering this question. THEOREM(Bartlett. ck at random from the data. Linder and Lugosi 1997). Then. if k ≥ 3. . the lower bound implies that we cannot ﬁnd any other method that improves much over the k-means approach. Choose k centers c1 . for any method C that selects k centers. b i: Xi ∈Tj 4. Form the clusters T1 . n Also. Moreover. kd ≥ 8. d3 k 1−2/d log n R(C) − R(C ∗ ) ≤ 32 . at least with respect to this loss function. Suppose that P(||X||2 /d ≤ 1) and that: n ≥ k 4/d . Let nj denote the number of points in Tj and set cj ←− 1 nj Xi .1. . as long as k = o(n/(d3 log n)). . . The reason there are so many answers is that the question is vague. It follows that the method is consistent in the sense that R(C)−R(C ∗ ) → 0. . 3. . there exists P such that R(C) − R(C ∗ ) ≥ c0 d k 1−4/d n √ where c0 = Φ4 (−2)2−12 / 6. dk 1−2/d log n ≥ 15. Figure 2: The k-means clustering algorithm. What does it mean to choose a good k? Rather than get into a prolonged discussion of the various methods let us instead discuss a simple method that is tied to the risk function R(C). 4 . n ≥ 16k/(2Φ2 (−2)) then. Tk where Xi is in Tj if cj is the center closest to Xi .

3 Level Set Clustering The density function p(x) can also be used to deﬁne clusters. Suppose that L(λ) can be decomposed into a ﬁnite collection of bounded. Need to deﬁne the distance between clusters. Merge the two closest points. For a ﬁxed non-negative number λ deﬁne the level set L(λ) = {x : p(x) > λ}. We assume that this decomposiiton is minimal in the sense that this is fewest number of sets for such a decomposition. . 0. we can look for the ﬁrst k such that the improvement Rk − Rk+1 is small. Ck } the 5 . . . C2 ) = min{d(xi . xj ∈ C2 }. .2 Hierarchical Clustering Agglomerative: start with each point in a separate cluster. connected. Instead. An estimate of kα is kα = min k : where σ 2 = n−1 n i=1 Rk − Rk+1 ≤α σ2 ||Xi − X||2 . 0. We then call C = {C1 . disjoint sets: k L(λ) = i=1 Cj . Example: d(C1 . Continue. Divide recursively. xj ) : xi ∈ C1 . It is easy to see that Rk is a nonincreasing function of k so minimizing Rk does not make sense. sometimes called an elbow. ﬁx a small number α > 0 and deﬁne Rk − Rk+1 kα = min k : ≤α σ2 where σ 2 = E(||X − µ||2 ) and µ = E(X). Divisive: start with one cluster. For example.∗ To indicate the dependence on the number of clusters k write Ck for ∗ the optimal clustering and let Rk = R(Ck ).

0 Figure 3: Original Image 6 .2 0.8 1.2 0.0.0 0.6 0.4 0.6 0.0 0.8 1.4 0.0 0.

0 0.4 0.8 1.8 1.0 0.4 0.0.6 0.2 0.6 0.0 Figure 4: Compressed Image 7 .0 0.2 0.

q q q q qq q q q q q q qq q q q q qq q q q qq q qq q q q qqq qqq q q qq q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q qq q qq q q q qqq qqq q q qq q q q q q q q q q qq q q qq qq qq q qq q q q q q q qq qq q q q qq q q qqq q qq q q q q q qq q q q q q q q q q qq q qq q q q qq q q qq q qq qq q q q q q q qq q q qq qq q q q q q q q q q q q q q 2 4 6 Number of Clusters 8 10 Figure 5: Synthetic examples of clusters. qqqq qqq qq qq q q q q q q q q q qqq q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqqq q q q q q q q q q qq qq qqq qqqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 4 6 Number of Clusters 8 10 8 .

The following theorem gives the rate of convergence of this estimator. THEOREM (Cuevas and Fraiman 1996). r)) ≤ K(x). B) = inf : A ⊂ B and B ⊂ A . Let B(x. Suppose that the following assumptions hold. The kernel K is a bounded density. . ) ∩ S) ≥ δµ(B(x. .density clusters. n 9 . The bandwidth hn satisﬁes hn → 0 and nhd / log n → ∞. uniformly Lipschitz. . there exists c. 1. i=1 B(x. We can estimate these clusters by ﬁrst estimating the density with. One can then decompose L(λ) into estimated clusters C1 . r such that cI(x ∈ B(0. 2. ). )) for all x ∈ S and all 0 < ≤ λ. where µ denotes Lebesgue measure. Ck . supported on a compact set. and K(t) is decreasing in ||t||. . (1) A set S is standard if for every λ > 0 there exists a δ ∈ (0. 1) such that µ(B(x. The Hausdorﬀ distance between two sets is deﬁned by dH (A. The set C0 = {x : p(x) ≤ λ} is called the background and points Xi in C0 are called background clutter. the kernel density estimator p(x) = An estimator of L(λ) is L(λ) = {x : p(x) > λ}. ) = {y : ||x − y|| ≤ } and for any set A deﬁne A = x∈A 1 n n Kh (x − Xi ). ||t||d K(t) is bounded. say.

n A diﬃcult computational problem is how to decompose L(λ) into disjoint clusters. Febrero and Fraiman (2000). Faster rates of convergence are possible under stronger conditions. ) = {y : ||x − y|| ≤ }. . L) → 0 a. One possible choice for is = max min ||Xi − Xj ||/2 i j=i which is the smallest vaalue that connects every Xi to its nearest neighbor. p is bounded and L = {x : p(x) ≥ λ} is a compact. Keep only those observations for which f (Zi ) ≥ λ. from f . Suppose there are N such observations where N is much larger than n. Draw onservations Z1 . see Riggolet (2007). One can do this using the original sample but using the generated sample makes the procedure more accurate. 10 . One possibility is to ﬁx a small number α and then chooses λ = λα where λα = sup{λ : P (L(λ)) ≥ 1 − α}. Let L = {x : p ≥ λ + cn } where cn > 0 and cn → 0. Z2 . Cuevas. Now ﬁnd the connected components as explained in See Figure ??. In particular. To implement this method one must choose λ. Fix an > 0. Two points Xi and Xj are in the same cluster if and only if there exists a path between Xi and Xi such that p(x) > λ for all x along the path.3. . We estimate λα by λα = sup λ : #{Xi ∈ L(λ)} ≥1−α . Febrero and Fraiman (2000) suggest the following algorithm. The connected components of L can be found as follows. Noe we approximate L with B(Xi . Then βn dH (L. Other choices are discussed in Cuevas. standard set. ) L= Xi ∈S where S = {i : p(Xi ) > λ} and B(x.s (2) where βn is any sequence satisfying βn → ∞ and βn hn → 0. . This is a very diﬃcult condition to check. we can choose hn (log2 n/n)1/d and βn (n/ log3 n)1/d .

hierarchical clusters qqqqqqq qq qq qq q q q q q q q qqqq qqqqq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 113 114 111 112 53 54 51 52 46 47 48 49 50 106 107 108 109 110 96 97 98 99 36 37 38 39 40 41 42 43 44 45 100 101 102 103 104 105 35 33 34 29 30 31 32 95 93 94 89 90 91 92 85 86 87 88 25 26 27 28 20 21 22 23 2 24 80 81 8 83 84 665 67 6 63 64 6 7 5 3 4 55 56 57 58 59 2 1 6015 1 116 117 118 119 62 617 12057 7 76 78 79 15 16 17 18 19 8 9 10 11 12 13 14 68 69 70 71 72 73 74 qqqqqqq qq qq qq q q q q q q q qqqq qqqqq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qqqqq qqqq q q q q q q q qq qq qq qqqqqqq Height 0 1 2 3 4 Cluster Dendrogram Figure 6: Synthetic examples of clusters. "complete") 11 . dist(X) density clusters hclust (*.

4 Modal Clustering Suppose that the density p on Rd has ﬁnitely many modes m1 . mode clustering is not very useful since. . Next. The steepest ascent path starting at x is deﬁned to be the curve t : [0. Finally the data are clustered according to which element Tj they fall in. . We can think of ph as the smoothed out version of p using a Gaussian kernel. . . deﬁne a partition T1 . Thus. .length=1024)) image(y. the steepest ascent path is the intergal curve deﬁned by the vector ﬁeld V . A reﬁnement that is used in practice is the following.130. ∞) → Rd such that t(0) = x and t (s) = V (t(s)) where V (x) = p(x). mk . A suﬃcient condition for the existence and uniqueness of the steepest ascent paths is that there exist c such that ||V (x) − V (y)|| ≤ c||x − y|| for all x and y. As described so far. ph is the density of the random variable X + Z where X ∼ P and Z ∼ N (0.89) g = gray(seq(0. A point x belongs to Tj if and only if the steepest ascent path beginning at x leads to mj . for example.centers=5) 12 .col=g) tmp = kmeans(c(y).pdf") library(adimpro) x = read.1. The number of modes is nondecreasing in h.image(x) y = matrix(y. Tk as follows. . THEOREM. . . This algorithm not only ﬁnds the modes but it shows what mode each Xi belongs to. pdf("cat1. In other words. We estimate ph with the kernel density estimator. h2 I). p may not even have a ﬁnite number of modes. Given h > 0 deﬁne ph (x) = 1 K hd ||u − x|| h p(u)du. For each h.0.image("cat.jpeg") y = extract. To ﬁnd the modes and cluster the data we use the mean shift algorithm. ph has ﬁnitely many modes. see Figure ??.

8.. .rnorm(50. . xN .off() pdf("cat2..5).1..pdf") image(Y.rnorm(50.off() ###### blobs x1 = c(rnorm(50. Repeat until convergence.-.centers=i) ss[i] = sum(out$withinss) } plot(ss.1.8.. n i=1 Kh (xj − Xi ) .130.2).5)) X = cbind(x1. .-. .lwd=2.2).5.x2) n = nrow(X) ss = rep(0..ylab="") out = kmeans(X. Set t = 0.5.xaxt="n". Xn } but in general it can be diﬀerent.1.2).lwd=3.centers=3) 13 n i=1 Xi Kh (xj − Xi ) .8.col=g) dev. N set xj ←− xj + 3.. . Y = tmp$centers[tmp$cluster.8. .8.type="h".col="black".ylab="".pch=K.rnorm(50.2)) plot(x1. .8. This grid can be identical to the data {X1 ...-. 2. . Figure 7: The Mean Shift Algorithm. xlim=c(-1.ylim=c(-1.yaxt="n". .yaxt="n".1] Y = matrix(Y. For j = 1. Choose a grid of points x1 . . Let t = t + 1.rnorm(50. .xlab="Number of Clusters". .x2.89) dev..2)) x2 = c(rnorm(50.2).xlab="".10) ss[1] = n*sum(diag(var(X))) ### kmeans for(i in 2:10){ out = kmeans(X.

length=60) x1 = cos(a).xaxt="n".lwd=3.lwd=3.yaxt="n".x2) plot(x1.3)) ### hierarchical clustering out = hclust(dist(X)) plot(out) tmp = cutree(out.pch=tmp.ylab="".col="black".col="black".lwd=2. xlim=c(-3.ylim=c(-3.xaxt="n".x2.xlab="".type="h".xlab="".ylim=c(-3.2).yaxt="n".yaxt="n".pch=out$cluster.lwd=3.plot(x1. xlim=c(-3.ylab="".pin=c(2.centers=i) ss[i] = sum(out$withinss) } plot(ss.3).x2.k=2) plot(x1.5).5).2)) plot(x1.xlab="Number of Clusters".10) ss[1] = n*sum(diag(var(X))) ## kmeans for(i in 2:10){ out = kmeans(X.1.centers=4) plot(x1.xlab="".3).ylim=c(-3.x2.col=tmp.pch=out$cluster.pch=K.5. xlim=c(-3.col=out$cluster) ###### circles a = seq(0. xlim=c(-1.xlab="".lwd=3.col=out$cluster) par(mfrow=c(2.ylab="") out = kmeans(X.ylab="".2*x2) X = cbind(x1.ylab="".ylab="".x2.5.xaxt="n".3).ylim=c(-1.2*x1) x2 = c(x2. xlim=c(-3.x2.3).2*pi.yaxt="n".3)) n = nrow(X) ss = rep(0.1.yaxt="n".xlab="".ylim=c(-3.xaxt="n".pch=K.yaxt="n". x2 = sin(a) x1 = c(x1.3). main="hierarchical clusters") 14 .pty="s".lwd=3.xaxt="n".3).

x2[i]."PROSTATE"."MCF7D-repro"."UNKNOWN". "BREAST"."MCF7A-repro".0."MELANOMA" ) table(true.data") out = kmeans(t(m). "OVARIAN". "LEUKEMIA"."CNS". "COLON".50). "BREAST"."LEUKEMIA".length=50) y1 = rep(y1."COLON".50.table("nci.### density clusters h = ."CNS"."NSCLC"."MELANOMA".3."RENAL"."BREAST"."NSCLC".50) f = rep(0."LEUKEMIA"."NSCLC"."MELANOMA"."BREAST"."COLON"."K562B-repro".xaxt="n".h)*dnorm(y2. "MELANOMA"."PROSTATE"."RENAL"."OVARIAN"."COLON"."COLON".rep(50."MELANOMA"."BREAST"."RENAL".out$cluster) cluster 1 2 3 true BREAST 0 4 3 CNS 0 0 5 COLON 0 7 0 K562A-repro 1 0 0 K562B-repro 1 0 0 LEUKEMIA 5 1 0 15 . "OVARIAN"."NSCLC"."MELANOMA".1 y1 = seq(-3."BREAST"."BREAST"."NSCLC". "LEUKEMIA".50)) y2 = rep(y2."RENAL"."NSCLC"."MELANOMA".3."OVARIAN"."RENAL".x1[i]."RENAL".col=g."OVARIAN"."K562A-repro".length=50) y2 = seq(-3."COLON"."LEUKEMIA"."NSCLC".h)/n } g = gray(seq(1."LEUKEMIA"."RENAL"."OVARIAN".50*50) for(i in 1:n){ f = f + dnorm(y1."COLON".yaxt="n") ### genes m = read."RENAL".length=10)) image(matrix(f."NSCLC"."CNS".centers=3) true=c("CNS"."CNS"."MELANOMA".main="density clusters". "RENAL"."NSCLC".

MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN 0 0 0 0 0 0 0 0 1 1 7 3 2 1 0 0 0 0 1 6 4 1 9 1 16 .

Regression Analysis by Example. [5] Chatterjee. Newbury Park... Cook and S. New York: Wiley. New Jersey: Lawrence Erlbaum Associates. & Tukey. 1977. 1994. Hillsdale. 1997. Applied Regression Analysis. Newbury Park. F. D. Linear Models. Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences.Bibliography [1] Cohen. CA. J. Wiley. . Weisberg. 1991. and Related Methods.: Sage. S. and Price. Reading. J. Regression Diagnostics: An Introduction. Data analysis and regression: A second course in statistics. (1975). (1977). P. B. [5] Mosteller. [3] Fox. and Cohen. J.: Sage. [4] Fox. An Introduction to Regression Graphics. New York. W. [2] R. MA: Addison-Wesley. J. CA.

linear regression, model selection, smoothing and nonparametric regression

linear regression, model selection, smoothing and nonparametric regression

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd