You are on page 1of 7

Midterm Test for ST4240 Data Mining

(please answer all the questions for full marks. Please send your answer to
staxyc@nus.edu.sg)

1. For data A (at http://www.stat.nus.edu.sg/~staxyc/DM07testdata1.dat), there are 5
predictors X1, … , X5 and response Y. A Single-index model (SIM) is suggested
Y = g(a1*X1+… +a5*X5) + e

A. Estimate the model, plot the link function and its confidence band.
The estimated model is
Y = g(0.008595073X1 -0.740091476X2 + 0.034182754X3
-0.671579756X4 + 0.001703434X5)
The estimated function and its 95% confidence band are show in
Figure 1
15
10
y

5
0

-2 -1 0 1 2

xalpha

Figure 1

B. which variables can be removed? Estimate the model again after removing the
variables
The estimated coefficients have SE respectively 0.02222250 0.01297481
0.02339194 0.01451642 0.02307927. By checking the “t-statistics”, we
can see that X1, X3 and X5 can be removed

C. For a new X (X1=0, X2=0, X3 = 0, X4=0, X5=0), predict the function value
i.e. E(Y|new X) and calculate its 95% confidence interval.
Predict value is 1.023846, the 95% confidence interval is [0.6610302,
1.386661]

1:5]) y = data. xnew = c(0.R") out = sim(x. 0.matrix(xy[.dat") x = data. y) I = order(xalpha) lines(xalpha[I]. 0. y) out$alpha out$se xalpha = x %*% out$alpha plot(xalpha.matrix(xy[.table("testdata1.6]) source("sim. out$predict[I]) lines(xalpha[I]. out$Un[I]) out = sim(x. y. 0. 0)) out$predict out$Ln out$Un .CODE xy = read. out$Ln[I]) lines(xalpha[I].

8 1.6 0.6 0.4 0.4 0.4] out$coeff[I.0 0.2 0.0 0.0 0.1 1.8 0.6 0.stat.8 1.0 0.0 0.4 0.8 z[I] z[I] z[I] 0.4 0.0 1.4 0.8 1.edu. which bandwidth do you prefer? We try the model with the 3 bandwidths.4 0.4 0.8 0.0 0. 0.2 0.2.6 0.8 0.6 0. … .8 0. there are 5 predictors X1.0 0.4 0.0 0.4 0.2 0.4 0. To find how Z affects the relationships between Y and X1. X4 and Z with response Y.0 0.dat).3] 0. The plots are shown below bandwidth=0.nus.0 0.0 0.6 0.4 0. X4 we consider the varying coefficient model Y = g0(Z) + g1(Z)*X1 + g2(Z)*X2 + g3(Z)*X3 + g4(Z)*X4 + e A.8 z[I] z[I] .2 0.2 0.1.2 0.0 0.0 0.8 0.8 0.6 0.0 0.1] out$coeff[I.2 0.4 0.6 0.2] out$coeff[I.0 0.4 0.0 out$coeff[I.8 out$coeff[I.8 z[I] z[I] z[I] 0.5] 0. For data B (at http://www.4 0.0 0.6 0.0 0.4 0.0 0.8 z[I] z[I] bandwidth=0.8 1.8 0.2 0.0 1.8 0.0 0.0 0.0 0.8 1.2 0.0 0.0 out$coeff[I.3.2 0.0 0.2 0.4 0.5] 0.4 0.4 0.0 0.4 0.4 0.0 out$coeff[I.8 1.2. try different bandwidths 0.sg/~staxyc/ DM07testdata2.4] out$coeff[I.1] out$coeff[I.6 0. … .8 1.3] 0.0 0. 0.2] out$coeff[I.

X2 = 1.1 is overfitted (undersmoothed) and h=0. bandwidth=0. X3=0.0 0.).8 z[I] z[I] z[I] 1.0 0. … . g4 are constant and which are varying? [hint: using plot(… .) and a3(.0 0.dat") x = data.5 and Z = 0. out$Un[I.0 0.6 0.0 0.2].2 0.2 0.table("testdata2.8 0.4 0.8 0.2 is suggested B. col="red") lines(z[I].0 0.2) I = order(z).3 is underfitted (oversmoothed).4 0.426397 CODE xy = read. h=0. predict its Y the predicted Y is 1.4 0.1:4]) z = data. bandwidth=0.4 0. it suggests that h=0. a2(.1].matrix(xy[. 4] out$coeff[I.4 0.6 0.0 0.0 1.R") ####################### out = vcm(x. which coefficient functions among g0. out$coeff[I. 1] out$coeff[I.8 0.0 1.0 0.2 0.0 0.6 0.0 0. for a new observation with X1 = 1.4 0.matrix(xy[. ylim=c(specify. 1)) . z.1]. specify))] a0(.6 0.4 0.2 0.1].0 1.) are constants C.8 z[I] z[I] By comparing the plots.4 0. out$Ln[I. 1)) lines(z[I].8 0. y. 2] out$coeff[I.8 out$coeff[I.4 0. col="red") plot(z[I].0 0.8 0. out$coeff[I. ylim=c(0.2 0.6 0.0 0. par(mfrow = c(2. X4 = 0.3 1.5. 3] 0.8 0. g1.5]) y = data.8 out$coeff[I. 3)) plot(z[I].matrix(xy[.4 0. 5] 0.6]) source("vcm. ylim=c(0.

5]. z. col="red") plot(z[I]. out$Ln[I.4].2]. out$Un[I. y. out$Ln[I.4]. col="red") title("bandwidth=0. col="red") lines(z[I]. out$Ln[I.5].2) predict = out$coeff[1] + out$coeff[2] + out$coeff[3] +0. out$Ln[I. ylim=c(0. out$coeff[I.3]. bandwidth=0. out$Un[I.3].2") plot(z[I]. znew = 0. out$Un[I. ylim=c(0.5].5*out$coeff[4]+ 0.lines(z[I].2]. col="red") out = vcm(x. col="red") lines(z[I]. ylim=c(0.5*out$coeff[5] predict . col="red") plot(z[I]. col="red") lines(z[I]. out$coeff[I. 1)) lines(z[I]. 1)) lines(z[I]. out$Un[I. out$coeff[I. 1)) lines(z[I].4].3]. col="red") lines(z[I].

4]) source("cvh.5. X3 and response Y.5 2. A. X2 = 2.0 0.5 1.sg/~staxyc/ DM07testdata3. Suppose we need to construct a partially linear model. 1] x[.5 0.5 2.edu. predict its response Y. 3.1:3]) y = data. X2.matrix(xy[.table("testdata3. plot Y against each covariate X1. which predictor you select as the nonlinear part? From the above plot.5 -0.nus.090471 CODE xy = read.5 1. ylim=c(-0.stat.0 x[.5 out3$m 0. y) out1 = ks(x[. and the regression function m1(x) = E(Y|X1=x).1]. m2(x) = E(Y|X2=x).5.0 x[.1].1]. 3] B.5 -0.2)) .0 0. we see that the relation between Y and X1 is likely linear.e.dat).R") par(mfrow = c(2.matrix(xy[. 2] 1. out1$m.5 0. X3 = 0. We would choose X3 as the nonlinear part in the partially linear model C. m3(x) = E(Y|X3=x) and corresponding 95% confidence bands The estimated functions are shown below 1. 2)) h = cvh(x[. X2 and X3 respectively. what is your suggestion? i.5 1.0 0. bandwidth = h) plot(x[. For X1 = 0. Y and X2 is likely linear.0 1. For data C (at http://www.dat") x = data. there are 3 predictors X1.R") source("ks.5 2.0 1. The predicted value is -1.5 1.5 out1$m out2$m 0.0 0.0 1. y. with X3 nonlinear.5 1.0 0.

3].3].2]. out3$m. out2$U. type="p".R") out = plr(x[. bandwidth=h).1].5. col="red") h = cvh(x[. ylim=c(-0. bandwidth = h) plot(x[. type="p". predict = out$beta[1]*0 + out$beta[2]*2 + out$g predict . y. znew=0. type="p".2]. col="red") lines(x[. y) out2 = ks(x[. type="p". col="red") source("plr.1]. out3$L. y) out3 = ks(x[. out1$L.2].3]. bandwidth = h) plot(x[.2)) lines(x[.2]. col="red") h = cvh(x[. out3$U.1. col="red") lines(x[. out1$U.3].3].8)) lines(x[. y. type="p".lines(x[. out2$L.2]. y.5. col="red") lines(x[. ylim=c(0. out2$m. x[. type="p".1:2].3].