You are on page 1of 11

PROBLEM 1 ANSWERS.

1)

2) If we knew z (n) for every x (n) ,

the maximum is:

`(π, µ, Σ) = X N n=1 ln p(x (n) , z (n) |π, µ, Σ) = X N n=1 ln p(x (n) | z (n) ; µ, Σ)+ln p(z (n) |
π)

We have been optimizing something similar for Gaussian bayes classifiers

We would get :

µk = PN n=1 1[z (n)=k] x (n) PN n=1 1[z (n)=k] Σk = PN n=1 1[z (n)=k] (x (n) − µk )(x (n) −
µk ) T

PN n=1 1[z (n)=k] πk = 1 N X N n=1 1[z (n)=k

The maximum-likelihood estimates are then the parameter values

q(y) for y ∈ {1 . . . k}, qj (x|y) for j ∈ {1 . . . d}, y ∈ {1 . . . k}, x ∈ {−1, +1} that maximize

L(θ) = Xn i=1 log q(y (i) ) + Xn i=1 X d j=1 log qj (x (i) j |y (i) ) subject to the following
constraints:

1. q(y) ≥ 0 for all y ∈ {1 . . . k}. Pk y=1 q(y) = 1. 2. For all y, j, x, qj (x|y) ≥ 0. For all y
∈ {1 . . .
k}, for all j ∈ {1 . . . d}, X x∈{−1,+1} qj (x|y) = 1

2. For all y, j, x, qj (x|y) ≥ 0. For all y ∈ {1 . . . k}, for all j ∈ {1 . . . d}, X x∈{−1,+1} qj
(x|y) = 1
PROBLEM 2 ANSWERS. 

1) Find a cubic polynomial

1a) f1(x)=a1+b1x+c1x2+d1x3f1(x)=a1+b1x+c1x2+d1x3

such that f(x)=f1(x)f(x)=f1(x) for all x≤ξx≤ξ. Express a1,b1,c1,d1a1,b1,c1,d1 in terms


of β0,β1,β2,β3,β4β0,β1,β2,β3,β4.

For  x≤ξx≤ξ, we have


f(x)=β0+β1x+β2x2+β3x3,f(x)=β0+β1x+β2x2+β3x3,
so we take  a1=β0a1=β0, b1=β1b1=β1, c1=β2c1=β2 and d1=β3d1=β3.

1b) f2(x)=a2+b2x+c2x2+d2x3f2(x)=a2+b2x+c2x2+d2x3

such that

 f(x)=f2(x)f(x)=f2(x) for all x>ξx>ξ. Express a2,b2,c2,d2a2,b2,c2,d2 in terms


of β0,β1,β2,β3,β4β0,β1,β2,β3,β4. We have now established that f(x)f(x) is a piecewie
polynomial.

For  x>ξx>ξ, we have


f(x)=β0+β1x+β2x2+β3x3+β4(x−ξ)3=(β0−β4ξ3)+(β1+3ξ2β4)x+(β2−3β4ξ)x2+
(β3+β4)x3,f(x)=β0+β1x+β2x2+β3x3+β4(x−ξ)3=(β0−β4ξ3)+(β1+3ξ2β4)x+(β2−3β4ξ)x2+
(β3+β4)x3,
take 
a2=β0−β4ξ3a2=β0−β4ξ3, b2=β1+3ξ2β4b2=β1+3ξ2β4, c2=β2−3β4ξc2=β2−3β4ξ and d2=β3
+β4d2=β3+β4.

2) Showthat 
f1(ξ)=f2(ξ)f1(ξ)=f2(ξ). That is f(x)f(x) is continuous at ξξ.

We have
f1(ξ)=β0+β1ξ+β2ξ2+β3ξ3f1(ξ)=β0+β1ξ+β2ξ2+β3ξ3
and
f2(ξ)=(β0−β4ξ3)+(β1+3ξ2β4)ξ+(β2−3β4ξ)ξ2+
(β3+β4)ξ3=β0+β1ξ+β2ξ2+β3ξ3.f2(ξ)=(β0−β4ξ3)+(β1+3ξ2β4)ξ+(β2−3β4ξ)ξ2+
(β3+β4)ξ3=β0+β1ξ+β2ξ2+β3ξ3.

3 Show that f′1(ξ)=f′2(ξ)f1′(ξ)=f2′(ξ). That is f′(x)f′(x) is continuous at ξξ.

We have
f′1(ξ)=β1+2β2ξ+3β3ξ2f1′(ξ)=β1+2β2ξ+3β3ξ2
and
f′2(ξ)=β1+3ξ2β4+2(β2−3β4ξ)ξ+3(β3+β4)ξ2=β1+2β2ξ+3β3ξ2.f2′
(ξ)=β1+3ξ2β4+2(β2−3β4ξ)ξ+3(β3+β4)ξ2=β1+2β2ξ+3β3ξ2.

e. Show  
f′′1(ξ)=f′′2(ξ)f1″(ξ)=f2″(ξ). That is f′′(x)f″(x) is continuous at ξξ. Therefore, f(x)f(x) is
indeed a cubic spline.

f′′1(ξ)=2β2+6β3ξf1″(ξ)=2β2+6β3ξ
and
f′′2(ξ)=2(β2−3β4ξ)+6(β3+β4)ξ=2β2+6β3ξ.
PROBLEM 5 ANSWERS. 
1)
set.seed(1)
require(MASS); require(tidyverse); require(ggplot2); require(ggthemes)
require(broom); require(knitr); require(caret)
theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
data('Boston')
model <- lm(nox ~ poly(dis, 3), data = Boston)
tidy(model) %>%
kable(digits = 3)

term estimate std.error statistic

(Intercept) 0.555 0.003 201.021

poly(dis, 3)1 -2.003 0.062 -32.271

poly(dis, 3)2 0.856 0.062 13.796

poly(dis, 3)3 -0.318 0.062 -5.124

Boston %>%
mutate(pred = predict(model, Boston)) %>%
ggplot() +
geom_point(aes(dis, nox, col = '1')) +
geom_line(aes(dis, pred, col = '2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00'))
The model finds each power of the dis coefficient to be statistically significant. On the plot,
the fitted line seems to describe the data well without overfitting.
2) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to
10), and report the associated residual sum of squares.
errors <- list()
models <- list()
pred_df <- data_frame(V1 = 1:506)
for (i in 1:9) {
models[[i]] <- lm(nox ~ poly(dis, i), data = Boston)
preds <- predict(models[[i]])
pred_df[[i]] <- preds
errors[[i]] <- sqrt(mean((Boston$nox - preds)^2))
}

errors <- unlist(errors)

names(pred_df) <- paste('Level', 1:9)


data_frame(RMSE = errors) %>%
mutate(Poly = row_number()) %>%
ggplot(aes(Poly, RMSE, fill = Poly == which.min(errors))) +
geom_col() +
guides(fill = FALSE) +
scale_x_continuous(breaks = 1:9) +
coord_cartesian(ylim = c(min(errors), max(errors))) +
labs(x = 'Polynomial Degree')
We see that when fitted and tested on the same data the model with the higest polynomial
degree has the lowest RSS.
Boston %>%
cbind(pred_df) %>%
gather(Polynomial, prediction, -(1:14)) %>%
mutate(Polynomial = factor(Polynomial,
levels = unique(as.character(Polynomial)))) %>%
ggplot() +
ggtitle('Predicted Values for Each Level of Polynomial') +
geom_point(aes(dis, nox, col = '1')) +
geom_line(aes(dis, prediction, col = '2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00')) +
facet_wrap(~ Polynomial, nrow = 3)
3) Perform cross-validation or another approach to select the optimal degree for the
polynomial, and explain your results
errors <- list()

folds <- sample(1:10, 506, replace = TRUE)


errors <- matrix(NA, 10, 9)
for (k in 1:10) {
for (i in 1:9) {
model <- lm(nox ~ poly(dis, i), data = Boston[folds != k,])
pred <- predict(model, Boston[folds == k,])
errors[k, i] <- sqrt(mean((Boston$nox[folds == k] - pred)^2))
}
}

errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%


mutate(Poly = row_number()) %>%
ggplot(aes(Poly, RMSE, fill = Poly == which.min(errors))) +
geom_col() + theme_tufte() + guides(fill = FALSE) +
scale_x_continuous(breaks = 1:9) +
coord_cartesian(ylim = range(errors))

4)
require(splines)
model <- lm(nox ~ bs(dis, df = 4), data = Boston)

kable(tidy(model), digits = 3)

term estimate std.error statistic

(Intercept) 0.734 0.015 50.306

bs(dis, df = 4)1 -0.058 0.022 -2.658

bs(dis, df = 4)2 -0.464 0.024 -19.596

bs(dis, df = 4)3 -0.200 0.043 -4.634

bs(dis, df = 4)4 -0.389 0.046 -8.544

Boston %>%
mutate(pred = predict(model)) %>%
ggplot() +
geom_point(aes(dis, nox, col = '1')) +
geom_line(aes(dis, pred, col = '2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00')) +
theme_tufte(base_size = 13)

The model finds all the different bases to be statistically significant. The prediction line
seems to fit the data well without overfitting.
5)
errors <- list()
models <- list()
pred_df <- data_frame(V1 = 1:506)
for (i in 1:9) {
models[[i]] <- lm(nox ~ bs(dis, df = i), data = Boston)
preds <- predict(models[[i]])
pred_df[[i]] <- preds
errors[[i]] <- sqrt(mean((Boston$nox - preds)^2))
}

names(pred_df) <- paste(1:9, 'Degrees of Freedom')


data_frame(RMSE = unlist(errors)) %>%
mutate(df = row_number()) %>%
ggplot(aes(df, RMSE, fill = df == which.min(errors))) +
geom_col() + guides(fill = FALSE) + theme_tufte() +
scale_x_continuous(breaks = 1:9) +
coord_cartesian(ylim = range(errors))
when tested in same data and model with very higher complexity is the best.
Boston %>%
cbind(pred_df) %>%
gather(df, prediction, -(1:14)) %>%
mutate(df = factor(df, levels = unique(as.character(df)))) %>%
ggplot() + ggtitle('Predicted Values for Each Level of Polynomial') +
geom_point(aes(dis, nox, col = '1')) +
geom_line(aes(dis, prediction, col = '2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00')) +
facet_wrap(~ df, nrow = 3)

6)
folds <- sample(1:10, size = 506, replace = TRUE)
errors <- matrix(NA, 10, 9)
models <- list()
for (k in 1:10) {
for (i in 1:9) {
models[[i]] <- lm(nox ~ bs(nox, df = i), data = Boston[folds != k,])
pred <- predict(models[[i]], Boston[folds == k,])
errors[k, i] <- sqrt(mean((Boston$nox[folds == k] - pred)^2))
}
}

errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%


mutate(df = row_number()) %>%
ggplot(aes(df, RMSE, fill = df == which.min(errors))) +
geom_col() + theme_tufte() + guides(fill = FALSE) +
scale_x_continuous(breaks = 1:9) +
coord_cartesian(ylim = range(errors))

validation on out-of-sample data a les complex model is selected.

You might also like