RSUDIO (AutoRecovered)

PROBLEM 1 ANSWERS.
1)
2) If we knew z (n) for every x (n) ,
the maximum is:
`(π, µ, Σ) = X N n=1 ln p(x (n) , z (n) |π, µ, Σ) = X N n=1 ln p(x (n) | z (n) ; µ, Σ)+ln p(z (n) |
π)
We have been optimizing something similar for Gaussian bayes classifiers
We would get :
µk = PN n=1 1[z (n)=k] x (n) PN n=1 1[z (n)=k] Σk = PN n=1 1[z (n)=k] (x (n) − µk )(x (n) −
µk ) T
PN n=1 1[z (n)=k] πk = 1 N X N n=1 1[z (n)=k
The maximum-likelihood estimates are then the parameter values
q(y) for y ∈ {1 . . . k}, qj (x|y) for j ∈ {1 . . . d}, y ∈ {1 . . . k}, x ∈ {−1, +1} that maximize
L(θ) = Xn i=1 log q(y (i) ) + Xn i=1 X d j=1 log qj (x (i) j |y (i) ) subject to the following
constraints:
1. q(y) ≥ 0 for all y ∈ {1 . . . k}. Pk y=1 q(y) = 1. 2. For all y, j, x, qj (x|y) ≥ 0. For all y
∈ {1 . . .
k}, for all j ∈ {1 . . . d}, X x∈{−1,+1} qj (x|y) = 1
2. For all y, j, x, qj (x|y) ≥ 0. For all y ∈ {1 . . . k}, for all j ∈ {1 . . . d}, X x∈{−1,+1} qj
(x|y) = 1
PROBLEM 2 ANSWERS.
1) Find a cubic polynomial
1a) f1(x)=a1+b1x+c1x2+d1x3f1(x)=a1+b1x+c1x2+d1x3
such that f(x)=f1(x)f(x)=f1(x) for all x≤ξx≤ξ. Express a1,b1,c1,d1a1,b1,c1,d1 in terms

of β0,β1,β2,β3,β4β0,β1,β2,β3,β4.
For x≤ξx≤ξ, we have

f(x)=β0+β1x+β2x2+β3x3,f(x)=β0+β1x+β2x2+β3x3,
so we take a1=β0a1=β0, b1=β1b1=β1, c1=β2c1=β2 and d1=β3d1=β3.
1b) f2(x)=a2+b2x+c2x2+d2x3f2(x)=a2+b2x+c2x2+d2x3
such that
f(x)=f2(x)f(x)=f2(x) for all x>ξx>ξ. Express a2,b2,c2,d2a2,b2,c2,d2 in terms

of β0,β1,β2,β3,β4β0,β1,β2,β3,β4. We have now established that f(x)f(x) is a piecewie
polynomial.
For x>ξx>ξ, we have

f(x)=β0+β1x+β2x2+β3x3+β4(x−ξ)3=(β0−β4ξ3)+(β1+3ξ2β4)x+(β2−3β4ξ)x2+
(β3+β4)x3,f(x)=β0+β1x+β2x2+β3x3+β4(x−ξ)3=(β0−β4ξ3)+(β1+3ξ2β4)x+(β2−3β4ξ)x2+
(β3+β4)x3,
take
a2=β0−β4ξ3a2=β0−β4ξ3, b2=β1+3ξ2β4b2=β1+3ξ2β4, c2=β2−3β4ξc2=β2−3β4ξ and d2=β3
+β4d2=β3+β4.
2) Showthat
f1(ξ)=f2(ξ)f1(ξ)=f2(ξ). That is f(x)f(x) is continuous at ξξ.
We have
f1(ξ)=β0+β1ξ+β2ξ2+β3ξ3f1(ξ)=β0+β1ξ+β2ξ2+β3ξ3
and
f2(ξ)=(β0−β4ξ3)+(β1+3ξ2β4)ξ+(β2−3β4ξ)ξ2+
(β3+β4)ξ3=β0+β1ξ+β2ξ2+β3ξ3.f2(ξ)=(β0−β4ξ3)+(β1+3ξ2β4)ξ+(β2−3β4ξ)ξ2+
(β3+β4)ξ3=β0+β1ξ+β2ξ2+β3ξ3.
3 Show that f′1(ξ)=f′2(ξ)f1′(ξ)=f2′(ξ). That is f′(x)f′(x) is continuous at ξξ.
We have
f′1(ξ)=β1+2β2ξ+3β3ξ2f1′(ξ)=β1+2β2ξ+3β3ξ2
and
f′2(ξ)=β1+3ξ2β4+2(β2−3β4ξ)ξ+3(β3+β4)ξ2=β1+2β2ξ+3β3ξ2.f2′
(ξ)=β1+3ξ2β4+2(β2−3β4ξ)ξ+3(β3+β4)ξ2=β1+2β2ξ+3β3ξ2.
e. Show
f′′1(ξ)=f′′2(ξ)f1″(ξ)=f2″(ξ). That is f′′(x)f″(x) is continuous at ξξ. Therefore, f(x)f(x) is
indeed a cubic spline.
f′′1(ξ)=2β2+6β3ξf1″(ξ)=2β2+6β3ξ
and
f′′2(ξ)=2(β2−3β4ξ)+6(β3+β4)ξ=2β2+6β3ξ.
PROBLEM 5 ANSWERS.
1)
set.seed(1)
require(MASS); require(tidyverse); require(ggplot2); require(ggthemes)
require(broom); require(knitr); require(caret)
theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
data('Boston')
model <- lm(nox ~ poly(dis, 3), data = Boston)
tidy(model) %>%
kable(digits = 3)
term estimate std.error statistic
(Intercept) 0.555 0.003 201.021
poly(dis, 3)1 -2.003 0.062 -32.271
poly(dis, 3)2 0.856 0.062 13.796
poly(dis, 3)3 -0.318 0.062 -5.124
Boston %>%
mutate(pred = predict(model, Boston)) %>%
ggplot() +
geom_point(aes(dis, nox, col = '1')) +
geom_line(aes(dis, pred, col = '2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00'))
The model finds each power of the dis coefficient to be statistically significant. On the plot,
the fitted line seems to describe the data well without overfitting.
2) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to
10), and report the associated residual sum of squares.
errors <- list()
models <- list()
pred_df <- data_frame(V1 = 1:506)
for (i in 1:9) {
models[[i]] <- lm(nox ~ poly(dis, i), data = Boston)
preds <- predict(models[[i]])
pred_df[[i]] <- preds
errors[[i]] <- sqrt(mean((Boston$nox - preds)^2))
}
errors <- unlist(errors)
names(pred_df) <- paste('Level', 1:9)

data_frame(RMSE = errors) %>%
mutate(Poly = row_number()) %>%
ggplot(aes(Poly, RMSE, fill = Poly == which.min(errors))) +
geom_col() +
guides(fill = FALSE) +
scale_x_continuous(breaks = 1:9) +
coord_cartesian(ylim = c(min(errors), max(errors))) +
labs(x = 'Polynomial Degree')
We see that when fitted and tested on the same data the model with the higest polynomial
degree has the lowest RSS.
Boston %>%
cbind(pred_df) %>%
gather(Polynomial, prediction, -(1:14)) %>%
mutate(Polynomial = factor(Polynomial,
levels = unique(as.character(Polynomial)))) %>%
ggplot() +
ggtitle('Predicted Values for Each Level of Polynomial') +
geom_line(aes(dis, prediction, col = '2'), size = 1.5) +
values = c('#56B4E9', '#E69F00')) +
facet_wrap(~ Polynomial, nrow = 3)
3) Perform cross-validation or another approach to select the optimal degree for the
polynomial, and explain your results
errors <- list()
folds <- sample(1:10, 506, replace = TRUE)

errors <- matrix(NA, 10, 9)
for (k in 1:10) {
for (i in 1:9) {
model <- lm(nox ~ poly(dis, i), data = Boston[folds != k,])
pred <- predict(model, Boston[folds == k,])
errors[k, i] <- sqrt(mean((Boston$nox[folds == k] - pred)^2))
}
}
errors <- apply(errors, 2, mean)

mutate(Poly = row_number()) %>%
ggplot(aes(Poly, RMSE, fill = Poly == which.min(errors))) +
geom_col() + theme_tufte() + guides(fill = FALSE) +
coord_cartesian(ylim = range(errors))
4)
require(splines)
model <- lm(nox ~ bs(dis, df = 4), data = Boston)
kable(tidy(model), digits = 3)
term estimate std.error statistic
(Intercept) 0.734 0.015 50.306
bs(dis, df = 4)1 -0.058 0.022 -2.658
bs(dis, df = 4)2 -0.464 0.024 -19.596
bs(dis, df = 4)3 -0.200 0.043 -4.634
bs(dis, df = 4)4 -0.389 0.046 -8.544
Boston %>%
mutate(pred = predict(model)) %>%
ggplot() +
geom_line(aes(dis, pred, col = '2'), size = 1.5) +
values = c('#56B4E9', '#E69F00')) +
theme_tufte(base_size = 13)
The model finds all the different bases to be statistically significant. The prediction line
seems to fit the data well without overfitting.
5)
errors <- list()
models <- list()
pred_df <- data_frame(V1 = 1:506)
for (i in 1:9) {
models[[i]] <- lm(nox ~ bs(dis, df = i), data = Boston)
preds <- predict(models[[i]])
pred_df[[i]] <- preds
errors[[i]] <- sqrt(mean((Boston$nox - preds)^2))
}
names(pred_df) <- paste(1:9, 'Degrees of Freedom')

data_frame(RMSE = unlist(errors)) %>%
mutate(df = row_number()) %>%
ggplot(aes(df, RMSE, fill = df == which.min(errors))) +
geom_col() + guides(fill = FALSE) + theme_tufte() +
when tested in same data and model with very higher complexity is the best.
Boston %>%
cbind(pred_df) %>%
gather(df, prediction, -(1:14)) %>%
mutate(df = factor(df, levels = unique(as.character(df)))) %>%
ggplot() + ggtitle('Predicted Values for Each Level of Polynomial') +
geom_line(aes(dis, prediction, col = '2'), size = 1.5) +
values = c('#56B4E9', '#E69F00')) +
facet_wrap(~ df, nrow = 3)
6)
folds <- sample(1:10, size = 506, replace = TRUE)
errors <- matrix(NA, 10, 9)
models <- list()
for (k in 1:10) {
for (i in 1:9) {
models[[i]] <- lm(nox ~ bs(nox, df = i), data = Boston[folds != k,])
pred <- predict(models[[i]], Boston[folds == k,])
errors[k, i] <- sqrt(mean((Boston$nox[folds == k] - pred)^2))
}
}
errors <- apply(errors, 2, mean)

mutate(df = row_number()) %>%
ggplot(aes(df, RMSE, fill = df == which.min(errors))) +
geom_col() + theme_tufte() + guides(fill = FALSE) +
validation on out-of-sample data a les complex model is selected.

RSUDIO (AutoRecovered)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RSUDIO (AutoRecovered)

Uploaded by

Copyright:

Available Formats

PROBLEM 1 ANSWERS.

2) If we knew z (n) for every x (n) ,

the maximum is:

We have been optimizing something similar for Gaussian bayes classifiers

PN n=1 1[z (n)=k] πk = 1 N X N n=1 1[z (n)=k

The maximum-likelihood estimates are then the parameter values

1) Find a cubic polynomial

such that f(x)=f1(x)f(x)=f1(x) for all x≤ξx≤ξ. Express a1,b1,c1,d1a1,b1,c1,d1 in terms

For x≤ξx≤ξ, we have

f(x)=f2(x)f(x)=f2(x) for all x>ξx>ξ. Express a2,b2,c2,d2a2,b2,c2,d2 in terms

For x>ξx>ξ, we have

3 Show that f′1(ξ)=f′2(ξ)f1′(ξ)=f2′(ξ). That is f′(x)f′(x) is continuous at ξξ.

term estimate std.error statistic

(Intercept) 0.555 0.003 201.021

poly(dis, 3)1 -2.003 0.062 -32.271

poly(dis, 3)2 0.856 0.062 13.796

poly(dis, 3)3 -0.318 0.062 -5.124

errors <- unlist(errors)

names(pred_df) <- paste('Level', 1:9)

folds <- sample(1:10, 506, replace = TRUE)

errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%

term estimate std.error statistic

(Intercept) 0.734 0.015 50.306

bs(dis, df = 4)1 -0.058 0.022 -2.658

bs(dis, df = 4)2 -0.464 0.024 -19.596

bs(dis, df = 4)3 -0.200 0.043 -4.634

bs(dis, df = 4)4 -0.389 0.046 -8.544

names(pred_df) <- paste(1:9, 'Degrees of Freedom')

errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%

validation on out-of-sample data a les complex model is selected.

You might also like