You are on page 1of 6

Q1)

(a) The sample size n is extremely large, and the number of predictors p is small.

A flexible method would be beAer than an inflexible one. With a large sample size,
flexible methods have the capacity to learn more complex relaDonships in the data
without overfiFng. Since the number of predictors is small, the risk of overfiFng is
further reduced. An inflexible method, on the other hand, might not capture all the
underlying paAerns present in the large sample.

(b) The number of predictors p is extremely large, and the number of observaDons n is
small.

A flexible method would be worse than an inflexible one. When the number of
predictors is high relaDve to the number of observaDons, flexible methods are likely to
be overfiFng. Inflexible methods, in contrast, are more constrained and less likely to
overfit in these circumstances.

(c) The relaDonship between the predictors and response is highly non-linear.

A flexible method would be beAer than an inflexible one. Flexible methods can capture
non-lineariDes in the data as they have large degree of freedoms. If the true relaDonship
is highly non-linear, inflexible methods might not be able to capture this relaDonship
effecDvely.

(d) The variance of the error terms, i.e., σ^2 = Var(ϵ), is extremely high.

A flexible method would be worse than an inflexible one. When the variance of the error
terms is high, it implies there's a lot of noise in the data. Flexible methods might fit to
this noise, thinking it's a genuine paAern in the data (overfiFng). Inflexible methods,
being more constrained, won't react as strongly to the noise, potenDal leading to more
robust predicDons in the presence of high variance error terms.
Q3)
a)

b)
Squared Bias Curve: Less flexible models make stronger assumpDons about the form of
the target funcDon, which may not be accurate. As models become more flexible, they
can adapt beAer to the underlying data paAerns, leading to a decrease in bias.

Variance Curve: With increasing flexibility, models become more sensiDve to fluctuaDons
in training data, causing an increase in variance. A more flexible model can adjust more
to the noise of the training data, leading to high variance. Less flexible models can't
adjust as much to the training data, leading to low variance.

Training Error Curve: More flexible models fit training data more closely, generally
resulDng in a decrease in training error.

Test Error Curve: IniDally, test error drops as bias decreases, but it rises again as models
overfit and variance increases, forming a U-shape. In the beginning, as the model
becomes more flexible, it starts to fit the underlying paAern beAer, reducing test error.
However, a_er a certain point, as the model becomes too flexible, it starts to fit the
noise in the training data, causing the test error to increase due to overfiFng.

Bayes (Irreducible) Error Curve: This represents inherent noise in the system, remaining
constant regardless of model flexibility.
Q9)

a)
quanDtaDve predictors: mpg, displacement, horsepower, weight,
acceleraDon, and year

qualitaDve predictors: cylinders, name and origin.

b)

c)
mean

Standard deviaDon

d)
e)

Based on the scaAerplot matrix, we can see more displacement comes with lower mpg.
It also indicates that more weight comes with lower mpg.
The plot also indicates that more horsepower will have lower mpg.
Year has certain weaker associaDon with mpg.

f)
It seems all the quanDtaDve predictors have correlaDon with gas mileage. Based on the
plot we can see the points forms almost a line between mpg with all the other
quanDtaDve variables, and there are no paAerns. Predictors displacement, horsepower,
and weight have strong negaDve correlaDon with mpg. Year and acceleraDon have
posiDve correlaDon with mpg but are weaker correlaDon. So all these quanDtaDve
variables contribute to predict mpg
Code:
Q3)
setwd("/Users/Danchenxu/Desktop")
data <- read.csv("Auto.csv", header = T, na.strings = "?", stringsAsFactors = T)
data <- na.omit(data)
install.packages("ggplot2")
library(ggplot2)
df <- data.frame(
flexibility = seq(-10, 10, by = 0.1),
bayes_error = rep(0.2, 201)
)

ggplot(df, aes(x = flexibility, y = bayes_error)) +


geom_line(color = "blue") +
labs(Dtle = "Bayes Error Curve", x = "Flexibility", y = "Error") +
theme_minimal()

install.packages("ggplot2")
library(ggplot2)
flexibility <- seq(0, 1, length.out = 100)
bias_squared <- (1 - flexibility)^2
variance <- flexibility^2
training_error <- 0.2 - 0.1
test_error <- bias_squared + variance
bayes_error <- rep(0.2, length(flexibility))

df <- data.frame(flexibility, bias_squared, variance, training_error, test_error,


bayes_error)
ggplot(df, aes(x = flexibility)) +
geom_line(aes(y = bias_squared, color = "Bias^2")) +
geom_line(aes(y = variance, color = "Variance")) +
geom_line(aes(y = training_error, color = "Training Error")) +
geom_line(aes(y = test_error, color = "Test Error")) +
geom_line(aes(y = bayes_error, color = "Bayes Error")) +
labs(Dtle = "Bias-Variance Tradeoff", y = "Error") +
theme_minimal() +
scale_color_manual(values = c("Bias^2" = "red",
"Variance" = "blue",
"Training Error" = "green",
"Test Error" = "purple",
"Bayes Error" = "orange"),
name = "Curves")
Q9)

setwd("/Users/Danchenxu/Desktop")
b)
data <- read.csv("Auto.csv", header = T, na.strings = "?", stringsAsFactors = T)
data <- na.omit(data)
range_data <- sapply(data[, 1:6], range)
data <- subset(data, select = -cylinders)

c)
mean1<-sapply(data[,1:6], mean)
sd1<-sapply(data[,1:6], sd)
df_range <- as.data.frame(t(range_data))
colnames(df_range) <- c("Minimum", "Maximum")
print(df_range)

d)
data2 <- data[-(10:85), ]

range_values <- sapply(data2[,1:6], range)


means <- sapply(data2[,1:6], mean)
sds <- sapply(data2[,1:6], sd)
print(data.frame(Min = range_values[1,], Max = range_values[2,], Means = means, SDs =
sds))

e)
pairs(data[c(1,3:7)])
selected_vars <- data[, c("mpg", "displacement", "horsepower", "weight",
"acceleraDon", "year")]
pairs(selected_vars)
pairs(selected_vars, panel = panel.smooth)

You might also like