You are on page 1of 15

Linear Regression

Tianchen Xu

2023-10-16
#1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond.
Explain what conclusions you can draw based on these p-values. Your explanation
should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms
of the coefficients of the linear model.
Intercept: The null hypothesis for the intercept is if that the expected value of sales is zero
when all independent variables are zero.
TV: The null hypothesis is that there is no relationship between TV advertising expenditure
and sales. In other words, changes in TV advertising have no effect on sales.
Radio: The null hypothesis is that there is no relationship between radio advertising
expenditure and sales. In other words, changes in radio advertising have no effect on sales.
Newspaper: The null hypothesis is that there is no relationship between newspaper
advertising expenditure and sales. In other words, changes in newspaper advertising have
no effect on sales.
Intercept: The p-value is less than 0.0001, which is typically considered statistically
significant. This means we can reject the null hypothesis for the intercept, but, as
mentioned, this isn’t typically of substantive interest.
TV: The p-value is less than 0.0001, which means that it is very unlikely that we would
observe such a relationship between TV advertising and sales due to random chance alone.
We can reject the null hypothesis and conclude that there is a statistically significant
relationship between TV advertising and sales.
Radio: Similarly, the p-value for radio is less than 0.0001, suggesting a statistically
significant relationship between radio advertising and sales. We can reject the null
hypothesis for radio.
Newspaper: The p-value for newspaper is 0.8599, which is not statistically significant at
common significance levels (like 0.05). Therefore, we fail to reject the null hypothesis for
newspaper, suggesting that there’s no evidence of a relationship between newspaper
advertising and sales.
In conclusion, based on the p-values provided: TV advertising is significantly associated
with sales. Radio advertising is significantly associated with sales. Newspaper advertising
is not significantly associated with sales.
###3. Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1
for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction
between GPA and Gender. The response is starting salary after graduation (in thousands of
dollars). Suppose we use least squares to fit the model, and get ˆβ0 = 50, ˆβ1 =20, ˆβ2 = 0.07,
ˆβ3 = 35, ˆβ4 = 0.01, ˆβ5 = −10. Y= ˆβ0 +GPA* ˆβ1+IQ* ˆβ2+Gender* ˆβ3+GPAIQ
ˆβ4+GPAGender ˆβ5
(a) Which answer is correct, and why? i. For a fixed value of IQ and GPA, males earn more
on average than females. ii. For a fixed value of IQ and GPA, females earn more on average
than males. iii. For a fixed value of IQ and GPA, males earn more on average than females
provided that the GPA is high enough. iv. For a fixed value of IQ and GPA, females earn more
on average than males provided that the GPA is high enough.
#iii) is correct for male the model will be Y= 50+20GPA+0.07IQ+0.01GPAIQ-10GPA
For female, the model will be Y= 50+20GPA+35+0.07IQ+0.01IQGPA-10GPA
=85+20GPA+0.07IQ+0.01GPAIQ-10GPA So, we just need to compare 85-10GPA and 50
based on these two models. So, we can see if GPA is high enough, 50 will be more
than 85 -10*GPA, which means on average, males earn more than females.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
Y= 85+204+0.07110+0.014110-10*4=137.1 thousands
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small,
there is very little evidence of an interaction effect. Justify your answer.
False, P-value is more appropriate to determine if interaction term is statistically
significant or not.

#6. Using (3.4), argue that in the case of simple linear regression, the least squares line
always passes through the point (¯x, ¯y).

The simple linear regression is Y=^beta0 + ^beta1*X+ error term.


We put point (¯x, ¯y) into simple linear regression. Y=¯y-^beta*¯x + error term。
So, the predicted Y=¯y, this shows the least squares line always passes through the
point (¯x, ¯y).
###8. This question involves the use of simple linear regression on the Auto data set. (a)
Use the lm() function to perform a simple linear regression with mpg as the response and
horsepower as the predictor. Use the summary () function to print the results. Comment on
the output. For example: i. Is there a relationship between the predictor and the response?
ii. How strong is the relationship between the predictor and the response? iii. Is the
relationship between the predictor and the response. positive or negative? iv. What is the
predicted mpg associated with a horsepower of 98? What are the associated 95%
confidence and prediction? intervals?
setwd("/Users/tianchenxu/Desktop")
data<-read.csv("Auto.csv", header=T, na.strings="?")
Auto = na.omit(data)
summary(model<-lm(mpg~horsepower, data=Auto))

##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16

i) If our null hypothesis is there is no relationship between horsepower and mpg. Our
p-value is extremely less than significance level 5%, so we can reject null hypothesis
and conclude that there is relationship between mpg and horsepower.
ii) The r-square is 0.6059, meaning 60.59% of the variance in mpg is explained by
horsepower. This r-square is good enough to conclude they have a strong
relationship.
iii) Based on the slope, which is -0.1578, we can see they have a negative relationship.
If the horsepower increases 1 unit, the mpg will decrease 0.158 unit
iv)
predict (model, data.frame(horsepower=c(98)), interval="confidence")

## fit lwr upr


## 1 24.46708 23.97308 24.96108

predict(model, data.frame(horsepower=c(98)), interval="prediction")


## fit lwr upr
## 1 24.46708 14.8094 34.12476

#(b) Plot the response and the predictor. Use the abline() function to display the least
squares regression line
plot(Auto$horsepower, Auto$mpg)
abline(model)

#(c) Use the plot () function to produce diagnostic plots of the least squares regression fit.
Comment on any problems you see with the fit.
par(mfrow=c(2,2))
plot(model)
Residuals vs Fitted: Indicates potential non-linearity.
Normal Q-Q: Minor deviations suggest slight non-normality of residuals.
Scale-Location: Funnel shape suggests non-constant variance (heteroscedasticity).
Residuals vs Leverage: A few points suggest potential undue influence on the model.
11. In this problem we will investigate the t-statistic for the null hypothesis H0: β
= 0 in simple linear regression without an intercept. To begin, we generate a
predictor x and a response y as follows. (Bonus)
set.seed (1)
x=rnorm (100)
y=2*x+rnorm (100)

(a) Perform a simple linear regression of y onto x, without an intercept. Report the
coefficient estimate ˆβ, the standard error of this coefficient estimate, and the t-
statistic and p-value associated with the null hypothesis H0: β = 0. Comment on
these results. (You can perform regression without an intercept using the command
lm(y∼x+0).)
summary(lm(y~x+0))

##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16

The coefficient estimate is 1.9939. The standard error is 0.1065. T-statistics is 18.73. P-
value is 2.2e-16, way more less than 0.05 and it is close to 0. So, we can reject null
hypothesis, H0: beta=0.

#(b) Now perform a simple linear regression of x onto y without an intercept, and report
the coefficient estimate, its standard error, and the corresponding t-statistic and p-values
associated with the null hypothesis H0: β = 0. Comment on these results.
summary(lm(x~y+0))

##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16

The coefficient estimate is 0.3911, standard error is 0.02089, t value is 18.73 and p-
value is 2.2e-16. The p-value is near to 0, so we reject null β1 = 0 If I use
0.3911/0.02089=18.72, which is equal to the test statistics in question a).

#(c) What is the relationship between the results obtained in (a) and (b)?
The t-test statistics results are the same in a) and b)
The two regressions line can almost be written with each other. Y=2x, x=0.39y
#(d) For the regression of Y onto X without an intercept, the t-statistic for H0: β = 0 takes
the form ˆβ/SE (ˆ β), where ˆ β is given by (3.38)

(sqrt(length(x)-1) * sum(x*y)) / (sqrt(sum(x*x) * sum(y*y) - (sum(x*y))^2))

## [1] 18.72593
#e) Using the results from (d), argue that the t-statistic for the regression of y onto x is the
same as the t-statistic for the regression of x onto y.
In a) 1.9939/0.1065=18.73. In b) 0.39111/0.02089=18.73 If there’s a relationship between
x and y, then there’s also a relationship between y and x. The strength of the linear
relationship between the two variables is the same, regardless of which one is considered
the response variable.

#f)In R, show that when regression is performed with an intercept, the t-statistic for H0: β1
= 0 is the same for the regression of y onto x as it is for the regression of x onto y.
# Linear regression of y onto x
model1 <- lm(y ~ x)
summary1 <- summary(model1)
t_stat_1 <- coef(summary1)["x", "t value"]
t_stat_1

## [1] 18.5556

# Linear regression of x onto y


model2 <- lm(x ~ y)
summary2 <- summary(model2)
t_stat_2 <- coef(summary2)["y", "t value"]
t_stat_2

## [1] 18.5556

Both of two models’ t test statistics are 18.56.

14) This problem focuses on the collinearity problem. Perform the following
commands in R:
set.seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)

a) The last line corresponds to creating a linear model in which y is a function of x1


and x2. Write out the form of the linear model. What are the regression coefficients?
Y=2+2X1+0.3X2+error term
β0=2, β1=2, β3=0.3

(b) What is the correlation between x1 and x2? Create a scatterplot displaying the
relationship between the variables
cor(x1, x2)

## [1] 0.8351212
plot(x1, x2)

#The correlation between x1 and x2 is 0.8351, which is strong

(c) Using this data, fit the least squares regression to predict y using x1 and x2. Describe
the results obtained. What are ˆ β0, ˆ β1, and ˆ β2? How do these relate to the true β0,
β1, and β2? Can you reject the null hypothesis H0: β1 = 0? How about the null
hypothesis H0: β2 = 0?
summary(lm(y~x1+x2))

##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Y=1.4396x1+1.0097x2+2.1305 ˆβ0 is 2.1305, ˆβ1 is 1.4396, ˆβ2 is 1.0097
For null hypothesis H0: β1 = 0. We reject it because we have p value 0.0487 less than 0.05.
For null hypothesis H0: β2 = 0. We fail to reject it because we have p-value 0.3754.

d) Now fit the least squares regression to predict y using only x1. Comment on your
results. Can you reject the null hypothesis H0: β1 = 0?
summary(lm(y~x1))

##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06

The p-value of x1 is 2.66e-06. Significantly less than 5%, so we reject H0.

e) Now fit the least squares regression to predict y using only x2. Comment on your
results. Can you reject the null hypothesis H0: β1 = 0?
summary(lm(y~x2))

##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05

The p-value of x2 is 1.37e-05, very close to 0. So, we also reject H0.

f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.
Yes, there is a strong correlation between x1 and x2, which is 0.83. So we
cannot determine their sigificance if we do regression on x1 and x2 together.
But when I did regression on these variables separately, it is clear for me to
see their relationships with y, respectively.
g) Now suppose we obtain one additional observation, which was unfortunately
mismeasured.
x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new
observation have on the each of the models? In each model, is this observation an outlier? A
high-leverage point? Both? Explain your answers.
summary(lm(y~x1+x2))

##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
model1<-lm(y~x1+x2)

summary(lm(y~x1))

##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05

model2<-lm(y~x1)

summary(lm(y~x2))

##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06

model3<-lm(y~x2)

par(mfrow=c(2,2))
plot(model1)
plot(model2)

plot(model3)
Based on the diagnostic plots of model1,observation “210” shows potential non-
normality in residuals but isn’t overly influential in the regression model.
Based on the diagnostic plots of model2, observation “1010” has high leverage but is not an
outlier and doesn’t overly influence the regression model. It doesn’t have an unusually high
residual, so it isn’t considered an outlier based on this plot. The observation is evenly
distributed, indicating homoscedasticity. “1010” lies far to the right, indicating it has high
leverage.
Based on the diagnostic plots of model3, Point “210” appears to be an outlier since it
deviates from the expected pattern in the residuals in multiple plots. Given its distance, it’s
likely an influential point. Therefore, point “210” can be considered both an outlier and a
high-leverage point.
BONUS QUESTION
If you were to conduct the simulation in a) many times (with different draws for the
random variables) so that you had a large number of datasets, what you would
expect the average values (over all simulations) for the coefficients from the
regression in c) to be? Explain your answer.

If simulating the above scenario many times, repeatedly generating datasets according to
the given setup and then fit a linear regression model to each dataset, on average, we
would expect the estimated coefficients to converge to the true coefficients.
The average estimated intercept should be close to 2.
The average estimated coefficient for x1 should be close to 2.
The average estimated coefficient for x2 should be close to 0.3.
These are the true coefficients that are used in the data-generating process, so with enough
simulations, the average of the estimated coefficients from all the regressions should
converge to these values.

You might also like