Professional Documents
Culture Documents
2/5/2022
2148135
## Warning: package 'zoo' was built under R version 4.1.2
##
## Attaching package: 'zoo'
data("CollegeDistance")
head(CollegeDistance)
## gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
attach(CollegeDistance)
# Let us consider
y<-education #as the dependent variable.
x1<-score #x1, x2 and x3 as the dependent variables.
x2<-tuition
x3<-distance
2148135
Using the obtained scatter plot matrix, we can say that the dependent variable education
has linear relationship with our independent variables score, tuition and distance.
Also, it does not appear that any of our predictor variables are highly correlated, or have a
strong linear relationship with one another.
To confirm that there is no strong correlation between the independent variables, we
obtain the correlation matrix.
# Obtaining correlation matrix
cor(CollegeDistance[c("education", "score", "tuition","distance")])
We see that the pairwise correlations between our predictor variables are very low. Hence
we can conclude that there is no multicollinearity issue in this model.
# Fitting multiple linear regression.
model<-lm(y~x1+x2+x3);summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
2148135
## Residuals:
## Min 1Q Median 3Q Max
## -3.8279 -1.1831 -0.2465 1.2225 5.3621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.156801 0.144677 63.291 < 2e-16 ***
## x1 0.095470 0.002664 35.839 < 2e-16 ***
## x2 -0.143689 0.068471 -2.099 0.0359 *
## x3 -0.050134 0.010057 -4.985 6.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.58 on 4735 degrees of freedom
## Multiple R-squared: 0.2209, Adjusted R-squared: 0.2204
## F-statistic: 447.6 on 3 and 4735 DF, p-value: < 2.2e-16
INTERPRETATIONS:
Here y, “education” is the dependent variable and “score”, “tuition” and “distance” as the
independent variables. We obtain the regression equation as,
education = 9.15680 + 0.09547 * score - 0.14369 * tuition -0.05013 * distance.
For every increase in achievement test score by 1 point, the predicted years of education of
a student increases by 0.0956 years when that student’s state’s average tuition for a 4-year
college and distance from a 4-year college all stay the same. For every increase in average
tuition in the student’s state (in 1000’s of dollars) by 1 unit , the predicted years of
education of a student decreases by 0.144 years when that achievement test score and
distance from a 4-year college all stay the same. For every increase in the distance the
student lives from a 4-year college (in 10’s of miles) by 1 unit, the predicted years of
education of a student decreases by 0.05 years when that achievement test score and
average tuition in the student’s state all stay the same.
The intercept wouldn’t really make sense to interpret because even a student that scored a
zero on their achievement test, had an average state tuition for a 4-year college of zero
dollars and lived zero miles from a 4-year college would still have at least 12 years of
education since they had to have graduated in order to be part of this data set and our
intercept is less than 12. We cannot use this model to determine anything about students
who did not graduate.
The coefficient of determination (Adjusted R square) is 0.2204 which implies that about
22% of the total variablility in the response variable, “Education” is explained by the
regressors.
2148135
#Testing the significance of regression parameters using the t-test.
# t_tab=t(9431-4,0.025);t_tab = 1.96.
# t1=beta1/SE(beta1)
t1=0.09547/0.002;t1
## [1] 47.735
# t2=beta2/SE(beta2)
t2=0.14369/0.068;t2
## [1] 2.113088
# t3=beta3/SE(beta3)
t3=0.05013/0.010;t3
## [1] 5.013
2148135
CONCLUSIONS:
We performed multiple linear regressions to predict how the factors score, distance and
tuition affect the number of years of education the student had attained six years after
graduation. We can conclude that:
Using the scatter plot matrix, it does not appear that any of our predictor variables
are highly correlated, or have a strong linear relationship with one another. We see that the
pairwise correlations using correlation matrix between our predictor variables are very
low. Hence, we can conclude that there is no multicollinearity issue in this model.
We found the regression equation as:
education = 9.15680 + 0.09547 * score - 0.14369 * tuition -0.05013 * distance.
Using t test, we tested the significance of parameters. We found that there exists a
significant correlation between our dependent variable, education and the independent
variables, i.e., score, tuition and distance.
2148135