You are on page 1of 6

Predicting the number of years of education, attained six

years after graduation using Multiple Linear Regression


MST 271–Regression Analysis - Lab Assignment No.3
2148135_KEERTHANA A

2/5/2022

About the dataset:


The dataset used here is CollegeDistance from the R package AER. This is a cross-sectional
data set from a survey conducted by the Department of Education in 1980, with a follow-up
in 1986, containing 14 variables relating to the characteristics of the students surveyed for
this data set, their families and the area in which they live.
Objectives: Predicting the number of years of education the student had attained six years
after graduation using Multiple Linear Regression.
 Plot a matrix of scatter diagrams between the variables of interest and also find the
matrix of coefficient of correlations and interpret it. Are the regressors independent
of each other? Justify your answer.
 Fit a multiple linear regression model and interpret the estimated coefficients.
 Test the significance of regression parameters using the t-test and interpret it.
Analysis:
Though the data set contains 14 different variables, we will only use score, the achievement
test score obtained during the student’s senior year of high school, distance, the distance
the student lives from a 4-year college (in 10’s of miles), tuition, the average 4-year college
tuition in the student’s state (in 1000’s of dollars), and education, the number of years of
education attained 6 years after high school graduation, in any of the models we create.
#install.packages("AER")
library(AER)

## Warning: package 'AER' was built under R version 4.1.2

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 4.1.2

## Loading required package: zoo

2148135
## Warning: package 'zoo' was built under R version 4.1.2

##
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':


##
## as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.1.2

## Loading required package: survival

data("CollegeDistance")
head(CollegeDistance)

## gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other

attach(CollegeDistance)

# Let us consider
y<-education #as the dependent variable.
x1<-score #x1, x2 and x3 as the dependent variables.
x2<-tuition
x3<-distance

# To check the problem of Multicollinearity.


# Obtaining scatter plot matrix
pairs(CollegeDistance[c("education", "score",
"tuition","distance")],main="Pairwise Scatter plot")

2148135
Using the obtained scatter plot matrix, we can say that the dependent variable education
has linear relationship with our independent variables score, tuition and distance.
Also, it does not appear that any of our predictor variables are highly correlated, or have a
strong linear relationship with one another.
To confirm that there is no strong correlation between the independent variables, we
obtain the correlation matrix.
# Obtaining correlation matrix
cor(CollegeDistance[c("education", "score", "tuition","distance")])

## education score tuition distance


## education 1.00000000 0.46518719 0.03953361 -0.09318309
## score 0.46518719 1.00000000 0.12985848 -0.06797927
## tuition 0.03953361 0.12985848 1.00000000 -0.10098058
## distance -0.09318309 -0.06797927 -0.10098058 1.00000000

We see that the pairwise correlations between our predictor variables are very low. Hence
we can conclude that there is no multicollinearity issue in this model.
# Fitting multiple linear regression.
model<-lm(y~x1+x2+x3);summary(model)

##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##

2148135
## Residuals:
## Min 1Q Median 3Q Max
## -3.8279 -1.1831 -0.2465 1.2225 5.3621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.156801 0.144677 63.291 < 2e-16 ***
## x1 0.095470 0.002664 35.839 < 2e-16 ***
## x2 -0.143689 0.068471 -2.099 0.0359 *
## x3 -0.050134 0.010057 -4.985 6.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.58 on 4735 degrees of freedom
## Multiple R-squared: 0.2209, Adjusted R-squared: 0.2204
## F-statistic: 447.6 on 3 and 4735 DF, p-value: < 2.2e-16

INTERPRETATIONS:
Here y, “education” is the dependent variable and “score”, “tuition” and “distance” as the
independent variables. We obtain the regression equation as,
education = 9.15680 + 0.09547 * score - 0.14369 * tuition -0.05013 * distance.
For every increase in achievement test score by 1 point, the predicted years of education of
a student increases by 0.0956 years when that student’s state’s average tuition for a 4-year
college and distance from a 4-year college all stay the same. For every increase in average
tuition in the student’s state (in 1000’s of dollars) by 1 unit , the predicted years of
education of a student decreases by 0.144 years when that achievement test score and
distance from a 4-year college all stay the same. For every increase in the distance the
student lives from a 4-year college (in 10’s of miles) by 1 unit, the predicted years of
education of a student decreases by 0.05 years when that achievement test score and
average tuition in the student’s state all stay the same.
The intercept wouldn’t really make sense to interpret because even a student that scored a
zero on their achievement test, had an average state tuition for a 4-year college of zero
dollars and lived zero miles from a 4-year college would still have at least 12 years of
education since they had to have graduated in order to be part of this data set and our
intercept is less than 12. We cannot use this model to determine anything about students
who did not graduate.
The coefficient of determination (Adjusted R square) is 0.2204 which implies that about
22% of the total variablility in the response variable, “Education” is explained by the
regressors.

2148135
#Testing the significance of regression parameters using the t-test.

#Null hypothesis: H0: There exists no significant correlation between our


dependent variable, education and the independent variables, score, tuition
and distance.
#Alternative hypothesis: HA: There exists a significant correlation between
our dependent variable, education and the independent variables, score,
tuition and distance.

#H0: β1= β2= β3=0


#HA: β1≠0,β2≠0,β1≠0

# t_tab=t(9431-4,0.025);t_tab = 1.96.
# t1=beta1/SE(beta1)
t1=0.09547/0.002;t1

## [1] 47.735

# t2=beta2/SE(beta2)
t2=0.14369/0.068;t2

## [1] 2.113088

# t3=beta3/SE(beta3)
t3=0.05013/0.010;t3

## [1] 5.013

At 95 % level of significance, we obtained the tabular value of t as 1.96. We calculated the


values of t1, t2 and t3 and we can see that these values are greater than the tabular value of
t. Hence, we can reject the null hypothesis and conclude that there exists a significant
correlation between our dependent variable, education and the independent variables,
score, tuition and distance.

2148135
CONCLUSIONS:
We performed multiple linear regressions to predict how the factors score, distance and
tuition affect the number of years of education the student had attained six years after
graduation. We can conclude that:
 Using the scatter plot matrix, it does not appear that any of our predictor variables
are highly correlated, or have a strong linear relationship with one another. We see that the
pairwise correlations using correlation matrix between our predictor variables are very
low. Hence, we can conclude that there is no multicollinearity issue in this model.
 We found the regression equation as:
education = 9.15680 + 0.09547 * score - 0.14369 * tuition -0.05013 * distance.
 Using t test, we tested the significance of parameters. We found that there exists a
significant correlation between our dependent variable, education and the independent
variables, i.e., score, tuition and distance.

2148135

You might also like