You are on page 1of 4

Regression math

Following are information from 420 schools in the year 2020 to evaluate the school
performance. Among the key variables in the dataset, some are-

Average test score in exams (short name: TESTSCR) (unit in points),


Student-teacher ratio representing class size (short name: STR) (unit in %),
Average family income (AVGINC) (unit in $1000’s) and
Percent of English learners (short name: EL_PCT) (unit in %).

Following are some statistical outputs from STATA software which you will need to answer
the questions. If you take help from these tables in answering the requirements, please
mention the Table number which you are referring to.

Table-1: Descriptive/Summary Statistics


Variable Name Mean Standard Minimum Maximum Number of
Deviation Value Value observation
testscr 500 15 400 600 420
str 20 2 14 26 420
avginc 15 7 5 55 420
el_pct 16 18 0 85 420

Table- 2: Correlation Matrix


Variable Name str avinc el_pct
str 1
avginc -0.25 1
el_pct 0.19 -0.31 1

Table-3: Variance Inflation Matrix (VIF)


Variable VIF 1/VIF
avginc 2 0.5
el_pct 2 0.5
str 1 1

Table-4: Breusch-Pagan Test


Ho: Constant variance
Chi2(1) = 0.57
Prob>chi2 = 0.45

Table-5: Simple linear regression


F (1, 418) = 20
P-value = 0.00
R-squared = 0.06
Adjusted R-squared = 0.05
Root MSE = 18
testscr Coefficient Standard Error P-value
str -2.5 0.48 0.00
constant 520 10 0.00

Table-6: Multiple linear regression


F (1, 416) = 300
P-value = 0.00
R-squared = 0.71
Adjusted R-squared = 0.70
Root MSE = 10
testscr Coefficient Standard Error P-value
str -0.07 0.28 0.80
avginc 1.5 0.08 0.00
el_pct -0.49 0.03 0.00
constant 480 5 0.00

Requirements:
a) What type of dataset is it and why? What is the sample size? Which one is (are) the
dependent and independent variable(s)?
Ans:
Data Set Type Time Period Entity
Time Series Multiple Single
Cross Sectional Single Multiple
Panel Multiple Multiple

Entity: No. of Schools: 420


Time period: Year 2020

For the year 2020: Cross-sectional data


School Name Testscr STR Avginc EL_Pct
1 x x x x
2 x x x x
3 x x x x
…..
420 x x x x

Sample size : 420


Dependant Variable: Testscr
Independent Variables: STR, AvgInc, EL_Pct

b) Briefly explain the key aspects of all the variables based on the descriptive/summary
statistics.
Ans:

Testscr: On average students of all the schools have test score of 500 points. As the standard
deviation of testscr is 15, there is a large deviation in Testscr among the schools. Minimum
score of a student in the data set 400 points. The highest score of a student among all the
school is 600 points.
STR: Average student teacher ratio is 20%. So, there is on average 1 teacher for every 5
students. As the the standard deviation of STR is 2, we can infer that there is little difference
among the schools in terms of STR. Minimum there is 1 teacher for 7 students. Maximum
there is 1 teacher for every 4 students.
AvgInc: On average a student’s family income is $15000. As the standard deviation is 7, so
there is moderate deviation in the family income of the students. As the minimum family
income of student is $5000, so we can say that there are many poor students who need
financial assistance. Again, some families are rich as the maximum family income is $55000.
Most of the families have lower income.
EL_Pct: On average 16% students are English learners. As the standard deviation of EL_Pct
is 18, there is a significant deviation among the schools in terms of English learner students.
There are some schools that have all native speaker students as the minimum value is 0%.
There are some schools that have non-native students mostly as the maximum value is 85%.
On average, the number of English learners among the schools are low.

c) What is multicollinearity? Is there any multicollinearity problem in this dataset? If


any, what are the consequences and what are the possible solutions?

Ans:
When two or more independent variables are highly correlated, then this problem in the
multiple linear regression model is called multicolinearity.
No, there is no multicollinearity in this dataset. Because, all the correlation value among the
independent variables in the correlation matrix in Table-2 are less than 0.80 and the VIF
value for the independent variable are less than 10 in Table-3.

Consequences and solution: See regression lecture slide.

d) What is heteroscedasticity? Is there any heteroscedasticity problem in this dataset? If


any, what are the consequences and how can this problem be solved?

Ans:
If the variance of the error term is not constant, then it is defined as heteroscedasticity.

In table-4, the Breusch-Pagan Test shows the P-value 0.45> 0.05, so we cannot reject the null
hypothesis which means that there is no heteroscedasticity in the data set.

Consequences and solution: See regression lecture slide.

e) Write down the simple population regression model considering only STR. Write
down the sample regression LINE considering only STR.
f) Based on requirement (e), interpret the economic and statistical significance of the
intercept and the slope of the simple linear regression line. Do you agree that class
size significantly influences students’ exam performance?
g) Based on requirement (e), if a school has a test score of 590 points and class size
(STR) of 18, what will be the predicted test score and the corresponding residual/error
of that school?
h) Based on requirement (e), comment on the goodness of fit of this simple linear
regression model.
i) Write down the multiple population regression model considering STR, AVGINC and
EL_PCT. Write down the sample regression LINE considering STR, AVGINC and
EL_PCT.
j) Based on requirement (i), interpret the economic and statistical significance of the
intercept and the slopes of the multiple linear regression line. Do you still agree that
class size significantly influences students’ exam performance?
Ans:
Constant:
Economic significance: If all the co-efficients of the independent variables are zero, then the
Testscr will be 480 units.
Statistical significance: As the P-value of the constant is 0.00 < 0.05, the constant value is
significant at 5% significance level.

STR:
Economic significance: If STR increases by 1%, then Testscr will decrease by .07 units
considering all other independent variables remain constant.
Statistical significance: As the P-value of the STR is 0.80 > 0.05, the STR variable is
insignificant at 5% significance level.

AvgInc:
Economic significance: If AvgInc increases by $1000, then Testscr will increase by 1.5 units
considering all other independent variables remain constant.
Statistical significance: As the P-value of the AvgInc is 0.00 < 0.05, the AvgInc variable is
significant at 5% significance level.

El_Pct:
Economic significance: If El_Pct increases by 1%, then Testscr will decrease by 0.49 units
considering all other independent variables remain constant.
Statistical significance: As the P-value of the El_Pct is 0.00 < 0.05, the El_Pct variable is
significant at 5% significance level.

As the STR variable is insignificant at 5% significance level, class size does not influence
students’ exam performance.

k) Based on requirement (i), if a school has a test score of 590 points, class size (STR) of
20, average income of $10 thousand and English learning students of 30%, what will
be the predicted test score and the corresponding residual/error of that school?
l) Based on requirement (i), comment on the goodness of fit of this multiple linear
regression model.
m) Based on requirements (h) and (l), does the multiple linear regression model fit better
than the simple linear regression model? Why or why not?
n) Comment on the validity of the whole/overall multiple regression model.

Ans: F-statistics of the multiple linear regression model is 300 with P-value 0.00 which is less
than .05, so the overall model is significant and valid.

o) Based on requirement (i), suppose you are considering taking the natural logarithm of
some variables in the multiple regression model. How would you explain the slope if
you take natural logarithm of (1) AVGINC only, (2) TESTSCR only, and (3) both
AVGINC and TESTSCR.

You might also like