You are on page 1of 5

Lab 9

Name: Maira Sulaimanova

All Plots must be visualized. Use proper tables (no R screen shots). Do not show codes if not
asked!

In this Lab you will be using lm() to estimate the regression model. The function lm() is part of
the package AER. Attach the package using library(). The RMD file that is attached is information
for how to run this lab using different data, for this lab you will be performing similar
assessments using the following data:

1. A researcher wants to analyze the relationship between class size (measured by the
student-teacher ratio) and the average test score. Therefore, he measures both
variables in 10 different classes and ends up with the following results.
Class Size 23 19 30 22 23 29 35 36 33 25
Test Score 430 430 333 410 390 377 325 310 328 385
Instructions: Create the vectors cs (the class size) and ts (the test score), containing the
observations above.
● Draw a scatterplot of the results.

● Compute the mean, the sample variance, and the sample standard deviation of
Test Scores.
The descriptive statistics for Test score are as follows: the mean is 371.8, the sample variance is
2022.178, and the sample standard deviation is 44.96863.
● Compute the covariance and the correlation coefficient for Test Scores and Class
Size.
The covariance between class size and test scores is -254.2222, implying an inverse relationship
—when class size increases, test scores tend to decrease. This negative association is reinforced
by a high correlation coefficient of -0.953319, signifying a strong negative linear correlation.
Estimate the regression model using lm(). To find out more about this function
use help. Assign the result to mod.

TestScore_i=β_0+β_1 ClassSize_i + u_i

● Obtain a statistical summary of the model. Report the findings here in a proper
table with explanation.
call:
lm(formula = Test_Score ~ Class_Size)

Residuals:
Min 1Q Median 3Q Max
-20.727 -4.665 -2.404 5.475 25.669

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 570.5994 22.7243 25.110 6.77e-09 ***
Class_Size -7.2291 0.8096 -8.929 1.96e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.4 on 8 degrees of freedom


Multiple R-squared: 0.9088, Adjusted R-squared: 0.8974
F-statistic: 79.74 on 1 and 8 DF, p-value: 1.963e-05

The model estimates that on average for each additional unit increase in class size, the test
score is expected to decrease by approximately 7.23 points. Both the intercept and the
coefficient for Class_Size are statistically significant, supporting the model's validity. The R-
squared values of 0.9088 and 0.8974 indicate that about 90.88% of the variability in test scores
can be explained by the model, reflecting a strong predictive capability. Finally, the F-statistic,
with a low p-value of 1.963e-05, further confirms the overall significance of the model.

● Create a new scatterplot with the regression line from the first bullet. Use the
object mod to create the abline (abline()).
As it can be seen here, the scatterplot illustrates the strong negative correlation, indicating that
as the class size increases (student-teacher ratio becomes larger), the average test scores tend
to decrease.

2. Now, Using the CASchools Data you are going to regress income on to test scores.
● Write the regression model for this relationship.
income=β_0+β_1 Test_Score+ u_i
income - dependent variable
Test_Score - independent variable
β_0 - intercept
β_1 - slope
u_i - error term
mod1 <- lm(income ~ Test_Score, data = data_df)

● Provide the summary stats and correlation coefficient for these variables. Plot
the scatter plot. Describe your findings.
Call:
lm(formula = income ~ Test_Score, data = data_df)

Residuals:
Min 1Q Median 3Q Max
-4.1394 -2.0373 -0.2803 0.8577 8.7266

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.57491 10.65244 -1.180 0.272
Test_Score 0.06172 0.02846 2.168 0.062 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.84 on 8 degrees of freedom


Multiple R-squared: 0.3701, Adjusted R-squared: 0.2914
F-statistic: 4.701 on 1 and 8 DF, p-value: 0.06199
[1] 0.6083904

The linear regression analysis suggests a moderately positive relationship, with a correlation
coefficient of 0.6084. The estimated coefficient for test scores is 0.06172, indicating that, on
average, a one-unit increase in the test score is associated with a 0.06172 increase in income.
However, the intercept is not statistically significant, and the overall model's significance is
borderline (p-value: 0.062). The model explains 37.01% of the variability in income, suggesting a
moderate fit.

The scatterplot illustrates a moderate positive correlation between the test scores and income.
● Estimate the regression model. Store the information that is contained in the
output of summary(), assign the output of summary(mod) to the variable
inc_mod. Report the findings of the model.
The estimated regression model suggests an intercept of -12.57491 and a Test_Score coefficient
of 0.06172; hence,, a one-unit increase in the test score is associated with a 0.06172 increase in
income. The residuals, reflecting the variability in predictions, ranged from -4.1394 to 8.7266.
The model's explanatory power is moderate, with a multiple r-squared of 0.3701, indicating
that 37.01% of the variability in income is explained. The adjusted r-squared is 0.2914, adjusting
for predictors. Finally, the F-statistic's p-value is 0.06199 which is borderline significant.
● Calculate the Sum of Squared Residuals and the Total Sum of Squares. Now
compute the R2 manually. Discuss what the SSR and TSS are. Discuss what the R2
is. Report the R2 of this model.
The calculated SSR for this regression model is 117.9645, representing the unexplained
variability in income captured by the error term. The TSS is 187.2865, indicating the total
variability in income without considering the model. Finally, the coefficient of determination R2
is 0.37, signifying that the model explains approximately 37% of the variability in income.

● Report the SER and what it is.


The SER, which is a metric that serves as a measure of the typical deviation of individual
observed income values from the values predicted by the model, is calculated to be
approximately 3.84. A lower SER suggests that the model's predictions are closer to the actual
observed values, therefore, it has a higher precision. In contrast, a higher SER implies a greater
variability in the residuals, indicating that the model's predictions may not be as precise.
Consequently, the SER of 3.84 provides a specific number that indicates the average size of the
differences between the actual income values and the values predicted by the regression mode.

3. Explain the three assumptions of OLS in your own words.

Firstly, there is the assumption of linearity, asserting that the relationship between the
variables follows a straight line, indicating a consistent association between changes in one
variable and changes in another. Secondly, the assumption of independence means that
prediction errors are unrelated to each other, ensuring that the occurrence of one error
provides no information about another. Finally, the assumption of homoscedasticity requires
that the spread of prediction errors remains relatively constant across all levels of the
independent variable.

You might also like