You are on page 1of 4

Neal Pania

Homework 1: Simple Linear Regression

The Boston data set records the median house value (medv) for 506 census tracts in Boston. There are 12 predictors for
medv, some of which are:
 CRIM - per capita crime rate by town
 INDUS - proportion of non-retail business acres per town.
 NOX - nitric oxides concentration (parts per 10 million)
 RM - average number of rooms per dwelling

For each of the four predictors above, fit a simple linear regression model to predict the response using Excel. 
i) For each model, paste the line fit plot and write down the simple linear regression model. Is there a statistically
significant association between the predictor and the response?
Regression on CRIM
y = 24.0331 + -0.4152
p-value < 0.05; implies that CRIM has statistically significant association with MEDV

Regression on INDUS
y = 29.7549 + -0.6485
p-value < 0.05; implies that INDUS has statistically significant association with MEDV

Regression on NOX
Neal Pania

y = 41.3459 + -33.9161
p-value < 0.05; implies that NOX has statistically significant association with MEDV

Regression on RM
y = -34.6706 + 9.1021
p-value < 0.05; implies that RM has statistically significant association with MEDV

ii) For each model, paste the residual plot. Comment on whether or not the observed plot is completely uniformly
random.
Neal Pania

  Residual on CRIM Regression


Residuals for CRIM do not appear to be completely uniformly random, with residuals appearing to increase as
CRIM gets larger. Majority of the data is also clustered where CRIM is lower and
not uniformly distributed.

Residual on INDUS Regression


Residuals for INDUS do appear to be mostly random, with overall residuals appearing to have no trend. Although
INDUS residuals maintain randomness, the INDUS values are fixed for several data points causing vertical lines
on the residual plot. It also appears that the residuals are more uniformly random when INDUS is < 15. Also, the
positive residual value range is larger than the negative residual value range.

Residual on NOX Regression


Neal Pania

Residuals for NOX do appear to be mostly random, with overall residuals appearing to have no trend. When
looking at the larger residuals at the top of the plot, there appears to be an increasing trend, though these data
points are few in numbers. In terms of the distribution of residuals, the positive residual value range is larger
than the negative residual value range.

Residual on RM Regression
Residuals for RM do appear to be mostly random, with the overall residuals appearing to have no trend. When
looking at the few datapoints with lager residual values, a negative trend can be seen, however these data
points are so few and majority of the data appears to be random.

iii) Which model best fits the data, and why?


I would select RM (average number of rooms per dwelling) as the model that best fits the data. Given that all
models had a statistically significant p-value, it came down to assessing the residuals, R^2, and standard error.
Regarding the residual plots; NOX, INDUS, and RM appeared to be the only models with random residuals,
eliminating CRIM.
Between these three models, RM had a much higher R^2 and much lower standard error compared to both
INDUS and NOX. For these reasons, I would ultimately select the regression on RM as being the model that best
fits the data.

You might also like