You are on page 1of 12

Boston Housing

Case Analysis of Boston Housing Data


• The dataset was compiled by David Harrison of Harvard and Daniel
Rubenfeld of University of Michigan who in the late 1970’s
investigated the relationship between housing values and the
willingness to pay for clean air.
• The hypothesis in this study proposes that environmental pollution
should have a negative impact on house prices. The Boston Housing
Dataset contains 506 observations and includes 14 non-constant
independent variables, which are listed below.

https://www.chegg.com/homework-help/questions-and-answers/case-analysis-boston-housing-data-histo
ry-data-dataset-compiled-david-harrison-harvard-dan-q22902185
Variables
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq. ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centers
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63) ^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. MEDV (Y) Median value of owner-occupied homes in $1000's (response variable)
Variables
• Dependent variable: medv, the median value of owner-occupied homes (in
thousands of dollars).
• Structural variables indicating the house characteristics: rm (average number of
rooms “in owner units”) and age (proportion of owner-occupied units built prior to
1940).
• Neighborhood variables: crim (crime rate), zn (proportion of residential areas), indus
(proportion of non-retail business area), chas (river limitation), tax (cost of public
services in each community), ptratio (pupil-teacher ratio), B = 1000(Bk – 0.63)^2
, where Bk is the black proportion of population – low and high values
of B increase housing prices) and lstat (percent of lower status of the population).
• Accesibility variables: dis (distances to five Boston employment centers) and rad
(accessibility to radial highways – larger index denotes better accessibility).
• Air pollution variable: nox, the annual concentration of nitrogen oxide (in parts per
ten million).
A quick summary of the data
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
min 0.00632 0.00000 0.46000 0.00000 0.38500 3.56100 2.90000 1.12960 1.00000 187.00000 12.60000 0.32000 1.73000 5.00000
1st Qu 0.08205 0.00000 5.19000 0.00000 0.44900 5.88550 45.02500 2.10018 4.00000 279.00000 17.40000 375.37750 6.95000 17.02500
median 0.25651 0.00000 9.69000 0.00000 0.53800 6.20850 77.50000 3.20745 5.00000 330.00000 19.05000 391.44000 11.36000 21.20000
2nd Qu 0.25651 11.36364 11.13678 0.06917 0.55470 6.28463 68.57490 3.79504 9.54941 408.23715 18.45553 356.67403 12.65306 22.53281
3rd Qu 3.67708 12.50000 18.10000 0.00000 0.62400 6.62350 94.07500 5.18843 24.00000 666.00000 20.20000 396.22500 16.95500 25.00000
max 88.97620 100.00000 27.74000 1.00000 0.87100 8.78000 100.00000 12.12650 24.00000 711.00000 22.00000 396.90000 37.97000 50.00000
mean 3.613524 11.363636 11.13678 0.06917 0.554695 6.284634 68.574901 3.795043 9.549407 408.23715 18.45553 356.67403 12.65306 22.53281
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.10571 8.707259 168.53712 2.164946 91.294864 7.141062 9.197104
var 73.98658 543.93681 47.06444 0.064513 0.013428 0.493671 792.3584 4.434015 75.81637 28404.759 4.686989 8334.7523 50.99476 84.58672
Build a model to test
• The hypothesis : environmental pollution should have a negative
impact on house prices (medv).
• Is it true?
How can?
• Choose the dependent variable and the independent variable or
variables.
• Justify whether you would expect the independent variable(s) to have
a positive or negative effect on the dependent variable.

• Independent variable: a variable that stands


alone and isn't changed by the other variables
• Dependent variable: a variable being affected by
the other variables
The model
• To predict the value of houses MEDV, a multiple regression model will
be constructed with the following features (independent variables):
• RM (average number of rooms per dwelling)
• LSTAT (% lower status of the population)
• NOX (nitric oxides concentration (parts per 10 million)
• PTRATIO (pupil-teacher ratio by town)
Justify that …
• Higher Nitric Oxide (NOX) concentrations have a direct impact on
housing prices (MEDV). Related to a negative impact on housing
values.
• The larger houses (more rooms RM) typically cost more and,
therefore, have a positive impact on the MEDV.
• Higher LSTAT (or lower class citizens), one would expect to observe a
lower MEDV. Has a negative impact on MEDV.
• A lower teacher-to-student ratio is related to lower performance in
students, which is more typical for areas with lower housing costs.
Has a negative impact on MEDV
Scatterplot

(using WEKA)
Scatterplot
variable correl
lstat vs medv -0.73766
nox vs medv -0.42732
rm vs medv 0.69536
pratio vs medv -0.50779
Using regression •

WEKA :
Classify
• Choose LinearRegression
• Select medv
• Y = Boston Housing Price • Click Start
• X = All other features
• Predict Y=f(x1, x2, …) using regression
• Then create scatter plot for Y vs MEDV
variable correl Ideally, the scatter plot should
lstat vs medv -0.73766 create a linear line. Since the
nox vs medv -0.42732
model does not fit 100%, the
rm vs medv 0.69536
pratio vs medv -0.50779 scatter plot is not creating a
predicted vs medv 0.86057 linear line.

medv = -0.1084 * crim + 0.0458 * zn + 2.7187 * chas + (-17.376) * nox + 3.8016 * rm + (-1.4927) * dis +
0.2996 * rad + (-0.0118) * tax + (-0.9465) * ptratio + 0.0093 * b + (-0.5226) * lstat + 36.3411

You might also like