You are on page 1of 15

boston_house_pricing

August 3, 2020

0.1 Guilherme Marthe - Boston House Pricing Challenge


In this notebook we lay out the foundations for the regression task of predicting a the price of houses
in the grand Boston area in the US. The dataset is quite well known among machine learning circles,
and it has already been used a lot around the web in training. tutorials, blog posts, benchmarks,
etc…
What you’ll find here is the basic process for building an ML model where the task revolves around
a simple prediction on a holdout set. This is not very representative of production system due to
many reasons, but a key one is that the validation process gets a bit simpler.
Here is the steps we took in the analysis:
• basic data profiling to detect potentcial problems in the data
• feature engineering
• data transformation pipelines
• exploration in categorical data encoding
• cross validation for pipeline construction and hyperparameter exploration
• benchmarking against naive, simple and more complex models
• a basic exploration of the key factors that the model looks at when predicting
Furthermore, we’ll comment the code through out the notebook.
[71]: import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
%matplotlib inline
from sklearn.pipeline import make_pipeline, make_union
from sklearn.compose import ColumnTransformer, make_column_selector,␣
,→make_column_transformer, TransformedTargetRegressor

from sklearn.preprocessing import StandardScaler, OneHotEncoder,␣


,→FunctionTransformer

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


from sklearn.linear_model import LinearRegression, LassoCV
from sklearn import metrics
from sklearn.feature_selection import SelectFromModel
from sklearn import set_config
from category_encoders import CountEncoder, TargetEncoder, CatBoostEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

1
from sklearn.dummy import DummyRegressor
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
set_config(display='diagram')
%load_ext autoreload
%autoreload 2
from utils import ZIPTransformer, GroupedMeanDiffTrasformer

The autoreload extension is already loaded. To reload it, use:


%reload_ext autoreload
Here, after exploring the data I see some information encoded in the zeroes of a few variables,
meaning the absence of that variable has some important meaning about the house it self. So rather
then trusting the ML algorithm to detect and test such importance in terms of price descrimination,
we created a few of those variables ourselves.
[2]: df_raw = (pd.read_csv('house_sales.csv')
.assign(zip = lambda x: x.zip.astype(str))
.assign(condition = lambda x: x.condition.astype(str))
.assign(has_renovated = lambda x: (x.renovation_date > 0).
,→astype(float))

.assign(has_basement = lambda x: (x.size_basement > 0).astype(float))


)

[3]: report = ProfileReport(df_raw, title='Profile report')

[4]: report

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=32.0, style=Progre

HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style

HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle

<IPython.core.display.HTML object>

[4]:

When exploring the profiling reports, the first thing that cought our attention was the skewness
of the response variable price. We will handle this in the modeling process, but we only do it
through the TransformedTargetRegressor class, so that we don’t loose track of the real variable
and stop using it in the explorations.

2
About data spliting, we first create a hold out test set so we can gage our performance at the end.
[10]: X_raw = df_raw.drop(columns=['price', 'latitude', 'longitude'])
y_raw = df_raw.price

X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.


,→33, random_state=42)

[7]: numeric_columns = ['num_bed',


'num_bath',
'size_house',
'size_lot',
'num_floors',
'is_waterfront',
'size_basement',
'year_built',
'renovation_date',
'avg_size_neighbor_houses',
'avg_size_neighbor_lot'
]

0.1.1 Feature Engineering - Grouped variable differences


One of the first ideas we got in order to contribute to the already great features was the idea of
comparing a house’s variable to the mean (or any other summary statistic) of another gruping
variable. In this example the grouping factor was the ZIP code, but if could have been anything,
sucha as similar house size or same year of construction.
For those variables, it is important to have a proper pipeline, in order to avoid data leakage from
the training set to the test set, since those variable have to be created in each cross validation
iteration. The pipeline was written by us and is in the utils.py module.
[12]: groupped_comparisions_transformer = make_pipeline(
ColumnTransformer([
('grouped_var_num_bed', GroupedMeanDiffTrasformer('zip', 'num_bed'),␣
,→['zip', 'num_bed']),

('grouped_var_num_bath', GroupedMeanDiffTrasformer('zip', 'num_bath'),␣


,→['zip', 'num_bath']),

('grouped_var_size_house', GroupedMeanDiffTrasformer('zip',␣
,→'size_house'), ['zip', 'size_house']),

('grouped_var_size_lot', GroupedMeanDiffTrasformer('zip', 'size_lot'),␣


,→['zip', 'size_lot']),

('grouped_var_num_floors', GroupedMeanDiffTrasformer('zip',␣
,→'num_floors'), ['zip', 'num_floors']),

('grouped_var_year_built', GroupedMeanDiffTrasformer('zip',␣
,→'year_built'), ['zip', 'year_built'])

]),
StandardScaler()

3
)

0.1.2 Pipelines for data preperation


For numerical variables we first create the pipeline making a transformation of the variables (which
all seemed right skewed) and scaling them. This has the numerical stability purpose for some
algorithms to be stable.
The ZIP variable is quite important, so it gets it’s own pipeline for now. This pipeline should
optimize which digits of the ZIP number we should use (possible only due to its nested characteristic)
and what type of encoding should we use for that variable.
[34]: numeric_pipeline = make_pipeline(FunctionTransformer(func=np.log1p,␣
,→inverse_func=np.expm1), StandardScaler())

zip_pipeline_oh = make_pipeline(ZIPTransformer(digits=5),␣
,→OneHotEncoder(handle_unknown='ignore'))

all_other_variables_transfomer = ColumnTransformer([
('std_scale_for_numeric', numeric_pipeline, numeric_columns),
('dummy_zipcode', zip_pipeline_oh, ['zip'])
])

[35]: feature_engineering_union = make_union(groupped_comparisions_transformer,␣


,→transfomer)

0.1.3 Response transformation and feature selection


As mentioned earlier we will use the transformed target regressors to handle the response variable’s
transformation. We will also we a technique to select variables using a regularized lasso regression.
[16]: transformed_linear_regression = TransformedTargetRegressor(
regressor=LinearRegression(),
func=lambda x: np.log(x),
inverse_func=lambda x: np.exp(x)
)

transformed_lassocv = TransformedTargetRegressor(
regressor=LassoCV(),
func=lambda x: np.log(x),
inverse_func=lambda x: np.exp(x)
)

[50]: pre_model_pipe = make_pipeline(feature_engineering_union,␣


,→SelectFromModel(LassoCV(max_iter=5000)))

[51]: modeling_pipeline_base = make_pipeline(pre_model_pipe,␣


,→transformed_linear_regression)

4
0.1.4 Model pipeline structure
Below is the final pipeline structure. Some parts of it will be optimized in grid search, but the
structure should be the same.
[55]: modeling_pipeline_base.steps[-1]

[55]: ('transformedtargetregressor',
TransformedTargetRegressor(func=<function <lambda> at 0x7fe07be45290>,
inverse_func=<function <lambda> at 0x7fe07be45200>,
regressor=LinearRegression()))

0.1.5 Grid Search


For grid search parameters, we will create a few more pipelines in order make it easier to optimize
the hyper parameters. Since we are also working with a CV based feature selection and a few steps
in the grid search, we will use the RandomizedSearchCV and make random fits in order to explore
the hyperparameter space. With more compute power and time, this could be optimized to a fuller
search.
[63]: transformed_random_forest = TransformedTargetRegressor(
regressor=RandomForestRegressor(),
func=lambda x: np.log(x),
inverse_func=lambda x: np.exp(x)
)

modeling_pipeline_rf = make_pipeline(pre_model_pipe, transformed_random_forest)

params_rf = {

,→'pipeline__featureunion__columntransformer__std_scale_for_numeric__functiontransformer':

,→

['passthrough',
FunctionTransformer(func=np.log1p, inverse_func=np.expm1),
FunctionTransformer(func=np.sqrt, inverse_func=lambda x: np.power(x,␣
2))],
,→


,→'pipeline__featureunion__columntransformer__dummy_zipcode__ziptransformer__digits':
,→ [3,4,5],


,→'pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder':␣

,→[OneHotEncoder(handle_unknown='ignore'),


,→CountEncoder(min_group_size=100, handle_unknown=0,


,→handle_missing='count'),

5

,→CatBoostEncoder(sigma = 0.01),CatBoostEncoder(sigma = 0.
,→1),CatBoostEncoder(sigma=1)],

'transformedtargetregressor__regressor__n_estimators' : [100, 500, 1000],


'transformedtargetregressor__regressor__max_features' : ['auto', 'sqrt'],
'transformedtargetregressor__regressor__min_samples_split' : [2, 5, 10],
'transformedtargetregressor__regressor__max_depth' : [None] + [int(x) for x␣
,→in np.linspace(10, 110, num = 11)],

gs_rf = (RandomizedSearchCV(modeling_pipeline_rf,
param_distributions=params_rf,
cv=10,
scoring='neg_mean_squared_error',
n_iter=100)
.fit(X_train, y_train)
)

[64]: params_base = {

,→'pipeline__featureunion__columntransformer__std_scale_for_numeric__functiontransformer':

,→

['passthrough',
FunctionTransformer(func=np.log1p, inverse_func=np.expm1),
FunctionTransformer(func=np.sqrt, inverse_func=lambda x: np.power(x,␣
2))],
,→


'pipeline__featureunion__columntransformer__dummy_zipcode__ziptransformer__digits':
,→

,→ [3,4,5],


,→'pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder':␣

,→[OneHotEncoder(handle_unknown='ignore'),


,→CountEncoder(min_group_size=100, handle_unknown=0,


,→handle_missing='count'),


,→CatBoostEncoder(sigma = 0.01),CatBoostEncoder(sigma = 0.

,→1),CatBoostEncoder(sigma=1)

],
# 'featureunion__pipeline__columntransformer__grouped_var_num_bed':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'num_bed')],

# 'featureunion__pipeline__columntransformer__grouped_var_num_bath':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'num_bath')],

6
# ␣
'featureunion__pipeline__columntransformer__grouped_var_size_house':
,→

,→['passthrough', GroupedMeanDiffTrasformer('zip', 'size_house')],

# 'featureunion__pipeline__columntransformer__grouped_var_size_lot':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'size_lot')],

# ␣
,→'featureunion__pipeline__columntransformer__grouped_var_num_floors':

,→['passthrough', GroupedMeanDiffTrasformer('zip', 'num_floors')],

# ␣
,→'featureunion__pipeline__columntransformer__grouped_var_year_built':

,→['passthrough', GroupedMeanDiffTrasformer('zip', 'year_built')]

[65]: gs = (RandomizedSearchCV(modeling_pipeline_base,
param_distributions=params_base,
cv=5,
scoring='neg_mean_squared_error',
n_iter=50)
.fit(X_train, y_train)
)

0.1.6 Overview of the grid search process


Within the two grid search procedures we test a few paramenters. We study the best wey to
transform open variables and how to encode the feature ZIP. Here are a few runs of grid search
procedure, firstly for the Random Forest and secondly for the linear regression.
[127]: pd.DataFrame(gs_rf.cv_results_).sort_values(by='rank_test_score').head(10)

[127]: mean_fit_time std_fit_time mean_score_time std_score_time \


67 13.614857 1.048362 0.037702 0.000692
37 21.523406 0.992393 0.064881 0.002475
8 11.015670 2.345028 0.037906 0.003768
80 19.851658 1.247191 0.060765 0.001402
41 1.375265 0.052176 0.037508 0.000620
90 7.403681 0.580911 0.085051 0.002040
23 1.360066 0.235751 0.055067 0.021186
45 7.479487 0.201544 0.081320 0.001422
79 8.500575 0.556376 0.082103 0.004956
48 1.062942 0.016723 0.037258 0.000753

param_transformedtargetregressor__regressor__n_estimators \
67 100
37 500
8 100
80 500

7
41 100
90 1000
23 100
45 1000
79 1000
48 100

param_transformedtargetregressor__regressor__min_samples_split \
67 2
37 5
8 5
80 10
41 2
90 10
23 2
45 10
79 10
48 10

param_transformedtargetregressor__regressor__max_features \
67 auto
37 auto
8 auto
80 auto
41 auto
90 auto
23 auto
45 auto
79 auto
48 auto

param_transformedtargetregressor__regressor__max_depth \
67 None
37 60
8 20
80 20
41 90
90 70
23 50
45 100
79 50
48 30

param_pipeline__featureunion__columntransformer__std_scale_for_numeric__funct
iontransformer \
67 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
37 FunctionTransformer(func=<ufunc 'log1p'>, inve…

8
8 passthrough
80 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
41 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
90 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
23 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
45 passthrough
79 passthrough
48 passthrough

param_pipeline__featureunion__columntransformer__dummy_zipcode__ziptransforme
r__digits \
67 5
37 5
8 5
80 5
41 4
90 5
23 3
45 3
79 4
48 3

… split3_test_score split4_test_score split5_test_score \


67 … -4.582389e+10 -3.612524e+10 -3.559393e+10
37 … -4.143761e+10 -3.772234e+10 -3.367208e+10
8 … -4.068402e+10 -3.979278e+10 -3.590792e+10
80 … -4.031044e+10 -4.400711e+10 -3.447770e+10
41 … -4.284610e+10 -4.231764e+10 -4.253006e+10
90 … -4.392433e+10 -4.645225e+10 -3.948918e+10
23 … -4.916391e+10 -4.155473e+10 -4.164786e+10
45 … -4.449559e+10 -4.730355e+10 -3.940687e+10
79 … -4.442070e+10 -4.939398e+10 -4.066170e+10
48 … -4.499722e+10 -5.403323e+10 -4.022156e+10

split6_test_score split7_test_score split8_test_score \


67 -1.492174e+10 -1.656685e+10 -5.023916e+10
37 -1.443371e+10 -1.666341e+10 -5.381811e+10
8 -1.533361e+10 -1.719224e+10 -5.716759e+10
80 -1.438865e+10 -1.723123e+10 -5.726620e+10
41 -1.427266e+10 -1.781704e+10 -4.544135e+10
90 -1.397616e+10 -1.778879e+10 -5.864278e+10
23 -1.479185e+10 -1.696597e+10 -6.138538e+10
45 -1.463821e+10 -1.711119e+10 -5.706345e+10
79 -1.417849e+10 -1.796129e+10 -5.660925e+10
48 -1.435718e+10 -1.784108e+10 -5.643760e+10

split9_test_score mean_test_score std_test_score rank_test_score

9
67 -3.122349e+10 -3.904618e+10 2.201742e+10 1
37 -3.148320e+10 -3.943127e+10 2.250668e+10 2
8 -3.175893e+10 -4.069156e+10 2.270068e+10 3
80 -3.037819e+10 -4.077850e+10 2.338606e+10 4
41 -3.234607e+10 -4.130591e+10 2.357536e+10 5
90 -3.130556e+10 -4.312794e+10 2.425819e+10 6
23 -3.341148e+10 -4.319919e+10 2.442334e+10 7
45 -3.107680e+10 -4.339634e+10 2.610328e+10 8
79 -3.058615e+10 -4.340451e+10 2.497479e+10 9
48 -2.811619e+10 -4.379189e+10 2.537706e+10 10

[10 rows x 25 columns]

[128]: pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score').head(10)

[128]: mean_fit_time std_fit_time mean_score_time std_score_time \


7 9.157713 0.846226 0.031030 0.000166
8 10.036359 1.636437 0.030933 0.000079
6 7.727077 2.697961 0.030863 0.000080
15 0.261782 0.009390 0.088180 0.000635
3 0.282754 0.021720 0.057643 0.000695
9 0.200941 0.025468 0.063943 0.008404
0 0.214271 0.026761 0.067542 0.016662
12 0.217424 0.006014 0.088709 0.000868
14 0.212531 0.004270 0.091474 0.006081
5 0.277733 0.021802 0.062236 0.001296

param_pipeline__featureunion__columntransformer__std_scale_for_numeric__funct
iontransformer \
7 FunctionTransformer(func=<ufunc 'log1p'>, inve…
8 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
6 passthrough
15 passthrough
3 passthrough
9 passthrough
0 passthrough
12 passthrough
14 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
5 FunctionTransformer(func=<ufunc 'sqrt'>,\n …

param_pipeline__featureunion__columntransformer__dummy_zipcode__ziptransforme
r__digits \
7 5
8 5
6 5
15 5
3 4

10
9 3
0 3
12 4
14 4
5 4

param_pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder
\
7 OneHotEncoder(handle_unknown='ignore')
8 OneHotEncoder(handle_unknown='ignore')
6 OneHotEncoder(handle_unknown='ignore')
15 CountEncoder(combine_min_nan_groups=True, hand…
3 OneHotEncoder(handle_unknown='ignore')
9 CountEncoder(combine_min_nan_groups=True, hand…
0 OneHotEncoder(handle_unknown='ignore')
12 CountEncoder(combine_min_nan_groups=True, hand…
14 CountEncoder(combine_min_nan_groups=True, hand…
5 OneHotEncoder(handle_unknown='ignore')

params split0_test_score \
7 {'pipeline__featureunion__columntransformer__s… -2.954841e+10
8 {'pipeline__featureunion__columntransformer__s… -2.974013e+10
6 {'pipeline__featureunion__columntransformer__s… -2.473737e+10
15 {'pipeline__featureunion__columntransformer__s… -4.520121e+10
3 {'pipeline__featureunion__columntransformer__s… -3.682898e+10
9 {'pipeline__featureunion__columntransformer__s… -4.700729e+10
0 {'pipeline__featureunion__columntransformer__s… -4.700729e+10
12 {'pipeline__featureunion__columntransformer__s… -4.607334e+10
14 {'pipeline__featureunion__columntransformer__s… -8.888916e+10
5 {'pipeline__featureunion__columntransformer__s… -8.925942e+10

split1_test_score split2_test_score split3_test_score \


7 -1.598335e+10 -1.744806e+10 -1.018315e+10
8 -1.646233e+10 -1.826629e+10 -1.058345e+10
6 -1.745181e+10 -3.611450e+10 -1.185269e+10
15 -2.136673e+10 -3.770084e+10 -1.667300e+10
3 -1.846793e+10 -4.225844e+10 -1.661951e+10
9 -2.044684e+10 -3.602863e+10 -1.687564e+10
0 -2.044684e+10 -3.602863e+10 -1.690102e+10
12 -2.001135e+10 -3.538586e+10 -1.742555e+10
14 -2.580735e+10 -3.697956e+10 -1.915572e+10
5 -2.329066e+10 -4.221344e+10 -1.813822e+10

split4_test_score mean_test_score std_test_score rank_test_score


7 -3.659119e+10 -2.195083e+10 9.658440e+09 1
8 -3.871974e+10 -2.275439e+10 1.011492e+10 2
6 -5.476239e+10 -2.898375e+10 1.523073e+10 3

11
15 -5.195522e+10 -3.457940e+10 1.356244e+10 4
3 -5.999820e+10 -3.483461e+10 1.607454e+10 5
9 -5.612667e+10 -3.529701e+10 1.504271e+10 6
0 -5.612667e+10 -3.530209e+10 1.503649e+10 7
12 -7.098151e+10 -3.797552e+10 1.952787e+10 8
14 -3.571939e+10 -4.131024e+10 2.467764e+10 9
5 -3.512156e+10 -4.160466e+10 2.529606e+10 10

0.1.7 Benchmarking with simple models


The choice of target metrics to follow will be:
• RMSE: root mean squared error
• R2: r squared
The choice boils down to use standard metrics for the regression task. In terms of mean squared
error, a more diligent choice could be made if we knew how such model would be used and how
the magnitude of the prediction error of a single observation could bring problems in order to
smoothen them out with other metrics. Usually pricing tasks use a relative measure of error due
to it’s sensitivity to large errors, for example.
The R2 is in it self a problematic measure it has many flaws. But since it is so widely used and
understood, we’ll we it here.
[72]: linear_regression_no_feat_engineering = transformed_linear_regression.
,→fit(X_train, y_train)

dummy_regression = DummyRegressor().fit(X_train, y_train)


linear_regression = gs.best_estimator_
rf_regression = gs_rf.best_estimator_

[75]: models = {'lr_no_feats':linear_regression_no_feat_engineering, 'dummy':␣


,→dummy_regression, 'lr': linear_regression, 'rf':rf_regression}

[76]: model_predictions = {'truth': y_test}


for name, model in models.items():
model_predictions[name] = model.predict(X_test)

Here we gather all model predictions in one data frame, together with the truth.
[79]: test_predictions = pd.DataFrame(model_predictions)
test_predictions.head()

[79]: truth lr_no_feats dummy lr rf


367 365000 4.646097e+05 536783.906385 3.816964e+05 3.784975e+05
1293 170000 2.767648e+05 536783.906385 1.898924e+05 1.848740e+05
2460 3000000 1.639587e+06 536783.906385 1.971901e+06 1.385539e+06
2329 266000 3.677239e+05 536783.906385 2.501483e+05 2.917449e+05
521 485000 4.246798e+05 536783.906385 5.140322e+05 4.936743e+05

12
[94]: long_model_predictions = (
test_predictions
.set_index('truth', append=True)
.stack()
.to_frame('prediction')
.reset_index([1,2])
.rename(columns={'level_2':'model'})
.assign(residuals = lambda x: x.prediction - x.truth)
)

By inspectng the following plots, we can see that the linear regression with the feature enginering
job we built outperforms all other algorithms. This is in terms of RMSE and R2.
[116]: (long_model_predictions
.groupby('model')
.apply(lambda x: np.sqrt(metrics.mean_squared_error(x.truth, x.prediction))
).plot.bar(title = 'Root mean squared error on test set'))

[116]: <AxesSubplot:title={'center':'Root mean squared error on test set'},


xlabel='model'>

13
[119]: (long_model_predictions
.groupby('model')
.apply(lambda x: metrics.r2_score(x.truth, x.prediction))
).plot.bar(title = 'R2 on test set')

[119]: <AxesSubplot:title={'center':'R2 on test set'}, xlabel='model'>

0.1.8 Next steps


Since the best model is a simple linear regression, more can be studied in terms of model perfor-
mance. We did not use any regularization, so exploring Lasso, Elastic Net and Ridge regressions
is a netural next step. Other parallels are to double down on feature engineering. Better grouping
comparissiong, KNN features, auto encoders come to mind.
Now, the latitude and longitude features were not used due to time constraints. We could enginner
distance features looking at the local area such as distance to hospitals, schools and etc.
Another aspect that is needed to improve is the validation scheme. Here we do basically an out
of sample sampling. But we need study better how the model will be used. Out of time, out of
group and combinations to better estimate the performance are some other steps we can suggest.

14
[135]: import cloudpickle

with open('final_model.pkl', 'wb') as fp:


cloudpickle.dump(linear_regression, fp)

15

You might also like