Professional Documents
Culture Documents
August 3, 2020
1
from sklearn.dummy import DummyRegressor
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
set_config(display='diagram')
%load_ext autoreload
%autoreload 2
from utils import ZIPTransformer, GroupedMeanDiffTrasformer
[4]: report
<IPython.core.display.HTML object>
[4]:
When exploring the profiling reports, the first thing that cought our attention was the skewness
of the response variable price. We will handle this in the modeling process, but we only do it
through the TransformedTargetRegressor class, so that we don’t loose track of the real variable
and stop using it in the explorations.
2
About data spliting, we first create a hold out test set so we can gage our performance at the end.
[10]: X_raw = df_raw.drop(columns=['price', 'latitude', 'longitude'])
y_raw = df_raw.price
('grouped_var_size_house', GroupedMeanDiffTrasformer('zip',␣
,→'size_house'), ['zip', 'size_house']),
('grouped_var_num_floors', GroupedMeanDiffTrasformer('zip',␣
,→'num_floors'), ['zip', 'num_floors']),
('grouped_var_year_built', GroupedMeanDiffTrasformer('zip',␣
,→'year_built'), ['zip', 'year_built'])
]),
StandardScaler()
3
)
zip_pipeline_oh = make_pipeline(ZIPTransformer(digits=5),␣
,→OneHotEncoder(handle_unknown='ignore'))
all_other_variables_transfomer = ColumnTransformer([
('std_scale_for_numeric', numeric_pipeline, numeric_columns),
('dummy_zipcode', zip_pipeline_oh, ['zip'])
])
transformed_lassocv = TransformedTargetRegressor(
regressor=LassoCV(),
func=lambda x: np.log(x),
inverse_func=lambda x: np.exp(x)
)
4
0.1.4 Model pipeline structure
Below is the final pipeline structure. Some parts of it will be optimized in grid search, but the
structure should be the same.
[55]: modeling_pipeline_base.steps[-1]
[55]: ('transformedtargetregressor',
TransformedTargetRegressor(func=<function <lambda> at 0x7fe07be45290>,
inverse_func=<function <lambda> at 0x7fe07be45200>,
regressor=LinearRegression()))
params_rf = {
␣
,→'pipeline__featureunion__columntransformer__std_scale_for_numeric__functiontransformer':
,→
['passthrough',
FunctionTransformer(func=np.log1p, inverse_func=np.expm1),
FunctionTransformer(func=np.sqrt, inverse_func=lambda x: np.power(x,␣
2))],
,→
␣
,→'pipeline__featureunion__columntransformer__dummy_zipcode__ziptransformer__digits':
,→ [3,4,5],
␣
,→'pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder':␣
,→[OneHotEncoder(handle_unknown='ignore'),
␣
,→CountEncoder(min_group_size=100, handle_unknown=0,
␣
,→handle_missing='count'),
5
␣
,→CatBoostEncoder(sigma = 0.01),CatBoostEncoder(sigma = 0.
,→1),CatBoostEncoder(sigma=1)],
gs_rf = (RandomizedSearchCV(modeling_pipeline_rf,
param_distributions=params_rf,
cv=10,
scoring='neg_mean_squared_error',
n_iter=100)
.fit(X_train, y_train)
)
[64]: params_base = {
␣
,→'pipeline__featureunion__columntransformer__std_scale_for_numeric__functiontransformer':
,→
['passthrough',
FunctionTransformer(func=np.log1p, inverse_func=np.expm1),
FunctionTransformer(func=np.sqrt, inverse_func=lambda x: np.power(x,␣
2))],
,→
␣
'pipeline__featureunion__columntransformer__dummy_zipcode__ziptransformer__digits':
,→
,→ [3,4,5],
␣
,→'pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder':␣
,→[OneHotEncoder(handle_unknown='ignore'),
␣
,→CountEncoder(min_group_size=100, handle_unknown=0,
␣
,→handle_missing='count'),
␣
,→CatBoostEncoder(sigma = 0.01),CatBoostEncoder(sigma = 0.
,→1),CatBoostEncoder(sigma=1)
],
# 'featureunion__pipeline__columntransformer__grouped_var_num_bed':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'num_bed')],
# 'featureunion__pipeline__columntransformer__grouped_var_num_bath':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'num_bath')],
6
# ␣
'featureunion__pipeline__columntransformer__grouped_var_size_house':
,→
# 'featureunion__pipeline__columntransformer__grouped_var_size_lot':
,→['passthrough', GroupedMeanDiffTrasformer('zip', 'size_lot')],
# ␣
,→'featureunion__pipeline__columntransformer__grouped_var_num_floors':
# ␣
,→'featureunion__pipeline__columntransformer__grouped_var_year_built':
[65]: gs = (RandomizedSearchCV(modeling_pipeline_base,
param_distributions=params_base,
cv=5,
scoring='neg_mean_squared_error',
n_iter=50)
.fit(X_train, y_train)
)
param_transformedtargetregressor__regressor__n_estimators \
67 100
37 500
8 100
80 500
7
41 100
90 1000
23 100
45 1000
79 1000
48 100
param_transformedtargetregressor__regressor__min_samples_split \
67 2
37 5
8 5
80 10
41 2
90 10
23 2
45 10
79 10
48 10
param_transformedtargetregressor__regressor__max_features \
67 auto
37 auto
8 auto
80 auto
41 auto
90 auto
23 auto
45 auto
79 auto
48 auto
param_transformedtargetregressor__regressor__max_depth \
67 None
37 60
8 20
80 20
41 90
90 70
23 50
45 100
79 50
48 30
param_pipeline__featureunion__columntransformer__std_scale_for_numeric__funct
iontransformer \
67 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
37 FunctionTransformer(func=<ufunc 'log1p'>, inve…
8
8 passthrough
80 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
41 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
90 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
23 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
45 passthrough
79 passthrough
48 passthrough
param_pipeline__featureunion__columntransformer__dummy_zipcode__ziptransforme
r__digits \
67 5
37 5
8 5
80 5
41 4
90 5
23 3
45 3
79 4
48 3
9
67 -3.122349e+10 -3.904618e+10 2.201742e+10 1
37 -3.148320e+10 -3.943127e+10 2.250668e+10 2
8 -3.175893e+10 -4.069156e+10 2.270068e+10 3
80 -3.037819e+10 -4.077850e+10 2.338606e+10 4
41 -3.234607e+10 -4.130591e+10 2.357536e+10 5
90 -3.130556e+10 -4.312794e+10 2.425819e+10 6
23 -3.341148e+10 -4.319919e+10 2.442334e+10 7
45 -3.107680e+10 -4.339634e+10 2.610328e+10 8
79 -3.058615e+10 -4.340451e+10 2.497479e+10 9
48 -2.811619e+10 -4.379189e+10 2.537706e+10 10
[128]: pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score').head(10)
param_pipeline__featureunion__columntransformer__std_scale_for_numeric__funct
iontransformer \
7 FunctionTransformer(func=<ufunc 'log1p'>, inve…
8 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
6 passthrough
15 passthrough
3 passthrough
9 passthrough
0 passthrough
12 passthrough
14 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
5 FunctionTransformer(func=<ufunc 'sqrt'>,\n …
param_pipeline__featureunion__columntransformer__dummy_zipcode__ziptransforme
r__digits \
7 5
8 5
6 5
15 5
3 4
10
9 3
0 3
12 4
14 4
5 4
param_pipeline__featureunion__columntransformer__dummy_zipcode__onehotencoder
\
7 OneHotEncoder(handle_unknown='ignore')
8 OneHotEncoder(handle_unknown='ignore')
6 OneHotEncoder(handle_unknown='ignore')
15 CountEncoder(combine_min_nan_groups=True, hand…
3 OneHotEncoder(handle_unknown='ignore')
9 CountEncoder(combine_min_nan_groups=True, hand…
0 OneHotEncoder(handle_unknown='ignore')
12 CountEncoder(combine_min_nan_groups=True, hand…
14 CountEncoder(combine_min_nan_groups=True, hand…
5 OneHotEncoder(handle_unknown='ignore')
params split0_test_score \
7 {'pipeline__featureunion__columntransformer__s… -2.954841e+10
8 {'pipeline__featureunion__columntransformer__s… -2.974013e+10
6 {'pipeline__featureunion__columntransformer__s… -2.473737e+10
15 {'pipeline__featureunion__columntransformer__s… -4.520121e+10
3 {'pipeline__featureunion__columntransformer__s… -3.682898e+10
9 {'pipeline__featureunion__columntransformer__s… -4.700729e+10
0 {'pipeline__featureunion__columntransformer__s… -4.700729e+10
12 {'pipeline__featureunion__columntransformer__s… -4.607334e+10
14 {'pipeline__featureunion__columntransformer__s… -8.888916e+10
5 {'pipeline__featureunion__columntransformer__s… -8.925942e+10
11
15 -5.195522e+10 -3.457940e+10 1.356244e+10 4
3 -5.999820e+10 -3.483461e+10 1.607454e+10 5
9 -5.612667e+10 -3.529701e+10 1.504271e+10 6
0 -5.612667e+10 -3.530209e+10 1.503649e+10 7
12 -7.098151e+10 -3.797552e+10 1.952787e+10 8
14 -3.571939e+10 -4.131024e+10 2.467764e+10 9
5 -3.512156e+10 -4.160466e+10 2.529606e+10 10
Here we gather all model predictions in one data frame, together with the truth.
[79]: test_predictions = pd.DataFrame(model_predictions)
test_predictions.head()
12
[94]: long_model_predictions = (
test_predictions
.set_index('truth', append=True)
.stack()
.to_frame('prediction')
.reset_index([1,2])
.rename(columns={'level_2':'model'})
.assign(residuals = lambda x: x.prediction - x.truth)
)
By inspectng the following plots, we can see that the linear regression with the feature enginering
job we built outperforms all other algorithms. This is in terms of RMSE and R2.
[116]: (long_model_predictions
.groupby('model')
.apply(lambda x: np.sqrt(metrics.mean_squared_error(x.truth, x.prediction))
).plot.bar(title = 'Root mean squared error on test set'))
13
[119]: (long_model_predictions
.groupby('model')
.apply(lambda x: metrics.r2_score(x.truth, x.prediction))
).plot.bar(title = 'R2 on test set')
14
[135]: import cloudpickle
15