You are on page 1of 25

STAT 210

Applied Statistics and Data Analysis


Problem List 10

import numpy as np
import scipy as sp
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

Exercise 1
For this exercise we use the data set ais. We concentrate on six variables, bmi, lbm, ssf, ht, wcc, and hc.
(a) Use the function pairplot in the package seaborn to obtain a graph matrix for the variables. Use the
matshow function from matplotlib to draw a plot of the correlation coefficients for the six variables.
Comment.
(b) Fit a multiple regression model for bmi as a function of the other variables. Print the summary table
and discuss the results.
(c) Using a stepwise procedure, select a minimal adequate model. Use αcrit = 0.05.
(d) Using R2 as a selection criterion, select a minimal adequate model.
(e) Using BIC as a selection criterion, select a minimal adequate model.
(f) Fit also a lasso regression model. You can use smf.ols().fit_regularized(L1_wt = 1, alpha =
0.5) syntax to do it. Select a minimal adequate model out of all these procedures. Justify your answer.
(g) Draw the diagnostic plots for your final model and discuss them

Solution
(a) Load the data
data_1 = pd.read_table("data/ais", delim_whitespace = True)
data_1.info()

## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 202 entries, 1 to 202
## Data columns (total 13 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 rcc 202 non-null float64
## 1 wcc 202 non-null float64
## 2 hc 202 non-null float64

1
## 3 hg 202 non-null float64
## 4 ferr 202 non-null int64
## 5 bmi 202 non-null float64
## 6 ssf 202 non-null float64
## 7 pcBfat 202 non-null float64
## 8 lbm 202 non-null float64
## 9 ht 202 non-null float64
## 10 wt 202 non-null float64
## 11 sex 202 non-null object
## 12 sport 202 non-null object
## dtypes: float64(10), int64(1), object(2)
## memory usage: 22.1+ KB
data_1 = data_1[["bmi", "lbm", "ssf", "ht", "wcc", "hc"]]
data_1.info()

## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 202 entries, 1 to 202
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 bmi 202 non-null float64
## 1 lbm 202 non-null float64
## 2 ssf 202 non-null float64
## 3 ht 202 non-null float64
## 4 wcc 202 non-null float64
## 5 hc 202 non-null float64
## dtypes: float64(6)
## memory usage: 11.0 KB
Pairwise scatterplots
sns.pairplot(data_1, kind = "reg", diag_kind = "kde", plot_kws = {"line_kws": {"color": "red"}});
plt.show()

2
35.0
32.5
30.0
27.5
bmi

25.0
22.5
20.0
17.5

100

80
lbm

60

40

200

150

100
ssf

50

210
200
190
180
ht

170
160
150
14
12
10
wcc

8
6
4
60
55
50
hc

45
40
35
20 30 25 50 75 100 0 100 200 140 160 180 200 220 5 10 15 40 50 60
bmi lbm ssf ht wcc hc

Plots on the top row have bmi in the y-axis and are relevant for variable selection. These plots show that bmi
has a strong linear relation with lbm. The linear relation with the other variables is not so clear. Also, there
is a strong linear relation between lbm and ht.
The correlation matrix
fig, ax = plt.subplots()
cax = ax.matshow(data_1.corr(), vmin = -1, vmax = 1, cmap = plt.cm.RdBu)
fig.colorbar(cax);
ax.set(xticks = range(data_1.shape[1]), yticks = range(data_1.shape[1]),
xticklabels = data_1.columns, yticklabels = data_1.columns);
plt.show()

3
bmi lbm ssf ht wcc hc
1.00
bmi
0.75
lbm 0.50
0.25
ssf
0.00
ht
0.25
wcc 0.50
0.75
hc
1.00

The highest correlation corresponds to ht and lbm, with a value of around 0.8; lbm and bmi also have a
strong correlation.
(b) To include all variables in the formula, we can use the string manipulation techniques in Python. We
first join all column names with a + separator in between, and then fit a model by appending the names
to the formula. Observe that we need to drop bmi since this variable is not a regressor.
all_vars_1 = "+".join(data_1.columns.drop(["bmi"]))
model_10 = smf.ols("bmi ~ " + all_vars_1, data_1).fit()
model_10.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: bmi R-squared: 0.970
## Model: OLS Adj. R-squared: 0.969
## Method: Least Squares F-statistic: 1260.
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 7.23e-147
## Time: 09:32:02 Log-Likelihood: -145.10
## No. Observations: 202 AIC: 302.2
## Df Residuals: 196 BIC: 322.0
## Df Model: 5
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------

4
## Intercept 40.5999 1.074 37.817 0.000 38.483 42.717
## lbm 0.3219 0.005 60.579 0.000 0.311 0.332
## ssf 0.0508 0.001 40.202 0.000 0.048 0.053
## ht -0.2380 0.006 -37.985 0.000 -0.250 -0.226
## wcc 0.0090 0.021 0.437 0.663 -0.032 0.049
## hc 0.0176 0.014 1.301 0.195 -0.009 0.044
## ==============================================================================
## Omnibus: 12.113 Durbin-Watson: 1.223
## Prob(Omnibus): 0.002 Jarque-Bera (JB): 16.098
## Skew: -0.412 Prob(JB): 0.000319
## Kurtosis: 4.110 Cond. No. 6.32e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.32e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
The variables wcc and hc have large p-values and do not seem to be significant. All the other variables have
a p-value reported as zero.
(c) We use a critical value of αcrit = 0.05 for variable selection. We remove wcc which has the largest
p-value.
all_vars_1 = "+".join(data_1.columns.drop(["bmi", "wcc"]))
model_11 = smf.ols("bmi ~ " + all_vars_1, data_1).fit()
model_11.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: bmi R-squared: 0.970
## Model: OLS Adj. R-squared: 0.969
## Method: Least Squares F-statistic: 1581.
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.88e-148
## Time: 09:32:02 Log-Likelihood: -145.20
## No. Observations: 202 AIC: 300.4
## Df Residuals: 197 BIC: 316.9
## Df Model: 4
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 40.6066 1.071 37.905 0.000 38.494 42.719
## lbm 0.3219 0.005 60.707 0.000 0.311 0.332
## ssf 0.0510 0.001 41.522 0.000 0.049 0.053
## ht -0.2380 0.006 -38.068 0.000 -0.250 -0.226
## hc 0.0188 0.013 1.420 0.157 -0.007 0.045
## ==============================================================================
## Omnibus: 12.380 Durbin-Watson: 1.232
## Prob(Omnibus): 0.002 Jarque-Bera (JB): 16.472
## Skew: -0.420 Prob(JB): 0.000265
## Kurtosis: 4.118 Cond. No. 6.32e+03
## ==============================================================================

5
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.32e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
We now remove hc.
all_vars_1 = "+".join(data_1.columns.drop(["bmi", "wcc", "hc"]))
model_12 = smf.ols("bmi ~ " + all_vars_1, data_1).fit()
model_12.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: bmi R-squared: 0.969
## Model: OLS Adj. R-squared: 0.969
## Method: Least Squares F-statistic: 2097.
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.03e-149
## Time: 09:32:02 Log-Likelihood: -146.22
## No. Observations: 202 AIC: 300.4
## Df Residuals: 198 BIC: 313.7
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 41.4719 0.883 46.942 0.000 39.730 43.214
## lbm 0.3254 0.005 69.112 0.000 0.316 0.335
## ssf 0.0503 0.001 44.440 0.000 0.048 0.053
## ht -0.2393 0.006 -38.597 0.000 -0.252 -0.227
## ==============================================================================
## Omnibus: 14.254 Durbin-Watson: 1.212
## Prob(Omnibus): 0.001 Jarque-Bera (JB): 20.207
## Skew: -0.452 Prob(JB): 4.09e-05
## Kurtosis: 4.259 Cond. No. 5.08e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 5.08e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
This is the final model in the stepwise procedure since all the remaining variables have p-values below αcrit .
(d) To fit a model using R2 as a criterion, we use the function from the lectures.
from itertools import combinations

# Function to get best model by adjusted Rˆ2


def get_best_model_by_adj_r2(X, y):
best_adj_r2 = float("-inf")
best_combination = []
for i in range(1, len(X.columns)+1):

6
for combo in combinations(X.columns, i):
model = sm.OLS(y, sm.add_constant(X[list(combo)])).fit()
if model.rsquared_adj > best_adj_r2:
best_adj_r2 = model.rsquared_adj
best_combination = combo
return best_combination, best_adj_r2

We need to separate the regressors from the dependent variable and then we run the function:
# Drop bmi to get matrix of predictors
X = data_1.drop(columns=['bmi'])
y = data_1['bmi']
best_combo, best_adj_r2 = get_best_model_by_adj_r2(X, y)
print(f"Best combination by adjusted Rˆ2: {best_combo}")
print(f"Maximum adjusted Rˆ2: {best_adj_r2:.5f}")

## Best combination by adjusted R^2: ('lbm', 'ssf', 'ht', 'hc')


## Maximum adjusted R^2: 0.96918
This criterion selects model_11 with variables lbm, ssf, ht, and hc.
(e) For BIC we also use the function from the lectures.
def stepwise_selection_bic_recheck(X, y, initial_list=[], verbose=True):
"""
Perform a forward-backward feature selection
based on BIC from statsmodels.api.OLS
Arguments:
X - pandas.DataFrame with candidate features
y - list-like with the target
initial_list - list of features to start with (column names of X)
verbose - whether to print the sequence of inclusions and exclusions
Returns: list of selected features
"""
included = list(initial_list)
best_bic = float('inf')
while True:
changed = False
# forward step
excluded = list(set(X.columns) - set(included))
new_bic = {}
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
new_bic[new_column] = model.bic
best_new_bic = min(new_bic.values())
if best_new_bic < best_bic:
best_bic = best_new_bic
best_feature = min(new_bic, key=new_bic.get)
included.append(best_feature)
changed = True
if verbose:
print('Add {:30} with BIC {:.6}'.format(best_feature, best_bic))

# backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
current_bic = model.bic

7
bic_diff = {feature: current_bic - sm.OLS(y, sm.add_constant(pd.DataFrame(X[list(set(included) -
if bic_diff and max(bic_diff.values()) > 0:
drop = max(bic_diff, key=bic_diff.get)
included.remove(drop)
best_bic = current_bic - bic_diff[drop]
changed = True
if verbose:
print('Drop {:30} with BIC {:.6}'.format(drop, best_bic))

if not changed:
break

return included
# Stepwise selection based on BIC (re-evaluated)
result_bic_recheck = stepwise_selection_bic_recheck(X, y)

## Add lbm with BIC 864.023


## Add ssf with BIC 741.239
## Add ht with BIC 313.683
result_bic_recheck, sm.OLS(y, sm.add_constant(X[result_bic_recheck])).fit().bic

## (['lbm', 'ssf', 'ht'], 313.6828085768625)


The BIC criterion chooses the same model as the stepwise method, with variables lbm, ht, and ssf. This is
model_12.
(f) We fit a lasso model as follows
all_vars_1 = "+".join(data_1.columns.drop(["bmi"]))
model_13 = smf.ols("bmi ~ " + all_vars_1, data_1).fit_regularized(L1_wt = 1, alpha = 0.5)
print(model_13.params)

## Intercept 0.000000
## lbm 0.313154
## ssf 0.074056
## ht -0.023160
## wcc 0.000000
## hc 0.038315
## dtype: float64
The lasso regression results in some coefficients being zero, meaning that those variable are excluded from
the model. We see that the procedure gives us identical set of variables as in the first model in (c), except
the intercept. Observe that the change in adjusted R2 is 0.9692 to 0.969 between two models in (c). The
difference in AIC is also a very small, from 300.4 to 300.45. I would keep the simpler model model_12.
(g) The diagnostic plots for our final model
cls_12 = LinearRegDiagnostic(model_12)
vif, fig, ax = cls_12()

8
Residuals vs Fitted Normal Q-Q
3
1.0 2

Standardized Residuals
0.5 1
0.0 0
Residuals

0.5 1
1.0 2
1.5 97 132 3 97
132
2.0 98 4 98
20 25 30 4 2 0 2
Fitted values Theoretical Quantiles
Scale-Location Residuals vs 74
Leverage
2.00
98 Cook's distance
2
|Standardized Residuals|

97 132
Standardized Residuals
1.75
1
1.50
1.25 0
1.00 1
0.75
2
0.50
3 132
0.25
0.00 4 98
20 25 30 35 0.00 0.02 0.04 0.06 0.08 0.10
Fitted values Leverage
The quantile plot has departures from the straight line at the lower end and the scale-location plots shows an
increasing pattern, and in the residuals vs leverage we see points with large residuals.
We use tests for normality and homoscedasticity
swt = sp.stats.shapiro(model_12.resid)
print('p-value: ', round(swt[1],5))

## p-value: 0.00417
For homoscedasticity we use the Breusch-Pagan test (not reviewed in the lectures).
round(sm.stats.het_breuschpagan(model_12.resid, model_12.model.exog, False)[1],7)

## 1.7e-06
The normality test rejects the null hypothesis of a normal distribution, and the homoscedasticity assumption
is rejected too.
For comparison, we plot the diagnostic graphs for the other model. The differences are small.

9
cls_11 = LinearRegDiagnostic(model_11)
_, fig, ax = cls_11()

Residuals vs Fitted Normal Q-Q


1.0 2

Standardized Residuals
0.5 1
0.0 0
Residuals

0.5 1
1.0 2
1.5 97 132 3 97
132
2.0 98 4 98
20 25 30 4 2 0 2
Fitted values Theoretical Quantiles
Scale-Location Residuals
74
vs Leverage
2.00 98 Cook's distance
132 2
|Standardized Residuals|

1.75 97
Standardized Residuals

1.50 1
1.25 0
1.00 1
0.75
2
0.50
0.25 3 132
0.00 4 98
20 25 30 35 0.00 0.05 0.10 0.15
Fitted values Leverage
The diagnostic plots and the tests indicate that this model does not satisfy the assumptions for linear
regression.

Exercise 2
The data set data_q2 has six variables. Find a minimal adequate model for res in terms of the other variables
(without interactions). In your answer include exploratory analysis, variable selection and residual analysis.
In each step justify clearly the reason for your decision. Give a prediction of res using your model for
a subject with values (var1,var2,var3,var4,var5) = (16.1, 14.0, 66.8, 202, 45.4).

10
Solution
Load the data
data_2 = pd.read_table("data/data_q2.txt", delim_whitespace = True)
data_2 = data_2.reset_index(drop = True)
data_2.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 35 entries, 0 to 34
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 res 35 non-null float64
## 1 var1 35 non-null float64
## 2 var2 35 non-null float64
## 3 var3 35 non-null float64
## 4 var4 35 non-null float64
## 5 var5 35 non-null float64
## dtypes: float64(6)
## memory usage: 1.8 KB
Do a pairplot
sns.pairplot(data_2, kind = "reg", diag_kind = "kde");
plt.show()

11
30

20
res

10

24
22
20
var1

18
16

15
14
13
var2

12
11
10
85.0
82.5
80.0
77.5
var3

75.0
72.5
70.0
67.5
240
230
220
var4

210
200

50

48
var5

46

44

42
20 0 20 40 10 15 20 25 30 10 12 14 16 60 70 80 90 180 200 220 240 260 40 45 50
res var1 var2 var3 var4 var5

The top row shows that res increases with variables 1, 2, and 4, and decreases with 3 and 5; res and var2
show a strong linear relation. Variables 3 and 4 appear to have a high correlation.
Fit a complete model
all_vars_2 = "+".join(data_2.columns.drop(["res"]))
model_20 = smf.ols("res ~ " + all_vars_2, data_2).fit()
model_20.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: res R-squared: 0.966
## Model: OLS Adj. R-squared: 0.961
## Method: Least Squares F-statistic: 166.8

12
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.93e-20
## Time: 09:32:09 Log-Likelihood: -71.580
## No. Observations: 35 AIC: 155.2
## Df Residuals: 29 BIC: 164.5
## Df Model: 5
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept -1.5719 11.823 -0.133 0.895 -25.754 22.610
## var1 0.0893 0.120 0.744 0.463 -0.156 0.335
## var2 2.0243 0.844 2.398 0.023 0.297 3.751
## var3 -2.8994 0.321 -9.030 0.000 -3.556 -2.243
## var4 0.9629 0.163 5.892 0.000 0.629 1.297
## var5 -0.0010 0.209 -0.005 0.996 -0.429 0.427
## ==============================================================================
## Omnibus: 0.593 Durbin-Watson: 1.573
## Prob(Omnibus): 0.743 Jarque-Bera (JB): 0.568
## Skew: -0.279 Prob(JB): 0.753
## Kurtosis: 2.722 Cond. No. 7.98e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 7.98e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
Variables 2, 3 and 4 appear significant but we need to check for collinearity. Use VIF from statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
for i, name in enumerate(model_20.model.exog_names):
print(name, round(vif(model_20.model.exog, i),3))

## Intercept 1158.7
## var1 1.093
## var2 10.789
## var3 18.31
## var4 31.548
## var5 1.026
Also inspect the correlation matrix
fig, ax = plt.subplots()
cax = ax.matshow(data_2.corr(), vmin = -1, vmax = 1, cmap = plt.cm.RdBu)
fig.colorbar(cax);
ax.set(xticks = range(data_2.shape[1]), yticks = range(data_2.shape[1]),
xticklabels = data_2.columns, yticklabels = data_2.columns);
plt.show()

13
res var1 var2 var3 var4 var5
1.00
res
0.75
var1 0.50
0.25
var2
0.00
var3
0.25
var4 0.50
0.75
var5
1.00

We see that the Variance Inflation Factors for variables 4, 3 and 2 are large and the correlation matrix also
shows large values. Since vif is largest for variable 4, we try dropping it from the model
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4"]))
model_21 = smf.ols("res ~ " + all_vars_2, data_2).fit()
model_21.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: res R-squared: 0.926
## Model: OLS Adj. R-squared: 0.916
## Method: Least Squares F-statistic: 94.07
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.58e-16
## Time: 09:32:10 Log-Likelihood: -85.354
## No. Observations: 35 AIC: 180.7
## Df Residuals: 30 BIC: 188.5
## Df Model: 4
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 10.0161 16.990 0.590 0.560 -24.682 44.715
## var1 0.0158 0.174 0.091 0.928 -0.339 0.371
## var2 6.7554 0.380 17.769 0.000 5.979 7.532
## var3 -1.0653 0.115 -9.296 0.000 -1.299 -0.831

14
## var5 -0.0397 0.305 -0.130 0.897 -0.662 0.582
## ==============================================================================
## Omnibus: 0.765 Durbin-Watson: 1.484
## Prob(Omnibus): 0.682 Jarque-Bera (JB): 0.203
## Skew: -0.156 Prob(JB): 0.903
## Kurtosis: 3.205 Cond. No. 3.06e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 3.06e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
Check vif
for i, name in enumerate(model_21.model.exog_names):
print(name, round(vif(model_21.model.exog, i),3))

## Intercept 1126.635
## var1 1.081
## var2 1.03
## var3 1.098
## var5 1.025
Now variables 2 and 3 are significant and the vif have decreased to normal levels. var1 has the largest p-value,
so we drop it from the model
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4", "var1"]))
model_22 = smf.ols("res ~ " + all_vars_2, data_2).fit()
model_22.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: res R-squared: 0.926
## Model: OLS Adj. R-squared: 0.919
## Method: Least Squares F-statistic: 129.6
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.27e-17
## Time: 09:32:11 Log-Likelihood: -85.359
## No. Observations: 35 AIC: 178.7
## Df Residuals: 31 BIC: 184.9
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 10.4008 16.189 0.642 0.525 -22.616 43.418
## var2 6.7562 0.374 18.068 0.000 5.994 7.519
## var3 -1.0680 0.109 -9.821 0.000 -1.290 -0.846
## var5 -0.0371 0.298 -0.124 0.902 -0.645 0.571
## ==============================================================================
## Omnibus: 0.746 Durbin-Watson: 1.492
## Prob(Omnibus): 0.689 Jarque-Bera (JB): 0.187
## Skew: -0.146 Prob(JB): 0.911

15
## Kurtosis: 3.207 Cond. No. 2.90e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.9e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
Finally, we drop var5, which is also non-significant.
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4", "var1", "var5"]))
model_23 = smf.ols("res ~ " + all_vars_2, data_2).fit()
model_23.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: res R-squared: 0.926
## Model: OLS Adj. R-squared: 0.921
## Method: Least Squares F-statistic: 200.5
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 7.91e-19
## Time: 09:32:11 Log-Likelihood: -85.367
## No. Observations: 35 AIC: 176.7
## Df Residuals: 32 BIC: 181.4
## Df Model: 2
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 8.7246 8.810 0.990 0.329 -9.221 26.670
## var2 6.7614 0.366 18.482 0.000 6.016 7.507
## var3 -1.0690 0.107 -10.010 0.000 -1.287 -0.851
## ==============================================================================
## Omnibus: 0.924 Durbin-Watson: 1.496
## Prob(Omnibus): 0.630 Jarque-Bera (JB): 0.264
## Skew: -0.160 Prob(JB): 0.876
## Kurtosis: 3.280 Cond. No. 1.38e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.38e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
This is the minimal adequate model. The equation for the model is

res = 8.7246 + 6.7614 × var2 − 1.069 × var3

We look now at the residual plots


cls_23 = LinearRegDiagnostic(model_23)
_, fig, ax = cls_23()

16
Residuals vs Fitted Normal Q-Q
6 12 1312
13 2

Standardized Residuals
4
1
2
Residuals

0 0
2
1
4
6 2
3 3
0 10 20 30 2 1 0 1 2
Fitted values Theoretical Quantiles
Scale-Location Residuals vs Leverage
1.6 3 12 13
2 Cook's distance
1213
|Standardized Residuals|

1.4
Standardized Residuals

1.2 1
1.0
0
0.8
0.6 1
0.4
0.2 2
0.0 3
0 10 20 30 0.00 0.05 0.10 0.15
Fitted values Leverage
These graphs show that the usual hypothesis are satisfied and the fit is good. We do tests for normality and
homogeneous variances
swt = sp.stats.shapiro(model_23.resid)
print('p-value: ', round(swt[1],3))

## p-value: 0.727
bpt = sm.stats.het_breuschpagan(model_23.resid, model_23.model.exog, False)
print('p-value: ', round(bpt[1],3))

## p-value: 0.583
The tests confirm that the assumptions are satisfied.
We now assess the influence plots
fig, ax = plt.subplots()
sm.graphics.influence_plot(model_23, ax = ax);
plt.show()

17
12 Influence Plot 13
2

Studentized Residuals
1

2 3
3
0.04 0.06 0.08 0.10 0.12 0.14
Leverage
In the influence plots we see that none of the points has consistently high leverage, large Cook’s distance,
and large residuals. A few values for the standardized residuals are large (above 2 in absolute value). The
worst points seem to be 13, 12 and 3.
The equation for the final model is

res = 8.7246 + 6.7614 × var2 − 1.069 × var3

To give a prediction we only need the values for var2 = 14.0 and var3 = 66.8. We can use the equation
for the model to obtain the predicted values
round(8.7246 + 6.7614*14.0 - 1.069*66.8,3)

## 31.975

8.7246 + 6.7614 ∗ 14.0 − 1.069 ∗ 66.8 = 31.975

or we can do it with the following commands:


new_data = pd.DataFrame({"var2": [14.0], "var3": [66.8]})
new_pred = model_23.get_prediction(new_data)
new_pred.summary_frame()

## mean mean_se ... obs_ci_lower obs_ci_upper


## 0 31.975899 1.163916 ... 25.609573 38.342226
##
## [1 rows x 6 columns]

Exercise 3
In this exercise we want to examine the relationship between body temperature and heart rate. Further, we
would like to use heart rate to predict the body temperature.

18
(a) Use the BodyTemperature.txt data set to build a simple linear regression model for body temperature
using heart rate as the predictor.
(b) Interpret the estimate of regression coefficient and examine its statistical significance.
(c) Find the 95% confidence interval for the regression coefficient.
(d) Find the value of R2 and show that it is equal to sample correlation coefficient.
(e) Create simple diagnostic plots for your model and identify possible outliers.
(f) If someone’s heart rate is 75, what would be your estimate of this person’s body temperature?
(g) We believe that gender might also be related to body temperature and could help us to predict its
unknown values. Use the BodyTemperature.txt data set to build a multiple linear regression model for
body temperature using heart rate and gender as predictors. For the tests in this section use α = 0.1.
(h) We answer the next four questions using the additive model with HeartRate and Gender as variables.
How much$ did the $Rˆ2 increase compared the above simple linear regression model?
(i) Explain the estimates of regression coefficients in plain language.
(j) Find the 95% confidence intervals for regression coefficients.
(k) If a woman’s heart rate is 75, what would be your estimate of her body temperature? What would be
your estimate of body temperature for a man whose heart rate is 75?

Solution
Load the data
data_3 = pd.read_table("data/BodyTemperature.txt", delim_whitespace = True)
data_3.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 100 entries, 0 to 99
## Data columns (total 4 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Gender 100 non-null object
## 1 Age 100 non-null int64
## 2 HeartRate 100 non-null int64
## 3 Temperature 100 non-null float64
## dtypes: float64(1), int64(2), object(1)
## memory usage: 3.2+ KB
(a) Fit Temperature against HeartRate
model_30 = smf.ols("Temperature ~ HeartRate", data_3).fit()
model_30.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Temperature R-squared: 0.200
## Model: OLS Adj. R-squared: 0.192
## Method: Least Squares F-statistic: 24.56
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 3.01e-06
## Time: 09:32:13 Log-Likelihood: -125.80
## No. Observations: 100 AIC: 255.6

19
## Df Residuals: 98 BIC: 260.8
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 92.3907 1.201 76.900 0.000 90.006 94.775
## HeartRate 0.0806 0.016 4.956 0.000 0.048 0.113
## ==============================================================================
## Omnibus: 1.376 Durbin-Watson: 1.804
## Prob(Omnibus): 0.503 Jarque-Bera (JB): 0.861
## Skew: -0.057 Prob(JB): 0.650
## Kurtosis: 3.440 Cond. No. 1.03e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.03e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
Scatterplot
fig, ax = plt.subplots()
ax.scatter(data_3["HeartRate"], data_3["Temperature"]);
sm.graphics.abline_plot(model_results = model_30, color = "red", ax = ax);
plt.xlabel("Heart Rate")

## Text(0.5, 0, 'Heart Rate')


plt.ylabel("Temperature (ºF)")

## Text(0, 0.5, 'Temperature (ºF)')


plt.show()

101

100
Temperature (ºF)

99

98

97

96
60 65 70 75 80 85
Heart Rate

(b) Both the intercept and the slope have a reported p-value of 0, so both are significantly different from

20
zero. The intercept, 92.3907 is the value of Temperature when the heart rate is equal to zero. The
slope, 0.0806, is the increase in Temperature when the heart rate increases one unit.
(c) The 95% confidence intervals for regression coefficients is given by
model_30.conf_int()

## 0 1
## Intercept 90.006462 94.774900
## HeartRate 0.048347 0.112916
(d) The R2 can be found either in table, or by calling
round(model_30.rsquared,4)

## 0.2004
We see that the model only explains about 20% of the variability in the data. We calculate the correlation
squared to verify that it is equal to the R2 :
data_3[["HeartRate", "Temperature"]].corr() ** 2

## HeartRate Temperature
## HeartRate 1.000000 0.200418
## Temperature 0.200418 1.000000
(e) Diagnostic plots
cls_30 = LinearRegDiagnostic(model_30)
_, fig, ax = cls_30()

21
Residuals vs Fitted Normal Q-Q
5 3 5
2 85 85

Standardized Residuals
2
1 1
Residuals
0 0

1 1
2
2
74 3 74
97.5 98.0 98.5 99.0 2 0 2
Fitted values Theoretical Quantiles
Scale-Location Residuals vs Leverage
1.75 5 74 3 5
Cook's distance
85
|Standardized Residuals|

1.50 2

Standardized Residuals
78
1.25 1
1.00
0
0.75
1
0.50
0.25 2
0.00 3 74
97.5 98.0 98.5 99.0 99.5 0.00 0.02 0.04 0.06 0.08
Fitted values Leverage

Influence plot
fig, ax = plt.subplots()
sm.graphics.influence_plot(model_30, ax = ax);
plt.show()

22
5 Influence Plot
3
85
96 78

Studentized Residuals
45 93
2

1
81 27
0 28 99
1
2 48
2 74
3
0.01 0.02 0.03 0.04 0.05 0.06 0.07
Leverage
Points 5, and 78 are flagged in all diagnostic graphs and points 74 and 85 are flagged in some. They may be
outliers and should be investigated further.
(f) Predicting Temperature for a heart rate of 75. Using get_prediction
new_data = pd.DataFrame({"HeartRate": [75]})
new_pred = model_30.get_prediction(new_data)
new_pred.summary_frame()

## mean mean_se ... obs_ci_lower obs_ci_upper


## 0 98.438046 0.088721 ... 96.722331 100.153761
##
## [1 rows x 6 columns]
Using the coefficients and the model formula:
round(model_30.params[0] + model_30.params[1] * 75,2)

## 98.44
(g) Fit a model with interaction
model_31 = smf.ols("Temperature ~ HeartRate * Gender", data_3).fit()
model_31.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Temperature R-squared: 0.226
## Model: OLS Adj. R-squared: 0.201
## Method: Least Squares F-statistic: 9.317
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 1.82e-05
## Time: 09:32:16 Log-Likelihood: -124.21
## No. Observations: 100 AIC: 256.4
## Df Residuals: 96 BIC: 266.8
## Df Model: 3
## Covariance Type: nonrobust

23
## =========================================================================================
## coef std err t P>|t| [0.025 0.975]
## -----------------------------------------------------------------------------------------
## Intercept 92.1833 1.855 49.698 0.000 88.501 95.865
## Gender[T.M] 0.1338 2.428 0.055 0.956 -4.686 4.953
## HeartRate 0.0855 0.025 3.389 0.001 0.035 0.136
## HeartRate:Gender[T.M] -0.0059 0.033 -0.179 0.858 -0.071 0.059
## ==============================================================================
## Omnibus: 2.438 Durbin-Watson: 1.815
## Prob(Omnibus): 0.296 Jarque-Bera (JB): 2.137
## Skew: 0.062 Prob(JB): 0.344
## Kurtosis: 3.705 Cond. No. 2.84e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.84e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
The p-value for the interaction is big, and therefore it is not significant and we drop it from the model.
model_32 = smf.ols("Temperature ~ HeartRate + Gender", data_3).fit()
model_32.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Temperature R-squared: 0.225
## Model: OLS Adj. R-squared: 0.209
## Method: Least Squares F-statistic: 14.10
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 4.21e-06
## Time: 09:32:16 Log-Likelihood: -124.23
## No. Observations: 100 AIC: 254.5
## Df Residuals: 97 BIC: 262.3
## Df Model: 2
## Covariance Type: nonrobust
## ===============================================================================
## coef std err t P>|t| [0.025 0.975]
## -------------------------------------------------------------------------------
## Intercept 92.4376 1.189 77.743 0.000 90.078 94.798
## Gender[T.M] -0.3004 0.170 -1.763 0.081 -0.639 0.038
## HeartRate 0.0820 0.016 5.088 0.000 0.050 0.114
## ==============================================================================
## Omnibus: 2.387 Durbin-Watson: 1.810
## Prob(Omnibus): 0.303 Jarque-Bera (JB): 2.072
## Skew: 0.056 Prob(JB): 0.355
## Kurtosis: 3.696 Cond. No. 1.03e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.03e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

24
## """
In this model (see model summary), Gender is marginally significant. At the α = 0.05 and level we would
keep the simple regression model using only HeartRate and then the rest of the question will have the same
answer as in the first part. However, we are told to use α = 0.1, then Gender is significant and the answers
are different.
(h) The first value corresponds to the additive model, the second to the simple linear model.
(round(model_32.rsquared,3), round(model_30.rsquared,3))

## (0.225, 0.2)
(i) The additive model has a common slope of 0.08199 for both genders but the intercepts are different.
For females the intercept is 92.4376 while for males it is 92.4376 − 0.3004 = 92.1372
(j) The 95% confidence intervals for regression coefficients
model_32.conf_int()

## 0 1
## Intercept 90.077761 94.797512
## Gender[T.M] -0.638659 0.037776
## HeartRate 0.050009 0.113977
(k) Create a new data frame and make predictions
new_data = pd.DataFrame({"HeartRate": [75,75], "Gender": ["F","M"]})
new_pred = model_32.get_prediction(new_data)
new_pred.summary_frame()

## mean mean_se ... obs_ci_lower obs_ci_upper


## 0 98.587086 0.121868 ... 96.881046 100.293127
## 1 98.286645 0.122801 ... 96.580341 99.992949
##
## [2 rows x 6 columns]

25

You might also like