ProbList10 MOI SLN

STAT 210
Applied Statistics and Data Analysis

Problem List 10
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
Exercise 1
For this exercise we use the data set ais. We concentrate on six variables, bmi, lbm, ssf, ht, wcc, and hc.
(a) Use the function pairplot in the package seaborn to obtain a graph matrix for the variables. Use the
matshow function from matplotlib to draw a plot of the correlation coefficients for the six variables.
Comment.
(b) Fit a multiple regression model for bmi as a function of the other variables. Print the summary table
and discuss the results.
(c) Using a stepwise procedure, select a minimal adequate model. Use αcrit = 0.05.
(d) Using R2 as a selection criterion, select a minimal adequate model.
(e) Using BIC as a selection criterion, select a minimal adequate model.
(f) Fit also a lasso regression model. You can use smf.ols().fit_regularized(L1_wt = 1, alpha =
0.5) syntax to do it. Select a minimal adequate model out of all these procedures. Justify your answer.
(g) Draw the diagnostic plots for your final model and discuss them
Solution
(a) Load the data
data_1 = pd.read_table("data/ais", delim_whitespace = True)
data_1.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 202 entries, 1 to 202
## Data columns (total 13 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 rcc 202 non-null float64
## 1 wcc 202 non-null float64
## 2 hc 202 non-null float64
1
## 3 hg 202 non-null float64
## 4 ferr 202 non-null int64
## 5 bmi 202 non-null float64
## 6 ssf 202 non-null float64
## 7 pcBfat 202 non-null float64
## 8 lbm 202 non-null float64
## 9 ht 202 non-null float64
## 10 wt 202 non-null float64
## 11 sex 202 non-null object
## 12 sport 202 non-null object
## dtypes: float64(10), int64(1), object(2)
## memory usage: 22.1+ KB
data_1 = data_1[["bmi", "lbm", "ssf", "ht", "wcc", "hc"]]
data_1.info()
## Int64Index: 202 entries, 1 to 202
## --- ------ -------------- -----
## 0 bmi 202 non-null float64
## 1 lbm 202 non-null float64
## 2 ssf 202 non-null float64
## 3 ht 202 non-null float64
## 4 wcc 202 non-null float64
## 5 hc 202 non-null float64
## dtypes: float64(6)
## memory usage: 11.0 KB
Pairwise scatterplots
sns.pairplot(data_1, kind = "reg", diag_kind = "kde", plot_kws = {"line_kws": {"color": "red"}});
plt.show()
2
35.0
32.5
30.0
27.5
bmi
25.0
22.5
20.0
17.5
100
80
lbm
60
40
200
150
100
ssf
50
210
200
190
180
ht
170
160
150
14
12
10
wcc
8
6
4
60
55
50
hc
45
40
35
20 30 25 50 75 100 0 100 200 140 160 180 200 220 5 10 15 40 50 60
bmi lbm ssf ht wcc hc
Plots on the top row have bmi in the y-axis and are relevant for variable selection. These plots show that bmi
has a strong linear relation with lbm. The linear relation with the other variables is not so clear. Also, there
is a strong linear relation between lbm and ht.
The correlation matrix
fig, ax = plt.subplots()
cax = ax.matshow(data_1.corr(), vmin = -1, vmax = 1, cmap = plt.cm.RdBu)
fig.colorbar(cax);
ax.set(xticks = range(data_1.shape[1]), yticks = range(data_1.shape[1]),
xticklabels = data_1.columns, yticklabels = data_1.columns);
plt.show()
3
bmi lbm ssf ht wcc hc
1.00
bmi
0.75
lbm 0.50
0.25
ssf
0.00
ht
0.25
wcc 0.50
0.75
hc
1.00
The highest correlation corresponds to ht and lbm, with a value of around 0.8; lbm and bmi also have a
strong correlation.
(b) To include all variables in the formula, we can use the string manipulation techniques in Python. We
first join all column names with a + separator in between, and then fit a model by appending the names
to the formula. Observe that we need to drop bmi since this variable is not a regressor.
all_vars_1 = "+".join(data_1.columns.drop(["bmi"]))
model_10 = smf.ols("bmi ~ " + all_vars_1, data_1).fit()
model_10.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: bmi R-squared: 0.970
## Model: OLS Adj. R-squared: 0.969
## Method: Least Squares F-statistic: 1260.
## Date: Fri, 27 Oct 2023 Prob (F-statistic): 7.23e-147
## Time: 09:32:02 Log-Likelihood: -145.10
## No. Observations: 202 AIC: 302.2
## Df Residuals: 196 BIC: 322.0
## Df Model: 5
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
4
## Intercept 40.5999 1.074 37.817 0.000 38.483 42.717
## lbm 0.3219 0.005 60.579 0.000 0.311 0.332
## ssf 0.0508 0.001 40.202 0.000 0.048 0.053
## ht -0.2380 0.006 -37.985 0.000 -0.250 -0.226
## wcc 0.0090 0.021 0.437 0.663 -0.032 0.049
## hc 0.0176 0.014 1.301 0.195 -0.009 0.044
## ==============================================================================
## Omnibus: 12.113 Durbin-Watson: 1.223
## Prob(Omnibus): 0.002 Jarque-Bera (JB): 16.098
## Skew: -0.412 Prob(JB): 0.000319
## Kurtosis: 4.110 Cond. No. 6.32e+03
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.32e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
The variables wcc and hc have large p-values and do not seem to be significant. All the other variables have
a p-value reported as zero.
(c) We use a critical value of αcrit = 0.05 for variable selection. We remove wcc which has the largest
p-value.
all_vars_1 = "+".join(data_1.columns.drop(["bmi", "wcc"]))
model_11.summary()
## """
## ==============================================================================
## Df Model: 4
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 40.6066 1.071 37.905 0.000 38.494 42.719
## lbm 0.3219 0.005 60.707 0.000 0.311 0.332
## ssf 0.0510 0.001 41.522 0.000 0.049 0.053
## ht -0.2380 0.006 -38.068 0.000 -0.250 -0.226
## hc 0.0188 0.013 1.420 0.157 -0.007 0.045
## ==============================================================================
## Skew: -0.420 Prob(JB): 0.000265
## ==============================================================================
5
##
## Notes:
## """
We now remove hc.
all_vars_1 = "+".join(data_1.columns.drop(["bmi", "wcc", "hc"]))
model_12.summary()
## """
## ==============================================================================
## Df Model: 3
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 41.4719 0.883 46.942 0.000 39.730 43.214
## lbm 0.3254 0.005 69.112 0.000 0.316 0.335
## ssf 0.0503 0.001 44.440 0.000 0.048 0.053
## ht -0.2393 0.006 -38.597 0.000 -0.252 -0.227
## ==============================================================================
## Skew: -0.452 Prob(JB): 4.09e-05
## ==============================================================================
##
## Notes:
## """
This is the final model in the stepwise procedure since all the remaining variables have p-values below αcrit .
(d) To fit a model using R2 as a criterion, we use the function from the lectures.
from itertools import combinations
# Function to get best model by adjusted Rˆ2

def get_best_model_by_adj_r2(X, y):
best_adj_r2 = float("-inf")
best_combination = []
for i in range(1, len(X.columns)+1):
6
for combo in combinations(X.columns, i):
model = sm.OLS(y, sm.add_constant(X[list(combo)])).fit()
if model.rsquared_adj > best_adj_r2:
best_adj_r2 = model.rsquared_adj
best_combination = combo
return best_combination, best_adj_r2
We need to separate the regressors from the dependent variable and then we run the function:
# Drop bmi to get matrix of predictors
X = data_1.drop(columns=['bmi'])
y = data_1['bmi']
best_combo, best_adj_r2 = get_best_model_by_adj_r2(X, y)
print(f"Best combination by adjusted Rˆ2: {best_combo}")
print(f"Maximum adjusted Rˆ2: {best_adj_r2:.5f}")
## Best combination by adjusted R^2: ('lbm', 'ssf', 'ht', 'hc')

## Maximum adjusted R^2: 0.96918
This criterion selects model_11 with variables lbm, ssf, ht, and hc.
(e) For BIC we also use the function from the lectures.
def stepwise_selection_bic_recheck(X, y, initial_list=[], verbose=True):
"""
Perform a forward-backward feature selection
based on BIC from statsmodels.api.OLS
Arguments:
X - pandas.DataFrame with candidate features
y - list-like with the target
initial_list - list of features to start with (column names of X)
verbose - whether to print the sequence of inclusions and exclusions
Returns: list of selected features
"""
included = list(initial_list)
best_bic = float('inf')
while True:
changed = False
# forward step
excluded = list(set(X.columns) - set(included))
new_bic = {}
for new_column in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
new_bic[new_column] = model.bic
best_new_bic = min(new_bic.values())
if best_new_bic < best_bic:
best_bic = best_new_bic
best_feature = min(new_bic, key=new_bic.get)
included.append(best_feature)
changed = True
if verbose:
print('Add {:30} with BIC {:.6}'.format(best_feature, best_bic))
# backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
current_bic = model.bic
7
bic_diff = {feature: current_bic - sm.OLS(y, sm.add_constant(pd.DataFrame(X[list(set(included) -
if bic_diff and max(bic_diff.values()) > 0:
drop = max(bic_diff, key=bic_diff.get)
included.remove(drop)
best_bic = current_bic - bic_diff[drop]
changed = True
if verbose:
print('Drop {:30} with BIC {:.6}'.format(drop, best_bic))
if not changed:
break
return included
# Stepwise selection based on BIC (re-evaluated)
result_bic_recheck = stepwise_selection_bic_recheck(X, y)
## Add lbm with BIC 864.023

## Add ssf with BIC 741.239
## Add ht with BIC 313.683
result_bic_recheck, sm.OLS(y, sm.add_constant(X[result_bic_recheck])).fit().bic
## (['lbm', 'ssf', 'ht'], 313.6828085768625)

The BIC criterion chooses the same model as the stepwise method, with variables lbm, ht, and ssf. This is
model_12.
(f) We fit a lasso model as follows
all_vars_1 = "+".join(data_1.columns.drop(["bmi"]))
model_13 = smf.ols("bmi ~ " + all_vars_1, data_1).fit_regularized(L1_wt = 1, alpha = 0.5)
print(model_13.params)
## Intercept 0.000000
## lbm 0.313154
## ssf 0.074056
## ht -0.023160
## wcc 0.000000
## hc 0.038315
## dtype: float64
The lasso regression results in some coefficients being zero, meaning that those variable are excluded from
the model. We see that the procedure gives us identical set of variables as in the first model in (c), except
the intercept. Observe that the change in adjusted R2 is 0.9692 to 0.969 between two models in (c). The
difference in AIC is also a very small, from 300.4 to 300.45. I would keep the simpler model model_12.
(g) The diagnostic plots for our final model
cls_12 = LinearRegDiagnostic(model_12)
vif, fig, ax = cls_12()
8
Residuals vs Fitted Normal Q-Q
3
1.0 2
Standardized Residuals
0.5 1
0.0 0
Residuals
0.5 1
1.0 2
1.5 97 132 3 97
132
2.0 98 4 98
20 25 30 4 2 0 2
Fitted values Theoretical Quantiles
Scale-Location Residuals vs 74
Leverage
2.00
98 Cook's distance
2
|Standardized Residuals|
97 132
1.75
1
1.50
1.25 0
1.00 1
0.75
2
0.50
3 132
0.25
0.00 4 98
20 25 30 35 0.00 0.02 0.04 0.06 0.08 0.10
Fitted values Leverage
The quantile plot has departures from the straight line at the lower end and the scale-location plots shows an
increasing pattern, and in the residuals vs leverage we see points with large residuals.
We use tests for normality and homoscedasticity
swt = sp.stats.shapiro(model_12.resid)
print('p-value: ', round(swt[1],5))
## p-value: 0.00417
For homoscedasticity we use the Breusch-Pagan test (not reviewed in the lectures).
round(sm.stats.het_breuschpagan(model_12.resid, model_12.model.exog, False)[1],7)
## 1.7e-06
The normality test rejects the null hypothesis of a normal distribution, and the homoscedasticity assumption
is rejected too.
For comparison, we plot the diagnostic graphs for the other model. The differences are small.
9
_, fig, ax = cls_11()

1.0 2
0.5 1
0.0 0
Residuals
0.5 1
1.0 2
1.5 97 132 3 97
132
2.0 98 4 98
20 25 30 4 2 0 2
Scale-Location Residuals
74
vs Leverage
2.00 98 Cook's distance
132 2
1.75 97
1.50 1
1.25 0
1.00 1
0.75
2
0.50
0.25 3 132
0.00 4 98
20 25 30 35 0.00 0.05 0.10 0.15
The diagnostic plots and the tests indicate that this model does not satisfy the assumptions for linear
regression.
Exercise 2
The data set data_q2 has six variables. Find a minimal adequate model for res in terms of the other variables
(without interactions). In your answer include exploratory analysis, variable selection and residual analysis.
In each step justify clearly the reason for your decision. Give a prediction of res using your model for
a subject with values (var1,var2,var3,var4,var5) = (16.1, 14.0, 66.8, 202, 45.4).
10
Solution
Load the data
data_2 = pd.read_table("data/data_q2.txt", delim_whitespace = True)
data_2 = data_2.reset_index(drop = True)
data_2.info()
## RangeIndex: 35 entries, 0 to 34
## --- ------ -------------- -----
## 0 res 35 non-null float64
## 1 var1 35 non-null float64
## dtypes: float64(6)
## memory usage: 1.8 KB
Do a pairplot
sns.pairplot(data_2, kind = "reg", diag_kind = "kde");
plt.show()
11
30
20
res
10
24
22
20
var1
18
16
15
14
13
var2
12
11
10
85.0
82.5
80.0
77.5
var3
75.0
72.5
70.0
67.5
240
230
220
var4
210
200
50
48
var5
46
44
42
20 0 20 40 10 15 20 25 30 10 12 14 16 60 70 80 90 180 200 220 240 260 40 45 50
res var1 var2 var3 var4 var5
The top row shows that res increases with variables 1, 2, and 4, and decreases with 3 and 5; res and var2
show a strong linear relation. Variables 3 and 4 appear to have a high correlation.
Fit a complete model
all_vars_2 = "+".join(data_2.columns.drop(["res"]))
model_20 = smf.ols("res ~ " + all_vars_2, data_2).fit()
model_20.summary()
## """
## ==============================================================================
## Dep. Variable: res R-squared: 0.966
## Method: Least Squares F-statistic: 166.8
12
## Df Model: 5
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept -1.5719 11.823 -0.133 0.895 -25.754 22.610
## var1 0.0893 0.120 0.744 0.463 -0.156 0.335
## var2 2.0243 0.844 2.398 0.023 0.297 3.751
## var3 -2.8994 0.321 -9.030 0.000 -3.556 -2.243
## var4 0.9629 0.163 5.892 0.000 0.629 1.297
## var5 -0.0010 0.209 -0.005 0.996 -0.429 0.427
## ==============================================================================
## Skew: -0.279 Prob(JB): 0.753
## ==============================================================================
##
## Notes:
## """
Variables 2, 3 and 4 appear significant but we need to check for collinearity. Use VIF from statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
for i, name in enumerate(model_20.model.exog_names):
print(name, round(vif(model_20.model.exog, i),3))
## Intercept 1158.7
## var1 1.093
## var2 10.789
## var3 18.31
## var4 31.548
## var5 1.026
Also inspect the correlation matrix
cax = ax.matshow(data_2.corr(), vmin = -1, vmax = 1, cmap = plt.cm.RdBu)
fig.colorbar(cax);
ax.set(xticks = range(data_2.shape[1]), yticks = range(data_2.shape[1]),
xticklabels = data_2.columns, yticklabels = data_2.columns);
plt.show()
13
res var1 var2 var3 var4 var5
1.00
res
0.75
var1 0.50
0.25
var2
0.00
var3
0.25
var4 0.50
0.75
var5
1.00
We see that the Variance Inflation Factors for variables 4, 3 and 2 are large and the correlation matrix also
shows large values. Since vif is largest for variable 4, we try dropping it from the model
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4"]))
model_21.summary()
## """
## ==============================================================================
## Df Model: 4
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 10.0161 16.990 0.590 0.560 -24.682 44.715
## var1 0.0158 0.174 0.091 0.928 -0.339 0.371
## var2 6.7554 0.380 17.769 0.000 5.979 7.532
## var3 -1.0653 0.115 -9.296 0.000 -1.299 -0.831
14
## var5 -0.0397 0.305 -0.130 0.897 -0.662 0.582
## ==============================================================================
## Skew: -0.156 Prob(JB): 0.903
## ==============================================================================
##
## Notes:
## """
Check vif
for i, name in enumerate(model_21.model.exog_names):
print(name, round(vif(model_21.model.exog, i),3))
## Intercept 1126.635
## var1 1.081
## var2 1.03
## var3 1.098
## var5 1.025
Now variables 2 and 3 are significant and the vif have decreased to normal levels. var1 has the largest p-value,
so we drop it from the model
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4", "var1"]))
model_22.summary()
## """
## ==============================================================================
## Df Model: 3
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 10.4008 16.189 0.642 0.525 -22.616 43.418
## var2 6.7562 0.374 18.068 0.000 5.994 7.519
## var3 -1.0680 0.109 -9.821 0.000 -1.290 -0.846
## var5 -0.0371 0.298 -0.124 0.902 -0.645 0.571
## ==============================================================================
## Skew: -0.146 Prob(JB): 0.911
15
## ==============================================================================
##
## Notes:
## """
Finally, we drop var5, which is also non-significant.
all_vars_2 = "+".join(data_2.columns.drop(["res", "var4", "var1", "var5"]))
model_23.summary()
## """
## ==============================================================================
## Df Model: 2
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 8.7246 8.810 0.990 0.329 -9.221 26.670
## var2 6.7614 0.366 18.482 0.000 6.016 7.507
## var3 -1.0690 0.107 -10.010 0.000 -1.287 -0.851
## ==============================================================================
## Skew: -0.160 Prob(JB): 0.876
## ==============================================================================
##
## Notes:
## """
This is the minimal adequate model. The equation for the model is
res = 8.7246 + 6.7614 × var2 − 1.069 × var3
We look now at the residual plots

_, fig, ax = cls_23()
16
6 12 1312
13 2
4
1
2
Residuals
0 0
2
1
4
6 2
3 3
0 10 20 30 2 1 0 1 2
Scale-Location Residuals vs Leverage
1.6 3 12 13
2 Cook's distance
1213
1.4
1.2 1
1.0
0
0.8
0.6 1
0.4
0.2 2
0.0 3
0 10 20 30 0.00 0.05 0.10 0.15
These graphs show that the usual hypothesis are satisfied and the fit is good. We do tests for normality and
homogeneous variances
swt = sp.stats.shapiro(model_23.resid)
print('p-value: ', round(swt[1],3))
## p-value: 0.727
bpt = sm.stats.het_breuschpagan(model_23.resid, model_23.model.exog, False)
print('p-value: ', round(bpt[1],3))
## p-value: 0.583
The tests confirm that the assumptions are satisfied.
We now assess the influence plots
sm.graphics.influence_plot(model_23, ax = ax);
plt.show()
17
12 Influence Plot 13
2
Studentized Residuals
1
2 3
3
0.04 0.06 0.08 0.10 0.12 0.14
Leverage
In the influence plots we see that none of the points has consistently high leverage, large Cook’s distance,
and large residuals. A few values for the standardized residuals are large (above 2 in absolute value). The
worst points seem to be 13, 12 and 3.
The equation for the final model is
res = 8.7246 + 6.7614 × var2 − 1.069 × var3
To give a prediction we only need the values for var2 = 14.0 and var3 = 66.8. We can use the equation
for the model to obtain the predicted values
round(8.7246 + 6.7614*14.0 - 1.069*66.8,3)
## 31.975
8.7246 + 6.7614 ∗ 14.0 − 1.069 ∗ 66.8 = 31.975
or we can do it with the following commands:

new_data = pd.DataFrame({"var2": [14.0], "var3": [66.8]})
new_pred = model_23.get_prediction(new_data)
new_pred.summary_frame()
## mean mean_se ... obs_ci_lower obs_ci_upper

## 0 31.975899 1.163916 ... 25.609573 38.342226
##
## [1 rows x 6 columns]
Exercise 3
In this exercise we want to examine the relationship between body temperature and heart rate. Further, we
would like to use heart rate to predict the body temperature.
18
(a) Use the BodyTemperature.txt data set to build a simple linear regression model for body temperature
using heart rate as the predictor.
(b) Interpret the estimate of regression coefficient and examine its statistical significance.
(c) Find the 95% confidence interval for the regression coefficient.
(d) Find the value of R2 and show that it is equal to sample correlation coefficient.
(e) Create simple diagnostic plots for your model and identify possible outliers.
(f) If someone’s heart rate is 75, what would be your estimate of this person’s body temperature?
(g) We believe that gender might also be related to body temperature and could help us to predict its
unknown values. Use the BodyTemperature.txt data set to build a multiple linear regression model for
body temperature using heart rate and gender as predictors. For the tests in this section use α = 0.1.
(h) We answer the next four questions using the additive model with HeartRate and Gender as variables.
How much$ did the $Rˆ2 increase compared the above simple linear regression model?
(i) Explain the estimates of regression coefficients in plain language.
(j) Find the 95% confidence intervals for regression coefficients.
(k) If a woman’s heart rate is 75, what would be your estimate of her body temperature? What would be
your estimate of body temperature for a man whose heart rate is 75?
Solution
Load the data
data_3 = pd.read_table("data/BodyTemperature.txt", delim_whitespace = True)
data_3.info()
## RangeIndex: 100 entries, 0 to 99
## --- ------ -------------- -----
## 0 Gender 100 non-null object
## 1 Age 100 non-null int64
## 2 HeartRate 100 non-null int64
## 3 Temperature 100 non-null float64
## dtypes: float64(1), int64(2), object(1)
## memory usage: 3.2+ KB
(a) Fit Temperature against HeartRate
model_30 = smf.ols("Temperature ~ HeartRate", data_3).fit()
model_30.summary()
## """
## ==============================================================================
## Dep. Variable: Temperature R-squared: 0.200
19
## Df Model: 1
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 92.3907 1.201 76.900 0.000 90.006 94.775
## HeartRate 0.0806 0.016 4.956 0.000 0.048 0.113
## ==============================================================================
## Skew: -0.057 Prob(JB): 0.650
## ==============================================================================
##
## Notes:
## """
Scatterplot
ax.scatter(data_3["HeartRate"], data_3["Temperature"]);
sm.graphics.abline_plot(model_results = model_30, color = "red", ax = ax);
plt.xlabel("Heart Rate")
## Text(0.5, 0, 'Heart Rate')

plt.ylabel("Temperature (ºF)")
## Text(0, 0.5, 'Temperature (ºF)')

plt.show()
101
100
Temperature (ºF)
99
98
97
96
60 65 70 75 80 85
Heart Rate
(b) Both the intercept and the slope have a reported p-value of 0, so both are significantly different from
20
zero. The intercept, 92.3907 is the value of Temperature when the heart rate is equal to zero. The
slope, 0.0806, is the increase in Temperature when the heart rate increases one unit.
(c) The 95% confidence intervals for regression coefficients is given by
model_30.conf_int()
## 0 1
## Intercept 90.006462 94.774900
## HeartRate 0.048347 0.112916
(d) The R2 can be found either in table, or by calling
round(model_30.rsquared,4)
## 0.2004
We see that the model only explains about 20% of the variability in the data. We calculate the correlation
squared to verify that it is equal to the R2 :
data_3[["HeartRate", "Temperature"]].corr() ** 2
## HeartRate Temperature
## HeartRate 1.000000 0.200418
## Temperature 0.200418 1.000000
(e) Diagnostic plots
_, fig, ax = cls_30()
21
5 3 5
2 85 85
2
1 1
Residuals
0 0
1 1
2
2
74 3 74
97.5 98.0 98.5 99.0 2 0 2
Scale-Location Residuals vs Leverage
1.75 5 74 3 5
Cook's distance
85
1.50 2
78
1.25 1
1.00
0
0.75
1
0.50
0.25 2
0.00 3 74
97.5 98.0 98.5 99.0 99.5 0.00 0.02 0.04 0.06 0.08
Influence plot
sm.graphics.influence_plot(model_30, ax = ax);
plt.show()
22
5 Influence Plot
3
85
96 78
Studentized Residuals
45 93
2
1
81 27
0 28 99
1
2 48
2 74
3
0.01 0.02 0.03 0.04 0.05 0.06 0.07
Leverage
Points 5, and 78 are flagged in all diagnostic graphs and points 74 and 85 are flagged in some. They may be
outliers and should be investigated further.
(f) Predicting Temperature for a heart rate of 75. Using get_prediction
new_data = pd.DataFrame({"HeartRate": [75]})

## 0 98.438046 0.088721 ... 96.722331 100.153761
##
Using the coefficients and the model formula:
round(model_30.params[0] + model_30.params[1] * 75,2)
## 98.44
(g) Fit a model with interaction
model_31 = smf.ols("Temperature ~ HeartRate * Gender", data_3).fit()
model_31.summary()
## """
## ==============================================================================
## Df Model: 3
23
## =========================================================================================
## coef std err t P>|t| [0.025 0.975]
## -----------------------------------------------------------------------------------------
## Intercept 92.1833 1.855 49.698 0.000 88.501 95.865
## Gender[T.M] 0.1338 2.428 0.055 0.956 -4.686 4.953
## HeartRate 0.0855 0.025 3.389 0.001 0.035 0.136
## HeartRate:Gender[T.M] -0.0059 0.033 -0.179 0.858 -0.071 0.059
## ==============================================================================
## Skew: 0.062 Prob(JB): 0.344
## ==============================================================================
##
## Notes:
## """
The p-value for the interaction is big, and therefore it is not significant and we drop it from the model.
model_32 = smf.ols("Temperature ~ HeartRate + Gender", data_3).fit()
model_32.summary()
## """
## ==============================================================================
## Df Model: 2
## ===============================================================================
## coef std err t P>|t| [0.025 0.975]
## -------------------------------------------------------------------------------
## Intercept 92.4376 1.189 77.743 0.000 90.078 94.798
## Gender[T.M] -0.3004 0.170 -1.763 0.081 -0.639 0.038
## HeartRate 0.0820 0.016 5.088 0.000 0.050 0.114
## ==============================================================================
## Skew: 0.056 Prob(JB): 0.355
## ==============================================================================
##
## Notes:
24
## """
In this model (see model summary), Gender is marginally significant. At the α = 0.05 and level we would
keep the simple regression model using only HeartRate and then the rest of the question will have the same
answer as in the first part. However, we are told to use α = 0.1, then Gender is significant and the answers
are different.
(h) The first value corresponds to the additive model, the second to the simple linear model.
(round(model_32.rsquared,3), round(model_30.rsquared,3))
## (0.225, 0.2)
(i) The additive model has a common slope of 0.08199 for both genders but the intercepts are different.
For females the intercept is 92.4376 while for males it is 92.4376 − 0.3004 = 92.1372
(j) The 95% confidence intervals for regression coefficients
model_32.conf_int()
## 0 1
## Intercept 90.077761 94.797512
## Gender[T.M] -0.638659 0.037776
## HeartRate 0.050009 0.113977
(k) Create a new data frame and make predictions
new_data = pd.DataFrame({"HeartRate": [75,75], "Gender": ["F","M"]})

## 0 98.587086 0.121868 ... 96.881046 100.293127
## 1 98.286645 0.122801 ... 96.580341 99.992949
##
25

ProbList10 MOI SLN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ProbList10 MOI SLN

Uploaded by

Copyright:

Available Formats

STAT 210

Applied Statistics and Data Analysis

# Function to get best model by adjusted Rˆ2

## Best combination by adjusted R^2: ('lbm', 'ssf', 'ht', 'hc')

## Add lbm with BIC 864.023

## (['lbm', 'ssf', 'ht'], 313.6828085768625)

Residuals vs Fitted Normal Q-Q

res = 8.7246 + 6.7614 × var2 − 1.069 × var3

We look now at the residual plots

res = 8.7246 + 6.7614 × var2 − 1.069 × var3

8.7246 + 6.7614 ∗ 14.0 − 1.069 ∗ 66.8 = 31.975

or we can do it with the following commands:

## mean mean_se ... obs_ci_lower obs_ci_upper

## Text(0.5, 0, 'Heart Rate')

## Text(0, 0.5, 'Temperature (ºF)')

## mean mean_se ... obs_ci_lower obs_ci_upper

## mean mean_se ... obs_ci_lower obs_ci_upper

You might also like