You are on page 1of 8

Problem 4

(a) Logistic Regression


(i) The estimated probability of experiencing CHD within ten years is
exp(β)
p̂ = 1+exp(β) , where
β =− 6.89 − 1.65College − 1.92HighSchoolGED − 1.59SomeHighSchool − 1.72SomeCollege
+ 0.51bM ale + 0.06iAge − 0.08bSmoker + 0.02iCigarettesP erDay + 0.13bBloodP ressureM eds + 0.51bStroke
+ 0.13bHypertensive + 0.023f Chol + 0.051bDiabetes + 0.018f SysBP − 0.0045f DiaBP + 0.017f BM I
− 0.0024iHeartRate + 0.0079iGlucose
Here we have converted the cSchooling variable into four binary variables,
College, High School / GED, Some High School, and Some College, using
one hot encoding.
(ii) The risk factors, i.e. features significant at p<0.05 are bMale, iAge,
iCigarettePerDay, fSysBP, iGlucose.
(iii) Take iAge for example. If age increases by 1, then the odds of developing
CHD within years will be multiplied by exp(0.0566) ≈ 1.058 .
(iv) Plug the features into the equation in (i) gives p̂ = 0.832 .
(b) Decision Tree
(i) Threshold value p should be such that the health care provider is
indifferent between giving and not giving the preventing medication.
Mathematically, this corresponds to the following equation:

545( p4 ) + 45(1 − p4 ) = 500p

Solving the equations gives p *= 0.12 .


(ii) Yes, since p̂ > p * in this case.
(iii) Accuracy=0.59, TPR=0.78, FPR=0.44. Accuracy is the percentage of
correctly predicted cases. True positive rate (TPR) is the percentage of
positive cases that are correctly predicted as positive. False positive rate
(FPR) is the percentage of negative cases that are incorrectly predicted as
positive.
(iv) With the naive way of calculating the economic costs, a patient prescribed
the preventive medication incurs a cost of 545,000 if he developed CHD
eventually, and 45,000 otherwise. Conversely, a patient not prescribed the
medication incurs a cost of 500 if he developed CHD eventually and 0
otherwise. The expected economic costs per patient calculated this way is
93,000.

This calculation is not reasonable as the method does not take into
account the preventive effect of the medication. According to this
calculation, not prescribing the medication will always incur a lower cost
whether the patient developed CHD or not!

To account for the preventive effect, the cost associated with prescribing
the medication to a patient who eventually developed CHD should be
545( 4p̂ ) + 45(1 − 4p̂ ) , where p̂ is the probability of developing CHD within ten
years predicted by the logistic model. This calculation reflects the
counterfactual benefit of the medication: if the patient were given the
medication, his chance of eventually developing CHD would have been
lowered by 75%. The expected economic costs per patient calculated this
way is 42,000.
(v) For the baseline model, accuracy=0.858, TPR=0, FPR=0. The expected
economic costs per patient are 71,000, which is worse than the
prescribing strategy used in (iv).
(c) AUC=0.74 for the ROC curve.

The ROC curve can be helpful to visualize the sensitivity and false alarm
rate of different prescription policies. We can see the sensitivity and false
alarm rate of different threshold probabilities, and use them to calculate
the expected economic costs for different policies.

One interesting observation is that the ROC curve has the highest slopes
at low FPRs, indicating that in this area we have much to gain for TPR
while losing little in terms of FPR.

(d) One of the risk factors identified is bMale. If gender is actually a confounding
factor, as opposed to a cause, of CHD, then this model could end up prescribing
more medication to men than women even though both genders have the same
chance of developing CHD, raising equality issues. Possible remedies include
excluding gender variables from regression analysis.
Appendix: Codes
import​ pandas ​as​ pd
import​ numpy ​as​ np
from​ sklearn.model_selection ​import​ train_test_split
from​ sklearn.linear_model ​import​ LogisticRegression
from​ sklearn.preprocessing ​import​ OneHotEncoder
from​ sklearn.metrics ​import​ roc_auc_score, plot_roc_curve
import​ statsmodels.api ​as​ sm
import​ matplotlib.pyplot ​as​ plt

csv_filename = ​'framingham.csv'
def​ ​split_data​():
df = pd.read_csv(csv_filename)
n = df.shape[​0​]
n_train = ​int​(n*​0.7​)
n_test = n - n_train

​# convert schooling to one-hot encoding


schooling_cats = df[​'cSchooling'​].unique()
ohe = OneHotEncoder(​categories​=[schooling_cats])
cSchooling = ohe.fit_transform(df[​'cSchooling'​].to_numpy().reshape(-​1​, ​1​)).todense()
df_schooling = pd.DataFrame(cSchooling, ​columns​=schooling_cats)

df = df.drop(​columns​=[​'cSchooling'​])
df = pd.concat([df_schooling, df], ​axis​=​1​)

np.random.seed(​0​)
df[​'group'​] = np.random.choice(np.repeat([​'train'​, ​'test'​], (n_train, n_test)), n, ​replace​=​False​)

df_train = df[df[​'group'​] == ​'train'​]


df_test = df[df[​'group'​] == ​'test'​]

df_train = df_train.drop(​columns​=[​'group'​])
df_test = df_test.drop(​columns​=[​'group'​])

df_train.to_csv(​'hw2_train.csv'​, ​index​=​False​)
df_test.to_csv(​'hw2_test.csv'​, ​index​=​False​)
def​ ​check_split​():
df_train = pd.read_csv(​'hw2_train.csv'​)
df_test = pd.read_csv(​'hw2_test.csv'​)
index_train = df_train[​'Unnamed: 0'​].to_list()
index_test = df_test[​'Unnamed: 0'​].to_list()
​print​(​set​(index_train).intersection(index_test))

def​ ​logistic_regression​():
df_train = pd.read_csv(​'hw2_train.csv'​)
y = df_train[​'bTenYearCHD'​].to_numpy()
X = df_train.drop(​columns​=[​'bTenYearCHD'​])

reg = LogisticRegression(​penalty​=​'none'​, ​random_state​=​0​, ​max_iter​=​10000​)


result = reg.fit(X, y)

​print​(result.classes_)
cols = df_train.columns.to_numpy()
coef = result.coef_[​0​]

​#col_coef_ranked = sorted(zip(cols, coef), key=lambda x: abs(x[1]), reverse=True)


col_coef_ranked = ​zip​(cols, coef)
​for​ col, co ​in​ col_coef_ranked:
​print​(​'​{}​: ​{}​'​.format(col, co))
​print​(​'intersect: '​, result.intercept_)

​# statsmodel
X = sm.add_constant(X)
logit_mod = sm.Logit(y, X)
logit_res = logit_mod.fit()
​print​(logit_res.summary())

X_example = np.array([​1​, ​0​, ​0​, ​0​, ​0​, ​51​, ​1​, ​20​, ​0​, ​0​, ​1​, ​0​, ​220​, ​140​, ​100​, ​31​, ​59​, ​78​]).reshape(​1​,
-​1​)
y_hat = result.predict_proba(X_example)
​print​(y_hat)

df_test = pd.read_csv(​'hw2_test.csv'​)
n_test = ​len​(df_test)
y_test = df_test[​'bTenYearCHD'​].to_numpy()
X_test = df_test.drop(​columns​=[​'bTenYearCHD'​]).to_numpy()
y_hat_test = result.predict(X_test)
y_prob_test = result.predict_proba(X_test)
y_prob_test = [v[​1​] ​for​ v ​in​ y_prob_test]
y_hat_threshp_test = [​1​ ​if​ p > ​0.12​ ​else​ ​0​ ​for​ p ​in​ y_prob_test ]
​print​(y_prob_test)
​print​(​'n_hat_negative: '​, ​sum​([​1​ ​for​ p ​in​ y_prob_test ​if​ p < ​0.5​]))

diff_vec = y_test - y_hat_threshp_test


n_positive = np.sum(y_test)
n_negative = n_test - n_positive
n_tp = ​sum​([​1​ ​for​ c1, c2 ​in​ ​zip​(y_test, y_hat_threshp_test) ​if​ c1 == ​1​ ​and​ c2 == ​1​])
n_fp = ​sum​([​1​ ​for​ c1, c2 ​in​ ​zip​(y_test, y_hat_threshp_test) ​if​ c1 == ​0​ ​and​ c2 == ​1​])
correct = ​sum​([​1​ ​for​ comp ​in​ diff_vec ​if​ comp == ​0​])

n_hat_positive = np.sum(y_hat_threshp_test)
n_hat_negative = n_test - n_hat_positive

​print​(​f​'n_test: ​{​n_test​}​, n_positive=​{​n_positive​}​, n_negative=​{​n_negative​}​, n_tp=​{​n_tp​}​,


n_fp=​{​n_fp​}​'​)
​print​(​f​'n_hat_positive: ​{​n_hat_positive​}​, n_hat_negative: ​{​n_hat_negative​}​'​)
​print​(​'Accuracy: ​{}​, TPR: ​{}​, FPR: ​{}​'​.format(correct/n_test, n_tp/n_positive, n_fp/n_negative))

baseline_correct = ​sum​([​1​ ​for​ comp ​in​ y_test ​if​ comp == ​0​])


​print​(​'Baseline Accuracy: ​{}​'​.format(baseline_correct/n_test))

​# problem (b) 4
n = ​len​(df_test)
costs = ​0
p_thresh = ​0.12

​print​(​'lens: '​, ​len​(y_test), ​len​(y_hat_threshp_test))


​for​ y, y_hat ​in​ ​zip​(y_test, y_prob_test):
​if​ y == ​1​ ​and​ y_hat > p_thresh:
costs += ​545
​elif​ y == ​1​ ​and​ y_hat < p_thresh:
costs += ​500
​elif​ y == ​0​ ​and​ y_hat > p_thresh:
costs += ​45
​elif​ y == ​0​ ​and​ y_hat < p_thresh:
costs += ​0
​else​:
​print​(​'Error!!!!!'​)
​print​(​'Expected costs (naive): '​, costs/n)
​#====================
n = ​len​(df_test)
costs = ​0
p_thresh = ​0.12

​print​(​'lens: '​, ​len​(y_test), ​len​(y_hat_threshp_test))


​for​ y, y_hat ​in​ ​zip​(y_test, y_prob_test):
​if​ y == ​1​ ​and​ y_hat > p_thresh:
pp = y_hat/​4
costs += ​545​*pp+​45​*(​1​-pp)
​elif​ y == ​1​ ​and​ y_hat < p_thresh:
costs += ​500
​elif​ y == ​0​ ​and​ y_hat > p_thresh:
costs += ​45
​elif​ y == ​0​ ​and​ y_hat < p_thresh:
costs += ​0
​else​:
​print​(​'Error!!!!!'​)
​print​(​'Expected costs: '​, costs/n)
​#====================
n = ​len​(df_test)
costs = ​0
​for​ y ​in​ y_test:
​if​ y == ​1​:
costs += ​500
​elif​ y == ​0​:
costs += ​0
​else​:
​print​(​'Error!!!!!'​)
​print​(​'Expected costs (baseline): '​, costs/n)
​#====================
​# plot roc curve
plot_roc_curve(reg, X_test, y_test)
plt.plot([​0​, ​1​], [​0​, ​1​], ​color​=​'orange'​, ​lw​=​2​, ​linestyle​=​'--'​)
plt.show()

if​ ​__name__​ == ​"__main__"​:


​#split_data()
​#check_split()
logistic_regression()

You might also like