HW2 Report

Problem 4
(a) Logistic Regression

(i) The estimated probability of experiencing CHD within ten years is
exp(β)
p̂ = 1+exp(β) , where
β =− 6.89 − 1.65College − 1.92HighSchoolGED − 1.59SomeHighSchool − 1.72SomeCollege
+ 0.51bM ale + 0.06iAge − 0.08bSmoker + 0.02iCigarettesP erDay + 0.13bBloodP ressureM eds + 0.51bStroke
+ 0.13bHypertensive + 0.023f Chol + 0.051bDiabetes + 0.018f SysBP − 0.0045f DiaBP + 0.017f BM I
− 0.0024iHeartRate + 0.0079iGlucose
Here we have converted the cSchooling variable into four binary variables,
College, High School / GED, Some High School, and Some College, using
one hot encoding.
(ii) The risk factors, i.e. features significant at p<0.05 are bMale, iAge,
iCigarettePerDay, fSysBP, iGlucose.
(iii) Take iAge for example. If age increases by 1, then the odds of developing
CHD within years will be multiplied by exp(0.0566) ≈ 1.058 .
(iv) Plug the features into the equation in (i) gives p̂ = 0.832 .
(b) Decision Tree
(i) Threshold value p should be such that the health care provider is
indifferent between giving and not giving the preventing medication.
Mathematically, this corresponds to the following equation:
545( p4 ) + 45(1 − p4 ) = 500p
Solving the equations gives p *= 0.12 .

(ii) Yes, since p̂ > p * in this case.
(iii) Accuracy=0.59, TPR=0.78, FPR=0.44. Accuracy is the percentage of
correctly predicted cases. True positive rate (TPR) is the percentage of
positive cases that are correctly predicted as positive. False positive rate
(FPR) is the percentage of negative cases that are incorrectly predicted as
positive.
(iv) With the naive way of calculating the economic costs, a patient prescribed
the preventive medication incurs a cost of 545,000 if he developed CHD
eventually, and 45,000 otherwise. Conversely, a patient not prescribed the
medication incurs a cost of 500 if he developed CHD eventually and 0
otherwise. The expected economic costs per patient calculated this way is
93,000.
This calculation is not reasonable as the method does not take into
account the preventive effect of the medication. According to this
calculation, not prescribing the medication will always incur a lower cost
whether the patient developed CHD or not!
To account for the preventive effect, the cost associated with prescribing
the medication to a patient who eventually developed CHD should be
545( 4p̂ ) + 45(1 − 4p̂ ) , where p̂ is the probability of developing CHD within ten
years predicted by the logistic model. This calculation reflects the
counterfactual benefit of the medication: if the patient were given the
medication, his chance of eventually developing CHD would have been
lowered by 75%. The expected economic costs per patient calculated this
way is 42,000.
(v) For the baseline model, accuracy=0.858, TPR=0, FPR=0. The expected
economic costs per patient are 71,000, which is worse than the
prescribing strategy used in (iv).
(c) AUC=0.74 for the ROC curve.
The ROC curve can be helpful to visualize the sensitivity and false alarm
rate of different prescription policies. We can see the sensitivity and false
alarm rate of different threshold probabilities, and use them to calculate
the expected economic costs for different policies.
One interesting observation is that the ROC curve has the highest slopes
at low FPRs, indicating that in this area we have much to gain for TPR
while losing little in terms of FPR.
(d) One of the risk factors identified is bMale. If gender is actually a confounding
factor, as opposed to a cause, of CHD, then this model could end up prescribing
more medication to men than women even though both genders have the same
chance of developing CHD, raising equality issues. Possible remedies include
excluding gender variables from regression analysis.
Appendix: Codes
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score, plot_roc_curve
import statsmodels.api as sm
import matplotlib.pyplot as plt
csv_filename = 'framingham.csv'
def split_data():
df = pd.read_csv(csv_filename)
n = df.shape[0]
n_train = int(n*0.7)
n_test = n - n_train
# convert schooling to one-hot encoding

schooling_cats = df['cSchooling'].unique()
ohe = OneHotEncoder(categories=[schooling_cats])
cSchooling = ohe.fit_transform(df['cSchooling'].to_numpy().reshape(-1, 1)).todense()
df_schooling = pd.DataFrame(cSchooling, columns=schooling_cats)
df = df.drop(columns=['cSchooling'])
df = pd.concat([df_schooling, df], axis=1)
np.random.seed(0)
df['group'] = np.random.choice(np.repeat(['train', 'test'], (n_train, n_test)), n, replace=False)
df_train = df[df['group'] == 'train']

df_test = df[df['group'] == 'test']
df_train = df_train.drop(columns=['group'])
df_test = df_test.drop(columns=['group'])
df_train.to_csv('hw2_train.csv', index=False)
df_test.to_csv('hw2_test.csv', index=False)
def check_split():
df_train = pd.read_csv('hw2_train.csv')
df_test = pd.read_csv('hw2_test.csv')
index_train = df_train['Unnamed: 0'].to_list()
index_test = df_test['Unnamed: 0'].to_list()
print(set(index_train).intersection(index_test))
def logistic_regression():
df_train = pd.read_csv('hw2_train.csv')
y = df_train['bTenYearCHD'].to_numpy()
X = df_train.drop(columns=['bTenYearCHD'])
reg = LogisticRegression(penalty='none', random_state=0, max_iter=10000)

result = reg.fit(X, y)
print(result.classes_)
cols = df_train.columns.to_numpy()
coef = result.coef_[0]
#col_coef_ranked = sorted(zip(cols, coef), key=lambda x: abs(x[1]), reverse=True)

col_coef_ranked = zip(cols, coef)
for col, co in col_coef_ranked:
print('{}: {}'.format(col, co))
print('intersect: ', result.intercept_)
# statsmodel
X = sm.add_constant(X)
logit_mod = sm.Logit(y, X)
logit_res = logit_mod.fit()
print(logit_res.summary())
X_example = np.array([1, 0, 0, 0, 0, 51, 1, 20, 0, 0, 1, 0, 220, 140, 100, 31, 59, 78]).reshape(1,
-1)
y_hat = result.predict_proba(X_example)
print(y_hat)
df_test = pd.read_csv('hw2_test.csv')
n_test = len(df_test)
y_test = df_test['bTenYearCHD'].to_numpy()
X_test = df_test.drop(columns=['bTenYearCHD']).to_numpy()
y_hat_test = result.predict(X_test)
y_prob_test = result.predict_proba(X_test)
y_prob_test = [v[1] for v in y_prob_test]
y_hat_threshp_test = [1 if p > 0.12 else 0 for p in y_prob_test ]
print(y_prob_test)
print('n_hat_negative: ', sum([1 for p in y_prob_test if p < 0.5]))
diff_vec = y_test - y_hat_threshp_test

n_positive = np.sum(y_test)
n_negative = n_test - n_positive
n_tp = sum([1 for c1, c2 in zip(y_test, y_hat_threshp_test) if c1 == 1 and c2 == 1])
n_fp = sum([1 for c1, c2 in zip(y_test, y_hat_threshp_test) if c1 == 0 and c2 == 1])
correct = sum([1 for comp in diff_vec if comp == 0])
n_hat_positive = np.sum(y_hat_threshp_test)
n_hat_negative = n_test - n_hat_positive
print(f'n_test: {n_test}, n_positive={n_positive}, n_negative={n_negative}, n_tp={n_tp},

n_fp={n_fp}')
print(f'n_hat_positive: {n_hat_positive}, n_hat_negative: {n_hat_negative}')
print('Accuracy: {}, TPR: {}, FPR: {}'.format(correct/n_test, n_tp/n_positive, n_fp/n_negative))
baseline_correct = sum([1 for comp in y_test if comp == 0])

print('Baseline Accuracy: {}'.format(baseline_correct/n_test))
# problem (b) 4
n = len(df_test)
costs = 0
p_thresh = 0.12
print('lens: ', len(y_test), len(y_hat_threshp_test))

for y, y_hat in zip(y_test, y_prob_test):
if y == 1 and y_hat > p_thresh:
costs += 545
elif y == 1 and y_hat < p_thresh:
costs += 500
elif y == 0 and y_hat > p_thresh:
costs += 45
costs += 0
else:
print('Error!!!!!')
print('Expected costs (naive): ', costs/n)
#====================
n = len(df_test)
costs = 0
p_thresh = 0.12
print('lens: ', len(y_test), len(y_hat_threshp_test))

for y, y_hat in zip(y_test, y_prob_test):
if y == 1 and y_hat > p_thresh:
pp = y_hat/4
costs += 545*pp+45*(1-pp)
costs += 500
elif y == 0 and y_hat > p_thresh:
costs += 45
costs += 0
else:
print('Error!!!!!')
print('Expected costs: ', costs/n)
#====================
n = len(df_test)
costs = 0
for y in y_test:
if y == 1:
costs += 500
elif y == 0:
costs += 0
else:
print('Error!!!!!')
print('Expected costs (baseline): ', costs/n)
#====================
# plot roc curve
plot_roc_curve(reg, X_test, y_test)
plt.plot([0, 1], [0, 1], color='orange', lw=2, linestyle='--')
plt.show()
if __name__ == "__main__":

#split_data()
#check_split()
logistic_regression()

HW2 Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW2 Report

Uploaded by

Copyright:

Available Formats

Problem 4

(a) Logistic Regression

545( p4 ) + 45(1 − p4 ) = 500p

Solving the equations gives p *= 0.12 .

# convert schooling to one-hot encoding

df_train = df[df['group'] == 'train']

reg = LogisticRegression(penalty='none', random_state=0, max_iter=10000)

#col_coef_ranked = sorted(zip(cols, coef), key=lambda x: abs(x[1]), reverse=True)

diff_vec = y_test - y_hat_threshp_test

print(f'n_test: {n_test}, n_positive={n_positive}, n_negative={n_negative}, n_tp={n_tp},

baseline_correct = sum([1 for comp in y_test if comp == 0])

print('lens: ', len(y_test), len(y_hat_threshp_test))

print('lens: ', len(y_test), len(y_hat_threshp_test))

if name == "main":

You might also like

HW2 Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW2 Report

Uploaded by

Copyright:

Available Formats

Problem 4

(a) Logistic Regression

545( p4 ) + 45(1 − p4 ) = 500p

Solving the equations gives p *= 0.12 .

​# convert schooling to one-hot encoding

df_train = df[df[​'group'​] == ​'train'​]

reg = LogisticRegression(​penalty​=​'none'​, ​random_state​=​0​, ​max_iter​=​10000​)

​#col_coef_ranked = sorted(zip(cols, coef), key=lambda x: abs(x[1]), reverse=True)

diff_vec = y_test - y_hat_threshp_test

​print​(​f​'n_test: ​{​n_test​}​, n_positive=​{​n_positive​}​, n_negative=​{​n_negative​}​, n_tp=​{​n_tp​}​,

baseline_correct = ​sum​([​1​ ​for​ comp ​in​ y_test ​if​ comp == ​0​])

​print​(​'lens: '​, ​len​(y_test), ​len​(y_hat_threshp_test))

​print​(​'lens: '​, ​len​(y_test), ​len​(y_hat_threshp_test))

if​ ​__name__​ == ​"__main__"​:

You might also like

# convert schooling to one-hot encoding

df_train = df[df['group'] == 'train']

reg = LogisticRegression(penalty='none', random_state=0, max_iter=10000)

#col_coef_ranked = sorted(zip(cols, coef), key=lambda x: abs(x[1]), reverse=True)

print(f'n_test: {n_test}, n_positive={n_positive}, n_negative={n_negative}, n_tp={n_tp},

baseline_correct = sum([1 for comp in y_test if comp == 0])

print('lens: ', len(y_test), len(y_hat_threshp_test))

print('lens: ', len(y_test), len(y_hat_threshp_test))

if name == "main":