Professional Documents
Culture Documents
This calculation is not reasonable as the method does not take into
account the preventive effect of the medication. According to this
calculation, not prescribing the medication will always incur a lower cost
whether the patient developed CHD or not!
To account for the preventive effect, the cost associated with prescribing
the medication to a patient who eventually developed CHD should be
545( 4p̂ ) + 45(1 − 4p̂ ) , where p̂ is the probability of developing CHD within ten
years predicted by the logistic model. This calculation reflects the
counterfactual benefit of the medication: if the patient were given the
medication, his chance of eventually developing CHD would have been
lowered by 75%. The expected economic costs per patient calculated this
way is 42,000.
(v) For the baseline model, accuracy=0.858, TPR=0, FPR=0. The expected
economic costs per patient are 71,000, which is worse than the
prescribing strategy used in (iv).
(c) AUC=0.74 for the ROC curve.
The ROC curve can be helpful to visualize the sensitivity and false alarm
rate of different prescription policies. We can see the sensitivity and false
alarm rate of different threshold probabilities, and use them to calculate
the expected economic costs for different policies.
One interesting observation is that the ROC curve has the highest slopes
at low FPRs, indicating that in this area we have much to gain for TPR
while losing little in terms of FPR.
(d) One of the risk factors identified is bMale. If gender is actually a confounding
factor, as opposed to a cause, of CHD, then this model could end up prescribing
more medication to men than women even though both genders have the same
chance of developing CHD, raising equality issues. Possible remedies include
excluding gender variables from regression analysis.
Appendix: Codes
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score, plot_roc_curve
import statsmodels.api as sm
import matplotlib.pyplot as plt
csv_filename = 'framingham.csv'
def split_data():
df = pd.read_csv(csv_filename)
n = df.shape[0]
n_train = int(n*0.7)
n_test = n - n_train
df = df.drop(columns=['cSchooling'])
df = pd.concat([df_schooling, df], axis=1)
np.random.seed(0)
df['group'] = np.random.choice(np.repeat(['train', 'test'], (n_train, n_test)), n, replace=False)
df_train = df_train.drop(columns=['group'])
df_test = df_test.drop(columns=['group'])
df_train.to_csv('hw2_train.csv', index=False)
df_test.to_csv('hw2_test.csv', index=False)
def check_split():
df_train = pd.read_csv('hw2_train.csv')
df_test = pd.read_csv('hw2_test.csv')
index_train = df_train['Unnamed: 0'].to_list()
index_test = df_test['Unnamed: 0'].to_list()
print(set(index_train).intersection(index_test))
def logistic_regression():
df_train = pd.read_csv('hw2_train.csv')
y = df_train['bTenYearCHD'].to_numpy()
X = df_train.drop(columns=['bTenYearCHD'])
print(result.classes_)
cols = df_train.columns.to_numpy()
coef = result.coef_[0]
# statsmodel
X = sm.add_constant(X)
logit_mod = sm.Logit(y, X)
logit_res = logit_mod.fit()
print(logit_res.summary())
X_example = np.array([1, 0, 0, 0, 0, 51, 1, 20, 0, 0, 1, 0, 220, 140, 100, 31, 59, 78]).reshape(1,
-1)
y_hat = result.predict_proba(X_example)
print(y_hat)
df_test = pd.read_csv('hw2_test.csv')
n_test = len(df_test)
y_test = df_test['bTenYearCHD'].to_numpy()
X_test = df_test.drop(columns=['bTenYearCHD']).to_numpy()
y_hat_test = result.predict(X_test)
y_prob_test = result.predict_proba(X_test)
y_prob_test = [v[1] for v in y_prob_test]
y_hat_threshp_test = [1 if p > 0.12 else 0 for p in y_prob_test ]
print(y_prob_test)
print('n_hat_negative: ', sum([1 for p in y_prob_test if p < 0.5]))
n_hat_positive = np.sum(y_hat_threshp_test)
n_hat_negative = n_test - n_hat_positive
# problem (b) 4
n = len(df_test)
costs = 0
p_thresh = 0.12