You are on page 1of 2

Done By : Yasmeen Alhaj Yousef 0197638 , Mohammad Almajali 2191370

Project#1 : Heart_stroke Project


Heart_stroke Data Explanation : Gender Female/Male age Patient's Age education uneducated/primaryschool/graduate/postgraduate currentSmoker 0(No)/1(Yes) cigsPerDay Number of cigarette per
day BPMeds Blood pressure meds: 0.0(No)/1.0(Yes) prevalentStroke Any history of a stroke? No/Yes prevalentHyp Prevalent hypertension: No/yes diabetes Does the patient have diabetes?
0(No)/1(Yes) totChol Total cholesterol sysBP Systolic blood pressure diaBP Diastolic blood pressure BMI Patient's BMI heartRate Patient's heart rate glucose Glucose level Heart_ stroke The ability to
have a heart stroke or no : No/yes
After analysing the problem, we found that it is supervised, classification task, and batch learning

Steps to Analyse and Prepare Data :

Step #1 : Import Needed Libraries


In [53]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as ss
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from pandas.plotting import scatter_matrix
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import roc_curve
import matplotlib.patches as patches
from sklearn.metrics import roc_auc_score

Step #2 : Import Data from heart_stroke.csv file


In [54]: Path = "/Users/yasmeenalhajyousef/Desktop/ai/heart_stroke.csv"
Data = pd.read_csv(Path)

In [55]: Data

Out[55]: Gender age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose Heart_ stroke

0 Male 39 postgraduate 0 0.0 0.0 no 0 0 195.0 106.0 70.0 26.97 80.0 77.0 No

1 Female 46 primaryschool 0 0.0 0.0 no 0 0 250.0 121.0 81.0 28.73 95.0 76.0 No

2 Male 48 uneducated 1 20.0 0.0 no 0 0 245.0 127.5 80.0 25.34 75.0 70.0 No

3 Female 61 graduate 1 30.0 0.0 no 1 0 225.0 150.0 95.0 28.58 65.0 103.0 yes

4 Female 46 graduate 1 23.0 0.0 no 0 0 285.0 130.0 84.0 23.10 85.0 85.0 No

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

4233 Male 50 uneducated 1 1.0 0.0 no 1 0 313.0 179.0 92.0 25.97 66.0 86.0 yes

4234 Male 51 graduate 1 43.0 0.0 no 0 0 207.0 126.5 80.0 19.71 65.0 68.0 No

4235 Female 48 primaryschool 1 20.0 NaN no 0 0 248.0 131.0 72.0 22.00 84.0 86.0 No

4236 Female 44 uneducated 1 15.0 0.0 no 0 0 210.0 126.5 87.0 19.16 86.0 NaN No

4237 Female 52 primaryschool 0 0.0 0.0 no 0 0 269.0 133.5 83.0 21.47 80.0 107.0 No

4238 rows × 16 columns

Step #3: Take a look on the Data


Displaying the first 7 rows in the data :

In [56]: Data.head(7)

Out[56]: Gender age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose Heart_ stroke

0 Male 39 postgraduate 0 0.0 0.0 no 0 0 195.0 106.0 70.0 26.97 80.0 77.0 No

1 Female 46 primaryschool 0 0.0 0.0 no 0 0 250.0 121.0 81.0 28.73 95.0 76.0 No

2 Male 48 uneducated 1 20.0 0.0 no 0 0 245.0 127.5 80.0 25.34 75.0 70.0 No

3 Female 61 graduate 1 30.0 0.0 no 1 0 225.0 150.0 95.0 28.58 65.0 103.0 yes

4 Female 46 graduate 1 23.0 0.0 no 0 0 285.0 130.0 84.0 23.10 85.0 85.0 No

5 Female 43 primaryschool 0 0.0 0.0 no 1 0 228.0 180.0 110.0 30.30 77.0 99.0 No

6 Female 63 uneducated 0 0.0 0.0 no 0 0 205.0 138.0 71.0 33.11 60.0 85.0 yes

To get a quick description of the data:

In [57]: Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4238 non-null object
1 age 4238 non-null int64
2 education 4133 non-null object
3 currentSmoker 4238 non-null int64
4 cigsPerDay 4209 non-null float64
5 BPMeds 4185 non-null float64
6 prevalentStroke 4238 non-null object
7 prevalentHyp 4238 non-null int64
8 diabetes 4238 non-null int64
9 totChol 4188 non-null float64
10 sysBP 4238 non-null float64
11 diaBP 4238 non-null float64
12 BMI 4219 non-null float64
13 heartRate 4237 non-null float64
14 glucose 3850 non-null float64
15 Heart_ stroke 4238 non-null object
dtypes: float64(8), int64(4), object(4)
memory usage: 529.9+ KB

Using .info() can help us know how many null values are in each column. For example; education column has 4133 non-null values, which means that there are 105 null values.

We can use .value_counts() to find categories and repetitions of any column:

Example: To know how many females and males are in the problem:

In [58]: Data['Gender'].value_counts()

Gender
Out[58]:
Female 2419
Male 1819
Name: count, dtype: int64

.describe() method shows a summary of categorical and numerical attributes. Numerical attributes summary differs from the summary of categorical attributes.

Using .describe() with numerical data :

In [59]: Data.describe()

Out[59]: age currentSmoker cigsPerDay BPMeds prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose

count 4238.000000 4238.000000 4209.000000 4185.000000 4238.000000 4238.000000 4188.000000 4238.000000 4238.000000 4219.000000 4237.000000 3850.000000

mean 49.584946 0.494101 9.003089 0.029630 0.310524 0.025720 236.721585 132.352407 82.893464 25.802008 75.878924 81.966753

std 8.572160 0.500024 11.920094 0.169584 0.462763 0.158316 44.590334 22.038097 11.910850 4.080111 12.026596 23.959998

min 32.000000 0.000000 0.000000 0.000000 0.000000 0.000000 107.000000 83.500000 48.000000 15.540000 44.000000 40.000000

25% 42.000000 0.000000 0.000000 0.000000 0.000000 0.000000 206.000000 117.000000 75.000000 23.070000 68.000000 71.000000

50% 49.000000 0.000000 0.000000 0.000000 0.000000 0.000000 234.000000 128.000000 82.000000 25.400000 75.000000 78.000000

75% 56.000000 1.000000 20.000000 0.000000 1.000000 0.000000 263.000000 144.000000 89.875000 28.040000 83.000000 87.000000

max 70.000000 1.000000 70.000000 1.000000 1.000000 1.000000 696.000000 295.000000 142.500000 56.800000 143.000000 394.000000

Using .describe() with categorical attribues :

In [60]: Data[['Gender','education','prevalentStroke','Heart_ stroke']].describe()

Out[60]: Gender education prevalentStroke Heart_ stroke

count 4238 4133 4238 4238

unique 2 4 2 2

top Female uneducated no No

freq 2419 1720 4213 3594

We can use .unique( ) to find the categories in a certain attribute without repetition :

Example : Finding the education levels:

In [61]: Data.education.unique()

array(['postgraduate', 'primaryschool', 'uneducated', 'graduate', nan],


Out[61]:
dtype=object)

Let's look at the number of patients whoes age is greater than 43:

In [62]: Data[Data['age']> 43].age.count()

2979
Out[62]:

Step #4: Data Visualisation


In [63]: Data.plot(kind="scatter", x="sysBP", y="diaBP", grid=True,
s="BMI", label="BMI",
c="age", cmap="tab20b", colorbar=True,
legend=True, sharex=False, figsize=(10, 7))
plt.show()

Above is a scatter plot that shows the relationship between the diastolic blood pressure and the systolic blood pressure, with the color controlled by the age attribute and the size with
the BMI attribute

To show the distribution of the numerical attributes , look at the following plot:

In [64]: plt.rc('font', size=14)


plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

Data.hist(bins=50, figsize=(12, 10))


plt.show()

To show the distribution of a categorical attribute (ex: Gender):

In [65]: x = Data['Gender'].value_counts()
values = [x.Female,x.Male]
Ans= ['Female','Male']
plt.pie(values,labels = Ans,autopct = '%1.1f%%',startangle=90,explode =(0,.1))
plt.legend(loc='upper right')
plt.title('Gender')
plt.show()

- Looking for Correlation

In [66]: Data2 = Data.select_dtypes(np.number)


corr_matrix = Data2.corr()
corr_matrix["BMI"].sort_values(ascending = False)

BMI 1.000000
Out[66]:
diaBP 0.377588
sysBP 0.326981
prevalentHyp 0.301318
age 0.135800
totChol 0.115767
BPMeds 0.100668
glucose 0.087377
diabetes 0.087036
heartRate 0.067678
cigsPerDay -0.092856
currentSmoker -0.167650
Name: BMI, dtype: float64

Let us plot some of the correlations:

In [67]: attr = ["BMI", "age", "diaBP","sysBP"]


scatter_matrix(Data[attr], figsize=(12, 8))
plt.show()

The plot above is the correlation between two columns. When correlation is appplied on a column and itself the plot will be a histogram as shown. On the other hand, if the correlation
is done on two different columns the plot will be an increasing, decresing, or none scatter plot. Increasing means there are a linear realationship, while if decreasing the two attributes
are inversely proportional and finally if none there is no correlation between those attributes

Step #5 : Prepare the Data


- Handle Missing Data ,Dealing with Categorical Data and Scaling Data
In [68]: num_attribs = Data.select_dtypes(np.number)
cat_attribs = [['Gender','education','prevalentStroke','Heart_ stroke']]
num_pipeline = make_pipeline(SimpleImputer(strategy="median"),StandardScaler())
cat_pipeline = make_pipeline( SimpleImputer(strategy="most_frequent"),OneHotEncoder(handle_unknown="ignore"))
preprocessing = ColumnTransformer([("cat", cat_pipeline, make_column_selector(dtype_include=object))],remainder=num_pipeline)
org_Data_prepared = preprocessing.fit_transform(Data)
##converting it to DataFrame
ready_data = pd.DataFrame(org_Data_prepared,columns =preprocessing.get_feature_names_out() )

In [69]: ready_data.shape

(4238, 22)
Out[69]:

In [70]: preprocessing.get_feature_names_out()

array(['cat__Gender_Female', 'cat__Gender_Male',
Out[70]:
'cat__education_graduate', 'cat__education_postgraduate',
'cat__education_primaryschool', 'cat__education_uneducated',
'cat__prevalentStroke_no', 'cat__prevalentStroke_yes',
'cat__Heart_ stroke_No', 'cat__Heart_ stroke_yes',
'remainder__age', 'remainder__currentSmoker',
'remainder__cigsPerDay', 'remainder__BPMeds',
'remainder__prevalentHyp', 'remainder__diabetes',
'remainder__totChol', 'remainder__sysBP', 'remainder__diaBP',
'remainder__BMI', 'remainder__heartRate', 'remainder__glucose'],
dtype=object)

To check if there is any null value in ready_data:

In [71]: ready_data.isnull().any()

cat__Gender_Female False
Out[71]:
cat__Gender_Male False
cat__education_graduate False
cat__education_postgraduate False
cat__education_primaryschool False
cat__education_uneducated False
cat__prevalentStroke_no False
cat__prevalentStroke_yes False
cat__Heart_ stroke_No False
cat__Heart_ stroke_yes False
remainder__age False
remainder__currentSmoker False
remainder__cigsPerDay False
remainder__BPMeds False
remainder__prevalentHyp False
remainder__diabetes False
remainder__totChol False
remainder__sysBP False
remainder__diaBP False
remainder__BMI False
remainder__heartRate False
remainder__glucose False
dtype: bool

- Spliting Data into Test set and Train Set


In this step data will be split into two sets: test set and train set

In [72]: split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)


for train_index, test_index in split.split(ready_data, ready_data[["cat__Heart_ stroke_No","cat__Heart_ stroke_yes"]]):
strat_train_set = ready_data.loc[train_index]
strat_test_set = ready_data.loc[test_index]
print("train set size = "+str(len(strat_train_set)))
print("test set size = "+str(len(strat_test_set)))

train set size = 3390


test set size = 848

to make sure that the data is split almost equally between the test and train sets :

In [73]: strat_test_set[["cat__Heart_ stroke_No","cat__Heart_ stroke_yes"]].value_counts() / len(strat_test_set)

cat__Heart_ stroke_No cat__Heart_ stroke_yes


Out[73]:
1.0 0.0 0.847877
0.0 1.0 0.152123
Name: count, dtype: float64

In [74]: strat_train_set[["cat__Heart_ stroke_No","cat__Heart_ stroke_yes"]].value_counts() / len(strat_train_set)

cat__Heart_ stroke_No cat__Heart_ stroke_yes


Out[74]:
1.0 0.0 0.848083
0.0 1.0 0.151917
Name: count, dtype: float64

- Seperate Labels From Original Data


In [75]: Data_train = strat_train_set.drop(["cat__Heart_ stroke_No","cat__Heart_ stroke_yes"],axis = 1)
Label_train = strat_train_set["cat__Heart_ stroke_yes"].copy()
Data_test = strat_test_set.drop(["cat__Heart_ stroke_No","cat__Heart_ stroke_yes"],axis = 1)
Label_test = strat_test_set["cat__Heart_ stroke_yes"].copy()

Classification

- Training and Selecting a model


First, we use different classification algorithms to see which one gives the best performance using the accuracy mertic:

1. Logistic Regression:

In [76]: log_clf = LogisticRegression(random_state=42,penalty='l1',solver='liblinear')


LR = log_clf.fit(Data_train, Label_train)
score_train = cross_val_score(LR,Data_train, Label_train, cv=3,scoring='accuracy')
score_test = cross_val_score(LR,Data_test, Label_test, cv=3,scoring='accuracy')
print("Train Score: "+ str(score_train))
print("Test Score: "+ str(score_test))

Train Score: [0.85221239 0.85309735 0.85486726]


Test Score: [0.85512367 0.83392226 0.84751773]

2. SVC:

In [77]: svm_clf = SVC(random_state=42)


svc = svm_clf.fit(Data_train, Label_train)
score_train1 = cross_val_score(svc,Data_train, Label_train, cv=3,scoring='accuracy')
score_test1 = cross_val_score(svc,Data_test, Label_test, cv=3,scoring='accuracy')
print("Train Score: "+ str(score_train1))
print("Test Score: "+ str(score_test1))

Train Score: [0.84690265 0.84424779 0.84778761]


Test Score: [0.84805654 0.8409894 0.84751773]

3. Training a Binary Classifier:

In [78]: Label_train_1 = (Label_train==1)


Label_test_1 = (Label_test==1)

In [79]: sgd_clf = SGDClassifier(random_state=42)


sgd_clf.fit(Data_train, Label_train_1)

Out[79]: ▾ SGDClassifier
SGDClassifier(random_state=42)

In [80]: some_data = Data_train.iloc[0]


sgd_clf.predict([some_data])

/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but


SGDClassifier was fitted with feature names
warnings.warn(
array([False])
Out[80]:

In [81]: cross_val_score(sgd_clf,Data_train,Label_train_1, cv=3, scoring="accuracy")

array([0.84513274, 0.84336283, 0.83893805])


Out[81]:

4. Decision Tree Classifier:

In [82]: clf = DecisionTreeClassifier(max_depth=2, min_samples_split=2)


dtc =clf.fit(Data_train, Label_train_1)
tree.plot_tree(dtc)
score_train2 = cross_val_score(dtc,Data_train, Label_train_1, cv=3,scoring='accuracy')
score_test2 = cross_val_score(dtc,Data_test, Label_test_1, cv=3,scoring='accuracy')
print("Train Score: "+ str(score_train2))
print("Test Score: "+ str(score_test2))

Train Score: [0.83185841 0.8460177 0.84778761]


Test Score: [0.84805654 0.83392226 0.84397163]

5. Random Forest Classifier:

In [83]: forest_clf = RandomForestClassifier(random_state=42)


rfc =forest_clf.fit(Data_train, Label_train)
score_train3 = cross_val_score(rfc,Data_train, Label_train, cv=3,scoring='accuracy')
score_test3 = cross_val_score(rfc,Data_test, Label_test, cv=3,scoring='accuracy')
print("Train Score: "+ str(score_train3))
print("Test Score: "+ str(score_test3))

Train Score: [0.85132743 0.84336283 0.84778761]


Test Score: [0.85159011 0.83038869 0.84397163]

6. KNN:

In [84]: knn_clf = KNeighborsClassifier()


knn = knn_clf.fit(Data_train, Label_train)
score_train4 = cross_val_score(knn,Data_train, Label_train, cv=3,scoring='accuracy')
score_test4 = cross_val_score(knn,Data_test, Label_test, cv=3,scoring='accuracy')
print("Train Score: "+ str(score_train4))
print("Test Score: "+ str(score_test4))

Train Score: [0.83362832 0.83362832 0.83539823]


Test Score: [0.83038869 0.79858657 0.81914894]

Below is a table that shows each model with its maximum accuracy and the corresponding best_parameters:

In [85]: p = { 'logistic_regression' : {'model': LogisticRegression(random_state=42,penalty='l1',solver='liblinear'),


'params': {'C': [1,5,10,3,7,2] , 'tol':[.0001,.001,.00001,0.01],
'max_iter':[50,100,200,800,1000]}},
'svm': {'model': svm.SVC(random_state=42),
'params' : {'C': [0, 1,10,20],'kernel': ['poly','linear'],'degree':[2,3]}},

'binary_classifier':{'model': SGDClassifier(random_state=42),
'params' : {'max_iter':[100,200,800,1000],'tol':[.0001,.001,.00001]}},

'decision_tree' : {'model': DecisionTreeClassifier(max_depth=2, min_samples_split=2),


'params': {'max_depth': [1,2,3,4,5,6], 'min_samples_split': [2,3,6] }},
'random_forest': {'model': RandomForestClassifier(random_state=42),
'params' : {'n_estimators': [1,5,10]}},
'knn' : {'model': KNeighborsClassifier(),
'params': {'n_neighbors': [1,10,20] }}
}
table = []
for name, mp in p.items():
clf = GridSearchCV(mp['model'], mp['params'], cv=3)
clf.fit(Data_train , Label_train)
table.append({
'model': name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})

table = pd.DataFrame(table,columns=['model','best_score','best_params'])
table

/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
12 fits failed out of a total of 48.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:


--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
File "/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_scor
e
estimator.fit(X_train, y_train, **fit_params)
File "/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/svm/_base.py", line 180, in fit
self._validate_params()
File "/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter
_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'C' parameter of SVC must be a float in the range (0.0, inf). Got 0 instead.

warnings.warn(some_fits_failed_message, FitFailedWarning)
/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the tes
t scores are non-finite: [ nan nan nan nan 0.84660767 0.84660767
0.84247788 0.84660767 0.8460177 0.84660767 0.82861357 0.84660767
0.8460177 0.84660767 0.82389381 0.84660767]
warnings.warn(
/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/linear_model/_stochastic_gradient.py:702: ConvergenceWarning: Maxim
um number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
Out[85]: model best_score best_params

0 logistic_regression 0.854277 {'C': 3, 'max_iter': 50, 'tol': 0.001}

1 svm 0.846608 {'C': 1, 'degree': 2, 'kernel': 'poly'}

2 binary_classifier 0.842478 {'max_iter': 100, 'tol': 0.0001}

3 decision_tree 0.848083 {'max_depth': 1, 'min_samples_split': 2}

4 random_forest 0.842773 {'n_estimators': 10}

5 knn 0.848083 {'n_neighbors': 20}

In [86]: table.sort_values(by = 'best_score',inplace = True,ascending = False )


table

Out[86]: model best_score best_params

0 logistic_regression 0.854277 {'C': 3, 'max_iter': 50, 'tol': 0.001}

3 decision_tree 0.848083 {'max_depth': 1, 'min_samples_split': 2}

5 knn 0.848083 {'n_neighbors': 20}

1 svm 0.846608 {'C': 1, 'degree': 2, 'kernel': 'poly'}

4 random_forest 0.842773 {'n_estimators': 10}

2 binary_classifier 0.842478 {'max_iter': 100, 'tol': 0.0001}

As we can see the model with the highest accuracy is :

In [87]: table.iloc[0]

model logistic_regression
Out[87]:
best_score 0.854277
best_params {'C': 3, 'max_iter': 50, 'tol': 0.001}
Name: 0, dtype: object

After using the cross validation to test the accuracy of each model we can see that the Logistic Regression has the highest accuracy.

Precision, Recall and F1score Using LogisticRegression


In [109… # assuming that threshold = -2:
threshold = -2
label_scores = cross_val_predict(log_clf, Data_train, Label_train_1, cv=3,method="decision_function")
label_scores_2 = (label_scores>threshold)
precisions, recalls, thresholds = precision_recall_curve(Label_train_1,label_scores)
print(precisions)
print(recalls)
print(thresholds)

[0.1519174 0.15166716 0.15171192 ... 0.5 1. 1. ]


[1. 0.99805825 0.99805825 ... 0.00194175 0.00194175 0. ]
[-4.1732576 -4.0940033 -4.01600949 ... 2.16973358 2.27776467
2.92478342]

In [110… ##plotting precision vs recall relation


threshold =-2
plt.figure(figsize=(8, 4))
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

idx = (thresholds >= threshold).argmax() # first index ≥ threshold


plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([thresholds.min(),thresholds.max(),0 , 1.2])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

In [111… ## finding the 90% precision:


idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
threshold_for_90_precision

2.924783421727712
Out[111]:

In [112… Label_train_pred_90 = (label_scores >= threshold_for_90_precision)


Label_train_pred_90

array([False, False, False, ..., False, False, False])


Out[112]:

In [113… precision_score(Label_train_1, Label_train_pred_90)

1.0
Out[113]:

In [94]: recall_at_90_precision = recall_score(Label_train_1, Label_train_pred_90)


recall_at_90_precision

0.001941747572815534
Out[94]:

ROC Curve
In [95]: fpr, tpr, thresholds = roc_curve(Label_train_1, label_scores)

In [119… idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()


tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(6, 5)) # extra code – not needed, just formatting


plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

# extra code – just beautifies and saves Figure 3–7


plt.gca().add_patch(patches.FancyArrowPatch((0.50, 0.65), (0.2, 0.4), connectionstyle="arc3,rad=.4",arrowstyle="Simple, tail_width=1.5, head_width=8, head_le
plt.text(0.12, 0.65, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([-.25, 1.25, -.25, 1.25])
plt.legend(loc="lower right", fontsize=13)

plt.show()
In [120… roc_auc_score(Label_train_1, label_scores)

0.7257887716336007
Out[120]:

Confusion Matrix
In [121… cm = confusion_matrix(Label_train_1, label_scores_2)
cm

array([[1589, 1286],
Out[121]:
[ 122, 393]])

In [122… #Computing the Precision


precision_score(Label_train_1,label_scores_2)

0.23406789755807028
Out[122]:

In [123… #Computing the Recall


recall_score(Label_train_1,label_scores_2)

0.7631067961165049
Out[123]:

In [124… f1_score(Label_train_1,label_scores_2)

0.35824977210574294
Out[124]:

In [125… ##Precision and Recall Tradeoff:


x = log_clf.decision_function([some_data])
x

/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but


LogisticRegression was fitted with feature names
warnings.warn(
array([-2.63750842])
Out[125]:

In [126… scaler = StandardScaler()


Data_train_scaled = scaler.fit_transform(Data_train.astype("float64"))
cross_val_score(log_clf, Data_train_scaled, Label_train, cv=3, scoring="accuracy")

Label_train_pred = cross_val_predict(log_clf,Data_train_scaled,Label_train, cv=3)


plt.rc('font', size=9) # extra code – make the text smaller
ConfusionMatrixDisplay.from_predictions(Label_train, label_scores_2)
plt.show()

In [127… plt.rc('font', size=10)


ConfusionMatrixDisplay.from_predictions(Label_train, label_scores_2,normalize="true", values_format=".0%")
plt.show()

- Fine Tune the Hyper Parameters of the Best Model.


In [128… params = { 'logistic_regression' : {'model': LogisticRegression(penalty='l1',solver='liblinear'),
'params': {'C': [1,5,10,3,7,2] ,
'tol':[.0001,.001,.00001,0.01],
'max_iter':[50,100,200,800,1000]}}
}

In [106… items = params.items()


for name1, mp1 in items:
gs = GridSearchCV(log_clf, mp1['params'],cv = 3)
gs.fit(Data_train,Label_train)
print("Model Name: "+name1)
print("Best Score: "+ str(gs.best_score_))
print("Best Parameters: "+ str(gs.best_params_))

Model Name: logistic_regression


Best Score: 0.8542772861356932
Best Parameters: {'C': 3, 'max_iter': 50, 'tol': 0.001}

In [107… end_mod = gs.best_estimator_

- Evaluating the best model on the test data and computing the accuracy of the model:
In [108… pred = end_mod.predict(Data_test)
accuracy_score(Label_test,pred)

0.8525943396226415
Out[108]:

You might also like