Professional Documents
Culture Documents
In [55]: Data
Out[55]: Gender age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose Heart_ stroke
0 Male 39 postgraduate 0 0.0 0.0 no 0 0 195.0 106.0 70.0 26.97 80.0 77.0 No
1 Female 46 primaryschool 0 0.0 0.0 no 0 0 250.0 121.0 81.0 28.73 95.0 76.0 No
2 Male 48 uneducated 1 20.0 0.0 no 0 0 245.0 127.5 80.0 25.34 75.0 70.0 No
3 Female 61 graduate 1 30.0 0.0 no 1 0 225.0 150.0 95.0 28.58 65.0 103.0 yes
4 Female 46 graduate 1 23.0 0.0 no 0 0 285.0 130.0 84.0 23.10 85.0 85.0 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4233 Male 50 uneducated 1 1.0 0.0 no 1 0 313.0 179.0 92.0 25.97 66.0 86.0 yes
4234 Male 51 graduate 1 43.0 0.0 no 0 0 207.0 126.5 80.0 19.71 65.0 68.0 No
4235 Female 48 primaryschool 1 20.0 NaN no 0 0 248.0 131.0 72.0 22.00 84.0 86.0 No
4236 Female 44 uneducated 1 15.0 0.0 no 0 0 210.0 126.5 87.0 19.16 86.0 NaN No
4237 Female 52 primaryschool 0 0.0 0.0 no 0 0 269.0 133.5 83.0 21.47 80.0 107.0 No
In [56]: Data.head(7)
Out[56]: Gender age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose Heart_ stroke
0 Male 39 postgraduate 0 0.0 0.0 no 0 0 195.0 106.0 70.0 26.97 80.0 77.0 No
1 Female 46 primaryschool 0 0.0 0.0 no 0 0 250.0 121.0 81.0 28.73 95.0 76.0 No
2 Male 48 uneducated 1 20.0 0.0 no 0 0 245.0 127.5 80.0 25.34 75.0 70.0 No
3 Female 61 graduate 1 30.0 0.0 no 1 0 225.0 150.0 95.0 28.58 65.0 103.0 yes
4 Female 46 graduate 1 23.0 0.0 no 0 0 285.0 130.0 84.0 23.10 85.0 85.0 No
5 Female 43 primaryschool 0 0.0 0.0 no 1 0 228.0 180.0 110.0 30.30 77.0 99.0 No
6 Female 63 uneducated 0 0.0 0.0 no 0 0 205.0 138.0 71.0 33.11 60.0 85.0 yes
In [57]: Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4238 non-null object
1 age 4238 non-null int64
2 education 4133 non-null object
3 currentSmoker 4238 non-null int64
4 cigsPerDay 4209 non-null float64
5 BPMeds 4185 non-null float64
6 prevalentStroke 4238 non-null object
7 prevalentHyp 4238 non-null int64
8 diabetes 4238 non-null int64
9 totChol 4188 non-null float64
10 sysBP 4238 non-null float64
11 diaBP 4238 non-null float64
12 BMI 4219 non-null float64
13 heartRate 4237 non-null float64
14 glucose 3850 non-null float64
15 Heart_ stroke 4238 non-null object
dtypes: float64(8), int64(4), object(4)
memory usage: 529.9+ KB
Using .info() can help us know how many null values are in each column. For example; education column has 4133 non-null values, which means that there are 105 null values.
Example: To know how many females and males are in the problem:
In [58]: Data['Gender'].value_counts()
Gender
Out[58]:
Female 2419
Male 1819
Name: count, dtype: int64
.describe() method shows a summary of categorical and numerical attributes. Numerical attributes summary differs from the summary of categorical attributes.
In [59]: Data.describe()
Out[59]: age currentSmoker cigsPerDay BPMeds prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose
count 4238.000000 4238.000000 4209.000000 4185.000000 4238.000000 4238.000000 4188.000000 4238.000000 4238.000000 4219.000000 4237.000000 3850.000000
mean 49.584946 0.494101 9.003089 0.029630 0.310524 0.025720 236.721585 132.352407 82.893464 25.802008 75.878924 81.966753
std 8.572160 0.500024 11.920094 0.169584 0.462763 0.158316 44.590334 22.038097 11.910850 4.080111 12.026596 23.959998
min 32.000000 0.000000 0.000000 0.000000 0.000000 0.000000 107.000000 83.500000 48.000000 15.540000 44.000000 40.000000
25% 42.000000 0.000000 0.000000 0.000000 0.000000 0.000000 206.000000 117.000000 75.000000 23.070000 68.000000 71.000000
50% 49.000000 0.000000 0.000000 0.000000 0.000000 0.000000 234.000000 128.000000 82.000000 25.400000 75.000000 78.000000
75% 56.000000 1.000000 20.000000 0.000000 1.000000 0.000000 263.000000 144.000000 89.875000 28.040000 83.000000 87.000000
max 70.000000 1.000000 70.000000 1.000000 1.000000 1.000000 696.000000 295.000000 142.500000 56.800000 143.000000 394.000000
unique 2 4 2 2
We can use .unique( ) to find the categories in a certain attribute without repetition :
In [61]: Data.education.unique()
Let's look at the number of patients whoes age is greater than 43:
2979
Out[62]:
Above is a scatter plot that shows the relationship between the diastolic blood pressure and the systolic blood pressure, with the color controlled by the age attribute and the size with
the BMI attribute
To show the distribution of the numerical attributes , look at the following plot:
In [65]: x = Data['Gender'].value_counts()
values = [x.Female,x.Male]
Ans= ['Female','Male']
plt.pie(values,labels = Ans,autopct = '%1.1f%%',startangle=90,explode =(0,.1))
plt.legend(loc='upper right')
plt.title('Gender')
plt.show()
BMI 1.000000
Out[66]:
diaBP 0.377588
sysBP 0.326981
prevalentHyp 0.301318
age 0.135800
totChol 0.115767
BPMeds 0.100668
glucose 0.087377
diabetes 0.087036
heartRate 0.067678
cigsPerDay -0.092856
currentSmoker -0.167650
Name: BMI, dtype: float64
The plot above is the correlation between two columns. When correlation is appplied on a column and itself the plot will be a histogram as shown. On the other hand, if the correlation
is done on two different columns the plot will be an increasing, decresing, or none scatter plot. Increasing means there are a linear realationship, while if decreasing the two attributes
are inversely proportional and finally if none there is no correlation between those attributes
In [69]: ready_data.shape
(4238, 22)
Out[69]:
In [70]: preprocessing.get_feature_names_out()
array(['cat__Gender_Female', 'cat__Gender_Male',
Out[70]:
'cat__education_graduate', 'cat__education_postgraduate',
'cat__education_primaryschool', 'cat__education_uneducated',
'cat__prevalentStroke_no', 'cat__prevalentStroke_yes',
'cat__Heart_ stroke_No', 'cat__Heart_ stroke_yes',
'remainder__age', 'remainder__currentSmoker',
'remainder__cigsPerDay', 'remainder__BPMeds',
'remainder__prevalentHyp', 'remainder__diabetes',
'remainder__totChol', 'remainder__sysBP', 'remainder__diaBP',
'remainder__BMI', 'remainder__heartRate', 'remainder__glucose'],
dtype=object)
In [71]: ready_data.isnull().any()
cat__Gender_Female False
Out[71]:
cat__Gender_Male False
cat__education_graduate False
cat__education_postgraduate False
cat__education_primaryschool False
cat__education_uneducated False
cat__prevalentStroke_no False
cat__prevalentStroke_yes False
cat__Heart_ stroke_No False
cat__Heart_ stroke_yes False
remainder__age False
remainder__currentSmoker False
remainder__cigsPerDay False
remainder__BPMeds False
remainder__prevalentHyp False
remainder__diabetes False
remainder__totChol False
remainder__sysBP False
remainder__diaBP False
remainder__BMI False
remainder__heartRate False
remainder__glucose False
dtype: bool
to make sure that the data is split almost equally between the test and train sets :
Classification
1. Logistic Regression:
2. SVC:
Out[79]: ▾ SGDClassifier
SGDClassifier(random_state=42)
6. KNN:
Below is a table that shows each model with its maximum accuracy and the corresponding best_parameters:
'binary_classifier':{'model': SGDClassifier(random_state=42),
'params' : {'max_iter':[100,200,800,1000],'tol':[.0001,.001,.00001]}},
table = pd.DataFrame(table,columns=['model','best_score','best_params'])
table
/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
12 fits failed out of a total of 48.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
warnings.warn(some_fits_failed_message, FitFailedWarning)
/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the tes
t scores are non-finite: [ nan nan nan nan 0.84660767 0.84660767
0.84247788 0.84660767 0.8460177 0.84660767 0.82861357 0.84660767
0.8460177 0.84660767 0.82389381 0.84660767]
warnings.warn(
/Users/yasmeenalhajyousef/anaconda3/envs/yasmeen/lib/python3.11/site-packages/sklearn/linear_model/_stochastic_gradient.py:702: ConvergenceWarning: Maxim
um number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
Out[85]: model best_score best_params
In [87]: table.iloc[0]
model logistic_regression
Out[87]:
best_score 0.854277
best_params {'C': 3, 'max_iter': 50, 'tol': 0.001}
Name: 0, dtype: object
After using the cross validation to test the accuracy of each model we can see that the Logistic Regression has the highest accuracy.
plt.show()
2.924783421727712
Out[111]:
1.0
Out[113]:
0.001941747572815534
Out[94]:
ROC Curve
In [95]: fpr, tpr, thresholds = roc_curve(Label_train_1, label_scores)
plt.show()
In [120… roc_auc_score(Label_train_1, label_scores)
0.7257887716336007
Out[120]:
Confusion Matrix
In [121… cm = confusion_matrix(Label_train_1, label_scores_2)
cm
array([[1589, 1286],
Out[121]:
[ 122, 393]])
0.23406789755807028
Out[122]:
0.7631067961165049
Out[123]:
In [124… f1_score(Label_train_1,label_scores_2)
0.35824977210574294
Out[124]:
- Evaluating the best model on the test data and computing the accuracy of the model:
In [108… pred = end_mod.predict(Data_test)
accuracy_score(Label_test,pred)
0.8525943396226415
Out[108]: