Siahaan V. Data Science Crash Course... With Python GUI 2ed 2023

DATA SCIENCE CRASH COURSE:
THYROID DISEASE
CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING AND DEEP LEARNING
WITH PYTHON GUI
DATA SCIENCE CRASH COURSE:
THYROID DISEASE
CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING AND DEEP LEARNING
WITH PYTHON GUI
Second Edition
VIVIAN SIAHAAN
RISMON HASIHOLAN SIANIPAR
Copyright © 2023 BALIGE Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor BALIGE Publishing or its
dealers and distributors, will be held liable for any damages caused or alleged to have
been caused directly or indirectly by this book. BALIGE Publishing has endeavored to
provide trademark information about all of the companies and products mentioned in this
book by the appropriate use of capitals. However, BALIGE Publishing cannot guarantee
the accuracy of this information.
Copyright © 2023 BALIGE Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor BALIGE Publishing or its
dealers and distributors, will be held liable for any damages caused or alleged to have
been caused directly or indirectly by this book. BALIGE Publishing has endeavored to
provide trademark information about all of the companies and products mentioned in this
book by the appropriate use of capitals. However, BALIGE Publishing cannot guarantee
the accuracy of this information.
Published: JULY 2023

Production reference: 21070223
Published by BALIGE Publishing Ltd.
BALIGE, North Sumatera
ABOUT THE AUTHOR
ABOUT THE AUTHOR
Vivian Siahaan is a highly motivated individual
with a passion for continuous learning and
exploring new areas. Born and raised in Hinalang
Bagasan, Balige, situated on the picturesque
banks of Lake Toba, she completed her high school
education at SMAN 1 Balige. Vivian's journey into the
world of programming began with a deep dive into
various languages such as Java, Android, JavaScript,
CSS, C++, Python, R, Visual Basic, Visual C#,
MATLAB, Mathematica, PHP, JSP, MySQL, SQL Server,
Oracle, Access, and more. Starting from scratch, Vivian
diligently studied programming, focusing on mastering
the fundamental syntax and logic. She honed her skills
by creating practical GUI applications, gradually
building her expertise. One particular area of interest
for Vivian is animation and game development, where
she aspires to make significant contributions. Alongside
her programming and mathematical pursuits, she also
finds joy in indulging in novels, nurturing her love for
literature. Vivian Siahaan's passion for programming
and her extensive knowledge are reflected in the
numerous ebooks she has authored. Her works,
published by Sparta Publisher, cover a wide range of
topics, including "Data Structure with Java," "Java
Programming: Cookbook," "C++ Programming:
Cookbook," "C Programming For High
Schools/Vocational Schools and Students," "Java
Programming for SMA/SMK," "Java Tutorial: GUI,
Graphics and Animation," "Visual Basic Programming:
From A to Z," "Java Programming for Animation and
Games," "C# Programming for SMA/SMK and
Students," "MATLAB For Students and Researchers,"
"Graphics in JavaScript: Quick Learning Series,"
"JavaScript Image Processing Methods: From A to Z,"
"Java GUI Case Study: AWT & Swing," "Basic CSS and
JavaScript," "PHP/MySQL Programming: Cookbook,"
"Visual Basic: Cookbook," "C++ Programming for High
Schools/Vocational Schools and Students," "Concepts
and Practices of C++," "PHP/MySQL For Students,"
"C# Programming: From A to Z," "Visual Basic for
SMA/SMK and Students," and "C# .NET and SQL
Server for High School/Vocational School and
Students." Furthermore, at the ANDI Yogyakarta
publisher, Vivian Siahaan has contributed to several
notable books, including "Python Programming Theory
and Practice," "Python GUI Programming," "Python
GUI and Database," "Build From Zero School Database
Management System In Python/MySQL," "Database
Management System in Python/MySQL,"
"Python/MySQL For Management Systems of Criminal
Track Record Database," "Java/MySQL For
Management Systems of Criminal Track Records
Database," "Database and Cryptography Using
Java/MySQL," and "Build From Zero School Database
Management System With Java/MySQL." Vivian's
diverse range of expertise in programming languages,
combined with her passion for exploring new horizons,
makes her a dynamic and versatile individual in the
field of technology. Her dedication to learning, coupled
with her strong analytical and problem-solving skills,
positions her as a valuable asset in any programming
endeavor. Vivian Siahaan's contributions to the world
of programming and literature continue to inspire and
empower aspiring programmers and readers alike.
Rismon Hasiholan Sianipar, born in Pematang Siantar in 1994,

is a distinguished researcher and expert in the field of electrical
engineering. After completing his education at SMAN 3 Pematang
Siantar, Rismon ventured to the city of Jogjakarta to pursue his
academic journey. He obtained his Bachelor of Engineering (S.T) and
Master of Engineering (M.T) degrees in Electrical Engineering from
Gadjah Mada University in 1998 and 2001, respectively, under the
guidance of esteemed professors, Dr. Adhi Soesanto and Dr. Thomas Sri
Widodo. During his studies, Rismon focused on researching non-stationary
signals and their energy analysis using time-frequency maps. He explored
the dynamic nature of signal energy distribution on time-frequency maps
and developed innovative techniques using discrete wavelet
transformations to design non-linear filters for data pattern analysis. His
research showcased the application of these techniques in various fields.
In recognition of his academic prowess, Rismon was awarded the
prestigious Monbukagakusho scholarship by the Japanese Government in
2003. He went on to pursue his Master of Engineering (M.Eng) and
Doctor of Engineering (Dr.Eng) degrees at Yamaguchi University,
supervised by Prof. Dr. Hidetoshi Miike. Rismon's master's and doctoral
theses revolved around combining the SR-FHN (Stochastic Resonance
Fitzhugh-Nagumo) filter strength with the cryptosystem ECC (elliptic
curve cryptography) 4096-bit. This innovative approach effectively
suppressed noise in digital images and videos while ensuring their
authenticity. Rismon's research findings have been published in renowned
international scientific journals, and his patents have been officially
registered in Japan. Notably, one of his patents, with registration number
2008-009549, gained recognition. He actively collaborates with several
universities and research institutions in Japan, specializing in
cryptography, cryptanalysis, and digital forensics, particularly in the
areas of audio, image, and video analysis. With a passion for knowledge
sharing, Rismon has authored numerous national and international
scientific articles and authored several national books. He has also
actively participated in workshops related to cryptography, cryptanalysis,
digital watermarking, and digital forensics. During these workshops,
Rismon has assisted Prof. Hidetoshi Miike in developing applications
related to digital image and video processing, steganography,
cryptography, watermarking, and more, which serve as valuable training
materials. Rismon's field of interest encompasses multimedia security,
signal processing, digital image and video analysis, cryptography, digital
communication, digital forensics, and data compression. He continues to
advance his research by developing applications using programming
languages such as Python, MATLAB, C++, C, VB.NET, C#.NET, R, and
Java. These applications serve both research and commercial purposes,
further contributing to the advancement of signal and image analysis.
Rismon Hasiholan Sianipar is a dedicated researcher and expert in the
field of electrical engineering, particularly in the areas of signal
processing, cryptography, and digital forensics. His academic
achievements, patented inventions, and extensive publications
demonstrate his commitment to advancing knowledge in these fields.
Rismon's contributions to academia and his collaborations with
prestigious institutions in Japan have solidified his position as a respected
figure in the scientific community. Through his ongoing research and
development of innovative applications, Rismon continues to make
significant contributions to the field of electrical engineering.
ABOUT THE BOOK

ABOUT THE BOOK
Thyroid disease is a prevalent condition that affects the

thyroid gland, leading to various health issues. In this
session of the Data Science Crash Course, we will
explore the classification and prediction of thyroid
disease using machine learning and deep learning
techniques, all implemented with the power of Python
and a user-friendly GUI built with PyQt.
We will start by conducting data exploration on a

comprehensive dataset containing relevant features
and thyroid disease labels. Through analysis and
pattern recognition, we will gain insights into the
underlying factors contributing to thyroid disease.
Next, we will delve into the machine learning phase,

where we will implement popular algorithms including
Support Vector, Logistic Regression, K-Nearest
Neighbors (KNN), Decision Tree, Random Forest,
Gradient Boosting, Light Gradient Boosting, Naive
Bayes, Adaboost, Extreme Gradient Boosting, and
Multi-Layer Perceptron. These models will be trained
using different preprocessing techniques, including raw
data, normalization, and standardization, to evaluate
their performance and accuracy. We train each model
on the training dataset and evaluate its performance
using appropriate metrics such as accuracy, precision,
recall, and F1-score. This helps us assess how well the
models can predict stroke based on the given features.
To optimize the models' performance, we perform
hyperparameter tuning using techniques like grid
search or randomized search. This involves
systematically exploring different combinations of
hyperparameters to find the best configuration for each
model. After training and tuning the models, we save
them to disk using joblib. This allows us to reuse the
trained models for future predictions without having to
train them again.
Moving beyond traditional machine learning, we will

build an artificial neural network (ANN) using
TensorFlow. This ANN will capture complex
relationships within the data and provide accurate
predictions of thyroid disease. To ensure the
effectiveness of our ANN, we will train it using a
curated dataset split into training and testing sets. This
will allow us to evaluate the model's performance and
its ability to generalize predictions.
To provide an interactive and user-friendly experience,

we will develop a Graphical User Interface (GUI) using
PyQt. The GUI will allow users to input data, select
prediction methods (machine learning or deep
learning), and visualize the results. Through the GUI,
users can explore different prediction methods,
compare performance, and gain insights into thyroid
disease classification. Visualizations of training and
validation loss, accuracy, and confusion matrices will
enhance understanding and model evaluation. Line
plots comparing true values and predicted values will
further aid interpretation and insights into
classification outcomes. Throughout the project, we will
emphasize the importance of preprocessing techniques,
feature selection, and model evaluation in building
reliable and effective thyroid disease classification and
prediction models.
By the end of the project, readers will have gained

practical knowledge in data exploration, machine
learning, deep learning, and GUI development. They
will be equipped to apply these techniques to other
domains and real-world challenges. The project’s
comprehensive approach, from data exploration to
model development and GUI implementation, ensures a
holistic understanding of thyroid disease classification
and prediction. It empowers readers to explore
applications of data science in healthcare and beyond.
The combination of machine learning and deep learning

techniques, coupled with the intuitive GUI, offers a
powerful framework for thyroid disease classification
and prediction. This project serves as a stepping stone
for readers to contribute to the field of medical data
science. Data-driven approaches in healthcare have the
potential to unlock valuable insights and improve
outcomes. The focus on thyroid disease classification
and prediction in this session showcases the
transformative impact of data science in the medical
field. Together, let us embark on this journey to
advance our understanding of thyroid disease and
make a difference in the lives of individuals affected by
this condition. Welcome to the Data Science Crash
Course on Thyroid Disease Classification and
Prediction!
CONTENT
CONTENT
EXPLORING DATASET AND FEATURES 1

DISTRIBUTION 1
Description 1
Exploring Dataset 9
Information of Dataset 11
Checking Unique Values 13
Checking Null Values 15
Converting Six Columns into Numerical 17
Deleting Irrelevant Columns 20
Statistical Description 21
Limiting Age 22
Distribution of Samples 23
Distribution of Target Variable 26
Distribution of All Features 28
Distribution of Age versus Target Variable 33
Distribution of TSH versus Target Variable 36
Distribution of T3 versus Target Variable 38
Distribution of TT4 versus Target Variable 40
Distribution of T4U versus Target Variable 42
Distribution of FTI versus Target Variable 43
Distribution of Sex Feature 49
Distribution of On Thyroxine Feature 52
Distribution of Sick Feature 54
Distribution of Tumor Feature 56
Distribution of TSH Measured Feature 58
Distribution of TT4 Measured Feature 60
Extracting Categorical and Numerical Features 62
Extracting Categorical and Numerical Features 65
Distribution of Categorical Features 68
Distribution of Nine Categorical Features versus 70
Target Variable 73
On Antithyroid Medication
76
Distribution of Nine Categorical Features versus
77
Thyroid Surgery
79
Lithium 80
TSH Measured 83
T3 Measured 86
TT4 Measured
93
97
T4U Measured
100
FTI Measured 103
I131 Measured 107
Query Hypothyroid 111
Percentage Distribution of On Thyroxine and On
Antithyroid Medication versus Target Variable
Percentage Distribution of Sick and Pregnant
versus Target Variable
114
Percentage Distribution of Lithium and Tumor
versus Target Variable 114
Age 119
Probability Density of Nine Categorical Features 122
versus Age 125
Distribution and Probability Density of Nine 126
Categorical Features versus TSH
130
Distribution and Probability Density of Nine
133
Categorical Features versus T3
135
Distribution and Probability Density of Nine
Categorical Features versus TT4 138
Categorical Features versus T4U 161
Categorical Features versus FTI
184
195
199
203
PREDICTING THYROID USING MACHINE
208
LEARNING
212
Converting Categorical Columns into Numerical
Feature Importance Using Random Forest
Feature Importance Using Extra Trees
Feature Importance Using Logistic Regression
229
Resampling and Splitting Data
229
Learning Curve
232
Real Values versus Predicted Values and
Confusion Matrix 233
ROC and Decision Boundaries 241
Training Model and Predicting Thyroid 244
Support Vector Classifier 245
Logistic Regression Classifier 247
K-Nearest Neighbors Classifier 249
Decision Tree Classifier 251
Random Forest Classifier
Gradient Boosting Classifier
Extreme Gradient Boosting Classifier
Multi-Layer Perceptron Classifier 255
Light Gradient Boosting Classifier 255
Source Code 260
269
275
281
PREDICTING THYROID USING DEEP 285
LEARNING 289
Reading Dataset and Preprocessing 292
Resampling, Splitting, and Scaling Data 308
Building, Compiling, and Training Model 316
Plotting Accuracy and Loss 319
Predicting Thyroid Using Test Data 324
Printing Accuracy and Classification Report 328
Confusion Matrix 331
True Values versus Predicted Values 334
Source Code 338
341
345
349
IMPLEMENTING GUI WITH PYQT 352
Designing GUI 356
Preprocessing Data and Populating Tables 360
Resampling and Splitting Data 369
Distribution of Target Variable
Distribution of TSH Measured
Distribution of T3 Measured
Distribution of TT4 Measured
Case and Probability Distribution
Helper Functions to Plot Model Performance
Training Model and Predicting Thyroid
Logistic Regression Classifier
Support Vector Classifier
K-Nearest Neighbors Classifier
Decision Tree Classifier
Naïve Bayes Classifier
Adaboost Classifier
Light Gradient Boosting Classifier
Multi-Layer Perceptron Classifier
ANN Classifier
Source Code
EXPLORING
DATASET
AND FEATURES DISTRIBUTION
EXPLORING
DATASET
AND FEATURES DISTRIBUTION
Description
This dataset was from Garavan Institute

Documentation as given by Ross Quinlan 6 databases
from the Garavan Institute in Sydney, Australia.
Approximately the following for each database: 2800
training (data) instances and 972 test instances. This
dataset contains plenty of missing data, while 29 or
so attributes, either Boolean or continuously-valued.
Exploring Dataset
Step Download dataset from

1 https://viviansiahaan.blogspot.com/2023/07/data-science-crash-
course-thyroid.html and save it to your working directory. Unzip the
file, hypothyroid.csv, and put it into working directory.
Step Open a new Python script and save it as thyroid.py.

2
Step Import all necessary libraries:

3
1 #thyroid.py
2 import numpy as np
3 import pandas as pd
4 import matplotlib
5 import matplotlib.pyplot as plt
6 import seaborn as sns
7 sns.set_style('darkgrid')
8 from sklearn.preprocessing import
9 LabelEncoder
10 import warnings
11 warnings.filterwarnings('ignore')
12 import os
13 import plotly.graph_objs as go
14 import joblib
15 import itertools
16 from sklearn.metrics import
roc_auc_score,roc_curve
17
from sklearn.model_selection import
18
train_test_split,
19 RandomizedSearchCV,
20 GridSearchCV,StratifiedKFold
22 StandardScaler, MinMaxScaler
23 from sklearn.linear_model import
LogisticRegression
24
from sklearn.naive_bayes import
25
GaussianNB
26 from sklearn.tree import
27 DecisionTreeClassifier
28 from sklearn.svm import SVC
29 from sklearn.ensemble import
RandomForestClassifier,
30
ExtraTreesClassifier
31
from sklearn.neighbors import
32 KNeighborsClassifier
34 AdaBoostClassifier,
35 GradientBoostingClassifier
36 from xgboost import XGBClassifier
37 from sklearn.neural_network import
MLPClassifier
38
from sklearn.linear_model import
39
SGDClassifier
40
from sklearn.preprocessing import
41 StandardScaler, \
LabelEncoder, OneHotEncoder
from sklearn.metrics import
confusion_matrix, accuracy_score,
recall_score, precision_score
classification_report, f1_score,
plot_confusion_matrix
from catboost import
CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import
SMOTE
learning_curve
from mlxtend.plotting import
plot_decision_regions
The code is a typical import statement in Python. It is used to import

various libraries and modules that are required for a data analysis or
machine learning project. Here is an explanation of each import
statement:
import numpy as np: It imports the NumPy
library and assigns it the alias np. NumPy is
a fundamental package for scientific
computing in Python, providing support for
large, multi-dimensional arrays and
matrices, along with a collection of
mathematical functions to operate on these
arrays.
import pandas as pd: It imports the Pandas
library and assigns it the alias pd. Pandas is
a powerful library for data manipulation
and analysis. It provides data structures
such as DataFrame for efficient handling of
structured data.
import matplotlib: It imports the Matplotlib
library, which is a plotting library for
creating static, animated, and interactive
visualizations in Python.
import matplotlib.pyplot as plt: It imports
the pyplot module from Matplotlib and
assigns it the alias plt. The pyplot module
provides a collection of functions for
creating and customizing plots.
import seaborn as sns: It imports the
Seaborn library, which is a data
visualization library built on top of
Matplotlib. Seaborn provides a high-level
interface for creating informative and
attractive statistical graphics.
sns.set_style('darkgrid'): It sets the default
style of Seaborn plots to 'darkgrid', which
displays a dark background grid.
LabelEncoder: It imports the LabelEncoder
class from the preprocessing module of the
scikit-learn library. The LabelEncoder is
used to encode categorical variables into
numerical values.
import warnings: It imports the warnings
module, which provides control over
warning messages in Python.
warnings.filterwarnings('ignore'): It sets
the filter mode of warnings to 'ignore',
suppressing warning messages from being
displayed.
import os: It imports the os module, which
provides a way to use operating system-
dependent functionality in Python.
import plotly.graph_objs as go: It imports
the graph_objs module from the Plotly
library and assigns it the alias go. Plotly is
a graphing library that allows interactive
plotting and data visualization.
import joblib: It imports the joblib module,
which provides utilities for saving and
loading Python objects (e.g., models) in a
serialized format.
import itertools: It imports the itertools
module, which provides functions for
creating iterators and combining them in
various ways.
roc_auc_score,roc_curve: It imports the
roc_auc_score and roc_curve functions
from the metrics module of scikit-learn.
These functions are used for evaluating the
performance of binary classification
models.
train_test_split, RandomizedSearchCV,
GridSearchCV, StratifiedKFold: It imports
the train_test_split, RandomizedSearchCV,
GridSearchCV, and StratifiedKFold
classes/functions from the model_selection
module of scikit-learn. These are used for
splitting the data into train and test sets,
performing hyperparameter tuning, and
creating stratified folds for cross-validation.
StandardScaler, MinMaxScaler: It imports
the StandardScaler and MinMaxScaler
classes from the preprocessing module of
scikit-learn. These classes are used for
standardizing and scaling numerical
features.
LogisticRegression: It imports the
LogisticRegression class from the
linear_model module of scikit-learn.
LogisticRegression is a class for logistic
regression classification models.
from sklearn.naive_bayes import
GaussianNB: It imports the GaussianNB
class from the naive_bayes module of scikit-
learn. GaussianNB is a class for Gaussian
Naive Bayes classification models.
from sklearn.tree import
DecisionTreeClassifier: It imports the
DecisionTreeClassifier class from the tree
module of scikit-learn.
DecisionTreeClassifier is a class for
decision tree classification models.
from sklearn.svm import SVC: It imports
the SVC class from the svm module of
scikit-learn. SVC is a class for Support
Vector Classifier models.
from sklearn.ensemble import
RandomForestClassifier,
ExtraTreesClassifier: It imports the
RandomForestClassifier and
ExtraTreesClassifier classes from the
ensemble module of scikit-learn. These are
classes for random forest and extra trees
classification models, respectively.
from sklearn.neighbors import
KNeighborsClassifier: It imports the
KNeighborsClassifier class from the
neighbors module of scikit-learn.
KNeighborsClassifier is a class for k-
nearest neighbors classification models.
AdaBoostClassifier,
GradientBoostingClassifier: It imports the
AdaBoostClassifier and
GradientBoostingClassifier classes from the
ensemble module of scikit-learn. These are
classes for AdaBoost and gradient boosting
classification models, respectively.
from xgboost import XGBClassifier: It
imports the XGBClassifier class from the
XGBoost library. XGBClassifier is a class for
XGBoost classification models.
from sklearn.neural_network import
MLPClassifier: It imports the MLPClassifier
class from the neural_network module of
scikit-learn. MLPClassifier is a class for
multi-layer perceptron classification
models.
SGDClassifier: It imports the SGDClassifier
class from the linear_model module of
scikit-learn. SGDClassifier is a class for
stochastic gradient descent classification
models.
StandardScaler, LabelEncoder,
OneHotEncoder: It imports the
StandardScaler, LabelEncoder, and
OneHotEncoder classes from the
preprocessing module of scikit-learn. These
classes are used for feature scaling, label
encoding, and one-hot encoding,
respectively.
recall_score, precision_score: It imports the
recall_score, and precision_score functions
from the metrics module of scikit-learn.
These functions are used for evaluating
classification model performance.
classification_report, f1_score,
plot_confusion_matrix: It imports the
classification_report, f1_score, and
plot_confusion_matrix functions from the
metrics module of scikit-learn. These
functions are used for generating
classification reports, calculating F1
scores, and plotting confusion matrices.
from catboost import CatBoostClassifier: It
imports the CatBoostClassifier class from
the CatBoost library. CatBoostClassifier is a
class for gradient boosting with categorical
features support.
from lightgbm import LGBMClassifier: It
imports the LGBMClassifier class from the
LightGBM library. LGBMClassifier is a
class for gradient boosting framework that
uses tree-based learning algorithms.
SMOTE: It imports the SMOTE class from
the imbalanced-learn library. SMOTE is a
class for synthetic minority oversampling
technique used to address class imbalance
in machine learning.
learning_curve: It imports the
learning_curve function from the
model_selection module of scikit-learn.
learning_curve is used to generate learning
curves to assess model performance.
plot_decision_regions: It imports the
plot_decision_regions function from the
plotting module of the mlxtend library.
plot_decision_regions is used to visualize
decision boundaries of classifiers.
These import statements allow the code to use functions, classes, and
modules from the respective libraries and modules in the subsequent
code.
Step Read dataset:

4
1 #Reads dataset
2 curr_path = os.getcwd()
3 df =
4 pd.read_csv(curr_path+"/hypothyroid.csv")
5 print(df.iloc[:,0:7].head().to_string())
print(df.iloc[:,21:31].head().to_string())
Output:
age sex on thyroxine query on thyroxine on antithyroid
medication sick pregnant
0 41 F f f
f f f
1 23 F f f
f f f
2 46 M f f
f f f
3 70 F t f
f f f
4 70 F f f
f f f
thyroid surgery I131 treatment query hypothyroid query

hyperthyroid lithium
0 f f f
f f
1 f f f
f f
2 f f f
f f
3 f f f
f f
4 f f f
f f
goitre tumor hypopituitary psych TSH measured TSH T3

measured T3 TT4 measured
0 f f f f t 1.3
t 2.5 t
1 f f f f t 4.1
t 2 t
2 f f f f t 0.98
f ? t
3 f f f f t 0.16
t 1.9 t
4 f f f f t 0.72
t 1.2 t
TT4 T4U measured T4U FTI measured FTI TBG measured TBG
referral source binaryClass
0 125 t 1.14 t 109 f ?
SVHC P
1 102 f ? f ? f ?
other P
2 109 t 0.91 t 120 f ?
other P
3 175 f ? f ? f ?
other P
4 61 t 0.87 t 70 f ?
SVI P
Here are the steps involved in the code:

1. curr_path = os.getcwd(): This line assigns
the current working directory to the
variable curr_path. The os.getcwd()
function returns the path of the current
working directory.
2. df =
pd.read_csv(curr_path+"/hypothyroid.csv"):
This line reads the CSV file named
"hypothyroid.csv" located in the current
working directory using the pd.read_csv()
function from the Pandas library. It assigns
the resulting DataFrame to the variable df.
3. print(df.iloc[:,0:7].head().to_string()): This
line prints the first 5 rows of columns 0 to 6
(inclusive) of the DataFrame df. The iloc
indexer is used to select rows and columns
by their integer positions. The head()
function returns the first 5 rows, and
to_string() is used to print the DataFrame
as a string.
4. print(df.iloc[:,7:12].head().to_string()): This
line prints the first 5 rows of columns 7 to
11 (inclusive) of the DataFrame df.
5. print(df.iloc[:,12:21].head().to_string()):
This line prints the first 5 rows of columns
12 to 20 (inclusive) of the DataFrame df.
6. print(df.iloc[:,21:31].head().to_string()):
This line prints the first 5 rows of columns
21 to 30 (inclusive) of the DataFrame df.
Overall, these steps read a CSV file named "hypothyroid.csv", located
in the current working directory, into a DataFrame. Then, it prints the
first 5 rows of different subsets of columns in the DataFrame for data
exploration or inspection purposes.
Step Check the shape of dataset:

5
1 #Checks shape
2 print(df.shape)
Output:
(3772, 30)
The code print(df.shape) is used to check the shape of the DataFrame

df.
The shape attribute of a DataFrame returns a tuple representing the

dimensions of the DataFrame, where the first element of the tuple
indicates the number of rows and the second element indicates the
number of columns.
By printing df.shape, you will see the output as a tuple containing the
number of rows and columns in the DataFrame, giving you an idea of
the size of the dataset.
The output (3772, 30) indicates that the DataFrame df has 3772 rows
and 30 columns.
This information helps you understand the size and structure of the
dataset. There are 3772 instances or observations in the dataset, and
each instance has 30 different features or variables associated with it.
Step Read every column in dataset:

6
1 #Reads columns
2 print("Data Columns -->
",df.columns)
Output:
Data Columns --> Index(['age', 'sex', 'on thyroxine',
'query on thyroxine', 'on antithyroid medication', 'sick',
'pregnant', 'thyroid surgery', 'I131 treatment', 'query
hypothyroid', 'query hyperthyroid', 'lithium', 'goitre',
'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured',
'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral
source',
'binaryClass'], dtype='object')
The code print("Data Columns --> ",df.columns) is used to print the
column names of the DataFrame df.
The columns attribute of a DataFrame returns a pandas Index object

that contains the names of all the columns in the DataFrame.
By printing df.columns, you will see the output as the column names of
the DataFrame, which provides a list of the feature names or variables
present in the dataset.
This output lists all the column names of the DataFrame df. Each
column name represents a specific feature or variable present in the
dataset.
Here is an explanation of each column in the "hypothyroid" dataset:
'age': Represents the age of the patient.
'sex': Indicates the sex of the patient (e.g.,
male or female).
'on thyroxine': Indicates whether the
patient is on thyroxine medication or not.
'query on thyroxine': Indicates whether
there is a query or doubt about the patient
being on thyroxine medication.
'on antithyroid medication': Indicates
whether the patient is on antithyroid
medication or not.
'sick': Indicates whether the patient is
currently sick or not.
'pregnant': Indicates whether the patient is
pregnant or not.
'thyroid surgery': Indicates whether the
patient has undergone thyroid surgery or
not.
'I131 treatment': Indicates whether the
patient has received I131 treatment or not.
'query hypothyroid': Indicates whether
having hypothyroidism.
'query hyperthyroid': Indicates whether
having hyperthyroidism.
'lithium': Indicates whether the patient has
taken lithium medication or not.
'goitre': Indicates whether the patient has
goitre (enlargement of the thyroid gland) or
not.
'tumor': Indicates whether the patient has a
thyroid tumor or not.
'hypopituitary': Indicates whether the
patient has hypopituitarism (reduced
hormone production by the pituitary gland)
or not.
'psych': Indicates whether the patient has a
psychological disorder or not.
'TSH measured': Indicates whether the
Thyroid Stimulating Hormone (TSH) level
has been measured or not.
'TSH': Represents the Thyroid Stimulating
Hormone (TSH) level.
'T3 measured': Indicates whether the
triiodothyronine (T3) level has been
measured or not.
'T3': Represents the triiodothyronine (T3)
level.
'TT4 measured': Indicates whether the total
thyroxine (TT4) level has been measured or
not.
'TT4': Represents the total thyroxine (TT4)
level.
'T4U measured': Indicates whether the
thyroxine uptake (T4U) has been measured
or not.
'T4U': Represents the thyroxine uptake
(T4U) level.
'FTI measured': Indicates whether the Free
Thyroxine Index (FTI) has been measured
or not.
'FTI': Represents the Free Thyroxine Index
(FTI) value.
'TBG measured': Indicates whether the
Thyroxine-Binding Globulin (TBG) level has
been measured or not.
'TBG': Represents the Thyroxine-Binding
Globulin (TBG) level.
'referral source': Indicates the source or
method of referral for the patient.
'binaryClass': Represents the binary
classification label indicating whether the
patient has hypothyroidism (1) or not (0).
These columns contain various features and attributes related to
patients and thyroid-related measurements, which can be used to
analyze and predict hypothyroidism.
Information of Dataset
Step Check the information of dataset:

1
1 #Checks dataset information

2 print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 30 columns):
# Column Non-Null
Count Dtype
--- ------ --------
------ -----
0 age 3772
non-null object
1 sex 3772
non-null object
2 on thyroxine 3772
non-null object
3 query on thyroxine 3772
non-null object
4 on antithyroid medication 3772
non-null object
5 sick 3772
non-null object
6 pregnant 3772
non-null object
7 thyroid surgery 3772
non-null object
8 I131 treatment 3772
non-null object
9 query hypothyroid 3772
non-null object
10 query hyperthyroid 3772
non-null object
11 lithium 3772
non-null object
12 goitre 3772
non-null object
13 tumor 3772
non-null object
14 hypopituitary 3772
non-null object
15 psych 3772
non-null object
16 TSH measured 3772
non-null object
17 TSH 3772
non-null object
18 T3 measured 3772
non-null object
19 T3 3772
non-null object
20 TT4 measured 3772
non-null object
21 TT4 3772
non-null object
22 T4U measured 3772
non-null object
23 T4U 3772
non-null object
24 FTI measured 3772
non-null object
25 FTI 3772
non-null object
26 TBG measured 3772
non-null object
27 TBG 3772
non-null object
28 referral source 3772
non-null object
29 binaryClass 3772
non-null object
dtypes: object(30)
memory usage: 884.2+ KB
None
The code print(df.info()) is used to display

information about the DataFrame df, including
the data types and memory usage.
The info() function provides a concise summary

of the DataFrame, showing the following
information for each column:
The column name
The number of non-null
values
The data type of the
column
The memory usage
By printing df.info(), you will see a summary of
the dataset, which includes the column names,
the number of non-null values in each column,
the data type of each column, and the overall
memory usage of the DataFrame. This
information is helpful for understanding the
structure and integrity of the dataset.
The output of print(df.info()) provides

information about the DataFrame df. Here's a
breakdown of the information displayed:
1. The DataFrame has a

RangeIndex with 3772
entries, ranging from 0 to
3771. This indicates that
the DataFrame has 3772
rows of data.
2. There are 30 columns in
total, each representing a
different feature or
attribute.
3. For each column, the
following information is
displayed:
The column index
number
The column name
The number of non-null
values present in the
column
The data type of the
column (object in this
case, which usually
represents strings or
mixed data types)
4. The memory usage of the
DataFrame is shown as
884.2+ KB.
In summary, the output indicates that the
DataFrame consists of 3772 rows and 30
columns, with all columns being of the object
data type. It's worth noting that the object data
type can include various types of data, such as
strings or a mix of different data types.
Checking Unique Values
Step Checks count, unique values, and frequency:

1
1 #Checks count, unique values, and

2 frequency
print(df.describe().T)
Output:
count unique
top freq
age 3772 94
59 95
sex 3772 3
F 2480
on thyroxine 3772 2
f 3308
query on thyroxine 3772 2
f 3722
on antithyroid medication 3772 2
f 3729
sick 3772 2
f 3625
pregnant 3772 2
f 3719
thyroid surgery 3772 2
f 3719
I131 treatment 3772 2
f 3713
query hypothyroid 3772 2
f 3538
query hyperthyroid 3772 2
f 3535
lithium 3772 2
f 3754
goitre 3772 2
f 3738
tumor 3772 2
f 3676
hypopituitary 3772 2
f 3771
psych 3772 2
f 3588
TSH measured 3772 2
t 3403
TSH 3772 288
? 369
T3 measured 3772 2
t 3003
T3 3772 70
? 769
TT4 measured 3772 2
t 3541
TT4 3772 242
? 231
T4U measured 3772 2
t 3385
T4U 3772 147
? 387
FTI measured 3772 2
t 3387
FTI 3772 235
? 385
TBG measured 3772 1
f 3772
TBG 3772 1
? 3772
referral source 3772 5
other 2201
binaryClass 3772 2
P 3481
The output shows the count, unique values,

most frequent value, and its frequency for each
column in the DataFrame. Here's an
explanation of the output:
'count': Represents the
number of non-null values
for each column.
'unique': Indicates the
count of unique values for
each column.
'top': Displays the most
frequent value in each
column.
'freq': Represents the
frequency of the most
column.
For example, let's consider the first few
columns in the output:
'age': There are 94 unique
values in the 'age' column.
The most frequent value is
'59', which appears 95
times.
'sex': There are 3 unique
values in the 'sex' column:
'F', 'M', and an unknown
value. The most frequent
value is 'F', which appears
2480 times.
'on thyroxine': There are 2
unique values: 'f' and 't'.
The most frequent value is
'f', which appears 3308
times.
'query on thyroxine': There
are 2 unique values: 'f' and
't'. The most frequent value
is 'f', which appears 3722
times.
The output provides insights into the unique
values and their frequencies in each column of
the DataFrame.
Analyzing the output of the unique values in

each column, we can draw the following
conclusions:
'age': There are 94 unique
age values in the dataset,
ranging from different
ages. The most frequent
age is '59', which appears
95 times.
'sex': The 'sex' column has
3 unique values: 'F', 'M',
and an unknown value. The
most frequent value is 'F'
(female), which appears
2480 times. This indicates
that the dataset contains a
majority of female patients.
Binary Columns (e.g., 'on
thyroxine', 'query on
thyroxine', etc.): These
columns have 2 unique
values ('f' and 't')
representing binary
attributes. The most
column is 'f' (false),
indicating that the majority
of patients do not exhibit
the corresponding
attribute.
Thyroid Hormone
Measurement Columns
(e.g., 'TSH measured', 'T3
measured', etc.): These
columns indicate whether
the corresponding hormone
measurement was taken or
not. They have 2 unique
values ('t' and 'f'). The most
frequent value for each
column is 't' (true),
indicating that the majority
of patients had their
thyroid hormone levels
measured.
Missing Values: Some
columns have the value '?'
which indicates missing
data. For example, the
'TSH' column has 288
unique values, but 369
instances have missing
values denoted by '?'.
Similarly, other columns
like 'TT4', 'T4U', and 'FTI'
also have missing values.
'TBG measured' and 'TBG'
Columns: These columns
have only one unique value,
indicating that all instances
have the same value. As a
result, these columns do
not provide any useful
information for analysis.
'referral source': This
column has 5 unique values
representing different
sources of patient referral.
The most frequent source is
'other', which appears 2201
times.
'binaryClass': This column
represents the binary
classification label,
indicating whether the
patient has hypothyroidism
('P') or not ('N'). The most
frequent value is 'P'
(positive), which appears
3481 times.
In conclusion, the analysis of the unique values
in each column provides insights into the
distribution, frequency, and characteristics of
the dataset. It helps us understand the
representation of categorical variables, the
prevalence of certain attributes, and the
presence of missing values. This information is
valuable for data preprocessing, handling
missing values, and gaining a better
understanding of the dataset before performing
further analysis or modeling.
Checking Null Values
Step Checks null values in each column:

1
1 #Checks null values

2 df=df.replace({"?":np.NAN})
3 print(df.isnull().sum())
4 print('Total number of null values:
', df.isnull().sum().sum())
Output:
age 1
sex 150
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH measured 0
TSH 369
T3 measured 0
T3 769
TT4 measured 0
TT4 231
T4U measured 0
T4U 387
FTI measured 0
FTI 385
TBG measured 0
TBG 3772
referral source 0
binaryClass 0
dtype: int64
Total number of null values: 6064
Here are the steps involved in the code you

provided:
df=df.replace({"?":np.NAN}):
This line replaces all
occurrences of the string "?"
in the DataFrame df with
np.NAN, which represents a
missing value in NumPy.
This step is performed to
standardize missing values
across the dataset.
print(df.isnull().sum()): This
line calculates the sum of
null values for each column
in the DataFrame using the
isnull().sum() function. The
isnull() function identifies
missing values in the
DataFrame, and sum()
calculates the total count of
null values for each column.
print('Total number of null
values: ',
df.isnull().sum().sum()): This
line calculates the total
number of null values in the
entire DataFrame by taking
the sum of null values across
all columns using
df.isnull().sum().sum().
In summary, these steps check and handle null
values in the DataFrame. The code replaces the
string "?" with np.NAN to represent missing
values consistently. Then, it prints the count of
null values for each column and the total number
of null values in the entire DataFrame. This
information helps in understanding the extent of
missing data and can guide further data
preprocessing steps.
Converting Six Columns into Numerical
Step Convert six columns into numerical:

1
1 #Converts six columns into numerical

2 num_cols =
['age','FTI','TSH','T3','TT4','T4U']
3 for i in num_cols:
4 df[i] = pd.to_numeric(df[i])
6 print(df.info())
Output:
# Column Non-Null Count
Dtype
--- ------ --------------
-----
0 age 3771 non-null
float64
1 sex 3622 non-null
object
2 on thyroxine 3772 non-null
object
3 query on thyroxine 3772 non-null
object
4 on antithyroid medication 3772 non-null
object
5 sick 3772 non-null
object
6 pregnant 3772 non-null
object
7 thyroid surgery 3772 non-null
object
8 I131 treatment 3772 non-null
object
9 query hypothyroid 3772 non-null
object
10 query hyperthyroid 3772 non-null
object
11 lithium 3772 non-null
object
12 goitre 3772 non-null
object
13 tumor 3772 non-null
object
14 hypopituitary 3772 non-null
object
15 psych 3772 non-null
object
16 TSH measured 3772 non-null
object
17 TSH 3403 non-null
float64
18 T3 measured 3772 non-null
object
19 T3 3003 non-null
float64
20 TT4 measured 3772 non-null
object
21 TT4 3541 non-null
float64
22 T4U measured 3772 non-null
object
23 T4U 3385 non-null
float64
24 FTI measured 3772 non-null
object
25 FTI 3387 non-null
float64
26 TBG measured 3772 non-null
object
27 TBG 0 non-null
float64
28 referral source 3772 non-null
object
29 binaryClass 3772 non-null
object
dtypes: float64(7), object(23)
None
The code converts six columns, namely 'age', 'FTI',

'TSH', 'T3', 'TT4', and 'T4U', into numerical data types.
Here are the steps involved:
1. num_cols =
['age','FTI','TSH','T3','TT4','T4U']:
This line creates a list named
num_cols that contains the names
of the columns to be converted
into numerical data types.
2. for i in num_cols: df[i] =
pd.to_numeric(df[i]): This loop
iterates over each column name
in num_cols. Inside the loop, the
pd.to_numeric() function is used
to convert the values in each
column to numeric data types.
This conversion allows for
mathematical operations and
analysis on these columns.
3. print(df.info()): This line prints
the updated information about
the DataFrame df using the info()
function. This provides details
about the data types of each
column after the conversion.
Overall, these steps convert the specified columns into
numerical data types and then display the updated
information about the DataFrame. This conversion is
useful when you need to perform numerical
computations or analysis on the values in these
columns.
The output shows the updated information about the

DataFrame df after converting the specified columns to
numerical data types. Here's an explanation of the
output:
The DataFrame still has 30
columns, but now it contains 7
columns with a float64 data type
(representing numerical values)
and 23 columns with an object
data type (representing non-
numeric values).
The 'age' column has been
successfully converted to a
float64 data type, and it has 3771
non-null values. There is one
missing value in this column.
The 'TSH', 'T3', 'TT4', 'T4U', and
'FTI' columns have also been
converted to float64 data types.
However, they contain a different
number of non-null values,
indicating missing values in these
columns.
The 'TBG' column has all missing
values (NaN) and is not useful for
analysis since it contains no valid
data.
The 'sex' column still has an
object data type and has missing
values, indicated by a non-null
count less than the total number
of entries (3772).
The 'referral source' and
'binaryClass' columns remain as
object data types.
The memory usage of the
DataFrame remains the same as
before the conversion.
In summary, the output provides an updated view of the
DataFrame, reflecting the conversion of the specified
columns to numerical data types. It also shows the
presence of missing values in some columns, which may
need to be addressed during further data preprocessing
steps.
Deleting Irrelevant Columns
Step Delete irrelevant columns: TBG and referral

1 source. Then, handle missing values. Use mode
imputation for object features: sex and T4U
measured. Use mean value for age feature and use
SimpleImputer for TSH, T3, TT4, T4U, and FTI
features:
1 #Deletes irrelevant columns

2 df.drop(['TBG','referral
3 source'],axis=1,inplace=True)
4
5 #Handles missing values
6 #Uses mode imputation for all other
7 categorical features
8 def mode_imputation(feature):
9 mode=df[feature].mode()[0]
10
df[feature]=df[feature].fillna(mode)
11
12
for col in ['sex', 'T4U measured']:
13
mode_imputation(col)
14
15
df['age'].fillna(df['age'].mean(),
16
inplace=True)
17
18
from sklearn.impute import
19 SimpleImputer
20 imputer =
21 SimpleImputer(strategy='mean')
22 df['TSH'] =
23 imputer.fit_transform(df[['TSH']])
24 df['T3'] =
imputer.fit_transform(df[['T3']])
25
df['TT4'] =
26
imputer.fit_transform(df[['TT4']])
27
df['T4U'] =
28 imputer.fit_transform(df[['T4U']])
df['FTI'] =
imputer.fit_transform(df[['FTI']])
#Checks each column datatype

print(df.dtypes)
#Checks null values

print(df.isnull().sum())
print('Total number of null values:
', df.isnull().sum().sum())
Output:
age float64
sex object
on thyroxine object
query on thyroxine object
on antithyroid medication object
sick object
pregnant object
thyroid surgery object
I131 treatment object
query hypothyroid object
query hyperthyroid object
lithium object
goitre object
tumor object
hypopituitary object
psych object
TSH measured object
TSH float64
T3 measured object
T3 float64
TT4 measured object
TT4 float64
T4U measured object
T4U float64
FTI measured object
FTI float64
TBG measured object
binaryClass object
dtype: object
age 0
sex 0
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH measured 0
TSH 0
T3 measured 0
T3 0
TT4 measured 0
TT4 0
T4U measured 0
T4U 0
FTI measured 0
FTI 0
TBG measured 0
binaryClass 0
dtype: int64
Total number of null values: 0
Here are the steps involved in the code:

1. df.drop(['TBG','referral
source'],axis=1,inplace=True):
This line drops the 'TBG' and
'referral source' columns from
the DataFrame df. The drop()
function is used to remove
these columns, and the
inplace=True parameter
ensures that the changes are
made to the DataFrame itself.
2. Missing value handling:
mode_imputation()
function: This function
performs mode imputation
for categorical features. It
takes a feature name as
input, finds the mode of
that feature in the
DataFrame, and replaces
missing values in that
feature with the mode
value.
Loop for mode imputation:
The loop iterates over the
'sex' and 'T4U measured'
columns, calling the
mode_imputation()
function to handle missing
values in each column.
'age' column: The missing
values in the 'age' column
are filled with the mean
value of the column using
the fillna() function.
SimpleImputer for
numerical columns: The
SimpleImputer class from
scikit-learn is used to
handle missing values in
the 'TSH', 'T3', 'TT4', 'T4U',
and 'FTI' columns. It
replaces the missing values
with the mean value of
each column using the
fit_transform() method.
3. print(df.dtypes): This line
prints the data types of each
column in the DataFrame
using the dtypes attribute. It
helps to verify the updated
data types after handling
missing values.
4. print(df.isnull().sum()): This
line calculates the sum of null
values for each column in the
DataFrame using the
isnull().sum() function. It
displays the count of null
values in each column.
5. print('Total number of null
values: ',
df.isnull().sum().sum()): This
line calculates the total
entire DataFrame by taking
the sum of null values across
all columns using
df.isnull().sum().sum().
In summary, these steps involve dropping irrelevant
columns, handling missing values through mode
imputation and mean imputation, checking the
updated data types, and verifying the count of null
values in the DataFrame. These actions are crucial
for data preprocessing and ensuring data quality
before further analysis or modeling.
The output shows the updated information about

the DataFrame df after dropping irrelevant columns
and handling missing values. Here's an explanation
of the output:
Data Types: The data types of
each column are displayed.
The 'age' column is of type
float64, while the remaining
columns are of type object.
Null Values: The count of null
values for each column is
shown. After handling missing
values, there are no null
values remaining in any
column.
Total Null Values: The total
entire DataFrame is 0.
In summary, the output confirms that the irrelevant
columns have been dropped successfully, missing
values have been handled through mode imputation
and mean imputation, and there are no null values
remaining in the DataFrame. This ensures that the
DataFrame is now ready for further analysis or
modeling.
Statistical Description
Step Look at statistical description of numerical

1 columns:
1 #Looks at statistical description of

2 data
print(df.describe().to_string())
Output:
age TSH
T3 TT4 T4U
FTI
count 3772.000000 3772.000000
3772.000000 3772.000000 3772.000000
3772.000000
mean 51.735879 5.086766
2.013500 108.319345 0.995000
110.469649
std 20.082295 23.290853
0.738262 34.496511 0.185156
31.355087
min 1.000000 0.005000
0.050000 2.000000 0.250000
2.000000
25% 36.000000 0.600000
1.700000 89.000000 0.890000
94.000000
50% 54.000000 1.600000
2.013500 106.000000 0.995000
110.000000
75% 67.000000 3.800000
2.200000 123.000000 1.070000
121.250000
max 455.000000 530.000000
10.600000 430.000000 2.320000
395.000000
The code print(df.describe().to_string()) is used

to generate a statistical description of the
DataFrame df and display it as a formatted
string.
The describe() function computes various

summary statistics for numerical columns in the
DataFrame, such as count, mean, standard
deviation, minimum value, quartiles, and
maximum value. By calling .to_string() on the
result, the output is formatted as a string for
better readability when printing.
Here's an explanation of the output:

The output provides a
statistical summary for
each numerical column in
the DataFrame.
Each column is represented
by its name at the leftmost
side of the output.
The statistical metrics
displayed include count,
mean, standard deviation
(std), minimum value
(min), quartiles (25%, 50%,
and 75%), and maximum
value (max).
Analyzing the output of the statistical
description, we can draw the following
conclusions:
'age': The age column
ranges from 1 to 455 years,
with a mean age of
approximately 51.74. The
age distribution seems
reasonable, with no
extreme outliers.
'TSH': The TSH (Thyroid-
Stimulating Hormone)
column ranges from 0.005
to 530. The mean TSH
value is approximately
5.09, with a relatively high
standard deviation of
23.29. This indicates a wide
range of TSH values,
including some potentially
extreme values.
'T3': The T3 column ranges
from 0.05 to 10.6. The
mean T3 value is
approximately 2.01, with a
standard deviation of 0.74.
The distribution seems to
be relatively tight, centered
around the mean.
'TT4': The TT4 (Total
Thyroxine) column ranges
from 2 to 430. The mean
TT4 value is approximately
108.32, with a standard
deviation of 34.50. The TT4
values appear to have a
moderate spread.
'T4U': The T4U (Total
Thyroxine Uptake) column
ranges from 0.25 to 2.32.
The mean T4U value is
approximately 0.995, with a
standard deviation of 0.19.
The T4U values seem to
have a narrow distribution.
'FTI': The FTI (Free
Thyroxine Index) column
ranges from 2 to 395. The
mean FTI value is
approximately 110.47, with
a standard deviation of
31.36. The FTI values
exhibit moderate
variability.
In conclusion, the statistical description
provides insights into the distribution and
spread of the numerical columns in the dataset.
It helps in understanding the central tendency,
variability, and potential outliers in these
variables. The analysis suggests that some
columns have a wide range of values (e.g.,
'TSH'), while others have a more concentrated
distribution (e.g., 'T3' and 'T4U'). These
observations can guide further data exploration
and analysis to understand the patterns and
relationships within the dataset.
Limiting Age
Step Look at the maximum number of age column. It

1 is impossible to have 455 value for age. Clean
the column:
1 #Cleans age column

2 for i in range(df.shape[0]):
3 if df.age.iloc[i] > 100.0:
4 df.age.iloc[i] = 100.0
5
6 df['age'].describe()
The code cleans the 'age' column in the

DataFrame df by setting a maximum value of
100 for any age that exceeds 100. Here's an
explanation of the code:
1. for i in range(df.shape[0])::
This loop iterates over the
indices of the DataFrame,
ranging from 0 to the
number of rows (shape[0]).
2. if df.age.iloc[i] > 100.0::
This condition checks if the
age value at index i is
greater than 100.
3. df.age.iloc[i] = 100.0: If the
condition is true, the age
value at index i is replaced
with the value 100. This
ensures that any age value
exceeding 100 is capped at
100.
4. df['age'].describe(): This
line calculates the
statistical description of the
'age' column after the
cleaning operation. It
provides summary statistics
such as count, mean,
standard deviation,
minimum, quartiles, and
maximum.
By running this code, you will obtain the
statistical description of the 'age' column after
cleaning. This will allow you to verify the effect
of the cleaning process and observe any
changes in the data distribution and summary
statistics.
Distribution of Samples
Step Print the total distribution:

1
1 #Prints the total distribution

2 print("Total Number of samples :
3 {}".format(df.shape[0]))
4 print("Total No.of Negative
Thyroid: {}".format(\
5
df[df.binaryClass ==
6
'N'].shape[0]))
print("Total No.of Positive Thyroid
: {}".format(\
df[df.binaryClass ==
'P'].shape[0]))
Output:
Total Number of samples : 3772
Total No.of Negative Thyroid: 291
Total No.of Positive Thyroid : 3481
The code prints the total distribution of the

samples in the DataFrame df based on the
'binaryClass' column. Here's an explanation of
the code:
1. print("Total Number of
samples :
{}".format(df.shape[0])):
This line prints the total
number of samples in the
DataFrame df. It uses the
shape[0] attribute to
retrieve the number of
rows in the DataFrame and
formats the output string
using the format() method.
2. print("Total No.of Negative
Thyroid:
{}".format(df[df.binaryClass
== 'N'].shape[0])): This
line prints the total number
of samples classified as
'Negative Thyroid'. It filters
the DataFrame using the
condition df.binaryClass
== 'N' to select rows where
the 'binaryClass' column
has a value of 'N'. It then
retrieves the number of
rows in the filtered
DataFrame using shape[0]
and formats the output
string.
3. print("Total No.of Positive
Thyroid :
{}".format(df[df.binaryClass
== 'P'].shape[0])): This line
prints the total number of
samples classified as
'Positive Thyroid'. It follows
the same approach as the
previous line but filters the
DataFrame based on the
condition df.binaryClass
== 'P' to select rows where
has a value of 'P'.
By running this code, you will obtain the total
distribution of samples in the DataFrame based
on the 'binaryClass' column. It provides the
total number of samples, the count of samples
classified as 'Negative Thyroid', and the count
of samples classified as 'Positive Thyroid'.
The output shows the total distribution of

samples in the DataFrame based on the
'binaryClass' column. Here's an explanation of
the output:
Total Number of samples:
The total number of
samples in the DataFrame
is 3772.
Total No.of Negative
Thyroid: The count of
'Negative Thyroid' is 291.
These are samples where
has a value of 'N'.
Total No.of Positive
Thyroid: The count of
'Positive Thyroid' is 3481.
These are samples where
has a value of 'P'.
In summary, there are 291 samples classified as
'Negative Thyroid' and 3481 samples classified
as 'Positive Thyroid' in the DataFrame. This
distribution provides an overview of the
imbalance between the two classes, which may
be important to consider when performing
further analysis or modeling tasks.
Step Plot the distribution of binaryClass (target variable) in dataset:

1
1 #Defines function to create pie chart and bar plot

2 as subplots
3 def plot_piechart(df, var, title=''):
4 plt.figure(figsize=(25, 10))
5 plt.subplot(121)
6 label_list =
list(df[var].value_counts().index)
7
8
9
df[var].value_counts().plot.pie(autopct="%1.1f%%",
10 \
11
12 colors=sns.color_palette("prism", 7), \
13 startangle=60,
14 labels=label_list, \
15 wedgeprops=
{"linewidth": 2, "edgecolor": "k"}, \
16
shadow=True,
17
textprops={'fontsize': 20}
18
)
19
plt.title("Distribution of " + var + "
20 variable " + title, fontsize=25)
21
22 value_counts = df[var].value_counts()
23 # Print percentage values
24 percentages = value_counts / len(df) * 100
25 print("Percentage values:")
26 print(percentages)
27
28 plt.subplot(122)
29 ax = df[var].value_counts().plot(kind="barh")
30
31 for i, j in
32 enumerate(df[var].value_counts().values):
33 ax.text(.7, i, j, weight="bold",
fontsize=20)
34
35
plt.title("Count of " + var + " cases " +
36
title, fontsize=25)
37
# Print count values
38
print("Count values:")
39
print(value_counts)
plt.show()
plot_piechart(df,'binaryClass')
Output:
P 3481
N 291
Name: binaryClass, dtype: int64
The result is shown in Figure 1.
Figure 1 The distribution of binaryClass (target variable)
The code defines a function named plot_piechart() that creates a pie chart and a
horizontal bar plot as subplots. Here are the steps involved in the code:
1. def plot_piechart(df, var, title=''):: This line defines
the function plot_piechart, which takes three
parameters: df (the DataFrame), var (the
variable/column to plot), and an optional title
parameter for the title of the plot.
2. plt.figure(figsize=(25, 10)): This line creates a new
figure with a specified size for the plot.
3. plt.subplot(121): This line creates the first subplot
in a 1x2 grid, selecting the first position.
4. label_list = list(df[var].value_counts().index): This
line retrieves the unique values of the var column
in the DataFrame and converts them to a list. It will
be used as labels for the pie chart.
5. df[var].value_counts().plot.pie(autopct="%1.1f%%",
colors=sns.color_palette("prism", 7),
startangle=60, labels=label_list, wedgeprops=
{"linewidth": 2, "edgecolor": "k"}, shadow=True,
textprops={'fontsize': 20}): This line plots the pie
chart using the plot.pie() function. It uses the
value_counts() method to count the occurrences of
each unique value in the var column. Other
parameters specify the autopct format for
percentage display, color palette, starting angle,
labels, wedge properties, shadow, and text font
size.
6. plt.title("Distribution of " + var + " variable " +
title, fontsize=25): This line sets the title of the pie
chart by concatenating the var and title
parameters.
7. value_counts = df[var].value_counts(): This line
calculates the count of each unique value in the var
column using value_counts() and assigns it to the
value_counts variable.
8. percentages = value_counts / len(df) * 100: This
line calculates the percentage values of each
unique value in the var column by dividing the
value_counts by the length of the DataFrame and
multiplying by 100.
9. print("Percentage values:") print(percentages):
This code prints the percentage values calculated
in the previous step.
10. plt.subplot(122): This line creates the second
subplot in the 1x2 grid, selecting the second
position.
11. ax = df[var].value_counts().plot(kind="barh"): This
line plots a horizontal bar plot using the
value_counts() method on the var column of the
DataFrame.
12. for i, j in enumerate(df[var].value_counts().values):
ax.text(.7, i, j, weight="bold", fontsize=20): This
code adds text annotations to the horizontal bar
plot, displaying the count values above each bar.
13. plt.title("Count of " + var + " cases " + title,
fontsize=25): This line sets the title of the
horizontal bar plot by concatenating the var and
title parameters.
14. print("Count values:") print(value_counts): This
code prints the count values calculated in step 7.
15. plt.show(): This line displays the plot with both
subplots.
Overall, the function plot_piechart() generates a pie chart to visualize the
distribution of a variable and a horizontal bar plot to display the count of each
category. It also prints the percentage values and count values for each category.
The function allows for customized titles and can be used to analyze the
distribution and count of different variables in the DataFrame.
Analyzing the output, we can draw the following conclusions:

The 'binaryClass' column represents the
classification of thyroid cases. The output shows
two categories: 'P' for Positive Thyroid and 'N' for
Negative Thyroid.
The count values indicate an imbalanced
distribution between the two classes. There are
3481 samples classified as 'Positive Thyroid' and
only 291 samples classified as 'Negative Thyroid'.
This suggests that the majority of cases in the
dataset are classified as 'Positive Thyroid'.
The significant difference in counts between the
two classes may have implications for modeling or
analysis tasks. Imbalanced classes can pose
challenges, such as bias in model predictions or
difficulties in detecting minority class patterns. It's
important to consider appropriate strategies to
address this class imbalance, such as resampling
techniques or using evaluation metrics that account
for imbalanced classes.
In summary, the analysis of the output reveals an imbalanced distribution in the
'binaryClass' column, with a majority of samples classified as 'Positive Thyroid'
and a minority classified as 'Negative Thyroid'. This understanding of class
distribution is essential when performing subsequent data analysis or building
classification models to ensure accurate representation and consideration of both
classes.
Distribution of All Features
Step Plot distribution of all features in the whole dataset:

1
1 # Looks at distribution of all

2 features in the whole original
dataset
3
columns = list(df.columns)
4
columns.remove('binaryClass')
5
plt.subplots(figsize=(35, 50))
6
length = len(columns)
7
color_palette =
8
sns.color_palette("Set3",
9 n_colors=length) # Define color
10 palette
11
12 for i, j in
13 itertools.zip_longest(columns,
range(length)):
14
plt.subplot((length // 2), 5, j
15
+ 1)
16
plt.subplots_adjust(wspace=0.2,
17 hspace=0.5)
18 ax = df[i].hist(bins=10,
19 edgecolor='black',
20 color=color_palette[j]) # Set
color for each histogram
21
for p in ax.patches:
22
ax.annotate(format(p.get_height(),
'.0f'), (p.get_x() + p.get_width()
/ 2., p.get_height()), ha='center',
va='center',
xytext=(0, 10), weight="bold",
fontsize=17, textcoords='offset
points')
plt.title(i, fontsize=30) #
Adjust title font size
plt.show()
Figure 2 The distribution of all features in the whole

dataset
Here's a step-by-step explanation of the code:

1. columns = list(df.columns): This
line creates a list of column
names from the DataFrame df.
2. columns.remove('binaryClass'):
This line removes the column
name 'binaryClass' from the
columns list since it is not
included in the plot.
3. plt.subplots(figsize=(35, 50)):
This line creates a figure with a
specified size (35 inches wide
and 50 inches tall) for the
subplots.
4. length = len(columns): This line
calculates the number of
columns/features in the
columns list.
5. color_palette =
n_colors=length): This line
creates a color palette using the
seaborn function
color_palette(). It uses the
"Set3" palette and sets the
number of colors to match the
length of the columns list.
6. for i, j in
itertools.zip_longest(columns,
range(length)):: This line
iterates over the columns list
and generates a corresponding
index value using
range(length). It uses
itertools.zip_longest() to ensure
that the iteration continues
until the longer of the two
iterables is exhausted.
7. plt.subplot((length // 2), 5, j +
1): This line creates a subplot
within the larger figure grid. It
calculates the number of rows
in the grid as (length // 2) and
sets the number of columns to
5. The j + 1 value represents
the current subplot position.
8. plt.subplots_adjust(wspace=0.2,
hspace=0.5): This line adjusts
the spacing between subplots
horizontally and vertically to
improve readability.
9. ax = df[i].hist(bins=10,
edgecolor='black',
color=color_palette[j]): This
line plots a histogram for the
current column (i) using the
hist() function on the
corresponding column in the
DataFrame df. It sets the
number of bins to 10, the edge
color to 'black', and the color of
the histogram bars to the
corresponding color from the
color_palette.
10. for p in ax.patches:: This line
iterates over the bars of the
histogram.
11. ax.annotate(...): This line adds
text annotations to each bar of
the histogram, displaying the
count value (p.get_height()) at
the center of the bar.
12. plt.title(i, fontsize=30): This
line sets the title of the current
subplot to the column name (i)
with a font size of 30.
13. plt.show(): This line displays
the plot with all the subplots.
The code generates a set of subplots, each
representing the distribution of a specific
feature/column from the original dataset. The
histograms are color-coded using the defined color
palette, and each subplot has a title with the
corresponding column name, set to a font size of 30.
The annotations on the histogram bars display the
count values.
Distribution of Age versus Target Variable
Step Define another_versus_label() method to plot the

1 distribution of a feature against label feature:
1 from tabulate import tabulate
2 # Looks at another feature
3 distribution by binaryClass feature
4 def another_versus_label(feat):
5 fig, (ax1, ax2) =
plt.subplots(nrows=2, ncols=1,
6
figsize=(25, 15))
7
plt.subplots_adjust(wspace=0.5,
8 hspace=0.25)
9
10 # Define color palette
11 colors =
12 sns.color_palette("Set2")
13
14 df[df['binaryClass'] == "N"]
15 [feat].plot(ax=ax1, kind='hist',
bins=10, edgecolor='black',
16
color=colors[0])
17
ax1.set_title('Negative
18 Thyroid', fontsize=25)
19 ax1.set_xlabel(feat,
20 fontsize=20)
21 ax1.set_ylabel('Count',
22 fontsize=20)
23 data1 = []
24 for p in ax1.patches:
25 x = p.get_x() +
p.get_width() / 2.
26
y = p.get_height()
27
ax1.annotate(format(y,
28
'.0f'), (x, y), ha='center',
29
va='center',
30 xytext=(0, 10),
31 weight="bold",
32 fontsize=20, textcoords='offset
33 points')
34 data1.append([x, y])
35
36 df[df['binaryClass'] == "P"]
[feat].plot(ax=ax2, kind='hist',
37
38 color=colors[1])
39 ax2.set_title('Positive
40 Thyroid', fontsize=25)
41
42 ax2.set_xlabel(feat,
43 fontsize=20)
44 ax2.set_ylabel('Count',
fontsize=20)
45
data2 = []
46
for p in ax2.patches:
47
x = p.get_x() +
p.get_width() / 2.
y = p.get_height()
ax2.annotate(format(y,
'.0f'), (x, y), ha='center',
va='center',
xytext=(0, 10),
weight="bold",
fontsize=20, textcoords='offset
points')
data2.append([x, y])
plt.show()
# Print x and y values using

tabulate
print("Negative Thyroid:")
print(tabulate(data1, headers=
[feat, "y"]))
print("\nPositive Thyroid:")
print(tabulate(data2, headers=
[feat, "y"]))

1. from tabulate import tabulate:
This line imports the tabulate
function from the tabulate
library, which will be used to
print the x and y values.
2. def another_versus_label(feat):
This line defines a function
named another_versus_label
that takes a feature name (feat)
as input.
3. fig, (ax1, ax2) =
plt.subplots(nrows=2, ncols=1,
figsize=(25, 15)): This line
creates a figure with two
subplots arranged vertically.
The subplots are assigned to
ax1 and ax2 variables. The size
of the figure is set to (25, 15).
4. plt.subplots_adjust(wspace=0.5,
hspace=0.25): This line adjusts
the spacing between subplots
horizontally and vertically.
5. colors =
sns.color_palette("Set2"): This
line defines a color palette
using the seaborn function
color_palette(). It uses the
"Set2" palette, which provides a
set of distinct colors.
6. df[df['binaryClass'] == "N"]
[feat].plot(ax=ax1, kind='hist',
color=colors[0]): This line plots
a histogram for the feature feat
in the subset of the DataFrame
where the 'binaryClass' is "N"
(Negative Thyroid). It uses ax1
as the subplot, sets the number
of bins to 10, and assigns the
color from the colors palette at
index 0.
7. ax1.set_title('Negative Thyroid',
fontsize=25): This line sets the
title of the first subplot to
'Negative Thyroid' with a font
size of 25.
8. ax1.set_xlabel(feat,
x-axis label of the first subplot
to the feature name (feat) with
a font size of 20.
9. ax1.set_ylabel('Count',
y-axis label of the first subplot
to 'Count' with a font size of 20.
10. for p in ax1.patches:: This line
iterates over the bars of the
histogram in the first subplot.
11. ax1.annotate(...): This line adds
text annotations to each bar of
the histogram in the first
subplot, displaying the count
value (y) at the center of the
bar.
12. data1 = []: This line initializes
an empty list data1 to store the
x and y values of the
annotations in the first subplot.
13. The code following the loop in
the first subplot repeats similar
steps for the second subplot
(ax2) representing the 'Positive
Thyroid' cases.
14. plt.show(): This line displays
the plot with the two subplots.
15. print("Negative Thyroid:"): This
line prints the header for the
section displaying x and y
values for 'Negative Thyroid'.
16. print(tabulate(data1, headers=
[feat, "y"])): This line prints the
x and y values in tabular format
using the tabulate function. It
displays the x and y values from
data1, with headers for the
feature (feat) and the y values.
17. The code following the first
tabulate statement repeats
similar steps for the 'Positive
Thyroid' cases (data2).
The code allows you to visualize the distribution of a
specific feature (feat) separately for the 'Negative
Thyroid' and 'Positive Thyroid' cases. It plots
histograms for both cases, displays the count values
on the bars, and prints the x and y values for each
case in a tabular format.
Figure 3 The age feature distribution by label

feature
Step Look at age feature distribution by binaryClass

2 feature:
1 #Looks at age feature distribution

2 by binaryClass feature
another_versus_label("age")
The result is shown in Figure 3. The resulting plots

show the distribution of the "age" feature categorized
by the "binaryClass" feature, which represents the
presence or absence of thyroid disease.
The plots are divided into two subplots: the top

subplot represents the "Negative Thyroid" cases, and
the bottom subplot represents the "Positive Thyroid"
cases.
For each subplot:

The x-axis represents the age
values.
The y-axis represents the count
of individuals within each age
group.
The histogram bars represent
the frequency of individuals
within each age group.
In the "Negative Thyroid" subplot:
The title is set to "Negative
Thyroid".
The x-axis label is set to "age".
The y-axis label is set to
"Count".
The bars are color-coded using
a specific color (such as blue)
to differentiate them from the
bars in the "Positive Thyroid"
subplot.
The count values are annotated
on top of each bar, indicating
the number of individuals
In the "Positive Thyroid" subplot:
The title is set to "Positive
Thyroid".
The x-axis label is set to "age".
"Count".
The bars are color-coded using
a specific color (such as
orange) to differentiate them
from the bars in the "Negative
Thyroid" subplot.
The count values are annotated
on top of each bar, indicating
the number of individuals
These plots provide a visual representation of the age
distribution within the "Negative Thyroid" and
"Positive Thyroid" groups, allowing for a comparison
between the two.
Output:
Negative Thyroid:
age y
----- ---
5.35 7
14.05 8
22.75 20
31.45 30
40.15 38
48.85 38
57.55 48
66.25 54
74.95 37
83.65 11
Positive Thyroid:
age y
----- ---
5.95 15
15.85 168
25.75 427
35.65 489
45.55 453
55.45 644
65.35 634
75.25 503
85.15 139
95.05 9
Based on the output:

For the "Negative Thyroid" cases:
The age values range from 5.35
to 83.65, indicating a wide
range of ages in this group.
The count values ("y") range
from 7 to 54, representing the
number of individuals in each
age group.
The distribution of ages
appears to be relatively evenly
spread, with no specific age
group dominating the count.
The age groups with higher
counts include 66.25, 57.55,
and 48.85, indicating a
relatively larger number of
individuals in these age ranges.
The age groups with lower
counts include 5.35 and 14.05,
indicating a smaller number of
For the "Positive Thyroid" cases:
The age values range from 5.95
to 95.05, indicating a wide
range of ages in this group as
well.
The count values ("y") range
from 9 to 644, representing the
number of individuals in each
age group.
The distribution of ages shows
variations, with certain age
groups having higher counts
compared to others.
The age group 55.45 has the
highest count of 644, indicating
a significant number of
individuals in this age range.
The age groups with lower
counts include 5.95 and 95.05,
indicating a smaller number of
In conclusion, analyzing the age distribution for both
"Negative Thyroid" and "Positive Thyroid" cases
reveals that there is a diverse range of ages in both
groups. However, there may be some differences in
the distribution patterns. Further analysis and
exploration of other features could provide more
insights into the relationship between age and the
occurrence of thyroid disease.
Distribution of TSH versus Target Variable
Step Look at TSH feature distribution by

1 binaryClass feature:
1 #Looks at TSH feature distribution

another_versus_label("TSH")
The result is shown in Figure 4. The resulting

plots show the distribution of the "TSH"
(Thyroid-Stimulating Hormone) feature
categorized by the "binaryClass" feature, which
represents the presence or absence of thyroid
disease.
The plots consist of two subplots: the top

subplot represents the "Negative Thyroid"
cases, and the bottom subplot represents the
"Positive Thyroid" cases.
For each subplot:

The x-axis represents the
values of the "TSH"
feature.
The y-axis represents the
count of individuals within
each "TSH" value range.
The histograms display the
frequency of individuals
within each "TSH" value
range.
In the "Negative Thyroid" subplot:
The title is set to "Negative
Thyroid".
The x-axis label is set to
"TSH".
"Count".
The bars are colored using
one color to differentiate
them from the bars in the
"Positive Thyroid" subplot.
The count values are
annotated on top of each
bar, indicating the number
of individuals within each
"TSH" value range.
Figure 4 The TSH feature distribution by

binaryClass feature
In the "Positive Thyroid" subplot:

The title is set to "Positive
Thyroid".
The x-axis label is set to
"TSH".
"Count".
The bars are colored using
a different color to
differentiate them from the
bars in the "Negative
Thyroid" subplot.
The count values are
annotated on top of each
bar, indicating the number
of individuals within each
"TSH" value range.
These plots provide a visual representation of
the distribution of the "TSH" feature within the
"Negative Thyroid" and "Positive Thyroid"
groups, allowing for a comparison between the
two.
Output:
Negative Thyroid:
TSH y
-------- ---
26.5143 243
79.5127 23
132.511 9
185.51 8
238.508 2
291.507 0
344.505 0
397.504 1
450.502 3
503.501 2
Positive Thyroid:
TSH y
--------- ----
7.25475 3453
21.7543 19
36.2538 3
50.7533 1
65.2528 1
79.7523 2
94.2518 0
108.751 1
123.251 0
137.75 1

The TSH levels range from
26.5143 to 503.501.
The count values ("y")
range from 0 to 243,
indicating the number of
individuals within each
TSH level range.
The most common TSH
level among the negative
thyroid cases is
approximately 26.5143,
with 243 individuals having
this TSH level.
As the TSH levels increase,
the count values generally
decrease, indicating a
potential inverse
relationship between TSH
levels and the occurrence
of negative thyroid cases.
The TSH levels range from
7.25475 to 137.75.
range from 0 to 3453,
representing the number of
individuals within each
TSH level range.
The most common TSH
level among the positive
thyroid cases is
approximately 7.25475,
with 3453 individuals
having this TSH level.
In contrast to the negative
thyroid cases, the count
values for positive thyroid
cases are generally higher
across a wide range of TSH
levels, indicating a
potential association
between higher TSH levels
and the occurrence of
positive thyroid cases.
In conclusion, analyzing the distribution of TSH
levels within the negative and positive thyroid
cases provides insights into the potential
relationship between TSH levels and thyroid
disease. The results suggest that lower TSH
levels may be associated with positive thyroid
cases, while higher TSH levels may be more
common in negative thyroid cases. However,
further analysis and consideration of other
factors are necessary to establish a definitive
conclusion about the relationship between TSH
levels and thyroid disease.
Distribution of T3 versus Target Variable
Step Look at T3 feature distribution by binaryClass

1 feature:
1 #Looks at T3 feature distribution

another_versus_label("T3")
Output:
Negative Thyroid:
T3 y
----- ---
0.395 39
0.785 29
1.175 34
1.565 51
1.955 91
2.345 36
2.735 6
3.125 1
3.515 2
3.905 2
Positive Thyroid:
T3 y
------- ----
0.5775 229
1.6325 2163
2.6875 924
3.7425 111
4.7975 35
5.8525 10
6.9075 6
7.9625 1
9.0175 1
10.0725 1
Figure 5 The T3 feature distribution by

binaryClass feature

The "T3" values range from
0.395 to 3.905.
range from 1 to 91.
The highest count is
observed at a "T3" value of
1.955, with 91 individuals.
The counts generally vary
across different "T3" value
ranges, with some ranges
having higher counts than
others.
The "T3" values range from
0.5775 to 10.0725.
range from 1 to 2163.
The highest count is
observed at a "T3" value of
1.6325, with 2163
individuals.
Similar to the negative
thyroid cases, the counts
vary across different "T3"
value ranges, with some
ranges having higher
counts.
These results provide insights into the
distribution of the "T3" feature within the
"Negative Thyroid" and "Positive Thyroid"
groups. It shows the variation in "T3" values
and their corresponding counts, indicating the
potential association between "T3" levels and
thyroid disease. However, further analysis and
consideration of other factors are necessary to
establish a definitive conclusion about the
relationship between "T3" levels and thyroid
disease.
Distribution of TT4 versus Target Variable
Step Look at TT4 feature distribution by

1 #Looks at TT4 feature distribution

another_versus_label("TT4")
Figure 6 The TT4 feature distribution by

binaryClass feature
Output:
Negative Thyroid:
TT4 y
----- ---
9.3 29
23.9 15
38.5 21
53.1 21
67.7 52
82.3 62
96.9 35
111.5 30
126.1 15
140.7 11
Positive Thyroid:
TT4 y
------ ----
39.55 68
80.65 1355
121.75 1616
162.85 329
203.95 80
245.05 26
286.15 4
327.25 0
368.35 1
409.45 2
Upon analyzing the distribution of the "TT4"

feature by the "binaryClass" feature, the
following observations can be made:

The "TT4" values range
from 9.3 to 140.7,
indicating a wide range of
thyroid hormone levels in
this group.
The count values ("y") vary
across different "TT4"
value ranges, with the
highest count observed at a
"TT4" value of 82.3.
The distribution appears to
be skewed towards lower
"TT4" values, as evidenced
by the lower counts for
higher "TT4" values.
The "TT4" values range
from 39.55 to 409.45,
indicating a wider range
compared to the "Negative
Thyroid" cases.
The count values ("y") also
vary across different "TT4"
"TT4" value of 121.75.
be more balanced, with
relatively higher counts for
a broader range of "TT4"
values.
Overall, these findings suggest that the "TT4"
feature is likely to be a significant indicator of
thyroid disease. The distribution of "TT4"
values differs between the "Negative Thyroid"
and "Positive Thyroid" cases, with potential
patterns and trends that could be explored
further. However, it is important to note that
additional analysis and consideration of other
factors are necessary to fully understand the
relationship between "TT4" levels and thyroid
disease.
Distribution of T4U versus Target Variable
Step Look at T4U feature distribution by

1 #Looks at T4U feature distribution

another_versus_label("T4U")

Figure 7 The T4U feature distribution by
binaryClass feature
Output:
Negative Thyroid:
T4U y
------ ---
0.6145 7
0.7235 15
0.8325 43
0.9415 81
1.0505 64
1.1595 48
1.2685 20
1.3775 8
1.4865 2
1.5955 3
Positive Thyroid:
T4U y
------ ----
0.3535 5
0.5605 59
0.7675 715
0.9745 1905
1.1815 602
1.3885 108
1.5955 54
1.8025 26
2.0095 5
2.2165 2
Upon analyzing the distribution of the "T4U"
feature by the "binaryClass" feature, the
following observations can be made:

The "T4U" values range
from 0.6145 to 1.5955.
The count values ("y") vary
across different "T4U"
"T4U" value of 0.9415.
be relatively balanced, with
no significant skewness
towards higher or lower
"T4U" values.
The "T4U" values range
from 0.3535 to 2.2165,
indicating a wider range
compared to the "Negative
Thyroid" cases.
The count values ("y") also
vary across different "T4U"
"T4U" value of 0.9745.
be slightly skewed towards
higher "T4U" values, as
evidenced by the
decreasing counts for lower
"T4U" values.
These findings suggest that the "T4U" feature
may have some relevance in distinguishing
between "Negative Thyroid" and "Positive
Thyroid" cases. However, further analysis and
consideration of other factors are necessary to
fully understand the relationship between
"T4U" levels and thyroid disease.
It is important to note that interpreting these

distributions alone may not provide a
comprehensive understanding of the feature's
significance. Additional statistical analysis and
exploration of other related features are
recommended to gain further insights.
Distribution of FTI versus Target Variable
Step Look at FTI feature distribution by

1 #Looks at FTI feature distribution

another_versus_label("FTI")
Figure 8 The FTI feature distribution by

binaryClass feature
Output:
Negative Thyroid:
FTI y
------ ---
9.55 30
24.65 9
39.75 18
54.85 28
69.95 48
85.05 43
100.15 51
115.25 53
130.35 9
145.45 2
Positive Thyroid:
FTI y
----- ----
35.9 17
73.7 639
111.5 2186
149.3 481
187.1 109
224.9 32
262.7 10
300.5 3
338.3 1
376.1 3
Upon analyzing the distribution of the "FTI"

feature by the "binaryClass" feature, we can
observe the following:
Negative Thyroid:
The "FTI" values for the
negative thyroid group
range from 9.55 to 145.45.
The majority of individuals
have "FTI" values between
54.85 and 115.25.
The highest count of
individuals (53) falls within
the "FTI" range of 115.25.
There are relatively fewer
individuals with extreme
"FTI" values below 24.65
and above 130.35 in the
negative thyroid group.
Positive Thyroid:
The "FTI" values for the
positive thyroid group
range from 35.9 to 376.1.
The majority of individuals
have "FTI" values between
73.7 and 149.3.
The highest count of
individuals (2186) falls
within the "FTI" range of
111.5.
There are relatively fewer
individuals with extreme
"FTI" values below 73.7
and above 262.7 in the
positive thyroid group.
Based on this analysis, it appears that higher
"FTI" values are more common in the positive
thyroid group compared to the negative thyroid
group. This suggests that the "FTI" feature may
be a useful indicator for distinguishing between
individuals with positive and negative thyroid
conditions. Further analysis and modeling can
be performed to assess the predictive power of
the "FTI" feature and its contribution to thyroid
disease diagnosis.
Distribution of Sex Feature
Step Define dist_percent_plot() method to plot the distribution

1 of binaryClass feature against another categorical feature
in stacked bar plot:
1 def put_label_stacked_bar(ax,fontsize):
2 #patches is everything inside of the chart
3 for rect in ax.patches:
4 # Find where everything is located
5 height = rect.get_height()
6 width = rect.get_width()
7 x = rect.get_x()
8 y = rect.get_y()
9
10 # The height of the bar is the data value and
11 can be used as the label
12 label_text = f'{height:.0f}' #
f'{height:.2f}' to format decimal values
13
14
# ax.text(x, y, text)
15
label_x = x + width / 2
16
label_y = y + height / 2
17
18
# plots only when height is greater than
19
specified value
20
if height > 0:
21
ax.text(label_x, label_y,
22 label_text, ha='center', va='center', weight =
23 "bold",fontsize=fontsize)
24
25 def dist_count_plot(df, cat):
26 fig = plt.figure(figsize=(25, 15))
27 ax1 = fig.add_subplot(111)
28
29 group_by_stat = df.groupby([cat,
30 'binaryClass']).size()
31 stacked_plot =
group_by_stat.unstack().plot(kind='bar',
32
stacked=True, ax=ax1, grid=True)
33
ax1.set_title('Stacked Bar Plot of ' + cat
34 + ' (number of cases)', fontsize=14)
35 ax1.set_ylabel('Number of Cases')
36 put_label_stacked_bar(ax1, 17)
37
38 # Collect the values of each stacked bar
39 data = []
40 headers = [''] +
41 df['binaryClass'].unique().tolist()
42 for container in stacked_plot.containers:
43 bar_values = []
44 for bar in container:
45 bar_value = bar.get_height()
46 bar_values.append(bar_value)
47 data.append(bar_values)
48
49 # Transpose the data for tabulate
50
51 data_transposed = list(map(list,
52 zip(*data)))
53
54 # Insert the values of `cat` as the first
column in the data
55
data_with_cat = [[value] + row for value,
56
row in
zip(group_by_stat.index.get_level_values(cat),
data_transposed)]
# Print the values in tabular form

print(tabulate(data_with_cat,
headers=headers))
plt.show()
The function put_label_stacked_bar() is used to annotate

the stacked bar chart with the counts of each category. It
iterates over each bar in the chart, retrieves its height
(count), and places the label at the center of the bar.
The function dist_count_plot() takes a DataFrame (df) and
a categorical feature (cat) as input. It generates a stacked
bar plot showing the distribution of the categories within
each class of the binaryClass feature. The height of each
bar represents the count of cases for each category. The
bars are stacked according to the classes (N and P) of the
binaryClass feature.
The stacked bar plot is annotated using the

put_label_stacked_bar function to display the count values
on each bar. The table of counts is also printed using the
tabulate function from the tabulate library.
This plot provides an overview of the distribution of each

category within the cat feature for both the negative (N)
and positive (P) thyroid classes. It allows for a visual
comparison of the distribution of categories between the
two classes and facilitates the identification of any patterns
or differences in the distribution.
To use this function, you can pass your DataFrame (df) and
the categorical feature you want to analyze (cat). The
stacked bar plot will be displayed, and the counts for each
category will be printed in tabular form.
Here are the step-by-step explanations of
each function:
put_label_stacked_bar(ax, fontsize): This

function is responsible for annotating the
stacked bar chart with count values. It
takes two parameters:
ax: The Axes object
representing the
stacked bar chart.
fontsize: The font size
of the count labels.
Inside the function:
1. It iterates over each
rectangle patch in the
chart using ax.patches.
2. For each patch, it
retrieves the height,
width, x-coordinate,
and y-coordinate.
3. It constructs the label
text using the height
value.
4. It calculates the
coordinates for placing
the label at the center
of the bar.
5. If the height is greater
than 0 (to avoid
labeling empty bars), it
uses ax.text() to place
the label at the
calculated coordinates.
dist_count_plot(df, cat): This function
generates a stacked bar plot and a table
of counts for a categorical feature within
each class of the binaryClass feature. It
takes two parameters:
df: The DataFrame
containing the data.
cat: The name of the
categorical feature to
be analyzed.
Inside the function:
1. It creates a new figure
and axes using
plt.subplots().
2. It groups the data by
the cat feature and
binaryClass and
calculates the
size/count of each
group.
3. It uses unstack() to
reshape the grouped
data into a format
suitable for a stacked
bar chart.
4. It plots the stacked bar
chart using
plot(kind='bar',
stacked=True) on the
axes object.
5. It sets the title and
ylabel of the plot.
6. It calls
put_label_stacked_bar()
to annotate the stacked
bars with count values.
7. It extracts the count
values for each bar and
organizes them in a
table format.
8. It uses tabulate to print
the table of counts with
appropriate headers.
9. Finally, it displays the
plot using plt.show().
These functions provide a convenient way
to visualize and analyze the distribution of
a categorical feature within each class of
a binary target variable. The stacked bar
chart allows for easy comparison of
category proportions, while the tabulated
count values provide a more detailed
overview.
Step 2 Plot the distribution of sex feature in pie char

and bar plot and plot the distribution o
binaryClass variable against sex variable in
stacked bar plot:
1 #Plots the distribution of sex

2 feature in pie chart and bar plot
3 plot_piechart(df,'sex')
4
5 #Plots binaryClass variable against
sex variable in stacked bar plots
dist_count_plot(df,'sex')
Figure 9 The distribution of sex feature in pie

chart and bar plot
Figure 10 The distribution of binaryClass
variable against sex variable in stacked bar plo
The results are shown in Figure 9 and Figure

10. The resulting plots provide insights into the
distribution of the "sex" feature and it
relationship with the "binaryClass" variable.
plot_piechart(df, 'sex'): This function generate

two subplots:
The first subplot is a pie
chart that visualizes the
distribution of the "sex"
feature. Each category o
"sex" is represented by a
pie slice, and the size o
each slice represents the
proportion of that category
in the dataset.
The second subplot is a
horizontal bar plot that
shows the count of each
category of "sex" in the
dataset. Each bar
represents a category, and
the length of the bar
corresponds to the count o
that category.
The function also prints the
percentage values and
count values for each
category of "sex". This
allows for a better
understanding of the
distribution and count of
each category.
dist_count_plot(df, 'sex'): This function
generates a stacked bar plot that compares the
distribution of the "sex" feature within each
class of the "binaryClass" variable.
The stacked bar plot has
two bars for each category
of "sex", one representing
the count of that category
in the "Negative Thyroid"
class and the other
representing the count in
the "Positive Thyroid"
class. The height of each
stacked bar corresponds to
the count of the respective
category within each class.
The function also prints a
table that presents the
count values for each
category of "sex" within
each class. This allows for a
more detailed analysis of
the distribution and count
of each category within
each class.
These plots provide visual and tabula
representations of the distribution of the "sex
feature and its relationship with the
"binaryClass" variable. They help in
understanding the proportion and count of each
category within the dataset and how it varie
across the classes of the target variable.
Output:
Percentage values:
F 69.724284
M 30.275716
Name: sex, dtype: float64
Count values:
F 2630
M 1142
Name: sex, dtype: int64
P N
-- --- ----
F 226 2404
F 65 1077
Based on the analysis of the resulting plots and

tables:
Pie Chart:
The pie chart shows that
the majority of the samples
(69.7%) are labeled as "F"
(female), while the
remaining 30.3% are
labeled as "M" (male).
This indicates that the
dataset is imbalanced in
terms of gender
representation, with more
female samples compared
to male samples.
Bar Plot:
The bar plot provides a
visual representation of the
count of each category of
the "sex" feature.
It shows that the category
"F" (female) has a higher
count (2630) compared to
the category "M" (male)
with a count of 1142.
Stacked Bar Plot:
The stacked bar plot
compares the distribution
of the "sex" feature within
each class of the
"binaryClass" variable.
In the "Negative Thyroid"
class, the majority of
samples are labeled as "F"
(female) with a count of
2404, while there are fewer
samples labeled as "M"
(male) with a count o
1077.
In the "Positive Thyroid"
class, there are fewer
samples overall, with 226
labeled as "F" (female) and
65 labeled as "M" (male).
These plots provide insights into the
distribution of the "sex" feature and it
relationship with the "binaryClass" variable
However, it's important to note that the datase
is imbalanced, particularly in terms of gende
representation. Further analysis and modeling
should take this into consideration to avoid
potential biases and ensure reliable predictions
Distribution of On Thyroxine Feature
Step Plot the distribution of on thyroxine feature in

1 pie chart and bar plot and plot the distribution
of binaryClass variable against on thyroxine
variable in stacked bar plot:
1 #Plots the distribution of on

2 thyroxine feature in pie chart and
bar plot
3
plot_piechart(df,'on thyroxine')
4
5
#Plots binaryClass variable against
on thyroxine variable in stacked
bar plots
dist_count_plot(df,'on thyroxine')

12.
Figure 11 The distribution of on thyroxine

feature in pie chart and bar plot

variable against on thyroxine variable in
stacked bar plot
Here is an explanation of each resulting plot

without generating the output:
Pie Chart:
The pie chart represents
the distribution of the "on
thyroxine" feature.
It visualizes the proportion
of samples that are labeled
as "t" (on thyroxine) and "f"
(not on thyroxine).
The slices of the pie chart
show the percentage of
each category out of the
total number of samples.
Bar Plot:
The bar plot displays the
count of each category of
the "on thyroxine" feature.
It presents a bar for each
category, where the height
of each bar corresponds to
the count of samples in that
category.
categories "t" and "f", and
the y-axis represents the
count of samples.
Stacked Bar Plot:
compares the distribution
of the "on thyroxine"
feature within each class of
the "binaryClass" variable.
It visualizes the count of
samples for each
combination of the "on
thyroxine" feature and the
"binaryClass" variable.
categories "t" and "f" of the
"on thyroxine" feature, and
the y-axis represents the
count of samples.
Each stacked bar
represents a specific
category of the "on
thyroxine" feature within
the "Negative Thyroid" or
"Positive Thyroid" class.
These plots provide visual representations of
the distribution and relationship between the
"on thyroxine" feature and the "binaryClass"
variable in the dataset. They help to understand
the proportions, counts, and distribution
patterns of the different categories within each
variable.
Output:
Percentage values:
f 87.698834
t 12.301166
Name: on thyroxine, dtype: float64
Count values:
f 3308
t 464
Name: on thyroxine, dtype: int64
P N
-- --- ----
f 282 3026
f 9 455
Based on the analysis of the "on thyroxine"

feature, we can draw the following conclusions:
Distribution of "on thyroxine" feature:

Approximately 87.7% of the
samples are labeled as "f"
(not on thyroxine).
Approximately 12.3% of the
samples are labeled as "t"
(on thyroxine).
Stacked bar plot analysis:
In the "Negative Thyroid"
samples (around 98.8%)
are labeled as "f" (not on
thyroxine), and only a small
portion (around 1.2%) are
labeled as "t" (on
thyroxine).
In the "Positive Thyroid"
samples (around 86.9%)
are labeled as "f" (not on
thyroxine), while a smaller
proportion (around 13.1%)
are labeled as "t" (on
thyroxine).
These findings suggest that being on thyroxine
may be more common in the "Positive Thyroid"
class compared to the "Negative Thyroid" class.
The "on thyroxine" feature appears to have
some correlation with the thyroid status,
indicating its potential importance in predicting
thyroid-related conditions.
Further analysis and modeling techniques can

be applied to explore the relationship between
the "on thyroxine" feature and the target
variable, as well as other potential factors that
may contribute to thyroid disorders.
Distribution of Sick Feature
Step Plot the distribution of sick feature in pie chart

1 and bar plot and plot the distribution of
binaryClass variable against sick variable in
stacked bar plot:
1 #Plots the distribution of sick

3 plot_piechart(df,'sick')
4
sick variable in stacked bar plots
dist_count_plot(df,'sick')

14.
Figure 13 The distribution of sick feature in
pie chart and bar plot
The code will plot the distribution of the "sick"

feature in a pie chart and a bar plot. It will also
create stacked bar plots of the "binaryClass"
variable against the "sick" variable.
The pie chart will show the percentage

distribution of the "sick" feature, indicating the
proportion of each category ("f" or "t") within
the dataset. The bar plot will display the count
of each category of the "sick" feature.
The stacked bar plots will show the count of

each category ("f" or "t") of the "sick" feature
for both the "Positive Thyroid" and "Negative
Thyroid" classes. This allows for visual
comparison of the distribution of the "sick"
feature between the two classes.
By analyzing these plots, you can gain insights

into the distribution of the "sick" feature and its
relationship with the target variable
("binaryClass").

variable against sick variable in stacked bar
plot
Output:
Percentage values:
f 96.102863
t 3.897137
Name: sick, dtype: float64
Count values:
f 3625
t 147
Name: sick, dtype: int64
P N
-- --- ----
f 280 3345
f 11 136
The pie chart shows that the majority of the

samples (96.1%) have the "sick" feature value
as "f" (indicating not sick), while only a small
portion (3.9%) have the value "t" (indicating
sick). The bar plot further illustrates the count
of each category, with "f" having a significantly
higher count (3625) compared to "t" (147).
The stacked bar plots demonstrate the
distribution of the "sick" feature for both the
"Positive Thyroid" (P) and "Negative Thyroid"
(N) classes. In the "Positive Thyroid" class,
there are 280 samples classified as "f" and 11
samples classified as "t". In the "Negative
Thyroid" class, the count is higher, with 3345
samples classified as "f" and 136 samples
classified as "t".
From these plots, we can observe that the

majority of the samples in the dataset are not
classified as sick (value "f"), and this trend is
consistent across both the entire dataset and
the individual classes.
Distribution of Tumor Feature
Step Plot the distribution of tumor feature in pie

1 chart and bar plot and plot the distribution of
binaryClass variable against tumor variable in
stacked bar plot:
1 #Plots the distribution of tumor
3 plot_piechart(df,'tumor')
4
tumor variable in stacked bar plots
dist_count_plot(df,'tumor')

16.
Figure 15 The distribution of tumor feature in

The function plot_piechart() is used to plot the

distribution of the "tumor" feature in a pie chart
and bar plot. It takes the DataFrame (df) and
the feature name ("tumor") as input.
The function dist_count_plot() is used to plot

the binaryClass variable against the "tumor"
variable in stacked bar plots. It takes the
DataFrame (df) and the feature name ("tumor")
as input.
By calling these functions, you will generate

visualizations that show the distribution and
count of the "tumor" feature, as well as how it
relates to the binaryClass variable (positive and
negative thyroid classes).
To interpret the plots, you can analyze the

percentage values, count values, and the
stacked bar plots for each category of the
"tumor" feature. This will provide insights into
the distribution and relationship between the
"tumor" feature and the binaryClass variable.
variable against tumor variable in stacked bar
plot
Output:
Percentage values:
f 97.454931
t 2.545069
Name: tumor, dtype: float64
Count values:
f 3676
t 96
Name: tumor, dtype: int64
P N
-- --- ----
f 283 3393
f 8 88
The "tumor" feature:

The pie chart shows that
the majority of samples
(97.5%) have no tumor (f),
while a small percentage
(2.5%) have a tumor (t).
The bar plot confirms the
distribution, with a
significantly higher count
for the "f" category (no
tumor) compared to the "t"
category (tumor).
Relationship between "tumor" and binaryClass:
reveals the distribution of
thyroid classification
(positive and negative)
based on the presence or
absence of a tumor.
Among samples with no
tumor (f), the majority
(97.5%) are classified as
negative thyroid (N), while
a small percentage (2.5%)
are classified as positive
thyroid (P).
Among samples with a
tumor (t), a higher
percentage (77.7%) are
classified as positive
thyroid (P), while a smaller
percentage (22.3%) are
classified as negative
thyroid (N).
In conclusion, the presence or absence of a
tumor (as indicated by the "tumor" feature)
appears to have a relationship with the binary
classification of thyroid status. The majority of
samples without a tumor are classified as
negative thyroid, while a higher proportion of
samples with a tumor are classified as positive
thyroid. However, further analysis and domain
knowledge are necessary to understand the
significance and implications of this
relationship in the context of thyroid conditions.
Distribution of TSH Measured Feature
Step Plot the distribution of TSH measured feature

1 in pie chart and bar plot and plot the
distribution of binaryClass variable against
TSH measured variable in stacked bar plot:
1 #Plots the distribution of TSH

2 measured feature in pie chart and
bar plot
3
plot_piechart(df,'TSH measured')
4
5
TSH measured variable in stacked
bar plots
dist_count_plot(df,'TSH measured')

18.
Figure 17 The distribution of TSH measured


variable against TSH measured variable in
stacked bar plot
Output:
Percentage values:
t 90.217391
f 9.782609
Name: TSH measured, dtype: float64
Count values:
t 3403
f 369
Name: TSH measured, dtype: int64
P N
-- --- ----
f 0 369
t 291 3112
The analysis of the output reveals the following

information:
Distribution of "TSH measured" feature:

The majority of samples
(90.2%) have the value "t"
(true) for the "TSH
measured" feature,
indicating that TSH levels
were measured.
A smaller proportion of
samples (9.8%) have the
value "f" (false), indicating
that TSH levels were not
measured.
Stacked bar plot of "TSH measured" versus
"binaryClass":
Among the samples where
TSH levels were not
measured (category "f"), all
cases belong to the
negative thyroid class
(N=369). There are no
positive thyroid cases in
this category.
Among the samples where
TSH levels were measured
(category "t"), there are
291 positive thyroid cases
(P=291) and 3112 negative
thyroid cases (N=3112).
Based on this analysis, it can be concluded that
TSH measurement is an important factor in
determining the presence of thyroid-related
conditions. The majority of samples where TSH
was measured belong to the negative thyroid
class, while a smaller subset belongs to the
positive thyroid class. However, it's important
to consider other features and conduct further
analysis to gain a comprehensive understanding
of the relationship between TSH levels, the
"TSH measured" feature, and the presence of
thyroid-related conditions.
Distribution of TT4 Measured Feature
Step Plot the distribution of TT4 measured feature

1 in pie chart and bar plot and plot the
distribution of binaryClass variable against
TT4 measured variable in stacked bar plot:
1 #Plots the distribution of TT4

2 measured feature in pie chart and
bar plot
3
plot_piechart(df,'TT4 measured')
4
5
TT4 measured variable in stacked
bar plots
dist_count_plot(df,'TT4 measured')

20.
Figure 19 The distribution of TT4 measured

variable against TT4 measured variable in
stacked bar plot
Output:
Percentage values:
t 93.875928
f 6.124072
Name: TT4 measured, dtype: float64
Count values:
t 3541
f 231
Name: TT4 measured, dtype: int64
P N
-- --- ----
f 5 226
f 286 3255

Pie Chart: The majority of
the samples (93.9%) have
the "TT4 measured" feature
marked as "t" (True),
indicating that the TT4
measurements were
recorded for most of the
samples. Only a small
percentage (6.1%) have it
marked as "f" (False),
indicating that TT4
measurements were not
taken for those samples.
Stacked Bar Plot: Among
the samples with "TT4
measured" marked as "t"
(True), the distribution
between the "N" (Negative
Thyroid) and "P" (Positive
Thyroid) classes is
imbalanced. The majority of
the samples (3255) with
"TT4 measured" as "t"
belong to the "N" class,
while a smaller number
(286) belong to the "P"
class. On the other hand,
among the samples with
"TT4 measured" marked as
"f" (False), the majority
(226) belong to the "P"
class, while only a few (5)
belong to the "N" class.
These plots suggest that the availability of TT4
measurements may have some correlation with
the thyroid class. However, further analysis and
statistical testing would be needed to establish
a stronger relationship and determine the
significance of this feature in predicting thyroid
class.
Extracting Categorical and Numerical Features
Step Extract categorical and numerical columns:

1

2 print(df.info())
3
4
5 #Extracts categorical and numerical
6 columns
7 cat_cols = [col for col in
df.columns if \
8
(df[col].dtype == 'object')]
9
num_cols = [col for col in
10
df.columns if \
11
(df[col].dtype != 'object')]
print(cat_cols)
print(num_cols)
Output:
# Column Non-Null
Count Dtype
--- ------ --------
------ -----
0 age 3772
non-null float64
1 sex 3772
non-null object
2 on thyroxine 3772
non-null object
non-null object
non-null object
5 sick 3772
non-null object
6 pregnant 3772
non-null object
7 thyroid surgery 3772
non-null object
8 I131 treatment 3772
non-null object
non-null object
non-null object
11 lithium 3772
non-null object
12 goitre 3772
non-null object
13 tumor 3772
non-null object
14 hypopituitary 3772
non-null object
15 psych 3772
non-null object
16 TSH measured 3772
non-null object
17 TSH 3772
non-null float64
18 T3 measured 3772
non-null object
19 T3 3772
non-null float64
20 TT4 measured 3772
non-null object
21 TT4 3772
non-null float64
22 T4U measured 3772
non-null object
23 T4U 3772
non-null float64
24 FTI measured 3772
non-null object
25 FTI 3772
non-null float64
26 TBG measured 3772
non-null object
27 binaryClass 3772
non-null object
dtypes: float64(6), object(22)
None
['sex', 'on thyroxine', 'query on

thyroxine', 'on antithyroid medication',
'sick', 'pregnant', 'thyroid surgery',
'I131 treatment', 'query hypothyroid',
'query hyperthyroid', 'lithium',
'goitre', 'tumor', 'hypopituitary',
'psych', 'TSH measured', 'T3 measured',
'TT4 measured', 'T4U measured', 'FTI
measured', 'TBG measured',
'binaryClass']
['age', 'TSH', 'T3', 'TT4', 'T4U',

'FTI']
Here are the steps in the code:
1. print(df.info()): This
statement prints the
information about the
dataset df. It provides a
summary of the dataset,
including the column
names, the count of non-
null values in each column,
and the data types of the
columns.
2. cat_cols = [col for col in
df.columns if (df[col].dtype
== 'object')]: This line of
code creates a list cat_cols
that contains the names of
categorical columns in the
dataset. It iterates over the
column names in
df.columns and checks if
the data type of each
column is 'object',
indicating a categorical
variable.
3. num_cols = [col for col in
df.columns if (df[col].dtype
!= 'object')]: This line of
code creates a list
num_cols that contains the
names of numerical
columns in the dataset. It
iterates over the column
names in df.columns and
checks if the data type of
each column is not 'object',
indicating a numerical
variable.
4. print(cat_cols): This
statement prints the list of
categorical columns
extracted in step 2. It
displays the names of
columns that contain
categorical variables.
5. print(num_cols): This
statement prints the list of
numerical columns
extracted in step 3. It
displays the names of
columns that contain
numerical variables.
By examining the dataset information and
extracting categorical and numerical columns,
we gain a better understanding of the data
types present in the dataset, which can be
useful for further analysis and processing.
Based on the provided output:

The dataset contains 3772
entries (rows) and 28
columns.
The column names and
their corresponding data
types are displayed.
There are 6 numerical
columns: 'age', 'TSH', 'T3',
'TT4', 'T4U', and 'FTI'.
There are 22 categorical
columns: 'sex', 'on
thyroxine', 'query on
thyroxine', 'on antithyroid
medication', 'sick',
'pregnant', 'thyroid
surgery', 'I131 treatment',
'query hypothyroid', 'query
hyperthyroid', 'lithium',
'goitre', 'tumor',
'hypopituitary', 'psych',
'TSH measured', 'T3
measured', 'TT4 measured',
'T4U measured', 'FTI
measured', 'TBG
measured', and
'binaryClass'.
The memory usage of the
dataset is approximately
825.2 KB.
By analyzing the dataset information, we gain
insights into the data types of each column,
which can be useful for data manipulation,
analysis, and modeling tasks. The numerical
columns contain continuous or discrete numeric
values, while the categorical columns represent
different categories or labels.
Extracting Categorical and Numerical Features
Step Check numerical features density distribution:

1
1 # Checks numerical features density

2 distribution
4 plotnumber = 1
5 color_palette =
sns.color_palette("husl",
6
n_colors=len(num_cols)) # Define
7 color palette
8
9 for column in num_cols:
10 if plotnumber <= 6:
11 ax = plt.subplot(2, 3,
12 plotnumber)
13 sns.distplot(df[column],
14 color=color_palette[plotnumber-1])
15 plt.xlabel(column,
fontsize=40)
16
17
18
19 ax.annotate(format(p.get_height(),
'.2f'), (p.get_x() + p.get_width()
20
21 va='center', xytext=(0, 10),
22 weight="bold", fontsize=30,
23 textcoords='offset points')
24
plotnumber += 1
fig.suptitle('The density of
numerical features', fontsize=50)
plt.tight_layout()
plt.show()
Figure 21 The numerical features density distribution
Here are the steps performed in the code:

1. Create a figure with a specific
size using plt.figure(figsize=(40,
30)).
2. Initialize the plot number
variable (plotnumber) to 1.
3. Define a color palette using the
"husl" color palette from seaborn
with the number of colors equal
to the length of the numerical
columns (color_palette =
sns.color_palette("husl",
n_colors=len(num_cols))).
4. Iterate over each numerical
column in num_cols using a for
loop.
5. If the plot number is less than or
equal to 6, create a subplot using
plt.subplot(2, 3, plotnumber).
This creates a grid of subplots
with 2 rows and 3 columns, and
the current plot number
determines the position of the
subplot.
6. Use sns.distplot to plot the
density distribution of the
current numerical column,
specifying the color from the
color palette
(color=color_palette[plotnumber-
1]).
7. Set the x-axis label of the subplot
to the current column name
using plt.xlabel(column,
fontsize=40).
8. Add annotations to the plot using
ax.annotate. This annotates each
bar in the plot with the
corresponding height, formatting
it to 2 decimal places. The
annotations are positioned at the
center of each bar and displayed
above the bar.
9. Increment the plot number by 1.
10. Set the super title of the figure
using fig.suptitle to provide an
overall title for the subplots.
11. Adjust the layout of the subplots
using plt.tight_layout() to ensure
proper spacing.
12. Display the plot using plt.show().
This code generates a figure with subplots, each
showing the density distribution of a numerical feature.
Each subplot is labeled with the feature name and
annotated with the height of each bar. The "husl" color
palette is used to provide visually appealing colors for
each subplot.
Density distribution plots are used to visualize the

distribution of data values and understand their
patterns and characteristics. They provide insights into
how the data is spread out and the likelihood of
different values occurring.
The density distribution shows the probability density

function (PDF) or an estimated smoothed
representation of the underlying probability
distribution of the data. It provides information about
the relative frequency or density of data values across
the range of the variable.
Density distributions are useful for various purposes:

Understanding the shape of the
data: Density distributions help
identify the shape of the data
distribution, such as whether it
follows a normal distribution, is
skewed, has multiple peaks, or
exhibits other patterns. This
information is essential for
understanding the central
tendency and variability of the
data.
Detecting outliers or anomalies:
Density distributions can reveal
outliers, which are data points
that deviate significantly from
the majority of the data. Outliers
may indicate errors, data quality
issues, or interesting phenomena
that require further
investigation.
Comparing distributions: Density
distributions allow for the
comparison of multiple variables
or groups. By overlaying or
comparing distributions, you can
identify similarities, differences,
or relationships between
different sets of data. This is
particularly useful for
exploratory data analysis and
hypothesis testing.
Assessing skewness or
asymmetry: Density distributions
provide insights into the
skewness of the data, indicating
whether it is positively skewed
(tail to the right), negatively
skewed (tail to the left), or
symmetrically distributed.
Skewness affects the
interpretation of central
tendency measures like mean
and median.
Estimating probability: Density
distributions represent the
relative likelihood of different
values occurring. The area under
the curve within a specific range
represents the probability of
observing values in that range.
This information is useful for
making probabilistic
assessments or conducting
statistical inference.
Overall, density distribution plots help visualize the
overall pattern and characteristics of the data,
providing a summary of its distributional properties.
They assist in data exploration, identifying trends,
outliers, and relationships between variables, and
support decision-making processes in various domains,
including statistics, data analysis, and machine
learning.
Distribution of Categorical Features
Step Check categorical features distribution:

1

2 plotnumber = 1
3 color_palette =
4 sns.color_palette("Set3",
n_colors=len(cat_cols)) # Define
5
color palette
6
7
for column, color in zip(cat_cols,
8 color_palette):
9 if plotnumber <= 20:
10 ax = plt.subplot(4, 5,
11 plotnumber)
12 ax.tick_params(axis='x',
13 labelsize=30)
14 ax.tick_params(axis='y',
labelsize=30)
15
sns.countplot(df[column],
16
color=color)
17
plt.xlabel(column,
18 fontsize=40)
19 for p in ax.patches:
20
21 ax.annotate(format(p.get_height(),
22 '.0f'), (p.get_x() + p.get_width()
23
va='center', xytext=(0, 10),
24 weight="bold", fontsize=30,
textcoords='offset points')
plotnumber += 1
fig.suptitle('The distribution of
categorical features', fontsize=50)
plt.tight_layout()
plt.show()
Figure 22 The categorical features distribution
Here are the steps to understand the code:

1. Create a new figure with a
size of 50x40 using
plt.figure(figsize=(50, 40)).
2. Initialize a counter
plotnumber to keep track
of the subplot position.
3. Define a color palette using
n_colors=len(cat_cols)),
which generates a color
palette with the "Set3"
palette name and the
number of colors equal to
the number of categorical
columns.
4. Iterate over each
categorical column and its
corresponding color using
zip(cat_cols, color_palette).
5. Check if the plotnumber is
within the first 20 plots to
limit the number of
subplots to 20.
6. Create a subplot using
plt.subplot(4, 5,
plotnumber) with 4 rows
and 5 columns, and set the
current axis to ax.
7. Customize the tick labels
on the x-axis and y-axis
using ax.tick_params().
8. Plot a countplot for the
current categorical column
using
sns.countplot(df[column],
color=color).
9. Set the x-axis label to the
column name using
plt.xlabel(column,
fontsize=40).
10. Add annotations to each
bar in the countplot using
ax.annotate(), displaying
the count value above each
bar.
11. Increment the plotnumber
by 1 for the next subplot.
12. Set the title of the overall
figure using fig.suptitle().
13. Adjust the spacing between
subplots using
plt.tight_layout().
14. Display the figure using
plt.show().
This code generates a set of countplots for each
categorical feature in the dataset. Each
countplot uses a different color from the "Set3"
color palette, and the count values are
displayed above each bar. The figure provides
an overview of the distribution of categorical
features in the dataset.
The resulting plots show

the distribution of each
categorical feature in the
dataset. Each plot
represents a different
categorical column.
Each plot displays a count
of occurrences for each
category in the respective
feature.
unique categories in the
feature, and the y-axis
represents the count of
occurrences.
Each bar in the plot
corresponds to a category,
and its height indicates the
frequency or count of that
category in the dataset.
Above each bar, the count
value is annotated to
provide a visual
representation of the
count.
By analyzing these plots, you can observe the
distribution and frequency of different
categories within each categorical feature. It
helps you understand the proportion and
variability of different categories in the dataset.
This information can be valuable for identifying
patterns, imbalances, or potential relationships
between categorical variables and the target
variable.
Distribution of Nine Categorical Features versus Target
Variable
Step Plot distribution of number of cases of nine categorical

1 features versus binaryClass:
1 def plot_one_versus_one_cat(feat):
2 categorical_features = ["FTI measured",
3 "sex", "on thyroxine", "sick", "pregnant",
"goitre", "tumor", "hypopituitary", "psych"]
4
num_plots = len(categorical_features)
5
6
fig, axes = plt.subplots(3, 3, figsize=
7
(40, 30), facecolor='#fbe7dd')
8
axes = axes.flatten()
9
color_palette =
10 sns.color_palette("Spectral_r",
11 n_colors=num_plots)
12
13 for i in range(num_plots):
14 ax = axes[i]
15 ax.tick_params(axis='x',
16 labelsize=30)
17 ax.tick_params(axis='y',
labelsize=30)
18
g =
19
sns.countplot(x=df[categorical_features[i]],
20 hue=df[feat], palette=color_palette[i:i+2],
21 ax=ax)
22
23 for p in g.patches:
24
25 g.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2.,
26
p.get_height()),
27
ha='center',
28 va='center', xytext=(0, 10), weight="bold",
29 fontsize=20, textcoords='offset points')
30
31
32 g.set_xlabel(categorical_features[i],
fontsize=30)
g.set_ylabel("Number of Cases",
fontsize=30)
g.legend(fontsize=25)
plt.tight_layout()
plt.show()
plot_one_versus_one_cat("binaryClass")
The result is shown in Figure 23. Here's a step-by-step

1. The function
plot_one_versus_one_cat() takes a
categorical feature feat as input.
2. The list categorical_features
contains the names of the
categorical features to be plotted
against feat.
3. The variable num_plots stores the
number of plots to be created,
which is the length of
categorical_features.
4. A figure object fig and an array of
subplots axes are created using
plt.subplots with a 3x3 grid layout
and a specified figure size.
5. The axes array is flattened into a
1D array for easier iteration.
6. The color palette is generated
using sns.color_palette with the
"Spectral_r" colormap and the
number of colors equal to
num_plots.
7. A loop is used to iterate over each
categorical feature and create a
countplot for each one.
Figure 23 The distribution of number of cases of nine
categorical features versus binaryClass
8. Within the loop:

The current subplot ax is
selected.
The x-axis and y-axis tick
labels are set to a font size of
30.
The countplot g is created
using sns.countplot, with the
x-axis representing the current
categorical feature, the hue
representing feat, and the
color palette for the plot.
Annotations are added to each
bar in the countplot using a
loop over g.patches. The
height of each bar is formatted
and displayed at the center.
The x-label is set to the name
of the current categorical
feature with a font size of 30.
The y-label is set to "Number
of Cases" with a font size of
30.
The legend in the plot is set to
a font size of 25.
9. After the loop, the layout of
subplots is adjusted for better
spacing using plt.tight_layout().
10. Finally, the plot is displayed using
plt.show().
This code generates a grid of countplots, where each plot
represents the distribution of the specified categorical
feature against the input feat. The count of cases is
displayed on top of each bar, and the legends show the
different categories of feat with different colors.
This will generate a grid of countplots, where each plot

represents the distribution of the "binaryClass" feature
against the other categorical features in your dataset.
The count of cases is displayed on top of each bar, and
the legends show the different categories of
"binaryClass" with different colors.
Distribution of Nine Categorical Features versus On

Antithyroid Medication
Step Plot distribution of number of cases of nine

1 categorical features versus on antithyroid
medication:
1 plot_one_versus_one_cat("on
antithyroid medication")
The result is shown in Figure 24. Let's dive into

more detail about the resulting plots from the
plot_one_versus_one_cat("on antithyroid
medication") function.
FTI measured:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"FTI measured".
"FTI measured" categories:
"t" (FTI measured) and "f"
(FTI not measured).
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Figure 24 The distribution of number of cases

of nine categorical features versus on
antithyroid medication
Sex:
The plot shows the
"sex".
"sex" categories: "F"
(female) and "M" (male).
number of cases.
coded accordingly.
bar.
On thyroxine:
The plot shows the
"on thyroxine".
"on thyroxine" categories:
"f" (not on thyroxine) and
"t" (on thyroxine).
number of cases.
coded accordingly.
bar.
Sick:
The plot shows the
"sick".
"sick" categories: "f" (not
sick) and "t" (sick).
number of cases.
coded accordingly.
bar.
Pregnant:
The plot shows the
"pregnant".
"pregnant" categories: "f"
(not pregnant) and "t"
(pregnant).
number of cases.
coded accordingly.
bar.
Goitre:
The plot shows the
"goitre".
"goitre" categories: "f" (no
goitre) and "t" (goitre
present).
number of cases.
coded accordingly.
bar.
Tumor:
The plot shows the
"tumor".
"tumor" categories: "f" (no
tumor) and "t" (tumor
present).
number of cases.
coded accordingly.
bar.
Hypopituitary:
The plot shows the
"hypopituitary".
"hypopituitary" categories:
"f" (not hypopituitary) and
"t" (hypopituitary present).
number of cases.
coded accordingly.
bar.
Psych:
The plot shows the
"psych".
"psych" categories: "f" (not
psych) and "t" (psych
issue).
number of cases.
coded accordingly.
bar.
By examining these plots, you can analyze the
distribution of the "on antithyroid medication"
feature across different categories of each
corresponding feature. It allows you to observe
any patterns, trends, or relationships between
the presence of antithyroid medication and
other categorical features in the dataset.
Thyroid Surgery

1 categorical features versus thyroid surgery:
1 plot_one_versus_one_cat("thyroid
surgery")
The result is shown in Figure 25. The purpose

of the code plot_one_versus_one_cat("thyroid
surgery") is to generate a set of subplots that
illustrate the distribution of the "thyroid
surgery" feature in relation to other categorical
features in the dataset.
The code creates a 3x3 grid of subplots using
the subplots() function from Matplotlib. Each
subplot represents the distribution of the
"thyroid surgery" feature for a specific
categorical feature. The categorical features
include "FTI measured", "sex", "on thyroxine",
"sick", "pregnant", "goitre", "tumor",
"hypopituitary", and "psych".

of nine categorical features versus thyroid
surgery
For each subplot, a countplot is generated

using Seaborn's countplot() function. The
countplot displays the number of cases for each
category of the selected categorical feature,
grouped by the presence or absence of "thyroid
surgery". The bars in the countplot are color-
coded to differentiate the categories.
Additionally, annotations are added to each bar

to display the count of cases on top of the bars.
The x-axis and y-axis labels are set to indicate
the categorical feature being analyzed and the
number of cases, respectively. The legend is
included in each subplot to represent the
presence or absence of "thyroid surgery".
The resulting plots allow for a visual

examination of how the distribution of "thyroid
surgery" varies across different categories of
each categorical feature. This analysis can
provide insights into any potential relationships
or associations between "thyroid surgery" and
other categorical variables in the dataset.

Lithium

1 categorical features versus lithium:
1 plot_one_versus_one_cat("lithium")
The result is shown in Figure 26. The code

plot_one_versus_one_cat("lithium") generates a
set of subplots to examine the distribution of
the "lithium" feature in relation to other
categorical features in the dataset.
of nine categorical features versus lithium

"lithium" feature for a specific categorical
feature. The categorical features considered in
this case are "FTI measured", "sex", "on
thyroxine", "sick", "pregnant", "goitre", "tumor",
For each subplot, a countplot is created using

Seaborn's countplot() function. The countplot
visualizes the number of cases for each
categorized by the presence or absence of
"lithium". The bars in the countplot are colored
differently to distinguish the categories.
Annotations are added to each bar to display

the count of cases on top of the bars. The x-axis
and y-axis labels are set to indicate the
categorical feature being analyzed and the
presence or absence of "lithium".
By examining these plots, one can gain insights

into how the distribution of the "lithium"
feature varies across different categories of
each categorical feature. This analysis can help
identify any potential relationships or patterns
between the use of lithium and the other
categorical variables in the dataset.
Distribution of Nine Categorical Features versus TSH
Measured

1 categorical features versus TSH measured:
1 plot_one_versus_one_cat("TSH
measured")
The result is shown in Figure 27. The code

plot_one_versus_one_cat("TSH measured")
generates a set of subplots to examine the
distribution of the "TSH measured" feature in
relation to other categorical features in the
dataset.

subplot represents the distribution of the "TSH
measured" feature for a specific categorical

categorized by the presence or absence of "TSH
measured". The bars in the countplot are
colored differently to distinguish the categories.
Annotations are added to each bar to display

the count of cases on top of the bars. The x-axis
and y-axis labels are set to indicate the
categorical feature being analyzed and the
presence or absence of "TSH measured".

into how the distribution of the "TSH
measured" feature varies across different
categories of each categorical feature. This
analysis can help identify any potential
relationships or patterns between the
measurement of TSH and the other categorical
variables in the dataset.

of nine categorical features versus TSH
measured
Distribution of Nine Categorical Features versus T3

Measured

1 categorical features versus T3 measured:
1 plot_one_versus_one_cat("T3
measured")

of nine categorical features versus T3
measured
The purpose of the code

plot_one_versus_one_cat("T3 measured") is to
generate a set of subplots that depict the
distribution of the "T3 measured" feature in
dataset.

subplot represents the distribution of the "T3

categorized by the presence or absence of "T3
measured". Different colors are used to
differentiate between the categories.
Annotations are added to each bar in the

countplot to display the count of cases on top of
the bars. The x-axis and y-axis labels are set to
indicate the categorical feature being analyzed
and the number of cases, respectively. The
legend in each subplot represents the presence
or absence of "T3 measured".
into the distribution of the "T3 measured"
categorical feature. This analysis can help
identify any patterns or relationships between
the measurement of T3 and the other
Distribution of Nine Categorical Features versus TT4

Measured

1 categorical features versus TT4 measured:
1 plot_one_versus_one_cat("TT4
measured")

of the code plot_one_versus_one_cat("TT4
measured") is to generate a set of subplots that
depict the distribution of the "TT4 measured"
feature in relation to other categorical features
in the dataset.

subplot represents the distribution of the "TT4

categorized by the presence or absence of "TT4

or absence of "TT4 measured".

into the distribution of the "TT4 measured"
the measurement of TT4 and the other

of nine categorical features versus TT4
measured
Distribution of Nine Categorical Features versus T4U

Measured

1 categorical features versus T4U measured:
1 plot_one_versus_one_cat("T4U
measured")

of nine categorical features versus T4U
measured

plot_one_versus_one_cat("T4U measured") is to
distribution of the "T4U measured" feature in
dataset.

subplot represents the distribution of the "T4U

categorized by the presence or absence of "T4U

or absence of "T4U measured".
into the distribution of the "T4U measured"
the measurement of T4U and the other
Distribution of Nine Categorical Features versus FTI

Measured

1 categorical features versus FTI measured:
1 plot_one_versus_one_cat("FTI
measured")

of nine categorical features versus FTI
measured
plot_one_versus_one_cat("FTI measured") is to
distribution of the "FTI measured" feature in
dataset.
subplot represents the distribution of the "FTI
this case are "sex", "on thyroxine", "sick",
"pregnant", "goitre", "tumor", "hypopituitary",
and "psych".

categorized by the presence or absence of "FTI

or absence of "FTI measured".

into the distribution of the "FTI measured"
the measurement of FTI and the other
Distribution of Nine Categorical Features versus I131

Measured

1 categorical features versus I131 treatment:
1 plot_one_versus_one_cat("I131
treatment")

of the code plot_one_versus_one_cat("I131
treatment") is to generate a set of subplots that
depict the distribution of the "I131 treatment"
in the dataset.

subplot represents the distribution of the "I131
treatment" feature for a specific categorical

"I131 treatment". Different colors are used to

or absence of "I131 treatment".

into the distribution of the "I131 treatment"
the I131 treatment and the other categorical
variables in the dataset.
of nine categorical features versus I131
treatment

Query Hypothyroid

1 categorical features versus query
hypothyroid:
1 plot_one_versus_one_cat("query
hypothyroid")

of nine categorical features versus query
hypothyroid

plot_one_versus_one_cat("query hypothyroid")
is to generate a set of subplots that illustrate
the distribution of the "query hypothyroid"
in the dataset.

"query hypothyroid" feature for a specific
categorical feature. The categorical features
considered in this case are "FTI measured",
"sex", "on thyroxine", "sick", "pregnant",
"goitre", "tumor", "hypopituitary", and "psych".
"query hypothyroid". Different colors are used
to differentiate between the categories.

or absence of "query hypothyroid".

into the distribution of the "query hypothyroid"
the "query hypothyroid" and the other
Percentage Distribution of On Thyroxine and On

Antithyroid Medication versus Target Variable
Step Plot the percentage distribution of on thyroxine

1 and on antithyroid medication versus
binaryClass in pie chart:
1 def
2 plot_feat1_feat2_vs_target_pie(df,
target, col1, col2):
3
gs0 = df[df[target] == 'P']
4
[col1].value_counts()
5
gs1 = df[df[target] == 'N']
6 [col1].value_counts()
7 ss0 = df[df[target] == 'P']
9 ss1 = df[df[target] == 'N']
11
12 col1_labels = df[col1].unique()
13 col2_labels = df[col2].unique()
14
15 _, ax = plt.subplots(2, 2,
16 figsize=(20, 20),
facecolor='#f7f7f7')
17
18
cmap = plt.get_cmap('Pastel1')
19
# Define color map
20
21
ax[0][0].pie(gs0,
22 labels=col1_labels[:len(gs0)],
23 shadow=True, autopct='%1.1f%%',
24 explode=[0.03] * len(gs0),
25
colors=cmap(np.arange(len(gs0))),
26
textprops={'fontsize': 20})
27
ax[0][1].pie(gs1,
28 labels=col1_labels[:len(gs1)],
30 explode=[0.03] * len(gs1),
31
colors=cmap(np.arange(len(gs1))),
32
33
ax[1][0].pie(ss0,
34 labels=col2_labels[:len(ss0)],
36 explode=[0.04] *
37 len(ss0),
colors=cmap(np.arange(len(ss0))),
38
39
ax[1][1].pie(ss1,
40 labels=col2_labels[:len(ss1)],
42 explode=[0.04] *
43 len(ss1),
colors=cmap(np.arange(len(ss1))),
44
45
46
ax[0][0].set_title(f"
47 {target.capitalize()} = 0",
48 fontsize=30)
49 ax[0][1].set_title(f"
50 {target.capitalize()} = 1",
fontsize=30)
51
plt.show()
52
53
# Print each pie chart as a table
54
in tabular form for each stroke
55
print(f"\n{col1.capitalize()}:")
56
57 print(f"{target.capitalize()} =
58 0:")
59 gs0_table =
pd.DataFrame({'Categories':
60
col1_labels[:len(gs0)],
61 'Percentage': gs0 / gs0.sum() *
62 100, 'Count': gs0})
63 print(gs0_table)
64
65 print(f"\n{target.capitalize()} =
66 1:")
gs1_table =
col1_labels[:len(gs1)],
'Percentage': gs1 / gs1.sum() *
100, 'Count': gs1})
print(gs1_table)
print(f"{target.capitalize()} =
0:")
ss0_table =
col2_labels[:len(ss0)],
'Percentage': ss0 / ss0.sum() *
100, 'Count': ss0})
print(ss0_table)
print(f"\n{target.capitalize()} =
1:")
ss1_table =
col2_labels[:len(ss1)],
'Percentage': ss1 / ss1.sum() *
100, 'Count': ss1})
print(ss1_table)
plot_feat1_feat2_vs_target_pie(df,
"binaryClass", "on thyroxine", "on
antithyroid medication")
The result is shown in Figure 35. Here's a step-by-

step explanation of the code:
1. The function
plot_feat1_feat2_vs_target_pie
takes in a DataFrame df, a
target variable target, and
two feature variables col1 and
col2 as input.
2. The counts of the feature
values are computed based on
the target variable. For each
target value ('P' and 'N'), the
counts of feature values are
stored in gs0, gs1, ss0, and
ss1.
3. Unique labels for col1 and
col2 are obtained and stored
in col1_labels and col2_labels,
respectively.
4. A 2x2 subplot figure is
created with the specified
figsize and facecolor.
5. The color map 'Pastel1' is
obtained using
plt.get_cmap('Pastel1'). This
color map will be used to
assign colors to the pie chart
slices.
6. Pie charts are plotted on each
axis of the subplot figure. The
pie function is called for each
target and feature
combination, with the
corresponding counts, labels,
and other properties. The
colors argument is set to
cmap(np.arange(len(gs0))) to
assign colors from the color
map based on the number of
feature values.
7. Titles for the subplots are set
using the target values.
8. The pie charts and subplots
are displayed using
plt.show().
9. Tabular representations of
the pie charts are printed.
DataFrames gs0_table,
gs1_table, ss0_table, and
ss1_table are created to store
the categories, percentages,
and counts of each feature
value for the corresponding
target value.
10. The data in the DataFrames is
printed to show the
categories, percentages, and
counts for each target and
feature combination.
Each pie chart visualizes the distribution of feature
values for a specific target value ('P' or 'N'). The
percentages and counts provide additional insights
into the distribution of the feature values based on
the target variable.
The plot_feat1_feat2_vs_target_pie() function

generates a set of four pie charts that illustrate the
distribution of feature values for two given
features (col1 and col2) based on the target
variable (binaryClass).
In this specific case, the function is called with col1

set as "on thyroxine" and col2 set as "on
antithyroid medication". The resulting plots show
the distribution of these two features based on the
binary classification target variable.
Here's a breakdown of the resulting plots:

The top-left plot: This pie
chart represents the
distribution of "on thyroxine"
feature values for the target
value 'P' (positive class). The
labels on the pie chart
represent the unique feature
values ("f" and "t"), and the
percentages indicate the
proportion of each feature
value within the positive
class.
The top-right plot: This pie
chart shows the distribution
of "on thyroxine" feature
values for the target value 'N'
(negative class). Similarly,
the labels and percentages
represent the unique feature
values and their proportions
within the negative class.
The bottom-left plot: This pie
chart represents the
distribution of "on antithyroid
medication" feature values for
the target value 'P' (positive
class). The labels on the pie
chart represent the unique
feature values ("f" and "t"),
and the percentages indicate
the proportion of each feature
value within the positive
class.
The bottom-right plot: This
pie chart shows the
distribution of "on antithyroid
medication" feature values for
the target value 'N' (negative
class). Similarly, the labels
and percentages represent
the unique feature values and
their proportions within the
negative class.
The titles of each plot indicate the target value ('P'
or 'N') they represent.
Additionally, after the pie charts are displayed,
tabular representations of the data are printed.
These tables provide further information about the
categories, percentages, and counts of the feature
values for each target and feature combination.
Figure 35 The percentage distribution of on

thyroxine and on antithyroid medication versus
binaryClass in pie chart
Output:
On thyroxine:
Binaryclass = 0:
Categories Percentage Count
f f 86.929043 3026
t t 13.070957 455
Binaryclass = 1:
f f 96.907216 282
t t 3.092784 9
On antithyroid medication:
Binaryclass = 0:
f f 98.79345 3439
t t 1.20655 42
Binaryclass = 1:
f f 99.656357 290
t t 0.343643 1
Based on the output analysis:
For the feature "on thyroxine":

Among the cases where
Binaryclass = 0 (negative),
the majority (86.93%) of
individuals are not on
thyroxine medication
(category 'f'). Only a small
portion (13.07%) are on
(category 't').
In contrast, among the cases
where Binaryclass = 1
(positive), a higher
percentage (96.91%) of
(category 'f'), and only a very
small percentage (3.09%) are
on thyroxine medication
(category 't').
For the feature "on antithyroid medication":
In the cases where
Binaryclass = 0 (negative),
almost all individuals
(98.79%) are not on
(category 'f'). Only a very
small percentage (1.21%) are
on antithyroid medication
(category 't').
Similarly, in the cases where
Binaryclass = 1 (positive), the
majority (99.66%) of
(category 'f'), and only a
negligible percentage (0.34%)
are on antithyroid medication
(category 't').
From this analysis, we can conclude that being on
thyroxine medication or antithyroid medication
seems to have a higher prevalence among
individuals with Binaryclass = 0 (negative). On the
other hand, the majority of individuals with
Binaryclass = 1 (positive) are not on these
medications. These findings suggest a potential
correlation between the use of these medications
and the binary classification of the thyroid
condition.
Percentage Distribution of Sick and Pregnant versus

Target Variable
Step Plot the percentage distribution of sick and

1 pregnant versus binaryClass in pie chart:
2 "binaryClass", "sick", "pregnant")
Figure 36 The percentage distribution of sick

and pregnant versus binaryClass in pie chart
plot_feat1_feat2_vs_target_pie() is to visualize
and analyze the distribution of two categorical
features (sick and pregnant) with respect to a
binary target variable (binaryClass). It
generates a set of four pie charts to compare
the distribution of each feature's categories for
different values of the target variable.
The code calculates the count of each category

in the features sick and pregnant for each value
of the target variable. It then creates a 2x2 grid
of subplots to plot the pie charts. Each pie chart
represents a combination of the target variable
and one of the features.
The pie charts are color-coded using a
colormap, and the percentage and count of
each category are displayed inside the pie
slices. The titles of the subplots indicate the
value of the target variable.
After plotting the pie charts, the code also

prints the data in tabular form, showing the
percentage and count of each category for each
combination of the target variable and feature.
Overall, the purpose of the code is to provide a

visual and tabular analysis of how the
categories in the sick and pregnant features are
distributed among the different values of the
binaryClass target variable.
Output:
Sick:
Binaryclass = 0:
f f 96.093077 3345
t t 3.906923 136
Binaryclass = 1:
f f 96.219931 280
t t 3.780069 11
Pregnant:
Binaryclass = 0:
f f 98.477449 3428
t t 1.522551 53
Binaryclass = 1:
f f 100.0 291
The output shows the analysis of the

distribution of categories in the sick and
pregnant features with respect to the
binaryClass target variable.
For the sick feature:

When binaryClass = 0, the
majority category (f)
represents 96.1% of the
cases, with a count of 3345.
The minority category (t)
For the pregnant feature:

only category present (f)
represents 100% of the
Based on this analysis, we can observe the
following:
For both the sick and
pregnant features, the
majority category
dominates the distribution
for both values of
binaryClass.
The percentage distribution
of categories in the sick
feature is similar for both
values of binaryClass,
indicating that being sick
does not strongly
differentiate between the
two binary classes.
The pregnant feature
shows a higher percentage
of the majority category (f)
for both values of
binaryClass, indicating that
being pregnant is more
prevalent in the negative
class (binaryClass = 0).
These insights can help understand the
relationship between the features and the
target variable and may be useful in further
analysis or modeling tasks.
Percentage Distribution of Lithium and Tumor versus

Target Variable
Step Plot the percentage distribution of lithium and

1 tumor versus binaryClass in pie chart:
2 "binaryClass", "lithium", "tumor")
The result is shown in Figure 37. The resulting

plot from executing the code
plot_feat1_feat2_vs_target_pie(df,
"binaryClass", "lithium", "tumor") would display
a 2x2 grid of pie charts. Each pie chart
represents the distribution of categories within
a specific feature and is categorized by the
binary target variable ("binaryClass").
Here's an explanation of the resulting plot:
Top-Left Chart:
Title: "BinaryClass = 0"
Feature: "lithium"
This pie chart shows the
distribution of categories in
the "lithium" feature for the
cases where the binary
target variable is 0.
Each category is
represented by a slice in
the pie chart, and the size
of each slice corresponds to
the proportion of cases
belonging to that category.
The labels inside the pie
chart indicate the category
names, and the percentage
values represent the
proportion of cases for
each category.
The colors of the slices in
the pie chart are assigned
using the "Pastel1"
colormap.
Top-Right Chart:
Feature: "lithium"
This pie chart shows the
the "lithium" feature for the
Similar to the top-left
chart, the slices in the pie
chart represent the
categories, and the sizes of
the slices correspond to the
each category.
chart display the category
names and the percentage
values indicate the
each category.
The colors of the slices are
assigned using the
"Pastel1" colormap.
Bottom-Left Chart:
Feature: "tumor"
This pie chart represents
the distribution of
categories in the "tumor"
feature for the cases where
the binary target variable is
0.
Each slice in the pie chart
the size of the slice reflects
the proportion of cases for
that category.
chart indicate the category
values represent the
each category.
assigned using the
"Pastel1" colormap.
Bottom-Right Chart:
Feature: "tumor"
This pie chart displays the
the "tumor" feature for the
Similar to the other pie
charts, each slice
the size of the slice
corresponds to the
proportion of cases for that
category.
chart show the category
values indicate the
each category.
assigned using the
"Pastel1" colormap.
Overall, the plot provides an intuitive
visualization of how the categories in the
"lithium" and "tumor" features are distributed
across the two target values ("binaryClass = 0"
and "binaryClass = 1"). It allows for a quick
comparison of category proportions within each
feature based on the binary target variable.
Figure 37 The percentage distribution of
lithium and tumor versus binaryClass in pie
chart
Output:
Lithium:
Binaryclass = 0:
f f 99.511635 3464
t t 0.488365 17
Binaryclass = 1:
f f 99.656357 290
t t 0.343643 1
Tumor:
Binaryclass = 0:
f f 97.471991 3393
t t 2.528009 88
Binaryclass = 1:
f f 97.250859 283
t t 2.749141 8
Based on the output, we can analyze the

distribution of categories within the "lithium"
and "tumor" features based on the binary target
variable ("binaryClass"). Here are some
observations:
Lithium:
For cases where binaryClass = 0:
The category "f" (not on
lithium) accounts for 99.5%
of the cases, with a count
of 3464.
The category "t" (on
of 17.
The category "f" (not on
of 290.
The category "t" (on
of 1.
Tumor:
The category "f" (no tumor)
accounts for 97.5% of the
The category "t" (has
tumor) accounts for 2.5%
of 88.
The category "f" (no tumor)
accounts for 97.3% of the
The category "t" (has
tumor) accounts for 2.7%
of 8.
From these observations, we can conclude:
The majority of cases in
both binaryClass = 0 and
binaryClass = 1 are not on
lithium.
The majority of cases in
both binaryClass = 0 and
binaryClass = 1 do not
have a tumor.
The proportion of cases
with lithium or a tumor is
generally higher in
binaryClass = 1 compared
to binaryClass = 0,
indicating a potential
association between these
features and the binary
target variable.
These insights can help in understanding the
relationships between the "lithium" and "tumor"
features and the binary target variable,
providing valuable information for further
analysis or decision-making.
Distribution of Nine Categorical Features versus Age
Step Plot the distribution of nine categorical features versus age feature:
1
1 def feat_versus_other(feat,another,legend,ax0,label):
2 for s in ["right", "top"]:
3 ax0.spines[s].set_visible(False)
4
5 ax0_sns = sns.histplot(data=df,
6 x=feat,ax=ax0,zorder=2,kde=False,hue=another,multiple="stack",
shrink=.8
7
,linewidth=0.3,alpha=1)
8
9 put_label_stacked_bar(ax0_sns,15)
10 ax0_sns.set_xlabel('',fontsize=30, weight='bold')
11 ax0_sns.set_ylabel('',fontsize=30, weight='bold')
12 ax0_sns.grid(which='major', axis='x', zorder=0,
13 color='#EEEEEE', linewidth=0.4)
14 ax0_sns.grid(which='major', axis='y', zorder=0,
color='#EEEEEE', linewidth=0.4)
15
16
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
17
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8',
18
fontsize=10, bbox_to_anchor=(1, 0.989), loc='upper right')
19
ax0_sns.set_xlabel(label)
20
plt.tight_layout()
21
22
label_bin = list(df["binaryClass"].value_counts().index)
23
label_sex = list(df["sex"].value_counts().index)
24
label_thyroxine = list(df["on
25 thyroxine"].value_counts().index)
27 label_pregnant = list(df["pregnant"].value_counts().index)
28 label_lithium = list(df["lithium"].value_counts().index)
29 label_goitre = list(df["goitre"].value_counts().index)
30 label_tumor = list(df["tumor"].value_counts().index)
31 label_tsh = list(df["TSH measured"].value_counts().index)
32 label_tt4 = list(df["TT4 measured"].value_counts().index)
33
34 label_dict = {
35 "binaryClass": label_bin,
36 "sex": label_sex,
37 "on thyroxine": label_thyroxine,
38 "pregnant": label_pregnant,
39 "lithium": label_lithium,
40 "goitre": label_goitre,
41 "tumor": label_tumor,
42 "TSH measured": label_tsh,
43 "TT4 measured": label_tt4
44 }
45
46 def hist_feat_versus_nine_cat(feat, label):
47 ax_list = []
48
49 fig, axes = plt.subplots(3, 3, figsize=(30, 20))
50 axes = axes.flatten()
51
52
53 for i, (cat_feature, label_var) in
54 enumerate(label_dict.items()):
55 ax = axes[i]
56 feat_versus_other(feat, df[cat_feature], label_var,
ax, f"{cat_feature} versus {label}")
57
ax_list.append(ax)
58
59
plt.tight_layout()
60
plt.show()
61
62
hist_feat_versus_nine_cat(df["age"],"age")
63
The result is shown in Figure 38. Here's a step-by-step explanation of the

code:
1. The feat_versus_other() function is defined. It
takes five parameters: feat (the main feature to
be plotted), another (the other categorical
feature for comparison), legend (a list of labels
for the legend), ax0 (the subplot axis), and label
(the label for the x-axis).
2. The function loops over the right and top spines
of ax0 and sets their visibility to False, effectively
removing them from the plot.
3. The ax0_sns variable is assigned the result of
sns.histplot function. It plots a histogram using
the data from the DataFrame df, with x as feat.
The ax0 parameter is used to specify the subplot
axis. The histogram is stacked based on the
another feature, and the hue parameter is used
to differentiate the categories. The kde, multiple,
and shrink parameters control the appearance of
the histogram. The linewidth and alpha
parameters are used to adjust the line width and
transparency of the histogram bars.
4. The put_label_stacked_bar() function is called
with ax0_sns as a parameter. This function adds
labels to the stacked bars in the histogram.
5. Various formatting settings are applied to
ax0_sns. The x and y labels are removed
(set_xlabel and set_ylabel with empty strings).
Grid lines are added with light gray color using
grid function. Tick parameters are adjusted
using tick_params. The legend is added using the
legend function, specifying the legend list for
labels, the number of columns, face color, font
size, and location. The x-label is set to label.
6. plt.tight_layout() ensures that the subplots are
properly spaced.
7. The label_dict dictionary is defined, containing
the labels for each categorical feature.
8. The hist_feat_versus_nine_cat() function is
defined. It takes feat (the main feature to be
plotted) and label (the label for the x-axis).
9. An empty list ax_list is created to store the
subplot axes.
10. A figure and subplots are created using
plt.subplots, specifying the number of rows,
columns, and the figure size. The subplots are
flattened to a 1D array using flatten().
11. The enumerate function is used to iterate over
the items of label_dict, which contains the
categorical features and their labels.
12. Inside the loop, an axis ax is assigned from the
flattened axes array.
13. The feat_versus_other function() is called with
the appropriate arguments, including feat,
df[cat_feature] (the corresponding feature values
from the DataFrame), label_var (the label list for
the legend), ax, and the combined label for the x-
axis.
14. The ax is appended to the ax_list.
15. After the loop, plt.tight_layout() ensures proper
spacing of the subplots.
16. The resulting plot is displayed using plt.show().
17. The hist_feat_versus_nine_cat() function is called
with df["age"] as the feat parameter and "age" as
the label parameter, to plot the feature "age"
against each of the nine categorical features.
This code generates a 3x3 grid of subplots, where each subplot represents the
distribution of the specified feat feature across different categories of each
categorical feature. The subplots use stacked histograms to visualize the
distribution, and each subplot has its own color scheme and legend.
The resulting plots show the distribution of the "age" feature in relation to
each of the nine categorical features in the dataset. Each subplot represents a
different categorical feature, and the stacked histograms illustrate the
distribution of age within each category.
The purpose of this code is to provide a visual representation of how the "age"
feature varies across different categories of the categorical features. It allows
for easy comparison and analysis of the age distribution within each category,
providing insights into potential relationships or patterns between age and the
categorical features.
By examining the resulting plots, we can gain insights into how the age
distribution is influenced by each categorical feature. We can observe the
distribution patterns and differences between categories, as well as any
notable trends or relationships that may emerge. This analysis can be useful
for understanding the impact of categorical variables on the distribution of
the "age" feature and exploring potential associations in the dataset.
Figure 38 The distribution of nine categorical features versus age feature
Probability Density of Nine Categorical Features versus Age
Step Plot the density of nine categorical features versus age feature:
1
1 def prob_feat_versus_other(feat,another,legend,ax0,label):
4
5 ax0_sns =
6 sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='
7
8 ax0_sns.set_xlabel('',fontsize=20, weight='bold')
9 ax0_sns.set_ylabel('',fontsize=20, weight='bold')
10
11 ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE',
12 ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE',
13
14 ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
15 ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', fontsize=25,
loc='upper right')
16
17
plt.tight_layout()
18
19
def prob_feat_versus_nine_cat(feat, label):
20
fig, axes = plt.subplots(3, 3, figsize=(30, 20))
21
22
23
for i, (cat_feature, label_var) in enumerate(label_dict.items()):
24
ax = axes[i]
25
prob_feat_versus_other(feat, df[cat_feature], label_var, ax,
27
28
plt.tight_layout()
29
plt.show()
30
31
prob_feat_versus_nine_cat(df["age"],"age")
32
33
34
35
36
37

Figure 39 The density of nine categorical features vers
Here are the steps for the code:
1. The function prob_feat_versus_other() is defined,
parameters: feat (the main feature of interest), anot
compare with), legend (the legend labels for the
subplot axes), and label (the label for the plot).
2. The code sets the visibility of the right and top spine
removing the right and top borders of the plot.
3. It creates a KDE (Kernel Density Estimate) plot usin
represents the main feature (feat) and the hue repr
(another). The KDE plot is filled and stacked ba
categorical feature.
4. Various formatting settings are applied, such as setti
adjusting grid lines, tick parameters, and the legend a
5. The function prob_feat_versus_nine_cat() is defined,
feat and label as input.
6. It creates a figure with subplots using plt.subplots and
7. It iterates over the label_dict dictionary, which contai
their respective labels.
8. For each categorical feature, it retrieves the corresp
and calls the prob_feat_versus_other() function to crea
9. Finally, it adjusts the layout and displays the plo
plt.show().
In summary, the purpose of this code is to create KDE plots to compare the distribu
categories of each of the nine categorical features in the dataset. It allows for visua
across different categories and provides insights into the relationship between the m
The prob_feat_versus_nine_cat() function is used to generate a set of subplots, each s

plot of a numerical feature (df["age"] in this case) against nine different catego
resulting plots:
1. The resulting plots are arranged in a 3x3 grid, with
KDE plot of the numerical feature "age" against a spec
2. Each subplot shows the KDE plot with the probabilit
numerical feature ("age") on the x-axis. The KDE
estimate of the probability density function.
3. The hue parameter (another) is used to differentiat
values of the categorical feature being plotted aga
shades are used to represent different categories of th
4. The legend in each subplot indicates the categories o
feature and their corresponding color representation i
5. The subplots have gridlines on both the x-axis and
interpret the density estimation.
6. The x-axis and y-axis labels indicate the numerical fe
the probability density, respectively.
The purpose of these plots is to visualize the distribution of the numerical featu
categorical features. It allows for a quick comparison of the density estimates and
the distribution of "age" among the categories.
Distribution and Probability Density of Nine Categorical Features

versus TSH
Step Plot the distribution and its density of nine categorical

1 features versus TSH feature:
1 hist_feat_versus_nine_cat(df_dummy["TSH"],"TSH")
2 prob_feat_versus_nine_cat(df_dummy["TSH"],"TSH")
The results are shown in Figure 40 and Figure 41. The

functions hist_feat_versus_nine_cat() and
prob_feat_versus_nine_cat() are used to generate plots that
compare the distribution of the numerical feature "TSH"
against nine different categorical features. Here's an
explanation of the resulting plots:
hist_feat_versus_nine_cat(): This function generates a set of

subplots where each subplot shows a histogram of the
feature "TSH" for different categories of a categorical
feature.
1. Each subplot represents the
histogram of the "TSH" feature for a
specific categorical feature.
2. The x-axis represents the values of
the "TSH" feature, and the y-axis
represents the frequency or count of
occurrences.
3. The histograms for different
categories of the categorical feature
are stacked on top of each other,
allowing for visual comparison of the
distribution across categories.
4. The legend in each subplot indicates
the categories of the corresponding
categorical feature and their
corresponding color representation
in the histogram.
prob_feat_versus_nine_cat(): This function generates a set of
subplots where each subplot shows the kernel density
estimation (KDE) plot of the feature "TSH" against different
categories of a categorical feature.
1. Each subplot represents the KDE
plot of the "TSH" feature for a
specific categorical feature.
2. The x-axis represents the values of
the "TSH" feature, and the y-axis
represents the probability density
estimate.
3. The KDE plots for different
categories of the categorical feature
are overlaid on top of each other,
allowing for visual comparison of the
distribution across categories.
4. The legend in each subplot indicates
the categories of the corresponding
categorical feature and their
corresponding color representation
in the KDE plot.
Figure 40 The distribution of four categorical features
versus TSH feature
The purpose of these plots is to analyze and compare the

distribution of the numerical feature "TSH" across different
categories of the categorical features. The histograms
provide a visual representation of the frequency or count
distribution, while the KDE plots provide a smooth estimate
of the probability density distribution. These plots can help
identify any patterns, differences, or outliers in the
distribution of "TSH" among the categories, facilitating
exploratory data analysis and inference.
Figure 41 The density of four categorical features versus

TSH feature
Distribution and Probability Density of Nine Categorical

Features versus T3

1 features versus T3 feature:
1 hist_feat_versus_nine_cat(df_dummy["T3"],"T3")
2 prob_feat_versus_nine_cat(df_dummy["T3"],"T3")
The results are shown in Figure 42 and Figure 43.
hist_feat_versus_nine_cat():
The purpose of this plot is to
compare the distribution of the
numerical feature "T3" across
different categories of each
categorical feature. By visualizing
histograms for each category, we
can gain insights into how the
distribution of "T3" varies within
each category. This plot helps us
understand the range, spread, and
frequency of "T3" values within
different groups defined by the
categorical features. It allows us to
identify any notable differences or
similarities in the distribution of
"T3" across categories, which can
be useful for identifying patterns
or potential relationships between
"T3" and the categorical variables.
prob_feat_versus_nine_cat():
compare the probability density
estimation (KDE plot) of the
numerical feature "T3" across
categorical feature. KDE plots
provide a smooth estimate of the
underlying probability distribution,
allowing us to visualize the shape
and density of "T3" values within
analyze the relative likelihood of
different "T3" values occurring
within each category and observe
any differences or similarities in
the distribution patterns. It is
particularly useful for identifying
modes, peaks, or variations in the
density of "T3" values across
categories, providing insights into
the relationship between "T3" and
the categorical variables.

versus T3 feature
In summary, both plots aim to provide a visual

representation of how the numerical feature "T3" is
distributed across different categories of the categorical
features. They enable us to explore the variations,
patterns, and relationships between "T3" and the
categorical variables, facilitating data analysis, and
interpretation.
T3 feature

versus TT4

1 features versus TT4 feature:
1 hist_feat_versus_nine_cat(df_dummy["TT4"],"TT4")
2 prob_feat_versus_nine_cat(df_dummy["TT4"],"TT4")
numerical feature "TT4" across
can gain insights into how the
distribution of "TT4" varies within
understand the range, spread, and
frequency of "TT4" values within
different groups defined by the
categorical features. It allows us to
identify any notable differences or
similarities in the distribution of
"TT4" across categories, which can
be useful for identifying patterns or
potential relationships between
"TT4" and the categorical variables.
numerical feature "TT4" across
and density of "TT4" values within
different "TT4" values occurring
any differences or similarities in the
distribution patterns. It is
density of "TT4" values across
the relationship between "TT4" and

representation of how the numerical feature "TT4" is
features. They enable us to explore the variations, patterns,
and relationships between "TT4" and the categorical
variables, facilitating data analysis and interpretation.
versus TT4 feature

TT4 feature

versus T4U

1 features versus T4U feature:
1 hist_feat_versus_nine_cat(df_dummy["T4U"],"T4U")
2 prob_feat_versus_nine_cat(df_dummy["T4U"],"T4U")
numerical feature "T4U" across
can examine how the distribution of
"T4U" varies within each category.
This plot helps us understand the
range, spread, and frequency of
"T4U" values within different groups
defined by the categorical features.
It allows us to identify any notable
differences or similarities in the
distribution of "T4U" across
categories, which can provide
insights into potential relationships
between "T4U" and the categorical
variables.
numerical feature "T4U" across
and density of "T4U" values within
different "T4U" values occurring
any differences or similarities in the
distribution patterns. It is
density of "T4U" values across
the relationship between "T4U" and
representation of how the numerical feature "T4U" is
features. They enable us to explore the variations, patterns,
and relationships between "T4U" and the categorical
variables, facilitating data analysis and interpretation.

versus T4U feature

T4U feature
versus FTI

1 features versus FTI feature:
1 hist_feat_versus_nine_cat(df_dummy["FTI"],"FTI")
2 prob_feat_versus_nine_cat(df_dummy["FTI"],"FTI")
The results are shown in Figure 48 and Figure 49. The

purpose of each resulting plot can be explained as follows:
hist_feat_versus_nine_cat(df_dummy["FTI"],"FTI"):
This plot displays the distribution of
FTI (Free Thyroxine Index) values
across different categories of the
nine selected features.
Each subplot represents a different
categorical feature, and the
histogram bars show the frequency
or count of FTI values within each
category.
The plot allows us to observe how
the FTI values are distributed
among different categories and
identify any patterns or variations
across the selected features.
prob_feat_versus_nine_cat(df_dummy["FTI"],"FTI"):
This plot shows the estimated
probability density of FTI values
across different categories of the
nine selected features.
Each subplot corresponds to a
categorical feature, and the KDE
(Kernel Density Estimation) curves
represent the probability
distribution of FTI values within
each category.
The plot helps us assess the
likelihood of observing specific FTI
values within each category and
compare the density patterns across
the selected features.
These plots provide visual representations of the
relationship between the FTI values and the selected
categorical features. They allow us to explore how the FTI
values are distributed across different categories and gain
insights into any variations or patterns that may exist. The
histogram plot emphasizes the frequency or count of FTI
values, while the KDE plot focuses on the probability
density. Together, they provide a comprehensive
understanding of the relationship between the FTI values
and the categorical variables.

versus FTI feature

FTI feature
PREDICTING
THYROID
USING MACHINE LEARNING
PREDICTING
THYROID
USING MACHINE LEARNING
Converting Categorical Columns into Numerical
Step Convert binaryClass and sex features to {0,1},

1 replace t to 1 and f to 0, extract output and
input variables, and plot feature importance
using RandomForest Classifier:
1 #Converts binaryClass feature to

2 {0,1}
3 def map_binaryClass(n):
4 if n == "N":
5 return 0
6
7 else:
8 return 1
9 df['binaryClass'] =
df['binaryClass'].apply(lambda x:
10
map_binaryClass(x))
11
12
#Converts sex feature to {0,1}
13
def map_sex(n):
14
if n == "F":
15
return 0
16
17
else:
18
return 1
19
df['sex'] = df['sex'].apply(lambda
20 x: map_sex(x))
df=df.replace({"t":1,"f":0})
The code can be explained in the following

steps:
1. map_binaryClass()
function:
This function takes a
value n as input and
maps it to either 0 or 1
based on the condition.
If the input n is equal to
"N", it returns 0.
Otherwise, it returns 1.
This function is used to
convert the values in
the "binaryClass"
column to numerical
values (0 or 1)
representing the two
classes.
2. Applying
map_binaryClass() function
to "binaryClass" column:
The apply method is
used on the
"binaryClass" column of
the DataFrame (df).
It applies the
map_binaryClass()
function to each value
in the "binaryClass"
column, effectively
converting the values to
0 or 1 based on the
mapping.
3. map_sex() function:
This function takes a
value n as input and
maps it to either 0 or 1
based on the condition.
If the input n is equal to
"F", it returns 0.
Otherwise, it returns 1.
This function is used to
convert the values in
the "sex" column to
numerical values (0 or
1) representing the two
genders.
4. Applying map_sex()
function to "sex" column:
The apply method is
used on the "sex"
column of the
DataFrame (df).
It applies the map_sex()
function to each value
in the "sex" column,
effectively converting
the values to 0 or 1
based on the mapping.
5. Replacing values in the
DataFrame:
The replace method is
used on the DataFrame
(df) to replace specific
values.
In this case, the values
"t" and "f" are replaced
with 1 and 0,
respectively, throughout
the entire DataFrame.
This step is likely
performed to convert
other categorical
variables represented as
"t" and "f" to numerical
values (1 and 0) for
further analysis or
modeling purposes.
Overall, these steps involve mapping and
replacing values in the DataFrame to convert
categorical variables into numerical
representations, enabling easier data analysis
or modeling tasks.
Feature Importance Using Random Forest
Step Convert binaryClass and sex features to {0,1},

1 replace t to 1 and f to 0, extract output and input
variables, and plot feature importance using
RandomForest Classifier:
1 #Extracts output and input
2 variables
3 y = df['binaryClass'].values #
Target for the model
4
X = df.drop(['binaryClass'], axis =
5
1)
6
7
#Feature Importance using
8 RandomForest Classifier
9 names = X.columns
10 rf = RandomForestClassifier()
11 rf.fit(X, y)
12
13 result_rf = pd.DataFrame()
14 result_rf['Features'] = X.columns
15 result_rf ['Values'] =
16 rf.feature_importances_
17 result_rf.sort_values('Values',
18 inplace = True, ascending = False)
19
plt.figure(figsize=(11,11))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y =
'Features', data=result_rf,
color="Blue")
plt.show()
The result is shown in Figure 50. The code can be

explained in the following steps:
1. Extracting output and input
variables:
The target variable for the
model is extracted from the
DataFrame df and assigned
to the variable y. In this case,
the target variable is stored
in the column "binaryClass".
The input variables are
extracted by dropping the
"binaryClass" column from
the DataFrame df. The
resulting DataFrame without
the target column is assigned
to the variable X.
2. Feature Importance using
Random Forest Classifier:
The variable names is
assigned the column names
of the DataFrame X.
An instance of the
RandomForestClassifier class
is created and assigned to
the variable rf.
The fit method is called on
the rf object, fitting the
model to the input variables
X and the target variable y.
3. Creating a DataFrame to store
feature importance results:
An empty DataFrame named
result_rf is created.
The "Features" column of
result_rf is assigned the
column names from the
DataFrame X.
The "Values" column of
result_rf is assigned the
feature importances obtained
from the fitted random forest
model.
The sort_values method is
called on result_rf to sort the
DataFrame based on the
"Values" column in
descending order.
4. Visualizing feature importances
using a bar plot:
A new figure is created with
a size of 11x11 inches using
plt.figure(figsize=(11,11)).
The color codes for the plot
are set to "pastel" using
sns.set_color_codes("pastel").
A bar plot is created using
sns.barplot with "Values" on
the x-axis and "Features" on
the y-axis. The data used for
the plot is taken from the
result_rf DataFrame. The
color of the bars is set to
blue.
Finally, plt.show() is called to
display the bar plot.
The purpose of this code is to calculate and visualize
the feature importances using a random forest
classifier. It helps identify the most important features
that contribute significantly to the target variable,
which can be useful for feature selection or
understanding the underlying patterns in the data. The
bar plot provides a visual representation of the feature
importances, allowing for easy interpretation and
comparison of the importance levels among different
features.
Figure 50 The feature importance using

RandomForest classifier
Output:
Feature Importance:
Features Values
17 TSH 0.606588
25 FTI 0.100281
21 TT4 0.098522
2 on thyroxine 0.063000
19 T3 0.035325
23 T4U 0.028387
0 age 0.025030
7 thyroid surgery 0.015311
1 sex 0.005475
9 query hypothyroid 0.004923
18 T3 measured 0.003115
16 TSH measured 0.002938
10 query hyperthyroid 0.001571
5 sick 0.001464
13 tumor 0.001361
22 T4U measured 0.001271
24 FTI measured 0.001214
4 on antithyroid medication 0.000871
15 psych 0.000820
3 query on thyroxine 0.000765
20 TT4 measured 0.000642
12 goitre 0.000369
8 I131 treatment 0.000322
11 lithium 0.000160
6 pregnant 0.000150
14 hypopituitary 0.000126
26 TBG measured 0.000000
The output shows the feature importance values

calculated using the random forest classifier. Each row
in the output corresponds to a feature, and the
"Values" column represents the importance level
assigned to each feature.
Here is the breakdown of the output:

The most important feature is
"TSH" with an importance value
of 0.606588.
The second most important
feature is "FTI" with an
importance value of 0.100281.
The third most important feature
is "TT4" with an importance
value of 0.098522.
The remaining features follow in
descending order of importance.
The feature importance values indicate the relative
contribution of each feature in predicting the target
variable. Higher values indicate stronger importance,
suggesting that those features have a greater impact
on the target variable. Lower values indicate lesser
importance or negligible influence.
This information is helpful for understanding the

relative importance of different features in the dataset
and can assist in feature selection, identifying
influential factors, and gaining insights into the
underlying relationships between features and the
target variable.
Feature Importance Using Extra Trees
Step Plot feature importance using

1 ExtraTreesClassifier:
1 #Feature Importance using

2 ExtraTreesClassifier
3 model = ExtraTreesClassifier()
4 model.fit(X, y)
5
6 result_et = pd.DataFrame()
7 result_et['Features'] = X.columns
8 result_et ['Values'] =
model.feature_importances_
9
result_et.sort_values('Values',
10
inplace=True, ascending =False)
11
12
13
14
sns.barplot(x = 'Values',y =
15 'Features', data=result_et,
16 color="red")
17 plt.xlabel('Feature Importance',
18 fontsize=30)
plt.ylabel('Feature Labels',
fontsize=30)
plt.tick_params(axis='x',
labelsize=20)
plt.tick_params(axis='y',
labelsize=20)
plt.show()
The result is shown in Figure 51. Here is a step-

by-step explanation of the code:
1. Create an instance of the
ExtraTreesClassifier model.
2. Fit the model using the
input variables (X) and the
target variable (y).
3. Create an empty
DataFrame called
"result_et" to store the
feature importance values.
4. Add the column "Features"
to the DataFrame and set it
as the column containing
the feature names.
5. Add the column "Values" to
the DataFrame and set it as
the column containing the
feature importance values
calculated by the
ExtraTreesClassifier model.
6. Sort the DataFrame in
descending order based on
the "Values" column to get
the features with the
highest importance at the
top.
7. Create a figure with a size
of 25x25 inches to plot the
feature importance.
8. Set the color codes for the
plot to "pastel".
9. Create a bar plot using
seaborn (sns) with the
"Values" as the x-axis and
the "Features" as the y-
axis. Set the color of the
bars to red.
10. Set the x-label to "Feature
Importance" with a font
size of 30.
11. Set the y-label to "Feature
Labels" with a font size of
30.
12. Adjust the tick label sizes
on the x-axis and y-axis to
have a font size of 20.
13. Show the plot.
The resulting plot will display the feature
importance values calculated by the
ExtraTreesClassifier model. The features are
shown on the y-axis, while the importance
values are shown on the x-axis. The bars in the
plot represent the feature importance, with
higher bars indicating more important features.
The color of the bars is set to red for visual
distinction. The plot provides an overview of the
relative importance of each feature in
predicting the target variable.
Figure 51 The feature importance using
ExtraTreesClassifier
Output:
Feature Importance:
Features Values
17 TSH 0.511161
25 FTI 0.121740
21 TT4 0.104991
19 T3 0.054661
0 age 0.054394
23 T4U 0.046756
2 on thyroxine 0.016755
16 TSH measured 0.011206
9 query hypothyroid 0.010994
1 sex 0.010496
7 thyroid surgery 0.008807
18 T3 measured 0.007312
5 sick 0.007273
10 query hyperthyroid 0.005463
13 tumor 0.005275
15 psych 0.004032
22 T4U measured 0.003265
24 FTI measured 0.003217
8 I131 treatment 0.003115
3 query on thyroxine 0.002658
20 TT4 measured 0.002111
4 on antithyroid medication 0.001334
11 lithium 0.001125
6 pregnant 0.001060
12 goitre 0.000706
14 hypopituitary 0.000091
26 TBG measured 0.000000
The output shows the feature importance

values calculated by the ExtraTreesClassifier
model. Here is an explanation of the results:
Features: This column lists
the names of the features.
Values: This column
represents the importance
values assigned to each
feature.
The higher the importance value, the more
significant the feature is in predicting the
target variable. Here are some key observations
from the feature importance results:
TSH (Thyroid-Stimulating
Hormone) has the highest
importance value of
0.511161, indicating it is
the most influential feature
in predicting the target
variable.
Other important features
include FTI (Free
Thyroxine Index) with an
importance value of
0.121740 and TT4 (Total
Thyroxine) with an
importance value of
0.104991.
Age, T3 (Triiodothyronine),
and T4U (Total T4 Uptake)
also have relatively high
importance values,
indicating they have a
significant impact on the
target variable.
Features such as I131
treatment, goitre, and
hypopituitary have lower
importance values,
suggesting they have less
influence on the target
variable.
The feature TBG measured
has an importance value of
0.000000, indicating it has
no predictive power for the
target variable in this
analysis.
The bar plot provides a visual representation of
the feature importance, with taller bars
indicating more important features. This allows
for a quick understanding of which features are
the most relevant for predicting the target
variable.
Feature Importance Using Logistic Regression
Step Plot feature importance using RFE:

1
1 #Feature Importance using RFE

2 from sklearn.feature_selection
3 import RFE
4 model = LogisticRegression()
5 # create the RFE model
6 rfe = RFE(model)
7 rfe = rfe.fit(X, y)
8
9 result_lg = pd.DataFrame()
10 result_lg['Features'] = X.columns
11 result_lg ['Ranking'] =
rfe.ranking_
12
result_lg.sort_values('Ranking',
13
inplace=True , ascending = False)
14
15
16
17 sns.set_color_codes("pastel")
18 sns.barplot(x = 'Ranking',y =
19 'Features', data=result_lg,
color="orange")
20
plt.ylabel('Feature Labels',
21
fontsize=30)
22
plt.tick_params(axis='x',
23 labelsize=20)
plt.tick_params(axis='y',
labelsize=20)
plt.show()
print("Feature Ranking:")
print(result_lg)
The result is shown in Figure 52. Here are the

step-by-step explanations for the code:
1. Import the necessary
libraries:
from
sklearn.feature_selection
import RFE: Import the
RFE class from scikit-
learn's feature_selection
module.
from
sklearn.linear_model
import
LogisticRegression:
Import the
LogisticRegression class
from scikit-learn's
linear_model module.
2. Create the logistic
regression model:
model =
LogisticRegression():
Create an instance of
the LogisticRegression
model.
3. Perform RFE (Recursive
Feature Elimination):
rfe = RFE(model):
Create an RFE object
with the logistic
regression model as the
estimator.
rfe = rfe.fit(X, y): Fit the
RFE model to the input
features X and the target
variable y.
Figure 52 The feature importance using RFE
4. Generate the results:

Create an empty
DataFrame result_lg to
store the feature
rankings.
Assign the feature
names to the 'Features'
column of result_lg.
Assign the rankings of
the features obtained
from RFE to the
'Ranking' column of
result_lg.
Sort result_lg based on
the ranking values in
descending order.
5. Visualize the feature
rankings:
Create a bar plot using
seaborn's barplot
function.
Set the 'Ranking' column
as the x-axis and the
'Features' column as the
y-axis of the bar plot.
Set the color of the bars
to orange using the color
parameter.
Customize the plot by
setting the y-label, tick
labels, and tick sizes.
6. Print the feature ranking:
Display the DataFrame
result_lg containing the
feature names and their
rankings.
The resulting bar plot shows the feature
rankings, where lower rankings indicate more
important features according to the RFE process
with logistic regression. Features with higher
rankings are considered less relevant or less
important for predicting the target variable.
Output:
Feature Ranking:
Features Ranking
26 TBG measured 15
0 age 14
21 TT4 13
25 FTI 12
15 psych 11
14 hypopituitary 10
11 lithium 9
22 T4U measured 8
5 sick 6
8 I131 treatment 5
24 FTI measured 4
23 T4U 3
18 T3 measured 2
17 TSH 1
20 TT4 measured 1
19 T3 1
2 on thyroxine 1
16 TSH measured 1
1 sex 1
12 goitre 1
7 thyroid surgery 1
6 pregnant 1
13 tumor 1
The output from the Recursive Feature Elimination

(RFE) process with logistic regression provides
insights into the importance and ranking of each
feature in predicting the target variable. Here's a
more specific analysis of the output:
Features with a ranking of
1: These features are
considered the most
important based on the RFE
process. They have the
highest relevance in
predicting the target
variable. In this case,
features such as TSH, TT4,
T3, on thyroxine, TSH
measured, and others have
been ranked as 1, indicating
their significance in the
classification task.
Features with rankings
greater than 1: These
features are relatively less
important compared to
those with a ranking of 1.
They still contribute to the
prediction, but their impact
may be lower. Some of these
features include age, FTI,
psych, hypopituitary, and
others.
Based on the output, it can be concluded that the
features with a ranking of 1, as determined by the
RFE process with logistic regression, play a crucial
role in predicting the target variable. These features
should be considered as the most informative when
building a predictive model. However, it's important
to note that the significance of features may vary
depending on the specific dataset and modeling
context.
Step Split dataset into train and test data with three
1 feature scaling: raw, normalization, and
standardization:
1 sm = SMOTE(random_state=42)
2 X,y = sm.fit_resample(X, y.ravel())
3
4 #Splits the data into training and
5 testing
6 X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size =
7
0.2, random_state = 2021,
8 stratify=y)
9 X_train_raw = X_train.copy()
10 X_test_raw = X_test.copy()
11 y_train_raw = y_train.copy()
12 y_test_raw = y_test.copy()
13
14 X_train_norm = X_train.copy()
15 X_test_norm = X_test.copy()
16 y_train_norm = y_train.copy()
17 y_test_norm = y_test.copy()
18 norm = MinMaxScaler()
19 X_train_norm =
20 norm.fit_transform(X_train_norm)
21 X_test_norm =
22 norm.transform(X_test_norm)
23
24 X_train_stand = X_train.copy()
25 X_test_stand = X_test.copy()
26 y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
scaler = StandardScaler()
X_train_stand =
scaler.fit_transform(X_train_stand)
X_test_stand =
scaler.transform(X_test_stand)
The code performs the following steps:

1. SMOTE Oversampling: The
SMOTE (Synthetic Minority
Over-sampling Technique)
algorithm is used to
address the class
imbalance in the target
variable. It generates
synthetic samples for the
minority class (in this case,
the positive class) to
balance the class
distribution. The SMOTE
function from the
imblearn.over_sampling
module is used for
oversampling. The feature
matrix X and target vector
y are passed to the
fit_resample method, which
returns the oversampled
data.
2. Data Splitting: The
oversampled data is then
split into training and
testing sets using the
train_test_split function
from the
sklearn.model_selection
module. The parameter
test_size specifies the
proportion of the data to be
allocated for testing (in this
case, 20% is used), and
random_state ensures
reproducibility of the split.
The stratify parameter is
set to y to ensure that the
class distribution is
maintained in both the
training and testing sets.
3. Data Copy: The original
training and testing sets,
along with their
corresponding labels, are
copied into separate
variables. This is done to
preserve the original data
before any transformations
are applied.
4. Data Normalization: The
MinMaxScaler from the
sklearn.preprocessing
module is used to perform
feature normalization on
the training and testing
sets. The fit_transform
method is applied to the
X_train_norm matrix to
compute the minimum and
maximum values of each
feature and perform
normalization. The
transform method is then
used to normalize the
X_test_norm matrix using
the parameters learned
from the training set.
5. Data Standardization: The
StandardScaler from the
sklearn.preprocessing
module is used to perform
feature standardization on
sets. Similar to
normalization, the
fit_transform method is
applied to the
X_train_stand matrix to
compute the mean and
standard deviation of each
feature and perform
standardization. The
transform method is used
to standardize the
X_test_stand matrix using
from the training set.
Overall, these steps are commonly used in
preprocessing and preparing the data for
machine learning models. SMOTE oversampling
addresses class imbalance, data splitting
ensures the availability of independent training
and testing sets, and normalization and
standardization normalize the features to a
common scale.
Learning Curve
Step Define plot_learning_curve() method to plot

1 learning curve of a certain classifier:
1 def plot_learning_curve(estimator,
2 title, X, y, axes=None, ylim=None,
cv=None, n_jobs=None,
3
train_sizes=np.linspace(.1, 1.0,
4 5)):
5 if axes is None:
6 _, axes = plt.subplots(1, 3,
7 figsize=(35, 10))
8
9 axes[0].set_title(title)
10 if ylim is not None:
11 axes[0].set_ylim(*ylim)
12 axes[0].set_xlabel("Training
13 examples")
14 axes[0].set_ylabel("Score")
15
16 train_sizes, train_scores,
test_scores, fit_times, _ = \
17
learning_curve(estimator, X,
18
y, cv=cv, n_jobs=n_jobs,
19
20 train_sizes=train_sizes,
21
22 return_times=True)
23 train_scores_mean =
24 np.mean(train_scores, axis=1)
25 train_scores_std =
26 np.std(train_scores, axis=1)
27 test_scores_mean =
np.mean(test_scores, axis=1)
28
test_scores_std =
29
np.std(test_scores, axis=1)
30
fit_times_mean =
31 np.mean(fit_times, axis=1)
32 fit_times_std =
33 np.std(fit_times, axis=1)
34
35 # Plot learning curve
36 axes[0].grid()
37
38 axes[0].fill_between(train_sizes, \
39 train_scores_mean -
train_scores_std,\
40
train_scores_mean +
41
train_scores_std, \
42
alpha=0.1, color="r")
43
44 axes[0].fill_between(train_sizes, \
45 test_scores_mean -
46 test_scores_std,\
47 test_scores_mean +
48 test_scores_std, \
49 alpha=0.1, color="g")
50 axes[0].plot(train_sizes,
train_scores_mean, 'o-', \
51
color="r", label="Training score")
52
axes[0].plot(train_sizes,
53
test_scores_mean, 'o-', \
54
color="g", label="Cross-validation
55 score")
56 axes[0].legend(loc="best")
57
58
59 # Plot n_samples vs fit_times
60 axes[1].grid()
axes[1].plot(train_sizes,
fit_times_mean, 'o-')
axes[1].fill_between(train_sizes, \
fit_times_mean - fit_times_std,\
fit_times_mean +
fit_times_std, alpha=0.1)
axes[1].set_xlabel("Training
examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability
of the model")
# Plot fit_time vs score

axes[2].grid()
axes[2].plot(fit_times_mean,
test_scores_mean, 'o-')
axes[2].fill_between(fit_times_mean,
\
test_scores_mean -
test_scores_std,\
test_scores_mean +
test_scores_std, alpha=0.1)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance
of the model")
return plt
The code defines a function called

plot_learning_curve() that generates a plot to
visualize the learning curve and performance of
a machine learning model. Here's a step-by-step
1. The function takes several
parameters:
estimator: The machine
learning model or
estimator to evaluate.
title: The title of the
plot.
X: The input features of
the dataset.
y: The target variable of
the dataset.
axes: Optional
parameter for
specifying the axes of
the plot. If not provided,
a new figure with three
subplots will be created.
ylim: Optional
parameter for setting
the y-axis limits of the
first subplot.
cv: The cross-validation
strategy or number of
cross-validation folds to
use.
n_jobs: The number of
parallel jobs to run. If
set to -1, it will use all
available processors.
train_sizes: An array of
training set sizes to be
used in generating the
learning curve.
2. The function initializes the
first subplot (axes[0]) with
the provided title and y-axis
limit (if specified). It sets
the x-axis label as "Training
examples" and the y-axis
label as "Score".
3. The function uses the
learning_curve function to
calculate the training and
test scores, fit times, and
other metrics for the
learning curve. It passes
the provided estimator,
input features X, target
variable y, cross-validation
strategy cv, number of
parallel jobs n_jobs, and
training set sizes
train_sizes.
4. The function calculates the
mean and standard
deviation of the training
and test scores, fit times,
and other metrics.
5. The function plots the
learning curve on the first
subplot (axes[0]). It fills the
area between the mean
scores plus/minus the
standard deviation to show
the variance. It plots the
mean training scores, mean
test scores, and legends for
training and cross-
validation scores.
scalability of the model on
the second subplot
(axes[1]). It shows the
training set sizes versus
the fit times. It fills the
area between the mean fit
times plus/minus the
standard deviation.
performance of the model
on the third subplot
(axes[2]). It shows the fit
times versus the mean test
scores. It fills the area
between the mean test
scores plus/minus the
standard deviation.
8. The function returns the plt
object, which allows further
customization or display of
the plot outside the
function.
By calling this function with appropriate
parameters, you can generate a learning curve
plot that visualizes the performance and
scalability of a machine learning model as the
training set size increases. It helps in
understanding the bias-variance trade-off and
identifying the optimal training set size for the
model.
To observe and analyze each resulting plot

generated by the plot_learning_curve function,
you can follow these steps:
Learning Curve:
Look at the trend of the
training score and cross-
validation score as the
training set size increases.
If the training score and
cross-validation score are
both low, it suggests that
the model is underfitting
the data.
If the training score is high
and the cross-validation
score is significantly lower,
it indicates overfitting.
If both scores are high and
close to each other, it
suggests a good balance
between bias and variance.
Check the variance of the
scores by observing the
shaded areas around the
lines. A wider shaded area
indicates higher variance.
Compare the training and
cross-validation scores to
assess the model's
generalization ability.
Scalability of the Model:

Examine how the fit times
change with different
training set sizes.
If the fit times increase
linearly with the training
set size, it indicates good
scalability.
If the fit times increase
significantly or
exponentially with the
training set size, it
suggests scalability issues.
fit times by observing the
shaded area around the
line. A wider shaded area
Understanding the
scalability helps assess the
model's efficiency for
larger datasets.
Performance of the Model:
Analyze the relationship
between the fit times and
the mean test score.
Look for any patterns or
trends in the plot.
If the model's performance
(mean test score) improves
with longer fit times, it
indicates that more
training leads to better
results.
test scores by observing
the shaded area around the
line. A wider shaded area
This plot helps understand
the trade-off between
model performance and the
time required to train the
model.
By observing and analyzing each plot, you can
gain insights into the learning behavior,
scalability, and performance of the model.
These insights can guide you in understanding
the model's strengths, weaknesses, and
potential areas of improvement.
Real Values versus Predicted Values and Confusion Matrix
Step Define plot_real_pred_val() to plot true values versus

1 predicted values and plot_cm() method to plot confusion
matrix:
1 def plot_real_pred_val(Y_test, ypred, name):

2 plt.figure(figsize=(25,15))
3 acc=accuracy_score(Y_test,ypred)
4
5 plt.scatter(range(len(ypred)),ypred,color="blue",\
6 lw=5,label="Predicted")
7 plt.scatter(range(len(Y_test)), \
8 Y_test,color="red",label="Actual")
9 plt.title("Predicted Values vs True Values of
" + name, \
10
fontsize=10)
11
plt.xlabel("Accuracy: " +
12
str(round((acc*100),3)) + "%")
13
plt.legend()
14
plt.grid(True, alpha=0.75, lw=1, ls='-.')
15
plt.show()
16
17
def plot_cm(Y_test, ypred, name):
18 fig, ax = plt.subplots(figsize=(25, 15))
19 cm = confusion_matrix(Y_test, ypred)
20 sns.heatmap(cm, annot=True, linewidth=0.7,
21 linecolor='red', fmt='g', cmap="YlOrBr",
annot_kws={"size": 30})
22
plt.title(name + ' Confusion Matrix',
23
fontsize=30)
24
ax.xaxis.set_ticklabels(['Negative',
25 'Positive'], fontsize=20);
26 ax.yaxis.set_ticklabels(['Negative',
27 'Positive'], fontsize=20);
28 plt.xlabel('Y predict', fontsize=30)
plt.ylabel('Y test', fontsize=30)
plt.show()
return cm
Here are the steps for each function and their purposes:
plot_real_pred_val(Y_test, ypred, name):

1. Create a scatter plot to compare the
predicted values (ypred) with the true
values (Y_test) of a target variable.
2. The plot visualizes the predicted and
actual values, allowing you to observe
any discrepancies or patterns.
3. Calculate and display the accuracy of
the predictions (acc) as a percentage.
4. Set the plot title to indicate the target
variable being analyzed.
5. Set the x-axis label to show the
accuracy of the predictions.
6. Add a legend to differentiate between
the predicted and actual values.
7. Display the gridlines for better
visualization.
8. Show the plot.
plot_cm(Y_test, ypred, name):
1. Create a heatmap plot to visualize the
confusion matrix between the
predicted values (ypred) and the true
values (Y_test) of a target variable.
2. The confusion matrix displays the
counts or percentages of true positive,
true negative, false positive, and false
negative predictions.
3. The heatmap uses colors to represent
different values in the confusion
matrix, making it easier to interpret.
4. Set the plot title to indicate the target
variable's name.
5. Set the x-axis label as "Y predict" to
represent the predicted values.
6. Set the y-axis label as "Y test" to
represent the true values.
7. Show the plot.
8. Return the confusion matrix (cm).
These functions provide visualizations and metrics to assess the
performance of a model's predictions. The plot_real_pred_val()
function helps compare predicted and actual values, while the
plot_cm() function displays the confusion matrix to evaluate
prediction results.
To observe and analyze a confusion matrix, you can follow these
steps:
1. Understand the structure of the
confusion matrix: A confusion matrix is
a square matrix with dimensions
corresponding to the number of classes
or categories in your classification
problem. It consists of four main
components:
True Positive (TP): The number of
correctly predicted positive
instances.
True Negative (TN): The number of
correctly predicted negative
instances.
False Positive (FP): The number of
incorrectly predicted positive
instances (Type I error).
False Negative (FN): The number
of incorrectly predicted negative
instances (Type II error).
2. Analyze the values in the confusion
matrix:
Accuracy: It is calculated as (TP +
TN) / (TP + TN + FP + FN). It
represents the overall accuracy of
the model's predictions.
Precision: It is calculated as TP /
(TP + FP). It measures the model's
ability to correctly identify positive
instances among the predicted
positive instances.
Recall (Sensitivity or True Positive
Rate): It is calculated as TP / (TP +
FN). It measures the model's ability
to correctly identify positive
instances among the actual positive
instances.
Specificity (True Negative Rate): It
is calculated as TN / (TN + FP). It
measures the model's ability to
correctly identify negative
instances among the actual
negative instances.
F1 Score: It is the harmonic mean
of precision and recall and is
calculated as 2 * (Precision *
Recall) / (Precision + Recall). It
provides a balance between
precision and recall.
3. Interpret the confusion matrix:
TP and TN represent correct
predictions, indicating that the
model correctly identified positive
and negative instances,
respectively.
FP and FN represent incorrect
predictions, indicating that the
model misclassified instances as
positive or negative.
Pay attention to the imbalance
between FP and FN errors, as it
can vary based on the problem's
nature and the desired outcome.
Consider the trade-off between
precision and recall based on the
specific requirements of your
problem. If you prioritize
minimizing false positives, focus on
improving precision. If you
prioritize minimizing false
negatives, focus on improving
recall.
Compare the values in the
confusion matrix with your
problem's context and
requirements to assess the model's
performance. For example, in a
medical diagnosis scenario, a false
negative (FN) error might be more
critical than a false positive (FP)
error.
4. Use additional metrics or visualizations
to gain further insights:
Calculate metrics such as precision,
recall, specificity, and F1 score to
obtain more detailed evaluation
measures.
Visualize the confusion matrix
using heatmaps or other graphical
representations to get a clearer
understanding of the distribution of
predictions and errors.
Overall, the confusion matrix provides a comprehensive
overview of the model's performance, allowing you to assess its
accuracy, precision, recall, and other relevant metrics for your
specific classification problem.
ROC and Decision Boundaries
Step Define plot_roc() method to plot ROC and

1 plot_decision_boundary() method to plot
decision boundary of two chosen feature with
certain classifier:
1 #Plots ROC
2 def plot_roc(model,X_test, y_test,
3 title):
4 Y_pred_prob =
model.predict_proba(X_test)
5
Y_pred_prob = Y_pred_prob[:, 1]
6
7
fpr, tpr, thresholds =
8
roc_curve(y_test, Y_pred_prob)
9
10
plt.plot([0,1],[0,1],
11 color='navy', lw=5, linestyle='--')
12 plt.plot(fpr,tpr, label='ANN')
13 plt.xlabel('False Positive
14 Rate')
15 plt.ylabel('True Positive
16 Rate')
17 plt.title('ROC Curve of ' +
title)
18
plt.grid(True)
19 plt.show()
20
21 def
22 plot_decision_boundary(model,xtest,
ytest, name):
23
plt.figure(figsize=(25, 15))
24
#Trains model with two features
25
model.fit(xtest, ytest)
26
27
28
plot_decision_regions(xtest.values,
ytest.ravel(), clf=model, legend=2)
plt.title("Decision boundary
for " + name + " (Test)",
fontsize=30)
plt.xlabel("TSH", fontsize=25)
plt.ylabel("FTI", fontsize=25)
plt.legend(fontsize=25)
plt.show()
Here are the steps to understand and analyze

each function and its purpose:
plot_roc(model, X_test, y_test, title): This

function is used to plot the Receiver Operating
Characteristic (ROC) curve. The steps involved
are:
1. Predict the probabilities of
the positive class using the
trained model on the test
data.
2. Extract the predicted
probabilities for the
positive class.
3. Calculate the False Positive
Rate (FPR) and True
Positive Rate (TPR) using
the roc_curve function.
4. Plot the ROC curve,
including the diagonal line
(representing a random
classifier) and the curve for
5. Set the labels and title of
the plot.
plot_decision_boundary(model, xtest, ytest,
name): This function is used to plot the decision
boundary of a classification model. The steps
involved are:
1. Fit the model using the
features (xtest) and labels
(ytest).
2. Create a scatter plot of the
data points with decision
regions plotted based on
3. Set the title and labels for
the plot.
These functions are useful for visualizing and
analyzing the performance and decision
boundaries of classification models. The ROC
curve helps assess the model's classification
performance by examining the trade-off
between the true positive rate and the false
positive rate. The decision boundary plot shows
how the model separates different classes
based on the selected features.
By using these functions, you can gain insights

into the model's performance, evaluate its
ability to distinguish between classes, and
understand how it makes decisions based on
the chosen features.
To observe and analyze the ROC curve, you can

follow these steps:
1. Plot the ROC curve: Use
the plot_roc function to plot
the ROC curve for your
model. Provide the trained
model, the test features
(X_test), the test labels
(y_test), and a title for the
plot.
2. Interpret the ROC curve:
The ROC curve is a
graph that shows the
trade-off between the
True Positive Rate
(TPR) and the False
Positive Rate (FPR) as
the classification
threshold changes.
The TPR is the ratio of
correctly predicted
positive instances to the
total actual positive
instances. It represents
the model's ability to
correctly identify
positive samples.
The FPR is the ratio of
incorrectly predicted
negative instances to
the total actual negative
instances. It represents
the model's tendency to
incorrectly classify
negative samples as
positive.
The diagonal line in the
plot represents a
random classifier with
an equal chance of true
positives and false
positives. A better
classifier will have its
ROC curve above this
line.
3. Analyze the ROC curve:
The closer the ROC
curve is to the top-left
corner of the plot, the
better the model's
performance. This
indicates a higher TPR
for a lower FPR.
The area under the ROC
curve (AUC-ROC) is a
common metric used to
evaluate the overall
performance of the
model. A higher AUC-
ROC value (closer to 1)
suggests better
discrimination between
the positive and
negative classes.
If two models have
overlapping ROC
curves, you can
compare them by
looking at their AUC-
ROC values. The model
with a higher AUC-ROC
value is generally
considered better.
4. Determine the optimal
threshold: Depending on
your specific needs and the
nature of the problem, you
can choose a threshold that
balances the trade-off
between the TPR and FPR.
This threshold can be
adjusted to prioritize either
sensitivity or specificity
based on the requirements
of your application.
In summary, the ROC curve provides a visual
representation of the model's performance
across different classification thresholds. By
analyzing the curve's shape, proximity to the
diagonal line, and the AUC-ROC value, you can
assess the model's ability to discriminate
between positive and negative instances and
make informed decisions about the model's
performance.
Training Model and Predicting Stroke
Step Choose two features for decision boundary:

1
1 feat_boundary = ['TSH','FTI']
2 X_feature = X[feat_boundary]
3 X_train_feat, X_test_feat,
4 y_train_feat, y_test_feat = \
5 train_test_split(X_feature, y,
test_size = 0.2, \
random_state = 2021, stratify=y)
The code above performs the following steps:

1. Selects the features for
decision boundary plotting: The
variable feat_boundary is a list
that specifies the features to be
used for plotting the decision
boundary. In this case, the
features selected are 'TSH' and
FTI.
2. Extracts the selected features:
The features specified in
feat_boundary are extracted
from the original feature matrix
X and stored in X_feature.
3. Splits the data into training and
testing sets: The train_test_split
function is used to split the
extracted features (X_feature)
and the target variable (y) into
training and testing sets. The
testing set size is set to 20% of
the total data, and the random
state is set to 2021 for
reproducibility. The stratify
parameter ensures that the
class distribution is preserved
in the training and testing sets.
4. Stores the split data: The
resulting training and testing
feature matrices and target
variables are stored in the
variables X_train_feat,
X_test_feat, y_train_feat, and
y_test_feat, respectively. These
datasets will be used for
training and evaluating the
model for decision boundary
plotting.
By performing these steps, you have prepared the
necessary data for plotting the decision boundary
based on the selected features.
Step Define train_model() method to train model,

2 predict_model() method to get predicted values,
and run_model() method to perform training model,
predicting results, plotting confusion matrix, plotting
true values versus predicted values, plotting ROC,
plotting decision boundary, and plotting learning
curve:
1 def train_model(model, X, y):

2 model.fit(X, y)
3 return model
4
5 def predict_model(model, X, proba=False):
6 if ~proba:
7 y_pred = model.predict(X)
8 else:
9 y_pred_proba =
10 model.predict_proba(X)
11 y_pred = np.argmax(y_pred_proba,
axis=1)
12
13
return y_pred
14
15
list_scores = []
16
17
def run_model(name, model, X_train,
18
X_test, y_train, y_test, fc,
19 proba=False):
20 print(name)
21 print(fc)
22
23 model = train_model(model, X_train,
24 y_train)
25 y_pred = predict_model(model, X_test,
proba)
26
27
accuracy = accuracy_score(y_test,
28
y_pred)
29
recall = recall_score(y_test, y_pred)
30
precision = precision_score(y_test,
31 y_pred)
32 f1 = f1_score(y_test, y_pred)
33
34 print('accuracy: ', accuracy)
35 print('recall: ',recall)
36 print('precision: ', precision)
37 print('f1: ', f1)
38 print(classification_report(y_test,
39 y_pred))
40
41
42 plot_cm(y_test, y_pred, name)
43 plot_real_pred_val(y_test, y_pred,
44 name)
45 plot_roc(model, X_test, y_test, name)
plot_decision_boundary(model,X_test_feat,
y_test_feat, name)
plot_learning_curve(model, name,
X_train, y_train, cv=3);
plt.show()
list_scores.append({'Model Name':
name, \
'Feature Scaling':fc, 'Accuracy':
accuracy, \
'Recall': recall, 'Precision':
precision, 'F1':f1})
The code defines several functions and executes a

series of steps to train and evaluate a machine
learning model. Here's an explanation of each
function and the overall process:
1. train_model(model, X, y): This
function trains the specified
model on the features X and
target variable y using the fit()
method. It returns the trained
model.
2. predict_model(model, X,
proba=False): This function
makes predictions using the
trained model on the features
X. If proba is set to False, it
uses the predict() method to
obtain the predicted class
labels. Otherwise, it uses the
predict_proba() method to
obtain class probabilities and
then selects the class with the
highest probability using
argmax(). It returns the
predicted labels.
3. list_scores: This list will store
the evaluation scores for each
model.
4. run_model(name, model,
X_train, X_test, y_train, y_test,
fc, proba=False): This function
runs the model training,
evaluation, and plotting
process. It takes the following
parameters:
name: The name of the
model.
model: The machine
learning model to be trained
and evaluated.
X_train, X_test, y_train,
y_test: The training and
testing feature matrices and
target variables.
fc: The feature scaling
method ('StandardScaler',
'MinMaxScaler', or 'None').
proba: A flag indicating
whether to obtain class
probabilities instead of class
labels.
Inside the function, the following steps are
performed:
The model is trained using
the train_model() function.
Predictions are made on the
test set using the
predict_model() function.
Various evaluation metrics
(accuracy, recall, precision,
F1-score) are calculated
using scikit-learn functions.
The classification report is
printed to provide a detailed
evaluation.
The confusion matrix,
predicted vs. true values
plot, ROC curve, decision
boundary plot, and learning
curve plot are generated
and displayed using the
corresponding functions.
The evaluation scores are
stored in the list_scores list.
Overall, this function allows for training and
evaluating the model, generating visualizations, and
collecting evaluation scores for further analysis.
By using the run_model() function, you can train,

evaluate, and visualize different machine learning
models with different feature scaling methods. The
results are displayed for each model, including
accuracy, recall, precision, F1-score, and various
plots to assess model performance. The evaluation
scores are also stored in list_scores for further
analysis or comparison between models.
Step Run Support Vector Classifier (SVC) on three

1 feature scaling:
1 feature_scaling = {
2 'Raw':(X_train_raw, X_test_raw,
3 y_train_raw, y_test_raw),
4 'Normalization':(X_train_norm,
X_test_norm, \
5
y_train_norm, y_test_norm),
6
'Standardization':(X_train_stand,
7
X_test_stand, \
8
y_train_stand, y_test_stand),
9
}
10
11
model_svc =
12 SVC(random_state=2021,probability=True)
13 for fc_name, value in
feature_scaling.items():
X_train, X_test, y_train, y_test =
value
run_model('SVC', model_svc,
X_train, X_test, \
y_train, y_test, fc_name)
The code performs model evaluation using the

Support Vector Classifier (SVC) with different
feature scaling techniques. Here are the steps:
1. Define a dictionary
feature_scaling that holds the
different feature scaling
techniques as keys and their
corresponding training and
testing data as values. The
three feature scaling
techniques used are:
'Raw': Represents the raw
data without any scaling.
'Normalization':
Represents the data
scaled using Min-Max
normalization.
'Standardization':
Represents the data
scaled using
standardization.
2. Instantiate an SVC model
with the random_state
parameter set to 2021. The
probability parameter is also
set to True to enable
probability estimation.
3. Iterate over each feature
scaling technique in the
feature_scaling dictionary
using a for loop. At each
iteration, retrieve the training
and testing data for the
specific feature scaling
technique.
4. Call the run_model() function
to evaluate the SVC model
with the current feature
scaling technique. Pass the
model, training and testing
data, and the feature scaling
technique name as arguments
to the function.
5. Within the run_model()
function, the SVC model is
trained and evaluated using
the specified training and
testing data. The evaluation
metrics are printed, and
various plots (confusion
matrix, predicted vs. true
values, ROC curve, decision
boundary, learning curve) are
generated for visualization.
By running this code, you can compare the
performance of the SVC model with different
feature scaling techniques and assess the impact of
scaling on the model's predictive performance.
Output with Raw Scaling:

SVC
Raw
accuracy: 0.9705671213208902
recall: 0.9705671213208902
precision: 0.9706148686646138
f1: 0.9705664842498622
precision recall f1-score
support
0 0.98 0.97 0.97
697
1 0.97 0.98 0.97
696
accuracy 0.97
1393
macro avg 0.97 0.97 0.97
1393
weighted avg 0.97 0.97 0.97
1393
The results of using raw feature scaling are shown

in Figure 53 – 57. The output shows that the SVC
model with the "Raw" feature scaling technique
achieved high performance in terms of accuracy,
recall, precision, and F1-score. Here are some
specific observations and conclusions:
Accuracy: The model
achieved an accuracy of
0.9706, indicating that it
correctly predicted the
thyroid disease status for
97.06% of the instances in the
test set.
Recall: The recall score for
both classes (0 and 1) is
0.9706, indicating that the
model effectively captured
the true positive rate for
detecting both non-thyroid
disease and thyroid disease
cases.
Precision: The precision score
for both classes is also
model has a high ability to
correctly identify true
positive cases for both non-
thyroid disease and thyroid
disease.
F1-score: The F1-score is a
balanced metric that
considers both precision and
recall. The F1-score of 0.9706
indicates that the model
achieved a good balance
between precision and recall
for both classes, reflecting its
ability to provide accurate
predictions across the
dataset.
Overall, the SVC model with "Raw" feature scaling
demonstrates strong predictive performance for
classifying thyroid disease cases. The high
accuracy, recall, precision, and F1-score indicate
that the model is effective in distinguishing
between individuals with and without thyroid
disease based on the given features.
Figure 53 The confusion matrix of SVM model
with raw feature scaling
Figure 54 The true values versus predicted values

of SVM model with raw feature scaling
Figure 55 The decision boundary using two
chosen features with SVM model
Figure 56 The ROC of SVM model with raw

feature scaling
Figure 57 The learning curve of SVM model with
raw feature scaling
Output with Normalization Scaling:

SVC
Normalization
accuracy: 0.8406317300789663
recall: 0.8406317300789663
precision: 0.8431688328019513
f1: 0.840330558273782
support
0 0.81 0.88 0.85
697
1 0.87 0.80 0.83
696
accuracy 0.84
1393
macro avg 0.84 0.84 0.84
1393
1393
The results of using normalized feature scaling are

shown in Figure 58 – 61. The output shows that the
SVC model with the "Normalization" feature
scaling technique achieved moderate performance
in terms of accuracy, recall, precision, and F1-
score. Here are some specific observations and
conclusions:
Accuracy: The model
test set.
class 0 is 0.8406, indicating
that the model effectively
captured the true negative
rate for non-thyroid disease
cases. The recall score for
class 1 is also 0.8406,
indicating that the model
captured the true positive
rate for thyroid disease cases.
for class 0 is 0.8107,
indicating that the model has
a moderate ability to
negative cases for non-thyroid
disease. The precision score
a higher ability to correctly
identify true positive cases for
thyroid disease.
indicates a reasonable
balance between precision
and recall for both classes,
reflecting the model's ability
to provide accurate
predictions across the
dataset.
Overall, the SVC model with "Normalization"
feature scaling demonstrates moderate predictive
performance for classifying thyroid disease cases.
The accuracy, recall, precision, and F1-score
indicate that the model can distinguish between
individuals with and without thyroid disease to a
reasonable extent based on the given features.
with normalized feature scaling

of SVM model with normalized feature scaling
normalized feature scaling
Figure 62 The ROC of SVM model with

Output with Standardization Scaling:

SVC
Standardization
accuracy: 0.9109834888729361
recall: 0.9109834888729361
precision: 0.9112565204173689
f1: 0.910967796983287
support
0 0.90 0.92 0.91
697
1 0.92 0.90 0.91
696
accuracy 0.91
1393
macro avg 0.91 0.91 0.91
1393
1393
The results of using standardized feature scaling

are shown in Figure 63 – 66. The output shows that
the SVC model with the "Standardization" feature
scaling technique achieved good performance in
terms of accuracy, recall, precision, and F1-score.
Here are some specific observations and
conclusions:
Accuracy: The model
test set.
class 0 is 0.9110, indicating
that the model effectively
captured the true negative
rate for non-thyroid disease
cases. The recall score for
class 1 is also 0.9110,
captured the true positive
rate for thyroid disease cases.
a good ability to correctly
identify true negative cases
for non-thyroid disease. The
precision score for class 1 is
model has a higher ability to
positive cases for thyroid
disease.
indicates a good balance
between precision and recall
for both classes, reflecting
the model's ability to provide
accurate predictions across
the dataset.
Overall, the SVC model with "Standardization"
feature scaling demonstrates good predictive
performance for classifying thyroid disease cases.
The accuracy, recall, precision, and F1-score
indicate that the model can distinguish between
individuals with and without thyroid disease with a
high level of accuracy and precision based on the
given features.

with standardized feature scaling

of SVM model with standardized feature scaling
standardized feature scaling
Figure 66 The ROC of SVM model with

Step Comparing the three outputs of the SVC model

2 with different feature scaling techniques, we can
observe the following:
Raw Scaling:
Accuracy: 0.9706
Recall: 0.9706
Precision: 0.9706
F1-score: 0.9706
Normalization Scaling:
Accuracy: 0.8406
Recall: 0.8406
Precision: 0.8432
F1-score: 0.8403
Standardization Scaling:
Accuracy: 0.9110
Recall: 0.9110
Precision: 0.9113
F1-score: 0.9110
Analysis:
Raw Scaling: This scaling
technique uses the raw values
of the features without any
normalization or
standardization. The model
achieved high accuracy,
recall, precision, and F1-
score, indicating that it
performed well in
distinguishing between
individuals with and without
thyroid disease. The model
benefited from the raw
feature values and captured
the patterns effectively.
Normalization Scaling: This
scaling technique normalized
the feature values, resulting
in values between 0 and 1.
The accuracy, recall,
precision, and F1-score were
lower compared to the raw
scaling technique. This
suggests that the
normalization technique
might have affected the
model's ability to accurately
classify thyroid disease cases.
The normalization could have
reduced the separation
between the classes, leading
to a decrease in performance.
Standardization Scaling: This
scaling technique
standardized the feature
values, making them have
zero mean and unit variance.
The model achieved high
accuracy, recall, precision,
and F1-score, similar to the
raw scaling technique.
Standardization helped the
model to effectively capture
the patterns in the data and
make accurate predictions. It
maintained the separation
between the classes and
preserved the information in
the features.
Overall, the standardization scaling technique
yielded the best performance among the three. It
provided a balanced trade-off between accuracy,
recall, precision, and F1-score. The raw scaling
technique also performed well, indicating that the
features' original values contained valuable
information for the model. On the other hand, the
normalization scaling technique resulted in lower
performance, suggesting that it might not have
been suitable for this particular dataset and model.
Step Run Logistic Regression (LR) on three feature scaling:

1
1 logreg =
2 LogisticRegression(solver='lbfgs',
max_iter=5000, random_state=2021)
3
for fc_name, value in
4
5
X_train, X_test, y_train, y_test
6 = value
run_model('Logistic Regression',
logreg, X_train, \
X_test, y_train, y_test, fc_name,
proba=True)
The purpose of this code is to train and evaluate a
logistic regression model using different feature scaling
techniques. Here's a step-by-step explanation:
1. logreg =
LogisticRegression(solver='lbfgs',
max_iter=5000,
random_state=2021): This line
initializes a logistic regression
model with specific solver,
maximum iterations, and random
state settings. The solver is set to
'lbfgs', which is an optimization
algorithm for logistic regression.
The maximum number of
iterations is set to 5000 to ensure
convergence, and the random
state is set to 2021 for
reproducibility.
2. for fc_name, value in
feature_scaling.items():: This line
starts a loop over the
feature_scaling dictionary, which
contains different feature scaling
techniques as keys and
corresponding training and
testing data as values.
3. X_train, X_test, y_train, y_test =
value: This line unpacks the
training and testing data for the
current feature scaling technique
from the value variable into
separate variables: X_train
(training features), X_test
(testing features), y_train
(training target), and y_test
(testing target).
4. run_model('Logistic Regression',
logreg, X_train, X_test, y_train,
y_test, fc_name, proba=True):
This line calls the run_model
function to train and evaluate the
logistic regression model. It
passes the model (logreg),
training and testing data, feature
scaling technique name
(fc_name), and the parameter
proba=True to indicate that the
model should provide probability
estimates instead of class
predictions.
5. Inside the run_model() function,
the logistic regression model is
trained using the training data
(X_train, y_train), and then
predictions are made on the
testing data (X_test). The
function computes various
evaluation metrics such as
accuracy, recall, precision, and
F1-score. It also generates and
displays plots such as the
confusion matrix, predicted
values versus true values, ROC
curve, and decision boundary.
Finally, the learning curve is
plotted to visualize the model's
performance with different
training example sizes.
By running this code, you can compare the performance
of the logistic regression model using different feature
scaling techniques and assess their impact on the
model's predictions and generalization ability.

Logistic Regression
Raw
accuracy: 0.9885139985642498
recall: 0.9885139985642498
precision: 0.9885180503092059
f1: 0.9885139867257451
support
0 0.99 0.99 0.99
697
1 0.99 0.99 0.99
696
accuracy 0.99
1393
macro avg 0.99 0.99 0.99
1393
1393
The results of using raw feature scaling are shown in
Figure 67 – 71. The logistic regression model trained
with raw scaling achieved high accuracy, recall,
precision, and F1-score on the testing data. Here's a
breakdown of the performance metrics:
Accuracy: 0.9885, indicating that
the model correctly predicted
testing data.
Recall: 0.9885, which means the
model successfully identified
98.85% of the positive cases
(class 1) in the testing data.
Precision: 0.9885, showing that
when the model predicted a
positive case, it was correct
98.85% of the time.
F1-score: 0.9885, which is a
harmonic mean of precision and
recall, providing an overall
measure of model performance.
The classification report further confirms that the model
achieved high precision, recall, and F1-score for both
classes (0 and 1). These results suggest that the logistic
regression model with raw scaling performed
exceptionally well in accurately classifying instances in
the testing data.
Figure 67 The confusion matrix of LR model with raw
feature scaling
Figure 69 The true values versus predicted values of

LR model with raw feature scaling
Figure 68 The learning curve of LR model with raw
feature scaling
Figure 70 The ROC of LR model with raw feature

scaling
Figure 71 The decision boundary of LR model with raw
feature scaling
Output with Normalized Scaling:

Logistic Regression
Normalization
accuracy: 0.8341708542713567
recall: 0.8341708542713567
precision: 0.8348288891370841
f1: 0.834086034021766
support
0 0.82 0.86 0.84
697
1 0.85 0.81 0.83
696
accuracy 0.83
1393
macro avg 0.83 0.83 0.83
1393
1393

shown in Figure 72 – 75. The logistic regression model
trained with normalized scaling achieved moderate
performance on the testing data. Here's a breakdown of
the performance metrics:
testing data.
83.48% of the time.
achieved reasonable precision, recall, and F1-score for
both classes (0 and 1). However, compared to the raw
scaling, the performance of the logistic regression
model with normalized scaling is slightly lower.
Normalization scales the features to a specific range

(usually between 0 and 1). In this case, the normalized
scaling did not improve the model's performance
significantly, suggesting that the logistic regression
model may not be highly sensitive to feature scaling in
this dataset.
Figure 72 The confusion matrix of LR model with
Figure 73 The learning curve of LR model with

LR model with normalized feature scaling
Figure 75 The ROC of LR model with normalized

feature scaling
Output with Standardized Scaling:

Logistic Regression
Standardization
accuracy: 0.9641062455132807
recall: 0.9641062455132807
precision: 0.9643518739448818
f1: 0.9641018055614623
support
0 0.98 0.95 0.96
697
1 0.95 0.98 0.96
696
accuracy 0.96
1393
macro avg 0.96 0.96 0.96
1393
1393
The results of using standardized feature scaling are

shown in Figure 76 – 79. The logistic regression model
trained with standardized scaling achieved high
performance on the testing data. Here's a breakdown of
the performance metrics:
testing data.
96.44% of the time.
achieved excellent precision, recall, and F1-score for
both classes (0 and 1).
Standardization scales the features to have zero mean

and unit variance. In this case, the logistic regression
model with standardized scaling outperformed the
model with normalized scaling. Standardization might
have helped the logistic regression model by bringing
the features to a comparable scale and reducing the
impact of outliers, resulting in improved performance.
Figure 76 The confusion matrix of LR model with

LR model with standardized feature scaling
Figure 78 The learning curve of LR model with

Figure 79 The ROC of LR model with standardized
feature scaling
Step Comparing the three outputs of Logistic Regression with

2 different feature scalings, we can observe the following:
Raw Scaling:
Accuracy: 0.9885
Recall: 0.9885
Precision: 0.9885
F1-score: 0.9885
Normalization Scaling:
Accuracy: 0.8342
Recall: 0.8342
Precision: 0.8348
F1-score: 0.8341
Standardization Scaling:
Accuracy: 0.9641
Recall: 0.9641
Precision: 0.9644
F1-score: 0.9641
From these results, we can make the following
observations:
Raw Scaling: The model achieved
the highest accuracy, precision,
recall, and F1-score among the
three scalings. This indicates that
using the raw feature values
without any scaling had the best
performance.
Normalization Scaling: The
model's performance decreased
compared to the raw scaling. The
accuracy, precision, recall, and
F1-score were lower, suggesting
that normalizing the features to a
specific range negatively
impacted the model's predictive
ability.
Standardization Scaling: The
model performed well with
standardization scaling, although
slightly lower than the raw
scaling. It achieved high
accuracy, precision, recall, and
F1-score. Standardization helped
by standardizing the feature
values to have zero mean and
unit variance, which might have
improved the model's stability
and performance.
In conclusion, for this particular logistic regression
model and dataset, using the raw feature values without
any scaling resulted in the best performance. However,
it's important to note that the effectiveness of feature
scaling can vary depending on the dataset and the
model used. It is recommended to experiment with
different scalings and evaluate their impact on model
performance to find the most suitable approach.
Step Run K-Nearest Neighbors (KNN) on three feature

1 scaling:

2 feature_scaling.items():
3 scores_1 = []
value
5
6
for i in range(2,50):
7
knn =
8
KNeighborsClassifier(n_neighbors = i)
9
knn.fit(X_train, y_train)
10
11
12 scores_1.append(accuracy_score(y_test,
13 \
14 knn.predict(X_test)))
15
16 max_val = max(scores_1)
17 max_index = np.argmax(scores_1) +
18 2
19
20 knn =
KNeighborsClassifier(n_neighbors =
max_index)
run_model(f'KNeighbors Classifier
n_neighbors = {max_index}',\
knn, X_train, X_test, y_train,
y_test, \
fc_name, proba=True)
This code block performs the following steps:

1. Iterates over the
items using a for loop. Each
iteration assigns the key to
fc_name and the
corresponding value to
value.
2. Initializes an empty list
scores_1 to store the
accuracy scores.
3. Assigns the values of X_train,
X_test, y_train, and y_test
from value.
4. Iterates from 2 to 50
(exclusive) using a for loop.
Inside the loop, it:
Initializes a
KNeighborsClassifier
with the current value of i
as the number of
neighbors.
Fits the model on the
training data (X_train and
y_train).
Evaluates the accuracy
score on the test data
(X_test and y_test).
Appends the accuracy
score to the scores_1 list.
5. Finds the maximum accuracy
score (max_val) from the
scores_1 list and its
corresponding index
(max_index).
6. Initializes a new
KNeighborsClassifier with
the max_index as the number
of neighbors.
7. Fits the model on the
training data.
8. Calls the run_model()
function to evaluate and
visualize the performance of
the KNeighborsClassifier
with the chosen max_index
value of neighbors. It
provides the model name,
the classifier instance,
training, and testing data,
feature scaling type
(fc_name), and sets proba to
True.
9. Prints the evaluation metrics
and displays the confusion
matrix, predicted vs. true
values, ROC curve, and
decision boundary plots.
The purpose of this code is to determine the
optimal number of neighbors (n_neighbors) for
the K-Nearest Neighbors (KNN) classifier using
different feature scaling methods. It evaluates the
performance of the KNN classifier with varying
numbers of neighbors and selects the best
performing model based on accuracy.

KNeighbors Classifier n_neighbors = 3
Raw
accuracy: 0.9813352476669059
recall: 0.9813352476669059
precision: 0.9817320923673414
f1: 0.981331206860899
precision recall f1-
score support
0 0.97 1.00
0.98 697
1 1.00 0.97
0.98 696
accuracy
0.98 1393
macro avg 0.98 0.98
0.98 1393
weighted avg 0.98 0.98
0.98 1393
The results of using raw feature scaling are

shown in Figure 80 – 84. The KNeighbors
Classifier with n_neighbors = 3 and Raw Scaling
achieved an accuracy of 0.9813, indicating that it
correctly classified approximately 98.13% of the
instances.
Looking at the precision, recall, and F1-score, we

see that the model performs very well for both
classes (0 and 1). It achieved high precision,
indicating a low rate of false positives, and high
recall, indicating a low rate of false negatives.
The F1-score, which balances precision and
recall, is also high for both classes. This suggests
that the model can effectively distinguish between
the two classes and make accurate predictions.
The confusion matrix confirms the excellent

performance of the model, as it correctly
predicted all instances of class 0 and class 1.
There were no instances misclassified in this
case.
Overall, the KNeighbors Classifier with

n_neighbors = 3 and Raw Scaling demonstrated
strong predictive performance, achieving high
accuracy and correctly classifying the majority of
instances. It can be considered a reliable model
for this dataset and problem.
Figure 80 The confusion matrix of KNN model
Figure 81 The true values versus predicted

values of KNN model with raw feature scaling
Figure 82 The ROC of KNN model with raw
feature scaling
Figure 83 The learning curve of KNN model with

raw feature scaling
Figure 84 The decision boundary of KNN model

Normalization
accuracy: 0.9210337401292176
recall: 0.9210337401292176
precision: 0.9241770492787673
f1: 0.9208845108563785
score support
0 0.89 0.96
0.92 697
1 0.96 0.88
0.92 696
accuracy
0.92 1393
macro avg 0.92 0.92
0.92 1393
0.92 1393

shown in Figure 85 – 88. The KNeighbors
Classifier with n_neighbors = 3 and Normalized
Scaling achieved an accuracy of 0.9210,
indicating that it correctly classified
approximately 92.10% of the instances.
Analyzing the precision, recall, and F1-score, we

observe that the model performs well for both
classes (0 and 1). It achieved high precision and
recall for both classes, indicating a good balance
between true positives and false negatives. The
F1-score, which combines precision and recall, is
also high for both classes. This suggests that the
model can effectively classify instances from both
classes accurately.
Examining the confusion matrix, we can see that

the model correctly predicted the majority of
instances for both class 0 and class 1. However,
there is a slightly higher number of misclassified
instances compared to the model with Raw
Scaling.
In conclusion, the KNeighbors Classifier with

n_neighbors = 3 and Normalized Scaling
demonstrates good predictive performance,
achieving a high accuracy rate and balanced
precision and recall scores for both classes.
Although it has a slightly lower accuracy
compared to the Raw Scaling model, it is still a
reliable model for this dataset and problem.

values of KNN model with normalized feature
scaling
Figure 88 The ROC of KNN model with


Standardization
accuracy: 0.9203158650394831
recall: 0.9203158650394831
precision: 0.9225776268840425
f1: 0.9202068092421954
score support
0 0.89 0.96
0.92 697
1 0.95 0.88
0.92 696
accuracy
0.92 1393
macro avg 0.92 0.92
0.92 1393
0.92 1393

are shown in Figure 89 – 92. The KNeighbors
Classifier with n_neighbors = 3 and Standardized
Scaling achieved an accuracy of 0.9203,
indicating that it correctly classified
approximately 92.03% of the instances.
Analyzing the precision, recall, and F1-score, we

observe that the model performs well for both
classes (0 and 1). It achieved high precision and
recall for both classes, indicating a good balance
between true positives and false negatives. The
F1-score, which combines precision and recall, is
also high for both classes. This suggests that the
model can effectively classify instances from both
classes accurately.
Examining the confusion matrix, we can see that

the model correctly predicted the majority of
instances for both class 0 and class 1. However,
similar to the model with Normalized Scaling,
there is a slightly higher number of misclassified
instances compared to the model with Raw
Scaling.
In conclusion, the KNeighbors Classifier with

n_neighbors = 3 and Standardized Scaling
demonstrates good predictive performance,
achieving a high accuracy rate and balanced
precision and recall scores for both classes.
Although it has a slightly lower accuracy
compared to the Raw Scaling model, it is still a
reliable model for this dataset and problem.


values of KNN model with standardized feature
scaling
Figure 92 The ROC of KNN model with
Step When comparing the outputs of the KNeighbors

2 Classifier with different feature scaling methods,
we can make the following observations:
Raw Scaling:
Accuracy: 0.9813
Precision: 0.9817
Recall: 0.9813
F1-score: 0.9813
Normalized Scaling:
Accuracy: 0.9210
Precision: 0.9242
Recall: 0.9210
F1-score: 0.9209
Standardized Scaling:
Accuracy: 0.9203
Precision: 0.9226
Recall: 0.9203
F1-score: 0.9202
Overall, all three models achieved relatively high
accuracy rates and demonstrated good predictive
performance. However, there are some notable
differences:
The model with Raw Scaling
achieved the highest
accuracy, precision, recall,
and F1-score among the
three scaling methods. It
demonstrated excellent
performance in accurately
classifying instances from
both classes.
The models with Normalized
Scaling and Standardized
Scaling achieved slightly
lower accuracy rates
compared to the model with
Raw Scaling. However, they
still performed well, with
precision, recall, and F1-
scores above 0.92 for both
scaling methods.
There is a small difference in
performance between the
models with Normalized
Scaling and Standardized
Scaling. Both models have
similar accuracy, precision,
recall, and F1-scores,
indicating that they are
comparable in terms of
predictive performance.
In conclusion, the KNeighbors Classifier with Raw
Scaling showed the best overall performance in
terms of accuracy and predictive metrics.
However, the models with Normalized Scaling
and Standardized Scaling also demonstrated good
performance and can be considered reliable
alternatives. The choice of feature scaling method
may depend on the specific requirements of the
problem and the characteristics of the dataset.
Step Run Decision Tree (DT) classifier on three

1 feature scaling:

3 X_train, X_test, y_train,
y_test = value
4
5
dt = DecisionTreeClassifier()
6
7
parameters = {
8
'max_depth':np.arange(1,21,1),\
9
'random_state':[2021]}
10
searcher = GridSearchCV(dt,
11 parameters)
run_model('DecisionTree
Classifier', searcher, X_train, \
X_test, y_train, y_test,
fc_name,proba=True)
1. Iterates over the
items, which contain the
methods and their
corresponding data splits.
2. Assigns the training and
testing data from the
current feature scaling
method to X_train, X_test,
y_train, and y_test
variables.
3. Initializes a
DecisionTreeClassifier as
dt, which is the classifier
used for training and
prediction.
4. Defines the
hyperparameter grid for
the DecisionTreeClassifier.
In this case, it specifies the
range of max_depth values
to consider and sets the
random_state to 2021.
These hyperparameters will
be tuned using grid search.
5. Creates a GridSearchCV
object called searcher with
the DecisionTreeClassifier
and the defined parameter
grid. This object will
perform a grid search over
the specified
hyperparameters.
6. Calls the run_model()
function to train and
evaluate the
DecisionTreeClassifier with
the grid search on the
method. It passes the
classifier (searcher),
training and testing data,
feature scaling method
(fc_name), and enables
probability predictions
(proba=True).
The run_model() function will fit the
DecisionTreeClassifier with grid search on the
training data, make predictions on the testing
data, and evaluate the performance of the
model. It will also generate various plots and
metrics to analyze the results.
By iterating through the different feature

scaling methods, this code performs
hyperparameter tuning using grid search and
evaluates the DecisionTreeClassifier model for
each feature scaling method. This allows for a
comparison of the performance of the decision
tree model with different feature scaling
techniques.

DecisionTree Classifier
Raw
accuracy: 0.9978463747307968
recall: 0.9978463747307968
precision: 0.9978473987656798
f1: 0.9978463725110738
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393

shown in Figure 93 – 97. The output shows the
performance of the DecisionTreeClassifier
model with grid search using raw scaling.
Here's an analysis of the output:
Accuracy: The model
correctly classified 99.8%
of the samples in the
testing set.
Recall: The model achieved
a recall score of 0.998,
which means it correctly
identified 99.8% of the
positive cases (class 1) in
the testing set.
Precision: The model
achieved a precision score
of 0.998, indicating that
among the samples it
predicted as positive,
99.8% were actually
positive.
F1-Score: The model
achieved an F1-score of
0.998, which is a balanced
measure of precision and
recall.
Support: The support
column shows the number
of samples for each class in
the testing set.
Overall, the DecisionTreeClassifier model with
grid search performed exceptionally well with
raw scaling, achieving near-perfect accuracy,
precision, recall, and F1-score. This suggests
that the decision tree model was able to
capture the underlying patterns in the raw-
scaled features and make accurate predictions
on the thyroid disease classification task.
Figure 93 The confusion matrix of DT model
Figure 94 The true values versus

predicted values of DT model with raw feature
scaling
Figure 95 The learning curve of DT model with
raw feature scaling
Figure 96 The ROC of DT model with raw

feature scaling
Figure 97 The decision boundary of DT model

Normalization
accuracy: 0.9978463747307968
recall: 0.9978463747307968
precision: 0.9978473987656798
f1: 0.9978463725110738
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393
The results of using normalized feature scaling

are shown in Figure 98 – 101. The output shows
the performance of the DecisionTreeClassifier
model with grid search using normalized
scaling. Here's an analysis of the output:
Accuracy: The model
testing set.
the testing set.
99.8% were actually
positive.
F1-Score: The model
recall.
the testing set.
Similar to the raw scaling, the
DecisionTreeClassifier model with grid search
performed exceptionally well with normalized
scaling, achieving near-perfect accuracy,
capture the underlying patterns in the
normalized features and make accurate
predictions on the thyroid disease classification
task.
It's important to note that these performance

metrics are specific to the given dataset and
may vary when applied to different datasets.

values of DT model with normalized feature
scaling
Figure 101 The ROC of DT model with

Figure 100 The learning curve of DT model

Standardization
accuracy: 0.9978463747307968
recall: 0.9978463747307968
precision: 0.9978473987656798
f1: 0.9978463725110738
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393
The results of using standardized feature

scaling are shown in Figure 102 – 105. The
output shows the performance of the
using standardized scaling. Here's an analysis
of the output:
Accuracy: The model
testing set.
the testing set.
99.8% were actually
positive.
F1-Score: The model
recall.
the testing set.
Similar to the raw and normalized scaling, the
performed exceptionally well with standardized
scaling, achieving near-perfect accuracy,
capture the underlying patterns in the
standardized features and make accurate
predictions on the thyroid disease classification
task.

Figure 103 The learning curve of DT model

values of DT model with standardized feature
scaling
Figure 105 The ROC of DT model with
Step The Decision Tree Classifier model achieved

2 outstanding performance across all three
feature scaling methods: raw, normalized, and
standardized. Here's a comparison and analysis
of the outputs:
Raw Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
Normalized Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
Standardized Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
The results indicate that the Decision Tree
Classifier model performed consistently well
across all feature scaling methods, achieving
near-perfect scores in accuracy, recall,
precision, and F1-score. This suggests that the
decision tree model was able to effectively
capture the patterns and relationships in the
dataset, regardless of the scaling method used.
The high accuracy and balanced scores indicate

that the Decision Tree Classifier model is
capable of accurately classifying instances of
thyroid disease based on the given features. It
demonstrates the model's ability to make
accurate predictions and generalize well to
unseen data.
Overall, the Decision Tree Classifier model

shows promising results for the task of thyroid
disease classification, regardless of the feature
scaling method used. However, it's important to
note that the performance may vary depending
on the specific dataset and the choice of
hyperparameters for the decision tree model.
Step Run Random Forest (RF) classifier on three feature

1 scaling:
1 rf =
2 RandomForestClassifier(n_estimators=200,
3 \
4 max_depth=20, random_state=2021)
5
value
run_model('RandomForest Classifier',
rf, X_train, \
proba=True)
The purpose of the code is to train and evaluate a

RandomForest Classifier model using different
feature scaling techniques. It aims to compare the
performance of the model when different scaling
methods are applied to the input data.
By iterating over the feature_scaling dictionary,

which contains different feature scaling methods
and corresponding training and testing data, the
code allows for evaluating the model under
different scaling scenarios. The run_model function
is called for each feature scaling method, which
performs the following tasks:
1. Trains the RandomForest
Classifier model using the
provided training data.
2. Predicts the target variable
for the testing data.
3. Calculates evaluation metrics
such as accuracy, recall,
precision, and F1-score to
assess the model's
performance.
4. Plots the confusion matrix,
predicted values vs true
values, ROC curve, and
learning curve to visualize the
model's performance and
characteristics.
5. Displays the evaluation
metrics and plots to provide a
comprehensive analysis of the
model's performance under
methods.
The code helps in understanding how different
feature scaling techniques impact the effectiveness
of the RandomForest Classifier model. It allows for
comparison and identification of the most suitable
scaling method for improving the model's
performance and prediction accuracy.

Raw
accuracy: 0.9985642498205313
recall: 0.9985642498205313
precision: 0.9985642498205313
f1: 0.9985642498205313
support
0 1.00 1.00 1.00
697
1 1.00 1.00 1.00
696
accuracy 1.00
1393
macro avg 1.00 1.00 1.00
1393
1393
The results of using raw feature scaling are shown

in Figure 106 – 110. The RandomForest Classifier
model with Raw Scaling shows exceptional
performance on the thyroid disease dataset. It
achieves a perfect accuracy of 0.9985, indicating
that it correctly predicts the class labels for almost
all of the samples in the test dataset. The recall
and precision scores for both classes (0 and 1) are
also perfect, indicating that the model effectively
identifies positive and negative cases without any
false positives or false negatives.
The F1-score, which considers both precision and

recall, is also perfect for both classes, indicating a
balance between precision and recall. This implies
that the model not only accurately predicts positive
and negative cases but also maintains a good
balance between correctly identifying true
positives and minimizing false positives and false
negatives.
The classification report confirms that the model

performs consistently well across both classes,
with a macro average and weighted average of
precision, recall, and F1-score being perfect. This
indicates that the model is highly reliable and
robust in its predictions.
In conclusion, the RandomForest Classifier model

with Raw Scaling is highly accurate and reliable in
classifying thyroid disease. It demonstrates
exceptional performance on the given dataset,
making it a suitable choice for diagnosing thyroid
diseases based on the provided features.
Figure 106 The confusion matrix of RF model with

raw feature scaling
Figure 107 The true values versus

predicted values of RF model with raw feature
scaling
Figure 108 The learning curve of RF model with
raw feature scaling
Figure 109 The ROC of RF model with raw feature

scaling
Figure 110 The decision boundary of RF model

Normalization
accuracy: 0.9985642498205313
recall: 0.9985642498205313
precision: 0.9985642498205313
f1: 0.9985642498205313
support
0 1.00 1.00 1.00
697
1 1.00 1.00 1.00
696
accuracy 1.00
1393
macro avg 1.00 1.00 1.00
1393
1393

shown in Figure 111 – 114. The RandomForest
Classifier model with Normalized Scaling exhibits
exceptional performance on the thyroid disease
dataset. It achieves a perfect accuracy of 0.9986,
indicating that it accurately predicts the class
labels for almost all samples in the test dataset.
The recall and precision scores for both classes (0
and 1) are also perfect, signifying that the model
effectively identifies positive and negative cases
without any false positives or false negatives.
The F1-score, which combines precision and recall,

is also perfect for both classes. This indicates a
harmonious balance between precision and recall,
demonstrating that the model accurately identifies
true positives while minimizing false positives and
false negatives.
The classification report confirms the consistent

and outstanding performance of the model across
both classes, with both macro average and
weighted average precision, recall, and F1-score
being perfect. This highlights the model's
reliability and robustness in its predictions.

with Normalized Scaling is highly accurate and
dependable for classifying thyroid disease. It
demonstrates exceptional performance on the
given dataset, making it a suitable choice for
diagnosing thyroid diseases based on the provided
features.

values of RF model with normalized feature scaling
Figure 113 The ROC of RF model with normalized
feature scaling

Output with Standardized Scaling

Standardization
accuracy: 0.9985642498205313
recall: 0.9985642498205313
precision: 0.9985642498205313
f1: 0.9985642498205313
support
0 1.00 1.00 1.00
697
1 1.00 1.00 1.00
696
accuracy 1.00
1393
macro avg 1.00 1.00 1.00
1393
1393

are shown in Figure 115 – 118. The RandomForest
Classifier model with Standardized Scaling
performs exceptionally well on the thyroid disease
dataset. It achieves a perfect accuracy of 0.9986,
indicating that it accurately predicts the class
labels for almost all samples in the test dataset.
The recall and precision scores for both classes (0
and 1) are also perfect, indicating that the model
effectively identifies positive and negative cases
without any false positives or false negatives.
The F1-score, which combines precision and recall,

is also perfect for both classes. This demonstrates
a balanced trade-off between precision and recall,
indicating that the model accurately identifies true
positives while minimizing false positives and false
negatives.
The classification report further confirms the
consistent and outstanding performance of the
model across both classes, with both the macro
average and weighted average precision, recall,
and F1-score being perfect. This highlights the
model's reliability and robustness in its
predictions.

with Standardized Scaling is highly accurate and
reliable for classifying thyroid disease. It
demonstrates exceptional performance on the
given dataset, making it a suitable choice for
features.

values of RF model with standardized feature
scaling
Figure 117 The ROC of RF model with

Step The Random Forest Classifier models with

2 different feature scalings (Raw, Normalization, and
Standardization) all exhibit exceptional
performance on the thyroid disease dataset. In all
cases, the models achieve a perfect accuracy of
0.9986, indicating that they accurately predict the
class labels for almost all samples in the test
dataset.
The recall, precision, and F1-scores for both

classes (0 and 1) are also perfect across all feature
scalings, indicating that the models effectively
identify positive and negative cases without any
false positives or false negatives. This
demonstrates the robustness and consistency of
the Random Forest Classifier in classifying thyroid
disease.
The classification reports for all feature scalings
confirm the outstanding performance of the
models, with perfect precision, recall, and F1-
scores for both classes. The macro average and
weighted average metrics also indicate excellent
overall performance.
In conclusion, the Random Forest Classifier is a

highly accurate and reliable model for classifying
thyroid disease. Regardless of the feature scaling
method used, the model consistently achieves
perfect accuracy and demonstrates excellent
precision, recall, and F1-scores. This makes the
Random Forest Classifier a strong choice for
features.
Step Run Gradient Boosting (GB) classifier on three

1 feature scaling:
1 gbt =
2 GradientBoostingClassifier(n_estimators
= 200, \
3
max_depth=21, subsample=0.8,
4
max_features=0.2, \
5
random_state=2021)
6
7
value
run_model('GradientBoosting
Classifier', gbt, X_train, \
proba=True)
The purpose of the provided code is to train and

evaluate a Gradient Boosting Classifier (GBT)
model using different feature scaling methods
(Raw, Normalization, and Standardization) on the
thyroid disease dataset.
The code initializes a GBT model with specific

hyperparameters such as the number of
estimators, maximum depth, subsampling rate,
maximum features, and a random state for
reproducibility.
It then iterates over the feature scaling methods

and corresponding datasets. For each iteration, it
calls the run_model function to train and evaluate
the GBT model on the current feature scaling
method. The run_model function calculates
various evaluation metrics such as accuracy,
recall, precision, and F1-score. It also generates
visualizations such as confusion matrix, ROC
curve, decision boundary, and learning curve to
provide insights into the model's performance.
The output of the code includes the evaluation

metrics and visualizations for each feature scaling
method.
Output with Standardization Scaling:
GradientBoosting Classifier
Standardization
accuracy: 0.9985642498205313
recall: 0.9985642498205313
precision: 0.9985642498205313
f1: 0.9985642498205313
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393

are shown in Figure 119 – 123. The output of the
GradientBoosting Classifier (GBT) model with
Standardization Scaling shows exceptional
performance on the thyroid disease dataset. Here
are some key points to analyze in more detail:
Accuracy: The model
achieves an accuracy of
correctly predicts almost all
instances in the dataset. This
high accuracy suggests that
the model is highly reliable
in classifying thyroid disease
cases.
Precision and Recall: The
precision and recall values
for both classes (0 and 1) are
1.00. This means that the
model correctly identifies all
positive cases (both true
positives and true negatives)
and avoids false positives
and false negatives. The
perfect precision and recall
scores indicate that the
model has a high level of
precision in classifying both
healthy and diseased
individuals.
F1-Score: The F1-score is
0.9986, which is a harmonic
mean of precision and recall.
This metric provides an
overall measure of the
model's accuracy in
classifying both classes. The
high F1-score indicates that
the model has a strong
balance between precision
and recall.
Confusion Matrix: The
confusion matrix shows that
the model correctly classified
all instances in both the
negative (0) and positive (1)
classes. There are no
instances of
misclassification, suggesting
that the model is robust and
accurate in predicting the
presence or absence of
thyroid disease.
Based on these observations, we can conclude
that the GradientBoosting Classifier model with
Standardization Scaling is highly accurate and
reliable for predicting thyroid disease. It
demonstrates perfect performance on the given
dataset, achieving high precision, recall, and F1-
score. This indicates that the model can be a
valuable tool for medical professionals in
diagnosing thyroid disease accurately.
Figure 119 The confusion matrix of GB model

values of GB model with standardized feature
scaling
Figure 121 The ROC of GB model with

Figure 122 The learning curve of GB model with
Figure 123 The decision boundary of GB model


Step Run XGBoost classifier on three feature scaling:
1

3 X_train, X_test, y_train,
y_test = value
4
xgb=XGBClassifier(n_estimators
5
= 200, max_depth=20, \
6
random_state=2021,
7 use_label_encoder=False, \
eval_metric='mlogloss')
run_model('XGBoost Classifier',
xgb, X_train, X_test, \
y_train, y_test, fc_name,
proba=True)

1. Iterate over the
using a for loop.
2. Extract the X_train, X_test,
y_train, and y_test data
corresponding to the
method.
3. Create an instance of the
XGBoost Classifier (xgb)
with specific parameters,
including the number of
estimators, maximum
depth, random state, use of
label encoder, and
evaluation metric.
4. Call the run_model()
evaluate the XGBoost
Classifier on the current
feature scaling method.
5. Pass the classifier (xgb),
training and testing data
(X_train, X_test, y_train,
y_test), feature scaling
method (fc_name), and set
proba=True to enable
probability prediction.
This code allows you to compare the
performance of the XGBoost Classifier using
different feature scaling methods. It provides
insights into how the model performs when the
input features are scaled using different
techniques.

XGBoost Classifier
Standardization
accuracy: 0.9978463747307968
recall: 0.9978463747307968
precision: 0.9978473987656798
f1: 0.9978463725110738
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393

XGBoost Classifier performed exceptionally
well on the standardized scaled data with an
accuracy, recall, precision, and F1-score of
1.00. This indicates that the classifier achieved
perfect classification on the test set.
The precision, recall, and F1-score for both

classes (0 and 1) are also 1.00, indicating that
the classifier achieved perfect predictions for
both the negative and positive classes.
Overall, the XGBoost Classifier with

standardized scaling achieved excellent
performance on the test set, demonstrating its
ability to effectively classify the data.
Figure 124 The confusion matrix of XGBoost
model with standardized feature scaling

values of XGBoost model with standardized
feature scaling
Figure 126 The learning curve of XGBoost
Figure 127 The ROC of XGBoost model with

Figure 128 The decision boundary of XGBoost
Step Run MLP classifier on three feature scaling:

1
1 mlp =
2 MLPClassifier(random_state=2021)
4
X_train, X_test, y_train, y_test
5
= value
run_model('MLP Classifier', mlp,
X_train, X_test, \
y_train, y_test, fc_name,
proba=True)
The purpose of the code is to evaluate the

performance of an MLP Classifier model on
different feature scaling methods. It aims to
compare how different scaling techniques affect
the model's accuracy, recall, precision, and F1-
score.
The code achieves this by following these steps:

1. It initializes an MLP
Classifier model with a
fixed random state for
reproducibility.
2. It iterates over the
feature_scaling dictionary,
which contains different
feature scaling methods as
keys and their
corresponding data splits
as values.
3. For each feature scaling
method, it extracts the
training and testing data
splits.
4. It then calls the
run_model() function for
each feature scaling
method, passing the MLP
Classifier model, the
extracted data splits, and
other necessary
parameters.
5. Inside the run_model()
function, the MLP
Classifier model is trained
on the training data and
evaluated on the testing
data. Various performance
metrics such as accuracy,
recall, precision, and F1-
score are computed and
printed. Additionally,
visualizations like
confusion matrix, predicted
vs. actual values plot, ROC
curve plot, decision
boundary plot, and learning
curve plot are generated.
6. Finally, the performance
metrics for each feature
scaling method are stored
in the list_scores list for
further analysis and
comparison.
Overall, the purpose of the code is to assess
how different feature scaling techniques impact
the performance of the MLP Classifier model
and provide insights into the most effective
scaling approach for the given dataset.

MLP Classifier
Standardization
accuracy: 0.990667623833453
recall: 0.990667623833453
precision: 0.9906766982718167
f1: 0.9906675661202962
score support
0 0.99 0.99
0.99 697
1 0.99 0.99
0.99 696
accuracy
0.99 1393
macro avg 0.99 0.99
0.99 1393
0.99 1393

MLP Classifier model with Standardized scaling
achieved impressive results with an accuracy,
recall, precision, and F1-score of 99.07%. This
indicates that the model is highly effective in
correctly classifying both the negative (0) and
positive (1) cases in the dataset.
The precision and recall values for both classes

are also excellent, indicating a high level of
confidence in the model's predictions. The high
F1-score further validates the model's ability to
achieve a balance between precision and recall.
The overall performance of the MLP Classifier

model suggests that it can reliably classify
instances in the dataset with a high degree of
accuracy. This makes it a suitable choice for
tasks that require accurate classification, such
as identifying thyroid disease.
However, it is important to note that these
results are specific to the dataset and the
chosen hyperparameters for the MLP Classifier
model. It is always recommended to evaluate
and fine-tune models based on the specific
problem domain and dataset characteristics.
In conclusion, the MLP Classifier model with

Standardized scaling demonstrates exceptional
performance in classifying thyroid disease
cases. It shows high accuracy, precision, recall,
and F1-score, indicating its reliability in
predicting the presence or absence of thyroid
disease.
Figure 129 The confusion matrix of MLP

values of MLP model with standardized feature
scaling
Figure 131 The ROC of MLP model with

Figure 132 The learning curve of MLP model
Figure 133 The decision boundary of MLP

Step Run LGBM classifier on three feature scaling:

1
1 lgbm = LGBMClassifier(max_depth =
2 20, n_estimators=500,
subsample=0.8, random_state=2021)
3
4
5
X_train, X_test, y_train,
6 y_test = value
run_model('Lightgbm
Classifier', lgbm, X_train, \
proba=True)
The purpose of the code is to train and evaluate

the LightGBM Classifier model with different
feature scaling techniques (Raw,
Normalization, and Standardization) on the
given dataset. The model is created with
specified hyperparameters such as max_depth,
n_estimators, subsample, and random_state.
For each feature scaling method, the code splits

the dataset into training and testing sets, and
then trains the LightGBM Classifier model
using the training data. It subsequently
evaluates the model's performance on the
testing data by calculating various metrics such
as accuracy, recall, precision, and F1-score.
The code also generates visualizations such as a
confusion matrix, ROC curve, and learning
curve to further analyze the model's
performance.
The purpose of running the model with

different feature scaling techniques is to
observe how the choice of scaling method
affects the model's performance. By comparing
the results across different scaling techniques,
we can gain insights into which scaling method
works best for the LightGBM Classifier model
on the given dataset.
Overall, the code allows us to assess the

effectiveness of the LightGBM Classifier model
and determine the impact of different feature
scaling techniques on its performance.

Lightgbm Classifier
Standardization
accuracy: 0.9971284996410624
recall: 0.9971284996410624
precision: 0.997132592854707
f1: 0.9971284907621472
score support
0 1.00 1.00
1.00 697
1 1.00 1.00
1.00 696
accuracy
1.00 1393
macro avg 1.00 1.00
1.00 1393
1.00 1393

LightGBM Classifier model performed
exceptionally well with standardized scaling. It
achieved an accuracy of 99.71% on the test
data, indicating a high level of overall
correctness in its predictions. The model also
exhibited excellent performance in terms of
recall, precision, and F1-score, all of which
were at 99.71%.
The precision and recall scores of 1.00 for both
classes (0 and 1) indicate that the model
achieved perfect precision and recall,
suggesting that it made no false positives or
false negatives. The F1-score of 0.997 further
confirms the model's ability to accurately
classify instances from both classes.
In summary, the LightGBM Classifier model

with standardized scaling achieved exceptional
performance, demonstrating its effectiveness in
accurately classifying the given dataset. The
results suggest that standardizing the features
prior to training the model improved its
predictive capabilities and resulted in highly
accurate predictions.
Figure 134 The confusion matrix of LGBM

values of LGBM model with standardized
feature scaling
Figure 136 The ROC of LGBM model with

Figure 137 The learning curve of LGBM model
Figure 138 The decision boundary of LGBM

Following is the full version of thyroid.py:
#thyroid.py
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import os
import plotly.graph_objs as go
import joblib
import itertools
roc_auc_score,roc_curve
train_test_split, RandomizedSearchCV,
GridSearchCV,StratifiedKFold
from sklearn.preprocessing import StandardScaler,
MinMaxScaler
LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,
GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler,
\
LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix,
accuracy_score, recall_score, precision_score
from sklearn.metrics import classification_report,
f1_score, plot_confusion_matrix
from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
#Reads dataset
curr_path = os.getcwd()
df = pd.read_csv(curr_path+"/hypothyroid.csv")
#Checks shape
print(df.shape)
#Reads columns
print("Data Columns --> ",df.columns)
#Checks dataset information

print(df.info())
#Checks count, unique values, and frequency

print(df.describe().T)
#Checks null values

df=df.replace({"?":np.NAN})
print('Total number of null values: ',
df.isnull().sum().sum())
#Converts six columns into numerical

num_cols = ['age', 'FTI','TSH','T3','TT4','T4U']
df[num_cols] = df[num_cols].apply(pd.to_numeric,
errors='coerce')
#Deletes irrelevant columns

df.drop(['TBG','referral
source'],axis=1,inplace=True)
#Handles missing values

#Uses mode imputation for all other categorical
features
def mode_imputation(feature):
mode=df[feature].mode()[0]

df['age'].fillna(df['age'].mean(), inplace=True)
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df['TSH'] = imputer.fit_transform(df[['TSH']])
df['T3'] = imputer.fit_transform(df[['T3']])
df['TT4'] = imputer.fit_transform(df[['TT4']])
df['T4U'] = imputer.fit_transform(df[['T4U']])
df['FTI'] = imputer.fit_transform(df[['FTI']])
#Checks each column datatype

print(df.dtypes)
#Checks null values
print('Total number of null values: ',
df.isnull().sum().sum())
#Looks at statistical description of data

print(df.describe().to_string())
#Cleans age column

for i in range(df.shape[0]):
if df.age.iloc[i] > 100.0:
df.age.iloc[i] = 100.0
df['age'].describe()
#Prints the total distribution

print("Total Number of samples :
{}".format(df.shape[0]))
print("Total No.of Negative Thyroid:
{}".format(df[df.binaryClass == 'N'].shape[0]))
print("Total No.of Positive Thyroid :
{}".format(df[df.binaryClass == 'P'].shape[0]))
#Defines function to create pie chart and bar plot

as subplots
def plot_piechart(df, var, title=''):
plt.subplot(121)
label_list =
list(df[var].value_counts().index)
df[var].value_counts().plot.pie(autopct="%1.1f%%",
\
colors=sns.color_palette("prism", 7), \
startangle=60,
labels=label_list, \
wedgeprops=
{"linewidth": 2, "edgecolor": "k"}, \
shadow=True,
textprops={'fontsize': 20}
)
plt.title("Distribution of " + var + "
variable " + title, fontsize=25)
value_counts = df[var].value_counts()
# Print percentage values
percentages = value_counts / len(df) * 100
print("Percentage values:")
print(percentages)
plt.subplot(122)
ax = df[var].value_counts().plot(kind="barh")
for i, j in
enumerate(df[var].value_counts().values):
ax.text(.7, i, j, weight="bold",
fontsize=20)
plt.title("Count of " + var + " cases " +

title, fontsize=25)
# Print count values
print("Count values:")
print(value_counts)
plt.show()
plot_piechart(df,'binaryClass')
# Looks at distribution of all features in the
whole original dataset
columns = list(df.columns)
columns.remove('binaryClass')
plt.subplots(figsize=(35, 50))
length = len(columns)
color_palette = sns.color_palette("Set3",
n_colors=length) # Define color palette
for i, j in itertools.zip_longest(columns,
range(length)):
plt.subplot((length // 2), 5, j + 1)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
ax = df[i].hist(bins=10, edgecolor='black',
color=color_palette[j]) # Set color for each
histogram
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center',
weight="bold", fontsize=17, textcoords='offset
points')
plt.title(i, fontsize=30) # Adjust title font

size
plt.show()
from tabulate import tabulate

# Looks at another feature distribution by
binaryClass feature
def another_versus_label(feat):
fig, (ax1, ax2) = plt.subplots(nrows=2,
ncols=1, figsize=(25, 15))
plt.subplots_adjust(wspace=0.5, hspace=0.25)
# Define color palette
colors = sns.color_palette("Set2")
df[df['binaryClass'] == "N"]
[feat].plot(ax=ax1, kind='hist', bins=10,
edgecolor='black', color=colors[0])
ax1.set_title('Negative Thyroid', fontsize=25)
ax1.set_xlabel(feat, fontsize=20)
ax1.set_ylabel('Count', fontsize=20)
data1 = []
x = p.get_x() + p.get_width() / 2.
y = p.get_height()
ax1.annotate(format(y, '.0f'), (x, y),
ha='center',
weight="bold", fontsize=20,
df[df['binaryClass'] == "P"]
[feat].plot(ax=ax2, kind='hist', bins=10,
edgecolor='black', color=colors[1])
ax2.set_title('Positive Thyroid', fontsize=25)
ax2.set_xlabel(feat, fontsize=20)
ax2.set_ylabel('Count', fontsize=20)
data2 = []
x = p.get_x() + p.get_width() / 2.
y = p.get_height()
ax2.annotate(format(y, '.0f'), (x, y),
ha='center',
weight="bold", fontsize=20,
plt.show()
# Print x and y values using tabulate

print("Negative Thyroid:")
print(tabulate(data1, headers=[feat, "y"]))
print("\nPositive Thyroid:")
print(tabulate(data2, headers=[feat, "y"]))
#Looks at sex feature distribution by binaryClass

feature
another_versus_label("age")
#Looks at TSH feature distribution by binaryClass

feature
another_versus_label("TSH")
#Looks at T3 feature distribution by binaryClass

feature
another_versus_label("T3")
#Looks at TT4 feature distribution by binaryClass

feature
another_versus_label("TT4")
#Looks at T4U feature distribution by binaryClass

feature
another_versus_label("T4U")
#Looks at FTI feature distribution by binaryClass

feature
another_versus_label("FTI")
def put_label_stacked_bar(ax,fontsize):
#patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The height of the bar is the data value and can

be used as the label
label_text = f'{height:.0f}' #
f'{height:.2f}' to format decimal values

specified value
if height > 0:
ax.text(label_x, label_y, label_text,
ha='center', va='center', weight =
"bold",fontsize=fontsize)
def dist_count_plot(df, cat):

fig = plt.figure(figsize=(25, 15))
ax1 = fig.add_subplot(111)
group_by_stat = df.groupby([cat,
'binaryClass']).size()
stacked_plot =
group_by_stat.unstack().plot(kind='bar',
stacked=True, ax=ax1, grid=True)
ax1.set_title('Stacked Bar Plot of ' + cat + '
(number of cases)', fontsize=14)
ax1.set_ylabel('Number of Cases')
put_label_stacked_bar(ax1, 17)
# Collect the values of each stacked bar

data = []
headers = [''] +
df['binaryClass'].unique().tolist()
for container in stacked_plot.containers:
bar_values = []
for bar in container:
bar_value = bar.get_height()
bar_values.append(bar_value)
data.append(bar_values)
# Transpose the data for tabulate

data_transposed = list(map(list, zip(*data)))
# Insert the values of `cat` as the first column

in the data
data_with_cat = [[value] + row for value, row
in zip(group_by_stat.index.get_level_values(cat),
data_transposed)]
# Print the values in tabular form

print(tabulate(data_with_cat, headers=headers))
plt.show()
#Plots the distribution of sex feature in pie

chart and bar plot
plot_piechart(df,'sex')
#Plots binaryClass variable against sex variable

in stacked bar plots
dist_count_plot(df,'sex')
#Plots the distribution of on thyroxine feature in
plot_piechart(df,'on thyroxine')
#Plots binaryClass variable against on thyroxine

variable in stacked bar plots
dist_count_plot(df,'on thyroxine')
#Plots the distribution of sick feature in pie

chart and bar plot
plot_piechart(df,'sick')
#Plots binaryClass variable against sick variable

dist_count_plot(df,'sick')
#Plots the distribution of tumor feature in pie

chart and bar plot
plot_piechart(df,'tumor')
#Plots binaryClass variable against tumor variable

dist_count_plot(df,'tumor')
#Plots the distribution of TSH measured feature in

plot_piechart(df,'TSH measured')
#Plots binaryClass variable against TSH measured

dist_count_plot(df,'TSH measured')
#Plots the distribution of TT4 measured feature in

plot_piechart(df,'TT4 measured')
#Plots binaryClass variable against TT4 measured

dist_count_plot(df,'TT4 measured')
#Checks dataset information

print(df.info())
#Extracts categorical and numerical columns
cat_cols = [col for col in df.columns if (df[col].dtype == 'object')]
num_cols = [col for col in df.columns if (df[col].dtype != 'object')]
print(cat_cols)
print(num_cols)
# Checks numerical features density distribution

plotnumber = 1
color_palette = sns.color_palette("husl", n_colors=len(num_cols)) # Define c
for column in num_cols:

if plotnumber <= 6:
ax = plt.subplot(2, 3, plotnumber)
sns.distplot(df[column], color=color_palette[plotnumber-1])
plt.xlabel(column, fontsize=40)
ax.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_wid
ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=30, textcoo
plotnumber += 1
fig.suptitle('The density of numerical features', fontsize=50)

plt.tight_layout()
plt.show()

plotnumber = 1
color_palette = sns.color_palette("Set3", n_colors=len(cat_cols)) # Define c
for column, color in zip(cat_cols, color_palette):

if plotnumber <= 20:
ax = plt.subplot(4, 5, plotnumber)
ax.tick_params(axis='x', labelsize=30)
ax.tick_params(axis='y', labelsize=30)
sns.countplot(df[column], color=color)
plt.xlabel(column, fontsize=40)
ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_wid
ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=30, textcoo
plotnumber += 1
fig.suptitle('The distribution of categorical features', fontsize=50)

plt.tight_layout()
plt.show()
def plot_one_versus_one_cat(feat):
categorical_features = ["FTI measured", "sex", "on thyroxine", "sick", "p
"hypopituitary", "psych"]
num_plots = len(categorical_features)
fig, axes = plt.subplots(3, 3, figsize=(40, 30), facecolor='#fbe7dd')

color_palette = sns.color_palette("Spectral_r", n_colors=num_plots)
for i in range(num_plots):
ax = axes[i]
ax.tick_params(axis='x', labelsize=30)
ax.tick_params(axis='y', labelsize=30)
g = sns.countplot(x=df[categorical_features[i]], hue=df[feat], palett
ax=ax)
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_widt
ha='center', va='center', xytext=(0, 10), weight="bold
g.set_xlabel(categorical_features[i], fontsize=30)
g.set_ylabel("Number of Cases", fontsize=30)
g.legend(fontsize=25)
plt.tight_layout()
plt.show()
plot_one_versus_one_cat("binaryClass")
plot_one_versus_one_cat("on antithyroid medication")
plot_one_versus_one_cat("thyroid surgery")
plot_one_versus_one_cat("lithium")
plot_one_versus_one_cat("TSH measured")
plot_one_versus_one_cat("T3 measured")
plot_one_versus_one_cat("TT4 measured")
plot_one_versus_one_cat("T4U measured")
plot_one_versus_one_cat("FTI measured")
plot_one_versus_one_cat("I131 treatment")
plot_one_versus_one_cat("query hypothyroid")
def plot_feat1_feat2_vs_target_pie(df, target, col1, col2):

gs0 = df[df[target] == 'P'][col1].value_counts()
gs1 = df[df[target] == 'N'][col1].value_counts()
ss0 = df[df[target] == 'P'][col2].value_counts()
ss1 = df[df[target] == 'N'][col2].value_counts()
col1_labels = df[col1].unique()
col2_labels = df[col2].unique()
_, ax = plt.subplots(2, 2, figsize=(20, 20), facecolor='#f7f7f7')
cmap = plt.get_cmap('Pastel1') # Define color map
ax[0][0].pie(gs0, labels=col1_labels[:len(gs0)], shadow=True, autopct='%1

len(gs0),
colors=cmap(np.arange(len(gs0))), textprops={'fontsize': 20}
ax[0][1].pie(gs1, labels=col1_labels[:len(gs1)], shadow=True, autopct='%1
len(gs1),
colors=cmap(np.arange(len(gs1))), textprops={'fontsize': 20}
ax[1][0].pie(ss0, labels=col2_labels[:len(ss0)], shadow=True, autopct='%1
explode=[0.04] * len(ss0), colors=cmap(np.arange(len(ss0))),
ax[1][1].pie(ss1, labels=col2_labels[:len(ss1)], shadow=True, autopct='%1
explode=[0.04] * len(ss1), colors=cmap(np.arange(len(ss1))),
ax[0][0].set_title(f"{target.capitalize()} = 0", fontsize=30)

ax[0][1].set_title(f"{target.capitalize()} = 1", fontsize=30)
plt.show()
# Print each pie chart as a table in tabular form for each stroke
print(f"{target.capitalize()} = 0:")
gs0_table = pd.DataFrame({'Categories': col1_labels[:len(gs0)], 'Percenta
'Count': gs0})
print(gs0_table)
print(f"\n{target.capitalize()} = 1:")
gs1_table = pd.DataFrame({'Categories': col1_labels[:len(gs1)], 'Percenta
'Count': gs1})
print(gs1_table)
print(f"{target.capitalize()} = 0:")
ss0_table = pd.DataFrame({'Categories': col2_labels[:len(ss0)], 'Percenta
'Count': ss0})
print(ss0_table)
print(f"\n{target.capitalize()} = 1:")
ss1_table = pd.DataFrame({'Categories': col2_labels[:len(ss1)], 'Percenta
'Count': ss1})
print(ss1_table)
plot_feat1_feat2_vs_target_pie(df, "binaryClass", "on thyroxine", "on antithy
plot_feat1_feat2_vs_target_pie(df, "binaryClass", "sick", "pregnant")
plot_feat1_feat2_vs_target_pie(df, "binaryClass", "lithium", "tumor")
def feat_versus_other(feat,another,legend,ax0,label):
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
ax0_sns = sns.histplot(data=df, x=feat,ax=ax0,zorder=2,kde=False,hue=anot

shrink=.8
,linewidth=0.3,alpha=1)
put_label_stacked_bar(ax0_sns,15)
ax0_sns.set_xlabel('',fontsize=30, weight='bold')
ax0_sns.set_ylabel('',fontsize=30, weight='bold')
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidt
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidt

ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', fontsize=10, bbox_to_
loc='upper right')
plt.tight_layout()
label_bin = list(df["binaryClass"].value_counts().index)
label_sex = list(df["sex"].value_counts().index)
label_thyroxine = list(df["on thyroxine"].value_counts().index)
label_pregnant = list(df["pregnant"].value_counts().index)
label_lithium = list(df["lithium"].value_counts().index)
label_goitre = list(df["goitre"].value_counts().index)
label_tumor = list(df["tumor"].value_counts().index)
label_tsh = list(df["TSH measured"].value_counts().index)
label_tt4 = list(df["TT4 measured"].value_counts().index)
label_dict = {
"binaryClass": label_bin,
"sex": label_sex,
"on thyroxine": label_thyroxine,
"pregnant": label_pregnant,
"lithium": label_lithium,
"goitre": label_goitre,
"tumor": label_tumor,
"TSH measured": label_tsh,
"TT4 measured": label_tt4
}
def hist_feat_versus_nine_cat(feat, label):

ax_list = []


ax = axes[i]
feat_versus_other(feat, df[cat_feature], label_var, ax, f"{cat_featur
ax_list.append(ax)
plt.tight_layout()
plt.show()
hist_feat_versus_nine_cat(df["age"],"age")
def prob_feat_versus_other(feat,another,legend,ax0,label):
ax0_sns =
sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='g',zorder
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidt

ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidt

ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', fontsize=25, bbox_to_
loc='upper right')
plt.tight_layout()
def prob_feat_versus_nine_cat(feat, label):


ax = axes[i]
prob_feat_versus_other(feat, df[cat_feature], label_var, ax, f"{cat_f
plt.tight_layout()
plt.show()
prob_feat_versus_nine_cat(df["age"],"age")
hist_feat_versus_nine_cat(df["TSH"],"TSH")
prob_feat_versus_nine_cat(df["TSH"],"TSH")
hist_feat_versus_nine_cat(df["T3"],"T3")
prob_feat_versus_nine_cat(df["T3"],"T3")
hist_feat_versus_nine_cat(df["TT4"],"TT4")
prob_feat_versus_nine_cat(df["TT4"],"TT4")
hist_feat_versus_nine_cat(df["T4U"],"T4U")
prob_feat_versus_nine_cat(df["T4U"],"T4U")
hist_feat_versus_nine_cat(df["FTI"],"FTI")
prob_feat_versus_nine_cat(df["FTI"],"FTI")
#Converts binaryClass feature to {0,1}

def map_binaryClass(n):
if n == "N":
return 0
else:
return 1
df['binaryClass'] = df['binaryClass'].apply(lambda x: map_binaryClass(x))

def map_sex(n):
if n == "F":
return 0
else:
return 1
df['sex'] = df['sex'].apply(lambda x: map_sex(x))
#Extracts output and input variables

y = df['binaryClass'].values # Target for the model
X = df.drop(['binaryClass'], axis = 1)
#Feature Importance using RandomForest Classifier

names = X.columns
rf = RandomForestClassifier()
rf.fit(X, y)
result_rf = pd.DataFrame()
result_rf['Features'] = X.columns
result_rf ['Values'] = rf.feature_importances_
result_rf.sort_values('Values', inplace = True, ascending = False)
sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue")
plt.xlabel('Feature Importance', fontsize=30)
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
# Print the feature importance table

print("Feature Importance:")
print(result_rf)
#Feature Importance using ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X, y)
result_et = pd.DataFrame()
result_et['Features'] = X.columns
result_et ['Values'] = model.feature_importances_
result_et.sort_values('Values', inplace=True, ascending =False)
sns.barplot(x = 'Values',y = 'Features', data=result_et, color="red")
plt.xlabel('Feature Importance', fontsize=30)
plt.show()
# Print the feature importance table

print("Feature Importance:")
print(result_et)
#Feature Importance using RFE

from sklearn.feature_selection import RFE
model = LogisticRegression()
# create the RFE model
rfe = RFE(model)
rfe = rfe.fit(X, y)
result_lg = pd.DataFrame()
result_lg['Features'] = X.columns
result_lg ['Ranking'] = rfe.ranking_
result_lg.sort_values('Ranking', inplace=True , ascending = False)
sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange")
plt.show()
print("Feature Ranking:")
print(result_lg)
#Splits the data into training and testing

sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X, y.ravel())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, ra

stratify=y)
X_train_raw = X_train.copy()
X_test_raw = X_test.copy()
y_train_raw = y_train.copy()
y_test_raw = y_test.copy()
X_train_norm = X_train.copy()
X_test_norm = X_test.copy()
y_train_norm = y_train.copy()
y_test_norm = y_test.copy()
norm = MinMaxScaler()
X_train_norm = norm.fit_transform(X_train_norm)
X_test_norm = norm.transform(X_test_norm)
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
X_train_stand = scaler.fit_transform(X_train_stand)
X_test_stand = scaler.transform(X_test_stand)
def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None

n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
if axes is None:
_, axes = plt.subplots(3, 1, figsize=(50, 50))
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = \

learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
# Plot learning curve

axes[0].grid()
axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score", lw=10)
axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score", lw=10)
axes[0].legend(loc="best")
axes[0].set_title('Learning Curve', fontsize=50)
axes[0].set_xlabel('Training Examples', fontsize=40)
axes[0].set_ylabel('Score', fontsize=40)
axes[0].tick_params(labelsize=30)
# Plot n_samples vs fit_times

axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, 'o-', lw=10)
axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std, alpha=0.1)
axes[1].set_xlabel("Training examples", fontsize=40)
axes[1].set_ylabel("fit_times", fontsize=40)
axes[1].set_title("Scalability of the model", fontsize=50)
axes[1].tick_params(labelsize=30)
# Plot fit_time vs score
axes[2].grid()
axes[2].plot(fit_times_mean, test_scores_mean, 'o-', lw=10)
axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1)
axes[2].set_xlabel("fit_times", fontsize=40)
axes[2].set_ylabel("Score", fontsize=40)
axes[2].set_title("Performance of the model", fontsize=50)
return plt
def plot_real_pred_val(Y_test, ypred, name):

acc=accuracy_score(Y_test,ypred)
plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predicted")
plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
plt.title("Predicted Values vs True Values of " + name, fontsize=30)
plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%", fontsize=30)
plt.legend()
plt.show()
def plot_cm(Y_test, ypred, name):

fig, ax = plt.subplots(figsize=(25, 15))
cm = confusion_matrix(Y_test, ypred)
sns.heatmap(cm, annot=True, linewidth=0.7, linecolor='red', fmt='g', cmap
{"size": 30})
plt.title(name + ' Confusion Matrix', fontsize=30)
ax.xaxis.set_ticklabels(['Negative', 'Positive'], fontsize=20);
ax.yaxis.set_ticklabels(['Negative', 'Positive'], fontsize=20);
plt.xlabel('Y predict', fontsize=30)
plt.ylabel('Y test', fontsize=30)
plt.show()
return cm
#Plots ROC
def plot_roc(model,X_test, y_test, title):
Y_pred_prob = model.predict_proba(X_test)
Y_pred_prob = Y_pred_prob[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, Y_pred_prob)

plt.plot([0,1],[0,1], color='navy', lw=10, linestyle='--')
plt.plot(fpr,tpr, color='red', lw=10)
plt.xlabel('False Positive Rate', fontsize=30)
plt.ylabel('True Positive Rate', fontsize=30)
plt.title('ROC Curve of ' + title, fontsize=30)
plt.grid(True)
plt.show()
def plot_decision_boundary(model,xtest, ytest, name):

#Trains model with two features
model.fit(xtest, ytest)
plot_decision_regions(xtest.values, ytest.ravel(), clf=model, legend=2)

plt.title("Decision boundary for " + name + " (Test)", fontsize=30)
plt.xlabel("TSH", fontsize=25)
plt.ylabel("FTI", fontsize=25)
plt.show()
#Chooses two features for decision boundary

feat_boundary = ['TSH','FTI']
X_feature = X[feat_boundary]
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_fea
random_state = 2021, stratify=y)
def train_model(model, X, y):

model.fit(X, y)
return model
def predict_model(model, X, proba=False):

if ~proba:
y_pred = model.predict(X)
else:
y_pred_proba = model.predict_proba(X)
y_pred = np.argmax(y_pred_proba, axis=1)
return y_pred
list_scores = []
def run_model(name, model, X_train, X_test, y_train, y_test, fc, proba=False)

print(name)
print(fc)
model = train_model(model, X_train, y_train)
y_pred = predict_model(model, X_test, proba)
accuracy = accuracy_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print('accuracy: ', accuracy)

print('recall: ',recall)
print('precision: ', precision)
print('f1: ', f1)
print(classification_report(y_test, y_pred))
plot_cm(y_test, y_pred, name)

plot_real_pred_val(y_test, y_pred, name)
plot_roc(model, X_test, y_test, name)
plot_decision_boundary(model,X_test_feat, y_test_feat, name)
plot_learning_curve(model, name, X_train, y_train, cv=3);
plt.show()
list_scores.append({'Model Name': name, 'Feature Scaling':fc, 'Accuracy':

'Precision': precision, 'F1':f1})
feature_scaling = {
#'Raw':(X_train_raw, X_test_raw, y_train_raw, y_test_raw),
#'Normalization':(X_train_norm, X_test_norm, y_train_norm, y_test_norm),
'Standardization':(X_train_stand, X_test_stand, y_train_stand, y_test_stand)
}
model_svc = SVC(random_state=2021,probability=True)
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
run_model('SVC', model_svc, X_train, X_test, y_train, y_test, fc_name)
logreg = LogisticRegression(solver='lbfgs', max_iter=5000, random_state=2021)

run_model('Logistic Regression', logreg, X_train, X_test, y_train, y_test

scores_1 = []
for i in range(2,50):
knn = KNeighborsClassifier(n_neighbors = i)
scores_1.append(accuracy_score(y_test, knn.predict(X_test)))
max_val = max(scores_1)
max_index = np.argmax(scores_1) + 2
knn = KNeighborsClassifier(n_neighbors = max_index)

run_model(f'KNeighbors Classifier n_neighbors = {max_index}', knn, X_trai

fc_name, proba=True)
parameters = { 'max_depth':np.arange(1,21,1),'random_state':[2021]}
searcher = GridSearchCV(dt, parameters)
run_model('DecisionTree Classifier', searcher, X_train, X_test, y_train,
rf = RandomForestClassifier(n_estimators=200, max_depth=50, random_state=2021

run_model('Random Forest Classifier', rf, X_train, X_test, y_train, y_tes
gbt = GradientBoostingClassifier(n_estimators = 200, max_depth=20, subsample=

random_state=2021)
run_model('GradientBoosting Classifier', gbt, X_train, X_test, y_train, y_

xgb=XGBClassifier(n_estimators = 200, max_depth=20, random_state=2021, us
run_model('XGBoost Classifier', xgb, X_train, X_test, y_train, y_test, fc_
mlp = MLPClassifier(random_state=2021)
run_model('MLP Classifier', mlp, X_train, X_test, y_train, y_test, fc_nam
lgbm = LGBMClassifier(max_depth = 20, n_estimators=500, subsample=0.8, random_

run_model('Lightgbm Classifier', lgbm, X_train, X_test, y_train, y_test,
PREDICTING
THYROID
USING DEEP LEARNING
PREDICTING
THYROID
USING DEEP LEARNING
Reading Dataset and Preprocessing
Step Download dataset from

1 https://viviansiahaan.blogspot.com/2023/07/data-
science-crash-course-thyroid.html and save it to your
working directory. Unzip the file, hypothyroid.csv,
and put it into working directory.
Step Open a new Python script and save it as

2 thyroid_ann.py.
Step Import all necessary libraries:
3
1 # thyroid_ann.py
2 import numpy as np # linear algebra
3 import pandas as pd # data
4 processing, CSV file I/O
5 import os
6 import cv2
10 from matplotlib import pyplot as
11 plt
13 LabelEncoder
14 from sklearn.model_selection import
train_test_split
15
16 StandardScaler
17 import tensorflow as tf
confusion_matrix,
classification_report,
accuracy_score
SMOTE
SimpleImputer
Step Read dataset, Replace ? with NAN, convert six

4 columns into numerical, delete irrelevant columns,
handle missing values, convert binaryClass feature
to {0,1}, convert sex feature to {0,1}, replace t with 1
and f with 0, and extract outuput and input variables:
1 #Reads dataset
3 df =
4 pd.read_csv(curr_path+"/hypothyroid.csv")
5
6 #Replaces ? with NAN
8
9 #Converts six columns into numerical
10 num_cols = ['age',
11 'FTI','TSH','T3','TT4','T4U']
12 df[num_cols] =
df[num_cols].apply(pd.to_numeric,
13
errors='coerce')
14
15
16
18
21 categorical features
22 def mode_imputation(feature):
23 mode=df[feature].mode()[0]
24 df[feature]=df[feature].fillna(mode)
25
26 for col in ['sex', 'T4U measured']:
27 mode_imputation(col)
28
29 df['age'].fillna(df['age'].mean(),
30 inplace=True)
31
32 imputer = SimpleImputer(strategy='mean')
33 df['TSH'] =
imputer.fit_transform(df[['TSH']])
34
df['T3'] =
35
imputer.fit_transform(df[['T3']])
36
df['TT4'] =
37 imputer.fit_transform(df[['TT4']])
38 df['T4U'] =
40 df['FTI'] =
41 imputer.fit_transform(df[['FTI']])
42
47
48 df['age'].describe()
49
50 #Converts binaryClass feature to {0,1}
51 def map_binaryClass(n):
52 if n == "N":
53 return 0
54
55 else:
56 return 1
57 df['binaryClass'] =
58 df['binaryClass'].apply(lambda x:
map_binaryClass(x))
59
60
61
def map_sex(n):
62
if n == "F":
63
return 0
64
else:
return 1
df['sex'] = df['sex'].apply(lambda x:
map_sex(x))
#Replaces t with 1 and f with 0


y = df['binaryClass'].values # Target for
the model
Here are the steps explained:

1. Read the dataset: The code
reads the dataset from the file
"hypothyroid.csv" located in the
current working directory.
2. Replace missing values: The
code replaces the "?" values in
the dataset with NaN (Not a
Number) values.
3. Convert columns to numerical:
Six columns ('age', 'FTI', 'TSH',
'T3', 'TT4', 'T4U') are converted
to numeric data type using the
pd.to_numeric function. Any
non-numeric values in these
columns are converted to NaN.
4. Delete irrelevant columns: The
code removes the 'TBG' and
'referral source' columns from
the dataset as they are deemed
irrelevant for the analysis.
5. Handle missing values: The
code performs imputation to
handle missing values in the
dataset. Categorical features
('sex', 'T4U measured') are
imputed with the mode of their
respective columns, while the
'age' column is imputed with
the mean value of the column.
Numeric features ('TSH', 'T3',
'TT4', 'T4U', 'FTI') are imputed
using the mean strategy
through SimpleImputer.
6. Clean age column: The code
checks for any age values
greater than 100.0 and
replaces them with the value
100.0.
7. Convert binaryClass feature:
The code maps the
'binaryClass' feature to {0, 1},
where 'N' is mapped to 0 and
any other value is mapped to 1.
8. Convert sex feature: The code
maps the 'sex' feature to {0, 1},
where 'F' is mapped to 0 and
any other value is mapped to 1.
9. Replace 't' and 'f' values: The
code replaces 't' with 1 and 'f'
with 0 in the dataset.
10. Extract input and output
variables: The code assigns the
'binaryClass' column as the
target variable (y) and assigns
the remaining columns as the
input variables (X) for the
model.
These steps preprocess the dataset, handle missing
values, convert categorical features to numerical
values, and prepare the input and output variables
for further analysis or modeling.
Resampling, Splitting, and Scaling Data
Step Resample data using SMOTE, split the data into

1 training and testing, and transform them using
StandardScaler from SKLearn:
1 #Resamples data
3 X,y = sm.fit_resample(X, y.ravel())
4
6 testing
train_test_split(X, y,\
8
test_size = 0.2, random_state =
9
2021, stratify=y)
10
11
#Standar scaler
12
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

1. Resample data: The code
performs oversampling
using SMOTE (Synthetic
Minority Over-sampling
Technique) to balance the
classes in the dataset. It
creates synthetic samples
for the minority class to
match the number of
samples in the majority
class. The resampling is
done using the fit_resample
method of the SMOTE
object, with the input
features (X) and the target
variable (y) as arguments.
2. Split data into training and
testing: The code splits the
resampled data into
training and testing sets
using the train_test_split
function from scikit-learn.
It uses 80% of the data for
training and 20% for
testing. The random state
is set to 2021 for
reproducibility, and
stratified sampling is
performed to maintain the
class distribution in the
3. Standardize the data: The
code applies standard
scaling to the input
features. It uses the
StandardScaler object from
scikit-learn to standardize
data. The fit_transform
method is used on the
training data to compute
the mean and standard
deviation and then scale
the data. The transform
method is used on the
testing data to apply the
same scaling
transformation based on
from the training data.
These steps help in preprocessing the data by
addressing class imbalance through
oversampling, splitting the data into training
and testing sets, and applying standard scaling
to the input features.
Building, Compiling, and Training Model
Step Build, compile, and train model and save it and its history into files:
1
1 #Imports Tensorflow and create a

2 Sequential Model to add layer for the ANN
3 ann = tf.keras.models.Sequential()
4
5 #Input layer
6 ann.add(tf.keras.layers.Dense(units=1000,
7 input_dim=27,
8
kernel_initializer='uniform',
9
activation='relu'))
10
ann.add(tf.keras.layers.Dropout(0.5))
11
12
#Hidden layer 1
13
ann.add(tf.keras.layers.Dense(units=20,
14
15
16
activation='relu'))
17
18
19
#Output layer
20
21
22 kernel_initializer='uniform',
23 activation='sigmoid'))
24
25 print(ann.summary()) #for showing the
26 structure and parameters
27
28 #Compiles the ANN using ADAM optimizer.
29 ann.compile(optimizer = 'adam', loss =
'binary_crossentropy', metrics =
30
['accuracy'])
31
32
#Trains the ANN with 100 epochs.
33
history = ann.fit(X_train, y_train,
34 batch_size = 64, validation_split=0.20,
35 epochs = 100, shuffle=True)
36
#Saves model
ann.save('thyroid_model.h5')
#Saves history into npy file

np.save('thyroid_history.npy',
history.history)

1. Imports Tensorflow: The code imports the
TensorFlow library, which is a popular open-
source framework for building and training
deep learning models.
2. Create a Sequential Model: The code creates a
Sequential model using
tf.keras.models.Sequential(). This type of model
allows you to stack layers sequentially.
3. Input layer: The code adds an input layer to the
model using ann.add(). It specifies the number
of units (neurons) in the layer (1000), the input
dimension (27), the kernel initializer for weight
initialization, and the activation function
(ReLU). It also adds a dropout layer with a
dropout rate of 0.5, which helps prevent
overfitting by randomly dropping out a fraction
of input units during training.
4. Hidden layer 1: The code adds a hidden layer to
the model using ann.add(). It specifies the
number of units (20), the kernel initializer, and
the activation function (ReLU). It also adds a
dropout layer with a dropout rate of 0.5.
5. Output layer: The code adds an output layer to
the model using ann.add(). It specifies the
number of units (1), the kernel initializer, and
the activation function (sigmoid). This is a
binary classification problem, so the sigmoid
activation function is used to output
probabilities between 0 and 1.
6. Print model summary: The code prints a
summary of the model using ann.summary().
This provides information about the structure of
the model and the number of parameters.
7. Compile the model: The code compiles the
model using the compile() method. It specifies
the optimizer (Adam), the loss function (binary
cross-entropy), and the evaluation metric
(accuracy).
8. Train the model: The code trains the model
using the fit() method. It specifies the training
data (X_train and y_train), the batch size (64),
the validation split (20% of the training data for
validation), the number of epochs (100), and
the shuffle parameter (to shuffle the training
data before each epoch).
9. Save the model: The code saves the trained
model to a file named 'thyroid_model.h5' using
the save() method.
10. Save the history: The code saves the training
history (loss and accuracy values at each
epoch) to a NumPy file named
'thyroid_history.npy' using the np.save()
function.
These steps create and train an artificial neural network (ANN) model using
TensorFlow. The model consists of an input layer, a hidden layer, and an
output layer. It is compiled with the Adam optimizer and trained on the
training data for 100 epochs. The trained model and training history are
saved for future use.
Output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 100) 2800
_________________________________________________________________
dropout (Dropout) (None, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 20) 2020
_________________________________________________________________
dropout_1 (Dropout) (None, 20) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 21
=================================================================
Total params: 4,841
Trainable params: 4,841
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
70/70 [==============================] - 0s 2ms/step - loss:
0.6321 - accuracy: 0.7728 - val_loss: 0.4520 - val_accuracy:
0.8250
Epoch 2/100
70/70 [==============================] - 0s 994us/step - loss:
0.8357
Epoch 3/100
70/70 [==============================] - 0s 1ms/step - loss:
0.8555
Epoch 4/100
70/70 [==============================] - 0s 1ms/step - loss:
0.8734
Epoch 5/100
70/70 [==============================] - 0s 1ms/step - loss:
0.8815
Epoch 6/100
70/70 [==============================] - 0s 870us/step - loss:
0.9075
Epoch 7/100
70/70 [==============================] - 0s 869us/step - loss:
0.9264
Epoch 8/100
70/70 [==============================] - 0s 870us/step - loss:
0.9524
Epoch 9/100
70/70 [==============================] - 0s 899us/step - loss:
0.9587
Epoch 10/100
70/70 [==============================] - 0s 1ms/step - loss:
0.9659
Epoch 11/100
70/70 [==============================] - 0s 1ms/step - loss:
0.9749
Epoch 12/100
70/70 [==============================] - 0s 905us/step - loss:
0.9811
Epoch 13/100
70/70 [==============================] - 0s 855us/step - loss:
0.9803
Epoch 14/100
70/70 [==============================] - 0s 923us/step - loss:
0.9874
Epoch 15/100
70/70 [==============================] - 0s 859us/step - loss:
0.9829
Epoch 16/100
70/70 [==============================] - 0s 947us/step - loss:
0.9901
Epoch 17/100
70/70 [==============================] - 0s 899us/step - loss:
0.9883
Epoch 18/100
70/70 [==============================] - 0s 878us/step - loss:
0.9883
Epoch 19/100
70/70 [==============================] - 0s 884us/step - loss:
0.9892
Epoch 20/100
70/70 [==============================] - 0s 884us/step - loss:
0.9910
Epoch 21/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 22/100
70/70 [==============================] - 0s 855us/step - loss:
0.9910
Epoch 23/100
70/70 [==============================] - 0s 899us/step - loss:
0.9910
Epoch 24/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 25/100
70/70 [==============================] - 0s 928us/step - loss:
0.9901
Epoch 26/100
70/70 [==============================] - 0s 884us/step - loss:
0.9892
Epoch 27/100
70/70 [==============================] - 0s 870us/step - loss:
0.9901
Epoch 28/100
70/70 [==============================] - 0s 884us/step - loss:
0.9919
Epoch 29/100
70/70 [==============================] - 0s 899us/step - loss:
0.9928
Epoch 30/100
70/70 [==============================] - 0s 855us/step - loss:
0.9910
Epoch 31/100
70/70 [==============================] - 0s 870us/step - loss:
0.9928
Epoch 32/100
70/70 [==============================] - 0s 886us/step - loss:
0.9910
Epoch 33/100
70/70 [==============================] - 0s 870us/step - loss:
0.9892
Epoch 34/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 35/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 36/100
70/70 [==============================] - 0s 884us/step - loss:
0.9919
Epoch 37/100
70/70 [==============================] - 0s 889us/step - loss:
0.9919
Epoch 38/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 39/100
70/70 [==============================] - 0s 841us/step - loss:
0.9928
Epoch 40/100
70/70 [==============================] - 0s 859us/step - loss:
0.9937
Epoch 41/100
70/70 [==============================] - 0s 875us/step - loss:
0.9901
Epoch 42/100
70/70 [==============================] - 0s 885us/step - loss:
0.9928
Epoch 43/100
70/70 [==============================] - 0s 888us/step - loss:
0.9919
Epoch 44/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 45/100
70/70 [==============================] - 0s 891us/step - loss:
0.9901
Epoch 46/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 47/100
70/70 [==============================] - 0s 855us/step - loss:
0.9883
Epoch 48/100
70/70 [==============================] - 0s 858us/step - loss:
0.9892
Epoch 49/100
70/70 [==============================] - 0s 884us/step - loss:
0.9892
Epoch 50/100
70/70 [==============================] - 0s 1ms/step - loss:
0.9928
Epoch 51/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 52/100
70/70 [==============================] - 0s 869us/step - loss:
0.9901
Epoch 53/100
70/70 [==============================] - 0s 855us/step - loss:
0.9919
Epoch 54/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 55/100
70/70 [==============================] - 0s 855us/step - loss:
0.9919
Epoch 56/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 57/100
70/70 [==============================] - 0s 870us/step - loss:
0.9937
Epoch 58/100
70/70 [==============================] - 0s 855us/step - loss:
0.9910
Epoch 59/100
70/70 [==============================] - 0s 884us/step - loss:
0.9937
Epoch 60/100
70/70 [==============================] - 0s 860us/step - loss:
0.9937
Epoch 61/100
70/70 [==============================] - 0s 873us/step - loss:
0.9919
Epoch 62/100
70/70 [==============================] - 0s 870us/step - loss:
0.9910
Epoch 63/100
70/70 [==============================] - 0s 884us/step - loss:
0.9910
Epoch 64/100
70/70 [==============================] - 0s 884us/step - loss:
0.9910
Epoch 65/100
70/70 [==============================] - 0s 884us/step - loss:
0.9874
Epoch 66/100
70/70 [==============================] - 0s 873us/step - loss:
0.9928
Epoch 67/100
70/70 [==============================] - 0s 884us/step - loss:
0.9919
Epoch 68/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 69/100
70/70 [==============================] - 0s 871us/step - loss:
0.9892
Epoch 70/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 71/100
70/70 [==============================] - 0s 870us/step - loss:
0.9937
Epoch 72/100
70/70 [==============================] - 0s 957us/step - loss:
0.9937
Epoch 73/100
70/70 [==============================] - 0s 920us/step - loss:
0.9919
Epoch 74/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 75/100
70/70 [==============================] - 0s 893us/step - loss:
0.9937
Epoch 76/100
70/70 [==============================] - 0s 859us/step - loss:
0.9901
Epoch 77/100
70/70 [==============================] - 0s 884us/step - loss:
0.9937
Epoch 78/100
70/70 [==============================] - 0s 882us/step - loss:
0.9928
Epoch 79/100
70/70 [==============================] - 0s 879us/step - loss:
0.9928
Epoch 80/100
70/70 [==============================] - 0s 884us/step - loss:
0.9946
Epoch 81/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 82/100
70/70 [==============================] - 0s 864us/step - loss:
0.9928
Epoch 83/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 84/100
70/70 [==============================] - 0s 863us/step - loss:
0.9919
Epoch 85/100
70/70 [==============================] - 0s 877us/step - loss:
0.9901
Epoch 86/100
70/70 [==============================] - 0s 870us/step - loss:
0.9946
Epoch 87/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 88/100
70/70 [==============================] - 0s 884us/step - loss:
0.9928
Epoch 89/100
70/70 [==============================] - 0s 899us/step - loss:
0.9919
Epoch 90/100
70/70 [==============================] - 0s 841us/step - loss:
0.9937
Epoch 91/100
70/70 [==============================] - 0s 859us/step - loss:
0.9937
Epoch 92/100
70/70 [==============================] - 0s 855us/step - loss:
0.9946
Epoch 93/100
70/70 [==============================] - 0s 899us/step - loss:
0.9928
Epoch 94/100
70/70 [==============================] - 0s 872us/step - loss:
0.9928
Epoch 95/100
70/70 [==============================] - 0s 884us/step - loss:
0.9937
Epoch 96/100
70/70 [==============================] - 0s 884us/step - loss:
0.9937
Epoch 97/100
70/70 [==============================] - 0s 855us/step - loss:
0.9928
Epoch 98/100
70/70 [==============================] - 0s 870us/step - loss:
0.9919
Epoch 99/100
70/70 [==============================] - 0s 855us/step - loss:
0.9937
Epoch 100/100
70/70 [==============================] - 0s 870us/step - loss:
0.9937
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
The output represents the summary of the model structure and the number
of parameters in each layer:
Input Layer (dense): It has an output shape of
(None, 100), indicating that it outputs a tensor
of shape (batch_size, 100). The layer has 2,800
parameters.
Dropout Layer (dropout): It has an output
shape of (None, 100), same as the previous
layer. Dropout layers do not have any
parameters.
Hidden Layer 1 (dense_1): It has an output
shape of (None, 20), indicating that it outputs a
tensor of shape (batch_size, 20). The layer has
2,020 parameters.
Dropout Layer 1 (dropout_1): It has an output
shape of (None, 20), same as the previous
layer. Dropout layers do not have any
parameters.
Output Layer (dense_2): It has an output shape
of (None, 1), indicating that it outputs a tensor
of shape (batch_size, 1). The layer has 21
parameters.
The total number of trainable parameters in the
model is 4,841. These parameters are learned
during the training process to make predictions
on the given task.
This summary provides a concise overview of the model architecture, the
output shapes of each layer, and the total number of parameters.
The training process of the model shows a consistent improvement in both

training and validation metrics throughout the epochs. Here are some
observations from the epoch process:
The training loss consistently decreases,
indicating that the model is learning and
optimizing its predictions to minimize the loss
function.
The training accuracy increases steadily,
suggesting that the model is becoming more
accurate in predicting the training data.
The validation loss also decreases, indicating
that the model is generalizing well to unseen
data and is not overfitting.
The validation accuracy increases as well,
showing that the model performs well on the
validation data.
Overall, the model demonstrates a successful training process, with
decreasing loss and increasing accuracy metrics for both the training and
validation sets. This indicates that the model is learning the patterns and
features of the data effectively.
Based on this analysis, we can conclude that the model is well-suited for the
task at hand, which is predicting whether an individual has hypothyroidism
or not. The achieved high accuracy on both the training and validation data
suggests that the model can make accurate predictions on unseen data as
well.
Plotting Accuracy and Loss
Step Plot accuracy and loss versus epoch:

1
1 #Plots accuracy and loss

2 acc = history.history['accuracy']
3 val_acc =
4 history.history['val_accuracy']
5 loss = history.history['loss']
6 val_loss =
history.history['val_loss']
7
epochs = range(1, len(acc) + 1)
8
9
#accuracy
10
fig, ax = plt.subplots(figsize=(25,
11
15))
12
plt.plot(epochs, acc, 'r',
13 label='Training accuracy', lw=10)
14 plt.plot(epochs, val_acc, 'b--',
15 label='Validation accuracy', lw=10)
16 plt.title('Training and validation
17 accuracy', fontsize=35)
18 plt.legend(fontsize=25)
19 ax.set_xlabel("Epoch", fontsize=30)
20 ax.tick_params(labelsize=30)
21 plt.show()
22
23 #loss
24 fig, ax = plt.subplots(figsize=(25,
25 15))
26 plt.plot(epochs, loss, 'r',
label='Training loss', lw=10)
27
plt.plot(epochs, val_loss, 'b--',
label='Validation loss', lw=10)
plt.title('Training and validation
loss', fontsize=35)
ax.set_xlabel("Epoch", fontsize=30)
ax.tick_params(labelsize=30)
plt.show()
The results are shown in Figure 139 and 140.
Figure 139 Training and validation accuracy

1. The code defines the
variables acc, val_acc, loss,
val_loss, and epochs to
store the accuracy,
validation accuracy, loss,
validation loss, and the
number of epochs,
respectively.
2. It creates a figure and axis
object for the accuracy plot
using plt.subplots() and
sets the size of the figure
using figsize.
3. The training accuracy is
plotted using plt.plot() with
epochs on the x-axis and
acc on the y-axis. The line
is colored in red and
labeled as "Training
accuracy". The lw
parameter sets the line
width.
4. The validation accuracy is
plotted similarly, but with a
blue dashed line and
labeled as "Validation
accuracy".
5. The title of the plot is set
using plt.title(), and the
legend is displayed using
plt.legend().
6. The x-axis label is set using
ax.set_xlabel(), and the tick
parameters are adjusted
using ax.tick_params() to
increase the font size.
7. The accuracy plot is
displayed using plt.show().
8. Similarly, a figure and axis
object are created for the
loss plot.
Figure 140 Training and validation loss
9. The training loss is plotted

using plt.plot() with epochs
on the x-axis and loss on
the y-axis. The line is
colored in red and labeled
as "Training loss".
10. The validation loss is
plotted similarly, but with a
blue dashed line and
labeled as "Validation loss".
11. The title, legend, x-axis
label, and tick parameters
are set as in the accuracy
plot.
12. The loss plot is displayed
using plt.show().
By plotting the accuracy and loss over the
epochs, these visualizations provide insights
into the model's training progress, helping to
evaluate its performance and identify potential
issues such as overfitting or underfitting.
Predicting Thyroid Using Test Data
Step Predict result using test data:

1
1 #Sets the threshold for the
2 predictions.
3 #In this case, the threshold is 0.5
(this value can be modified).
4
#prediction on test set
5
y_pred = ann.predict(X_test)
6
y_pred = [int(p>=0.5) for p in
y_pred]
print(y_pred)
The code sets the threshold for the predictions

to 0.5. It applies this threshold to the predicted
probabilities (y_pred) generated by the model
on the test set. If a predicted probability is
greater than or equal to 0.5, it is classified as 1;
otherwise, it is classified as 0.
Here's a breakdown of the code:

1. The variable y_pred initially
stores the predicted
probabilities generated by
the model for each sample
in the test set.
2. The code then applies a list
comprehension to iterate
over each predicted
probability (p) in y_pred.
3. For each predicted
probability, the code
checks if it is greater than
or equal to 0.5. If it is, the
corresponding element in
y_pred is assigned the
integer value 1; otherwise,
it is assigned the integer
value 0.
4. Finally, the code prints the
updated y_pred, which now
contains the predicted
binary labels (0s and 1s)
based on the threshold of
0.5.
Output:
[1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,
1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1,
0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,
1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1,
1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,
0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,
1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0,
0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1,
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0,
0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,
0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0,
0, 0]
Printing Accuracy and Classification Report
Step Print accuracy and classification report:

1
1 #Performance Evaluation - Accuracy

2 and Classification Report
3 #Accuracy Score
4 print ('Accuracy Score : ',
accuracy_score(y_pred, y_test, \
5
normalize=True), '\n')
6
7
#precision, recall report
8
print ('Classification Report
:\n\n' ,\
classification_report(y_pred,
y_test))
Output:
Accuracy Score : 0.9928212491026561
Classification Report :
score support
0 0.99 0.99
0.99 697
1 0.99 0.99
0.99 696
accuracy
0.99 1393
macro avg 0.99 0.99
0.99 1393
0.99 1393
The code you provided calculates the accuracy

score and generates a classification report to
evaluate the performance of the model.
Accuracy Score:
The accuracy_score
function is used to
calculate the accuracy of
the model's predictions. It
compares the predicted
classes (y_pred) with the
actual classes (y_test) and
returns the accuracy score.
The normalize=True
parameter normalizes the
score to a value between 0
and 1. The accuracy score
represents the proportion
of correct predictions out
of the total number of
samples.
Classification Report:
The classification_report
function generates a
comprehensive report that
includes several metrics
such as precision, recall,
F1-score, and support for
each class. It compares the
predicted classes (y_pred)
with the actual classes
(y_test) and provides a
detailed analysis of the
model's performance for
each class. The report is
printed to the console.
The output of these statements will be the
accuracy score and the classification report.
The accuracy score will indicate how well the
model performed overall, while the
classification report will provide insights into
the performance of the model for each class,
including metrics like precision (the proportion
of true positive predictions out of the total
predicted positives), recall (the proportion of
true positive predictions out of the total actual
positives), F1-score (a balanced measure of
precision and recall), and support (the number
of samples in each class).
Based on the output you provided, the

performance evaluation results are as follows:
Accuracy Score: The model achieved an
accuracy score of 0.9928, which indicates that
it correctly predicted 99.28% of the samples.
Classification Report:
Precision: The precision for
both classes (0 and 1) is
0.99, which means that the
model has a high
proportion of true positive
predictions out of the total
predicted positives for both
classes.
Recall: The recall for both
classes is also 0.99,
has a high proportion of
true positive predictions
out of the total actual
positives for both classes.
F1-score: The F1-score for
both classes is 0.99, which
is a balanced measure of
precision and recall. It
indicates that the model
has a high harmonic mean
of precision and recall for
both classes.
refers to the number of
samples in each class. In
this case, both classes have
697 and 696 samples,
respectively.
Macro Avg: The macro-
averaged metrics consider
the average performance
across all classes. In this
case, the macro-averaged
precision, recall, and F1-
score are all 0.99.
Weighted Avg: The
weighted-averaged metrics
consider the average
performance weighted by
the support of each class.
In this case, the weighted-
averaged precision, recall,
and F1-score are all 0.99.
Overall, the classification report indicates that
the model performed very well, with high
precision, recall, and F1-score for both classes,
and a high accuracy score. This suggests that
the model is able to accurately predict the
classes for the given data.
Confusion Matrix
Step Plot confusion matrix:

1
1 #Confusion matrix:
2 conf_mat =
3 confusion_matrix(y_true=y_test,
y_pred = y_pred)
4
class_list = ['NORMAL', 'STROKE']
5
fig, ax = plt.subplots(figsize=(25,
6
15))
7
sns.heatmap(conf_mat, annot=True, ax
8 = ax, cmap='YlOrBr', fmt='g',
9 annot_kws={"size": 30})
10
11 ax.set_xlabel('Predicted labels',
12 fontsize=30)
ax.set_ylabel('True labels',
fontsize=30)
ax.set_title('Confusion Matrix',
fontsize=30)
ax.xaxis.set_ticklabels(class_list),
ax.yaxis.set_ticklabels(class_list)
The result is shown in Figure 141. Here's a

step-by-step explanation of the code:
1. Compute the confusion
matrix: The
confusion_matrix function
is used to calculate the
confusion matrix using the
true labels (y_test) and the
predicted labels (y_pred).
2. Define class labels: A list
called class_list is created
to store the class labels. In
this case, the class labels
are 'NORMAL' and
'STROKE', representing the
two classes in the
classification problem.
3. Create a heatmap plot: A
figure and axes (fig and ax)
are created to hold the
heatmap plot. The figsize
parameter sets the size of
the figure.
4. Plot the confusion matrix
heatmap: The sns.heatmap
function is used to create
the heatmap plot of the
confusion matrix. The
conf_mat is passed as the
data for the heatmap. The
annot=True parameter
enables the display of
annotations (values) in
each cell of the heatmap.
The ax parameter specifies
the axes to plot on. The
cmap parameter sets the
color map for the heatmap.
The fmt='g' parameter
specifies the format for the
cell values (as general
numbers). The annot_kws
parameter allows
customization of the
annotation properties, in
this case, setting the font
size to 30.
Figure 141 The confusion matrix
5. Set labels and title: The

ax.set_xlabel, ax.set_ylabel,
and ax.set_title functions
are used to set the x-axis
label, y-axis label, and title
for the plot, respectively.
The font size for these
labels and title is set to 30.
6. Set tick labels and tick
sizes: The
ax.xaxis.set_ticklabels and
ax.yaxis.set_ticklabels
functions are used to set
the tick labels for the x-axis
and y-axis, respectively.
The tick labels are set to
the class labels in
class_list. The
ax.tick_params function is
used to set the tick label
font size to 30.
7. Display the plot: The
plt.show() function is used
to display the plot.
The resulting plot is a heatmap visualization of
the confusion matrix, where the x-axis
represents the predicted labels and the y-axis
represents the true labels. The cells in the
heatmap show the number of samples that
belong to each class and were correctly or
incorrectly classified. The annotation values in
each cell provide additional information about
the counts. The plot helps in understanding the
performance of the model in terms of correctly
and incorrectly classified samples for each
class.
True Values versus Predicted Values
Step Plot true values versus predicted values:

1
1 def plot_real_pred_val(Y_test, ypred, name):

2 plt.figure(figsize=(25,15))
3 acc=accuracy_score(Y_test,ypred)
4
5 plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predict
6 plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual
7 plt.title("Predicted Values vs True Values of " + name, fontsize
8 plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%")
9 plt.legend(fontsize=25)
10 plt.grid(True, alpha=0.75, lw=1, ls='-.')
11 plt.show()
12
13 plot_real_pred_val(y_test, y_pred, 'ANN')
14
15
16
The result is shown in Figure 142. The plot_real_pred_val() function is defined to cre
scatter plot comparing the predicted values (ypred) and the true values (Y_test)
given model (name). Here's a step-by-step explanation of the code:
1. Create a figure for the plot: The plt.figure functio
used to create a figure object with a specified
(figsize=(25,15)).
2. Calculate accuracy: The accuracy between the predic
values and the true values is calculated using
accuracy_score function and stored in the variable acc
3. Plot the predicted values: The predicted values (ypr
are plotted as scatter points using the plt.sca
function. The color="blue" parameter sets the colo
the points to blue. The lw=5 parameter sets the linew
of the points to 5. The label="Predicted" parameter
the label for the legend.
4. Plot the true values: The true values (Y_test) are plo
as scatter points using the plt.scatter function.
color="red" parameter sets the color of the points to
The label="Actual" parameter sets the label for
legend.
5. Set the title: The plt.title function is used to set the
for the plot. The title includes the model name (na
and is formatted with the fontsize=10 parameter.
6. Set the x-axis label: The plt.xlabel function is used to
the x-axis label. The label includes the accuracy va
(acc) formatted as a percentage us
str(round((acc*100),3)) + "%". It represents the accur
of the model in predicting the values.
7. Set the legend: The plt.legend function is used to disp
the legend on the plot. The fontsize=25 parameter
the font size of the legend.
8. Display the grid: The plt.grid function is used to displ
grid on the plot. The True parameter enables the g
while alpha=0.75 sets the transparency level, lw=1
the linewidth of the grid lines, and ls='-.' sets
linestyle.
9. Show the plot: The plt.show function is called to disp
the plot.
The resulting plot shows the predicted values as blue points and the true values a
points. The accuracy of the model is displayed as the x-axis label, and the le
differentiates between the predicted and actual values. The grid lines provide addi
visual guidance.
Following is the full version of thyroid_ann.py:
#thyroid_ann.py
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import os
import cv2
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
#Reads dataset
df = pd.read_csv(curr_path+"/hypothyroid.csv")
#Replaces ? with NAN


df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

df.drop(['TBG','referral source'],axis=1,inplace=True)

#Uses mode imputation for all other categorical features
def mode_imputation(feature):

#Cleans age column

df['age'].describe()

def map_binaryClass(n):
if n == "N":
return 0
else:
return 1
df['binaryClass'] = df['binaryClass'].apply(lambda x:
map_binaryClass(x))

def map_sex(n):
if n == "F":
return 0
else:
return 1
df['sex'] = df['sex'].apply(lambda x: map_sex(x))


#Resamples data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.2, random_state = 2021, stratify=y)
#Standar scaler
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Imports Tensorflow and create a Sequential Model to add layer for the
ANN
ann = tf.keras.models.Sequential()
#Input layer
input_dim=27,
activation='relu'))
#Hidden layer 1
activation='relu'))
#Output layer
activation='sigmoid'))
print(ann.summary()) #for showing the structure and parameters
#Compiles the ANN using ADAM optimizer.

ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =
['accuracy'])

history = ann.fit(X_train, y_train, batch_size = 64,
validation_split=0.20, epochs = 100, shuffle=True)
#Saves model

np.save('thyroid_history.npy', history.history)
print (history.history.keys())
#Plots accuracy and loss

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#accuracy
plt.plot(epochs, acc, 'r', label='Training accuracy', lw=10)
plt.plot(epochs, val_acc, 'b--', label='Validation accuracy', lw=10)
plt.title('Training and validation accuracy', fontsize=35)
plt.show()
#loss
plt.plot(epochs, loss, 'r', label='Training loss', lw=10)
plt.plot(epochs, val_loss, 'b--', label='Validation loss', lw=10)
plt.title('Training and validation loss', fontsize=35)
plt.show()
#Sets the threshold for the predictions. In this case, the threshold is
0.5 (this value can be modified).
#prediction on test set
y_pred = ann.predict(X_test)
y_pred = [int(p>=0.5) for p in y_pred]
print(y_pred)
#Performance Evaluation - Accuracy and Classification Report

#Accuracy Score
print ('Accuracy Score : ', accuracy_score(y_pred, y_test,
normalize=True), '\n')

print ('Classification Report :\n\n' ,classification_report(y_pred,
y_test))
#Confusion matrix:
conf_mat = confusion_matrix(y_true=y_test, y_pred = y_pred)
class_list = ['NORMAL', 'STROKE']
sns.heatmap(conf_mat, annot=True, ax = ax, cmap='YlOrBr', fmt='g',
annot_kws={"size": 30})
ax.set_xlabel('Predicted labels', fontsize=30)
ax.set_ylabel('True labels', fontsize=30)
ax.set_title('Confusion Matrix', fontsize=30)
ax.xaxis.set_ticklabels(class_list), ax.yaxis.set_ticklabels(class_list)
def plot_real_pred_val(Y_test, ypred, name):

acc=accuracy_score(Y_test,ypred)
plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predicted")
plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
plt.title("Predicted Values vs True Values of " + name, fontsize=10)
plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%")
plt.show()
plot_real_pred_val(y_test, y_pred, 'ANN')
IMPLEMENTING
GUI
WITH PYQT
IMPLEMENTING
GUI
WITH PYQT
Designing GUI
Step Now, you will create a GUI to implement how to classify

1 and predict thyroid disease using some machine learning
algorithms and CNN. Open Qt Designer and choose
Main Window template. Save the form as
gui_thyroid.ui.
Step Put three Push Button widgets onto form. Set their text
2 property as LOAD DATA, TRAIN ML MODEL, and
TRAIN DL MODEL. Set their objectName property as
pbLoad, pbTrainML, dan pbTrainDL.
Step Put two Table Widget onto form. Set their objectName
3 properties as twData1 and twData2.
Step Add two Label widgets onto form. Set their text
4 properties as Label 1 and Label 2 and set their
objectName properties as label1 and label2.
Step Put three Widget from Containers panel onto form and
5 set their ObjectName property as widgetPlot1,
widgetPlot2, and widgetPlot3.
Figure 143 The widgetPlot1, widgetPlot2, and

widgetPlot3 are now an object of plot_class
Step Right click on the three Widgets and choose Promote

6 to …. Set Promoted class name as plot_class. Click
Add and Promote button. In Object Inspector window,
you can see that widgetPlot1, widgetPlot2, and
widgetPlot3 are now an object of plot_class as shown
in Figure 143.
Step Write the definition of plot_class and save it as

7 plot_class.py as follows:
1 #plot_class.py
2 from PyQt5.QtWidgets import*
3 from matplotlib.backends.backend_qt5agg
4 import FigureCanvas
5 from matplotlib.figure import Figure
6
7 class plot_class(QWidget):
8 def __init__(self, parent = None):
9 QWidget.__init__(self, parent)
10 self.canvas = FigureCanvas(Figure())
11
12 vertical_layout = QVBoxLayout()
13
vertical_layout.addWidget(self.canvas)
14
15
self.canvas.axis1 =
16
self.canvas.figure.add_subplot(111)
17
self.canvas.axis1 =
18 self.canvas.figure.subplots_adjust(
19
20 top=0.936,
21
22 bottom=0.104,
23
left=0.047,
24
25
right=0.981,
hspace=0.2,
wspace=0.2
)
self.canvas.figure.set_facecolor("xkcd:ligh
t mauve")
self.setLayout(vertical_layout)
The code defines a class called plot_class that inherits
from the QWidget class in the PyQt5 library. This class is
designed to display a plot using Matplotlib within a Qt
application. Here's a step-by-step explanation of the
code:
1. Import the required modules: The
code starts by importing the
necessary modules from the
PyQt5 and Matplotlib libraries.
2. Define the plot_class class: The
plot_class class is defined, and it
inherits from the QWidget class.
3. Initialize the class: The __init__
method is defined to initialize the
class. It takes parent as an
optional parameter.
4. Create a FigureCanvas: Inside the
__init__ method, a FigureCanvas
object is created with an empty
Figure. This canvas will be used
to display the plot.
5. Create a vertical layout: A
QVBoxLayout object is created to
arrange the canvas vertically.
6. Add the canvas to the layout: The
canvas is added to the vertical
layout using the addWidget
method.
7. Add a subplot to the canvas: The
canvas is assigned a subplot using
the add_subplot method. The
argument (111) specifies a single
subplot occupying the entire
canvas.
8. Adjust subplot parameters: The
subplots_adjust method is called
to adjust the spacing and
positioning of the subplot within
the canvas. The provided
parameters control the top,
bottom, left, right, height, and
width ratios.
9. Set the facecolor of the figure:
The set_facecolor method of the
Figure object is called to set the
background color of the figure. In
this case, the color is set to
"xkcd:light mauve".
10. Set the layout for the widget: The
setLayout method is called to set
the vertical layout as the layout
for the widget.
The plot_class class provides a convenient way to display
a plot within a Qt application by embedding a Matplotlib
canvas in a QWidget.
Step Add a Combo Box widget and set its objectName

8 property as cbData. Let it empty. You will populate it
from the code.
Step Add another Combo Box widget and set its objectName
9 property as cbClassifier. Populate this widget with
fourteen items as shown in Figure 144.
Figure 144 Populating cbClassifier widget with

fourteen items
Step Add three radio buttons and set their text properties as
10 Raw, Norm, and Stand. Then, set their objectName as
rbRaw, rbNorm, and rbStand.
Step Add the last Combo Box widget and set its objectName
11 property as cbPredictionDL. Populate this widget with
two items as shown in Figure 145.
Figure 145 Populating cbPredictionDL widget with

two items
Figure 146 The form when it first runs
Step Write this Python script and save it as gui_thyroid.py:

12
1 #gui_thyroid.py
2 from PyQt5.QtWidgets import *
3 from PyQt5.uic import loadUi
4 from
5 matplotlib.backends.backend_qt5agg
import (NavigationToolbar2QT as
6
NavigationToolbar)
7
from matplotlib.colors import
8 ListedColormap
9
10 class DemoGUI_Thyroid(QMainWindow):
11 def __init__(self):
12 QMainWindow.__init__(self)
13
14 loadUi("gui_thyroid.ui",self)
15 self.setWindowTitle(\
16 "GUI Demo of Classifying and
17 Predicting Thyroid Disease")
18 self.addToolBar(NavigationToolbar(
19 \
20 self.widgetPlot1.canvas, self))
21
22 if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
ex = DemoGUI_Thyroid()
ex.show()
sys.exit(app.exec_())
The code represents a GUI application for classifying and

predicting thyroid disease. It uses the PyQt5 library for
building the graphical user interface and incorporates
Matplotlib for displaying plots. Here's a step-by-step
1. Import the required modules: The
code begins by importing the
necessary modules from the
PyQt5 and Matplotlib libraries.
2. Define the DemoGUI_Thyroid
class: The DemoGUI_Thyroid
class is defined, and it inherits
from the QMainWindow class.
3. Initialize the class: The __init__()
method is defined to initialize the
class. It loads the UI from the
gui_thyroid.ui file using the
loadUi function. The window title
is set to "GUI Demo of Classifying
and Predicting Thyroid Disease".
4. Add a navigation toolbar: An
instance of the NavigationToolbar
class is added as a toolbar to the
main window. The
NavigationToolbar is associated
with the canvas object of the
widgetPlot1 widget.
5. Run the application: The if
__name__ == '__main__' block is
executed when the script is run
directly. It creates an instance of
the QApplication class, initializes
the DemoGUI_Thyroid class,
shows the main window, and
starts the application event loop.
The DemoGUI_Thyroid class represents the main window
of the GUI application and incorporates the UI defined in
the gui_thyroid.ui file. It also includes a navigation
toolbar for interacting with the Matplotlib plot displayed
on the widgetPlot1 widget.
Step Run gui_thyroid.py and click LOAD DATA button. You

13 will see form’s layout as shown in Figure 146.
Preprocessing Data and Populating Tables

Step Import all necessary modules:
1
1 import numpy as np
3 import matplotlib.pyplot as plt
6 import warnings
7 import mglearn
8 warnings.filterwarnings('ignore')
9 import os
10 import joblib
11 from numpy import save
12 from numpy import load
13 from os import path
15 roc_auc_score,roc_curve
16 from sklearn.model_selection import
train_test_split,
17
RandomizedSearchCV,
18 GridSearchCV,StratifiedKFold
20 StandardScaler, MinMaxScaler
22 LogisticRegression
23 from sklearn.naive_bayes import
GaussianNB
24
from sklearn.tree import
25
DecisionTreeClassifier
26
27
28 RandomForestClassifier,
29 ExtraTreesClassifier
30 from sklearn.neighbors import
31 KNeighborsClassifier
AdaBoostClassifier,
33
GradientBoostingClassifier
34
35
from sklearn.neural_network import
36 MLPClassifier
38 SGDClassifier
StandardScaler, LabelEncoder,
40
OneHotEncoder
41
42 confusion_matrix, accuracy_score,
43 recall_score, precision_score
45 classification_report, f1_score,
plot_confusion_matrix
from catboost import
CatBoostClassifier
SMOTE
learning_curve
plot_decision_regions
from sklearn.base import clone
from sklearn.decomposition import
PCA
from tensorflow.keras.models import
Sequential, Model, load_model
SimpleImputer
Step Define write_df_to_qtable() and populate_table() methods to populate

2 any table widget with some data:
1 # Takes a df and writes it to a qtable

2 provided. df headers become qtable
headers
3
@staticmethod
4
def write_df_to_qtable(df,table):
5
headers = list(df)
6
7 table.setRowCount(df.shape[0])
8 table.setColumnCount(df.shape[1])
9
10 table.setHorizontalHeaderLabels(headers)
11
12 # getting data from df is
computationally costly so convert it to
13
array first
14
df_array = df.values
15
for row in range(df.shape[0]):
16
for col in range(df.shape[1]):
17
table.setItem(row, col, \
18
QTableWidgetItem(str(df_array[row,col]))
19 )
20
21 def populate_table(self,data, table):
22 #Populates two tables
23 self.write_df_to_qtable(data,table)
24
25 table.setAlternatingRowColors(True)
table.setStyleSheet(\
"alternate-background-color:
#ffb07c;background-color:
#e6daa6;");
The code includes two functions for populating a QTableWidget with data
from a pandas DataFrame. Here's a step-by-step explanation of the code:
1. write_df_to_qtable() method: This static
method takes a DataFrame (df) and a
QTableWidget (table) as input parameters. It
is responsible for writing the DataFrame data
to the QTableWidget. Here are the steps
performed in this method:
a. Get the headers of the DataFrame using
list(df) and assign them to the headers
variable.
b. Set the number of rows and columns in the
QTableWidget using setRowCount and
setColumnCount, respectively, based on
the shape of the DataFrame.
c. Set the horizontal header labels of the
QTableWidget using
setHorizontalHeaderLabels and pass the
headers list.
d. Convert the DataFrame to a NumPy array
using df.values and assign it to the df_array
variable. This step is performed to improve
computational efficiency when retrieving
data from the DataFrame.
e. Iterate over each row and column in the
DataFrame using nested loops. Set the
QTableWidgetItem in the corresponding
position of the QTableWidget using
setItem. Convert the DataFrame values to
strings using str before assigning them to
the QTableWidgetItem.
2. populate_table() method: This method takes
two parameters - data (a DataFrame) and table
(a QTableWidget). It is responsible for
populating the QTableWidget with the data
from the DataFrame. Here are the steps
performed in this method:
a. Call the write_df_to_qtable() method and
pass the data and table as arguments to
populate the table with the DataFrame
data.
b. Set the alternating row colors of the
QTableWidget using
setAlternatingRowColors and pass True as
the argument.
c. Apply custom stylesheet to the
QTableWidget using setStyleSheet to set
the alternating background colors.
Step Define initial_state() method to disable some widgets when form initially
3 runs:
1 def initial_state(self, state):

2 self.pbTrainML.setEnabled(state)
3 self.cbData.setEnabled(state)
4 self.cbClassifier.setEnabled(state)
5
6 self.cbPredictionML.setEnabled(state
)
7
8
self.cbPredictionDL.setEnabled(state
9 )
10 self.pbTrainDL.setEnabled(state)
self.rbRaw.setEnabled(state)
self.rbNorm.setEnabled(state)
self.rbStand.setEnabled(state)
The initial_state() method takes a boolean parameter state and is

responsible for setting the initial state of various UI elements. Here's a
step-by-step explanation of the code:
self.pbTrainML.setEnabled(state): This line
sets the enabled state of a push button named
pbTrainML based on the value of the state
parameter. If state is True, the push button is
enabled; otherwise, it is disabled.
self.cbData.setEnabled(state): This line sets
the enabled state of a combo box named
cbData based on the value of the state
parameter. If state is True, the combo box is
self.cbClassifier.setEnabled(state): This line
sets the enabled state of a combo box named
cbClassifier based on the value of the state
parameter. If state is True, the combo box is
self.cbPredictionML.setEnabled(state): This
line sets the enabled state of a combo box
named cbPredictionML based on the value of
the state parameter. If state is True, the
combo box is enabled; otherwise, it is disabled.
self.cbPredictionDL.setEnabled(state): This
line sets the enabled state of a combo box
named cbPredictionDL based on the value of
the state parameter. If state is True, the
combo box is enabled; otherwise, it is disabled.
self.pbTrainDL.setEnabled(state): This line
sets the enabled state of a push button named
pbTrainDL based on the value of the state
parameter. If state is True, the push button is
self.rbRaw.setEnabled(state): This line sets the
enabled state of a radio button named rbRaw
based on the value of the state parameter. If
state is True, the radio button is enabled;
otherwise, it is disabled.
self.rbNorm.setEnabled(state): This line sets
the enabled state of a radio button named
rbNorm based on the value of the state
parameter. If state is True, the radio button is
self.rbStand.setEnabled(state): This line sets
the enabled state of a radio button named
rbStand based on the value of the state
parameter. If state is True, the radio button is
Overall, the initial_state() method is used to control the enabled/disabled
state of various UI elements based on the state parameter, allowing you to
set the initial state of these elements in your application's UI.
Step Read dataset, Replace ? with NAN, convert six columns into numerical,
4 delete irrelevant columns, handle missing values, copy dataframe as
dummy dataset for visualization, convert binaryClass feature to {0,1},
convert sex feature to {0,1}, and replace t with 1 and f with 0,:
1 def read_dataset(self, dir):

2 #Loads csv file
3 df = pd.read_csv(dir)
4
5 #Replaces ? with NAN
7
8 #Converts six columns into
9 numerical
10 num_cols = ['age',
'FTI','TSH','T3','TT4','T4U']
11
df[num_cols] =
12
df[num_cols].apply(pd.to_numeric, \
13
errors='coerce')
14
15
16
18
20 for col in ['sex', 'T4U measured']:
21 self.mode_imputation(df, col)
22
23 df['age'].fillna(df['age'].mean(),
24 inplace=True)
25 imputer =
SimpleImputer(strategy='mean')
26
df['TSH'] =
27
imputer.fit_transform(df[['TSH']])
28
df['T3'] =
29 imputer.fit_transform(df[['T3']])
30 df['TT4'] =
31 imputer.fit_transform(df[['TT4']])
32 df['T4U'] =
34 df['FTI'] =
35 imputer.fit_transform(df[['FTI']])
36
41
42 #Creates dummy dataset for
visualization
43
df_dummy=df.copy()
44
45
#Converts binaryClass feature to
46
{0,1}
47
df['binaryClass'] =
48 df['binaryClass'].apply(\
49 lambda x: self.map_binaryClass(x))
50
51 #Converts sex feature to {0,1}
52 df['sex'] =
53 df['sex'].apply(lambda x:
54 self.map_sex(x))
55
56 #Replaces t with 1 and f with 0
57 df=df.replace({"t":1,"f":0})
58
59 return df, df_dummy
60
categorical features
62
def mode_imputation(self, df,
63
feature):
64
#Converts binaryClass feature to

{0,1}
def map_binaryClass(self, n):
if n == "N":
return 0
else:
return 1

def map_sex(self, n):
if n == "F":
return 0
else:
return 1
The read_dataset() method is used to read and preprocess a dataset from a

given directory. Here's a step-by-step explanation of the code:
1. df = pd.read_csv(dir): This line reads a CSV
file from the specified directory and stores it in
a Pandas DataFrame called df.
2. df=df.replace({"?":np.NAN}): This line replaces
any occurrences of "?" in the DataFrame with
NaN values.
3. num_cols = ['age', 'FTI','TSH','T3','TT4','T4U']:
This line defines a list of column names that
need to be converted to numerical values.
4. df[num_cols] =
df[num_cols].apply(pd.to_numeric,
errors='coerce'): This line converts the
columns specified in num_cols to numeric
values. Any values that cannot be converted
are replaced with NaN.
5. df.drop(['TBG','referral
source'],axis=1,inplace=True): This line drops
the 'TBG' and 'referral source' columns from
the DataFrame, as they are deemed irrelevant.
6. for col in ['sex', 'T4U measured']:: This line
starts a loop to handle missing values in the
'sex' and 'T4U measured' columns.
7. self.mode_imputation(df, col): This line calls
the mode_imputation method to perform mode
imputation for the current column.
8. df['age'].fillna(df['age'].mean(), inplace=True):
This line fills missing values in the 'age'
column with the mean value of the column.
9. imputer = SimpleImputer(strategy='mean'):
This line creates an instance of the
SimpleImputer class with the strategy set to
'mean'.
10. df['TSH'] = imputer.fit_transform(df[['TSH']]):
This line fills missing values in the 'TSH'
column with the mean value of the column
using the imputer object.
11. Steps 10-12 are repeated for the 'T3', 'TT4',
'T4U', and 'FTI' columns to fill missing values
with their respective column means.
12. for i in range(df.shape[0]):: This line starts a
loop to clean up the 'age' column.
13. if df.age.iloc[i] > 100.0:: This line checks if the
current value in the 'age' column is greater
than 100.
14. df.age.iloc[i] = 100.0: This line sets the value
in the 'age' column to 100 if it is greater than
100.
15. df_dummy=df.copy(): This line creates a copy
of the DataFrame df called df_dummy for
visualization purposes.
16. df['binaryClass'] =
df['binaryClass'].apply(lambda x:
self.map_binaryClass(x)): This line converts
the 'binaryClass' column values to 0 or 1 using
the map_binaryClass method.
17. df['sex'] = df['sex'].apply(lambda x:
self.map_sex(x)): This line converts the 'sex'
column values to 0 or 1 using the map_sex
method.
18. df=df.replace({"t":1,"f":0}): This line replaces
't' with 1 and 'f' with 0 in the DataFrame.
19. Finally, the method returns the preprocessed
DataFrame df and the visualization DataFrame
df_dummy.
The mode_imputation, map_binaryClass, and map_sex methods are helper
methods used within the read_dataset method for specific operations, such
as mode imputation and mapping categorical values to numerical values.
Step Define populate_cbData() to populate cbData widget:

5
1 def populate_cbData(self):
2 self.cbData.addItems(self.df.iloc[:,1:]
3 )
4 self.cbData.addItems(["Features
Importance"])
5
self.cbData.addItems(["Correlation
Matrix", \
"Pairwise Relationship", "Features
Correlation"])
The populate_cbData() method is used to populate a combo box (cbData)

with various options. Here's an explanation of each step:
1. self.cbData.addItems(self.df.iloc[:,1:]): This
line adds items to the combo box based on the
columns of the DataFrame df. It adds all
columns except the first column, which is
assumed to be the index column.
2. self.cbData.addItems(["Features
Importance"]): This line adds the item
"Features Importance" to the combo box.
3. self.cbData.addItems(["Correlation Matrix",
"Pairwise Relationship", "Features
Correlation"]): This line adds multiple items to
the combo box: "Correlation Matrix", "Pairwise
Relationship", and "Features Correlation".
By executing these steps, the cbData combo box will be populated with the
column names of the DataFrame df, along with the additional options
"Features Importance", "Correlation Matrix", "Pairwise Relationship", and
"Features Correlation".
Step Define import_dataset() method to import dataset for machine learning

6 algorithms (df) and populate two table widgets with data and its
description:
1 def import_dataset(self):
3 dataset_dir = curr_path +
4 "/hypothyroid.csv"
5
6 #Loads csv file
7 self.df = self.read_dataset(dataset_dir)
8
9 #Populates tables with data
10 self.populate_table(self.df, self.twData1)
11 self.label1.setText('Thyroid Disease Data')
12
13 self.populate_table(self.df.describe(),
self.twData2)
14
self.twData2.setVerticalHeaderLabels(['Count'
15
, \
16
'Mean', 'Std', 'Min', '25%', '50%', '75%',
17 'Max'])
18 self.label2.setText('Data Desciption')
19
20 #Turns on pbTrainML widget
21 self.pbTrainML.setEnabled(True)
22 self.pbTrainDL.setEnabled(True)
23
24 #Turns off pbLoad
25 self.pbLoad.setEnabled(False)
#Populates cbData
self.populate_cbData()
The import_dataset() method is responsible for importing the dataset,

populating tables with data, and updating various widgets in the GUI.
Here's a breakdown of the steps performed in this method:
1. curr_path = os.getcwd() and dataset_dir =
curr_path + "/hypothyroid.csv": These lines
obtain the current working directory and
create the path to the dataset file
"hypothyroid.csv".
2. self.df = self.read_dataset(dataset_dir): This
line calls the read_dataset method to load and
preprocess the dataset, storing it in the self.df
attribute.
3. self.populate_table(self.df, self.twData1): This
line populates the first table (twData1) with
the loaded dataset (self.df).
4. self.label1.setText('Thyroid Disease Data'):
This line sets the text of label1 to "Thyroid
Disease Data".
5. self.populate_table(self.df.describe(),
self.twData2): This line populates the second
table (twData2) with the summary statistics of
the dataset obtained using the describe
method.
6. self.twData2.setVerticalHeaderLabels(['Count',
'Mean', 'Std', 'Min', '25%', '50%', '75%',
'Max']): This line sets the vertical header
labels of twData2 to the specified list of
strings.
7. self.label2.setText('Data Description'): This
line sets the text of label2 to "Data
Description".
8. self.pbTrainML.setEnabled(True) and
self.pbTrainDL.setEnabled(True): These lines
enable the training buttons (pbTrainML and
pbTrainDL).
9. self.pbLoad.setEnabled(False): This line
disables the load button (pbLoad).
10. self.populate_cbData(): This line calls the
populate_cbData method to populate the
combo box (cbData) with options based on the
loaded dataset.
By executing these steps, the GUI will be updated with the loaded dataset,
summary statistics, and enabled training buttons.
Step Connect clicked() event of pbLoad widget with import_dataset() and put
7 it inside __init__() method as shown in line 8 and invoke initial_state()
methode in line 9:
5 "GUI Demo of Classifying and Predicting Thyroid
6 Disease")
7 self.addToolBar(NavigationToolbar(\
9 self.pbLoad.clicked.connect(self.import_dataset
)
self.initial_state(False)
Figure 147 The initial state of form
Figure 148 When LOAD DATA button is clicked, the two tables will be
populated
Step Run gui_thyroid.py and you will see the other widgets are initially
8 disabled as shown in Figure 147. Then click LOAD DATA button. The two
tables will be populated as shown in Figure 148.
Step Define fit_dataset() method to resample data using SMOTE:

1
1 def fit_dataset(self, df):

2 #Extracts label feature as target
3 variable
4 y = df['binaryClass'].values #
Target for the model
5
6
#Drops label feature and set input
7
variable
8
X = df.drop('binaryClass', axis
9 = 1)
10
11 #Resamples data
X,y = sm.fit_resample(X,
y.ravel())
return X, y
The fit_dataset() method takes a DataFrame df as input and

performs the following steps:
1. Extracts the label feature 'binaryClass'
from the DataFrame and assigns it to
the variable y. This will be the target
variable for the model.
2. Drops the label feature from the
DataFrame using the drop method, with
axis=1. The resulting DataFrame,
without the label feature, is assigned to
the variable X. This will be the input
variables for the model.
3. Performs resampling of the data using
the Synthetic Minority Over-sampling
Technique (SMOTE). SMOTE is a
method used to address class imbalance
by creating synthetic samples of the
minority class. The SMOTE object is
created with random_state=2021 to
ensure reproducibility. The fit_resample
method is then called on X and y to
resample the data, and the resampled X
and y are assigned back to X and y,
respectively.
4. Finally, the method returns the
resampled X and y as the output.
By calling this method on a DataFrame, you can obtain the
resampled input features X and the corresponding target variable
y that are ready to be used for model training.
Step Define train_test() to split dataset into train and test data with
2 raw, normalized, and standardized feature scaling:
1 def train_test(self):
2 X, y = self.fit_dataset(self.df)
3
5 testing
train_test_split(X, y,\
7
test_size = 0.2, random_state = 2021,
8
stratify=y)
9
self.X_train_raw = X_train.copy()
10
self.X_test_raw = X_test.copy()
11
self.y_train_raw = y_train.copy()
12
self.y_test_raw = y_test.copy()
13
14
#Saves into npy files
15
save('X_train_raw.npy',
16 self.X_train_raw)
17 save('y_train_raw.npy',
18 self.y_train_raw)
19 save('X_test_raw.npy',
20 self.X_test_raw)
21 save('y_test_raw.npy',
self.y_test_raw)
22
23
self.X_train_norm = X_train.copy()
24
self.X_test_norm = X_test.copy()
25
26 self.y_train_norm = y_train.copy()
27 self.y_test_norm = y_test.copy()
28 norm = MinMaxScaler()
29 self.X_train_norm[inf_cols] = \
30 norm.fit_transform(self.X_train_norm)
31 self.X_test_norm[inf_cols] = \
32 norm.transform(self.X_test_norm)
33
34 #Saves into npy files
35 save('X_train_norm.npy',
36 self.X_train_norm)
37 save('y_train_norm.npy',
self.y_train_norm)
38
save('X_test_norm.npy',
39
self.X_test_norm)
40
save('y_test_norm.npy',
41 self.y_test_norm)
42
43 self.X_train_stand = X_train.copy()
44 self.X_test_stand = X_test.copy()
45 self.y_train_stand = y_train.copy()
46 self.y_test_stand = y_test.copy()
47 scaler = StandardScaler()
48 self.X_train_stand[inf_cols] = \
scaler.fit_transform(self.X_train_stand
)
self.X_test_stand[inf_cols] = \
scaler.transform(self.X_test_stand)

save('X_train_stand.npy',
self.X_train_stand)
save('y_train_stand.npy',
self.y_train_stand)
save('X_test_stand.npy',
self.X_test_stand)
save('y_test_stand.npy',
self.y_test_stand)
The train_test() method performs the following steps:

1. Calls the fit_dataset() method on self.df
to obtain the resampled input features
X and the corresponding target variable
y.
2. Splits the data into training and testing
sets using the train_test_split function
from scikit-learn. The resampled X and
y are passed to the function along with
test_size=0.2 to indicate that 20% of
the data should be used for testing. The
random state is set to 2021 for
reproducibility. The resulting training
and testing sets are assigned to X_train,
X_test, y_train, and y_test.
3. Creates copies of the raw training and
testing sets and assigns them to
self.X_train_raw, self.X_test_raw,
self.y_train_raw, and self.y_test_raw.
These copies are saved as .npy files
using the save function from numpy.
4. Creates copies of the normalized
training and testing sets and assigns
them to self.X_train_norm,
self.X_test_norm, self.y_train_norm, and
self.y_test_norm. The columns specified
in inf_cols are normalized using the
MinMaxScaler from scikit-learn. The
normalized training and testing sets are
saved as .npy files.
5. Creates copies of the standardized
training and testing sets and assigns
them to self.X_train_stand,
self.X_test_stand, self.y_train_stand,
and self.y_test_stand. The columns
specified in inf_cols are standardized
using the StandardScaler from scikit-
learn. The standardized training and
testing sets are saved as .npy files.
By calling this method, you will split the resampled data into
different sets (raw, normalized, standardized) and save them as
.npy files for later use in training and evaluating machine
learning models.
Step Define split_data_ML() method execute splitting dataset into

3 train and test data:
1 def split_data_ML(self):
2 if path.isfile('X_train_raw.npy'):
3 #Loads npy files
4 self.X_train_raw = \
5 np.load('X_train_raw.npy',allow_pickle=True)
6 self.y_train_raw = \
7 np.load('y_train_raw.npy',allow_pickle=True)
8 self.X_test_raw = \
9 np.load('X_test_raw.npy',allow_pickle=True)
10 self.y_test_raw = \
11 np.load('y_test_raw.npy',allow_pickle=True)
12
13 self.X_train_norm = \
14 np.load('X_train_norm.npy',allow_pickle=True)
15 self.y_train_norm = \
16 np.load('y_train_norm.npy',allow_pickle=True)
17 self.X_test_norm = \
18 np.load('X_test_norm.npy',allow_pickle=True)
19 self.y_test_norm = \
20 np.load('y_test_norm.npy',allow_pickle=True)
21
22 self.X_train_stand = \
23 np.load('X_train_stand.npy',allow_pickle=True
24 )
25 self.y_train_stand = \
26
27 np.load('y_train_stand.npy',allow_pickle=True
28 )
29 self.X_test_stand = \
30 np.load('X_test_stand.npy',allow_pickle=True)
31 self.y_test_stand = \
32 np.load('y_test_stand.npy',allow_pickle=True)
33
34 else:
35 self.train_test()
36
37 #Prints each shape
38 print('X train raw shape: ',
self.X_train_raw.shape)
39
print('Y train raw shape: ',
40
self.y_train_raw.shape)
41
print('X test raw shape: ',
42 self.X_test_raw.shape)
43 print('Y test raw shape: ',
44 self.y_test_raw.shape)
45
46 #Prints each shape
47 print('X train norm shape: ',
48 self.X_train_norm.shape)
49 print('Y train norm shape: ',
self.y_train_norm.shape)
50
print('X test norm shape: ',
self.X_test_norm.shape)
print('Y test norm shape: ',
self.y_test_norm.shape)
#Prints each shape

print('X train stand shape: ',
self.X_train_stand.shape)
print('Y train stand shape: ',
self.y_train_stand.shape)
print('X test stand shape: ',
self.X_test_stand.shape)
print('Y test stand shape: ',
self.y_test_stand.shape)
The split_data_ML() method performs the following steps:

1. Checks if the .npy files containing the
split data already exist by using the
path.isfile function from the os module.
2. If the files exist, it loads the data from
the .npy files using the np.load function
and assigns the loaded arrays to the
corresponding variables
(self.X_train_raw, self.y_train_raw,
self.X_test_raw, self.y_test_raw,
self.X_train_norm, self.y_train_norm,
self.X_test_norm, self.y_test_norm,
self.X_train_stand, self.y_train_stand,
self.X_test_stand, self.y_test_stand).
3. If the files do not exist, it calls the
train_test method to generate and save
the split data.
4. Prints the shape of each set of training
and testing data for the raw,
normalized, and standardized versions.
The shapes of the arrays are printed
using the shape attribute of numpy
arrays.
By calling this method, you can either load the split data from
existing .npy files or generate and save the split data if the files
do not exist. The shapes of the loaded or generated data are then
printed for each version (raw, normalized, standardized) of the
Step Define train_model_ML() method to invoke split_data_ML()

4 method:
1 def train_model_ML(self):
2 self.split_data_ML()
3
4 #Turns on three widgets
5 self.cbData.setEnabled(True)
6 self.cbClassifier.setEnabled(True)
7
8 self.cbPredictionML.setEnabled(True
)
9
10
#Turns off pbTrainML
self.pbTrainML.setEnabled(False)
The train_model_ML() method performs the following steps:

1. Calls the split_data_ML() method to
ensure that the training and testing
data are available.
2. Enables three widgets (self.cbData,
self.cbClassifier, self.cbPredictionML)
to allow user interaction for selecting
the data, classifier, and prediction
method.
3. Disables the pbTrainML widget to
prevent further training while the user
is selecting options.
By calling this method, the training and testing data are
prepared and made available for further interactions with the
GUI. The user can then select the desired data, classifier, and
prediction method using the enabled widgets.
Figure 149 The cbData, cbClassifier, and cbPredictionML

widgets are enabled user clicks TRAIN ML MODEL button
Step Connect clicked() event of pbTrainML widget with
5 train_model_ML() and put it inside __init__() method as shown
in line 10:
3 loadUi("gui_voice.ui",self)
5 "GUI Demo of Classifying and Predicting Voice
6 Based Gender")
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML
)
The line of code

self.pbTrainML.clicked.connect(self.train_model_ML) connects
the train_model_ML() method to the clicked signal of the
pbTrainML button.
This means that when the pbTrainML button is clicked by the

user, the train_model_ML() method will be executed. It allows
the user to initiate the training process by clicking the button.
Step Run gui_thyroid.py and you will see the other widgets are
6 initially disabled. Click LOAD DATA button. The two tables are
populated, LOAD DATA button is disabled, and TRAIN ML
MODEL and TRAIN DL MODEL buttons are enabled. Then,
click on TRAIN ML MODEL button. You will see that cbData,
cbClassifier, and cbPredictionML are enabled and pbTrainML
is disabled as shown in Figure 149. You also will find four
training dan test npy files for machine learning in your working
directory.
Step Define pie_cat() and bar_cat() method to plot distribution of a

1
categorical feature in pie and bar chart on a widget:
1 def pie_cat(self, df_target, var_target,

2 labels, widget):
3 df_target.value_counts().plot.pie(\
4 ax = widget.canvas.axis1,labels=labels,\
5 startangle=40,explode=
[0,0.15],shadow=True,\
6
colors=['#ff6666','#F5C7B8FF'],autopct =
7
'%1.1f%%',\
8
9
widget.canvas.axis1.set_title('The
10 distribution of ' + \
11 var_target + ' variable', fontweight
12 ="bold",fontsize=14)
13 widget.canvas.figure.tight_layout()
14 widget.canvas.draw()
15
16 def bar_cat(self,df,var, widget):
17 ax =
18 df[var].value_counts().plot(kind="barh",\
19 ax = widget.canvas.axis1)
20
21 for i,j in
enumerate(df[var].value_counts().values):
ax.text(.7,i,j,weight =
"bold",fontsize=10)
widget.canvas.axis1.set_title("Count
of "+ var +" cases")
widget.canvas.figure.tight_layout()
widget.canvas.draw()
The pie_cat() method plots a pie chart to visualize the distribution of a

categorical variable in the given DataFrame df_target. It takes the
following parameters:
df_target: The DataFrame containing the
target variable.
var_target: The name of the target variable.
labels: The labels to be displayed in the pie
chart.
widget: The widget that contains the
matplotlib figure for displaying the chart.
The method uses the value_counts() function to count the occurrences
of each category in the target variable and plots a pie chart using the
plot.pie() function. The chart is customized with attributes such as start
angle, explode, shadow, colors, and autopct (to display the percentage
values). The title of the chart is set using the set_title() method of the
matplotlib axis object.
The bar_cat() method plots a horizontal bar chart to visualize the count
of categories in a categorical variable in the given DataFrame df. It
takes the following parameters:
df: The DataFrame containing the variable to
be plotted.
var: The name of the categorical variable.
widget: The widget that contains the
matplotlib figure for displaying the chart.
The method uses the value_counts() function to count the occurrences
of each category in the variable and plots a horizontal bar chart using
the plot(kind='barh') function. The count values are displayed on the
bars using the text() function. The title of the chart is set using the
set_title() method of the matplotlib axis object.
Both methods update the matplotlib figure in the specified widget and
redraw the canvas to display the chart.
Step Define dist_percent_plot to plot distribution of label feature against

2 another categorical feature:
1 #Plots label with other variable

2 def
3 dist_percent_plot(self,df,cat,ax1,ax2):
4 cmap1=plt.cm.coolwarm_r
5
6 result = df.groupby(cat).apply (lambda
group: \
7
(group.binaryClass == 'N').sum() /
8
9 float(group.binaryClass.count())
10 ).to_frame('N')
11 result["P"] = 1 -result["N"]
12 g=result.plot(kind='bar', stacked =
13 True,\
14 colormap=cmap1, ax=ax1, grid=True)
15 self.put_label_stacked_bar(g,17)
16 ax1.set_xlabel(cat)
17 ax1.set_title('Stacked Bar Plot of '+
cat +' (in %)', \
18
fontsize=14)
19
ax1.set_ylabel('% Thyroid (Negative vs
20
Positive)')
21
22
group_by_stat = df.groupby([cat,
23 'binaryClass']).size()
24
25 g=group_by_stat.unstack().plot(kind='bar',
26 \
27 stacked=True,ax=ax2,grid=True)
29
30 ax2.set_title('Stacked Bar Plot of '+
cat +' (in %)', \
31
fontsize=14)
32
33
ax2.set_xlabel(cat)
34
plt.show()
35
36
def put_label_stacked_bar(self,
37
ax,fontsize):
38
#patches is everything inside of the
39 chart
40 for rect in ax.patches:
41 # Find where everything is located
42 height = rect.get_height()
43 width = rect.get_width()
44 x = rect.get_x()
45 y = rect.get_y()
46
47
48 # The height of the bar is the data value
49 and can be used as the label
label_text = f'{height:.0f}'

specified value
if height > 0:
ax.text(label_x, label_y,
label_text, \
ha='center', va='center', \
weight = "bold",fontsize=fontsize)
The dist_percent_plot() method is used to plot a stacked bar chart to

visualize the distribution of a categorical variable (cat) in relation to the
binary class (binaryClass) in the given DataFrame df. It takes the
following parameters:
df: The DataFrame containing the data.
cat: The name of the categorical variable to
be plotted.
ax1: The axis object to display the stacked
bar plot of percentages.
ax2: The axis object to display the stacked
bar plot of counts.
The method first calculates the percentage of "Negative" ('N') and
"Positive" ('P') values in the binaryClass column for each category of the
cat variable. It then plots a stacked bar chart using the plot(kind='bar',
stacked=True) function. The colormap is set to coolwarm_r using
colormap=cmap1. The put_label_stacked_bar method is called to add
labels to the bars.
The put_label_stacked_bar() method is a helper function used to place

labels on the stacked bar chart. It takes the matplotlib axis object (ax)
and the fontsize as parameters. It iterates through each bar in the plot
and adds the height value as a label at the center of the bar.
Both methods update the matplotlib figures in the specified axis objects
and display the charts using plt.show().
Step Define choose_plot() to read currentText property of cbData widget

3 and act accordingly:
1 def choose_plot(self):
2 strCB = self.cbData.currentText()
3
4 if strCB == "binaryClass":
5 #Plots distribution of binaryClass variable in pie
6 chart
7 self.widgetPlot1.canvas.figure.clf()
8 self.widgetPlot1.canvas.axis1 = \
9 self.widgetPlot1.canvas.figure.add_subplot(121,\
10 facecolor = '#fbe7dd')
11 label_class = \
12 list(self.df_dummy["binaryClass"].value_counts().index
)
13
self.pie_cat(self.df_dummy["binaryClass"],\
14
'binaryClass', label_class, self.widgetPlot1)
15
self.widgetPlot1.canvas.figure.tight_layout()
16
self.widgetPlot1.canvas.draw()
17
18
self.widgetPlot1.canvas.axis1 = \
19
self.widgetPlot1.canvas.figure.add_subplot(122,\
20
facecolor = '#fbe7dd')
21
self.bar_cat(self.df_dummy,"binaryClass",
22
self.widgetPlot1)
23
24
25
26
self.widgetPlot2.canvas.figure.clf()
27
28
29
30
31
32
33
34 self.dist_percent_plot(self.df_dummy,"sex",\
35 self.widgetPlot2.canvas.axis1,\
36 self.widgetPlot2.canvas.axis2)
37 self.widgetPlot2.canvas.figure.tight_layout()
38 self.widgetPlot2.canvas.draw()
39
44 g=sns.countplot(self.df_dummy["on thyroxine"],\
45 hue = self.df_dummy["binaryClass"], \
46 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
48 self.widgetPlot3.canvas.axis1.set_title(\
49 "on thyroxine versus binaryClass",\
50 fontweight ="bold",fontsize=14)
51
55 g=sns.countplot(self.df_dummy["TSH measured"],\
60 "TSH measured versus binaryClass",\
62
66 g=sns.countplot(self.df_dummy["TT4 measured"],\
71 "TT4 measured versus binaryClass",\
73
77 g=sns.countplot(self.df_dummy["T4U measured"],\
82 "T4U measured versus binaryClass",\
84
88 g=sns.countplot(self.df_dummy["T3 measured"],\
93 "T3 measured versus binaryClass",\
98 g=sns.countplot(self.df_dummy["query on
99 thyroxine"],\
104 "query on thyroxine versus binaryClass",\
109 g=sns.countplot(self.df_dummy["sick"],\
114 "sick versus binaryClass",\
119 g=sns.countplot(self.df_dummy["tumor"],\
124 "tumor versus binaryClass",\
126
127
128
129
g=sns.countplot(self.df_dummy["psych"],\
130
hue = self.df_dummy["binaryClass"], \
131
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
132
self.put_label_stacked_bar(g,17)
133
self.widgetPlot3.canvas.axis1.set_title(\
134
"psych versus binaryClass",\
135
fontweight ="bold",fontsize=14)
136
137
138
The choose_plot() method is called when a specific option is selected

from the cbData combobox. It determines the selected option using
self.cbData.currentText() and performs different plotting operations
based on the selected option.
If the selected option is "binaryClass", the method plots the distribution

of the "binaryClass" variable in a pie chart and a bar chart using the
pie_cat and bar_cat methods, respectively. It also plots the stacked bar
plots for the "sex" variable using the dist_percent_plot method.
Additionally, it plots multiple count plots for various variables such as
"on thyroxine", "TSH measured", "TT4 measured", "T4U measured", "T3
measured", "query on thyroxine", "sick", "tumor", and "psych".
The plotting operations update the figures in the specified axis objects
and redraw the canvas to display the charts.
Step Connect currentIndexChanged() event of cbData widget with

4 choose_plot() method and put it inside __init__() method as shown in
line 11:
6 Disease")
11 self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot
)
The line of code

self.cbData.currentIndexChanged.connect(self.choose_plot) connects
the currentIndexChanged signal of the cbData combobox to the
choose_plot() method. This means that whenever the selected item in
the combobox changes, the choose_plot() method will be called to
update the plots based on the newly selected item.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
5 buttons. Then, choose binaryClass item from cbData widget. You will
see the result as shown in Figure 150.
Figure 150 The distribution of binaryClass variable versus other
Distribution of TSH Measured
Step Add this code to the end of choose_plot() method:

1
1 if strCB == "TSH measured":

2 #Plots distribution of TSH measured variable in pie
3 chart
8 label_class = \
9 list(self.df_dummy["TSH
measured"].value_counts().index)
10
self.pie_cat(self.df_dummy["TSH measured"],\
11
12
13
14
15
16
17
18
19 self.bar_cat(self.df_dummy," TSH measured",
20 self.widgetPlot1)
23
31 self.dist_percent_plot(self.df_dummy,"TSH measured",\
36
41 g=sns.countplot(self.df_dummy["on
42 thyroxine"],\
43 hue = self.df_dummy["TSH measured"], \
44 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
)
45
46
47
"on thyroxine versus TSH measured",\
48
49
50
51
52
53
g=sns.countplot(self.df_dummy["goitre"],\
54
hue = self.df_dummy["TSH measured"], \
55
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1
56
)
57
58
59
"goitre versus TSH measured",\
60
62
66 g=sns.countplot(self.df_dummy["TT4
67 measured"],\
)
70
71
72
"TT4 measured versus TSH measured",\
73
74
75
76
77
78
g=sns.countplot(self.df_dummy["T4U
79
measured"],\
80
81
82 )
85 "T4U measured versus TSH measured",\
87
91 g=sns.countplot(self.df_dummy["T3 measured"],\
94 )
97 "T3 measured versus TSH measured",\
103 thyroxine"],\
)
106
107
108
"query on thyroxine versus TSH measured",\
109
110
)
117
118
119
"sick versus TSH measured",\
120
121
)
128
129
130
"tumor versus TSH measured",\
131
132
)
"psych versus TSH measured",\
The purpose of this code block is to generate and update plots based
on the selected item "TSH measured" in the combobox. It specifically
focuses on visualizing the distribution and relationships of variables
related to "TSH measured".
Figure 151 The distribution of TSH measured variable versus other

The code performs the following tasks:

Clears the first subplot in widgetPlot1 and
adds a new subplot for a pie chart
representing the distribution of the "TSH
measured" variable.
Clears the second subplot in widgetPlot1
and adds a new subplot for a bar chart
representing the count of "TSH measured"
cases.
Clears the subplots in widgetPlot2 and
adds two new subplots. The first subplot
represents the stacked bar plot of "TSH
measured" variable, while the second
subplot shows the percentage distribution
of "TSH measured".
Clears the subplots in widgetPlot3 and
adds several subplots representing the
count of different variables grouped by
"TSH measured".
Overall, this code block updates the plots in widgetPlot1, widgetPlot2,
and widgetPlot3 based on the selected item "TSH measured" in the
combobox, providing insights into the distribution and relationships of
variables related to "TSH measured".
2 buttons. Then, choose TSH measured item from cbData widget. You
will see the result as shown in Figure 151.
Distribution of T3 Measured

1
1 if strCB == "T3 measured":

2 #Plots distribution of T3 measured variable in pie
3 chart
8 label_class = \
9 list(self.df_dummy["T3
10
self.pie_cat(self.df_dummy["T3 measured"],\
11
12
13
14
15
19 self.bar_cat(self.df_dummy,"T3 measured",
23
31 self.dist_percent_plot(self.df_dummy,"T3 measured",\
36
thyroxine"],\
42
hue = self.df_dummy["T3 measured"], \
43
44
)
45
46
47
"on thyroxine versus T3 measured",\
48
49
50
51
52
53
54
55 hue = self.df_dummy["T3 measured"], \
57 )
60 "goitre versus T3 measured",\
62
66 g=sns.countplot(self.df_dummy["TT4
measured"],\
67
68
69
)
70
71
72
"TT4 measured versus T3 measured",\
73
74
75
76
77
78
g=sns.countplot(self.df_dummy["T4U
79 measured"],\
82 )
85 "T4U measured versus T3 measured",\
87
91 g=sns.countplot(self.df_dummy["TSH
92 measured"],\
95 )
98 "TSH measured versus T3 measured",\
104 thyroxine"],\
)
107
108
109
"query on thyroxine versus T3 measured",\
110
111
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
g=sns.countplot(self.df_dummy["sick"],\
)
"sick versus T3 measured",\
g=sns.countplot(self.df_dummy["tumor"],\
)
"tumor versus T3 measured",\
)
"psych versus T3 measured",\
The resulting plots will provide visual representations of the

distribution and relationships of variables related to "T3
measured". Here is an overview of each plot:
widgetPlot1:
Subplot 1 (Pie Chart): Represents the
distribution of the "T3 measured"
variable in a pie chart. It shows the
percentage of cases labeled as "T3
measured" and "Not T3 measured".
Subplot 2 (Bar Chart): Represents the
count of "T3 measured" cases in a bar
chart. It displays the number of cases
labeled as "T3 measured" and "Not T3
measured".
widgetPlot2:
Subplot 1: Represents the stacked bar
plot of the "T3 measured" variable. It
shows the percentage distribution of
"T3 measured" cases across different
categories.
Subplot 2: Represents the percentage
distribution of "T3 measured" across
different categories.
widgetPlot3:
Subplots 1-9: Each subplot represents
the count of a specific variable (e.g.,
"on thyroxine", "goitre", "TT4
measured") grouped by the "T3
measured" variable. The plots display
the number of cases for each variable
category, differentiated by the "T3
measured" label.
These plots aim to provide visual insights into the distribution
and relationships of variables related to "T3 measured" in the
dataset. They help in understanding the patterns and potential
correlations between the "T3 measured" variable and other
variables.
Step 2 Run gui_thyroid.py and

and TRAIN ML MOD
choose T3 measured
widget. You will see the
Figure 152.
Figure 152 The di
measured variable vers
featur
Distribution of TT4 Measured

1
1 if strCB == "TT4 measured":

2 #Plots distribution of TT4 measured variable in pie
3 chart
8 label_class = \
9 list(self.df_dummy["TT4
10
self.pie_cat(self.df_dummy["TT4 measured"],\
11
12
13
14
15
16
17
18
self.bar_cat(self.df_dummy,"TT4 measured",
19
self.widgetPlot1)
20
21
22
23
24
25
26
27
31 self.dist_percent_plot(self.df_dummy,"TT4 measured",\
36
42 thyroxine"],\
43 hue = self.df_dummy["TT4 measured"], \
)
45
46
47
"on thyroxine versus TT4 measured",\
48
49
50
51
52
53
54
hue = self.df_dummy["TT4 measured"], \
55
56
)
57
58
59
"goitre versus TT4 measured",\
60
61
62
63
64
65
g=sns.countplot(self.df_dummy["T3 measured"],\
66
67
68 )
71 "T3 measured versus TT4 measured",\
73
77 g=sns.countplot(self.df_dummy["T4U
78 measured"],\
)
81
82
83
"T4U measured versus TT4 measured",\
84
85
86
87
88
89
g=sns.countplot(self.df_dummy["TSH
90
measured"],\
91
92
93 )
96 "TSH measured versus TT4 measured",\
98
99
100
101
g=sns.countplot(self.df_dummy["query on
102 thyroxine"],\
105 )
108 "query on thyroxine versus TT4 measured",\
116 )
119 "sick versus TT4 measured",\
127 )
130 "tumor versus TT4 measured",\
135 g=sns.countplot(self.df_dummy["psych"],\
)
"psych versus TT4 measured",\
Here's the purpose of each plot generated when strCB is equal to

"TT4 measured":
widgetPlot1:
Subplot 1 (Pie Chart): This plot shows the
distribution of the "TT4 measured"
variable. It helps visualize the proportion
of cases labeled as "TT4 measured" and
"Not TT4 measured" in the dataset.
Subplot 2 (Bar Chart): This plot provides a
count of cases for each category of the
"TT4 measured" variable. It gives an
overview of the number of cases in each
category.
widgetPlot2:
Subplot 1: This plot displays a stacked bar
plot of the "TT4 measured" variable against
another variable. It shows the percentage
distribution of "TT4 measured" cases
within each category of the selected
variable. This plot helps identify the
relationship between "TT4 measured" and
another variable.
Subplot 2: This plot presents the count of
cases for each category of the selected
variable. It provides a visual representation
of the distribution of cases within each
category.
widgetPlot3 (Subplots 1-9):
Each subplot represents the count of a
specific variable (e.g., "on thyroxine",
"goitre", "T3 measured") grouped by the
"TT4 measured" variable. These plots help
analyze the distribution of cases for each
variable category based on the "TT4
measured" label. By comparing the counts
of "TT4 measured" and "Not TT4
measured" cases within each category,
these plots reveal potential relationships
and patterns between the variables.
Overall, these plots aim to provide a comprehensive understanding of
the "TT4 measured" variable and its associations with other variables
in the dataset. They assist in identifying any significant trends,
dependencies, or patterns that may exist between "TT4 measured"
and other variables of interest.
2 buttons. Then, choose TT4 measured item from cbData widget. You
Figure 153 The distribution of TT4 measured variable versus other

Case and Probability Distribution
Step Define feat_versus_other() method to plot distribution (histogram) of

1 one feature versus another on a widget:
1 def feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
7
8 ax0.set_facecolor(background_color)
9 ax0_sns = sns.histplot(data=self.df, \
10 x=self.df[feat],ax=ax0,zorder=2,kde=False,
\
11
hue=another,multiple="stack", shrink=.8,\
12
linewidth=0.3,alpha=1)
13
14
self.put_label_stacked_bar(ax0_sns,12)
15
16
ax0_sns.set_xlabel('',fontsize=10,
17
weight='bold')
18
ax0_sns.set_ylabel('',fontsize=10,
19 weight='bold')
20
21
22 ax0_sns.grid(which='major', axis='x',
23 zorder=0, \
25 ax0_sns.grid(which='major', axis='y',
zorder=0, \
26
27
28
ax0_sns.tick_params(labelsize=10,
29
width=0.5, length=1.5)
30
ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper
right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight
="bold",fontsize=14)
ax0_sns.set_title(title,fontweight
The purpose of the feat_versus_other() function is to plot a histogram

of one feature (feat) against another feature (another) using a stacked
bar representation. Here's an explanation of the function code:
1. background_color: Specifies the
background color for the plot.
2. sns.set_palette: Sets the color palette for
the plot.
3. Removing spines: Removes the right and
top spines of the plot to improve visual
aesthetics.
4. Setting facecolor: Sets the background
color of the plot.
5. sns.histplot: Plots a histogram of the feat
feature with another as the hue. The
histogram is displayed as stacked bars, with
each bar representing a category of the feat
feature. The legend parameter specifies the
labels for the hue categories.
6. put_label_stacked_bar(): Places labels on
the stacked bars indicating the count or
value represented by each bar.
7. Setting axis labels: Sets the x-axis and y-
axis labels of the plot.
8. Grid lines: Adds grid lines to the plot for
both major axes (x-axis and y-axis).
9. tick_params: Specifies the style of tick
marks on the axes.
10. Legend: Adds a legend to the plot,
displaying the labels specified in legend.
11. Setting facecolor (again): Sets the
background color of the plot.
12. Setting x-label and title: Sets the x-label and
title of the plot using the label and title
parameters.
Overall, the feat_versus_other() function provides a convenient way to
visualize the relationship between two features by plotting them as
stacked bars in a histogram. It allows for easy comparison and analysis
of the distribution of one feature across different categories of another
feature.
Step Define prob_feat_versus_other() method to plot the density of one

2 feature versus another on a widget:
1 def prob_feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
7
8 ax0.set_facecolor(background_color)
9 ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,\
10
hue=another,linewidth=0.3,fill=True,cbar='g',
11
\
12
zorder=2,alpha=1,multiple='stack')
13
14
ax0_sns.set_xlabel('',fontsize=10,
15 weight='bold')
16 ax0_sns.set_ylabel('',fontsize=10,
17 weight='bold')
18
19
20 ax0_sns.grid(which='major', axis='x',
21 zorder=0, \
23 ax0_sns.grid(which='major', axis='y',
zorder=0, \
24
25
26
ax0_sns.tick_params(labelsize=10,
27
width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper right')
ax0_sns.set_xlabel(label,fontweight
ax0_sns.set_title(title,fontweight
The prob_feat_versus_other() function is similar to the

feat_versus_other() function, but instead of plotting stacked bars, it
plots the probability density estimation (KDE) of one feature (feat)
against another feature (another) using filled curves. Here's an
explanation of the function code:
1. background_color: Specifies the
background color for the plot.
2. sns.set_palette: Sets the color palette for
the plot.
3. Removing spines: Removes the right and
top spines of the plot to improve visual
aesthetics.
4. Setting facecolor: Sets the background
color of the plot.
5. sns.kdeplot: Plots the probability density
estimation (KDE) of the feat feature with
another as the hue. The KDE is displayed as
filled curves, with each curve representing
a category of the feat feature. The legend
parameter specifies the labels for the hue
categories.
6. Setting axis labels: Sets the x-axis and y-
axis labels of the plot.
7. Grid lines: Adds grid lines to the plot for
both major axes (x-axis and y-axis).
8. tick_params: Specifies the style of tick
marks on the axes.
9. Legend: Adds a legend to the plot,
displaying the labels specified in legend.
10. Setting facecolor (again): Sets the
background color of the plot.
11. Setting x-label and title: Sets the x-label and
title of the plot using the label and title
parameters.
The prob_feat_versus_other() function provides a way to visualize the
probability distribution of one feature against another feature using
KDE curves. It allows for understanding the likelihood of certain
feature values occurring within different categories of another feature.
Step Define hist_num_versus_nine_cat() method to plot the distribution

3 (histogram) of one numerical feature versus nine categorical ones:
1 def hist_num_versus_nine_cat(self,feat):
2 self.label_bin = \
3 list(self.df_dummy["binaryClass"].value_counts().index
4 )
5 self.label_sex = \
6 list(self.df_dummy["sex"].value_counts().index)
7 self.label_thyroxine = \
8 list(self.df_dummy["on
thyroxine"].value_counts().index)
9
self.label_pregnant = \
10
list(self.df_dummy["pregnant"].value_counts().index)
11
self.label_lithium = \
12
list(self.df_dummy["lithium"].value_counts().index)
13
self.label_goitre = \
14
list(self.df_dummy["goitre"].value_counts().index)
15
self.label_tumor = \
16
list(self.df_dummy["tumor"].value_counts().index)
17
self.label_tsh = \
18
list(self.df_dummy["TSH
19
20
self.label_tt4 = \
21
list(self.df_dummy["TT4
22 measured"].value_counts().index)
23
28 print(self.df_dummy["binaryClass"].value_counts())
29 self.feat_versus_other(feat,\
30 self.df_dummy["binaryClass"],self.label_bin,\
31 self.widgetPlot3.canvas.axis1,label=feat,\
32 title='binaryClass versus ' + feat)
33
37 print(self.df_dummy["sex"].value_counts())
39 self.df_dummy["sex"],self.label_sex,\
41 title='sex versus ' + feat)
42
46 print(self.df_dummy["on thyroxine"].value_counts())
48 self.df_dummy["on thyroxine"],self.label_thyroxine,\
50 title='on thyroxine versus ' + feat)
51
55 print(self.df_dummy["pregnant"].value_counts())
57 self.df_dummy["pregnant"],self.label_pregnant,\
59 title='pregnant versus ' + feat)
60
64 print(self.df_dummy["lithium"].value_counts())
66 self.df_dummy["lithium"],self.label_lithium,\
68 title='lithium versus ' + feat)
69
73 print(self.df_dummy["goitre"].value_counts())
75 self.df_dummy["goitre"],self.label_goitre,\
77 title='goitre versus ' + feat)
78
82 print(self.df_dummy["tumor"].value_counts())
84 self.df_dummy["tumor"],self.label_tumor,\
86 title='tumor versus ' + feat)
87
91 print(self.df_dummy["TSH measured"].value_counts())
92 self.feat_versus_other(feat,self.df_dummy["TSH
93 measured"],\
94 self.label_tsh,self.widgetPlot3.canvas.axis1,\
95 label=feat,title='TSH measured versus ' + feat)
96
100 print(self.df_dummy["TT4 measured"].value_counts())
101 self.feat_versus_other(feat,self.df_dummy["TT4
measured"],\
102
self.label_tt4,self.widgetPlot3.canvas.axis1,\
label=feat,title='TT4 measured versus ' + feat)
The hist_num_versus_nine_cat() function generates a series of subplots
in the widgetPlot3 canvas to compare a numerical feature (feat)
against nine categorical features. Here's an explanation of the function
code:
1. Creating label lists: Create lists of labels for
each categorical feature based on their
value counts in the df_dummy DataFrame.
2. Clearing the figure: Clear the current figure
in the widgetPlot3 canvas.
3. Adding subplots: Add a subplot for each
categorical feature using the add_subplot()
method. Each subplot is assigned to
self.widgetPlot3.canvas.axis1,
self.widgetPlot3.canvas.axis2, etc.
4. Printing value counts: Print the value
counts of each categorical feature (for
debugging or information purposes).
5. Calling feat_versus_other(): Call the
feat_versus_other function for each
categorical feature to plot the numerical
feature (feat) against the categorical
feature.
6. Setting labels and titles: Set the x-label and
title for each subplot using the categorical
feature and feat.
7. Tight layout and drawing: Adjust the layout
and draw the figure on the widgetPlot3
canvas.
The purpose of this code is to generate multiple subplots to compare
the distribution of a numerical feature with different categories of nine
categorical features. It provides insights into how the numerical
feature varies across different categories of the categorical features.
Step Define prob_num_versus_two_cat() method to plot the density of one

4 numerical feature versus two categorical ones:
1 def prob_num_versus_two_cat(self,feat, feat_cat1,

2 feat_cat2, widget):
3 self.label_feat_cat1 = \
4
5 list(self.df_dummy[feat_cat1].value_counts().index
6 )
7 self.label_feat_cat2 = \
8 list(self.df_dummy[feat_cat2].value_counts().index
)
9
10
widget.canvas.figure.clf()
11
widget.canvas.axis1 = \
12
widget.canvas.figure.add_subplot(211,\
13
14
print(self.df_dummy[feat_cat2].value_counts())
15
self.prob_feat_versus_other(feat,\
16
self.df_dummy[feat_cat2],self.label_feat_cat2,\
17
widget.canvas.axis1,label=feat,\
18
title=feat_cat2 + ' versus ' + feat)
19
20
widget.canvas.axis1 = \
21
widget.canvas.figure.add_subplot(212,\
22
23
24
self.prob_feat_versus_other(feat,\
25
self.df_dummy[feat_cat1],self.label_feat_cat1,\
26
widget.canvas.axis1,label=feat,\
27
title=feat_cat1 + ' versus ' + feat)
28
The prob_num_versus_two_cat() function generates two subplots in the

specified widget canvas to compare the probability distribution of a
numerical feature (feat) across two categorical features (feat_cat1 and
feat_cat2). Here's an explanation of the function code:
1. Creating label lists: Create lists of labels for
each categorical feature based on their
value counts in the df_dummy DataFrame.
2. Clearing the figure: Clear the current figure
in the specified widget canvas.
3. Adding subplots: Add two subplots to the
widget canvas using the add_subplot()
method. The first subplot is assigned to
widget.canvas.axis1 and the second subplot
is assigned to widget.canvas.axis2.
4. Printing value counts: Print the value
counts of the second categorical feature (for
debugging or information purposes).
5. Calling prob_feat_versus_other(): Call the
prob_feat_versus_other() function to plot
the probability distribution of the numerical
feature (feat) across the second categorical
feature.
6. Setting labels and titles: Set the x-label and
title for each subplot using the numerical
feature and the corresponding categorical
feature.
7. Tight layout and drawing: Adjust the layout
and draw the figure on the widget canvas.
The purpose of this code is to visualize and compare the probability
distribution of a numerical feature across two different categories of
two categorical features. It provides insights into how the numerical
feature's distribution varies between the selected categories of the
categorical features.
Step Add this code to the end of choose_plot():

5
1 if strCB == 'age':
2 self.prob_num_versus_two_cat("age","binaryClass"
3 , \
4 "T4U measured" ,self.widgetPlot1)
5 self.hist_num_versus_nine_cat("age")
6 self.prob_num_versus_two_cat("age","FTI
measured", \
7
"T3 measured" ,self.widgetPlot2)
8
9
if strCB == 'TSH':
10
self.prob_num_versus_two_cat("TSH","binaryClass"
11
, \
12
"T4U measured" ,self.widgetPlot1)
13
self.hist_num_versus_nine_cat("TSH")
14
self.prob_num_versus_two_cat("TSH","FTI
15 measured", \
16 "T3 measured" ,self.widgetPlot2)
17
18 if strCB == 'T3':
19
20 self.prob_num_versus_two_cat("T3","binaryClass",
21 \
23 self.hist_num_versus_nine_cat("T3")
24 self.prob_num_versus_two_cat("T3","FTI
measured", \
25
26
27
if strCB == 'TT4':
28
self.prob_num_versus_two_cat("TT4","binaryClass"
29
, \
30
31
self.hist_num_versus_nine_cat("TT4")
32
self.prob_num_versus_two_cat("TT4","FTI
33 measured", \
34 "T3 measured" ,self.widgetPlot2)
35
36 if strCB == 'T4U':
37 self.prob_num_versus_two_cat("T4U","binaryClass"
38 , \
40 self.hist_num_versus_nine_cat("T4U")
41 self.prob_num_versus_two_cat("T4U","FTI
measured", \
if strCB == 'FTI':
self.prob_num_versus_two_cat("FTI","binaryClass"
, \
self.hist_num_versus_nine_cat("FTI")
self.prob_num_versus_two_cat("FTI","FTI
measured", \
This code segment contains conditional statements based on the value

of strCB. Depending on the value of strCB, different functions are
called to generate plots and visualize the data. Here's an explanation of
each condition:
1. if strCB == 'age': If the value of strCB is
'age', the following actions are performed:
prob_num_versus_two_cat: Call the
prob_num_versus_two_cat() function to
plot the probability distribution of the
"age" feature across the "binaryClass"
and "T4U measured" categorical features,
using the widgetPlot1 widget.
hist_num_versus_nine_cat: Call the
hist_num_versus_nine_cat() function to
generate histograms of the "age" feature
against nine different categorical
features.
prob_num_versus_two_cat: Call the
prob_num_versus_two_cat() function to
plot the probability distribution of the
"age" feature across the "FTI measured"
and "T3 measured" categorical features,
using the widgetPlot2 widget.
2. if strCB == 'TSH': If the value of strCB is
'TSH', similar actions are performed as in
the first condition, but with the "TSH"
feature instead of "age".
3. if strCB == 'T3': If the value of strCB is 'T3',
similar actions are performed as in the first
condition, but with the "T3" feature instead
of "age".
4. if strCB == 'TT4': If the value of strCB is
'TT4', similar actions are performed as in
the first condition, but with the "TT4"
5. if strCB == 'T4U': If the value of strCB is
'T4U', similar actions are performed as in
the first condition, but with the "T4U"
6. if strCB == 'FTI': If the value of strCB is
'FTI', similar actions are performed as in
the first condition, but with the "FTI"
Overall, these conditional statements generate various plots and
visualizations based on the selected feature (strCB) to provide insights
into the relationships between the numerical features and different
categorical features in the dataset.
6 buttons. Then, choose age item from cbData widget. You will see the
result as shown in Figure 154.
Figure 154 The distribution of age variable versus other features
7 buttons. Then, choose TSH item from cbData widget. You will see the
Figure 155 The distribution of TSH variable versus other features
8 buttons. Then, choose T3 item from cbData widget. You will see the
Figure 156 The distribution of T3 variable versus other features
9 buttons. Then, choose TT4 item from cbData widget. You will see the
Figure 157 The distribution of TT4 variable versus other features
10 buttons. Then, choose T4U item from cbData widget. You will see the
Figure 158 The distribution of T4U variable versus other features
11 buttons. Then, choose FTI item from cbData widget. You will see the
Figure 159 The distribution of FTI variable versus other features
Correlation Matrix and Feature Importance
Step Define plot_corr() method to plot correlation matrix on a widget:

1
1 def plot_corr(self, data, widget):

2 corrdata = data.corr()
3 sns.heatmap(corrdata, ax =
4 widget.canvas.axis1, \
5 lw=1, annot=True, cmap="Reds")
6
widget.canvas.axis1.set_title('Correlation
7
Matrix', \
8
The plot_corr() function is used to generate a correlation matrix heatmap.

Here's an explanation of the function:
1. corrdata = data.corr(): Calculate the correlation
coefficients between the columns of the
provided data DataFrame using the corr()
function. This will result in a correlation matrix.
2. sns.heatmap(corrdata, ax = widget.canvas.axis1,
lw=1, annot=True, cmap="Reds"): Create a
heatmap using Seaborn's heatmap function.
Pass the correlation matrix (corrdata) as the
data to be visualized. Specify the ax parameter
as widget.canvas.axis1 to indicate the subplot
where the heatmap should be plotted. Set lw=1
to adjust the width of the lines between cells,
annot=True to display the correlation
coefficients on the heatmap, and cmap="Reds"
to use the "Reds" color map for the heatmap.
3. widget.canvas.axis1.set_title('Correlation
Matrix', fontweight="bold", fontsize=20): Set
the title of the plot to "Correlation Matrix" using
the set_title function of the axis object
(widget.canvas.axis1). Additionally, set the font
weight to "bold" and the font size to 20.
4. widget.canvas.figure.tight_layout(): Adjust the
layout of the plot to optimize spacing and
alignment.
5. widget.canvas.draw(): Draw the plot on the
widget canvas to display the correlation matrix
heatmap.
Overall, the plot_corr() function helps visualize the correlations between the
columns of a given dataset using a heatmap, providing insights into the
relationships among the variables.

2
1 if strCB == 'Correlation Matrix':

4 self.widgetPlot3.canvas.figure.add_subplot(111
5 )
6 X,_ = self.fit_dataset(self.df)
self.plot_corr(X, self.widgetPlot3)
If strCB is equal to 'Correlation Matrix', the following code is executed:

1. self.widgetPlot3.canvas.figure.clf(): Clear the
current figure of widgetPlot3 to start with a
clean canvas.
2. self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(111):
Add a single subplot (111) to widgetPlot3 for
displaying the correlation matrix.
3. X,_ = self.fit_dataset(self.df): Preprocess the
dataset self.df using the fit_dataset() method,
which likely performs any necessary data
transformations or feature engineering. The
resulting preprocessed dataset is stored in X,
and the second variable _ is ignored.
4. self.plot_corr(X, self.widgetPlot3): Call the
plot_corr() function with the preprocessed
dataset X and the widgetPlot3 widget as
arguments. This function generates a
correlation matrix heatmap using the
preprocessed dataset and displays it on
widgetPlot3.
Overall, this code segment generates a correlation matrix heatmap for the
preprocessed dataset and displays it on the widgetPlot3 widget, allowing
visual exploration of the correlations between different variables in the
dataset.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL buttons.
3 Then, choose Correlation Matrix item from cbData widget. You will see the
Figure 160 Correlation matrix
Step Define plot_importance() method to plot features importance on a widget:

4
1 def plot_importance(self, widget):

2 #Compares different feature importances
3
4 r =
5 ExtraTreesClassifier(random_state=0)
6 X,y = self.fit_dataset(self.df)
7 r.fit(X, y)
8 feature_importance_normalized = \
9 np.std([tree.feature_importances_ for
tree in \
10
r.estimators_], axis = 0)
11
12
13
sns.barplot(feature_importance_normalized,
14 \
15 X.columns, ax = widget.canvas.axis1)
16
17 widget.canvas.axis1.set_ylabel('Feature
18 Labels',\
20
widget.canvas.axis1.set_xlabel('Features
Importance',\
widget.canvas.axis1.set_title(\
'Comparison of different Features
Importance',\
The plot_importance() method generates a bar plot to compare the

importance of different features in a dataset. Here's a breakdown of the code:
1. r = ExtraTreesClassifier(random_state=0):
Initialize an instance of the ExtraTreesClassifier
algorithm. This algorithm is used to estimate the
feature importances.
2. X, y = self.fit_dataset(self.df): Preprocess the
dataset self.df using the fit_dataset method,
which likely performs any necessary data
transformations or feature engineering. The
preprocessed dataset is stored in X, and the
corresponding target variable is stored in y.
3. r.fit(X, y): Fit the ExtraTreesClassifier model on
the preprocessed dataset X and target variable y
to estimate the feature importances.
4. feature_importance_normalized =
np.std([tree.feature_importances_ for tree in
r.estimators_], axis=0): Calculate the
normalized feature importances by taking the
standard deviation across all decision trees in
the ExtraTreesClassifier model.
5. sns.barplot(feature_importance_normalized,
X.columns, ax=widget.canvas.axis1): Create a
bar plot using Seaborn's barplot function. The
feature importances are plotted on the y-axis,
and the feature labels are plotted on the x-axis.
The plot is drawn on the widget.canvas.axis1
subplot.
6. widget.canvas.axis1.set_ylabel('Feature Labels',
fontweight="bold", fontsize=15): Set the y-axis
label of the plot to "Feature Labels" with a bold
font and font size of 15.
7. widget.canvas.axis1.set_xlabel('Features
Importance', fontweight="bold", fontsize=15):
Set the x-axis label of the plot to "Features
Importance" with a bold font and font size of 15.
8. widget.canvas.axis1.set_title('Comparison of
different Features Importance',
fontweight="bold", fontsize=20): Set the title of
the plot to "Comparison of different Features
Importance" with a bold font and font size of 20.
9. widget.canvas.figure.tight_layout(): Adjust the
layout of the figure to optimize spacing between
subplots and labels.
10. widget.canvas.draw(): Redraw the canvas to
display the updated plot on the widget.
Overall, this method provides a visualization of the relative importance of
different features in the dataset, helping to identify which features have the
most significant impact on the target variable.

5
1 if strCB == 'Features Importance':

4 self.widgetPlot3.canvas.figure.add_subplot(111
5 )
self.plot_importance(self.widgetPlot3)
If strCB is equal to 'Features Importance', the code block is executed. Here's

an explanation of the code:
1. self.widgetPlot3.canvas.figure.clf(): Clear the
current figure in widgetPlot3 to prepare for a
new plot.
2. self.widgetPlot3.canvas.axis1 =
self.widgetPlot3.canvas.figure.add_subplot(111):
Create a new subplot axis1 on widgetPlot3 using
add_subplot(111).
3. self.plot_importance(self.widgetPlot3): Call the
plot_importance method, passing
self.widgetPlot3 as the argument. This method
generates a bar plot comparing the importance
of different features and draws it on
widgetPlot3.
The purpose of this code is to display a plot showing the importance of
different features in the dataset. It clears the existing plot on widgetPlot3,
creates a new subplot, and calls the plot_importance method to generate and
draw the feature importance plot on widgetPlot3.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL buttons.
6 Then, choose Features Importance item from cbData widget. You will see
the result as shown in Figure 161.
Figure 161 The features importance

Helper Functions to Plot Model Performance
Step Define plot_real_pred_val() method to plot true values and

1 predicted values and plot_cm() method to calculate and plot
confusion matrix:
1 def plot_real_pred_val(self, Y_pred, Y_test,

2 widget, title):
3 #Calculate Metrics
4 acc=accuracy_score(Y_test,Y_pred)
5
6 #Output plot
7 widget.canvas.figure.clf()
8 widget.canvas.axis1 = \
9 widget.canvas.figure.add_subplot(111,\
10 facecolor='steelblue')
11
widget.canvas.axis1.scatter(range(len(Y_pred)),\
12
Y_pred,color="yellow",lw=5,label="Predictions")
13
14
widget.canvas.axis1.scatter(range(len(Y_test)),
15 \
16 Y_test,color="red",label="Actual")
17 widget.canvas.axis1.set_title(\
18 "Prediction Values vs Real Values of " + title,
19 \
20 fontsize=10)
21 widget.canvas.axis1.set_xlabel("Accuracy: " +
22 \
23 str(round((acc*100),3)) + "%")
24 widget.canvas.axis1.legend()
25 widget.canvas.axis1.grid(True, alpha=0.75,
lw=1, ls='-.')
26
27
28
def plot_cm(self, Y_pred, Y_test, widget,
29
title):
30
cm=confusion_matrix(Y_test,Y_pred)
31
32
widget.canvas.axis1 =
33 widget.canvas.figure.add_subplot(111)
34 class_label = ["Negative", "Positive"]
35 df_cm = pd.DataFrame(cm, \
36 index=class_label,columns=class_label)
sns.heatmap(df_cm, ax=widget.canvas.axis1, \
annot=True, cmap='plasma',linewidths=2,fmt='d')
widget.canvas.axis1.set_title("Confusion
Matrix of " + \
title, fontsize=10)
widget.canvas.axis1.set_xlabel("Predicted")
widget.canvas.axis1.set_ylabel("True")
The purpose of the code is to generate plots for evaluating

and visualizing the performance of a predictive model.
Specifically:
plot_real_pred_val():
Calculates the accuracy score by
comparing the predicted values
(Y_pred) with the actual values
(Y_test).
Clears the current figure and creates
a new subplot with a steel blue
background color.
Plots the predicted values as yellow
points and the actual values as red
points.
Sets the title of the plot to indicate
the type of values being compared.
Sets the x-axis label to display the
accuracy of the predictions.
Adds a legend to differentiate
between the predicted and actual
values.
Adds gridlines to the plot for better
readability.
plot_cm():
Computes the confusion matrix by
comparing the predicted values
(Y_pred) with the actual values
(Y_test).
Clears the current figure and creates
a new subplot.
Defines the class labels as "Negative"
and "Positive".
Creates a DataFrame (df_cm) to
store the confusion matrix values
with appropriate row and column
labels.
Generates a heatmap plot using
sns.heatmap, where each cell
represents the count of true
positives, true negatives, false
positives, and false negatives.
Sets the title of the plot to indicate
the type of values being analyzed.
Sets the x-axis and y-axis labels to
"Predicted" and "True", respectively.
Adds annotations (count values) to
each cell of the heatmap.
Applies a color map ('plasma') and
adjusts the linewidth of the heatmap
for better visualization.
These functions provide visual representations of the
predicted values compared to the actual values and the
confusion matrix, respectively, helping to evaluate the
performance and accuracy of the model.
Step Define plot_decision() method to plot decision boundaries:

2
1 def plot_decision(self, cla, feat1, feat2,

2 widget, title=""):
4 dataset_dir = curr_path +
"/hypothyroid.csv"
5
6
#Loads csv file
7
df, _ = self.read_dataset(dataset_dir)
8
9
#Plots decision boundary of two features
10
feat_boundary = [feat1, feat2]
11 X_feature = df[feat_boundary]
12 X_train_feature, X_test_feature,
13 y_train_feature, \
14 y_test_feature= train_test_split(X_feature,
\
15
df['binaryClass'], test_size=0.3,
16
random_state = 42)
17
cla.fit(X_train_feature,
18 y_train_feature)
19
20
21 plot_decision_regions(X_test_feature.values,
22 \
23 y_test_feature.ravel(), clf=cla, legend=2,
\
24
ax=widget.canvas.axis1)
widget.canvas.axis1.set_title(title, \
widget.canvas.axis1.set_xlabel(feat1)
widget.canvas.axis1.set_ylabel(feat2)

1. Get the current working directory:
The os.getcwd() function returns
the current working directory and
assigns it to curr_path variable.
2. Define the dataset directory:
The dataset directory is set to the
curr_path concatenated with
"/hypothyroid.csv".
3. Load the CSV file:
The read_dataset() method is
called with the dataset_dir to load
the CSV file.
The loaded dataset is assigned to
df, and the second return value is
ignored.
4. Select the features for decision
boundary plot:
The feat1 and feat2 are the names
of two features used for the
decision boundary plot.
A list feat_boundary is created
with feat1 and feat2.
The DataFrame df is indexed
using feat_boundary to extract the
selected features and assign them
to X_feature.
5. Split the dataset:
The train_test_split() function is
used to split the dataset into
X_feature and df['binaryClass']
are used as input features and
target variable, respectively.
The testing set size is set to 0.3
(30% of the data), and the random
state is set to 42.
The function returns four sets:
X_train_feature, X_test_feature,
y_train_feature, and
y_test_feature.
6. Fit the classifier:
The classifier object cla is trained
using the fit method.
The training features
X_train_feature and labels
y_train_feature are passed as
arguments.
7. Plot the decision boundary:
The plot_decision_regions()
function is used to plot the
decision boundary.
The testing feature values
(X_test_feature.values) and labels
(y_test_feature.ravel()) are passed
as arguments.
The clf parameter specifies the
trained classifier object.
The legend parameter is set to 2
to display the class legend in the
plot.
The ax parameter specifies the
axis on which the decision
boundary plot will be drawn
(widget.canvas.axis1).
8. Set plot attributes:
The plot's title is set to the
provided title.
The x-axis label is set to feat1,
and the y-axis label is set to feat2.
9. Adjust plot layout and draw:
The plot layout is adjusted to
ensure proper spacing using
tight_layout.
The plot is drawn on the canvas of
the widget object.
Overall, the plot_decision() method loads a dataset from a
CSV file, selects two specific features, splits the dataset into
training and testing sets, fits a classifier model on the
training data, and plots the decision boundary of the model
using the selected features. The resulting plot helps visualize
how the classifier separates different classes in the feature
space defined by feat1 and feat2.
Step Define plot_learning_curve() to plot learning curve of any

3 machine learning model:
1 def plot_learning_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title)
5
6
widget.canvas.axis1.set_ylim(*ylim)
7
widget.canvas.axis1.set_xlabel("Training
8 examples")
9 widget.canvas.axis1.set_ylabel("Score")
10
11
12 train_sizes, train_scores, test_scores,
13 fit_times, _ = \
14 learning_curve(estimator, X, y,
cv=cv, n_jobs=n_jobs,
15
16
return_times=True)
17
train_scores_mean = np.mean(train_scores,
18 axis=1)
19
train_scores_std = np.std(train_scores,
20 axis=1)
21 test_scores_mean = np.mean(test_scores,
22 axis=1)
23 test_scores_std = np.std(test_scores,
24 axis=1)
25
26 # Plot learning curve
27 widget.canvas.axis1.grid()
28
widget.canvas.axis1.fill_between(train_sizes,
29
\
30
train_scores_mean - train_scores_std,\
31 train_scores_mean + train_scores_std,
alpha=0.1,color="r")
widget.canvas.axis1.fill_between(train_sizes,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1, color="g")
widget.canvas.axis1.plot(train_sizes,
train_scores_mean, \
'o-', color="r", label="Training score")
widget.canvas.axis1.plot(train_sizes,
test_scores_mean, \
'o-', color="g", label="Cross-validation
score")
widget.canvas.axis1.legend(loc="best")
The plot_learning_curve() method generates a learning curve

for an estimator (machine learning model) to visualize the
model's performance on training and cross-validation sets as
the training set size increases. Here's a step-by-step
1. Set the title and axis labels:

The method sets the title of the
plot to the provided title
argument.
The x-axis label is set to "Training
examples", and the y-axis label is
set to "Score".
2. Set the y-axis limit (optional):
If the ylim parameter is provided,
the y-axis limit of the plot is set
accordingly.
3. Calculate learning curve data:
The learning_curve() function is
called to calculate the learning
curve data.
The function takes the estimator
(estimator), input features (X),
target variable (y), cross-
validation strategy (cv), number
of parallel jobs (n_jobs), and the
training set sizes (train_sizes).
The function returns the training
set sizes, training scores, test
scores, fit times, and sample
counts.
4. Calculate mean and standard
deviation of scores:
The mean and standard deviation
of the training and test scores are
calculated along the rows (across
different training set sizes).
train_scores_mean and
test_scores_mean store the mean
scores, while train_scores_std and
test_scores_std store the standard
deviations.
5. Plot the learning curve:
The plot's grid is enabled.
The fill_between function is used
to fill the area between the mean
scores plus/minus one standard
deviation for both the training and
test scores.
The training and test scores are
plotted against the training set
sizes using red circles ('o-') for
the training score and green
circles ('o-') for the cross-
validation score.
A legend is added to the plot to
differentiate the training and
cross-validation scores.
Overall, the plot_learning_curve() method helps visualize the
learning curve of a model by plotting the training and cross-
validation scores as a function of the training set size. This
can provide insights into the model's bias and variance, as
well as help determine if the model would benefit from
additional training data.
Step Define plot_scalability_curve() to plot the scalability of the

4 model and plot_performance_curve() method to plot
performance of the model on a widget:
1 def plot_scalability_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title, \
5
6
7
8
9 examples")
11
13 fit_times, _ = \
14 learning_curve(estimator, X, y, cv=cv,
15 n_jobs=n_jobs,
16 train_sizes=train_sizes,
return_times=True)
17 fit_times_mean = np.mean(fit_times, axis=1)
18 fit_times_std = np.std(fit_times, axis=1)
19
20 # Plot n_samples vs fit_times
21 widget.canvas.axis1.grid()
22 widget.canvas.axis1.plot(train_sizes,
23 fit_times_mean, 'o-')
24
25 widget.canvas.axis1.fill_between(train_sizes, \
26 fit_times_mean - fit_times_std,\
27 fit_times_mean + fit_times_std,
alpha=0.1)
28
29 examples")
30 widget.canvas.axis1.set_ylabel("fit_times")
31
32 def plot_performance_curve(self,estimator,
33 title, X, y, \
34 widget, ylim=None, cv=None, n_jobs=None, \
35 train_sizes=np.linspace(.1, 1.0, 5)):
36
37 widget.canvas.axis1.set_title(title, \
39 if ylim is not None:
40 widget.canvas.axis1.set_ylim(*ylim)
41 widget.canvas.axis1.set_xlabel("Training
42 examples")
44
46 fit_times, _ = \
47 learning_curve(estimator, X, y, cv=cv,
n_jobs=n_jobs,
48
49
return_times=True)
50
test_scores_mean = np.mean(test_scores,
51 axis=1)
52 test_scores_std = np.std(test_scores,
53 axis=1)
54 fit_times_mean = np.mean(fit_times,
axis=1)

widget.canvas.axis1.grid()
widget.canvas.axis1.plot(fit_times_mean, \
test_scores_mean, 'o-')
widget.canvas.axis1.fill_between(fit_times_mean,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1)
widget.canvas.axis1.set_xlabel("fit_times")
widget.canvas.axis1.set_ylabel("Score")
The code provided includes two methods:

plot_scalability_curve() and plot_performance_curve(). Here's
a step-by-step explanation of each method:
plot_scalability_curve():
argument.
set to "Score".
accordingly.
3. Calculate scalability curve data:
called to calculate the scalability
curve data.
counts.
deviation of fit times:
of the fit times are calculated
along the rows (across different
training set sizes).
fit_times_mean stores the mean fit
times, while fit_times_std stores
the standard deviations.
5. Plot the scalability curve:
The fit times are plotted against
the training set sizes using circles
('o-').
The area between the mean fit
times plus/minus one standard
deviation is filled with color.
The x-axis is labeled as "Training
examples", and the y-axis is
labeled as "fit_times".
plot_performance_curve():
argument.
set to "Score".
accordingly.
3. Calculate performance curve data:
called to calculate the
performance curve data.
counts.
deviation of test scores:
of the test scores are calculated
along the rows (across different
training set sizes).
test_scores_mean stores the mean
test scores, while test_scores_std
stores the standard deviations.
5. Calculate mean of fit times:
The mean of the fit times is
calculated along the rows.
6. Plot the performance curve:
The mean test scores are plotted
against the mean fit times using
circles ('o-').
The area between the mean test
scores plus/minus one standard
deviation is filled with color.
The x-axis is labeled as
"fit_times", and the y-axis is
labeled as "Score".
These methods are used to visualize the scalability and
performance of an estimator by plotting the fit times and
scores (such as accuracy) against the number of training
examples. They can provide insights into the efficiency and
effectiveness of a model as the dataset size increases.
Training Model and Predicting Thyroid
Step Define train_model() and predict_model() methods to train

1 any classifier and calculate the prediction:
1 def train_model(self, model, X, y):

2 model.fit(X, y)
3 return model
4
5 def predict_model(self, model, X,
6 proba=False):
7 if ~proba:
8 y_pred = model.predict(X)
9 else:
10 y_pred_proba =
model.predict_proba(X)
11
y_pred =
12
np.argmax(y_pred_proba, axis=1)
return y_pred
The train_model() function is used to train a machine learning

model. It takes as input the model object, the feature matrix
X, and the target variable y. The function fits the model to the
training data by calling the fit method on the model object.
Finally, it returns the trained model object.
The predict_model() function is used to make predictions

using a trained model. It takes as input the model object, the
feature matrix X, and an optional parameter proba which
indicates whether to return class labels or predicted
probabilities. If proba is set to False, the function calls the
predict method on the model object to obtain the predicted
class labels. If proba is set to True, the function calls the
predict_proba method on the model object to obtain the
predicted probabilities for each class, and then selects the
class with the highest probability using np.argmax. Finally,
the function returns the predicted values.
These functions provide a convenient and reusable way to

train models and make predictions, abstracting away the
underlying implementation details and allowing for easy
integration into the machine learning workflow.
Step Define run_model() method to calculate accuracy and

2 precision. It also invokes six methods to plot confusion
matrix, true values versus predicted values diagram, ROC,
learning curve, and decision boundaries:
1 def run_model(self, name, scaling, model,

2 X_train, X_test, y_train, y_test, train=True,
proba=True):
3
if train == True:
4
model = self.train_model(model, X_train,
5
y_train)
6
y_pred = self.predict_model(model, X_test,
7 proba)
8
9 accuracy = accuracy_score(y_test, y_pred)
10 recall = recall_score(y_test, y_pred,
11 average='weighted')
12 precision = precision_score(y_test, y_pred,
13 \
14 average='weighted')
15 f1 = f1_score(y_test, y_pred,
average='weighted')
16
17
18
19
20
print('f1: ', f1)
21
22
23
24
25
self.widgetPlot1.canvas.figure.add_subplot(111,
26
\
27
28
self.plot_cm(y_pred, y_test, self.widgetPlot1,
29 \
30 name + " -- " + scaling)
33
36 self.widgetPlot2.canvas.figure.add_subplot(111,
37 \
39 self.plot_real_pred_val(y_pred, y_test, \
40 self.widgetPlot2, name + " -- " + scaling)
43
47 \
49 self.plot_decision(model, 'TSH', 'FTI', \
50 self.widgetPlot3, title="The decision
boundaries of " + \
51
name + " -- " + scaling)
52
53
54
55
56
\
57
58
self.plot_learning_curve(model, \
59
'Learning Curve' + " -- " + scaling, X_train, \
60
y_train, self.widgetPlot3)
61
62
63
64
65 \
67 self.plot_scalability_curve(model, 'Scalability
68 of ' + \
69 name + " -- " + scaling, X_train, y_train, \
\
self.plot_performance_curve(model, \
'Performance of ' + name + " -- " + scaling, \
X_train, y_train, self.widgetPlot3)
The run_model() function is used to run a machine learning
model and evaluate its performance. It takes several input
parameters, including the name of the model, scaling method,
the model object, the training and test datasets (X_train,
X_test, y_train, y_test), and optional parameters for training
and predicting (train and proba).
The function first checks if train is set to True. If so, it trains

the model by calling the train_model() method with the
training data (X_train, y_train).
Next, it makes predictions on the test data by calling the
predict_model method with the trained model and the test
data (X_test). The proba parameter determines whether to
return class labels or predicted probabilities.
The function then calculates various evaluation metrics such

as accuracy, recall, precision, and F1-score using the
predicted values (y_pred) and the true labels (y_test). It
prints these metrics as well as the classification report.
The function proceeds to create and update plots using three

separate subplots (widgetPlot1, widgetPlot2, and
widgetPlot3). It first clears the existing figures and creates
new subplots with appropriate settings. It calls different plot
functions (plot_cm, plot_real_pred_val(), plot_decision(),
plot_learning_curve(), plot_scalability_curv()e, and
plot_performance_curve()) to generate and display the
corresponding visualizations.
Finally, it adjusts the layout and draws the figures in the

respective widgetPlot widgets.
Overall, the run_model() function provides a comprehensive

analysis of the model's performance and presents it through
various visualizations and metrics.
Step Define build_train_lr() method to build and train Logistic Regression

1 (LR) classifier using three feature scaling: Raw, Normalization, and
Standardization:
1 def build_train_lr(self):
2 if path.isfile('logregRaw.pkl'):
3 #Loads model
4 self.logregRaw =
5 joblib.load('logregRaw.pkl')
6 self.logregNorm =
joblib.load('logregNorm.pkl')
7
self.logregStand =
8
joblib.load('logregStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Logistic
12 Regression', 'Raw', \
13 self.logregRaw, self.X_train_raw,
14 \
15 self.X_test_raw, self.y_train_raw,
16 \
17 self.y_test_raw)
18 if self.rbNorm.isChecked():
19 self.run_model('Logistic
20 Regression', \
21 'Normalization', self.logregNorm,
\
22
self.X_train_norm,
23
self.X_test_norm, \
24
self.y_train_norm,
26
27 if self.rbStand.isChecked():
28 self.run_model('Logistic
29 Regression', \
30 'Standardization',
31 self.logregStand, \
32 self.X_train_stand,
self.X_test_stand, \
33
self.y_train_stand,
34
self.y_test_stand)
35
36
else:
37
#Builds and trains Logistic
38 Regression
39 self.logregRaw =
41 \
42 max_iter=500, random_state=2021)
43 self.logregNorm =
LogisticRegression(solver='lbfgs',
44
\
45
46
47 self.logregStand =
\
49
50
51
52
53
Regression', 'Raw', \
54
self.logregRaw, self.X_train_raw,
55 \
\
self.y_test_raw)
if self.rbNorm.isChecked():
Regression', \
'Normalization', self.logregNorm,
\
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
Regression', \
'Standardization',
self.logregStand, \
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.logregRaw,
'logregRaw.pkl')
joblib.dump(self.logregNorm,
'logregNorm.pkl')
joblib.dump(self.logregStand,
'logregStand.pkl')
The build_train_lr() function is responsible for building and training

Logistic Regression models. It first checks if the model files
(logregRaw.pkl, logregNorm.pkl, logregStand.pkl) exist. If they do, it
loads the pre-trained models from the files. Otherwise, it proceeds to
build and train the models.
If the "Raw" radio button (self.rbRaw) is selected, it calls the run_model

function with the appropriate arguments to run the Logistic Regression
model using the raw data (self.X_train_raw, self.X_test_raw,
self.y_train_raw, self.y_test_raw).
If the "Normalization" radio button (self.rbNorm) is selected, it calls the

run_model() function with the appropriate arguments to run the Logistic
Regression model using the normalized data (self.X_train_norm,
self.X_test_norm, self.y_train_norm, self.y_test_norm).
If the "Standardization" radio button (self.rbStand) is selected, it calls

the run_model function with the appropriate arguments to run the
Logistic Regression model using the standardized data
(self.X_train_stand, self.X_test_stand, self.y_train_stand,
self.y_test_stand).
After training the models, it saves them as pickle files (logregRaw.pkl,

logregNorm.pkl, logregStand.pkl) using the joblib.dump function.
Overall, the build_train_lr function provides the functionality to build,

train, and save Logistic Regression models based on different data
preprocessing techniques, and it also allows for reusing pre-trained
models if they exist.
Step Define choose_ML_model() method to read the selected item in

2 cbClassifier widget:
1 def choose_ML_model(self):
2 strCB =
3 self.cbClassifier.currentText()
4
5 if strCB == 'Logistic Regression':
self.build_train_lr()
The choose_ML_model() function is responsible for choosing and

running a machine learning model based on the selected option from the
cbClassifier combo box.
If the selected option is 'Logistic Regression', it calls the build_train_lr()

function, which builds and trains the Logistic Regression models based
on different data preprocessing techniques.
This function allows for the selection of different machine learning
models in the future by adding additional conditional statements for
each model option in the combo box. Each conditional statement would
call the corresponding function to build and train the selected model.
Overall, the choose_ML_model() function serves as a central point for

selecting and running different machine learning models based on
user's choice.
Step Connect currentIndexChanged() event of cbClassifier widget with

3 choose_ML_model() method and put it inside __init__() method as
shown in line 12-13:
6 Disease")
11 self.pbTrainML.clicked.connect(self.train_model_ML)
12 self.cbData.currentIndexChanged.connect(self.choose_plot
)
13
self.cbClassifier.currentIndexChanged.connect(\
self.choose_ML_model)
The line of code

self.cbClassifier.currentIndexChanged.connect(self.choose_ML_model)
establishes a connection between the
currentIndexChanged signal of the cbClassifier
combo box and the choose_ML_model() method.
This connection ensures that whenever the selected item in the combo
box changes, the choose_ML_model() method is called. This allows for
dynamic execution of the selected machine learning model based on the
user's choice.
In other words, when the user selects a different option in the combo
box, the choose_ML_model() method will be triggered, and the
corresponding machine learning model will be built and trained
accordingly.
4 buttons. Click on Raw radio button. Then, choose Logistic Regression
item from cbClassifier widget. Then, you will see the result as shown in
Figure 162.
Figure 162 The result using LR model with raw feature scaling
Click on Norm radio button. Then, choose Logistic Regression item

from cbClassifier widget. Then, you will see the result as shown in
Figure 163.
Click on Stand radio button. Then, choose Logistic Regression item

from cbClassifier widget. Then, you will see the result as shown in
Figure 164.
Figure 163 The result using LR model with normalization feature

scaling
Figure 164 The result using LR model with standardization feature
scaling
Step Define build_train_svm() method to build and

1 train Support Vector Machine (SVM) classifier
using three feature scaling: Raw,
Normalization, and Standardization:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 def build_train_svm(self):
32 if path.isfile('SVMRaw.pkl'):
33 #Loads model
34 self.SVMRaw =
35 joblib.load('SVMRaw.pkl')
36 self.SVMNorm =
joblib.load('SVMNorm.pkl')
37
self.SVMStand =
38
joblib.load('SVMStand.pkl')
39
40
41
self.run_model('Support Vector
42 Machine', 'Raw', \
43 self.SVMRaw, self.X_train_raw,
44 self.X_test_raw, \
45 self.y_train_raw, self.y_test_raw)
47 self.run_model('Support Vector
48 Machine', \
49 'Normalization', self.SVMNorm, \
50 self.X_train_norm, self.X_test_norm, \
self.y_train_norm, self.y_test_norm)
Machine', \
'Standardization', self.SVMStand, \
self.X_train_stand, self.X_test_stand,
\
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains Logistic Regression
self.SVMRaw =
SVC(random_state=2021,probability=True)
self.SVMNorm =
self.SVMStand =
Machine', 'Raw', \
self.SVMRaw, self.X_train_raw,
self.X_test_raw, \
self.y_train_raw, self.y_test_raw)
Machine', \
'Normalization', self.SVMNorm, \
self.X_train_norm, self.X_test_norm, \
Machine', \
'Standardization', self.SVMStand, \
\
#Saves model
joblib.dump(self.SVMRaw,
'SVMRaw.pkl')
joblib.dump(self.SVMNorm,
'SVMNorm.pkl')
joblib.dump(self.SVMStand,
'SVMStand.pkl')
Figure 165 The result using SVM model with raw

feature scaling
Here is a step-by-step explanation of the

build_train_svm() method:
1. Check if the saved SVM
models (SVMRaw.pkl,
SVMNorm.pkl,
SVMStand.pkl) exist.
If the models exist, load
them into the
corresponding variables
(self.SVMRaw,
self.SVMNorm,
self.SVMStand).
If the "Raw" radio button
(rbRaw) is checked,
execute the run_model
method to evaluate the
loaded SVM model on the
raw data (X_train_raw,
X_test_raw, y_train_raw,
y_test_raw).
If the "Normalization"
radio button (rbNorm) is
checked, execute the
run_model method to
evaluate the loaded SVM
model on the normalized
data (X_train_norm,
X_test_norm,
y_train_norm,
y_test_norm).
If the "Standardization"
radio button (rbStand) is
checked, execute the
run_model() method to
evaluate the loaded SVM
model on the standardized
data (X_train_stand,
X_test_stand,
y_train_stand,
y_test_stand).
2. If the saved models do not
exist, create new instances of
SVM models (SVC) with the
specified parameters.
If the "Raw" radio button
is checked, execute the
run_model() method to
train and evaluate the
SVM model on the raw
data.
If the "Normalization"
radio button is checked,
execute the run_model()
method to train and
evaluate the SVM model
on the normalized data.
If the "Standardization"
radio button is checked,
execute the run_model()
method to train and
evaluate the SVM model
on the standardized data.
3. Save the trained SVM models
(self.SVMRaw,
self.SVMNorm,
self.SVMStand) to disk using
the joblib.dump function.
The purpose of this code is to provide functionality
for building and training SVM models based on
different preprocessing techniques (raw,
normalized, standardized). It checks if the models
have been previously saved and loads them if
available, or creates new models if not. The models
are then trained and evaluated using the
run_model() method. Finally, the trained models
are saved for future use.
Step 2 Add this code to the

choose_ML_model() method:
1 if strCB == 'Support Vector

2 Machine':
self.build_train_svm()
Step 3 Run gui_thyroid.py and click LOAD D
TRAIN ML MODEL buttons. Click
radio button. Then, choose Suppor
Machine item from cbClassifier widg
you will see the result as shown in Fig
Figure 166 The result using SVM mo

normalization feature scaling
Figure 167 The result using SVM mo

standardization feature scalin
Click on Norm radio button. Then

Support Vector Machine ite
cbClassifier widget. Then, you wil
Click on Stand radio button. Then

Support Vector Machine ite
cbClassifier widget. Then, you wil

Step Define build_train_knn() method to build and
1 train K-Nearest Neighbor (KNN) classifier
using three feature scaling: Raw,
Normalization, and Standardization:
1 def build_train_knn(self):
2 if path.isfile('KNNRaw.pkl'):
3 #Loads model
4 self.KNNRaw =
5 joblib.load('KNNRaw.pkl')
6 self.KNNNorm =
joblib.load('KNNNorm.pkl')
7
self.KNNStand =
8
joblib.load('KNNStand.pkl')
9
10
11
self.run_model('K-Nearest
12 Neighbor', 'Raw', \
13 self.KNNRaw, self.X_train_raw,
17 self.run_model('K-Nearest
18 Neighbor', \
19 'Normalization', self.KNNNorm, \
20 self.X_train_norm,
21 self.X_test_norm, \
22 self.y_train_norm,
self.y_test_norm)
23
24
25
26
Neighbor', \
27
'Standardization', self.KNNStand,
28 \
30 self.X_test_stand, \
31 self.y_train_stand,
32 self.y_test_stand)
33
34 else:
35 #Builds and trains K-Nearest
Neighbor
36
self.KNNRaw =
37
38 20)
39 self.KNNNorm =
40 KNeighborsClassifier(n_neighbors =
20)
41
self.KNNStand =
42
43 20)
44
45 if self.rbRaw.isChecked():
46 self.run_model('K-Nearest
47 Neighbor', 'Raw', \
48 self.KNNRaw, self.X_train_raw, \
50 \
51 self.y_test_raw)
Neighbor', \
'Normalization', self.KNNNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
Neighbor', \
'Standardization', self.KNNStand,
\
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.KNNRaw,
'KNNRaw.pkl')
joblib.dump(self.KNNNorm,
'KNNNorm.pkl')
joblib.dump(self.KNNStand,
'KNNStand.pkl')
The build_train_knn method serves the purpose

of building, training, and evaluating K-Nearest
Neighbor (KNN) models for different
preprocessing techniques in a machine learning
application.
The code checks if the saved KNN models exist.
If they do, it loads the models into the
corresponding variables. It then checks which
radio button is selected to determine the
preprocessing technique to use. If the "Raw"
radio button is selected, it evaluates the loaded
KNN model on the raw data. If the
"Normalization" or "Standardization" radio
buttons are selected, it evaluates the loaded
KNN model on the respective preprocessed
data. This allows for reusing trained models
and comparing their performance under
different preprocessing scenarios.
Then, if the saved models do not exist, new

instances of the KNN models are created with
specified parameters. It again checks the
selected radio button to determine the
preprocessing technique, and trains and
evaluates the KNN models accordingly. This
allows for training new models if the saved
models are not available.
After training and evaluating the models, they

are saved using the joblib.dump function. This
enables the trained models to be reused in
future runs of the application, avoiding the
need to retrain the models every time. Saving
the models also allows for sharing and
deploying the trained models for production
use.
Overall, the purpose of the code is to provide a

streamlined workflow for building, training, and
evaluating KNN models with different
preprocessing techniques. It promotes code
reusability by checking for and loading saved
models, and it enables comparison and
selection of the best-performing model based
on different preprocessing strategies.
Additionally, it facilitates model persistence by
saving the trained models for future use.
Step Add this code to the end of

2 choose_ML_model() method:
1 if strCB == 'K-Nearest Neighbor':

2 self.build_train_knn()
Step Run gui_thyroid.py and click LOAD DATA and
3 TRAIN ML MODEL buttons. Click on Raw
radio button. Then, choose K-Nearest
Neighbor item from cbClassifier widget.
Then, you will see the result as shown in Figure
168.
Click on Norm radio button. Then, choose K-

Nearest Neighbor item from cbClassifier
widget. Then, you will see the result as shown
in Figure 169.
Click on Stand radio button. Then, choose K-

Nearest Neighbor item from cbClassifier
in Figure 170.
Figure 168 The result using KNN model with

raw feature scaling

standardization feature scaling
Step Define build_train_dt() method to build and

1 train Decision Tree (DT) classifier using three
feature scaling: Raw, Normalization, and
Standardization:
1 def build_train_dt(self):
2 if path.isfile('DTRaw.pkl'):
3 #Loads model
4 self.DTRaw =
5 joblib.load('DTRaw.pkl')
6 self.DTNorm =
joblib.load('DTNorm.pkl')
7
self.DTStand =
8
joblib.load('DTStand.pkl')
9
10
11
self.run_model('Decision Tree', \
12
'Raw', self.DTRaw,
13 self.X_train_raw, \
15 self.y_test_raw)
17 self.run_model('Decision Tree', \
18 'Normalization', self.DTNorm, \
24 self.run_model('Decision Tree', \
25 'Standardization', self.DTStand, \
self.y_test_stand)
29
30
else:
31
#Builds and trains Decision Tree
32
dt =
33
DecisionTreeClassifier()
34
parameters = {
35 'max_depth':np.arange(1,20,1),\
36 'random_state':[2021]}
37 self.DTRaw = GridSearchCV(dt,
38 parameters)
39 self.DTNorm = GridSearchCV(dt,
40 parameters)
41 self.DTStand = GridSearchCV(dt,
parameters)
42
43
44
45
'Raw', self.DTRaw,
46
self.X_train_raw, \
47
self.X_test_raw, self.y_train_raw,
48 \
49 self.y_test_raw)
50
51
52
'Normalization', self.DTNorm, \
53
self.X_train_norm,
self.y_test_norm)
'Standardization', self.DTStand, \
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.DTRaw,
'DTRaw.pkl')
joblib.dump(self.DTNorm,
'DTNorm.pkl')
joblib.dump(self.DTStand,
'DTStand.pkl')
Figure 171 The result using DT model with

raw feature scaling
The build_train_dt method is responsible for

creating, training, and evaluating Decision Tree
models in a machine learning application. The
purpose of this code can be summarized as
follows:
1. Model Loading and
Evaluation: The code
checks if the saved
Decision Tree models exist.
If they do, it loads the
models and evaluates their
performance based on the
selected preprocessing
technique. This allows for
reusing trained models and
comparing their
performance under
different preprocessing
scenarios.
2. Model Building and
Training: If the saved
models do not exist, the
code creates new instances
of the Decision Tree
classifier and performs grid
search to find the best
hyperparameters for each
preprocessing technique. It
then trains the Decision
Tree models using the
training data and evaluates
their performance on the
test data.
3. Model Saving: After
training and evaluation, the
trained Decision Tree
models are saved using the
joblib.dump function. This
enables the models to be
reused in future runs of the
application, eliminating the
need for retraining. Saving
the models also allows for
sharing and deploying them
for production use.
In summary, the purpose of the build_train_dt()
code is to provide a streamlined workflow for
building, training, and evaluating Decision Tree
models with different preprocessing
techniques. It supports model reusability,
hyperparameter tuning, and model persistence
for efficient and effective machine learning
model development.

1 if strCB == 'Decision Tree':

2 self.build_train_dt()
radio button. Then, choose Decision Tree item
from cbClassifier widget. Then, you will see
Click on Norm radio button. Then, choose

Decision Tree item from cbClassifier widget.
172.
Click on Stand radio button. Then, choose

Decision Tree item from cbClassifier widget.
173.


Step Define build_train_rf() method to build and train

1 Random Forest (RF) classifier using three feature
scaling: Raw, Normalization, and Standardization:
1 def build_train_rf(self):
2 if path.isfile('RFRaw.pkl'):
3 #Loads model
4 self.RFRaw = joblib.load('RFRaw.pkl')
5 self.RFNorm =
6 joblib.load('RFNorm.pkl')
7 self.RFStand =
joblib.load('RFStand.pkl')
8
9
10
self.run_model('Random Forest', 'Raw',
11
\
12
self.RFRaw, self.X_train_raw,
15
16
self.run_model('Random Forest', \
17
'Normalization', self.RFNorm, \
18
self.X_train_norm, self.X_test_norm, \
19
20
21
22
self.run_model('Random Forest', \
23
'Standardization', self.RFStand, \
24
25 \
26 self.y_train_stand, self.y_test_stand)
27
28 else:
29 #Builds and trains Random Forest
30 self.RFRaw =
31 RandomForestClassifier(n_estimators=200,
32 \
33 max_depth=20, random_state=2021)
34 self.RFNorm = RandomForestClassifier(\
35 n_estimators=100, max_depth=11,
random_state=2021)
36
self.RFStand =
37
RandomForestClassifier(\
38
n_estimators=100, max_depth=11,
39 random_state=2021)
40
42 self.run_model('Random Forest',
43 'Raw',\
44 self.RFRaw, self.X_train_raw, \
45 self.X_test_raw, self.y_train_raw, \
46 self.y_test_raw)
48 self.run_model('Random Forest', \
49 'Normalization', self.RFNorm, \
50 self.X_train_norm, self.X_test_norm, \
51 self.y_train_norm, self.y_test_norm)
52
54 self.run_model('Random Forest', \
55 'Standardization', self.RFStand, \
\
#Saves model
joblib.dump(self.RFRaw,
'RFRaw.pkl')
joblib.dump(self.RFNorm,
'RFNorm.pkl')
joblib.dump(self.RFStand,
'RFStand.pkl')
The build_train_rf() method is responsible for

creating, training, and evaluating Random Forest
models in a machine learning application. Here is
the purpose of this code:
Evaluation: The code checks if
the saved Random Forest
models exist. If they do, it
loads the models and
evaluates their performance
based on the selected
preprocessing technique. This
allows for reusing trained
models and comparing their
performance under different
preprocessing scenarios.
2. Model Building and Training:
If the saved models do not
exist, the code creates new
instances of the Random
Forest classifier with
predefined hyperparameters.
It then trains the Random
Forest models using the
their performance on the test
data.
3. Model Saving: After training
and evaluation, the trained
Random Forest models are
saved using the joblib.dump
function. This enables the
models to be reused in future
runs of the application,
eliminating the need for
retraining. Saving the models
also allows for sharing and
deploying them for production
use.
In summary, the purpose of the build_train_rf() code
is to provide a streamlined workflow for building,
training, and evaluating Random Forest models with
different preprocessing techniques. It supports
model reusability, predefined hyperparameters, and
model persistence for efficient and effective
machine learning model development.
Step Add this code to the end of choose_ML_model()

2 method:
1 if strCB == 'Random Forest':

2 self.build_train_rf()
3 TRAIN ML MODEL buttons. Click on Raw radio
button. Then, choose Random Forest item from
cbClassifier widget. Then, you will see the result as
shown in Figure 174.
Click on Norm radio button. Then, choose Random

Forest item from cbClassifier widget. Then, you
Click on Stand radio button. Then, choose Random

Forest item from cbClassifier widget. Then, you
Figure 174 The result using RF model with raw

feature scaling
Figure 175 The result using RF model with

Figure 176 The result using RF model with
Step Define build_train_gb() method to build and

1 train Gradient Boosting (GB) classifier using
three feature scaling: Raw, Normalization, and
Standardization:
1 def build_train_gb(self):
2 if path.isfile('GBRaw.pkl'):
3 #Loads model
4 self.GBRaw =
5 joblib.load('GBRaw.pkl')
6 self.GBNorm =
joblib.load('GBNorm.pkl')
7
self.GBStand =
8
joblib.load('GBStand.pkl')
9
10
11
self.run_model('Gradient
12 Boosting', 'Raw', \
13 self.GBRaw, self.X_train_raw, \
15 \
16 self.y_test_raw)
18 self.run_model('Gradient
19 Boosting', \
20 'Normalization', self.GBNorm, \
23
26
Boosting',\
29
'Standardization', self.GBStand, \
30
self.X_train_stand,
34
35 else:
36 #Builds and trains Gradient
37 Boosting
38 self.GBRaw =
39 GradientBoostingClassifier(\
40 n_estimators = 100, max_depth=20,
subsample=0.8, \
41
max_features=0.2,
43 self.GBNorm =
44 GradientBoostingClassifier(\
45 n_estimators = 100, max_depth=20,
46 subsample=0.8,\
47 max_features=0.2,
49 self.GBStand =
GradientBoostingClassifier(\
50
n_estimators = 100, max_depth=20,
51 subsample=0.8,\
52 max_features=0.2,
54
57 Boosting', 'Raw', \
58 self.GBRaw, self.X_train_raw, \
\
self.y_test_raw)
Boosting', \
'Normalization', self.GBNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
Boosting', \
'Standardization', self.GBStand, \
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.GBRaw,
'GBRaw.pkl')
joblib.dump(self.GBNorm,
'GBNorm.pkl')
joblib.dump(self.GBStand,
'GBStand.pkl')
The build_train_gb() method is responsible for

creating, training, and evaluating Gradient
Boosting models in a machine learning
application. Here is the purpose of this code:
checks if the saved
Gradient Boosting models
exist. If they do, it loads the
models and evaluates their
performance based on the
selected preprocessing
technique. This allows for
reusing trained models and
comparing their
performance under
scenarios.
of the Gradient Boosting
classifier with predefined
hyperparameters. It then
trains the Gradient
Boosting models using the
their performance on the
test data.
training and evaluation, the
trained Gradient Boosting
models are saved using the
joblib.dump function. This
enables the models to be
reused in future runs of the
application, eliminating the
need for retraining. Saving
the models also allows for
sharing and deploying them
for production use.
In summary, the purpose of the build_train_gb()
code is to provide a streamlined workflow for
building, training, and evaluating Gradient
Boosting models with different preprocessing
techniques. It supports model reusability,
predefined hyperparameters, and model
persistence for efficient and effective machine
learning model development.

1 if strCB == 'Gradient Boosting':

2 self.build_train_gb()

radio button. Then, choose Gradient Boosting
item from cbClassifier widget. Then, you will
Figure 177 The result of using GB model with

raw feature scaling

Gradient Boosting item from cbClassifier
in Figure 178.

Gradient Boosting item from cbClassifier
in Figure 179.

Naïve Bayes Classifier
Step Define build_train_nb() method to build and

1 train Naïve Bayes (NB) classifier using three
feature scaling: Raw, Normalization, and
Standardization:
1 def build_train_nb(self):
2 if path.isfile('NBRaw.pkl'):
3 #Loads model
4 self.NBRaw =
5 joblib.load('NBRaw.pkl')
6 self.NBNorm =
joblib.load('NBNorm.pkl')
7
self.NBStand =
8
joblib.load('NBStand.pkl')
9
10
11
self.run_model('Naive Bayes',
12 'Raw', \
13 self.NBRaw, self.X_train_raw, \
15 \
16 self.y_test_raw)
18 self.run_model('Naive Bayes',
19 'Normalization', \
20 self.NBNorm, self.X_train_norm, \
21 self.X_test_norm,
22 self.y_train_norm, \
24
26 self.run_model('Naive Bayes', \
27 'Standardization', self.NBStand, \
29 self.X_test_stand,\
self.y_test_stand)
31
32
else:
33
#Builds and trains Naive Bayes
34
self.NBRaw = GaussianNB()
35
self.NBNorm = GaussianNB()
36
self.NBStand = GaussianNB()
37
38
39
self.run_model('Naive Bayes',
40 'Raw', \
41 self.NBRaw, self.X_train_raw, \
43 \
44 self.y_test_raw)
46
self.run_model('Naive Bayes', \
47 'Normalization', self.NBNorm, \
52
self.run_model('Naive Bayes', \
'Standardization', self.NBStand, \
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.NBRaw,
'NBRaw.pkl')
joblib.dump(self.NBNorm,
'NBNorm.pkl')
joblib.dump(self.NBStand,
'NBStand.pkl')
The build_train_nb() code performs the

following steps:
checks if the saved Naive
Bayes models exist. If the
models are found
(path.isfile('NBRaw.pkl') is
True), it loads the models
using the joblib.load
function. This allows for
reusing trained models
without the need for
retraining. Then, it checks
the selected preprocessing
technique
(rbRaw.isChecked(),
rbNorm.isChecked(),
rbStand.isChecked()) to
determine which model to
evaluate. If the "Raw"
technique is selected, it
calls the run_model
function to evaluate the
loaded NBRaw model using
the corresponding raw data
(X_train_raw, X_test_raw,
y_train_raw, y_test_raw).
Similarly, it evaluates the
"Normalization" and
"Standardization" models if
the respective techniques
are selected.
of the GaussianNB
classifier (GaussianNB()),
representing the Naive
Bayes algorithm. It
initializes three separate
instances for each
preprocessing technique:
NBRaw, NBNorm, and
NBStand. These instances
will be used to train the
Naive Bayes models. Then,
based on the selected
preprocessing technique, it
calls the run_model
evaluate the corresponding
Naive Bayes model using
the appropriate
preprocessed data.
training and evaluation, if
the models were newly
built (not loaded from
files), the code saves the
trained models using the
joblib.dump function. It
saves the NBRaw,
NBNorm, and NBStand
models as "NBRaw.pkl",
"NBNorm.pkl", and
"NBStand.pkl",
respectively. This allows for
reusing the trained models
in future runs of the code,
avoiding the need for
retraining.
4. Overall Purpose: The
purpose of this code is to
facilitate the building,
training, and evaluation of
Naive Bayes models with
techniques. It provides
flexibility by allowing the
use of either pre-existing
models (if available) or
building new models. The
code supports different
preprocessing techniques,
such as raw data,
normalization, and
standardization, and
evaluates the models'
performance accordingly.
The ability to load and save
models enhances efficiency
by reusing trained models,
while also enabling model
persistence for future use.
1 if strCB == 'Naive Bayes':

2 self.build_train_nb()

radio button. Then, choose Naive Bayes item
from cbClassifier widget. Then, you will see

Naive Bayes item from cbClassifier widget.
181.
Figure 180 The result using Naïve Bayes
model with raw feature scaling

model with normalization feature scaling

Naive Bayes item from cbClassifier widget.
182.

model with standardization feature scaling
Adaboost Classifier
Step Define build_train_ada() method to build and

1 train Adaboost classifier using three feature
scaling: Raw, Normalization, and
Standardization:
1 def build_train_ada(self):
2 if path.isfile('ADARaw.pkl'):
3 #Loads model
4 self.ADARaw =
5 joblib.load('ADARaw.pkl')
6 self.ADANorm =
joblib.load('ADANorm.pkl')
7
self.ADAStand =
8
joblib.load('ADAStand.pkl')
9
10
11
self.run_model('Adaboost', 'Raw',
12 \
13 self.ADARaw, self.X_train_raw, \
15 \
16 self.y_test_raw)
18 self.run_model('Adaboost',
19 'Normalization',\
20 self.ADANorm, self.X_train_norm,
21 \
self.y_train_norm, \
23
self.y_test_norm)
24
25
26
self.run_model('Adaboost', \
27
'Standardization', self.ADAStand,
28
\
29
self.X_train_stand,
33
34 else:
35 #Builds and trains Adaboost
36 self.ADARaw = AdaBoostClassifier(\
37 n_estimators = 100,
38 learning_rate=0.01)
39 self.ADANorm =
AdaBoostClassifier(\
40
n_estimators = 100,
41
learning_rate=0.01)
42
self.ADAStand =
43 AdaBoostClassifier(\
44 n_estimators = 100,
45 learning_rate=0.01)
46
48 self.run_model('Adaboost', 'Raw',
49 self.ADARaw, \
50 self.X_train_raw, self.X_test_raw,
\
51
52
54 self.run_model('Adaboost',
'Normalization',\
55
self.ADANorm, self.X_train_norm,
\
self.X_test_norm,
self.y_test_norm)
if self.rbStand.isChecked(): \
self.run_model('Adaboost',
'Standardization', \
self.ADAStand, self.X_train_stand,
\
self.X_test_stand,
self.y_train_stand, \
self.y_test_stand)
#Saves model
joblib.dump(self.ADARaw,
'ADARaw.pkl')
joblib.dump(self.ADANorm,
'ADANorm.pkl')
joblib.dump(self.ADAStand,
'ADAStand.pkl')
Figure 183 The result using Adaboost model
The purpose of the build_train_ada() code is as

follows:
Evaluation: The code first
checks if the saved
Adaboost models exist by
using the path.isfile
function. If the models are
found
(path.isfile('ADARaw.pkl')
is True), it loads the models
function. The loaded
models are assigned to the
variables ADARaw,
ADANorm, and ADAStand.
Then, based on the selected
preprocessing technique
(rbRaw.isChecked(),
rbNorm.isChecked(),
rbStand.isChecked()), it
calls the run_model
loaded models. The
evaluation is performed
with the corresponding
raw, normalized, or
standardized data
y_train_raw, y_test_raw,
etc.).
of the AdaBoostClassifier
algorithm. Three instances
are created for each
ADARaw, ADANorm, and
ADAStand. These instances
are used to build and train
the Adaboost models. The
algorithm is configured
with a set number of
estimators and learning
rate (n_estimators = 100,
learning_rate=0.01).
Similar to the evaluation
step, based on the selected
preprocessing technique,
the code calls the
run_model function to train
and evaluate the
corresponding Adaboost
model using the
appropriate preprocessed
data.
saves the ADARaw,
ADANorm, and ADAStand
models as "ADARaw.pkl",
"ADANorm.pkl", and
"ADAStand.pkl",
in future runs of the code
retraining.
Overall, the purpose of this code is to facilitate
the building, training, and evaluation of
Adaboost models with different preprocessing
techniques. It provides the flexibility to either
load pre-existing models or build new models.
The code supports different preprocessing
techniques such as raw data, normalization,
and standardization, and evaluates the models'
performance accordingly. The ability to load
and save models enhances efficiency by reusing
trained models and allows for persistence of
models for future use.

1 if strCB == 'Adaboost':
2 self.build_train_ada()

radio button. Then, choose Adaboost item from
cbClassifier widget. Then, you will see the
with normalization feature scaling

with standardization feature scaling

Adaboost item from cbClassifier widget.
184.

Adaboost item from cbClassifier widget.
185.
Step Define build_train_xgb() method to build and

1 train XGB classifier using three feature scaling:
Raw, Normalization, and Standardization:
1 def build_train_xgb(self):
2 if path.isfile('XGBRaw.pkl'):
3 #Loads model
4 self.XGBRaw =
joblib.load('XGBRaw.pkl')
5 self.XGBNorm =
6 joblib.load('XGBNorm.pkl')
7 self.XGBStand =
joblib.load('XGBStand.pkl')
8
9
10
self.run_model('XGB', 'Raw',
11
self.XGBRaw, \
12
self.X_train_raw, self.X_test_raw,
13 \
15
16
self.run_model('XGB',
17 'Normalization', \
18 self.XGBNorm, self.X_train_norm, \
20 self.y_train_norm, \
22
24 self.run_model('XGB',
25 'Standardization', \
26 self.XGBStand, self.X_train_stand,
\
27
self.X_test_stand,
28
29
self.y_test_stand)
30
31
else:
32
#Builds and trains XGB classifier
33
self.XGBRaw =
34 XGBClassifier(n_estimators = 200, \
35 max_depth=20, random_state=2021, \
36 use_label_encoder=False,
37 eval_metric='mlogloss')
38 self.XGBNorm =
39 XGBClassifier(n_estimators = 200, \
40 max_depth=20, random_state=2021, \
41 use_label_encoder=False,
42
self.XGBStand =
43
XGBClassifier(n_estimators = 200, \
44
max_depth=20, random_state=2021, \
45
use_label_encoder=False,
46 eval_metric='mlogloss')
47
49 self.run_model('XGB', 'Raw',
50 self.XGBRaw, \
51 self.X_train_raw, self.X_test_raw,
\
52
53
55 self.run_model('XGB',
'Normalization', \
56
self.XGBNorm, self.X_train_norm, \
57
self.X_test_norm,
self.y_test_norm)
self.run_model('XGB',
'Standardization', \
self.XGBStand, self.X_train_stand,
\
self.X_test_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.XGBRaw,
'XGBRaw.pkl')
joblib.dump(self.XGBNorm,
'XGBNorm.pkl')
joblib.dump(self.XGBStand,
'XGBStand.pkl')
The purpose of the build_train_xgb() code is as

follows:
checks if the saved
XGBoost models exist by
found
(path.isfile('XGBRaw.pkl')
variables XGBRaw,
XGBNorm, and XGBStand.
(rbRaw.isChecked(),
rbNorm.isChecked(),
calls the run_model
loaded models. The
raw, normalized, or
standardized data
etc.).
of the XGBClassifier
XGBRaw, XGBNorm, and
XGBStand. These instances
the XGBoost models. The
with a set number of
estimators, maximum
depth, random state, and
evaluation metric
(n_estimators=200,
max_depth=20,
random_state=2021,
use_label_encoder=False,
eval_metric='mlogloss').
the code calls the
and evaluate the
corresponding XGBoost
model using the
data.
saves the XGBRaw,
XGBNorm, and XGBStand
models as "XGBRaw.pkl",
"XGBNorm.pkl", and
"XGBStand.pkl",
retraining.
XGBoost models with different preprocessing

1 if strCB == 'XGB Classifier':

2 self.build_train_xgb()

radio button. Then, choose XGB Classifier
Figure 186 The result using XGB model with

raw feature scaling
Click on Norm radio button. Then, choose XGB

Classifier item from cbClassifier widget.
187.
Click on Stand radio button. Then, choose XGB

188.

Step Define build_train_lgbm() method to build

1 and train LGBM classifier using three feature
Standardization:
1 def build_train_lgbm(self):
2 if path.isfile('LGBMRaw.pkl'):
3 #Loads model
4 self.LGBMRaw =
5 joblib.load('LGBMRaw.pkl')
6 self.LGBMNorm =
joblib.load('LGBMNorm.pkl')
7
self.LGBMStand =
8
joblib.load('LGBMStand.pkl')
9
10
11
12 self.run_model('LGBM Classifier',
13 'Raw', \
14 self.LGBMRaw, self.X_train_raw, \
15 self.X_test_raw,
self.y_train_raw,\
16
self.y_test_raw)
17
\
20
'Normalization', self.LGBMNorm, \
21
self.X_train_norm,
22
self.X_test_norm, \
23
self.y_train_norm,
25
28 \
29 'Standardization', self.LGBMStand,
30 \
32
self.y_train_stand,
33
self.y_test_stand)
34
35
else:
36
#Builds and trains LGBMClassifier
37 classifier
38 self.LGBMRaw =
39 LGBMClassifier(max_depth = 20, \
40 n_estimators=500, subsample=0.8,
42 self.LGBMNorm =
LGBMClassifier(max_depth = 20, \
43
n_estimators=500, subsample=0.8,
44
random_state=2021)
45
self.LGBMStand =
46 LGBMClassifier(max_depth = 20, \
47 n_estimators=500, subsample=0.8,
49
52 'Raw', \
53 self.LGBMRaw, self.X_train_raw, \
54
56 \
self.y_test_raw)
self.run_model('LGBM Classifier',
\
'Normalization', self.LGBMNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
self.run_model('LGBM Classifier',
\
'Standardization', self.LGBMStand,
\
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.LGBMRaw,
'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm,
'LGBMNorm.pkl')
joblib.dump(self.LGBMStand,
'LGBMStand.pkl')
The purpose of the build_train_lgbm() code is

as follows:
checks if the saved
LightGBM models exist by
found
(path.isfile('LGBMRaw.pkl')
variables LGBMRaw,
LGBMNorm, and
LGBMStand. Then, based
on the selected
(rbRaw.isChecked(),
rbNorm.isChecked(),
calls the run_model
loaded models. The
raw, normalized, or
standardized data
etc.).
of the LGBMClassifier
LGBMRaw, LGBMNorm,
and LGBMStand. These
instances are used to build
and train the LightGBM
models. The algorithm is
configured with parameters
such as maximum depth,
number of estimators,
subsample ratio, and
random state
(max_depth=20,
n_estimators=500,
subsample=0.8,
random_state=2021).
the code calls the
and evaluate the
corresponding LightGBM
model using the
data.
saves the LGBMRaw,
LGBMNorm, and
LGBMStand models as
"LGBMRaw.pkl",
"LGBMNorm.pkl", and
"LGBMStand.pkl",
retraining.
LightGBM models with different preprocessing

1 if strCB == 'LGBM Classifier':

2 self.build_train_lgbm()

radio button. Then, choose LGBM Classifier
Figure 189 The result using LGBM model with

raw feature scaling

LGBM Classifier item from cbClassifier
in Figure 190.

LGBM Classifier item from cbClassifier
in Figure 191.

Step Define build_train_mlp() method to build and

1 train LGBM classifier using three feature
Standardization:
1 def build_train_mlp(self):
2 if path.isfile('MLPRaw.pkl'):
3 #Loads model
4 self.MLPRaw =
5 joblib.load('MLPRaw.pkl')
6 self.MLPNorm =
joblib.load('MLPNorm.pkl')
7
self.MLPStand =
8
joblib.load('MLPStand.pkl')
9
10
11
self.run_model('MLP Classifier',
12 'Raw', \
13 self.MLPRaw, self.X_train_raw, \
14
16 \
17 self.y_test_raw)
19 self.run_model('MLP Classifier', \
20 'Normalization', self.MLPNorm, \
self.y_test_norm)
24
25
26
self.run_model('MLP Classifier', \
27
'Standardization', self.MLPStand,
28
\
29
self.X_train_stand,
33 else:
34 #Builds and trains MLP classifier
35 self.MLPRaw =
36 MLPClassifier(random_state=2021)
37 self.MLPNorm =
MLPClassifier(random_state=2021)
38
self.MLPStand =
39
MLPClassifier(random_state=2021)
40
41
42
self.run_model('MLP Classifier',
43 'Raw', \
44 self.MLPRaw, self.X_train_raw,
48 self.run_model('MLP Classifier', \
49 'Normalization', self.MLPNorm, \
self.y_train_norm,
self.y_test_norm)
self.run_model('MLP Classifier', \
'Standardization', self.MLPStand,
\
self.X_train_stand,
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.MLPRaw,
'MLPRaw.pkl')
joblib.dump(self.MLPNorm,
'MLPNorm.pkl')
joblib.dump(self.MLPStand,
'MLPStand.pkl')
Figure 192 The result of using MLP model

The purpose of the build_train_mlp() code is as

follows:
checks if the saved MLP
(Multi-Layer Perceptron)
models exist by using the
path.isfile function. If the
models are found
(path.isfile('MLPRaw.pkl')
variables MLPRaw,
MLPNorm, and MLPStand.
(rbRaw.isChecked(),
rbNorm.isChecked(),
calls the run_model
loaded models. The
raw, normalized, or
standardized data
etc.).
of the MLPClassifier
MLPRaw, MLPNorm, and
MLPStand. These instances
the MLP models. The
with the random_state
parameter set to 2021 to
ensure reproducibility.
the code calls the
and evaluate the
corresponding MLP model
using the appropriate
preprocessed data.
saves the MLPRaw,
MLPNorm, and MLPStand
models as "MLPRaw.pkl",
"MLPNorm.pkl", and
"MLPStand.pkl",
retraining.
the building, training, and evaluation of MLP
models with different preprocessing

1 if strCB == 'MLP Classifier':

2 self.build_train_mlp()
radio button. Then, choose MLP Classifier

with normalization feature scaling

with standardization feature scaling
Click on Norm radio button. Then, choose MLP

193.
Click on Stand radio button. Then, choose MLP

194.
ANN Classifier
Step Define train_test_ANN() to split dataset into train and test data for
1 ANN:
1 def train_test_DL(self):
2 X, Y = self.fit_dataset(self.df)
3
4 #Splits dataframe into X_train,
5 X_test, y_train and y_test
6 X_train, X_test, y_train_DL,
y_test_DL = \
7
train_test_split(X, Y, test_size =
8
0.2, \
9
random_state = 2021)
10
11
#Splits dataframe into X_train,
12 X_test, y_train and y_test
13 X_train, X_test,
14 self.y_train_DL, self.y_test_DL =
15 \
16 train_test_split(X, Y, test_size =
0.2, \
17
random_state = 2021)
18
19
#Deletes any outliers in the data
20
using StandardScaler from SKLearn
21
22
self.X_train_DL =
23 sc.fit_transform(X_train)
self.X_test_DL =
sc.transform(X_test)
#Saves data
np.save('X_train_DL.npy',
X_train_DL)
np.save('X_test_DL.npy',
X_test_DL)
np.save('y_train_DL.npy',
y_train_DL)
np.save('y_test_DL.npy',
y_test_DL)
The purpose of the train_test_DL() code is as follows:

1. Dataset Preparation: The code prepares the
dataset for training and testing a deep
learning model. It calls the fit_dataset
function with the dataset self.df to obtain the
features X and the target variable Y.
2. Train-Test Split: The code performs a train-
test split on the dataset. It splits the features
(X) and the target variable (Y) into separate
sets for training and testing using the
train_test_split function from scikit-learn.
The split is performed with a test size of 0.2
(20% of the data) and a random state of
2021, ensuring consistent splits across
different runs.
3. Data Standardization: The code applies
standardization to the feature data (X) using
the StandardScaler from scikit-learn. It
creates an instance of StandardScaler
named sc and applies it to the training data
(X_train) using the fit_transform method.
The same scaling parameters obtained from
the training data are then used to transform
the test data (X_test) using the transform
method. This ensures that the feature data is
standardized to have zero mean and unit
variance, which can improve the
performance and convergence of the deep
learning model.
4. Data Saving: The code saves the
preprocessed training and testing data
(X_train_DL, X_test_DL, y_train_DL,
y_test_DL) as numpy arrays using the
np.save function. It saves the arrays as
"X_train_DL.npy", "X_test_DL.npy",
"y_train_DL.npy", and "y_test_DL.npy",
respectively. Saving the data allows for easy
access and loading of the preprocessed data
during the training and evaluation of the
deep learning model.
Overall, the purpose of this code is to prepare the dataset for deep
learning model training and testing. It performs a train-test split, applies
standardization to the feature data, and saves the preprocessed data for
future use. By performing these steps, the code sets up the necessary
data inputs for training and evaluating deep learning models using the
preprocessed dataset.
Step Define build_ANN() method to build ANN:

2
1 def build_ANN(self,X_train, y_train,

2 NBATCH, NEPOCH):
3 #Imports Tensorflow and create a
Sequential Model to add layer for the
4
ANN
5
6
7
#Input layer
8
10 input_dim=27,
11
13 activation='relu'))
14
15 ann.add(tf.keras.layers.Dropout(0.5))
16
17 #Hidden layer 1
18
19
20
21
activation='relu'))
22
23 ann.add(tf.keras.layers.Dropout(0.5))
24
25 #Output layer
26
28
30 activation='sigmoid'))
31
32 print(ann.summary()) #for showing the
structure and parameters
33
34
#Compiles the ANN using ADAM
35
optimizer.
36
ann.compile(optimizer = 'adam', \
37 loss = 'binary_crossentropy', metrics
= ['accuracy'])

history = ann.fit(X_train, y_train,
batch_size = 64, \
validation_split=0.20, epochs = 250,
shuffle=True)
#Saves model

history.history)
The purpose of the build_ANN() code is to build and train an Artificial

Neural Network (ANN) model using the provided training data. Here is
an explanation of each step:
1. Create Sequential Model: The code imports
TensorFlow library and creates a sequential
model ann using
tf.keras.models.Sequential(). The sequential
model allows us to add layers to the ANN in
a sequential manner.
2. Define Layers: The code adds layers to the
ANN model. It starts with an input layer
(tf.keras.layers.Dense) with 100 units, using
the 'relu' activation function. It also includes
a dropout layer (tf.keras.layers.Dropout)
with a dropout rate of 0.5 to prevent
overfitting. Next, a hidden layer with 20
units and 'relu' activation function is added,
followed by another dropout layer. Finally,
an output layer with a single unit and
'sigmoid' activation function is added.
3. Model Summary: The code prints a summary
of the ANN model (ann.summary()) to
display the structure and the number of
parameters in each layer.
4. Compile the Model: The code compiles the
ANN model using the Adam optimizer
(ann.compile(optimizer='adam')) and
specifies the loss function as binary cross-
entropy (loss='binary_crossentropy').
Additionally, the code specifies that the
accuracy metric should be calculated during
training (metrics=['accuracy']).
5. Train the Model: The code trains the ANN
model using the provided training data
(ann.fit(X_train, y_train)). It specifies the
batch size as 64, the validation split as 0.20
(20% of the training data is used for
validation), and the number of epochs as
250. The model is trained using the Adam
optimizer and the specified loss function.
6. Save the Model: The code saves the trained
ANN model as 'thyroid_model.h5' using
ann.save('thyroid_model.h5'). This allows the
model to be loaded and used for future
predictions.
7. Save the Training History: The code saves
the training history into an npy file named
'thyroid_history.npy' using
history.history). This includes the loss and
accuracy values recorded during training,
which can be useful for analyzing the
model's performance over epochs.
Overall, the build_ANN() code constructs and trains an ANN model for
binary classification, saves the trained model, and stores the training
history for later analysis.
Step Define train_ANN() method to execute the training:

3
1 def train_ANN(self):
2 if path.isfile('X_train_DL.npy'):
3 #Loads files
4 self.X_train_DL = \
5 np.load('X_train_DL.npy',allow_pickle=True
6 )
7 self.X_test_DL = \
8 np.load('X_test_DL.npy',allow_pickle=True)
9 self.y_train_DL = \
10 np.load('y_train_DL.npy',allow_pickle=True
)
11
self.y_test_DL = \
12
np.load('y_test_DL.npy',allow_pickle=True)
13
14
else:
15
self.train_test_ANN()
16
#Loads files
17
self.X_train_DL = \
18
np.load('X_train_DL.npy',allow_pickle=True
19
)
20
self.X_test_DL = \
21
np.load('X_test_DL.npy',allow_pickle=True)
22
self.y_train_DL = \
23
np.load('y_train_DL.npy',allow_pickle=True
24 )
25 self.y_test_DL = \
26 np.load('y_test_DL.npy',allow_pickle=True)
27
28 if path.isfile('thyroid_model.h5') ==
29 False:
30 self.build_ANN(self.X_train_DL,
31 self.y_train_DL, 32, 100)
32
#Turns on cbPredictionDL
self.cbPredictionDL.setEnabled(True)
#Turns off pbTrainDL

self.pbTrainDL.setEnabled(False)
The purpose of the train_ANN() code is to train the Artificial Neural

Network (ANN) model using the provided training data. Here is an
explanation of each step:
1. Check if Data Files Exist: The code checks if
the data files (X_train_DL.npy,
X_test_DL.npy, y_train_DL.npy,
y_test_DL.npy) exist using path.isfile(). If the
files exist, it loads the data into the
corresponding variables (self.X_train_DL,
self.X_test_DL, self.y_train_DL,
self.y_test_DL).
2. Data Preparation: If the data files do not
exist, the code calls the train_test_ANN
function (not provided) to generate and save
the training and testing data files.
3. Check if Model File Exists: The code checks
if the ANN model file (thyroid_model.h5)
exists using path.isfile(). If the file does not
exist, it calls the build_ANN function (not
provided) to build and train the ANN model
using the loaded training data
(self.X_train_DL, self.y_train_DL).
4. Enable Prediction Checkbox: The code
enables the prediction checkbox
(cbPredictionDL) to allow users to make
predictions using the trained ANN model.
5. Disable Train Button: The code disables the
train button (pbTrainDL) to prevent
duplicate training of the ANN model.
Overall, the train_ANN() code checks if the necessary data and model
files exist, loads the data and trains the ANN model if needed, and
updates the GUI elements to reflect the training status.
Step Define plot_loss_acc() method to plot loss and accuracy of the model:
4
1 def plot_loss_acc(self,train_loss, val_loss, train_acc,

2 val_acc, widget,strPlot):
3 widget.canvas.figure.clf()
5 widget.canvas.figure.add_subplot(211,facecolor='#fbe7dd'
)
6
widget.canvas.axis1.plot(train_loss, \
7
label='Training Loss',color='blue',
8
linewidth=3.0)
9
widget.canvas.axis1.plot(val_loss, 'b--', \
10
label='Validation Loss',color='red', linewidth=3.0)
11
widget.canvas.axis1.set_title('Loss',\
13 widget.canvas.axis1.set_xlabel('Epoch')
14 widget.canvas.axis1.grid(True, alpha=0.75, lw=1,
15 ls='-.')
16 widget.canvas.axis1.legend()
17
19 widget.canvas.figure.add_subplot(212,facecolor='#fbe7dd'
)
20
widget.canvas.axis1.plot(train_acc, \
21
label='Training Accuracy',color='blue',
22
linewidth=3.0)
23
widget.canvas.axis1.plot(val_acc, 'b--', \
24
label='Validation Accuracy',color='red', linewidth=3.0)
25
widget.canvas.axis1.set_title('Accuracy',\
26
27
widget.canvas.axis1.set_xlabel('Epoch')
widget.canvas.axis1.grid(True, alpha=0.75, lw=1,
ls='-.')
widget.canvas.axis1.legend()
The purpose of the plot_loss_acc() function is to plot the training and

validation loss as well as the training and validation accuracy of a
model. Here is an explanation of each step:
1. Clear Figure: The function clears the
existing figure in the canvas
(widget.canvas.figure.clf()) to start with a
clean plot.
2. Plot Training and Validation Loss: The
function creates a subplot
(widget.canvas.axis1) in the figure with a
facecolor of '#fbe7dd'. It plots the training
loss (train_loss) as a solid blue line with a
label 'Training Loss' and a linewidth of 3.0. It
also plots the validation loss (val_loss) as a
dashed red line with a label 'Validation Loss'
and a linewidth of 3.0. The subplot is titled
'Loss' and the x-axis is labeled as 'Epoch'.
Gridlines are added with an alpha of 0.75,
line width of 1, and a linestyle of '-.'. The
legend is displayed.
3. Plot Training and Validation Accuracy: The
function creates another subplot
(widget.canvas.axis1) in the figure with the
same facecolor. It plots the training accuracy
(train_acc) as a solid blue line with a label
'Training Accuracy' and a linewidth of 3.0. It
also plots the validation accuracy (val_acc)
as a dashed red line with a label 'Validation
Accuracy' and a linewidth of 3.0. The subplot
is titled 'Accuracy' and the x-axis is labeled
as 'Epoch'. Gridlines are added with the
same parameters as in the previous subplot.
The legend is displayed.
4. Draw the Plot: Finally, the
widget.canvas.draw() function is called to
display the updated plot on the canvas.
Overall, the plot_loss_acc() function takes the loss and accuracy values
for both training and validation data, creates a figure with two subplots,
and plots the loss and accuracy curves accordingly. The resulting plot
provides visual insights into the model's performance during training.
Step Define pred_ann() to calculate the predicted values:

5
1 def pred_ann(self, xtest, ytest):

2 self.train_ANN()
3 self.ann =
4 load_model('thyroid_model.h5')
5
6 prediction =
self.ann.predict(xtest)
7
label = [int(p>=0.5) for p in
8
prediction]
9
10
print("test_target:", ytest)
11
print("pred_val:", label)
12
13
#Performance Evaluation -
14 Accuracy,
15
16 #Classification Report &
17 Confusion Matrix
18 #Accuracy Score
19 print ('Accuracy Score : ', \
20 accuracy_score(label, ytest),
'\n')
21

print ('Classification Report
:\n\n' ,\
classification_report(label,
ytest))
return label
The purpose of the pred_ann() function is to make predictions using the

trained ANN model and evaluate its performance. Here is an
explanation of each step:
1. Train the ANN: The function calls the
train_ANN method to ensure that the ANN
model is trained and available for making
predictions.
2. Load the ANN model: The function loads the
trained ANN model from the saved file
'thyroid_model.h5' using the load_model
function.
3. Make predictions: The function uses the
loaded ANN model to make predictions on
the input test data (xtest). The predictions
are obtained by calling
self.ann.predict(xtest).
4. Convert predictions to labels: The function
converts the prediction probabilities into
labels by thresholding at 0.5. If the predicted
probability is greater than or equal to 0.5, it
is assigned the label 1; otherwise, it is
assigned the label 0. The labels are stored in
the label variable.
5. Print test targets and predicted values: The
function prints the actual target values
(ytest) and the predicted values (label) for
comparison and evaluation.
6. Evaluate performance: The function
performs performance evaluation by
calculating the accuracy score using
accuracy_score(label, ytest). It then prints
the accuracy score.
7. Generate classification report: The function
generates a classification report using
classification_report(label, ytest) and prints
it. The classification report provides
precision, recall, F1-score, and support for
each class.
8. Return predicted labels: Finally, the function
returns the predicted labels (label) for
further analysis or processing.
Overall, the pred_ann() function loads the trained ANN model, makes
predictions on the test data, evaluates the performance of the model,
and returns the predicted labels. It provides insights into the accuracy
and classification performance of the ANN model on the test data.
Step Define choose_prediction_ANN() to read selected item from

6 cbPredictionDL widget:
1
2
3
4
5
6
7
8
9
10
11
12 def choose_prediction_ANN(self):
13 strCB = self.cbPredictionDL.currentText()
14
15 if strCB == 'CNN 1D':
16 pred_val = self.pred_ann(self.X_test_DL,
17 self.y_test_DL)
18
19 #Plots true values versus predicted values
\
23
24
self.plot_real_pred_val(pred_val,
25
self.y_test_DL, \
26
self.widgetPlot2, 'CNN 1D')
27
28
29
30
#Plot confusion matrix
31
32
33
34 \
self.plot_cm(pred_val, self.y_test_DL, \
self.widgetPlot1, 'CNN 1D')
#Loads history
history =
np.load('thyroid_history.npy',\
allow_pickle=True).item()
train_loss = history['loss']
train_acc = history['accuracy']
val_acc = history['val_accuracy']
val_loss = history['val_loss']
self.plot_loss_acc(train_loss, val_loss,
train_acc, \
val_acc, self.widgetPlot3, "History of " +
strCB)
The purpose of the choose_prediction_ANN() function is to
select a specific prediction method for the ANN model and
visualize the results. Here is an explanation of each step:
1. Retrieve the selected prediction
method: The function obtains the
selected prediction method from the
combo box
(self.cbPredictionDL.currentText())
and stores it in the variable strCB.
2. Handle 'CNN 1D' prediction method:
If the selected prediction method is
'CNN 1D', the function calls the
pred_ann() method with the test data
(self.X_test_DL and self.y_test_DL) to
obtain the predicted values
(pred_val) using the ANN model.
3. Plot true values versus predicted
values: The function clears the figure
of self.widgetPlot2 and adds a new
subplot with a light background
color. It calls the plot_real_pred_val()
method to plot the true values versus
the predicted values (pred_val and
self.y_test_DL) on the
self.widgetPlot2 widget. This plot
visualizes the comparison between
the actual and predicted values.
4. Plot confusion matrix: The function
clears the figure of self.widgetPlot1
and adds a new subplot with a light
background color. It calls the plot_cm
method to plot the confusion matrix
based on the predicted values
(pred_val) and the actual values
(self.y_test_DL) on the
self.widgetPlot1 widget. The
confusion matrix provides insights
into the performance of the model in
terms of true positives, true
negatives, false positives, and false
negatives.
5. Load and plot history: The function
loads the history of the ANN model
from the file 'thyroid_history.npy'. It
retrieves the training loss, validation
loss, training accuracy, and
validation accuracy from the history
object. It then calls the plot_loss_acc
method to plot the training and
validation loss as well as the training
and validation accuracy on the
self.widgetPlot3 widget. This plot
visualizes the training progress and
model performance over epochs.
Overall, the choose_prediction_ANN() function allows the
user to select a prediction method for the ANN model,
generates the corresponding predictions, and visualizes the
results through plots of true values versus predicted values,
confusion matrix, and the training history of the model.
Step 7 Connect clicked() event

method as shown in line
cbPredictionDL widget w
shown in line 15-16:
2 QMainWindow.__in
3 loadUi("gui_thy
4 self.setWindowTitle
5 "GUI Demo of Classi
6 Disease")
7 self.addToolBar(Nav
8 self.widgetPlot1.ca
9 self.pbLoad.clicked
10 self.initial_state
11 self.pbTrainML.clic
12 self.cbData.current
)
13
self.cbClassifier.c
14
15 self.choose_ML_mode
16 self.pbTrainDL.clic
self.cbPredictionDL
self.choose_predict
The code sets up a connect

of the cbPredictionDL com
method. This connection en
combo box changes, the ch
to handle the selection
accordingly.
By connecting the
choose_prediction_ANN() m
the combo box wil
choose_prediction_ANN() m
the visualizations based on
Overall, this code establish

to changes in the selecte
appropriate actions are take
Figure 195 T
Step 8 Run gui_thyroid.py and c

buttons. Then, choose CN
Then, you will see the resul
Following is the full version of gui_thyroid.py:
#gui_thyroid.py
from PyQt5.QtWidgets import *
from PyQt5.uic import loadUi
from matplotlib.backends.backend_qt5agg import (NavigationToolbar2QT as Navig
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import mglearn
warnings.filterwarnings('ignore')
import os
import joblib
from numpy import save
from numpy import load
from os import path
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split, RandomizedSearchCV, Gri
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, p
from sklearn.metrics import classification_report, f1_score, plot_confusion_m
from catboost import CatBoostClassifier
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
from sklearn.base import clone
from sklearn.decomposition import PCA
from tensorflow.keras.models import Sequential, Model, load_model
class DemoGUI_Thyroid(QMainWindow):
def __init__(self):
QMainWindow.__init__(self)
loadUi("gui_thyroid.ui",self)
self.setWindowTitle(\
"GUI Demo of Classifying and Predicting Thyroid Disease")
self.addToolBar(NavigationToolbar(\
self.widgetPlot1.canvas, self))
self.pbLoad.clicked.connect(self.import_dataset)
self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot)
self.cbClassifier.currentIndexChanged.connect(self.choose_ML_model)
self.pbTrainDL.clicked.connect(self.train_ANN)
self.cbPredictionDL.currentIndexChanged.connect(self.choose_prediction_ANN)
# Takes a df and writes it to a qtable provided. df headers become qtable he

@staticmethod
def write_df_to_qtable(df,table):
headers = list(df)
table.setRowCount(df.shape[0])
table.setColumnCount(df.shape[1])
table.setHorizontalHeaderLabels(headers)
# getting data from df is computationally costly so convert it to array firs

df_array = df.values
for row in range(df.shape[0]):
for col in range(df.shape[1]):
table.setItem(row, col, QTableWidgetItem(str(df_array[row,col
def populate_table(self,data, table):

#Populates two tables
self.write_df_to_qtable(data,table)
table.setAlternatingRowColors(True)
table.setStyleSheet("alternate-background-color: #ffbacd;background-c
def initial_state(self, state):

self.pbTrainML.setEnabled(state)
self.cbData.setEnabled(state)
self.cbClassifier.setEnabled(state)
self.cbPredictionML.setEnabled(state)
self.cbPredictionDL.setEnabled(state)
self.pbTrainDL.setEnabled(state)
self.rbRaw.setEnabled(state)
self.rbNorm.setEnabled(state)
self.rbStand.setEnabled(state)
def read_dataset(self, dir):

#Loads csv file
df = pd.read_csv(dir)
#Replaces ? with NAN


df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

df.drop(['TBG','referral source'],axis=1,inplace=True)

self.mode_imputation(df, col)
#Cleans age column

#Creates dummy dataset for visualization

df_dummy=df.copy()

df['binaryClass'] = df['binaryClass'].apply(lambda x: self.map_binary

df['sex'] = df['sex'].apply(lambda x: self.map_sex(x))

return df, df_dummy
#Uses mode imputation for all other categorical features

def mode_imputation(self, df, feature):

def map_binaryClass(self, n):
if n == "N":
return 0
else:
return 1

def map_sex(self, n):
if n == "F":
return 0
else:
return 1
def import_dataset(self):
dataset_dir = curr_path + "/hypothyroid.csv"
#Loads csv file

self.df, self.df_dummy = self.read_dataset(dataset_dir)
#Populates tables with data

self.populate_table(self.df, self.twData1)
self.label1.setText('Thyroid Disease Data')
self.populate_table(self.df.describe(), self.twData2)
self.twData2.setVerticalHeaderLabels(['Count', 'Mean', 'Std', 'Min', '25%',
self.label2.setText('Data Desciption')
#Turns on pbTrainML widget

self.pbTrainML.setEnabled(True)
self.pbTrainDL.setEnabled(True)
#Turns off pbLoad

self.pbLoad.setEnabled(False)
#Populates cbData
self.populate_cbData()
def populate_cbData(self):
self.cbData.addItems(self.df)
self.cbData.addItems(["Features Importance"])
self.cbData.addItems(["Correlation Matrix", "Pairwise Relationship", "Featur
def fit_dataset(self, df):

#Extracts label feature as target variable
#Drops diagnosis_result feature and set input variable

X = df.drop('binaryClass', axis = 1)
#Resamples data
return X, y
def train_test(self):
X, y = self.fit_dataset(self.df)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
self.X_train_raw = X_train.copy()
self.X_test_raw = X_test.copy()
self.y_train_raw = y_train.copy()
self.y_test_raw = y_test.copy()

save('X_train_raw.npy', self.X_train_raw)
save('y_train_raw.npy', self.y_train_raw)
save('X_test_raw.npy', self.X_test_raw)
save('y_test_raw.npy', self.y_test_raw)
self.X_train_norm = X_train.copy()
self.X_test_norm = X_test.copy()
self.y_train_norm = y_train.copy()
self.y_test_norm = y_test.copy()
norm = MinMaxScaler()
self.X_train_norm = norm.fit_transform(self.X_train_norm)
self.X_test_norm = norm.transform(self.X_test_norm)

save('X_train_norm.npy', self.X_train_norm)
save('y_train_norm.npy', self.y_train_norm)
save('X_test_norm.npy', self.X_test_norm)
save('y_test_norm.npy', self.y_test_norm)
self.X_train_stand = X_train.copy()
self.X_test_stand = X_test.copy()
self.y_train_stand = y_train.copy()
self.y_test_stand = y_test.copy()
self.X_train_stand = scaler.fit_transform(self.X_train_stand)
self.X_test_stand = scaler.transform(self.X_test_stand)

save('X_train_stand.npy', self.X_train_stand)
save('y_train_stand.npy', self.y_train_stand)
save('X_test_stand.npy', self.X_test_stand)
save('y_test_stand.npy', self.y_test_stand)
def split_data_ML(self):
if path.isfile('X_train_raw.npy'):
#Loads npy files
self.X_train_raw = np.load('X_train_raw.npy',allow_pickle=True)
self.y_train_raw = np.load('y_train_raw.npy',allow_pickle=True)
self.X_test_raw = np.load('X_test_raw.npy',allow_pickle=True)
self.y_test_raw = np.load('y_test_raw.npy',allow_pickle=True)
self.X_train_norm = np.load('X_train_norm.npy',allow_pickle=True)
self.y_train_norm = np.load('y_train_norm.npy',allow_pickle=True)
self.X_test_norm = np.load('X_test_norm.npy',allow_pickle=True)
self.y_test_norm = np.load('y_test_norm.npy',allow_pickle=True)
self.X_train_stand = np.load('X_train_stand.npy',allow_pickle=True)
self.y_train_stand = np.load('y_train_stand.npy',allow_pickle=True)
self.X_test_stand = np.load('X_test_stand.npy',allow_pickle=True)
self.y_test_stand = np.load('y_test_stand.npy',allow_pickle=True)
else:
self.train_test()
#Prints each shape

print('X train raw shape: ', self.X_train_raw.shape)
print('Y train raw shape: ', self.y_train_raw.shape)
print('X test raw shape: ', self.X_test_raw.shape)
print('Y test raw shape: ', self.y_test_raw.shape)
#Prints each shape

print('X train norm shape: ', self.X_train_norm.shape)
print('Y train norm shape: ', self.y_train_norm.shape)
print('X test norm shape: ', self.X_test_norm.shape)
print('Y test norm shape: ', self.y_test_norm.shape)
#Prints each shape

print('X train stand shape: ', self.X_train_stand.shape)
print('Y train stand shape: ', self.y_train_stand.shape)
print('X test stand shape: ', self.X_test_stand.shape)
print('Y test stand shape: ', self.y_test_stand.shape)
def train_model_ML(self):
self.split_data_ML()
#Turns on three widgets

self.cbData.setEnabled(True)
self.cbClassifier.setEnabled(True)
self.cbPredictionML.setEnabled(True)
#Turns off pbTrainML

self.pbTrainML.setEnabled(False)
#Turns on three radio buttons

self.rbRaw.setEnabled(True)
self.rbNorm.setEnabled(True)
self.rbStand.setEnabled(True)
self.rbRaw.setChecked(True)
def pie_cat(self, df, var_target, labels, widget):

df.value_counts().plot.pie(ax = widget.canvas.axis1,labels=labels,sta
10})
widget.canvas.axis1.set_title('The distribution of ' + var_target + '
def bar_cat(self,df,var, widget):
ax = df[var].value_counts().plot(kind="barh",ax = widget.canvas.axis1
for i,j in enumerate(df[var].value_counts().values):

ax.text(.7,i,j,weight = "bold",fontsize=10)
widget.canvas.axis1.set_title("Count of "+ var +" cases")

#Plots label with other variable

def dist_percent_plot(self,df,cat,ax1,ax2):
cmap1=plt.cm.coolwarm_r
result = df.groupby(cat).apply (lambda group: (group.binaryClass == '

).to_frame('N')
result["P"] = 1 -result["N"]
g=result.plot(kind='bar', stacked = True,colormap=cmap1, ax=ax1, grid
ax1.set_xlabel(cat)
ax1.set_title('Stacked Bar Plot of '+ cat +' (in %)', fontsize=14)
ax1.set_ylabel('% Thyroid (Negative vs Positive)')
group_by_stat = df.groupby([cat, 'binaryClass']).size()

g=group_by_stat.unstack().plot(kind='bar', stacked=True,ax=ax2,grid=T
ax2.set_title('Stacked Bar Plot of '+ cat +' (in %)', fontsize=14)

ax2.set_xlabel(cat)
plt.show()
def put_label_stacked_bar(self, ax,fontsize):

#patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The height of the bar is the data value and can be used as the label
label_text = f'{height:.0f}'
# plots only when height is greater than specified value

if height > 0:
ax.text(label_x, label_y, label_text, ha='center', va='center
def choose_plot(self):
strCB = self.cbData.currentText()
if strCB == "binaryClass":
#Plots distribution of binaryClass variable in pie chart
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
label_class = list(self.df_dummy["binaryClass"].value_counts().in
self.pie_cat(self.df_dummy["binaryClass"],'binaryClass', label_class, self.w
self.bar_cat(self.df_dummy,"binaryClass", self.widgetPlot1)
self.dist_percent_plot(self.df_dummy,"sex",self.widgetPlot2.canvas.axis1,sel
g=sns.countplot(self.df_dummy["on thyroxine"],hue = self.df_dummy
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus binaryClass",fo
g=sns.countplot(self.df_dummy["TSH measured"],hue = self.df_dummy
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus binaryClass",fo
g=sns.countplot(self.df_dummy["TT4 measured"],hue = self.df_dummy
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus binaryClass",fo
g=sns.countplot(self.df_dummy["T4U measured"],hue = self.df_dummy
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus binaryClass",fo
g=sns.countplot(self.df_dummy["T3 measured"],hue = self.df_dummy[
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus binaryClass",fon
g=sns.countplot(self.df_dummy["query on thyroxine"],hue = self.df_
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus binaryCla
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["binary
self.widgetPlot3.canvas.axis1.set_title("sick versus binaryClass",fontweight
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["binar
self.widgetPlot3.canvas.axis1.set_title("tumor versus binaryClass",fontweigh
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["binar
self.widgetPlot3.canvas.axis1.set_title("psych versus binaryClass",fontweigh
if strCB == "TSH measured":

#Plots distribution of TSH measured variable in pie chart
label_class = list(self.df_dummy["TSH measured"].value_counts().i
self.pie_cat(self.df_dummy["TSH measured"],'TSH measured', label_class, self
self.bar_cat(self.df_dummy,"TSH measured", self.widgetPlot1)
self.dist_percent_plot(self.df_dummy,"TSH measured",self.widgetPlot2.canvas.
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus TSH measured",f
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["TSH
self.widgetPlot3.canvas.axis1.set_title("goitre versus TSH measured",fontwei
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus TSH measured",f
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus TSH measured",f
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus TSH measured",fo
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus TSH measu
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["TSH me
self.widgetPlot3.canvas.axis1.set_title("sick versus TSH measured",fontweigh
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["TSH m
self.widgetPlot3.canvas.axis1.set_title("tumor versus TSH measured",fontweig
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["TSH m
self.widgetPlot3.canvas.axis1.set_title("psych versus TSH measured",fontweig
if strCB == "T3 measured":

#Plots distribution of T3 measured variable in pie chart
label_class = list(self.df_dummy["T3 measured"].value_counts().in
self.pie_cat(self.df_dummy["T3 measured"],'T3 measured', label_class, self.w
self.bar_cat(self.df_dummy,"T3 measured", self.widgetPlot1)
self.dist_percent_plot(self.df_dummy,"T3 measured",self.widgetPlot2.canvas.a
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus T3 measured",fo
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["T3 m
self.widgetPlot3.canvas.axis1.set_title("goitre versus T3 measured",fontweig
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus T3 measur
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["T3 mea
self.widgetPlot3.canvas.axis1.set_title("sick versus T3 measured",fontweight
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["T3 me
self.widgetPlot3.canvas.axis1.set_title("tumor versus T3 measured",fontweigh
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["T3 me
self.widgetPlot3.canvas.axis1.set_title("psych versus T3 measured",fontweigh
if strCB == "TT4 measured":

#Plots distribution of TT4 measured variable in pie chart
label_class = list(self.df_dummy["TT4 measured"].value_counts().i
self.pie_cat(self.df_dummy["TT4 measured"],'TT4 measured', label_class, self
self.bar_cat(self.df_dummy,"TT4 measured", self.widgetPlot1)
self.dist_percent_plot(self.df_dummy,"TT4 measured",self.widgetPlot2.canvas.
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus TT4 measured",f
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["TT4
self.widgetPlot3.canvas.axis1.set_title("goitre versus TT4 measured",fontwei
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus TT4 measured",fo
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus TT4 measured",f
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus TT4 measured",f
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus TT4 measu
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["TT4 me
self.widgetPlot3.canvas.axis1.set_title("sick versus TT4 measured",fontweigh
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["TT4 m
self.widgetPlot3.canvas.axis1.set_title("tumor versus TT4 measured",fontweig
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["TT4 m
self.widgetPlot3.canvas.axis1.set_title("psych versus TT4 measured",fontweig
if strCB == 'age':
self.prob_num_versus_two_cat("age","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("age")
self.prob_num_versus_two_cat("age","FTI measured", "T3 measured" ,self.widge
if strCB == 'TSH':
self.prob_num_versus_two_cat("TSH","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("TSH")
self.prob_num_versus_two_cat("TSH","FTI measured", "T3 measured" ,self.widge
if strCB == 'T3':
self.prob_num_versus_two_cat("T3","binaryClass", "T4U measured" ,self.widget
self.hist_num_versus_nine_cat("T3")
self.prob_num_versus_two_cat("T3","FTI measured", "T3 measured" ,self.widget
if strCB == 'TT4':
self.prob_num_versus_two_cat("TT4","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("TT4")
self.prob_num_versus_two_cat("TT4","FTI measured", "T3 measured" ,self.widge
if strCB == 'T4U':
self.prob_num_versus_two_cat("T4U","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("T4U")
self.prob_num_versus_two_cat("T4U","FTI measured", "T3 measured" ,self.widge
if strCB == 'FTI':
self.prob_num_versus_two_cat("FTI","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("FTI")
self.prob_num_versus_two_cat("FTI","FTI measured", "T3 measured" ,self.widge
if strCB == 'Correlation Matrix':

X,_ = self.fit_dataset(self.df)
self.plot_corr(X, self.widgetPlot3)
if strCB == 'Features Importance':

self.plot_importance(self.widgetPlot3)
def feat_versus_other(self, feat,another,legend,ax0,label='',title=''):

background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#66b3ff'])
ax0_sns = sns.histplot(data=self.df, x=self.df[feat],ax=ax0,zorder=2,
shrink=.8,linewidth=0.3,alpha=1)
self.put_label_stacked_bar(ax0_sns,12)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linew

ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linew

ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=backgro
loc='upper right')
ax0_sns.set_xlabel(label,fontweight ="bold",fontsize=14)
ax0_sns.set_title(title,fontweight ="bold",fontsize=16)
def prob_feat_versus_other(self,feat,another,legend,ax0,label='',title=''):
ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='


loc='upper right')
def hist_num_versus_nine_cat(self,feat):
self.label_bin = list(self.df_dummy["binaryClass"].value_counts().index)
self.label_sex = list(self.df_dummy["sex"].value_counts().index)
self.label_thyroxine = list(self.df_dummy["on thyroxine"].value_counts().ind
self.label_pregnant = list(self.df_dummy["pregnant"].value_counts().index)
self.label_lithium = list(self.df_dummy["lithium"].value_counts().index)
self.label_goitre = list(self.df_dummy["goitre"].value_counts().index)
self.label_tumor = list(self.df_dummy["tumor"].value_counts().index)
self.label_tsh = list(self.df_dummy["TSH measured"].value_counts().index)
self.label_tt4 = list(self.df_dummy["TT4 measured"].value_counts().index)
print(self.df_dummy["binaryClass"].value_counts())
self.feat_versus_other(feat,self.df_dummy["binaryClass"],self.label_bin,self
s versus ' + feat)
print(self.df_dummy["sex"].value_counts())
self.feat_versus_other(feat,self.df_dummy["sex"],self.label_sex,self.widgetP
feat)
print(self.df_dummy["on thyroxine"].value_counts())
self.feat_versus_other(feat,self.df_dummy["on thyroxine"],self.label_thyroxi
thyroxine versus ' + feat)
print(self.df_dummy["pregnant"].value_counts())
self.feat_versus_other(feat,self.df_dummy["pregnant"],self.label_pregnant,se
versus ' + feat)
print(self.df_dummy["lithium"].value_counts())
self.feat_versus_other(feat,self.df_dummy["lithium"],self.label_lithium,self
versus ' + feat)
print(self.df_dummy["goitre"].value_counts())
self.feat_versus_other(feat,self.df_dummy["goitre"],self.label_goitre,self.w
versus ' + feat)
print(self.df_dummy["tumor"].value_counts())
self.feat_versus_other(feat,self.df_dummy["tumor"],self.label_tumor,self.wid
versus ' + feat)
print(self.df_dummy["TSH measured"].value_counts())
self.feat_versus_other(feat,self.df_dummy["TSH measured"],self.label_tsh,sel
measured versus ' + feat)
print(self.df_dummy["TT4 measured"].value_counts())
self.feat_versus_other(feat,self.df_dummy["TT4 measured"],self.label_tt4,sel
measured versus ' + feat)
def prob_feat_versus_other(self,feat,another,legend,ax0,label='',title=''):
ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='


0.989), loc='upper right')
def prob_num_versus_two_cat(self,feat, feat_cat1, feat_cat2, widget):

self.label_feat_cat1 = list(self.df_dummy[feat_cat1].value_counts().index)
self.label_feat_cat2 = list(self.df_dummy[feat_cat2].value_counts().index)
widget.canvas.axis1 = widget.canvas.figure.add_subplot(211,facecolor
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat2],self.label_feat_ca
2 + ' versus ' + feat)
widget.canvas.axis1 = widget.canvas.figure.add_subplot(212,facecolor
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat1],self.label_feat_ca
1 + ' versus ' + feat)
def plot_corr(self, data, widget):

corrdata = data.corr()
sns.heatmap(corrdata, ax = widget.canvas.axis1, lw=1, annot=True, cma
widget.canvas.axis1.set_title('Correlation Matrix', fontweight ="bold
def plot_importance(self, widget):

#Compares different feature importances
r = ExtraTreesClassifier(random_state=0)
X,y = self.fit_dataset(self.df)
r.fit(X, y)
feature_importance_normalized = np.std([tree.feature_importances_ for
r.estimators_],
axis = 0)
sns.barplot(feature_importance_normalized, X.columns, ax = widget.can

widget.canvas.axis1.set_ylabel('Feature Labels',fontweight ="bold",fo
widget.canvas.axis1.set_xlabel('Features Importance',fontweight ="bol
widget.canvas.axis1.set_title('Comparison of different Features Impor
def plot_real_pred_val(self, Y_pred, Y_test, widget, title):
#Calculate Metrics
acc=accuracy_score(Y_test,Y_pred)
#Output plot
widget.canvas.axis1 = widget.canvas.figure.add_subplot(111,facecolor=
widget.canvas.axis1.scatter(range(len(Y_pred)),Y_pred,color="yellow",
widget.canvas.axis1.scatter(range(len(Y_test)), Y_test,color="red",la
widget.canvas.axis1.set_title("Prediction Values vs Real Values of "
widget.canvas.axis1.set_xlabel("Accuracy: " + str(round((acc*100),3))
widget.canvas.axis1.legend()
widget.canvas.axis1.grid(True, alpha=0.75, lw=1, ls='-.')
def plot_cm(self, Y_pred, Y_test, widget, title):

cm=confusion_matrix(Y_test,Y_pred)
widget.canvas.axis1 = widget.canvas.figure.add_subplot(111)
class_label = ['Negative', 'Positive']
df_cm = pd.DataFrame(cm, index=class_label,columns=class_label)
sns.heatmap(df_cm, ax=widget.canvas.axis1, annot=True, cmap='plasma',
widget.canvas.axis1.set_title("Confusion Matrix of " + title, fontwei
widget.canvas.axis1.set_xlabel("Predicted")
widget.canvas.axis1.set_ylabel("True")
def plot_roc(self, clf, xtest, ytest, title, widget):

pred_prob = clf.predict_proba(xtest)
pred_prob = pred_prob[:, 1]
fpr, tpr, thresholds = roc_curve(ytest, pred_prob)
widget.canvas.axis1.plot(fpr,tpr, label='ANN',color='crimson', linewi
widget.canvas.axis1.set_xlabel('False Positive Rate')
widget.canvas.axis1.set_ylabel('True Positive Rate')
widget.canvas.axis1.set_title('ROC Curve of ' + title, fontweight ="b
def plot_decision(self, cla, feat1, feat2, widget, title=""):

dataset_dir = curr_path + "/hypothyroid.csv"
#Loads csv file
df, _ = self.read_dataset(dataset_dir)
#Plots decision boundary of two features

feat_boundary = [feat1, feat2]
X_feature = df[feat_boundary]
X_train_feature, X_test_feature, y_train_feature, y_test_feature= tra
test_size=0.3, random_state = 42)
cla.fit(X_train_feature, y_train_feature)
plot_decision_regions(X_test_feature.values, y_test_feature.ravel(),
widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_xlabel(feat1)
widget.canvas.axis1.set_ylabel(feat2)
def plot_learning_curve(self,estimator, title, X, y, widget, ylim=None, cv=N

widget.canvas.axis1.set_xlabel("Training examples")

return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
# Plot learning curve

widget.canvas.axis1.fill_between(train_sizes, train_scores_mean - tra
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
widget.canvas.axis1.fill_between(train_sizes, test_scores_mean - test_
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
widget.canvas.axis1.plot(train_sizes, train_scores_mean, 'o-', color=
label="Training score")
widget.canvas.axis1.plot(train_sizes, test_scores_mean, 'o-', color="
label="Cross-validation score")
widget.canvas.axis1.legend(loc="best")
def plot_scalability_curve(self,estimator, title, X, y, widget, ylim=None, c


return_times=True)
fit_times_std = np.std(fit_times, axis=1)

widget.canvas.axis1.plot(train_sizes, fit_times_mean, 'o-')
widget.canvas.axis1.fill_between(train_sizes, fit_times_mean - fit_ti
fit_times_mean + fit_times_std, alpha=0.1)
widget.canvas.axis1.set_ylabel("fit_times")
def plot_performance_curve(self,estimator, title, X, y, widget, ylim=None, c


return_times=True)

widget.canvas.axis1.plot(fit_times_mean, test_scores_mean, 'o-')
widget.canvas.axis1.fill_between(fit_times_mean, test_scores_mean - t
test_scores_mean + test_scores_std, alpha=0.1)
widget.canvas.axis1.set_xlabel("fit_times")
def train_model(self, model, X, y):

model.fit(X, y)
return model
def predict_model(self, model, X, proba=False):

if ~proba:
y_pred = model.predict(X)
else:
y_pred_proba = model.predict_proba(X)
y_pred = np.argmax(y_pred_proba, axis=1)
return y_pred
def run_model(self, name, scaling, model, X_train, X_test, y_train, y_test,

if train == True:
model = self.train_model(model, X_train, y_train)
y_pred = self.predict_model(model, X_test, proba)
accuracy = accuracy_score(y_test, y_pred)

recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print('f1: ', f1)
self.plot_cm(y_pred, y_test, self.widgetPlot1, name + " -- " + scaling)
self.plot_real_pred_val(y_pred, y_test, self.widgetPlot2, name + " -- " + sc
self.plot_decision(model, 'TSH', 'FTI', self.widgetPlot3, title="The decisio
self.plot_learning_curve(model, 'Learning Curve' + " -- " + scaling, X_train
self.plot_scalability_curve(model, 'Scalability of ' + name + " -- " + scali
self.plot_performance_curve(model, 'Performance of ' + name + " -- " + scali
def build_train_lr(self):
if path.isfile('logregRaw.pkl'):
#Loads model
self.logregRaw = joblib.load('logregRaw.pkl')
self.logregNorm = joblib.load('logregNorm.pkl')
self.logregStand = joblib.load('logregStand.pkl')
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw
self.y_test_raw)
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self
self.run_model('Logistic Regression', 'Standardization', self.logregStand, s
else:
self.logregRaw = LogisticRegression(solver='lbfgs', max_iter=500, random_sta
self.logregNorm = LogisticRegression(solver='lbfgs', max_iter=500, random_st
self.logregStand = LogisticRegression(solver='lbfgs', max_iter=500, random_s
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw
self.y_test_raw)
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self
self.run_model('Logistic Regression', 'Standardization', self.logregStand, s
#Saves model
joblib.dump(self.logregRaw, 'logregRaw.pkl')
joblib.dump(self.logregNorm, 'logregNorm.pkl')
joblib.dump(self.logregStand, 'logregStand.pkl')
def choose_ML_model(self):
strCB = self.cbClassifier.currentText()
if strCB == 'Logistic Regression':

self.build_train_lr()
if strCB == 'Support Vector Machine':

self.build_train_svm()
if strCB == 'K-Nearest Neighbor':

self.build_train_knn()
if strCB == 'Decision Tree':

self.build_train_dt()
if strCB == 'Random Forest':

self.build_train_rf()
if strCB == 'Gradient Boosting':
self.build_train_gb()
if strCB == 'Naive Bayes':

self.build_train_nb()
if strCB == 'Adaboost':
self.build_train_ada()
if strCB == 'XGB Classifier':

self.build_train_xgb()
if strCB == 'LGBM Classifier':

self.build_train_lgbm()
if strCB == 'MLP Classifier':

self.build_train_mlp()
def build_train_svm(self):
if path.isfile('SVMRaw.pkl'):
#Loads model
self.SVMRaw = joblib.load('SVMRaw.pkl')
self.SVMNorm = joblib.load('SVMNorm.pkl')
self.SVMStand = joblib.load('SVMStand.pkl')
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw
self.y_test_raw)
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, s
else:
self.SVMRaw = SVC(random_state=2021,probability=True)
self.SVMNorm = SVC(random_state=2021,probability=True)
self.SVMStand = SVC(random_state=2021,probability=True)
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw
self.y_test_raw)
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, s
#Saves model
joblib.dump(self.SVMRaw, 'SVMRaw.pkl')
joblib.dump(self.SVMNorm, 'SVMNorm.pkl')
joblib.dump(self.SVMStand, 'SVMStand.pkl')
def build_train_knn(self):
if path.isfile('KNNRaw.pkl'):
#Loads model
self.KNNRaw = joblib.load('KNNRaw.pkl')
self.KNNNorm = joblib.load('KNNNorm.pkl')
self.KNNStand = joblib.load('KNNStand.pkl')
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, s
self.y_test_raw)
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_t
self.y_test_norm)
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X
else:
#Builds and trains K-Nearest Neighbor
self.KNNRaw = KNeighborsClassifier(n_neighbors = 20)
self.KNNNorm = KNeighborsClassifier(n_neighbors = 20)
self.KNNStand = KNeighborsClassifier(n_neighbors = 20)
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, s
self.y_test_raw)
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_t
self.y_test_norm)
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X
#Saves model
joblib.dump(self.KNNRaw, 'KNNRaw.pkl')
joblib.dump(self.KNNNorm, 'KNNNorm.pkl')
joblib.dump(self.KNNStand, 'KNNStand.pkl')
def build_train_dt(self):
if path.isfile('DTRaw.pkl'):
#Loads model
self.DTRaw = joblib.load('DTRaw.pkl')
self.DTNorm = joblib.load('DTNorm.pkl')
self.DTStand = joblib.load('DTStand.pkl')
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_n
self.y_test_norm)
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_trai
self.y_test_stand)
else:
#Builds and trains Decision Tree
parameters = { 'max_depth':np.arange(1,20,1),'random_state':[2021
self.DTRaw = GridSearchCV(dt, parameters)
self.DTNorm = GridSearchCV(dt, parameters)
self.DTStand = GridSearchCV(dt, parameters)
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_n
self.y_test_norm)
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_trai
self.y_test_stand)
#Saves model
joblib.dump(self.DTRaw, 'DTRaw.pkl')
joblib.dump(self.DTNorm, 'DTNorm.pkl')
joblib.dump(self.DTStand, 'DTStand.pkl')
def build_train_rf(self):
if path.isfile('RFRaw.pkl'):
#Loads model
self.RFRaw = joblib.load('RFRaw.pkl')
self.RFNorm = joblib.load('RFNorm.pkl')
self.RFStand = joblib.load('RFStand.pkl')
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_n
self.y_test_norm)
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_trai
self.y_test_stand)
else:
#Builds and trains Random Forest
self.RFRaw = RandomForestClassifier(n_estimators=200, max_depth=20, random_s
self.RFNorm = RandomForestClassifier(n_estimators=200, max_depth=20, random_
self.RFStand = RandomForestClassifier(n_estimators=200, max_depth=20, random_
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_n
self.y_test_norm)
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_trai
self.y_test_stand)
#Saves model
joblib.dump(self.RFRaw, 'RFRaw.pkl')
joblib.dump(self.RFNorm, 'RFNorm.pkl')
joblib.dump(self.RFStand, 'RFStand.pkl')
def build_train_gb(self):
if path.isfile('GBRaw.pkl'):
#Loads model
self.GBRaw = joblib.load('GBRaw.pkl')
self.GBNorm = joblib.load('GBNorm.pkl')
self.GBStand = joblib.load('GBStand.pkl')
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, sel
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_tra
self.y_test_norm)
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_
else:
#Builds and trains Gradient Boosting
self.GBRaw = GradientBoostingClassifier(n_estimators = 100, max_depth=10, su
random_state=2021)
self.GBNorm = GradientBoostingClassifier(n_estimators = 100, max_depth=10, s
random_state=2021)
self.GBStand = GradientBoostingClassifier(n_estimators = 100, max_depth=10,
random_state=2021)
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, sel
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_tra
self.y_test_norm)
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_
#Saves model
joblib.dump(self.GBRaw, 'GBRaw.pkl')
joblib.dump(self.GBNorm, 'GBNorm.pkl')
joblib.dump(self.GBStand, 'GBStand.pkl')
def build_train_nb(self):
if path.isfile('NBRaw.pkl'):
#Loads model
self.NBRaw = joblib.load('NBRaw.pkl')
self.NBNorm = joblib.load('NBNorm.pkl')
self.NBStand = joblib.load('NBStand.pkl')
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_te
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_nor
self.y_test_norm)
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_
self.y_test_stand)
else:
#Builds and trains Naive Bayes
self.NBRaw = GaussianNB()
self.NBNorm = GaussianNB()
self.NBStand = GaussianNB()
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_te
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_nor
self.y_test_norm)
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_
self.y_test_stand)
#Saves model
joblib.dump(self.NBRaw, 'NBRaw.pkl')
joblib.dump(self.NBNorm, 'NBNorm.pkl')
joblib.dump(self.NBStand, 'NBStand.pkl')
def build_train_ada(self):
if path.isfile('ADARaw.pkl'):
#Loads model
self.ADARaw = joblib.load('ADARaw.pkl')
self.ADANorm = joblib.load('ADANorm.pkl')
self.ADAStand = joblib.load('ADAStand.pkl')
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm,
self.y_test_norm)
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_st
self.y_test_stand)
else:
#Builds and trains Adaboost
self.ADARaw = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
self.ADANorm = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
self.ADAStand = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm,
self.y_test_norm)
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_st
self.y_test_stand)
#Saves model
joblib.dump(self.ADARaw, 'ADARaw.pkl')
joblib.dump(self.ADANorm, 'ADANorm.pkl')
joblib.dump(self.ADAStand, 'ADAStand.pkl')
def build_train_xgb(self):
if path.isfile('XGBRaw.pkl'):
#Loads model
self.XGBRaw = joblib.load('XGBRaw.pkl')
self.XGBNorm = joblib.load('XGBNorm.pkl')
self.XGBStand = joblib.load('XGBStand.pkl')
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw,
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self
self.y_test_norm)
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand,
self.y_test_stand)
else:
#Builds and trains XGB classifier
self.XGBRaw = XGBClassifier(n_estimators = 100, max_depth=20, random_state=2
self.XGBNorm = XGBClassifier(n_estimators = 100, max_depth=20, random_state=
self.XGBStand = XGBClassifier(n_estimators = 100, max_depth=20, random_state
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw,
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self
self.y_test_norm)
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.XGBRaw, 'XGBRaw.pkl')
joblib.dump(self.XGBNorm, 'XGBNorm.pkl')
joblib.dump(self.XGBStand, 'XGBStand.pkl')
def build_train_lgbm(self):
if path.isfile('LGBMRaw.pkl'):
#Loads model
self.LGBMRaw = joblib.load('LGBMRaw.pkl')
self.LGBMNorm = joblib.load('LGBMNorm.pkl')
self.LGBMStand = joblib.load('LGBMStand.pkl')
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, sel
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_tra
self.y_test_norm)
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_
else:
#Builds and trains LGBMClassifier classifier
self.LGBMRaw = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=0.
self.LGBMNorm = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=0
self.LGBMStand = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, sel
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_tra
self.y_test_norm)
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_
#Saves model
joblib.dump(self.LGBMRaw, 'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm, 'LGBMNorm.pkl')
joblib.dump(self.LGBMStand, 'LGBMStand.pkl')
def build_train_mlp(self):
if path.isfile('MLPRaw.pkl'):
#Loads model
self.MLPRaw = joblib.load('MLPRaw.pkl')
self.MLPNorm = joblib.load('MLPNorm.pkl')
self.MLPStand = joblib.load('MLPStand.pkl')
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_
self.y_test_norm)
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_tr
else:
#Builds and trains MLP classifier
self.MLPRaw = MLPClassifier(random_state=2021)
self.MLPNorm = MLPClassifier(random_state=2021)
self.MLPStand = MLPClassifier(random_state=2021)
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_
self.y_test_norm)
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_tr
#Saves model
joblib.dump(self.MLPRaw, 'MLPRaw.pkl')
joblib.dump(self.MLPNorm, 'MLPNorm.pkl')
joblib.dump(self.MLPStand, 'MLPStand.pkl')
def train_test_ANN(self):
X, y = self.fit_dataset(self.df)
#Resamples data
#Splits dataframe into X_train, X_test, y_train and y_test

X_train, X_test, y_train_DL, y_test_DL = train_test_split(X, y, test_
#Deletes any outliers in the data using StandardScaler from SKLearn

X_train_DL = sc.fit_transform(X_train)
X_test_DL = sc.transform(X_test)
#Saves data
np.save('X_train_DL.npy', X_train_DL)
np.save('X_test_DL.npy', X_test_DL)
np.save('y_train_DL.npy', y_train_DL)
np.save('y_test_DL.npy', y_test_DL)
def train_ANN(self):
if path.isfile('X_train_DL.npy'):
#Loads files
self.X_train_DL = np.load('X_train_DL.npy',allow_pickle=True)
self.X_test_DL = np.load('X_test_DL.npy',allow_pickle=True)
self.y_train_DL = np.load('y_train_DL.npy',allow_pickle=True)
self.y_test_DL = np.load('y_test_DL.npy',allow_pickle=True)
else:
self.train_test_ANN()
#Loads files
self.X_train_DL = np.load('X_train_DL.npy',allow_pickle=True)
self.X_test_DL = np.load('X_test_DL.npy',allow_pickle=True)
self.y_train_DL = np.load('y_train_DL.npy',allow_pickle=True)
self.y_test_DL = np.load('y_test_DL.npy',allow_pickle=True)
if path.isfile('thyroid_model.h5') == False:
self.build_ANN(self.X_train_DL, self.y_train_DL, 32, 250)
#Turns on cbPredictionDL
self.cbPredictionDL.setEnabled(True)
#Turns off pbTrainDL

self.pbTrainDL.setEnabled(False)
def build_ANN(self,X_train, y_train, NBATCH, NEPOCH):

#Imports Tensorflow and create a Sequential Model to add layer for the ANN
#Input layer
input_dim=27,
activation='relu'))
#Hidden layer 1
activation='relu'))
#Output layer
activation='sigmoid'))
print(ann.summary()) #for showing the structure and parameters
#Compiles the ANN using ADAM optimizer.

ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics

history = ann.fit(X_train, y_train, batch_size = 64, validation_split
#Saves model

np.save('thyroid_history.npy', history.history)
def choose_prediction_ANN(self):
strCB = self.cbPredictionDL.currentText()
if strCB == 'CNN 1D':

pred_val = self.pred_ann(self.X_test_DL, self.y_test_DL)
#Plots true values versus predicted values

self.plot_real_pred_val(pred_val, self.y_test_DL, self.widgetPlot2, 'CNN 1D'
#Plot confusion matrix

self.plot_cm(pred_val, self.y_test_DL, self.widgetPlot1, 'CNN 1D')
#Loads history
history = np.load('thyroid_history.npy',allow_pickle=True).item()
train_loss = history['loss']
train_acc = history['accuracy']
val_acc = history['val_accuracy']
val_loss = history['val_loss']
self.plot_loss_acc(train_loss, val_loss, train_acc, val_acc, self.widgetPlot
def plot_loss_acc(self,train_loss, val_loss, train_acc, val_acc, widget,strP

widget.canvas.axis1.plot(train_loss, \
label='Training Loss',color='blue', linew
widget.canvas.axis1.plot(val_loss, 'b--', label='Validation Loss',col
widget.canvas.axis1.set_title('Loss',fontweight ="bold",fontsize=20)
widget.canvas.axis1.legend(fontsize=16)
widget.canvas.axis1.plot(train_acc, \
label='Training Accuracy',color='blue',
widget.canvas.axis1.plot(val_acc, 'b--', label='Validation Accuracy',
widget.canvas.axis1.set_title('Accuracy',fontweight ="bold",fontsize=
widget.canvas.axis1.legend(fontsize=16)
def pred_ann(self, xtest, ytest):

self.train_ANN()
self.ann = load_model('thyroid_model.h5')
prediction = self.ann.predict(xtest)
label = [int(p>=0.5) for p in prediction]
print("test_target:", ytest)
print("pred_val:", label)
#Performance Evaluation - Accuracy, Classification Report & Confusion Matrix

#Accuracy Score
print ('Accuracy Score : ', accuracy_score(label, ytest), '\n')

print ('Classification Report :\n\n' ,classification_report(label, ytest))
return label
if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
ex = DemoGUI_Thyroid()
ex.show()
sys.exit(app.exec_())

Siahaan V. Data Science Crash Course... With Python GUI 2ed 2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Siahaan V. Data Science Crash Course... With Python GUI 2ed 2023

Uploaded by

Copyright:

Available Formats

DATA SCIENCE CRASH COURSE:

Copyright © 2023 BALIGE Publishing

Published: JULY 2023

Rismon Hasiholan Sianipar, born in Pematang Siantar in 1994,

ABOUT THE BOOK

Thyroid disease is a prevalent condition that affects the

We will start by conducting data exploration on a

Next, we will delve into the machine learning phase,

Moving beyond traditional machine learning, we will

To provide an interactive and user-friendly experience,

By the end of the project, readers will have gained

The combination of machine learning and deep learning

EXPLORING DATASET AND FEATURES 1

This dataset was from Garavan Institute

Step Download dataset from

Step Open a new Python script and save it as thyroid.py.

Step Import all necessary libraries:

The code is a typical import statement in Python. It is used to import

Step Read dataset:

thyroid surgery I131 treatment query hypothyroid query

goitre tumor hypopituitary psych TSH measured TSH T3

Here are the steps involved in the code:

Step Check the shape of dataset:

The code print(df.shape) is used to check the shape of the DataFrame

The shape attribute of a DataFrame returns a tuple representing the

Step Read every column in dataset:

The columns attribute of a DataFrame returns a pandas Index object

Step Check the information of dataset:

1 #Checks dataset information

The code print(df.info()) is used to display

The info() function provides a concise summary

The output of print(df.info()) provides

1. The DataFrame has a

Checking Unique Values

Step Checks count, unique values, and frequency:

1 #Checks count, unique values, and

The output shows the count, unique values,

Analyzing the output of the unique values in

Checking Null Values

Step Checks null values in each column:

1 #Checks null values

Here are the steps involved in the code you

Converting Six Columns into Numerical

Step Convert six columns into numerical:

1 #Converts six columns into numerical

The code converts six columns, namely 'age', 'FTI',

The output shows the updated information about the

Deleting Irrelevant Columns

Step Delete irrelevant columns: TBG and referral

1 #Deletes irrelevant columns

#Checks each column datatype

#Checks null values

Here are the steps involved in the code:

The output shows the updated information about

Step Look at statistical description of numerical

1 #Looks at statistical description of

The code print(df.describe().to_string()) is used

The describe() function computes various

Here's an explanation of the output:

Step Look at the maximum number of age column. It

1 #Cleans age column

The code cleans the 'age' column in the

Step Print the total distribution: