Final Report 1st Draft

St.
Xavier’S College
Affiliated to Tribhuvan University
Maitighar, Kathmandu
FINAL YEAR PROJECT PROPOSAL

ON
A Comparative Analysis and Risk Prediction of Diabetes using ML
Classification Algorithms
[CSC-408]
For the partial fulfillment of the requirement for the degree of Bachelor of Science in Computer
Science and Information Technology awarded by Tribhuvan University
Under the supervision of

Rajan Karmacharya
Supervisor/Lecturer
Submitted by
Anom Maharjan (T.U. Exam Roll No. 10173/073)
Anuska Sthapit (T.U. Exam Roll No. 10174/073)
Submitted to
St. Xavier’S College
Department of Computer Science
Maitighar, Kathmandu, Nepal
April, 2021
Final Year Project Report
On
A Comparative Analysis and Risk Prediction of Diabetes using ML
Classification Algorithms

A final year project report submitted in partial fulfillment of the
requirements for the degree of Bachelor of Science in Computer Science
and Information Technology awarded by Tribhuvan University.

Submitted By:
Anom Maharjan (T.U. Exam Roll No. 10173/073)
Anuska Sthapit(T.U. Exam Roll No. 10174/073)

Submitted To:
ST. XAVIER’S COLLEGE
Maitighar, Kathmandu, Nepal
April, 2021
CERTIFICATE OF APPROVAL
The undersigned certify that they have read and recommended to the Department of
Computer
Science for acceptance, a project proposal entitled “A Comparative Analysis and Risk
Prediction of Diabetes using ML Classification Algorithms.” submitted by Anom
Maharjan (TU ROLL NO.: 10173/073) and Anuska Sthapit (TU ROLL NO.:10174/073)
for the partial fulfillment of the requirement for the degree of Bachelor of Science in
Computer Science and Information Technology awarded by Tribhuvan University.
…………………………..
Er. Rajan Karmacharya
Supervisor /Lecturer
St. Xavier’s College
…………………………..
External Examiner
Tribhuvan University
…………………………..
Ganesh Yogi
Head of the Department
St. Xavier’s College
ACKNOWLEDGEMENT
We are momentously privileged to be the students of Computer Science here in St. Xavier’s
College with a department, utterly packed by expertise of the respective field, greatly
supportive to the learners. We would like to express our sincere gratitude to Er. Rajan
Karmacharya – our supervisor for creating a virtuous academic and sociable environment to
foster this project. Therefore, we would like to express our innermost thanks to him for
providing us with all the crucial advices, guidelines and resources for the accomplishment of
this project.
We are also grateful to the entire Computer Science Department of St. Xavier’s College for
housing us a seemly environment where we could work with this project. We were pleased to
be under the commands of the department to help us from all possible ways. We would also
take this opportunity to express our gratitude to Mr. Ganesh Yogi for his continuous
encouragement and support throughout the completion of this project.
We would also like to express our heartfelt gratitude to Mr. Jeetendra Manandhar, Mr. Bal
Krishna Subedi, Er. Anil Shah, Er. Saugat Sigdel, Er. Nitin Malla, Er. Sansar Dewan,
Er. Sanjay Kumar Yadav, Mr. Ganesh Dhami and Mr. Ramesh Shahi for their constant
support and guidance. Furthermore, we are also appreciative towards all our colleagues,
seniors and relatives who had directly or indirectly been a part of this case study.
Anom Maharjan (TU ROLL NO.: 10173/073)

Anuska Sthapit (TU ROLL NO.:10174/073)
ABSTRACT
Diabetes is a major metabolic disorder, which can affect entire body system adversely. Early
detection of diabetes is very important to maintain a healthy life. The tedious identifying
process results in visiting of a patient to a diagnostic center and consulting doctor. The
motive of this project is to design a model, which can prognosticate the likelihood of diabetes
in patients with maximum accuracy and develop a Web application to implement the
findings. Therefore, four machine learning classification algorithms namely Logistic
Regression, KNN, Random Forest Classifier, Support Vector Machine are used in this
project. Experiments are performed on Pima Indians Diabetes Database (PIDD), which is
sourced from UCI machine learning repository. The performances of all the four algorithms
are evaluated on various measures like Precision, Accuracy, F-Measure, and Recall. Results
obtained show Logistic Regression outperforms with the highest accuracy of 79.8%
comparatively than other algorithms. These results are verified using Receiver Operating
Characteristic (ROC) curves in a systematic manner. After finding the model with highest
accuracy, a web application is developed using Python and Flask. This web application will
help patients to detect the likelihood of diabetes, without having to go to diagnostic center
and consulting doctor for regular checkups.
Keywords: Diabetes, SVM, Naive Byes, Decision Tree, Accuracy, Machine Learning
TABLE OF CONTENTS
ACKNOWLEDGEMENT....................................................................................................................i
ABSTRACT.........................................................................................................................................ii
LIST OF FIGURES............................................................................................................................iv
LIST OF TABLES...............................................................................................................................v
LIST OF ABBREVIATIONS............................................................................................................vi
CHAPTER 1: INTRODUCTION.......................................................................................................1
1.1 Background............................................................................................................................1
1.2 Problem Statement:................................................................................................................1
1.3 Project Objectives and Scope:................................................................................................2
1.4 Significance of the Study:......................................................................................................2
1.5 Project Features.....................................................................................................................3
1.6 Requirement Analysis............................................................................................................3
1.6.1 Dataset...............................................................................................................................3
1.7 Feasibility Study....................................................................................................................4
1.8 System Requirement – Minimum Hardware and Software....................................................5
1.8.1 Platforms............................................................................................................................5
CHAPTER 2: LITERATURE REVIEW...........................................................................................6
2.1 Machine Learning..................................................................................................................6
2.2 Related Research...................................................................................................................7
2.3 Proposed Methodology........................................................................................................10
2.3.1 Dataset Description..........................................................................................................10
2.3.2 Dataset Preprocessing......................................................................................................11
2.3.3 Applying Machine Learning Models...............................................................................12
CHAPTER 3: SYSTEM DEVELOPMENT....................................................................................16
3.1 Project Management Strategy and Tools..............................................................................16
3.1.1 Project Workflow and schedule.......................................................................................17
3.1.2 Project Team....................................................................................................................17
3.1.3 Responsibilities................................................................................................................17
3.2 System Design.....................................................................................................................18
3.2.1 Data Preprocessing:.........................................................................................................18
3.2.2 Applying Machine Learning Models and Evaluating their Performance..........................25
3.2.3 Model comparison...........................................................................................................30
3.2.4 Feature Importance..........................................................................................................30
3.2.5 Save and Load Model......................................................................................................30
3.3 Web Application Architecture.............................................................................................31
3.4 Development Tools..............................................................................................................31
3.4.1 Visual Studio Code..........................................................................................................31
3.4.2 Python and Flask..............................................................................................................32
3.4.3 Heroku.............................................................................................................................32
3.4.4 Jupyter Notebook.............................................................................................................33
3.5 Project Schedule..................................................................................................................33
CHAPTER 4: RESULT ANALYSIS................................................................................................34
4.1 Experimental Results and Observations...............................................................................34
4.2 System Results.....................................................................................................................35
4.2.1 Screenshots and codes:....................................................................................................36
4.3 Critical Analysis..................................................................................................................39
4.4 Limitations and Future Enhancements.................................................................................40
4.4.1 Limitations.......................................................................................................................40
4.4.2 Future Enhancements.......................................................................................................41
4.5 Conclusion...........................................................................................................................41
REFERENCES..................................................................................................................................42
LIST OF FIGURES
Figure 1 Essential Learning Process to develop a predictive model.......................................................7

Figure 2 Bar Graph representing ratio of Diabetic and Non Diabetic Patient......................................11
Figure 3 Pie Chart representing ratio of Diabetic and Non Diabetic Patient........................................11
Figure 4 Data Preprocessing and Exploration......................................................................................19
Figure 5 Model training and selection..................................................................................................20
Figure 6 Histogram of all features (all patients)...................................................................................21
Figure 7 Histogram of all features (diabetic patients)..........................................................................21
Figure 8 Heat Map showing the correlation between the features........................................................22
Figure 9 Pair plot.................................................................................................................................23
Figure 10 Train test split......................................................................................................................24
Figure 11 Evaluating Performance of Logistic Regression..................................................................26
Figure 12 ROC Curve for Logistic Regression Model.........................................................................26
Figure 13 Evaluating Performance of Support Vector Machine..........................................................27
Figure 14 ROC Curve for Support Vector Machine.............................................................................27
Figure 15 Evaluating Performance of Random Forest Classifier.........................................................28
Figure 16 ROC Curve for Random Forest Classifier...........................................................................28
Figure 17 Evaluating Performance of K-Nearest Neighbors................................................................29
Figure 18 ROC Curve for KNNeighbors Classifier.............................................................................29
Figure 19 Comparison of different models..........................................................................................30
Figure 20 Importance of features.........................................................................................................30
Figure 21 Web application architecture...............................................................................................31
Figure 22 Gantt chart...........................................................................................................................33
Figure 23 Comparative analysis based on accuracy of algorithms.......................................................35
Figure 24 Screenshot showing the homepage......................................................................................37
Figure 25 Screenshot showing user entering data................................................................................38
Figure 26 Screenshot showing result page...........................................................................................39
LIST OF TABLES
Table 1 Statistical report of PIMA Indian Dataset.................................................................................4
Table 2 Team Resources and Roles.....................................................................................................16
Table 3 Results in terms of accuracy...................................................................................................33
Table 4 Results in term of Precision, Recall, F1-Score and AUC........................................................33
LIST OF ABBREVIATIONS
ML Machine Learning
KNN K-Nearest Neighbors
SVM Support Vector Machine
LR Logistic Regression
RFC Random Forest Classifier
ANN Artificial Neural Network
DT Decision Tree
UCI University of California, Irvine
ROC Receiver Operating Characteristic
PIDD Pima Indians Diabetes Database
FPG Fasting Plasma Glucose
IGT Impaired Glucose Tolerance
OGTT Oral Glucose Tolerance Test
WHO World Health Organization
RAM Random Access Memory
GB GigaByte
AI Artificial Intelligence
NDA Nepal Diabetes Association
NCD Non Communicable Disease
GP Genetic Programming
FBS Fasting Blood Sugar
CHAPTER 1: INTRODUCTION
1.1 Background
Prediabetes was first recognized as an intermediate diagnosis and indication of a relatively
high risk for the future development of diabetes by the Expert Committee on Diagnosis and
Classification of Diabetes Mellitus in 1997, and it has been reported that approximately 5–
10% of patients with untreated prediabetes subsequently develop diabetes .The definition of
prediabetes includes a fasting plasma glucose (FPG) level in the range of 100–125 mg/dL
(5.6–6.9 mmol/L), impaired glucose tolerance (IGT) (oral glucose tolerance test (OGTT) 2
h measurement in the range of 140–199 mg/dL (7.8–11.0 mmol/L)), or HbA1c level in the
range of 5.7–6.4% (39–46 mmol/mol). Early diagnosis and intervention for prediabetes
could prevent these complications, prevent delay, or prevent the transition to diabetes and
be cost-effective. [ CITATION Ada14 \l 1033 ]
Machine learning is an area of artificial intelligence research, which uses statistical methods
for data classification. Several machine learning techniques have been applied in clinical
settings to predict disease and have shown higher accuracy for diagnosis than classical
methods. This project is a small example of how machine learning can be used in
prediction of prediabetes.
In this project, the authors aimed to develop and validate models to predict prediabetes
using Logistic Regression, KNN, Random Forest Classifier, and Support vector machines
(SVM), which could be effective as simple and accurate screening tools. The performances
of all the four algorithms are evaluated on various measures like Precision, Accuracy, F-
Measure, and Recall. Accuracy is measured over correctly and incorrectly classified
instances. The results are verified using Receiver Operating Characteristic (ROC) curves in
a proper and systematic manner.
1.2 Problem Statement:

According to the latest WHO data published in 2017, Diabetes Mellitus Deaths in Nepal
reached 6,482 or 3.97% of total deaths. The age adjusted Death Rate is 33.25 per 100,000
of population ranks Nepal no. 77 in the world. The regular treatment for the disease is a
vital important, but we do not seem to care until something bad happens. Identifying the
diabetes at early stage is an important factor, which our medical institutions are still lagging
behind.
Considering the present situation of diabetes issues, the following problems have been
summarized:
 Lack of technological platforms enhancing medical needs.
 Passive use of modern technology to address healthcare issues.
 Insufficient knowledge among the public regarding diabetes.
1.3 Project Objectives and Scope:
Inadequate infrastructures and not realizing the disease in time has resulted in neglected
medical scenarios. So, to overcome these problems, the following objectives have been
made:
 To develop a system for early prediction of the disease
Current way of determining the disease is to go to the hospitals frequently for
checkup. This may be an effective way but it is time consuming. Since, this project
uses machine learning, so it helps us in predicting the chances of being affected by
the disease with the first few medical reports.
 To provide a platform for raising public awareness through behavioral change

Deficient knowledge about the risk of diabetes has imposed a problem as severe as
the disease itself. This project intends on raising public awareness by providing the
people with an effective tool to determine the stages of diabetes, act accordingly,
and possibly change their living manner through information gained.
1.4 Significance of the Study:

Based on the objectives mentioned above, the following are the significance of our study:
 Re-evaluation of traditional approaches on medical sector
The main benefit of this study is the way in which modern technology embarks to
solve crucial issues. The proposed method provides a unique approach to overcome
traditional medical practices.
 Low cost and more accurate solution to critical concern

With the help of this project, it is convenient to predict about the disease in the
future. Not only is this method on budget, but it also provides accurate results
whose importance is not to be compromised.
 Mindful perception of civil towards diabetes

Once the people are able to identify the chances of getting or not getting diabetes,
awareness is raised among the people, which will help them to act accordingly.
1.5 Project Features
After the completion of this project, we can expect a valid prediction model with high
accuracy and a web application that allows to:
 Enter specified clinical data into the application.
 View probability results.
1.6 Requirement Analysis
Requirements analysis is the first stage in the systems engineering process and
software development process. The main purpose of Requirement Analysis is to describe the
functional and nonfunctional requirement of the project in the day-to-day life. This project
provides a web application developed and deployed using a valid model for patients to help
in predicting prediabetes. It reduces manual process of having to go to diagnostic centers
and doctors for follow-up checkups by providing automated and a reliable application which
uses predictive model to give results with high accuracy.
1.6.1 Dataset
Dataset of female patients with minimum twenty one year age of Pima Indian population has
been taken from UCI machine learning repository. [ CITATION USD \l 1033 ] This dataset is
originally owned by the National institute of diabetes and digestive and kidney diseases. In
this dataset there are total 768 instances classified into two classes: diabetic and non diabetic
with eight different risk factors: number of times pregnant, plasma glucose concentration of
two hours in an oral glucose tolerance test, diastolic blood pressure, triceps skin fold
thickness, two-hour serum insulin, body mass index, diabetes pedigree function and age as in
Table 1.
Table 1 Statistical report of PIMA Indian Dataset

1.7 Feasibility Study
Feasibility study of the project is performed to get to know whether the project is convenient
in given circumstances:
 Whether the application being created is the need of people.
 To show the strengths and deficits before the project is planned.
 Whether the time, budget and human resources will be sufficient.
As the application is convenient and very easy to use, it is quite flexible and user friendly in
the present environment.
1.8 System Requirement – Minimum Hardware and Software

1.8.1 Platforms
1.8.1.1 HARDWARE REQUIREMENTS
 The computer system for executing programs: RAM -6GB
 Internal memory-500GB
 Processors-Intel i5/i7
1.8.1.2 SOFTWARE REQUIREMENTS:
 Code Editors: Visual Studio Code, Jupyter Notebook

 Libraries: NumPy, Pandas, matplotlib, Seaborn, Flask and pickle.
 Programming Language : Python3
CHAPTER 2: LITERATURE REVIEW

2.1 Machine Learning
Machine Learning is concerned with the development of algorithms and techniques that
allows the computers to learn and gain intelligence based on the past experience. It is a
branch of Artificial Intelligence (AI) and is closely related to statistics. By learning it means
that the system is able to identify and understand the input data, so that it can make decisions
and predictions based on it. There are two types of machine learning:
 Supervised Learning
 Unsupervised Learning
The learning process starts with the gathering of data by different means, from various
resources. Then the next step is to prepare the data, that is pre-process it in order to fix the
data related issues and to reduce the dimensionality of the space by removing the irrelevant
data (or selecting the data of interest). Since the amount of data that is being used for learning
is large, it is difficult for the system to make decisions, so algorithms are designed using
some logic, probability, statistics, control theory etc. to analyze the data and retrieve the
knowledge from the past experiences. Next step is testing the model to calculate the accuracy
and performance of the system. And finally optimization of the system, i.e. improvising the
model by using new rules or data set. The techniques of machine learning are used for
classification, prediction and pattern recognition. Machine learning can be applied in various
areas like: search engine, web page ranking, email filtering, face tagging and recognizing,
related advertisements, character recognition, gaming, robotics, disease prediction and traffic
management. [CITATION Har20 \l 1033 ] The essential learning process to develop a
predictive model is given in Figure 1
Figure 1 Essential Learning Process to develop a predictive model
2.2 Related Research

Nepal is a Himalayan country with a population surpassing approximately 30 millions. A
study reported the prevalence of pre-diabetes: diabetes in Nepal to be 19.5:9.5%. WHO
South-East Asia Region Prevalence of diabetes has projected prevalence from 436 000 in
2000 to 1 328 000 in 2030. The Nepal Diabetes Association (NDA) had reported a year back
that among people aged 20 years and older living in urban areas, 15% are affected by this
disease. Among people aged 40 years and older in urban areas, this number climbed to 19%.
[ CITATION Pou18 \l 1033 ]
A report from the World Health Organisation [ CITATION Gan \l 1033 ] addresses diabetes
and its complications that influence individual physically, financially, economically over the
families. The survey says about 1.2 million deaths due to the uncontrolled stage of health
lead to death. About 2.2 million deaths occurred due to the risk factors of diabetes like a
cardiovascular and other diseases. Currently there are over 199 million women living with
diabetes. It is projected to increase to 313 million by 2040. Diabetes is the ninth leading
cause of death in women globally, causing 2.1 million deaths per year. Up to 70% of cases
of type 2 diabetes could be prevented through the adoption of a healthy lifestyle.
In Nepal, obesity is the cause of diabetes among 16.6 per cent female population and 13.6
per cent male as stated by World Health Organization. Likewise, dullness is identified to be
another cause among 3.3 per cent population. Doctors have pointed that Nepal is at high risk
of diabetes. According to the WHO, there is no exact data of patients with diabetes in Nepal.
But, the 2016 Diabetes Profile has shown that 9.1 percent Nepali population are living with
diabetes. It includes 10.5 percent men and 7.9 percent women. With the increasing
population and changing lifestyle, the burden of non-communicable diseases (NCDs) is very
high especially in the urban areas. The basic treatment services for diabetes is now available
in many places across the country but for major treatment, the patients have to come to the
cities. The flow of the patients is also high at the central hospital. The treatment of these
diseases require prolonged treatment and extra financial burden for families with low
economic condition. The government has prioritized NCDs in the National Health Policy
2015 as well as National Health Sector Strategy 2015-2020, which also includes diabetes.
[ CITATION Kri18 \l 1033 ]
Orabi et al. [ CITATION Sin18 \l 1033 ] designed a system for diabetes prediction, whose
main aim is the prediction of diabetes acandidate is suffering at a particular age. The
proposed system is designed based on the concept of machine learning,by applying decision
tree. Obtained results were satisfactory as the designed system works well in predicting
thediabetes incidents at a particular age, with higher accuracy using Decision tree.
Pradhan et al in [ CITATION Pra \l 1033 ] used Genetic programming (GP) for the training
and testing of the database for predictionof diabetes by employing Diabetes data set which is
sourced from UCI repository. Results achieved using GeneticProgramming gives optimal
accuracy as compared to other implemented techniques. There can be significantimprove in
accuracy by taking less time for classifier generation. It proves to be useful for diabetes
prediction at lowcost.
Rashid et al. in [ CITATION Abd15 \l 1033 ] designed a prediction model with two sub-
modules to predict diabetes-chronic disease. ANN(Artificial Neural Network) is used in the
first module and FBS (Fasting Blood Sugar) is used in the second module. Decision Tree
(DT) is used to detect the symptoms of diabetes on patient’s health.
Nongyao et al. in [ CITATION Dee \l 1033 ] applied an algorithm which classifies the risk of
diabetes mellitus. To fulfill the objective author has employed four following renowned
machine learning classification methods namely Decision Tree, ArtificialNeural Networks,
Logistic Regression and Naive Bayes. For improving the robustness of designed model
Bagging and Boosting techniques are used. Experimentation results shows the Random
Forest algorithm gives optimum results among all the algorithms employed.
Kandhasamy and Balamurali [ CITATION Jpr15 \l 1033 ]used multiple classifiers SVM, J48,
K-Nearest Neighbors (KNN), and Random Forest. The classification was performed on a
dataset taken from the UCI repository. The results of the classifiers were compared based on
the values of the accuracy, sensitivity, and specificity. The classification was done in two
cases, when the dataset is pre-processed and without preprocessing by using 5-fold cross
validation. The authors didn’t explain the pre-processing step applied on the dataset, they
just mentioned that the noise was removed from the data. They reported that the decision
tree J48 classifier has the highest accuracy rate being 73.82 % without pre-processing, while
the classifiers KNN (k = 1) and Random Forest showed the highest accuracy rate of 100%
after pre-processing the data.
Moreover, Yuvaraj and Sripreethaa [ CITATION NYu19 \l 1033 ] presented an application for
diabetes prediction using three different ML algorithms including Random Forest, Decision
Tree, and the Naïve Bayes. The Pima Indian Diabetes dataset (PID) was used after pre-
processing it. The authors didn’t mention how the data was pre-processed, however they
discussed the Information Gain method used for feature selection to extract the relevant
features. They used only eight main attributes among 13 (see Table A4). In addition, they
divided the dataset into 70% for training and 30% for testing. The results showed that the
random forest algorithm had the highest accuracy rate of 94%.
Furthermore, Tafa et al. [ CITATION ZTa15 \l 1033 ] proposed a new integrated improved
model of SVM and Naïve Bayes for predicting the diabetes. The model was evaluated using
a dataset collected from three different locations in Kosovo. The dataset contains eight
attributes and 402 patients where 80 patients had type 2 diabetes. Some attributes utilized in
this study (see Table A4) have not been investigated before, including the regular diet,
physical activity, and family history of diabetes. The authors didn’t mention whether the
data was pre-processed or not. For the validation test, they split the dataset into 50% for
each of the training and testing sets. The proposed combined algorithms have improved the
accuracy of the prediction to reach 97.6%. This value was compared with the performance
of SVM and Naïve Bayes achieving 95.52% and 94.52%, respectively.
2.3 Proposed Methodology

The goal of this project is to investigate for model to predict diabetes with better accuracy.
We experimented with different classification algorithms to predict diabetes. In the
following, we briefly discuss the phases.
2.3.1 Dataset Description

The data is gathered from UCI repository, which is named as Pima Indian Diabetes Dataset.
The dataset have many attributes of 768 patients. This dataset has been used widely to predict
whether a patient has diabetes based on eight diagnostic measurements described below:
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigreeFunction: Diabetes pedigree function
8. Age: Age (years)
9. Outcome
The ninth attribute is class variable of each data points. This class variable shows the
outcome 0 and 1 for diabetics which indicates positive or negative for diabetics. The dataset
was slightly imbalanced having around 500 classes labeled as 0 means negative means no
diabetes and 268 labeled as 1 means positive means diabetic.
Figure 2 Bar Graph representing ratio of Diabetic and Non Diabetic Patient
Figure 3 Pie Chart representing ratio of Diabetic and Non Diabetic Patient
2.3.2 Dataset Preprocessing
Data Preprocessing is most important process. Mostly healthcare related data contains
missing vale and other impurities that can cause decrease in effectiveness of data. To improve
quality and effectiveness obtained after mining process, Data preprocessing is done. To use
Machine Learning Techniques on the dataset effectively this process is essential for accurate
result and successful prediction. For Pima Indian diabetes dataset we need to perform
preprocessing in two steps.
 Missing Values removal- Remove all the instances that have zero (0) as worth.
Having zero, as worth is not possible. Therefore, this instance is eliminated. Through
eliminating irrelevant features/instances, we make feature subset and this process is
called features subset selection, which reduces dimensionality of data and help to
work faster.
 Splitting of data- After cleaning the data, data is normalized in training and testing
the model. When data is spitted then we train algorithm on the training data set and
keep test data set aside. This training process will produce the training model based
on logic and algorithms and values of the feature in training data. Aim of
normalization is to bring all the attributes under same scale.
2.3.3 Applying Machine Learning Models
When data has been ready, we apply Machine Learning Technique.[ CITATION Mit \l 1033 ]
Different classification techniques, is used to predict diabetes. Main objective is to apply
Machine Learning Techniques to analyze the performance of these methods and find
accuracy of them, and figure out the responsible and important features, which play a major
role in prediction. The techniques used are as follows-
2.3.3.1 Support Vector Machine
Support Vector Machine also known as SVM is a supervised machine-learning algorithm.

SVM is most popular classification technique. SVM creates a hyperplane that separate two
classes. It can create a hyperplane or set of hyperplane in high dimensional space. This hyper
plane can be used for classification or regression also. SVM differentiates instances in
specific classes and can also classify the entities which are not supported by data.
Separation is done by through hyperplane performs the separation to the closest training point
of any class.
Algorithm-
o Select the hyperplane, which divides the class better.
o To find the better hyperplane you have to calculate the distance between the
planes and the data, which is called Margin.
o If the distance between the classes is low then the chance of miss conception is
high and vice versa. So we need to
o Select the class, which has the high margin.
o Margin = distance to positive point + Distance to negative point.
2.3.3.2 K-Nearest Neighbor
KNN is also a supervised machine-learning algorithm. KNN helps to solve both the
classification and regression problems. KNN is lazy prediction technique. KNN assumes that
similar things are near to each other. Many times data points, which are similar, are very near
to each other. KNN helps to group new work based on similarity measure. KNN algorithm
record all the records and classify them according to their similarity measure. For finding, the
distance between the points uses tree like structure. To make a prediction for a new data
point, the algorithm finds the closest data points in the training data set its nearest neighbors.
Here K= Number of nearby neighbors, it is always a positive integer. Neighbor’s value is
chosen from set of class. Closeness is mainly defined in terms of Euclidean distance. The
Euclidean distance between two points P and Q i.e. P (p1,p2, . Pn) and Q (q1, q2,..qn) is
defined by the following equation:-
n
d(P,Q) = ∑ ( Pi−Qi )2
i=1
Equation 1:
Algorithm-
o Take a sample dataset of columns and rows named as Pima Indian Diabetes
dataset.
o Take a test dataset of attributes and rows.
o Find the Euclidean distance by the help of formula-
o Then, Decide a random value of K. is the no. of nearest neighbors

o Then with the help of these minimum distance and Euclidean distance find out
the nth column of each.
o Find out the same output values.
o If the values are same, then the patient is diabetic, otherwise not.
2.3.3.3 Logistic Regression
Logistic regression is also a supervised learning classification algorithm. It is used to estimate

the probability of a binary response based on one or more predictors. They can be continuous
or discrete. Logistic regression used when we want to classify or distinguish some data items
into categories. It classify the data in binary form means only in 0 and 1 which refer case to
classify patient that is positive or negative for diabetes.
Main aim of logistic regression is to best fit which is responsible for describing the
relationship between target and predictor variable. Logistic regression is a based on Linear
regression model. Logistic regression model uses sigmoid function to predict probability of
positive and negative class.
Sigmoid function P = 1/1+e – (a+bx) Here P = probability, a and b = parameter of Model.
2.3.3.4 Random Forest
It is type of ensemble learning method and also used for classification and regression tasks.
The accuracy it gives is grater then compared to other models. This method can easily handle
large datasets. Leo Bremen develops Random Forest. It is popular ensemble Learning
Method. Random Forest Improve Performance of Decision Tree by reducing variance. It
operates by constructing a multitude of decision trees at training time and outputs the class
that is the mode of the classes or classification or mean prediction (regression) of the
individual trees.
Algorithm-
o The first step is to select the R features from the total features m where R<<M.
o Among the R features, the node using the best split point.
o Split the node into sub nodes using the best split.
o Repeat a to c steps until l number of nodes has been reached.
o Built forest by repeating steps a to d for a number of times to create n number of
trees.
The random forest finds the best split using the Gin-Index Cost Function, which is given by:
The first step is to need the take a glance at choices and use the foundations of each
indiscriminately created decision tree to predict the result and stores the anticipated
outcome at intervals the target place. Secondly, calculate the votes for each predicted target
and ultimately, admit the high voted predicted target because of the ultimate prediction
from the random forest formula. Some of the options of Random Forest does correct
predictions result for a spread of applications are offered.
CHAPTER 3: SYSTEM DEVELOPMENT
3.1 Project Management Strategy and Tools

A project is a well-defined task, which is a collection of several operations done in order to
achieve a goal (for example, software development and delivery).[ CITATION Ski \l 1033 ]
A Project can be characterized as:
 Every project may have a unique and distinct goal.
 Project is not routine activity or day-to-day operations.
 Project comes with a start time and end time.
 Project ends when its goal is achieved hence it is a temporary phase in the
lifetime of an organization.
 Project needs adequate resources in terms of time, work force, finance, material
and knowledge-bank.
Project management is the application of knowledge, skills and techniques to execute
projects effectively and efficiently. Project management is the discipline of planning,
organizing, and controlling resources to achieve specific goals. Project management has
been necessary and important in this project. The constraints for this project, as for most
projects, have been time, cost and quality. Project management is necessary to complete the
project under these constraints and utilize the resources properly. Project management tools
are aids to assist an individual or team to effectively organize work and manage projects and
tasks. The term usually refers to project management software you can purchase online or
even use for free. Despite its name, project management tools are not just for project
managers. Project management tools are made to be completely customizable so they can fit
the needs of teams of different sizes and with different goals. Project management tools are
usually defined by the different features offered. [ CITATION Eng12 \l 1033 ]They include,
but are not limited to:
 Planning/scheduling - Project management tools allow you to plan and delegate
work all in one place with tasks, subtasks, folders, templates, workflows, and
calendars.
 Collaboration - Email is no longer the only form of communication. Use project
management tools to assign tasks, add comments, organize dashboards, and for
proofing & approvals.
 Documentation - Avoid missing files with file management features: editing,
versioning, & storage of all files.
 Evaluation - Track and assess productivity and growth through resource
management & reporting.[ CITATION Wri \l 1033 ]

3.1.1 Project Workflow and schedule
 Team Size: 2
 Total project duration: 20 weeks
 Effort Required per person: 40 hours per week
3.1.2 Project Team
Table 2 Team Resources and Roles
Team Resources Roles
Er. Rajan Karmacharya Supervisor
Anom Maharjan Developer/ Designer
Anuska Sthapit Developer/ Designer
3.1.3 Responsibilities
Projects are initiated to solve a problem. If the project does not solve the problems, it is
intended to; the project is of no use. Also, the developers should work according to their
plans; otherwise, the project might be somewhat out of track. The developers should
thoroughly understand their responsibilities.

 Responsibilities of Supervisor
o Schedule the project
o Schedule Tracking
o Share information
o Documentation
 Responsibilities of team member

o Preliminary research regarding the project
o Background reading
o Design and Analysis
o Development and Testing
o Implementation and System Evolution
o Draft report writing and submission
o Final report writing and submission
3.2 System Design
3.2.1 Data Preprocessing:
The dataset used in this study is the Pima Indian Diabetes (PID) dataset, which was originally
came from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset
has been used widely to predict whether a patient has diabetes based on eight diagnostic
measurements described below:
 Pregnancies: Number of times pregnant

 Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
 SkinThickness: Triceps skin fold thickness (mm)
 Insulin: 2-Hour serum insulin (mu U/ml)
 BMI: Body mass index (weight in kg/(height in m)^2)
 DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years)
3.2.1.1 Data Exploration and Cleaning

There are several factors to consider in the data cleaning process.
 Duplicate or irrelevant observations.
 Bad labeling of data, same category occurring multiple times.
 Missing or null data points.
 Unexpected outliers.
Figure 4 Data Preprocessing and Exploration
Figure 5 Model training and selection
3.2.1.2 Data Visualization
 Histograms
Figure 6 Histogram of all features (all patients)
Figure 7 Histogram of all features (diabetic patients)

3.2.1.3 Plotting Correlation Plot (Heat Map)
Pearson’s correlation coefficient is the test statistics that measures the statistical relationship,
or association, between two continuous variables. It is known as the best method of
measuring the association between variables of interest because it is based on the method of
covariance. It gives information about the magnitude of the association, or correlation, as
well as the direction of the relationship.
The value of Pearson's Correlation Coefficient can be between -1 to +1. 1 means that they are
highly correlated and 0 means no correlation.
A heat map is a two-dimensional representation of information with the help of colors. Heat
maps can help the user visualize simple or complex information.
Figure 8 Heat Map showing the correlation between the features

3.2.1.4 Pairplot
A pairplot plot pairwise relationships in a dataset. The pairplot function creates a grid of Axes
such that each variable in data will by shared in the y-axis across a single row and in the x-
axis across a single column.
Figure 9 Pair plot

3.2.1.5 Segregating Feature & Target Variable
In this process, the data is divided into X and y variable as shown below:
3.2.1.6 Train Test Split

The data used is usually split into training data and test data. The training set contains a
known output and the model learns on this data in order to be generalized to other data later
on. The test dataset (or subset) is used to test the model’s prediction on this subset.
Figure 10 Train test split
Stratify property in train test split This stratify parameter makes a split so that the proportion
of values in the sample produced will be the same as the proportion of values provided to
parameter stratify. For example, if variable y is a binary categorical variable with values 0
and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your
random split has 25% of 0's and 75% of 1's.
3.2.1.7 Data Normalization

Author used Z-score normalization. Z-scores are linearly transformed data values having a
mean of zero and a standard deviation of 1. Z-scores are also known as standardized scores;
they are scores (or data values) that have been given a common standard. As the final step
before using machine learning, we will normalize our inputs. Machine Learning models often
benefit substantially from input normalization. It also makes it easier to understand the
importance of each feature later, when looking at the model weights. We’ll normalize the
data such that each variable has 0 mean and standard deviation of 1. The process of Data
Normalization helps to rescale all our data so that all the features are in a similar scale.
3.2.2 Applying Machine Learning Models and Evaluating their Performance

Typically, the primary goal of learning algorithms is to maximize the prediction accuracy or
equivalently minimize the error rate. However, in the specific medical application problem we study,
the ultimate goal is to alert and assist patients and doctors in taking further actions to prevent
hospitalizations before they occur, whenever possible. Thus, our models and results should be
accessible and easily explainable to doctors and not only machine learning experts.
With that in mind, we examine our models from two aspects: prediction accuracy and interpretability.
The prediction accuracy is captured in two metrics: the false alarm rate (how many patients were
predicted to be in the positive class, i.e., diabetic, while they truly were not) and the detection rate
(how many patients were predicted to be diabetic while they truly were). In the medical literature, the
detection rate is often referred to as sensitivity and the term specificity issued for one minus the false
alarm rate. Two other terms that are commonly used are the recall rate, which is the same as the
detection rate, and the precision rate, which is defined as the ratio of true positives (diabetic patients)
over all the predicted positives (true and false). [ CITATION The18 \l 1033 ]
For a binary classification system, the evaluation of the performance is typically illustrated with the
Receiver Operating Characteristic (ROC) curve, which plots the detection rate versus the false alarm
rate at various threshold settings. To summarize the ROC curve and be able to compare different
methods using only one metric, we used the Area Under the ROC Curve (AUC). An ideal classifier
achieves an AUC equal to 1(or 100%), while a classifier that makes random classification decisions
achieves an AUC equal to 0.5 (or 50%). Thus, the “best” (most accurate) classification method will be
the one that achieves the highest AUC.
 Logistics Regression
Figure 11 Evaluating Performance of Logistic Regression.
The accuracy of Logistics regression was 79.8%.
Figure 12 ROC Curve for Logistic Regression Model

 Support Vector Machine
Figure 13 Evaluating Performance of Support Vector Machine.
The accuracy of Support Vector Machine was 75.3%.
Figure 14 ROC Curve for Support Vector Machine

 Random Forest Classifier
Figure 15 Evaluating Performance of Random Forest Classifier.
The accuracy of Random Forest Classifier was 74%.
Figure 16 ROC Curve for Random Forest Classifier

 K-Nearest Neighbors
Figure 17 Evaluating Performance of K-Nearest Neighbors.
The accuracy of K-Nearest Neighbors was 72%.
Figure 18 ROC Curve for KNNeighbors Classifier

3.2.3 Model comparison
Figure 19 Comparison of different models
The figure above shows the comparison between accuracy of different models. Among the 4
models, Logistic Regression has the highest accuracy of 79.8%.
3.2.4 Feature Importance
Figure 20 Importance of features
Findings from the above figure:

 As you move down the top of the graph, the importance of the feature decreases.
 The features that is in green indicate that they have a positive impact on our
prediction.
 The features that is in white indicate that they have no effect on our prediction.
 The features shown in red indicate that they have a negative impact on our prediction.
3.2.5 Save and Load Model

After building and evaluating the models, Logistic Regression with an accuracy of 79.8%,
was chosen as the ideal model to build the web application so it was saved as a pickle file
with .pkl extension.
3.3 Web Application Architecture
The diagram below shows the web application architecture. The user loads the homepage, enters the
data, and then submits it. The client side sends request to the web server and the web server responds
to the client’s request. The web server then fetches the trained model and returns the result to the user.
Figure 21 Web application architecture
3.4 Development Tools
3.4.1 Visual Studio Code

Visual Studio Code (VS Code) is a lightweight but powerful source code editor which runs
on your desktop and is available for Windows, macOS and Linux. It comes with built-in
support for JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for
other languages (such as C++, C#, Java, Python, PHP, Go) and runtimes (such as .NET and
Unity). Visual Studio Code is a source-code editor developed by Microsoft for Windows,
Linux and macOS. It includes support for debugging, embedded Git control and GitHub,
syntax highlighting, intelligent code completion, snippets, and code refactoring. It is highly
customizable, allowing users to change the theme, keyboard shortcuts, preferences, and
install extensions that add additional functionality. As a developer, the code editor is one of
the most important parts of the setup. Visual Studio Code combines the ease of use of a
classic lightweight text editor with more powerful IDE- type features with very minimal
configuration which appealed to us quite a lot. That is why, without a second thought, this
text editor was our top choice. [ CITATION Vis \l 1033 ]
3.4.2 Python and Flask

Python is an interpreted, high-level, general-purpose programming language. Created by
Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace. Its language constructs and object-
oriented approach aims to help programmers write clear, logical code for small and large-
scale projects. Python is dynamically typed and garbage-collected. It supports multiple
programming paradigms, including procedural, object-oriented, and functional programming.
Python is often described as a "batteries included" language due to its comprehensive
standard library.[ CITATION Sam19 \l 1033 ]
Flask is a micro web framework written in Python. It is classified as a microframework
because it does not require particular tools or libraries. It has no database abstraction layer,
form validation, or any other components where pre-existing third-party libraries provide
common functions. However, Flask supports extensions that can add application features as if
they were implemented in Flask itself. Extensions exist for object-relational mappers, form
validation, and upload handling, various open authentication technologies and several
common framework related tools. Extensions are updated far more regularly than the core
Flask program. Applications that use the Flask framework include Pinterest, LinkedIn, and
the community web page for Flask itself.[ CITATION Wik \l 1033 ]
Flask does not require to use a particular library or tools with it. It has no database abstraction
layer, form validation, or any other components due to which one can use any of the
compatible components or libraries. In addition, python and flask are easier to learn and use.
Python and Flask are used for creating RESTful APIs which once hosted online can be easily
accessed through any device connected to internet.
3.4.3 Heroku
Heroku is a cloud platform as a service (PaaS) supporting several programming languages.
One of the first cloud platforms, Heroku has been in development since June 2007, when it
supported only the Ruby programming language, but now supports Java, Node.js, Scala,
Clojure, Python, PHP, and Go. For this reason, Heroku is said to be a polyglot platform as it
has features for a developer to build, run and scale applications in a similar manner across
most languages.[ CITATION Wik1 \l 1033 ]
3.4.4 Jupyter Notebook

The Jupyter Notebook App is a server-client application that allows editing and running
notebook documents via a web browser. The Jupyter Notebook App can be executed on a
local desktop requiring no internet access (as described in this document) or can be installed
on a remote server and accessed through the internet.
In addition to displaying/editing/running notebook documents, the Jupyter Notebook App has
a “Dashboard” (Notebook Dashboard), a “control panel” showing local files and allowing
opening notebook documents or shutting down their kernels.
Notebook documents are documents produced by the Jupyter Notebook App, which contain
both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links,
etc…). Notebook documents are both human-readable documents containing the analysis
description and the results (figures, tables, etc.) as well as executable documents which can
be run to perform data analysis.[ CITATION Jup \l 1033 ]
3.5 Project Schedule

A project proposal is a core document that helps you sell a potential project to sponsors and
Stakeholders.
Proposal Writing
Draft of chapter 1
Data Collection
Training dataset and preprocessing
Feature extraction
Draft of chapter 2
Trained datas
Testing
Web Application Development
Draft of chapter 3&4
Final Draft
8/29/2019 12/7/2019 3/16/2020 6/24/2020 10/2/2020 1/10/2021
Figure 22 Gantt chart

CHAPTER 4: RESULT ANALYSIS
4.1 Experimental Results and Observations
Algorithms Training Set Testing Set Accuracy
Logistic Regression 0.76 0.80 79.8%
Support Vector Machine 0.80 0.75 75.3%
Random Forest Classifier 1.00 0.73 74%
KNNeighbors Classifier 0.83 0.73 72.7%
In Experimental studies, the dataset have been partitioned between 80% –20% for training
and testing purpose. Table 3 shows Logistic Regression being the simplest classifier have
performed well with an accuracy of 79.8%. ROC is plotted for all the algorithms. More the
area covered better is the classifier. These measurements are taken by using sklearn tool
on Pima Indian Diabetes Data set taken from UCI repository. The results are shown in
Table 4. The results may be improved by applying large size updated data sets of realistic
context. However, we need to apply other machine learning algorithms using real data set
before generalizing the results.
Table 3 Results in terms of accuracy
Table 4 Results in term of Precision, Recall, F1-Score and AUC
Algorithms Precision Recall F1-Score AUC

Logistic Regression 0.80 0.78 0.79 0.87
Support Vector Machine 0.76 0.74 0.74 0.84
Random Forest Classifier 0.74 0.73 0.73 0.82
KNNeighbors Classifier 0.72 0.71 0.71 0.76
Figure 23 Comparative analysis based on accuracy of algorithms
4.2 System Results

Diabetes prediction is a web application based on Python and Flask, with user-friendly user
interface, which helps users to predict if they may or may not suffer from diabetes in the
near future. As more people are entering into professional jobs, they are busy with their
work all day. With busy work schedules and jobs, they rely on fast food services and do not
get proper exercise. As a result, there is a negative impact on their health.
As there is a saying “Health is wealth”, so, the main objective of this application is to predict
the chances of getting diabetes in the near future and make the users realize about their
health and well-being. The user interface is pretty simple and is easily accessible where the
users have to input some values and get the result in an instance.
4.2.1 Screenshots and codes:

Code to display homepage:
<form action='/predict' method="post" class="col s12">
<div class="row">
<div class="input-field col s4">
<label for="first_name"><b>Glucose </b></label>
<input id="first_name" name="1" placeholder="Glucose

level in sugar" type="number" step=0.01 class="validate">
</div>
<label for="first_name"><b>Age</b></label>
<input id="first_name" name="2" placeholder="Age"

type="number" class="validate">
</div>
<label for=" first_name"><b>BMI</b></label>
<input placeholder="Body Mass Index" name="3"

id="first_name" step=0.01 type="number" class="validate">
</div>
<label for="first_name"><b>Pregnancies</b></label>
<input placeholder="No. of Pregnancies" name="4"

id="first_name" type="number" class="validate">
</div>
<label for=" first_name"><b>Insulin</b></label>
<input id="first_name" name="5" placeholder="Insulin

level" type="number" step=0.01 class="validate">
</div>
<label for=" first_name"><b>Skin Thickness</b></label>
<input id="first_name" name="6" placeholder="Skin

Thickness" type="number" step=0.01 class="validate">
</div>
<label for=" first_name"><b>Diabetes Pedigree

Function</b></label>
<input id="first_name" name="7" placeholder="Diabetes

Pedigree Function" type="number" step=0.01 class="validate">
</div>
<label for=" first_name"><b>Diastolic Blood

Pressure</b></label>
<input id="first_name" name="8" placeholder="Diastolic

Blood Pressure" type="number" class="validate">
</div>
Figure 24 Screenshot showing the homepage

Figure 25 Screenshot showing user entering data

Code to display output page:
<nav class="light-blue lighten-1" role="navigation">

<div class="nav-wrapper container"><a id="logo-container"
href="/" class="brand-logo">Diabetes Prediction</a>
<ul class="right hide-on-med-and-down">
<li><a href="/">Home</a></li>
</ul>
</nav>
<div class="row" style="margin:15% 0% 0% 10%">
<h3 >{{pred}}</h3>
</div>
<br>
<br><br>
</div>

Figure 26 Screenshot showing result page
4.3 Critical Analysis

As the technology is getting more advanced and is being used in various sectors all over the
world but Nepal is still lagging behind the technological aspects. One of the cases is in the
medical sector. Technology is rarely being used in the medical sectors. Healthcare sectors
have large volume databases. Such databases may contain structured, semi-structured or
unstructured data. Big data analytics is the process which analyses huge data sets and reveals
hidden information, hidden patterns to discover knowledge from the given data.
As we were working on the project, the both of us have had many realizations as there were
lots of difficulties. The first and foremost problems that arose was while collecting data. Due
to the current situation, onsite data collection wasn’t an option. So, we had to contact the
hospitals and research centers via telephone and mail. Some of them responded whereas
some didn’t. Even the respondent didn’t agree to provide us data due to their policies. Using
Google Forms was an alternative, but we didn’t get an adequate amount of data, so we were
bound to use the data from PIMA Indian Diabetes Database for our project.
Selecting appropriate algorithms for the analysis was also a challenge but in the end simple
and mostly used algorithms was selected and compared based on their performance. For easy
understanding, only default parameters were sent to the models, to increase their performance
appropriate parameters need to be fitted into the models.
For the successful completion of the project, various tools and programming languages were
used. We chose Python as a programming language as it is easy to use, powerful, and
versatile, making it a great choice for beginners and experts alike. For model training and
selection, we used Jupyter Notebook as it allows users to illustrate the analysis process step
by step by arranging the stuff like code, images, text, output etc. in a step by step manner. It
also helps users to document the thought process while developing the analysis process. Vast
selection of libraries and modules like Numpy, Pandas, Matplotlib, etc. in Python allowed us
to select and create model in a very easy and simple way. For web application development,
we used Flask, a web framework. As flask provides different tools, libraries and
technologies, it allows us to build a web application easily.
The overall system was designed so as to make the users more conscious towards their health
and lifestyle. The users can be aware about their health condition and act accordingly rather
than facing the problem in the future.
Although the project is completed, there are lots of areas where the project can be improved.
Due to the present condition, more features could not be added to and hence pended for
future enhancement.
4.4 Limitations and Future Enhancements

4.4.1 Limitations
No matter how much one tries to incorporate all the services into their work, the application
is always subject to change as new advanced features arrive or as new demand arises.
Because of this, we have tried to incorporate all the services that have been thought of in our
initial phases and as the project advanced. We have completed all of our objectives.
However, there are certain limitations to our project which we have listed here:
 The project relies on data, so insufficient data led to decreased accuracy.
 PIMA Indian Dataset was used so the system is dependent on this dataset.
4.4.2 Future Enhancements
After the completion of the project, we have come up with ideas about how we could have
done things better or in a different way so as to allow higher accuracy of the app.
 Fitting right parameters in the models to increase their performance.
 Feature Engineering to increase accuracy of the models.
 Collect data in context to Nepal.
 Provide suggestions according to the results about what can be done to control
the disease.
CHAPTER 5: CONCLUSION
Having trained the models using four different algorithms named Logistic Regression,
Support Vector Machine, K-Nearest Neighbors and Random Forest Classifier with the
accuracy of 79.8%, 75.3%, 74% and 72.7%.
The results in terms of precision for the algorithms Logistic Regression, Support Vector
Machine, K-Nearest Neighbors and Random Forest Classifier were 0.80, 0.76, 0.74,0.72
respectively.
The results in terms of recall for the algorithms Logistic Regression, Support Vector
Machine, K-Nearest Neighbors and Random Forest Classifier were 0.78, 0.74, 0.73, 0.71
respectively.
The results in terms of F1-score for the algorithms Logistic Regression, Support Vector
respectively.
The results in terms of AUC for the algorithms Logistic Regression, Support Vector
respectively.
Comparing all these factors among all the algorithms, we found out that Logistic Regression
had the highest points. So, we came to the conclusion on choosing Logistic Regression for
training our model and implementing it in our web application.
A lot of effort and research finally led to the completion of this project. Having tremendous
support from our supervisor, teachers, friends and others has been a very overwhelming
experience for us. From the initial brainstorming to the research about similar international
projects, it has been a thorough learning experience for us.
Along with the completion of this project, the most essential experience of teamwork was
acquired. We learnt and got an opportunity to see lots of new and interesting tools,
programming and designing concepts, their implementation, and their usage. Without our
teamwork, the project would not have been possible.
Managing time for completing this project has been a strong challenge for us. It would have
been very difficult without the support and consultation provided by our teachers, friends and
supervisor. Though we have tried our best to make this report efficient, effective, and
purposeful, there may be some drawbacks and hence, advice and suggestions are welcome for
the correction and improvement of this project.
REFERENCES
[1] C. H. W. R. E. J. B. M. K. Adam G. Tabák, "Prediabetes: A high-risk state for developing
diabetes," National Center for Biotechnology Information, 2014.
[2] "U.S. Department of Health and Human Services," [Online].
[3] V. K. Harleen Kaur, "Predictive modelling and analytics for diabetes using a machine learning
approach," Applied Computing and Informatics, 2020.
[4] R. R. Poudel, "Diabetes and Endocrinology in Nepal," US National Library of Medicine, 2018.
[5] N. S. T. Gangil, "Analysis of diabetes mellitus for early prediction using optimal features
selection," J Big Data.
[6] K. R. a. A. Parajuli, "Diabetes in Nepal," HERD international, 2018.
[7] D. D. SinghSisodiab, "Prediction of Diabetes using Classification Algorithms," Procedia
Computer Science, vol. 132, 2018.
[8] P. Pradhan, "AGeneticProgrammingApproachf".
[9] T. A. R. S. A. R. M. Abdullah, "An Intelligent Approach for Diabetes Classification, Prediction
and Description," 2015.
[10] D. S. S. Deepti Sisodia, " Prediction of Diabetes using Classification Algorithms," International
Conference on Computational Intelligence and Data Science.
[11] J. p. K. S. Balamurali, "Performance Analysis of Classifier Models to Predict Diabetes
Mellitus," Procedia Computer Science , 2015.
[12] K. R. S. N. Yuvaraj, "Diabetes prediction in healthcare systems using machine learning
algorithms on Hadoop cluster," Cluster Computing, no. 1, 2019.
[13] N. P. B. K. Z. Tafa, "An intelligent system for diabetes prediction," 4th Mediterranean
Conference on Embedded Computing (MECO), 2015.
[14] D. S. V. Mitushi Soni, "Diabetes Prediction using Machine Learning Techniques,"
INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT), vol. 09, no.
09.
[15] "Skill Promise," [Online]. Available:
http://skillpromise.lexiconcpl.com/dev/node/program/978.
[16] R. L. |. B. A. Englund, "The complete project manager," Project Management Institute., 2012.
[17] "Wrike," [Online]. Available: https://www.wrike.com/project-management-guide/faq/what-
are-project-management-tools/.
[18] T. X. T. W. W. D. Theodora S. Brisimi, "Predicting Chronic Disease Hospitalizations from
Electronic Health Records: An Interpretable Classification Approach," 2018.
[19] "Visual Studio Code," [Online]. Available: https://code.visualstudio.com/docs.
[20] S. Owino, "Data Driven Inspector," 2019. [Online]. Available:
https://medium.datadriveninvestor.com/python-programming-language-ac762a3b5977.
[21] "Wikipedia," [Online]. Available:
https://en.wikipedia.org/wiki/Flask_(web_framework)#:~:text=Flask%20is%20a%20micro
%20web,require%20particular%20tools%20or%20libraries.&text=Extensions%20exist%20for
%20object%2Drelational,several%20common%20framework%20related%20tools..
[22] "Wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/Heroku.
[23] "Jupyter Notebook/Quick Guides," [Online]. Available: https://jupyter-notebook-beginner-
guide.readthedocs.io/en/latest/what_is_jupyter.html.
[25] World Health Organization, "Global report on Diabetes," World Health Organization, France,
2016.
[26] World Health Organization, 2015. [Online]. Available: https://www.who.int/health-
topics/diabetes.
[27] L. Y. R. S. K. M. Y. J. R. G. Ono K, " The prevalence of type 2 diabetes mellitus and impaired
fasting glucose," 2007.
[28] Marra.F., A Deep Learning process for Iris Model Identification, 2017.

Final Report 1st Draft

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report 1st Draft

Uploaded by

Copyright:

Available Formats

St.

FINAL YEAR PROJECT PROPOSAL

Under the supervision of

Anom Maharjan (TU ROLL NO.: 10173/073)

Figure 1 Essential Learning Process to develop a predictive model.......................................................7

1.2 Problem Statement:

 To provide a platform for raising public awareness through behavioral change

1.4 Significance of the Study:

 Low cost and more accurate solution to critical concern

 Mindful perception of civil towards diabetes

1.6 Requirement Analysis

Table 1 Statistical report of PIMA Indian Dataset

 Whether the application being created is the need of people.

 To show the strengths and deficits before the project is planned.

 Whether the time, budget and human resources will be sufficient.

1.8 System Requirement – Minimum Hardware and Software

1.8.1.2 SOFTWARE REQUIREMENTS:

 Code Editors: Visual Studio Code, Jupyter Notebook

CHAPTER 2: LITERATURE REVIEW

Figure 1 Essential Learning Process to develop a predictive model

2.2 Related Research

2.3 Proposed Methodology

2.3.1 Dataset Description

2.3.2 Dataset Preprocessing

2.3.3 Applying Machine Learning Models

2.3.3.1 Support Vector Machine

Support Vector Machine also known as SVM is a supervised machine-learning algorithm.

o Then, Decide a random value of K. is the no. of nearest neighbors

Logistic regression is also a supervised learning classification algorithm. It is used to estimate

Sigmoid function P = 1/1+e – (a+bx) Here P = probability, a and b = parameter of Model.

2.3.3.4 Random Forest

CHAPTER 3: SYSTEM DEVELOPMENT

3.1 Project Management Strategy and Tools

Team Resources Roles

Er. Rajan Karmacharya Supervisor

Anom Maharjan Developer/ Designer

Anuska Sthapit Developer/ Designer

 Responsibilities of team member

 Pregnancies: Number of times pregnant

3.2.1.1 Data Exploration and Cleaning

Figure 6 Histogram of all features (all patients)

Figure 7 Histogram of all features (diabetic patients)

Figure 8 Heat Map showing the correlation between the features

Figure 9 Pair plot

3.2.1.6 Train Test Split

Figure 10 Train test split

3.2.1.7 Data Normalization

3.2.2 Applying Machine Learning Models and Evaluating their Performance

Figure 11 Evaluating Performance of Logistic Regression.

The accuracy of Logistics regression was 79.8%.

Figure 12 ROC Curve for Logistic Regression Model

Figure 13 Evaluating Performance of Support Vector Machine.

The accuracy of Support Vector Machine was 75.3%.

Figure 14 ROC Curve for Support Vector Machine

Figure 15 Evaluating Performance of Random Forest Classifier.

The accuracy of Random Forest Classifier was 74%.

Figure 16 ROC Curve for Random Forest Classifier

Figure 17 Evaluating Performance of K-Nearest Neighbors.

The accuracy of K-Nearest Neighbors was 72%.

Figure 18 ROC Curve for KNNeighbors Classifier

Figure 19 Comparison of different models

3.2.4 Feature Importance

Figure 20 Importance of features