You are on page 1of 58

DIABETES DISEASE PREDICTION USING MACHINE LEARNING

A project report submitted in partial fulfillment of the

Requirements for the award of the Degree of

BACHELOR OF TECHNOLOGY

In

COMPUTER SCIENCE AND ENGINEERING

By

A.YAMUNA (186M1A0508)

P.S.V.ANISH (186M1A0570)

K.HEMA DURGA SAI (186M1A0536)

G.SUNITHA (186M1A0525)

Under the esteemed guidance of


Mr. A.VENKATA RAJU M.Tech.,
PROFESSOR, Department of CSE

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING
B. V. C. COLLEGE OF ENGINEERING
(Affiliated to JNTUK)
RAJAMAHENDRAVARAM,
ANDHRA PRADESH.
(2018-2022)
B.V.C COLLEGE OF ENGINEERING
PALACHARLA-533102
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING

CERTIFICATE

This is to certified that project entitled “DIABETES DISEASE USING MACHINE LEARNING” that is being
submitted by A.YAMUNA (186M1A0508), P.S.V.ANISH (186M1A0570), K.HEMA DURGA SAI
(186M1A0538), G.SUNITHA(186M1A0525) in partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in COMPUTER SCIENCE AND ENGINEERING of
B.V.C COLLEGE OF ENGINEERING Affiliated to JAWAHARLAL NEHRU TECHNOLOGICAL
UNIVERSITY KAKINADA, is a bonafide work carried out bythem during the academic year 2022.

Internal Guide Head of the Department


Mr.A.VENKATA RAJU, M.Tech., Mr.R.N.V. VISHNU MURTHY,M.Tech.,

Head of the Department, Assistant professor,

Department of CSE, HOD, Department of CSE,


B.V.C. COLLEGE OF ENGINEERING, B.V.C. COLLEGE OF ENGINEERING,
Rajamahendravaram. Rajamahendravaram

External Examiner
DECLARATION BY THE CANDIDATE

We, Ms. A. YAMUNA, Mr. P. S.V.ANISH , Mr.K .HEMA DURGA SAI,


Ms. G.SUNITHA bearing hall tickets 186M1A0508, 186M1A0570, 186M1A0536, 186M1A0525 hereby declare
that the project report titled “DIABETES DISEASE PREDICTION USING MACHINE LEARNING” under the
guidance of Mr. A. VENKATA RAJU, M.Tech., is submitted in partial fulfillment of the requirements for the
award of the degree of Bachelor of Technology in Computer Science And Engineering.
This is a record of Bonafide work carried out by us, the result embodied in this project report have not been
reproduced or copied from any source and have not been submitted to any other University or Institute for the
award of any other Degree.

PROJECTEES
A.YAMUNA (186M1A0508)

P.S.V.ANISH (186M1A0570)

K.HEMA DURGA SAI (186M1A0536)

G.SUNITHA (176M1A0525)

III
ACKNOWLEDGEMENT

First and foremost, we sincerely salute our esteemed institution B.V.C. COLLEGE OF ENGINEERING for
giving this goldenopportunity for fulfilling our dreams of becoming Engineers.

We would like to express our sincere gratitude to our main project internal guide, Mr.A.VENKATA RAJU,
M.Tech, Asst Professor for his guidance, encouragement and continuing support throughout the course of this
work.

We are highly obliged to our Head of the Department, Mr.R.N.V.V.VISHNU MURTHY,M.Tech., Assistant
Professor for his constant inspiration, extensive help and valuable support in our everystep.

We owe a great deal to our Principal Dr. G. RAVIKANTH, M.Tech., Ph.D.,MISTE., MIETE., MIAENG for
his for extending a helping hand at every juncture of need.

Finally, we are pleased to acknowledge our indebtedness to all those who devoted themselves directly or
indirectly to make this projectwork a total success.
PROJECTEES

A.YAMUNA (186M1A0508)

P.S.V.ANISH (186M1A0570)

K.HEMA DURGA SAI (186M1A0536)

G.SUNITHA (176M1A0525)

IV
J.N.T.U.K CODE: 6MSBTET CODE : COUNSELLING CODE :BVCR
347

B V C COLLEGE OF ENGINEERING
I approved By AICTE, New Delhi, Affiliated to JNTUK Kakinada & GOD of
IP)
PALACHARLA(V), R AJAHMUNDARY- 533102. EG DI. (AP)
Website: www.Isvcce.org

INSTITUTE VISION / MISSION

VISION :
BVC COLLEGE OF ENGINEERING PALACHARLA 533102

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

To become a model adobe of learning with time trusted academic values for serving the nation end world.

To provide academic infrastructure and create incubation centers.


To augment industry — Institute Interaction through research and skill Development
To build lit CU di4al3l0nc e and provide teaming etiquette tor all round growth
To promote innovative ideas
Consultancy and knowledge hubs.
To expand the knowledge of stakeholders by involving in workshops and training

' 'Principal
PRINCIPAL
BVC COLLEGE OF ENGINEERING PALACHARLA 533102

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT VISION
To emerge as a center to develop high quality education in the country through academic excellence and
preparing the students for leadership in their fie1ds in caring and challenging environment.

DEPARTMENT MISSION
M1: To nurture high quality education with strong foundation of technologies
in computer science and engineering through continuous development of
infrastructure that enables the students to meet the challenges.
M2: To provide an environment that values and encourages knowledge
acquisition and academic freedom, making this a preferred institution for
knowledge seekers.
M3: In collaboration with industries, developing professionals with necessary
communication skills and state-of-the-art technologies, team spirit, leadership
capabilities and social responsibilities with professional ethics and human
values to meet the standards.

HEAD HE DEPARTMENT
BVC COLLEGE OF ENGINEERING :: PALACHARLA 533102
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PROGRAM EDUCATIONAL OUTCOME (PEO)


PEO1: Graduates will be in computing profession as experts in solving hardware/software engineering problems by
their depth of understanding in core computing knowledge or will be pursuing researchleading to higher degrees.

PEO2: Graduates will demonstrate creativity in their engineering practices including entrepreneurial and
collaborative ventures with strategic thinking, planning and execution.

PEO3: Graduates will communicate effectively, recognize and incorporate societal needs and constraints in their
professional endeavors, and practice their profession with regard to legal and ethicalresponsibilities
BVC COLLEGE OF ENGINEERING :: PALACHARLA
533102
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PROGRAM OUTCOMES

PO1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems.

PO2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

PO3. Design/development of solutions: Design solutions for complex engineering problems and design
systemcomponents or processes that meet the specified needs with appropriate consideration for the public health
andsafety, and the cultural, societal, and environmental considerations.

PO4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions.

PO5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modem engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations.

PO6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, health,
safety, legal and cultural issues and the consequent responsibilities relevant to the professional engineering
practice.

PO7. Environment and sustainability: Understand the impact of the professional engineering solutions in societal
and environmental contexts, and demonstrate the knowledge of, and need for sustainable development.

PO8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
BVC COLLEGE OF ENGINEERING :: PALACHARLA

PO9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.

PO10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.

PO ll. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments.

PO12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technologiccahlange.
BVC COLLEGE OF ENGINEERING PALACHARLA- 533102

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PROGRAM SPECIFIC OUTCOMES( PSO):


PSO1: Ability to apply mathematical methodologies to solve computation task, model real world problem using
appropriate data structure and suitable algorithm in the fields of different domains .

HEAD E DEPARTMENT
ABSTRACT

Diabetes is a chronic disease with the potential to cause a worldwide health care crisis. According to
International Diabetes Federation 382 million people are living with diabetes across the whole world. By 2035,
this will be doubled as 592 million. Diabetes is a disease caused due to the increase level of blood glucose. This
high blood glucose produces the symptoms of frequent urination, increased thirst, and increased hunger. Diabetes
is a one of the leading cause of blindness, kidney failure, amputations, heart failure and stroke. When we eat, our
body turns food into sugars, or glucose. At that point, our pancreas is supposed to release insulin. Insulin serves
as a key to open our cells, to allow the glucose to enter and allow us to use the glucose for energy. But with
diabetes, this system does not work. Type 1 and type 2 diabetes are the most common forms of the disease, but
there are also other kinds, such as gestational diabetes, which occurs during pregnancy, as well as other forms.
Machine learning is an emerging scientific field in data science dealing with the ways in which machines learn
from experience. The aim of this project is to develop a system which can perform early prediction of diabetes
for a patient with a higher accuracy by combining the results of different machine learning techniques. The
algorithms like K nearest neighbor, Logistic Regression, Random forest, Support vector machine and Decision
tree are used. The accuracy of the model using each of the algorithms is calculated and the best fitted will be
chosen.
DIABETES DISEASE PREDICTION USING MACHINE LEARNING

INTRODUCTION
Diabetes is the fast-growing disease among the people even among the youngsters. In understanding diabetes and
how it develops, we need to understand what happens in the body without diabetes. Sugar (glucose) comes from
the foods that we eat, specifically carbohydrate foods. Carbohydrate foods provide our body with its main energy
source everybody, even those people with diabetes, needs carbohydrate. Carbohydrate foods include bread,
cereal, pasta, rice, fruit, dairy products and vegetables (especially starchy vegetables). When we eat these foods,
the body breaks them down into glucose. The glucose moves around the body in the bloodstream. Some of the
glucose is taken to our brain to help us think clearly and function. The remainder of the glucose is taken to the
cells of our body for energy and also to our liver, where it is stored as energy that is used later by the body. In
order for the body to use glucose for energy, insulin is required. Insulin is a hormone that is produced by the beta
cells in the pancreas. Insulin works like a key to a door. Insulin attaches itself to doors on the cell, opening the
door to allow glucose to move from the blood stream, through the door, and into the cell. If the pancreas is not
able to produce enough insulin (insulin deficiency) or if the body cannot use the insulin it produces (insulin
resistance), glucose builds up in the bloodstream (hyperglycemia) and diabetes develops. Diabetes Mellitus
means high levels of sugar (glucose) in the blood stream and in the urine.

Types of Diabetes:
Type 1: diabetes means that the immune system is compromised and the cells fail to produce insulin in sufficient
amounts. There are no eloquent studies that prove the causes of type 1 diabetes and there are currently no known
methods of prevention.
Type 2: diabetes means that the cells produce a low quantity of insulin or the body can’t use the insulin
correctly. This is the most common type of diabetes, thus affecting 90% of persons diagnosed with diabetes. It is
caused by both genetic factors and the manner of living.

Gestational diabetes: appears in pregnant women who suddenly develop high blood sugar. In two thirds of the
cases, it will reappear during subsequent pregnancies. There is a great chance that type 1 or type 2 diabetes will
occur after a pregnancy affected by gestational diabetes.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 1


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

Symptoms of diabetes:
 Frequent Urination
 Increased thirst
 Tired/Sleepiness
 Weight loss
 Blurred vision
 Mood swings
 Confusion and Difficulty in concentrating.
 Frequent infections
 Delayed healing of wounds
 Extreme fatigue

Causes of Diabetes:
Genetic factors are the main cause of diabetes. It is caused by at least two mutant genes in the chromosome 6, the
chromosome that affects the response of the body to various antigens. Viral infection may also influence the
occurrence of type 1 and type 2 diabetes. Studies have shown that infection with viruses such as rubella, mumps,
hepatitis B virus, and cytomegalovirus increase the risk of developing diabetes.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 2


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

LITERATURE SURVEY
Literature survey is the most important step in software development process. Before developing the tool, it is
necessary to determine the time factor, economy and company strength. Once these things are satisfied, then next
steps are to determine which operating system and language can be used for developing the tool. Once the
Programmers start building the tool the programmers need lot of external support. This support can be obtained
from senior programmers, from book or from websites. Before building the system, the above consideration is
taken into account for developing the recommend system.
uses the classification on diverse types of datasets that can be accomplished to decide if a person is diabetic or
not. The diabetic patient’s data set is established by gathering data from hospital warehouse which contains two
hundred instances with nine attributes.

MACHINE LEARNING:

Machine Learning is the scientific study of algorithms and statistical models that the computer systems use to
perform a special task without using explicit instructions, relying on patterns and inference instead. It is seen as a
subset of artificial intelligence. Machine Learning algorithms build a mathematical model based on sample data
known as “training dataset”, in order to make predictions or decisions without being explicitly programmed to
perform the task.
Types of Machine Learning:

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 3


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SUPERVISED LEARNING
It consists of a given set of input variables (training data) which are pre-labeled and target data. Using the input
variables, it generates a mapping function to map inputs to required outputs. Parameter adjustment procedure
continues until the system acquired a suitable accuracy extent regarding the teaching data.

UNSUPERVISED LEARNING
In this algorithm we only have training data rather an outcome data. That inputdata is not previously labeled. It is
used in classifiers by recognizing existing patterns or cluster in the input datasets.

REINFORCEMENT LEARNING
Applying this algorithm machine is trained to map action to a specific decision hence the reward or feedback
Signals are generated. The machine trained itself to find the most rewarding actions by reward and punishment
using past experience.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 4


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SYSTEM REQUIREMENTS

FUNCTIONAL REQUIREMENTS
Functional Requirement defines a function of a software system and how the system must behave when presented
with specific inputs or conditions. These may include calculations, identifying data and processing and other
specific functionality.

NON-FUNCTIONAL REQUIREMENTS
Non-Functional Requirements, as the name suggests, are those requirements that are not directly concerned with
the specific functions delivered by the systems. They may relate to emergent system properties such as reliability
response time and store occupancy. Alternatively, they may define constraints on the system such as capability of
the input output devices and the data representations usedin system interfaces. Many non-functional requirements
relate to the system as whole rather than to individual system features. This means they are often critical than the
individual functional requirements.

HARDWARE REQUIREMENTS
System : Pentium IV 2.4 GHz.
Hard Disk : 40 GB.
Floppy Drive : 1.44 Mb.
Monitor : 14’ Color Monitor.
Mouse : Optical Mouse.
Ram : 512Mb

SOFTWARE REQUIREMENTS
Operating system : Windows 7 Ultimate
Coding Language : Python
Front-End : Python
Database : MySql

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 5


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SOFTWARE ENVIRONMENT

FRONT END
The entire user interface is planned to be developed by using python. Python is both programming and scripting
language.
ABOUT PYTHON
Python is a interpreted high level general-purpose programming language. Created by Guido van Rossum and first
released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant
whitespace. It provides constructs that enable clear programming on both small and large scales. Van Rossum led
the language community until stepping down as leader in July 2018.

Python features a dynamic typed system and automatic memory management. It supports multiple programming
paradigms include object-oriented, imperative functional and procedural.

FEATURES OF PYTHON
EASY TO CODE
Python is very easy to code. Compared to other popular languages like Java and C++, it is easier to code in
Python. Anyone can learn python syntax in just a few hours. Though sure, mastering Python requires learning
about all its advanced concepts and packages and modules. That takes time. Thus, it is programmer friendly.

EASY TO READ
Being a high-level language, Python code is quite like English. Looking at it, you can tell what the code is
supposed to do. Also, since it is dynamically-typed, it mandates indentation. This aids readability.

PORTABILITY
Let’s assume you’ve written a Python code for your Windows machine. Now, if you want to run it on a Mac, you
don’t need to make changes to it for the same. In other words, you can take one code and run it on any machine,
there is no need to write different code for different machines. This makes Python a portable language.

INTERPRETED
If you’re any familiar with languages like C++ or Java, you must first compile it, and then run it. But in Python,
there is no need to compile it. Internally, its source code is converted into an immediate form called byte code.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 6


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

So, all you need to do is to run your Python code without worrying about linking to libraries, and a few other
things. By interpreted, we mean the source code is executed line by line, and not all at once. Because of this, it is
easier to debug your code. Also, interpreting makes it just slightly slower than Java, but that does not matter
compared to the benefits it has to offer.

OBJECT-ORIENTED
A programming language that can model the real world is said to be object oriented. It focuses on objects, and
combines data and functions. Contrarily, a procedure-oriented language revolves around functions, which are
code that can be reused. Python supports both procedure-oriented and object-oriented programming which is one
of the key python features. It also supports multiple inheritance, unlike Java. A class is a blueprint for such an
object. It is an abstract data type, and holds no values.

EXTENSIBLE
If needed, you can write some of your Python code in other languages like C++. This makes Python an
extensible language, meaning that it can be extended to other languages.

EMBEDDABLE
We just saw that we can put code in other languages in our Python source code.However, it is also possible to put
our Python code in a source code in a different language like C++. This allows us to integrate scripting
capabilities into our program of the other language.

LARGE STANDARD LIBRARY


Python downloads with a large library that you can use so you don’t have to write your own code for every single
thing. There are libraries for regular expressions, documentation-generation, unit- testing, web browsers,
threading, databases, CGI, email, image manipulation, and a lot of other functionality.
DYNAMICALLY TYPED
Python is dynamically-typed. This means that the type for a value is decided at runtime, not in advance. This is
why we don’t need to specify the type of data while declaring it.
WHAT CAN PYTHON DO
Python can be used on a server to create web applications. Python can be used alongside software to create
workflows. Python can connect to database systems. It can also read and modify files.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 7


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

WHY PYTHON
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,etc.).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer linesthan some other programming
languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is written. This means
that prototyping can be very quick.
Python can be treated in a procedural way, an object-orientated way or a functional way.

PYTHON SYNTAX COMPARED TO OTHER PROGRAMMING LANGUAGES


Python was designed to for readability, and has some similarities to the English language with influence from
mathematics.
Python uses new lines to complete a command, as opposed to other programming languages which often use
semicolons or parentheses.
Python relies on indentation, using whitespace, to define scope; such as the scope of loops, functions and classes.
Other programming languages often use curly-brackets for this purpose.

PYTHON LIBRARIES
We know that a module is a file with some python code, and a package is a directory for sun packages and
modules. A Python Library is a reusable chunk of code that you may want to include in your program projects.
Compared to language like C++ or C, a python library does not pertain to any specific context in python. Here, a
‘library’ loosely describes a collection of core modules. Essentially, then a library is a collection of core modules.
A package is a library that can be installed using a package manager like ruby gems or nm. The libraries used in
this project are and tensor flow.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 8


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

TENSORFLOW

Tensor Flow is an open-source deep learning library that is developed and maintained by Google. It offers
dataflow programming which performs a range of machine learning tasks. It wasbuilt to run on multiple CPUs or
GPUs and even mobile operating systems, and it has several wrappers in several languages like Python, C++, or
Java.

FEATURES OF TENSORFLOW
1.Faster debugging with Python tools
2.Dynamic models with Python control flow
3.Support for custom and higher-order gradients
4.TensorFlow offers multiple levels of abstraction, which helps you to build and train models.
5.TensorFlow allows you to train and deploy your model quickly, no matter what language or platformyou use.
6.TensorFlow provides the flexibility and control with features like the Keras Functional API and Model Well-
documented so easy to understand 8.Probably the most popular easy to use with Python.

PANDAS
Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and
intuitively. It provides various data structures and operations for manipulating numerical data and time series.
This library is built on top of the NumPy library. Pandas is fast and it has high performance & productivity for
users.
FEATURES OF PANDAS
Fast and efficient Data Frame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
 Label-based slicing, indexing and sub setting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 9


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

4.7.3NUMPY
NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays
and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
NumPy (Numerical Python) is an open-source core Python library for scientific computations. It is a general-
purpose array and matrices processing package.
FEATURES OF NUMPY

 High-performance N-dimensional array object


 It contains tools for integrating code from C/C++ and Fortran
 It contains a multidimensional container for generic data
 Additional linear algebra, Fourier transform, and random number capabilities
 It consists of broadcasting functions
 It had data type definition capability to work with varied databases

4.7.4 MATPLOTLIB

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Matplotlib makes easy things easy and hard things possible.

 Create publication quality plots.


 Make interactive figures that can zoom, pan, update.
 Customize visual style and layout.
 Export to many file formats .
 Embed in JupyterLab and Graphical User Interfaces.
 Use a rich array of third-party packages built on Matplotlib

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 10


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

5.SYSTEM ANALYSIS
The Systems Development life cycle (SDLC), or Software Development Life Cycle in systems engineering,
information system and software engineering, is the process of creating or altering systems, and the models and
methodologies that people use to develop these systems. In Software engineering the SDLC concept underpins
many kinds of software development methodologies.

EXISTING SYSTEM

DECISION TREE:
Decision Trees is a nonparametric supervised learning algorithm for regression and classification tasks. Decision
Trees can be seen as a construction model that includes root node, division, and leaf node. Each internal node
represents a test on an attribute, each division represents the outcome of test, and each leaf node grips the class
label. The opening node in the tree is the root node. First, an attribute is selected and sited at the root node. Then,
a division is made for each possible value. This splits dataset into subgroups, one for every value of the attribute.
The tree process is recursively repeated for each division using only those cases that reach the branch. When all
cases on a node have the same classification, the tree progress can be stopped. Usually, entropy or classification
error is used to define the best tree division.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 11


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node. For the next node, the algorithm again compares
the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf
node of the tree.

DISADVANTAGES:
 The decision tree contains lots of layers, which makes it complex.
 It may have an over fitting issue, which can be resolved using the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

RANDOM FOREST:
Random Forest is one of the most common uses of classifier integration. Random Forest is made up of
numerous separate Decision Tree classifiers that vote on test samples.
 The steps are as follows:
 Extracting some samples from the training set as a training subset using the bootstrap method.
 A number of features are randomly picked from the feature set for the training subset as the basis for
splitting each node of the Decision Tree.
 Repeat steps 1-2 to generate a large number of training subsets and Decision Trees, which are then
combined to build a Random Forest.
 The test set’s samples are fed into the Random Forest, where each Decision Tree makes a choice based
on the data. After receiving the findings, the results are voted on using a voting technique to determine
the sample categorization results.
 Repeat steps until all of the test sets have been classified.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 12


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

DISADVANTAGES:
 Although random forests can be an improvement on single decision trees, more sophisticated techniques
are available. Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
 A forest is less interpretable than a single decision tree. Single trees may be visualized as a sequence of
decisions.
 A trained forest may require significant memory for storage, due to the need for retaining the information
from several hundred individual trees.

PROPOSED SYSTEM
K-NEAREST NEIGHBOUR:
The k-nearest neighbor algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning
classifier, which uses proximity to make classifications or predictions about the grouping of an individual data
point. While it can be used for either regression or classification problems, it is typically used as a classification
algorithm, working off the assumption that similar points can be found near one another. K-Nearest Neighbour is
one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm
assumes the similarity between the new case/data and available cases and put the new case into the category that
is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems. K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 13


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
ADVANTAGES:
 Simple to implement and intuitive to understand
 Can learn non-linear decision boundaries when used for classification and regression. Can come up
with a highly flexible decision boundary adjusting the value of K.
 No Training Time for classification/regression : The KNN algorithm has no explicit training step and
all the work happens during prediction
 Constantly evolves with new data: Since there is no explicit training step, as we keep adding new data
to the dataset, the prediction is adjusted without having to retrain a new model.
 Single Hyperparameters: There is a single hyper parameter, the value of K. This makes hyper parameter
tuning easy.
 Choice of distance metric: There are many distance metrics to choose from. Some popular distance
metrics used are Euclidean, Manhattan, Minkowski, hamming distance and so on

FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan
for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to
be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis,
some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are,
 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the organization. The amount
of fund that the company can pour into the research and development of the system is limited. The expenditures
must be justified. Thus the developed system as well within the budget and this was achieved because most of
the technologies used are freely available. Only the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any
system developed must not have a high demand on the available technical resources. This will lead to high
demands on the available technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal ornull changes are required.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 14


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of
training the user to use the system efficiently. The user must not feel threatened by the system, instead must
accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of confidence must be raised so that
he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system

SYSTEM DESIGN
System design is transition from a user-oriented document to programmers or data base personnel. The design is a
solution, how to approach to creation of a new system. This is composed of several steps .it provides the
understanding and procedural details necessary for implementing the system recommendation in feasibility
study.

Designing goes through logical and physical stages of development, logical design reviews the present physical
system, prepare input and output specification, details of implementation plan and prepare a logical and physical
stages of development, logical design reviews the present physical system, prepare input and output
specification, details of implementation plan and prepare logical design walkthrough. Then in the input and
output screen design, the design should be made user friendly. The menu should be precise and compact.

Systems design is the process of defining elements of a system like modules, architecture, components and their
interfaces and data for a system based on the specified requirements. It is the process of defining, developing and
designing systems which satisfies the specific needs and requirements of a business or organization.

A systemic approach is required for a coherent and well-running system. Bottom-Up or Top- Down approach is
required to take into account all related variables of the system. A designer uses modeling languages to express
the information and knowledge in a structure of system that is defined by a consistent set of rules and definitions.
The designs can be defined in graphical or textual modeling languages.

The logical and physical stages of development, logical design reviews the present physical system, prepare
input and output specification, details of implementation plan and prepare logical design walkthrough. Then in
the input and output screen design, the design should be made user friendly. The menu should be precise and
compact.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 15


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SYSTEM ARCHITECTURE

DATASET
A dataset is a collection of data. For performing action related to objects a dataset named train and test dataset.
The detailed information about the dataset is as follows: Number of categories: 5 with number of files 400,
Number dataset is validated with an accuracy of 93% to increase the performance of system.

Fig -9 : System Architecture


The diabetes data was taken from different hospitals and their testing data which consists of people who were
tested for diabetes disease with different symptoms. The data had 768 readings. Each person’s basic health
metrics and their symptoms were checked in the process of testing whether the person had diabetes or not. The
person having diabetes were tested with the following attributes like Pregnancy record, Glucose,
BloodPressure, Body Mass Index, Pedigree and history of diabetes in the ancestors, Age, Skin Thickness etc are
checked and the data was noted accordingly. They were given to the pre-processing where the data will be
cleaned without the redundancy and noise and other garbage and null values. The data was now split into
Training and Testing data. Training data was given to a machine learning algorithm to understand the hidden
relationships between the data and find the correct output to determine whether the featured person have diabetes
or not. The Algorithms that were tried are Decision Tree and Random Forest where the algorithm gave out
inaccurate result. The K-Nearest Neighbors yielded a 79% accuracy hence this was used as a primary algorithm
to determine whether a person is suffering from diabetes or not. After the training phase, The machine was tested
with supervised data where the outcome is known just to cross validate the readings and outcomes. When the
machine yielded satisfactory results, It is used as a proper trained model.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 16


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

The data was taken as a CSV file with all the readings. Input of the data will be the CSV file with readings of
multiple persons in it and the Output will be graphical and statistical representation of how many people in that
data were infected with the diabetes.

SOFTWARE DESIGN
INPUT DESIGN

The input design is the link between the information system and the user. It comprises the developing
specification and procedures for data preparation and those steps are necessary to put transaction data in to a
usable form for processing can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system. The design of input focuses
on controlling the amount of input required, controlling the errors, avoiding delay, avoiding extra steps and
keeping the process simple. The input is designed in such a way so that it provides security and ease of use with
retaining the privacy. Input Design considered the following things:
What data should be given as input?
How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error occur.
OBJECTIVES
Input Design is the process of converting a user-oriented description of the input into a computer-based
system. This design is important to avoid errors in the data input process and show the correct direction to the
management for getting correct information from the computerized system.
It is achieved by creating user-friendly screens for the data entry to handle large volume of data. The goal
of designing input is to make data entry easier and to be free from errors. The data entry screen is designed in
such a way that all the data manipulates can be performed. It also provides record viewing facilities.
When the data is entered it will check for its validity. Data can be entered with the help of screens.
Appropriate messages are provided as when needed so that the user will not be in maize of instant. Thus the
objective of input design is to create an input layout that is easy to follow

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 17


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

OUTPUT DESIGN

A quality output is one, which meets the requirements of the end user and presents the information clearly. In
any system results of processing are communicated to the users and to other system through outputs. In output
design it is determined how the information is to be displaced for immediate need and also the hard copy output.
It is the most important and direct source information to the user. Efficient and intelligent output design
improves the system’s relationship to help user decision-making.
Designing computer output should proceed in an organized, well thought out manner; the right output must be
developed while ensuring that each output element is designed so that people will find the system can use easily
and effectively. When analysis design computer output, they should Identify the specific output that is needed to
meet the requirements.
Select methods for presenting information. Create document, report, or other formats that contain information
produced by the system. The output form of an information system should accomplish one or more of the
following objectives. Convey information about past activities, current status or projections of the future Signal
important events, opportunities, problems, or warnings,Trigger an action,Confirm an action.

UML DIAGRAMS
UML CONCEPTS
The Unified Modeling Language (UML) is a general- purpose, developmental, modeling language in the field of
software engineering that is intended to provide a standard way to visualize thedesign of a system.
The Unified Modeling Language (UML) is a standard language for writing Software blue prints.
The UML is a language for
 Visualizing
 Specifying
 Constructing
 Documenting the artifacts of a software intensive system.
The UML is a language which provides vocabulary and the rules for combiningwords in that vocabulary for
the purpose of communication. A modeling language is a language whose vocabulary and the rules focus on the
conceptual and physical representation of a system. Modeling yields anunderstanding of a system.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 18


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

BUILDING BLOCKS IN UML


The vocabulary of the UML encompasses three kinds of building blocks:
Things
Relationships
Diagrams
Things are the abstractions that are first-class citizens in a model; relationships tiethese things together; diagrams
group interesting collections of things.

THINGS IN THE UML


There are four kinds of things in the UML:
Structural things
Behavioral things
Grouping things
Annotation things
Structural things are the nouns of UML models. The structural things used in the project design are: First, a class
is a description of a set of objects that share the same attributes, operations, relationships and semantics.
Second, a use case is a description of set of sequence of actions that a systemperforms that yields an observable
result of value to particular actor.

Fig -11 : Use Cases


Third, a node is a physical element that exists at runtime and represents a computational resource,generally
having at least some memory and often processing capability.
Behavioral things are the dynamic parts of UML models. The behavioral thing used is:
Interaction
An interaction is a behavior that comprises a set of messages exchanged amonga set of objects within a particular
context to accomplish a specific purpose. An interaction involves a number of otherelements, including messages,
action sequences(the behavior invoked by a message, and links (the connection between objects).

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 19


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

Fig-13 :Message
RELATIONSHIPS IN THE UML
There are four kinds of relationships in the UML:
Dependency
Association
Generalization
Realization
A dependency is a semantic relationship between two things in which a change to one thing may affect the
semantics of the other thing (the dependent thing).

Fig-14 : Dependency
An association is a structural relationship that describes a set links, a link being a connection among
objects. Aggregation is a special kind of association, representinga structural relationship between a whole and
part

Fig-15 : Association
A generalization is a specialization/ generalization relationship in which objects of the specialized element (the
child) are substitutable for objects of the generalized element (the parent).

.
Fig-16 : Generalization
A realization is a semantic relationship between classifiers, where in one classifier specifies a contract that
another classifier guarantees to carry out.

UML DIAGRAMS

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 20


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

USECASE DIAGRAM:

CLASS DIAGRAM:

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 21


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

IMPLEMENTATION

The programming language used for the project is python. For representation of any code we need to create a
main window and to get access from the database we need the login or registration credentials done.

We implemented the interface used in python programming. User need to register and then login. The dataset is
divided into training and testing data using the test split function. After dividing we applied the multi linear
regression.
Using the dataset values user will enter the values through window and using the multi linear regression algorithm
the data will be processed and output will be displayed.

SAMPLE CODE:

#from mlxtend.plotting import plot_decision_regions


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
diabetes = pd.read_csv("diabetes.csv")
diabetes
## Display all the columns of the dataframe
pd.pandas.set_option('display.max_columns',None)
diabetes.head()
diabetes.info()
diabetes.shape
diabetes.describe().T
diabetes.isnull().sum()
check=diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
check.isin([0]).any().any()

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 22


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
check=check.replace(0,np.nan)
check.head()
check.isnull()
## 1 -step make the list of features which has missing values
features_with_na=[features for features in check.columns if check[features].isnull().sum()>1]
## 2- step print the feature name and the percentage of missing values
for feature in features_with_na:
print(feature, np.round(check[feature].isnull().mean(), 4), ' % missing values')
sns.heatmap(check.isnull() , yticklabels=False , cbar=False , cmap='viridis')
# proportion of diabetes patients (about 35% having diabetes)diabetes.Outcome.value_counts()[1] /
diabetes.Outcome.count()
# To analyse feature-outcome distribution in visualisation
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction',
'Age']
ROWS, COLS = 2, 4
fig, ax = plt.subplots(ROWS, COLS, figsize=(18,8) )
row, col = 0, 0
for i, feature in enumerate(features):
if col == COLS - 1:
row += 1
col = i % COLS
# diabetes[feature].hist(bins=35, color='green', alpha=0.5, ax=ax[row, col]).set_title(feature) #show all,
comment off below 2 lines
diabetes[diabetes.Outcome==0][feature].hist(bins=35, color='red', alpha=0.5, ax=ax[row,
col]).set_title(feature)
diabetes[diabetes.Outcome==1][feature].hist(bins=35, color='yellow', alpha=0.7, ax=ax[row, col])
plt.legend(['No Diabetes', 'Diabetes'])
fig.subplots_adjust(hspace=0.3)
sns.set_style('whitegrid')
print(diabetes.Outcome.value_counts())
sns.countplot('Outcome',data=diabetes).set_title('Diabetes Outcome')
list_diabetes=[268,500]
list_labels=['Diabetic','Healthy']
plt.axis('equal')

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 23


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
plt.pie(list_diabetes,labels=list_labels,radius=2,autopct="%0.1f%%",shadow=True)
sns.distplot(diabetes['Glucose'],kde=True,color='darkred',bins=40)
sns.set()
def plot_prob_density(diabetes_Glucose,diabetes_BloodPressure):
plt.figure(figsize = (10, 7))
unit = 1.5
x = np.linspace(Glucose.min() - unit, Glucose.max() + unit, 1000)[:, np.newaxis]
# Plot the data using a normalized histogram
plt.hist(df_lunch, bins=10, density=True, label='Glucose', color='orange', alpha=0.2)
plt.hist(diabetes_BloodPressure, bins=10, density=True, label='BloodPressure', color='navy', alpha=0.2)
# Do kernel density estimation
kd_Glucose = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df_lunch)
kd_BloodPressure = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(df_lunch)
# Plot the estimated densty
kd_vals_Glucose = np.exp(kd_Glucose.score_samples(x))
kd_vals_BloodPressure = np.exp(kd_BloodPressure.score_samples(x))
plt.plot(x, kd_vals_Glucose, color='orange')
plt.plot(x, kd_vals_Glucose, color='navy')
plt.axvline(x=x_start,color='red',linestyle='dashed')
plt.axvline(x=x_end,color='red',linestyle='dashed')
# Show the plots
plt.xlabel(field, fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
gc.collect()
return kd_Glucose
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 24


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
return probability.round(4)
plt.figure(figsize = (10, 7))
sns.distplot(diabetes['Glucose'], label='Glucose')
sns.distplot(diabetes['BloodPressure'], label='Blood Pressure')
plt.xlabel('Glucose,BloodPressure', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
sns.distplot(diabetes['SkinThickness'],kde=False,color='darkred',bins=40)
sns.distplot(diabetes['Insulin'],kde=False,color='darkred',bins=40)
sns.distplot(diabetes['BMI'],kde=False,color='darkred',bins=40)
sns.countplot(x='Pregnancies', data=diabetes)
diabetes['BloodPressure'].hist(color='green',bins=40,figsize=(8,4))
sns.distplot(diabetes['BloodPressure'],kde=False,color='darkred',bins=40)
diabetic=diabetes.loc[diabetes['Outcome']==1]
non_diabetic=diabetes.loc[diabetes['Outcome']==0]
plt.plot(diabetic['Insulin'],np.zeros_like(diabetic['Insulin']),'o')
plt.plot(non_diabetic['Insulin'],np.zeros_like(non_diabetic['Insulin']),'o')
plt.xlabel('Insulin')
plt.show()
sns.FacetGrid(diabetes,hue='Outcome',height=6).map(plt.scatter,'Glucose','Insulin').add_legend()
plt.show()
# to visualise pair plot
sns.pairplot(diabetes, hue='Outcome', plot_kws=dict(alpha=.3, edgecolor='none'), height=2, aspect=1.1)
plt.show()
#Pearson Correlation Cofficient
diabetes.corr()
mask = np.zeros_like(diabetes.corr())
traingle_indices=np.triu_indices_from(mask)
mask[traingle_indices]=True
maskplt.figure(figsize=(16,10))
sns.heatmap(diabetes.corr(),mask=mask, annot=True, annot_kws={"size" : 14})
sns.set_style('white')
plt.xticks(fontsize=10)

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 25


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
plt.yticks(fontsize=14)
plt.show()
diabetes['Glucose'].mean()
diabetes['Glucose'].median()
diabetes['Glucose'] = diabetes['Glucose'].replace(0, diabetes['Glucose'].median())
diabetes['BloodPressure'] = diabetes['BloodPressure'].replace(0, diabetes['BloodPressure'].median())
diabetes['SkinThickness'] = diabetes['SkinThickness'].replace(0, diabetes['SkinThickness'].median())
diabetes['Insulin'] = diabetes['Insulin'].replace(0, diabetes['Insulin'].median())
diabetes['BMI'] = diabetes['BMI'].replace(0, diabetes['BMI'].median())
diabetes
sns.heatmap(diabetes.isnull() , yticklabels=False , cbar=False , cmap='viridis')
diabetes.plot(kind='box',figsize=(20,10),color='Green',vert=False)
plt.show()
SkinThickness_Outliers = diabetes['SkinThickness'].to_list()
Insulin_outliers = diabetes['Insulin'].to_list()
outliers=[]
def detect_outliers(data):
threshold=3
mean = np.mean(data)
std =np.std(data)
for i in data:
z_score= (i - mean)/std
if np.abs(z_score) > threshold:
outliers.append(i)
return outliers
outlier_pt=detect_outliers(SkinThickness_Outliers)
outlier_pt
outlier_pt=detect_outliers(Insulin_outliers)
outlier_pt
diabetes=diabetes[diabetes['SkinThickness']<80]
diabetes=diabetes[diabetes['Insulin']<=600]
print(diabetes.shape)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 26


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
X = pd.DataFrame(sc_X.fit_transform(diabetes.drop(["Outcome"],axis = 1),),
columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
X.head()
#X = diabetes.drop("Outcome",axis = 1)
y = diabetes.Outcome
#importing train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42, stratify=y)
from sklearn.neighbors import KNeighborsClassifier
test_scores = []
train_scores = []
for i in range(1,15):
knn = KNeighborsClassifier(i)
knn.fit(X_train,y_train)
train_scores.append(knn.score(X_train,y_train))
test_scores.append(knn.score(X_test,y_test))
## score that comes from testing on the same datapoints that were used for training
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1,
train_scores_ind))))
## score that comes from testing on the datapoints that were split in the beginning to be used for testing solely
max_test_score = max(test_scores)
test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1, test_scores_ind))))
plt.figure(figsize=(12,5))
p = sns.lineplot(range(1,15),train_scores,marker='*',label='Train Score')
p = sns.lineplot(range(1,15),test_scores,marker='o',label='Test Score')
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(11)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
#import confusion_matrix

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 27


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
from sklearn.metrics import confusion_matrix
#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(X_test)
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
y_pred
y_pred = knn.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# accuracy
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
from sklearn.metrics import roc_curve
y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('Knn(n_neighbors=11) ROC curve')
plt.show()
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
#import GridSearchCV
from sklearn.model_selection import GridSearchCV
#In case of classifier like knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':np.arange(1,50)}

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 28


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(X,y)
print("Best Score:" + str(knn_cv.best_score_))
print("Best Parameters: " + str(knn_cv.best_params_))
# feature selection
feature_cols = ['Pregnancies', 'Insulin', 'BMI', 'Age', 'Glucose', 'BloodPressure', 'DiabetesPedigreeFunction']
x = diabetes[feature_cols]
y = diabetes.Outcome
# split data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3, random_state=1)
X_train.shape
Y_train.shape
X_test.shape
Y_test.shape
# build model
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)
print(y_pred)
y_pred = classifier.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# accuracy
print("Accuracy:", metrics.accuracy_score(Y_test,y_pred))
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(Y_test,y_pred))
from sklearn.externals.six import StringIO

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 29


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
%matplotlib inline
dot_data = StringIO()
export_graphviz(classifier, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
from sklearn.linear_model import LogisticRegression
regressor=LogisticRegression()
regressor.fit(X_train,Y_train)
y_pred=regressor.predict(X_test)
y_pred
# accuracy
print("Accuracy:", metrics.accuracy_score(Y_test,y_pred))
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(Y_test,y_pred))
y_pred = regressor.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
from sklearn.metrics import confusion_matrix,classification_report,roc_curve,accuracy_score,auc
fpr,tpr,_=roc_curve(Y_test,y_pred)
#calculate AUC
roc_auc=auc(fpr,tpr)

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 30


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
print('ROC AUC: %0.2f' % roc_auc)
#plot of ROC curve for a specified class
plt.figure()
plt.plot(fpr,tpr,label='ROC curve(area= %2.f)' %roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,y_pred)
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
from sklearn.svm import SVC
model=SVC(kernel='rbf')
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
accuracy_score(Y_test,y_pred)
y_pred = model.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print(classification_report(Y_test,y_pred))
fpr,tpr,_=roc_curve(Y_test,y_pred)
#calculate AUC

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 31


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
roc_auc=auc(fpr,tpr)
print('AUC: %0.2f' % roc_auc)
#plot of ROC curve for a specified class
plt.figure()
plt.plot(fpr,tpr,label='ROC curve(area= %2.f)' %roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,y_pred)
model=SVC(kernel='linear')
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
accuracy_score(Y_test,y_pred)
y_pred = model.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print(classification_report(Y_test,y_pred))
fpr,tpr,_=roc_curve(Y_test,y_pred)
#calculate AUC
roc_auc=auc(fpr,tpr)
print('ROC AUC: %0.2f' % roc_auc)
#plot of ROC curve for a specified class

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 32


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
plt.figure()
plt.plot(fpr,tpr,label='ROC curve(area= %2.f)' %roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)
accuracy_score(Y_test,y_pred)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print(classification_report(Y_test,y_pred))
fpr,tpr,_=roc_curve(Y_test,y_pred)
#calculate AUC
roc_auc=auc(fpr,tpr)
print('ROC AUC: %0.2f' % roc_auc)
#plot of ROC curve for a specified class
plt.figure()
plt.plot(fpr,tpr,label='ROC curve(area= %2.f)' %roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 33


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train,Y_train)
Y_pred=classifier.predict(X_test)
confusion_matrix(Y_test,Y_pred)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, Y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
accuracy_score(Y_test,Y_pred)
print(classification_report(Y_test,Y_pred))
fpr,tpr,_=roc_curve(Y_test,Y_pred)
#calculate AUC
roc_auc=auc(fpr,tpr)
print('ROC AUC: %0.2f' % roc_auc)
#plot of ROC curve for a specified class
plt.figure()
plt.plot(fpr,tpr,label='ROC curve(area= %2.f)' %roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 34


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,y_pred)
# split data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3, random_state=1)
# Import the classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
# Instantiate the classfiers and make a list
classifiers = [LogisticRegression(random_state=1),
SVC(kernel='rbf',probability=True),
RandomForestClassifier(random_state=1)]
# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])
# Train the models and record the results
for cls in classifiers:
model = cls.fit(X_train, Y_train)
yproba = model.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(Y_test, yproba)
auc = roc_auc_score(Y_test, yproba)
result_table = result_table.append({'classifiers':cls.__class__.__name__,
'fpr':fpr,
'tpr':tpr,
'auc':auc}, ignore_index=True)
# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)
fig = plt.figure(figsize=(8,6))
for i in result_table.index:
plt.plot(result_table.loc[i]['fpr'],
result_table.loc[i]['tpr'],
label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
plt.plot([0,1], [0,1], color='orange', linestyle='--')

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 35


DIABETES DISEASE PREDICTION USING MACHINE LEARNING
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("Flase Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()
fig.savefig('multiple_roc_curve.png')
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,yproba)
print(classification_report(Y_test,Y_pred))

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 36


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying todiscover every conceivable fault or
weakness in a work product. It provides a way to check the functionality of components, sub-assemblies,
assemblies and/or a finished product It is the process of exercising software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not failin an unacceptable manner. There
are various types of test. Each test type addresses aspecific testing requirement.

TYPES OF TESTS

UNIT TESTING

Unit testing involves the design of test cases that validate that the internal program logic is functioning properly,
and that program inputs produce valid outputs. All decision branches and internal code flow should be validated.
It is the testing of individual software units of the application
.it is done after the completion of an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive. Unit tests perform basic tests at component level and test a specific
business process, application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined inputs and expected
results.

INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they actually run as one
program. Testing is event driven and is more concerned with the basic outcomeof screens or fields.
Integration tests demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent. Integration testing is
specifically aimed at exposing the problems that arise from the combination of components.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 37


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

TOP DOWN INTEGRATION

This method is an incremental approach to the construction of program structure. Modules are integrated by
moving downward through the control hierarchy, beginning with the main program module. The module
subordinates to the main program module are incorporated into the structure in either a depth first or breadth first
manner. In this method, the software id from main module and individual stubs are replaced when the test
proceeds downwards.

BOTTOM UP INTEGRATION

This method begins the construction and testing with the modules at the lowest level in the program structure.
Since the modules are integrated from the bottom up, processing required for modules subordinate to a given
level is always available and the need for stubs is eliminated. The bottom up integration strategy may be
implemented with the following steps:
The low-level modules are combined into clusters into clusters that perform aspecific Softwaresub-function.
A driver (i.e.) the control program for testing is written to coordinate test caseinput and output.
The cluster is tested. Drivers are removed and clusters are combined moving upward in the programstructure
The bottom up approaches test each module individually and then each module ismodule is integrated with a
main module and tested for functionality.
FUNCTIONAL TESTING
Functional tests provide systematic demonstrations that functions tested are availableas specified by the business
and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input: identified classes of valid input must be accepted.Invalid Input : identified classes of invalid
input must be rejected.Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised. Systems/Procedures : interfacing systems or
procedures must be invoked.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 38


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

Organization and preparation of functional tests is focused on requirements, key functions, or special test cases.
In addition, systematic coverage pertaining to identify Business process flows; data fields, predefined processes,
and successive processes must be considered for testing. Before functional testing is complete, additional tests are
identified and the effective value of current tests is determined.

SYSTEM TESTING
System testing ensures that the entire integrated software system meets requirements. It tests a configuration to
ensure known and predictable results. An example of system testing is the configuration-oriented system
integration test. Systemtesting is based on process descriptions and flows, emphasizing pre-driven process links
and integration points.
METHODS OF TESTING

White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of the innerworkings, structure
and language of the software, or at least its purpose. It is purpose. It is used to test areas that cannot be reached
from a black box level.

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner workings, structure or language of
the module being tested. Black box tests, as most other kinds of tests, must be written from a definitive source
document, such as specification or requirements document, such as specification or requirements document. It is
a testing in which the software under test is treated, as a black box. you cannot “see” into it. The test provides
inputs and responds to outputs without considering how the software works.
Test strategy and approach
Field testing will be performed manually and functional tests will be written in detail
. Test objectives :
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 39


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.

ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant participationby the end user. It
also ensures that the system meets the functional requirements.
TEST CASE

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 40


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

OUTPUT SCREENS

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 41


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

TESTING WITH K_NEAREST NEIGHBOURS

KNN ACCURACY

RANDOM FOREST:

RANDOM FOREST ACCURACY

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 42


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

DECISION TREES

DECISION ACCURACY

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 43


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

CONCLUSION
One of the important real-world medical problems is the detection of diabetes at its early stage. In this study,
systematic efforts are made in designing a system which results in the prediction of diabetes. During this work,
five machine learning classification algorithms are studied and evaluated on various measures. Experiments are
performed on john Diabetes Database. Experimental results determine the adequacy of the designed system with
an achieved accuracy of 79% using KNN algorithm.
In future, the designed system with the used machine learning classification algorithms can be used to predict or
diagnose other diseases. The work can be extended and improved for the automation of diabetes analysis
including some other machine learning algorithms..

FUTURE SCOPE
In future, the designed system with the used machine learning classification algorithms can be used to predict or
diagnose other diseases. The work can be extended and improved for the automation of diabetes analysis
including some other machine learning algorithms..

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 44


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

BIBLOGRAPHY

International Journal of Scientific Research in Computer Science, Engineering and Information Technology
ISSN : 2456-3307 (www.ijsrcseit.com)://doi.org/10.32628/CSEIT206463

https://www.sciencedirect.com/science/article/pii/S1877050920300557

Diabetes Disease Prediction Using Machine Learning Algorithms


Publisher: IEEE Arwatki Chen Lyngdoh; Nurul Amin Choudhury; Soumen Moulik

https://ieeexplore.ieee.org/document/9398759
https://www.researchgate.net/publication/346265894_Diabetes_Disease_Prediction_Using_Machine_Learning_
Algorithms DOI:10.1109/IECBES48179.2021.9398759
Conference: 2020 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES) - IECBES
2020At: LANGKAWI- MALAYSIA
https://www.hindawi.com/journals/complexity/2021/6053824/
An Efficient Prediction System for Diabetes Disease Based on Deep Neural Network
Tawfik Beghriche ,1 Mohamed Djerioui ,2 Youcef Brik ,2 Bilal Attallah ,2 and Samir Brahim Belhaouari 3

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 45


DIABETES DISEASE PREDICTION USING MACHINE LEARNING

DEPARTMENT OF CSE, BVC COLLEGE OF ENGINEERING Page 46

You might also like