You are on page 1of 80

A PROJECT REPORT ON

AN EXPERT SYSTEM FOR INSULIN DOSAGE PREDICTION


Submitted in partial fulfilment of the requirements for the award of the
degree of
BACHELOR OF TECHNOLOGY
In

COMPUTER SCIENCE & ENGINEERING


Submitted By

S.SWAROOP

(18JD1A0590)

Under the esteemed guidance of

K. VADDI KASULU M. Tech(CST), Ph.D.(CSE)

Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ELURU COLLEGE OF ENGINEERING & TECHNOLOGY


DUGGIRALA(V), PEDAVEGI (M), ELURU-534004
(APPROVED BY AICTE-NEW DELHI &AFFLIATED TO JNTU-KAKINADA)

2018 –2022

i
ELURU COLLEGE OF ENGINEERING & TECHNOLOGY
DUGGIRALA(V), PEDAVEGI (M), ELURU-534004
(APPROVED BY AICTE-NEW DELHI &AFFLIATED TO JNTU-KAKINADA)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE
This is to certify that the Project Report entitled “AN EXPERT SYSTEM FOR INSULIN
DOSAGE PREDICTION” being submitted in partial fulfilment for the award of the degree
Bachelor of technology in Department of Computer Science & Engineering to the Jawaharlal
Nehru Technological University, Kakinada is a record of bonafide work carried out by
S.SWAROOP (18JD1A0590) under my guidance and supervision.

PROJECT GUIDE HEAD OF THE DEPARTMENT

EXTERNAL EXAMINAR

ii
ACKNOWLEDGEMENT

The present project work is the several days study of the various aspects of the project
development. During this effort in the present study, we have received a great amount of
help from our Chairman Sri V.RAGHAVENDRA RAO and Secretary Sri V.RAMA
KRISHNA RAO of Padmavathi group of Institutions, which we wish to acknowledge
and thank from depth of our heart.

We are thankful to our Principal Dr. P. BALAKRISHNA PRASAD for permitting and
encouraging us in doing this project.

We are deeply intended to Dr. G. GURU KESAVADAS, Professor & Head of the
Department, whose motivation and constant encouragement has led to pursue project in
the field of software development.

We are deeply intended to our project guide, K.VADDI KASULU, Assistant professor for
providing this opportunity and constant encouragement given by his during the course. We
are grateful to his valuable guidance and suggestions during my project work.

Our parents have put our self ahead of themselves. Because of their hard work and
dedication,we had opportunity beyond my wildest dreams.

Finally, we express our thanks to all the other faculty members, classmates, friends and
neighbors who helped us for the completion of our project and without their infinite love and
patience this would never have been possible.

S. Swaroop (18JD1A0590)

iii
DECLARATION
I hereby declare that the Project report entitled “AN EXPERT SYSTEM FOR INSULIN
DOSAGE PREDICTION” is carried out by me under the guidance of K.VADDI KASULU,
Assistant Professor of CSE dept, during the IV year II semester in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science and
Engineering during 2021-2022.

Submitted by:
S. Swaroop(18JD1A0590)

iv
ABSTRACT

Diabetes Mellitus is a chronic metabolic disorder. Normally, with a proper adjusting of blood
glucose levels, diabetic patients could live a normal life without the risk of having serious
complications that normally developed in the long run. However, blood glucose levels of most
Diabetic patients are not well controlled for many reasons. Although the traditional prevention
techniques such as eating healthy food and conducting physical exercise are important for the
diabetic patients to control their BGL’s , However taking the proper amount of insulin dosage
has the crucial role in the treatment process. In this project we are using Gradient Boosting
Classifier to predict diabetic detected patients. To implement this project we are using PIMA
diabetes dataset and UCI insulin dosage dataset. We are training both the algorithms with above
mentioned dataset and Once after training we will upload test dataset with no class label and
then Gradient Boosting will predict the presence of Diabetes and Linear Regressions will
predict insulin dosage if Diabetes detected by Gradient Boosting Algorithm.

v
TABLE OF CONTENTS
1. INTRODUCTION 1
2. LITERATURE SURVEY 5
3. SYSTEM STUDY 7
3.1 Existing system 7
3.2 Proposed System 7
4. SYSTEM REQUIREMENTS 8
4.1 Functional Requirements 8
4.2 Non-Functional Requirements 8
4.3 Software Requirements 9
4.4 Hardware Requirements 10
4.5 System Study 10
5. SYSTEM DESIGN 12
5.1 System Architecture 12
5.2 Modules Description 12
5.3 Data Flow Diagram 13
5.4 UML Diagrams 14
6. IMPLEMENTATION 33
6.1 Python 33
6.2 Python Features 39
6.3 NumPy 39
6.4 Matplotlib 40
6.5 Scikit-learn 40
6.6 Modules 41
6.7 Algorithms 41
7. METHODOLOGY 45
7.1 Feature Scaling 45
7.2 Classification Algorithms 46
8. SOURCE CODE 49
vi
9. TESTING 58
9.1 Test Data Set 58
9.2 Validation Data set 58
9.3 Selection of a valid data set 59
9.4 Testing Strategies 59
9.4.1 Unit Testing 59
9.4.2 Data Flow testing 59
9.4.3 Integration Testing 60
9.4.4 Big Bang Testing 60
9.5 Test Cases 60

10. OUTPUT SCREENSHOTS 63


11. CONCLUSION 70
12. REFERENCES 71
13. BIBLIOGRAPHY 72

vii
LIST OF FIGURES
1. Figure 5.1: System Architecture 12

2. Figure 5.3: Data Flow Diagram 14


3. Figure 5.4.2.1 Class Diagram 17
4. Figure 5.4.3.1 Use case Diagram 20
5. Figure 5.4.4.1 Sequence Diagram 22
6. Figure 5.4.5.1 Activity Diagram 23
7. Figure 5.4.6.1 Collaboration Diagram 24
8. Figure 5.4.7.1 State Chart Diagram 27
9. Figure 5.4.8.1 Component Diagram 30
10. Figure 5.4.9.1 Deployment Diagram 32
11. Figure 6.7.1.1 Linear Regression in Machine Learning 42
12. Fig 10.1 Test Values 63
13. Fig 10.2 After Running the project screenshot 64
14. Fig 10.3 Uploading the Diabetic dataset 64
15. Fig 10.4 After Uploading the dataset 65
16. Fig 10.5 Graph for Diabetes Indicates 65
17. Fig 10.6 Preprocessing of dataset 66
18. Fig 10.7 Accuracy for Gradient Boosting Algorithm 66
19. Fig 10.8 Accuracy for Linear Regression 67
20. Fig 10.9 Accuracy Graph 67
21. Fig 10.10 Predicting Diabetes and Insulin Dosage 68
22. Fig 10.11 Insulin Dosage When Diabetes detected 68

viii
1. INTRODUCTION
Purpose:
The prediction of glucose concentration could facilitate the appropriate patient reaction in
crucial situations such as hypoglycaemia. Thus, several recent studies have considered advanced
driven techniques for developing accurate predictive models for glucose metabolism. The fact that
the relationship between input variables and glucose levels are non-linear, dynamic, interactive
and patient-specific, necessitates the application of non-linear regression models such as
artificial neural networks, support vector regression and Gaussian processes.

Scope:

The aim of the project is to predict diabetes. In addition to the general guidelines that the patient
follows during his daily life, several diabetes management systems have been proposed to
further assist the patient in the self-management of the disease. One of the essential componentsof
a diabetes management system concerns the predictive modelling of the glucose metabolism.

Diabetes mellitus, a metabolic disease, is a crucial problem in modern health care, since it
currently hits more than 8% of the adult population (age 20–79 years). Recent predictions report
that this number can increase by 55% within 2 decades , which could also increase the mortality
rate and the incidence of further complications caused by the disease. This underlines the
importance of treatment of diabetics and finding new ways of diabetic lifestyle support.

As the current state of the art in medical science does not provide a full cure of the disease, the
patients have to adopt a special lifestyle with a slightly different treatment for each type of
diabetes; for some patients, it is enough to pay attention to the food intake while others need
subcutaneous insulin injections to compensate the insufficient insulin production in their body.
The two main types of the disease are type 1 diabetes (T1D) characterized by absolute
deficiency of endogenous insulin production and type 2 diabetes (T2D) characterized by partial
deficiency of endogenous insulin production and insulin resistance. This work concentrates on
the daily life support of type 1 and type 2 diabetes outpatients treated with subcutaneous insulin
injections. Subcutaneous insulin products have long-acting (or basal) types with a day-long
effect, typically administered once a day in the morning or in the evening, and short-acting (or
bolus) types administered separately for each main meal in order to control blood glucose level
(BGL). The little mistakes in choosing the right dose of these injections can lead to a critically
low BGL, that is, hypoglycaemia, which may present an instant medical emergency, or a
sustained, excessively high BGL, that is, hyperglycaemia, which results in severe complications
(e.g., cardiovascular diseases, kidney disease, and diabetic retinopathy) in the long run.

1
Estimating insulin needs before every meal and other activities is a daily task for each diabetic.
These decisions are usually based on experience and conjecture, which is sometimes rather
inefficient in practice, resulting in high glycated haemoglobin (HbA1c) values. These
observations justify the idea of developing blood glucose prediction algorithms and lifestyle
support applications that help diabetics in their everyday life.

Poorly controlled glucose is both common and dangerous in hospitalized patients, reflecting
deficiencies in common standard practices in insulin dosing. Hyperglycaemia, defined as a blood
glucose >140 mg/dL, occurs in 22% to 46% of non–critically ill hospitalized patients,1 and can
lead to serious complications, including infections, cardiovascular events, and increased overall
mortality.2 The increased odds of mortality among patients with a blood glucose above 145
mg/dL is 1.3 to 3 times that of patients with normal glucose (70-110 mg/dL), independent of
illness severity.3 The treatment for hyperglycaemia in inpatients is insulin, a hormone essential
to enabling cells to uptake glucose from the blood for energy. However, insulin has a narrow
therapeutic window when given as a medication, and overtreatment can lead to dangerous
hypoglycaemia causing seizure, arrhythmia, or even death. As such, predicting an accurate
insulin dose is critical for clinical outcomes. The existing standard of care for estimating initial
insulin dose prediction is unfortunately highly variable, as it is typically driven mainly by
individual clinical judgment supplementing crude weight-based clinical calculators, often
leading to ineffective glucose control.4 Practice guidelines for inpatient insulin dosing primarily
revolve around weight-based clinical calculators that estimatethe total daily dose (TDD) of insulin
required to be 0.4 to 0.6 units/kg among nonelderly patients with good kidney function.5,6 This
calculator results in a range of dosing that can vary by 50%, requiring prescribers to use variable
clinical experience to adjust for factors such as age, suspected insulin sensitivity, and renal
function. Even in optimal conditions, existing TDD guidelines are based on a dosing schema of
unclear provenance chosen in published studies7 that have not been clinically validated. The
current practice leads to significant dosing heterogeneity even within the same patient’s
hospitalization.

For admitted patients, prescribing an initial insulin dose is often challenging, as there is limited
information available at this early stage, whereas titration is often a simpler problem because a
patient’s insulin sensitivity can be estimated from their response to previous insulin doses.
Specialized glucose management services have been established to help minimize both
hyperglycaemia and hypoglycaemia in the inpatient setting. These consult services have
improved blood glucose control and cost savings. However, the number of patients who need
consults often exceeds the capacity of these services.11 Decision support that assists in insulin
dosing could improve inpatient glycaemic control at scale. An alternate approach to formal

2
consult services is a remote glucose monitoring service in which a consulting endocrinologist
provides teams with insulin dosing suggestions based on chart review, without examining the
patient, which has shown success in reducing the proportion of patients with hyperglycaemia
and reducing hypoglycaemic events. The success of such remote glucose control programs
suggests that electronic health records may contain sufficient information for prescribing insulin,
and can be leveraged using automated machine learning methods.

Prediction models can screen pre-diabetes or people with an increased risk of developing
diabetes to help decide the best clinical management for patients. Numerous predictive equations
have been suggested to model the risk factors of incident diabetes for instance, Heikes et al. studied
a tool to predict the risk of diabetes in the US using undiagnosed and pre-diabetes data, and
Razavian et al. developed Linear Regression-based prediction models for type 2 diabetes
occurrence. These models also help screen individuals to posit individuals who are at a high risk
of having diabetes. Zou et al. used machine learning methods to predict diabetes in Luzhou,
China, and a five-fold cross-validation was used to validate the models. Nguyen et al. predict the
onset of diabetes employing deep learning algorithms suggesting that sophisticated methods may
improve the performance of models. In contrast, several other studies have shown that Linear
Regression performs as least as well as machine learning techniques for disease risk prediction.
Similarly, Anderson et al used Linear Regression along with machine learning algorithms and
found a higher accuracy with the Linear Regression model. These are mainly based on assessing
risk factors of diabetes, such as household and individual characteristics; however, the lack of an
objective and unbiased evaluation is still an issue. Additionally, there is growing concern that
those predictive models are poorly developed due to inappropriate selection of covariates,
missing data, small samples size, and wrongly specified statistical models To this end, only a
few risk prediction models have been routinely used in clinical practice. The reliability and
quality of these predictive tools and equations show significant variation depending on
geography, available data, and ethnicity. Risk factors for one ethnic group may not be
generalized to others; for example, the prevalence of diabetes is reported to be higher among the
Pima Indian community. Therefore, this study uses the Pima Indian dataset to predict if an
individual is at risk of developing diabetes based on specific diagnostic factors.

We consider a combined approach of Linear Regression and machine learning to predict the risk
factors of type 2 diabetes mellitus. The Linear Regression compares several prediction models
for predicting diabetes. We used various selection criteria popular n the literature such as AIC,
BIC, Mallows’ Cp, adjusted R2, and forward and backward selection to identify the significant
predictors. We then exploit the classification tree, a widely used machine learning technique
with considerable classification power to predict diabetes incidence in several

3
previous studies. Our paper also contributes to the broad literature on risk factors of diabetes.
We extend previous research by exploring traditional econometric models and simple machine
learning algorithms to build on the literature on diabetes prediction. Our analysis finds five main
predictors of diabetes: glucose, pregnancy, body mass index, age, and diabetes pedigree
function. These risk factors of diabetes identified by the Linear Regression were validated by the
decision tree and could help classify high-risk individuals and prevent, diagnose and manage
diabetes. A regression model containing the above five predictors yields a prediction accuracy of
77.73% and a cross-validation error rate of 22.65%. This study helps inform policymakers and
design the health policy to curb the prevalence of diabetes. The methods exhibiting the best
classification performance vary depending on the data structure; therefore, future studies shouldbe
cautious using a single model or approach for diabetes risk prediction.

4
2. LITERATURE SURVEY

2.1 Diagnosis of Diabetes Using Classification Mining Techniques


AUTHORS: Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly
ABSTRACT: Diabetes is chronic disease when the body cannot provide enough insulin (T1D,
Type 1 diabetes) or the body cannot effectively use its produced insulin (T2D, Type 2 diabetes). It
is expected for the patient to manage their BG level in the normal value between 70–180 mg/dL.
High blood glucose or hyperglycaemia is the condition where BG > 180 mg/dL and can affect
several complications such as cardiovascular disease and retinopathy. Other condition that
occurs to the diabetic patients is low blood sugar or hypoglycaemia (BG < 70 mg/dL) and can
cause death. To monitor blood glucose level, the diabetes patient can utilize continuous glucose
monitoring (CGM) device. CGM device is advanced wearable medical devices that can monitor
and record glucose levels concentration every 1–5 min and has shown improvement for diabetes
management . Several studies have demonstrated BG prediction model, so that future value of
BG can be obtained earlier. Blood glucose prediction models have been presented in previous
studies by considering glucose level as single input. Perez-Gandia et al proposed neural network
(NN) to predict future blood glucose by using the last 20 minutes of BG level as input. Ben Ali
et al proposed neural network-based model to predict BG level of Type 1 Diabetes (T1D) using
only CGM data as inputs.

2.2 Analysis of Various Data Mining Techniques to Predict Diabetes Mellitus.

AUTHORS: Devi, M. Renuka, and J. Maria Shyla

ABSTRACT: Data mining approach helps to diagnose patient’s diseases. Diabetes Mellitus is a
chronic disease to affect various organs of the human body. Early prediction can save human life
and can take control over the diseases. This paper explores the early prediction of diabetes using
various data mining techniques. The dataset has taken 768 instances from PIMA Indian Dataset
to determine the accuracy of the data mining techniques in prediction. The analysis proves that
Modified J48 Classifier provide the highest accuracy than other techniques.

2.3 Predicting Diabetes Mellitus using Data Mining Techniques

AUTHORS: J. Steffi, Dr. R. Balasubramanian, Mr. K. Aravind Kumar

ABSTRACT: Diabetes is a chronic disease caused due to the expanded level of sugar addictionin
the blood. Various automated information systems were outlined utilizing various classifiers for
anticipate and diagnose the diabetes. Data mining approach helps to diagnose patient’s

5
diseases. Diabetes Mellitus is a chronic disease to affect various organs of the human body.
Early prediction can save human life and can take control over the diseases. Selecting legitimate
classifiers clearly expands the correctness and adeptness of the system. Due to its continuously
increasing rate, more and more families are unfair by diabetes mellitus. Most diabetics know
little about their risk factor they face prior to diagnosis. This paper explores the early predictionof
diabetes using data mining techniques. The dataset has taken 768 instances from PIMA Indian
Diabetes Dataset to determine the accuracy of the data mining techniques in prediction. Then we
developed five predictive models using 9 input variables and one output variable from the Dataset
information; we evaluated the five models in terms of their accuracy, precision, sensitivity,
specificity and F1 Score measures. The purpose of this study is to compare the performance
analysis of Naïve Bayes, Linear Regression, Artificial neural networks (ANNs), C5.0 Decision
Tree and Support Vector Machine (SVM) models for predicting diabetes using common risk
factors. The decision tree model (C5.0) had given the best classification accuracy, followed by the
Linear Regression model, Naïve Bayes, ANN and the SVM gave the lowest accuracy Index
Terms—Data mining, Prediction, Naïve Bayes, Linear Regression, C5.0 Decision Tree,
Artificial Neural Networks (ANN) and Support Vector Machine (SVM).

2.4 Comparison Data Mining Techniques To Prediction Diabetes Mellitus


ABSTRACT: Diabetes is one of the chronic diseases caused by excess sugar in the blood.
Various methods of automated algorithms in various to anticipate and diagnose diabetes. One
approach to data mining method can help diagnose the patient's disease. In the presence of
predictions can save human life and begin prevention before the disease attacks the patient.
Choosing a legitimate classification clearly expands the truth and accuracy of the system as
levels continue to increase. Most diabetics know little about the risk factors they face before the
diagnosis. This method uses developing five predictive models using 9 input variables and one
output variable from the dataset information. The purpose of this study was to compare
performance analysis of Naive Bayes, Decision Tree, SVM, K-NN and ANN models to predict
diabetes mellitus.

6
3. SYSTEM ANALYSIS

3.1 Existing System:

In Existed system Insulin dosage is predicted by using Decision Tree and Random Forest
algorithms it gives low accuracy and it’s not an early prediction. Insulin enables blood glucose
levels to get into cells and this glucose is used for energy. So blood glucose is kept in a narrow
range. Diabetes is a chronic disease with the potential to cause a world wide health crisis.
However, early prediction of diabetes is a quite challenging task for medical practitioners due to
complex interdependence on various factors.

3.1.1 Disadvantages:

Early prediction of diabetes is a quite challenging task for medical practitioners due to complex
interdependence on various factors. In Existed system used Decision Tree and Random Forest it
doesn’t predict the insulin dosage accurately and it’s not an early prediction.

3.2 Proposed System:

In this project we are using Gradient Boosting Classifier to predict diabetes and then using
Linear Regression algorithm to predict the insulin dosage in diabetic detected patients. To
implement this project we are using PIMA diabetes dataset and UCI insulin dosage dataset. Weare
training both algorithms with above mentioned datasets and once after training we will upload
test dataset with no class label and then Gradient Boosting will predict presence of diabetes and
Linear Regression will predict insulin dosage if detected by Gradient Boosting.

3.2.1 Advantages:

Individual blood glucose levels and insulin dosage are highly unpredictable and precise of
prediction of them is likely impractical. Average blood glucose levels over 24 hours can be more
reliably predicted and determining whether the patient’s glucose level is going to be high is a
more feasible task. On using Gradient Boosting classification algorithm and Linear Regression
algorithm we will predict more accurately of insulin dosage.

7
4. SYSTEM REQUIREMENTS ANALYSIS

4.1 Functional Requirements:


The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.
The appropriation of requirements and implementation constraints gives the general overview of
the project in regards to what the areas of strength and deficit are and how to tackle them.

1.Data collection 2.Data pre processing 3.Training and Testing4.Modelling 5.Predicting

4.2 Non Functional Requirements :

It specifies the quality of the attribute of a software system. They judge the software system
based on the responsiveness, usability, security, portability and other non-functional standards
that are critical for the success of the software system. Examples of non-functional requirement is
“ How fast a website can load?” failing to meet non-functional requirements can result in
systems that fail to satisfy user needs. It allows you to impose restrictions on the design of the
system across the various agile backlogs. Example, the site should load in 3 seconds when the
number of simultaneous users are >10000. Description of the non-functional requirements is just
as critical as a functional requirement.

1. Usability requirement

2. Serviceability requirement

3. Manageability requirement

4. Recoverability requirement

5. Security requirement

6. Data Integrity requirement

7. Capacity requirement

8. Availability requirement

9. Scalability requirement
10. Interoperability requirement

8
11.Reliability requirement

12.Maintainability requirement

13.Regulatory requirement

14.Environment requirement

4.2.1 EXAMPLES OF NON-FUNCTIONAL REQUIREMENTS:

Here, are some examples of non-functional requirement:

• The software should be portable. So, moving from one OS to other OS does not create any
problem.
• Privacy of information, the export of restricted technologies, intellectual property rights, etc.
should be audited.

4.2.2 ADVANTAGES OF NON-FUNCTIONAL REQUIREMENTS:


Benefits/pros of Non-functional testing are:

• The non-functional requirements ensure the software system follow legal and compliance
rules.
• They ensure the reliability, availability, and performance of the software system
• They ensure good user experience and ease of operating the software.
• They help in formulating security policy of the software system.

4.2.3 DISADVANTAGES OF NON-FUNCTIONAL REQUIREMENT:


Cons/drawbacks of non-function requirement are:

• None functional requirement may affect the various high-level software subsystem
• They require special consideration during the software architecture/high-level design phase
which increases costs.
• Their implementation does not usually map to the specific software sub-system,
• It is tough to modify non-functional once you pass the architecture phase.

4.3 Software Requirements:


The overall description includes the product perspective and features, operating system and
operating environment, graphics requirement. Design constraints and user documentation. The
appropriation of requirements and implementation constraints gives the overview of the project
in regards to what the areas of strength and deficit to tackle.

9
• Python idle 3.7 (or)
• Anaconda 3.7 (or)
• Jupiter (or)
• Google colab

4.4 Hardware Requirements:

Minimum hardware requirements are very dependent on the particular software being developed
by a given python user. Applications that need to store large arrays or objects in memory
requires more RAM, whereas applications that need to perform numerous calculations more
quickly will require a faster processor.

Operating system: Windows or Linux

Processor : minimum i3

RAM : minimum 4 Gb

Hard Disk : minimum 250 Gb

4.5 SYSTEM STUDY:

4.5.1 FEASIBILITY STUDY:

The feasibility of the project is analysed in this phase and business proposal is put forth with a
very general plan for the project and some cost estimates. During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY

4.5.2 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and developmentof
the system is limited. The expenditures must be justified. Thus the developed system as well

10
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.

4.5.3 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of
the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.

4.5.4 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened
by the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make
him familiar with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.

11
5. SYSTEM DESIGN

5.1 SYSTEM ARCHITECTURE:

Prepare Datasets Feature Scaling Build a Machine


Learning Model

Prediction
Check for
Accuracy

Figure 5.1: System Architecture

5.2 Modules Description:

Prepare Datasets:

Use the Insulin Dosage dataset and separate them into independent and dependent variables.
Convert the dependent variables to binary numbers by using one hot encoder.

Apply Feature Scaling:

Divide the dataset into test set and training set. Apply feature scaling to the test set and training
set.

Build A Machine Learning Model:

Use different classification algorithms like KNN, Naïve Bayes, SVM, Decision trees etc., and
test the data against these models. The model which gives lesser errors in the confusion matrix is
then decided to be the best machine learning model.

Prediction:

check whether the predicted result is appropriate or not.

12
Check for accuracy:

Validate the result against the graph and the confusion matrix and check which algorithm is
giving the accurate result.

5.3 Data Flow Diagram:

1. The DFD is a bubble chart. It is a simple graphical formalism that can be used to represent a
system in terms of input data to the system, various processing carried out on the data, and the
output data is generated by this system.
2. The DFD is one of the most important modelling tools. It is used to model the system
components. These components are the system process, the data used by the process, an external
entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a seriesof
transformation. It is a graphical techniques that depicts information flow and the transformations
that are applied as data moves from input or output.
4. DFD is used to represent at any levels of abstraction. DFD may be partitioned into levels that
represents increasing information flow and functional detail.

13
Figure 5.3: Data Flow Diagram

5.4 UML DIAGRAMS

UML stands for Unified Modelling Language. UML is a standardized general purpose
modelling language in field of object-oriented software engineering. The standard is managed,
and was created by the Object Management Group.

The goal of UML is to become a common language for creating models of object oriented
computer software. In its current form, UML is comprised of two major components: a Meta-
model and a Notation. In the future some form method or process may be added With UML.

UML is a standard language for specifying, visualization, construction, and documenting the
artifacts of software system, as well as for business modelling and other non-software systems.

UML represents a collection of best engineering practises that have proven successful in the
modelling of large and complex systems.

UML is a very important part of developing OOSE and the software development process. The
UML uses mostly graphical notations to express the design of software projects.

14
5.4.1 GOALS:

The primary goals in the design of UML are as follows;

• Provides users a ready-to use, expressive visual modelling language, so that they can
develop and exchange meaningful models
• Provides extendibility and specialization mechanism to extend the core concepts
• Be independent of particular programming languages and development process
• Provide a formal basis for understanding the modelling language
• Encourage growth of OO tools market.
• Integrate the best practices

5.4.2 Class Diagram:

A class diagram in the Unified Modelling Language (UML) is a type of static structure diagram
that describes the structure of a system by showing the system’s classes, their attributes,
operations (or methods), and the relationships among objects. A class defines the methods and
variables in an object, which is a specific entity in a program or the unit of code representing that
entity. Class diagrams are useful in all forms of object-oriented programming (OOP). The
concept is several years old but has been refined as OOP modelling paradigms have evolved.

It can be represented as follows

Class:

A class represents a relevant concept from the domain, a set of persons, objects, or ideas that are
depicted in the system.

Attribute:

An attribute of a class represents a characteristic of a class that is of interest for the user of theIT
system.

15
Generalization:

Generalization is a relationship between two classes: a general class and a special class.

Refer to Generalization, Specialization, and Inheritance.

Association:

An association represents a relationship between two classes

Multiplicity:

A multiplicity allows for statements about the number of objects that are involved in an
association.

Aggregation:

An aggregation is a special case of an association (see above) meaning “consists of”.

The diamond documents this meaning; a caption is unnecessary. Each end of an associate can
be labelled by a set of integers indicating number of links that legitimately originate from the
instance if the class connects to association end. This set of integers is called multiplicity of
associate end.

Association:

An association represents a family of links. A binary association (with two ends) is normally
represented as a line. An association can link any number of classes. An association with three
links is called a ternary association.

16
Different types of association are:

• One-To-One Association
• One-To-Many Association
• Many-To-Many Association

Figure 5.4.2.1 : Class Diagram

5.4.3 Use Case Diagram Description:

In the Dynamic en-route filtering scheme the dataflow diagram describes the En-Route node
receives a report from the source node or the lower associated en-route node and check the
integrity of the received report by means of the MAC enclosed in the report. If the Verification
succeeds then forward the report otherwise drop the report.

Scenario Based Modelling:

Use-oriented techniques are widely used in software requirement analysis and design. Use cases
and usage scenarios facilitate system understanding and provide a common language for
communication. A Use Case Diagram in the Unified Modelling Language (UML) is a type of
behavioural diagram defined by and created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases.

Use cases:

A use case describes a sequence of actions that provide something of measurable value to an
actor and is drawn as a horizontal ellipse.

17
Actors:

An actor is a person, organization, or external system that plays a role in one or more
interactions with the system

Include:

In one form of interaction, a given use case may include another. "Include is a Directed
Relationship between two use cases, implying that the behaviour of the included use case is
inserted into the behaviour of the including use case. The notation is a dashed arrow from the
including to the included use case, with the label “<<include>>”

Extend:

In another form of interaction, a given use case (the extension) may extend another. The notation
is a dashed arrow from the extension to the extended use case, with the label"«extend»".

Generalization:

In the third form of relationship among use cases, a generalization/ specialization relationship
exists. A given use case may have common behaviours, requirements, constraints, and
assumptions with a more general use case. In this case, describe them once, and deal with it in
the same way, describing any differences in the specialized cases.

Associations:

Associations between actors and use cases are indicated in use case diagrams by solid lines. An
association exists whenever an actor is involved with an interaction described by a use case.
Associations are modelled as lines connecting use cases and actors to one another, with an
optional arrowhead on one end of the line.

Identified Use Cases:

The “user model view” encompasses a problem and solution from the preservative of those
individuals whose problem the solution addresses. The view presents the goals and objectives of
the problem owners and their requirements of the solution. This view is composed of “use case
diagrams”.

18
Basic Use Case Diagram Symbols and Notations:

System:

Draw your system's boundaries using a rectangle that contains use cases. Place actors outside the
system's boundaries.

Use Case:

Draw use cases using ovals. Label the ovals with verbs that represent the system's functions.

Actors:

Actors are the users of a system. When one system is the actor of another system, label the actor
system with the actor stereotype.

Relationships:

Illustrate relationships between an actor and a use case with a simple line. For relationships
among use cases, use arrows labelled either "uses" or "extends." A "uses" relationship indicates
that one use case is needed by another in order to perform a task. An "extends" relationship
indicates alternative options under a certain use case.

19
`
Figure 5.4.3.1: Use Case Diagram

5.4.4 Sequence Diagram:

A message sent from outside the diagram can be represented by a message originating from a
filled-in circle (found message in UML) or from a border of sequence diagram (gate in UML)

Sequence Diagram Notations:

A sequence diagram is structured in such a way that it represents a timeline which begins at the
top and descends gradually to mark the sequence of interactions. Each object has a column and
the messages exchanged between them are represented by arrows.

A Quick Overview of the Various Parts of a Sequence Diagram:

Lifeline Notation:

20
And a lifeline with a control element indicates a controlling entity or manager. It organizes and
schedules the interactions between the boundaries and entities and serves as the mediator
between them.

Synchronous message:

As shown in the activation bars example, a synchronous message is used when the sender waits
for the receiver to process the message and return before carrying on with another message. The
arrow head used to indicate this type of message is a solid one, like the one below.

Reflexive message:

When an object sends a message to itself, it is called a reflexive message. It is indicated with a
message arrow that starts and ends at the same lifeline like shown in the example below.

21
Figure 5.4.4.1: Sequence Diagram

5.4.5 ACTIVITY DIAGRAM:

Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modelling Language, activity
24 diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.

22
Figure 5.4.5.1 : Activity Diagram

5.4.6 COLLABORATION DIAGRAM:

A Sequence diagram is dynamic, and, more importantly, is time ordered. A Collaboration


diagram is very similar to a Sequence diagram in the purpose it achieves; in other words, it
shows the dynamic interaction of the objects in a system. A distinguishing feature of a
Collaboration diagram is that it shows the objects and their association with other objects in the
system apart from how they interact with each other. The association between objects is not
represented in a Sequence diagram.

UML Collaboration Diagram SymbolsUML Collaboration Diagram Shapes

Pre-drawn UML collaboration diagram symbols represent object, multi-object, association role,
delegation, link to self, constraint and note.

UML collaboration diagram templates offer you many useful shapes. UML collaboration
diagram symbols like object, multi-object, association role, delegation, link to self, constraint
and note are available.
Objects are model elements that represent instances of a class or of classes.

23
Multi-object represents a set of lifeline instances.

Association role is optional and suppressible.

Delegation is like inheritance done manually through object composition.

Link to self is used to link the objects that fulfil more than one role.

Figure 5.4.6.1: Collaboration Diagram

24
Constraint is an extension mechanism that enables you to refine the semantics of a UML model
element.

Note contains comments or textual information.

5.4.7 STATE CHART DIAGRAM:

Objects have behaviors and states. The state of an object depends on its current activity or
condition. A state chart diagram shows the possible states of the object and the transitions that
cause a change in state. A state diagram, also called a state machine diagram or state chart
diagram, is an illustration of the states an object can attain as well as the transitions between
those states in the Unified Modelling Language. A state diagram resembles a flowchart in which
the initial state is represented by a large black dot and subsequent states are portrayed as boxes
with rounded corners.

Basic State Chart Diagram Symbols and Notations

States:

States represent situations during the life of an object. You can easily illustrate a state in Smart
Draw by using a rectangle with rounded corners.

25
Transition:

A solid arrow represents the path between different states of an object. Label the transition with
the event that triggered it and the action that results from it. A state can have a transition that
points back to itself.

Initial State:

A filled circle followed by an arrow represents the object's initial state.

Final State:

An arrow pointing to a filled circle nested inside another circle represents the object’s final
state.

Synchronization and Splitting of Control:

A short heavy bar with two transitions entering it represents a synchronization of control. The
first bar is often called a fork where a single transition splits into concurrent multiple transitions.
The second bar is called a join, where the concurrent transitions reduce back to one.

26
Figure 5.4.7.1 State Chart Diagram

27
5.4.8 COMPONENT DIAGRAM

DESCRIPTION

This chapter discusses the portion of the software development process where the design is
elaborated and the individual data elements and operations are designed in detail. First, different
views of a “component” are introduced. Guidelines for the design of object-oriented and
traditional (conventional) program components are presented.

A. Meaning of a Component

This section defines the term component and discusses the differences between object oriented,
traditional, and process related views of component level design. Object Management Group
OMG UML defines a component as “… a modular, deployable, and replaceable part of a system
that encapsulates implementation and exposes a set of interfaces.”

B. An Object-Oriented View

A component contains a set of collaborating classes. Each class within a component has been
fully elaborated to include all attributes and operations that are relevant to its implementation

Basic Component Diagram Symbols and Notations

Component:

A component is a logical unit block of the system, a slightly higher abstraction than classes. It is
represented as a rectangle with a smaller rectangle in the upper right corner with tabs or the word
written above the name of the component to help distinguish it from a class.

Interface:
An interface (small circle or semi-circle on a stick) describes a group of operations used
(required) or created (provided) by components. A full circle represents an interface created or
provided by the component. A semi-circle represents a required interface, like a person's input.

28
Dependencies:

Draw dependencies among components using dashed arrows

Port:

Ports are represented using a square along the edge of the system or a component. A port is
often used to help expose required and provided interfaces of a component.

29
Figure 5.4.8.1 : Component Diagram

5.4.9 DEPLOYMENT DIAGRAM:

DESCRIPTION

Deployment diagrams are used to visualize the topology of the physical components of a system
where the software components are deployed. So deployment diagrams are used to describe the
static deployment view of a system. Deployment diagrams consist of nodes and their
relationships.

Purpose:

The name Deployment itself describes the purpose of the diagram. Deployment diagrams are
used for describing the hardware components where software components are deployed.
Component diagrams and deployment diagrams are closely related.

Component diagrams are used to describe the components and deployment diagrams shows how
they are deployed in hardware. UML is mainly designed to focus on software artefacts of a
system. But these two diagrams are special diagrams used to focus on software components and
hardware components.

So, most of the UML diagrams are used to handle logical components but deployment diagrams
are made to focus on hardware topology of a system. Deployment diagrams are used by the
system engineers.

30
The purpose of deployment diagrams can be described as:

• Visualize hardware topology of a system.


• Describe the hardware components used to deploy software components.
• Describe runtime processing nodes.
Steps to draw Deployment Diagram:
Deployment diagram represents the deployment view of a system. It is related to the component
diagram. Because the components are deployed using the deployment diagrams. A deployment
diagram consists of nodes. Nodes are nothing but physical hard wares used to deploy the
application. Deployment diagrams are useful for system engineers. An efficient deployment
diagram is very important because it controls the following parameters
• Performance
• Scalability
• Maintainability
• Portability
So before drawing a deployment diagram the following artefacts should be identified:

• Nodes
• Relationships among nodes

The following deployment diagram is a sample to give an idea of the deployment view of order
management system. Here we have shown nodes as:

• Monitor
• Modem
• Caching server
• Server
The application is assumed to be a web-based application which is deployed in a clustered
environment using server 1, server 2 and server 3. The user is connecting to the application
using internet. The control is flowing from the caching server to the clustered environment. So,
the following deployment diagram has been drawn considering all the points mentioned above

31
Figure 5.4.9.1 : Deployment Diagram

32
6. IMPLEMENTATION

Implementation is the stage of the project when the theoretical design is turned out into a
working system. Thus, it can be considered to be the most critical stage in achieving a successful
new system and in giving the user, confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the existing system and its
constraints on implementation, designing of methods to achieve changeover and evaluation of
changeover methods.

6.1 Introduction to Python:

Python is currently the most widely used multi-purpose, high-level programming language.

Python allows programming in Object-Oriented and Procedural paradigms. Python programs


generally are smaller than other programming languages like Java.

Programmers have to type relatively less and indentation requirement of the language, makes
them readable all the time.

Python language is being used by almost all tech-giant companies like – Google, Amazon,
Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be used for the
following are

• Machine Learning
• GUI Applications
• Web frameworks like Django
• Image processing
• Web scrapping
• Test frameworks
• Multimedia

6.1.1 Advantages of Python:

1. Extensive Libraries:

Python downloads with an extensive library and it contain code for various purposes like regular
expressions, documentation-generation, unit-testing, web browsers, threading, databases, CGI,
email, image manipulations. So, we don’t have to write the complete code

33
2. Extensible:

Python can be extended to other languages. You can write some of your code in languages like
C++ or C. This comes in handy, especially in projects.

3.Embeddable:

Complimentary to extensibility, Python is embeddable as well. You can put your Python code in
your source code of a different language, like C++. This lets us add scripting capabilities to our
code in the other language.

4. Improved Productivity:

The language’s simplicity and extensive libraries render programmers more productive than Java
and C++. Also, the fact that you need to write less and get more things done.

5. IOT Opportunities:

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for the
Internet of Things. This is a way to connect the language with the real world.

6. Simple and Easy:

In Python, just a print statement will do. It is also quite easy to learn, understand, and code.

7. Readable:

Python is much like reading English. This is the reason why it is so easy to learn, understand,
and code. It does not need curly braces to define blocks, and indentation is mandatory. This
further aids the readability of the code.

8. Object-Oriented:

This language supports both the procedural and object-oriented programming paradigms. While
functions help us with code reusability, classes and objects let us model the real world.

9. Free and Open-Source:

Python is freely available. But not only can you download Python for free, but you can also
download its source code, make changes to it, and even distribute it. It downloads with an
extensive collection of libraries to help you with your tasks.

34
10. Portable:

When you code your project in a language like C++, you may need to make some changes to it if
you want to run it on another platform. But it isn’t the same with Python. Here, you need to
code only once, and you can run it anywhere. This is called Write Once Run Anywhere.
However, you need to be careful enough not to include any system-dependent features.

11. Interpreted:

Python is an interpreted language. Since statements are executed one by one, debugging is easier
than in compiled languages.

6.1.2 Advantages of Python Over Other Languages:

1. Less Coding:

Almost all of the tasks done in Python requires less coding when the same task is done in other
languages. Python also has an awesome standard library support, so you don’t have to search for
any third-party libraries to get your job done. This is the reason that many people suggest
learning Python to beginners.

2. Affordable:

Python is free therefore individuals, small companies or big organizations can leverage the free
available resources to build applications. Python is popular and widely used so it gives you
better community support.

3. Python is for Everyone:

Python code can run on any machine whether it is Linux, Mac or Windows. Programmers need to
learn different languages for different jobs but with Python, you can professionally build webapps,
perform data analysis and machine learning, automate things, web scraping and also build games
and powerful visualizations. It is an all-rounder programming language

6.1.3 Disadvantages of Python:

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.

35
1. Speed Limitations:

We have seen that Python code is executed line by line. But since Python is interpreted, it often
results in slow execution. This, however, isn’t a problem unless speed is a focal point for the
project. In other words, unless high speed is a requirement, the benefits offered by Python are
enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One such
application is called Carbonnelle.
The reason it is not so famous despite the existence of Brython is that it isn’t that secure.

3. Design Restrictions

As you know, Python is dynamically-typed. This means that you don’t need to declare the typeof
variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it just means
that if it looks like a duck, it must be a duck. While this is easy on the programmers during
coding, it can raise run-time errors.

4. Underdeveloped Database Access Layers

Compared to more widely used technologies like JDBC (Java Database Connectivity)
and ODBC (Open Database Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I don’tdo
Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity of Java code
seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming Language.

6.1.4 History of Python: -

What do the alphabet and the programming language Python have in common? Right, both start
with ABC. If we are talking about ABC in the Python context, it's clear that the programming
language ABC is meant. ABC is a general-purpose programming language and programming
environment, which had been developed in the Netherlands, Amsterdam, at the CWI (Centrum
Wiskunde & Informatica). The greatest achievement of ABC was to influence the design of

36
Python. Python was conceptualized in the late 1980s. Guido van Rossum worked that time in a
project at the CWI, called Amoeba, a distributed operating system. In an interview with Bill
Venners1, Guido van Rossum said: "In the early 1980s, I worked as an implementer on a team
building a language called ABC at Centrum voor Wiskunde en Informatica (CWI). I don't know
how well people know ABC's influence on Python. I try to mention ABC's influence because I'm
indebted to everything I learned during that project and to the people who worked on it. "Later
on in the same Interview, Guido van Rossum continued: "I remembered all my experience and
some of my frustration with ABC. I decided to try to design a simple scripting language that
possessed some of ABC's better properties, but without its problems. So I started typing. I created
a simple virtual machine, a simple parser, and a simple runtime. I made my own version of the
various ABC parts that I liked. I created a basic syntax, used indentation for statement grouping
instead of curly braces or begin-end blocks, and developed a small number of powerful data
types: a hash table (or dictionary, as we call it), a list, strings, and numbers."

6.1.5 Machine Learning:

The term ‘machine learning’ is often, incorrectly, interchanged with Artificial Intelligence, but
machine learning is actually a subfield/type of AI. Machine learning is also often referred to as
predictive analytics, or predictive modelling. Coined by American computer scientist Arthur
Samuel in 1959, the term ‘machine learning’ is defined as a “computer’s ability to learn without
being explicitly programmed”. At its most basic, machine learning uses programmed algorithms
that receive and analyse input data to predict output values within an acceptable range. As new
data is fed to these algorithms, they learn and optimise their operations to improve performance,
developing ‘intelligence’ over time.

There are four types of machine learning algorithms:

• Supervised
• Semi-supervised
• unsupervised
• Reinforcement

6.1.6 Supervised learning:

In supervised learning, the machine is taught by example. The operator provides the machine
learning algorithm with a known dataset that includes desired inputs and outputs, and the
algorithm must find a method to determine how to arrive at those inputs and outputs. While
the operator knows the correct answers to the problem, the algorithm identifies patterns in data,

37
learns from observations and makes predictions. The algorithm makes predictions and is
corrected by the operator – and this process continues until the algorithm achieves a high level of
accuracy/performance.
Under the umbrella of supervised learning fall: Classification, Regression and Forecasting.

1. Classification: In classification tasks, the machine learning program must draw a conclusion
from observed values and determine to what category new observations belong. For example,
when filtering emails as ‘spam’ or ‘not spam’, the program must look at existing observational
data and filter the emails accordingly.
2. Regression: In regression tasks, the machine learning program must estimate – and understand
– the relationships among variables. Regression analysis focuses on one dependent variable and a
series of other changing variables – making it particularly useful for prediction and
forecasting.
3. Forecasting: Forecasting is the process of making predictions about the future based on the past
and present data, and is commonly used to analyze trends.

6.1.7 Semi-supervised learning:

Semi-supervised learning is similar to supervised learning, but instead uses both labelled and
unlabeled data. Labelled data is essentially information that has meaningful tags so that the
algorithm can understand the data, whilst unlabeled data lacks that information. By using this
combination, machine learning algorithms can learn to label unlabeled data.

6.1.8 Unsupervised learning:

Here, the machine learning algorithm studies data to identify patterns. There is no answer key or
human operator to provide instruction. Instead, the machine determines the correlations and
relationships by analyzing available data. In an unsupervised learning process, the machine
learning algorithm is left to interpret large data sets and address that data accordingly. The
algorithm tries to organize that data in some way to describe its structure. This might mean
grouping the data into clusters or arranging it in a way that looks more organized. As it assesses
more data, its ability to make decisions on that data gradually improves and becomes more
refined. Under the umbrella of unsupervised learning, fall:

1. Clustering: Clustering involves grouping sets of similar data (based on defined criteria). It’s
useful for segmenting data into several groups and performing analysis on each data set to find
patterns.

38
2. Dimension reduction: Dimension reduction reduces the number of variables being considered to
find the exact information required.
6.1.9 Reinforcement learning:

Reinforcement learning focuses on regimented learning processes, where a machine learning


algorithm is provided with a set of actions, parameters and end values. By defining the rules, the
machine learning algorithm then tries to explore different options and possibilities, monitoring
and evaluating each result to determine which one is optimal. Reinforcement learning teaches
the machine trial and error. It learns from past experiences and begins to adapt its approach in
response to the situation to achieve the best possible result.

Implementation is the stage of the project when the theoretical design is turned out into a
working system. Thus, it can be considered to be the most critical stage in achieving a successful
new system and in giving the user, confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the existing system and its
constraints on implementation, designing of methods to achieve changeover and evaluation of
changeover methods.

6.2 Features:

• Data Frame object for data manipulation with integrated indexing.


• Tools for reading and writing data between in-memory data structures and different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of data sets.
• Label-based slicing, fancy indexing, and sub setting of large data sets.
• Data structure column insertion and deletion.
• Group by engine allowing split-apply-combine operations on data sets.
• Data set merging and joining.
• Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data
structure.
Time series-functionality: Date range generation and frequency conversion, moving window
statistics, moving window linear regressions, date shifting and lagging.

6.3 NumPy:

NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high- level mathematical
functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created

39
by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant
created NumPy by incorporating features of the competing numarray into Numeric, with
extensive modifications. NumPy is open-source software and has many contributors. The Nd
array data structure.

The core functionality of NumPy is its "Nd array", for n-dimensional array, data structure. These
arrays are stride views on memory. In contrast to Python's built-in list data structure (which,
despite the name, is a dynamic array), these arrays are homogeneously typed: all elements of a
single array must be of the same type. Such arrays can also be views into memory buffers
allocated by C/C++, C, Python, and Fortran extensions to the C, Python interpreter without the
need to copy data around, giving a degree of compatibility with existing numerical libraries. This
functionality is exploited by the SciPy package, which wraps a number of such libraries (notably
BLAS and LAPACK). NumPy has built-in support for memory-mapped Nd arrays.

6.4 Matplotlib:

Several toolkits are available which extend matplotlib functionality. Some are separate
downloads; others ship with the matplotlib source code but have external dependencies.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. Matplotlib makes easy things easy and hard things possible. Matplotlib is a cross-
platform, data visualization and graphical plotting library for Python and its numerical extension
NumPy. As such, it offers a viable open-source alternative to MATLAB. Developers can also use
matplotlib’s APIs (Application Programming Interfaces) to embed plots in GUI applications.

A Python matplotlib script is structured so that a few lines of code are all that is required in most
instances to generate a visual data plot. The matplotlib scripting layer overlays two APIs:

• The python API is a hierarchy of Python code objects topped by matplotlib. python.
• An OO (Object-Oriented) API collection of objects that can be assembled with greater
flexibility than python. This API provides direct access to Matplotlib’s backend layers.

6.5 Scikit-learn:

Scikit-learn (formerly scikit learn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and DBSCAN,
and is designed to interoperate with the Python numerical and scientific libraries NumPy and

40
SciPy. Scikit-learn is largely written in Python, with some core algorithms written in C, Python to
achieve performance. Support vector machines are implemented by a C, Python wrapper around
LIBSVM; Linear Regression and linear support vector machines by a similar wrapper around
LIBLINEAR.

6.6 Modules:
• Upload Diabetes Insulin Disease In this module user upload dataset.
• Execute Gradient Boosting Algorithm
In this module Gradient Boosting Algorithm is used to train dataset.
• Execute Linear Regression Algorithm
In this module Linear Regression Algorithm is used to train dataset.

• Predict Diabetes & Insulin Dosage


In this module diabetes and Insulin dosage is predicted .
• Performance Graph
• In this module Performance Graph is shown.

6.7 Algorithms:
Gradient boosting algorithm is one of the most powerful algorithms in the field of machine
learning. As we know that the errors in machine learning algorithms are broadly classified into
two categories i.e. Bias Error and Variance Error. As gradient boosting is one of the boosting
algorithms it is used to minimize bias error of the model.

Unlike, Ad boosting algorithm, the base estimator in the gradient boosting algorithm cannot be

mentioned by us. The base estimator for the Gradient Boost algorithm is fixed and i.e. Decision

Stump. Like, Adaboost, we can tune the n_estimator of the gradient boosting algorithm.

However, if we do not mention the value of estimator, the default value of estimator for this

algorithm is 100.

Gradient boosting algorithm can be used for predicting not only continuous target variable (as a

Regressor) but also categorical target variable (as a Classifier). When it is used as a regressor, the

cost function is Mean Square Error (MSE) and when it is used as a classifier then the cost

function is Log loss.

41
6.7.1 Linear Regression in Machine Learning:
o Linear Regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
o Linear Regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Linear Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Linear Regression is used
for solving the classification problems.
o In Linear Regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Linear Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Linear Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is
showing the logistic function:

Figure 6.7.1.1 : Linear Regression in Machine Learning

42
Note: Linear Regression uses the concept of predictive modelling as regression; therefore, it
is called Linear Regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the Linear Regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form. The S-form curve is called the Sigmoid function or the
logistic function.
o In Linear Regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Linear Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Linear Regression Equation:

The Linear Regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Linear Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Linear Regression y can be between 0 and 1 only, so for this let’s divide the above equation by
(1-y):

o But we need range between –[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Linear Regression.Type of Linear Regression:

On the basis of the categories, Linear Regression can be classified into three types:

43
o Binomial: In binomial Linear Regression, there can be only two possible types of thedependent
variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Linear Regression, there can be 3 or more possible unordered
types of the dependent variable, such as “cat”, “dogs”, or “sheep”
o Ordinal: In ordinal Linear Regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.

44
7. METHODOLOGY

7.1 Feature Scaling:

Feature Scaling or Standardization: It is a step of Data Pre-Processing which is applied to


independent variables or features of data. It basically helps to normalise the data within a
particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

Package Used:

sklearn.preprocessing

Import:

from sklearn.preprocessing import StandardScaler

Formula used in Backend:


Standardisation replaces the values by their Z scores.

Mostly the Fit method is used for Feature scaling

fit(X, y = None)

Computes the mean and std to be used for later scaling.

Import pandas as pd

from sklearn.preprocessing import StandardScaler

# Read Data from CSV

data = read_csv(‘Insulindosage.csv’)

data.head()

45
#Initialise the Scaler

scaler = StandardScaler()

# To scale data

scaler.fit(data)

Why and Where to Apply Feature Scaling?

Real world dataset contains features that highly vary in magnitudes, units, and range.
Normalisation should be performed when the scale of a feature is irrelevant or misleading and
not should Normalise when the the scale is meaningful.

The algorithms which use Euclidean Distance measure are sensitive to Magnitudes. Here feature
scaling helps to weigh all the features equally.

Formally, if a feature in the dataset is big in scale compared to others, then in algorithms where
Euclidean distance is measured this big scaled feature becomes dominating and needs to be
normalized.

Examples of Algorithms where Feature Scaling matters:


1. K-Means uses the Euclidean distance measure here feature scaling matters.
2. K-Nearest-Neighbours also require feature scaling.

3.Principal Component Analysis (PCA):

Tries to get the feature with maximum variance, here also feature scaling is required.
4. Gradient Descent:

Calculation speed increase as Theta calculation becomes faster after feature scaling.

Note: I Bayes, Linear Discriminant Analysis, and Tree-Based models are not affected by feature
scaling.
In Short, any Algorithm which is Not Distance based is Not affected by Feature Scaling.

7.2 CLASSIFICATION ALGORITHMS:

In machine learning and statistics, classification is a supervised learning approach in which the
computer program learns from the data input given to it and then uses this learning to classify
new observation. This data set may simply be bi-class (like identifying whether the person is

46
male or female or that the mail is spam or non-spam) or it may be multi-class too. Some
examples of classification problems are: speech recognition, handwriting recognition, bio metric
identification, document classification etc.

Here we have the types of classification algorithms in Machine Learning:

Linear Classifiers: Linear Regression, I Bayes Classifier, Support Vector Machines, Decision
Trees, KNN etc.,

Decision Trees:

Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.

Support Vector Machine (SVM):

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited for
classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional
space that distinctly classifies the data points. The dimension of the hyperplane depends upon
the number of features. If the number of input features is two, then the hyperplane is just a line. If
the number of input features is three, then the hyperplane becomes a 2-D plane. It becomes
difficult to imagine when the number of features exceeds three.

Linear Regression (Predictive Learning Model):

It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable (in
which there are only two possible outcomes). The goal of Linear Regression is to find the best
fitting model to describe the relationship between the dichotomous characteristic of interest

(dependent variable = response or outcome variable) and a set of independent (predictor or


explanatory) variables.

47
Linear Discriminant Analysis (LDA):

Linear discriminant analysis (LDA) is generally used to classify patterns between two classes;
however, it can be extended to classify multiple patterns. LDA assumes that all classes are
linearly separable and according to this multiple linear discrimination function representing
several hyperplanes in the feature space are created to distinguish the classes. If there are two
classes then the LDA draws one hyperplane and projects the data onto this hyperplane in such a
way as to maximize the separation of the two categories. This hyperplane is created according to
the two criteria considered simultaneously.

48
8. SOURCE CODE

from tkinter import messagebox

from tkinter import *

from tkinter.filedialog import askopenfilename

from tkinter import simpledialog

import tkinter

import numpy as np

from tkinter import filedialog

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

from sklearn.preprocessing import normalize

import seaborn as sns

from xgboost import XGBClassifier

from sklearn.linear_model import LinearRegression

main = tkinter.Tk()

main.title(“Machine learning Models for diagnosis of the diabetic patient and predicting insulin
dosage”)

main.geometry(“1300x1200”)

global filename

49
global diabetic_classifier

global insulin_classifier

global diabetic_X_train, diabetic_X_test, diabetic_y_train, diabetic_y_test

global insulin_X_train, insulin_X_test, insulin_y_train, insulin_y_test

global diabetic_dataset

global insulin_dataset

global diabetic_X

global diabetic_Y

global insulin_X

global insulin_Y

global gbc_acc

global lr_acc

def upload():

global filename

global diabetic_X_train, diabetic_X_test, diabetic_y_train, diabetic_y_test

global insulin_X_train, insulin_X_test, insulin_y_train, insulin_y_test

global diabetic_dataset

global insulin_dataset

filename = filedialog.askdirectory(initialdir = “.”)

pathlabel.config(text=filename)

text.delete(‘1.0’, END)

50
text.insert(END,’Diabetes & insulin dataset loaded\n’)

diabetic_dataset = pd.read_csv(‘Dataset/diabetes.csv’)

diabetic_dataset.fillna(0, inplace = True)

insulin_dataset = pd.read_csv(‘Dataset/insulin_dosage.csv’,sep=’\t’)

insulin_dataset.fillna(0, inplace = True)

text.insert(END,”Diabetes Dataset\n”)

text.insert(END,str(diabetic_dataset.head())+”\n\n”)

text.insert(END,”Insulin Dataset\n”)

text.insert(END,str(insulin_dataset.head())+”\n\n”)

insulin_dataset.drop([‘date’,’time’], axis = 1,inplace=True)

code = insulin_dataset[‘code’]

value = insulin_dataset[‘value’]

insulin_dataset = diabetic_dataset

sns.pairplot(data=diabetic_dataset, hue = ‘Outcome’)

plt.show()

def preprocess():

text.delete(‘1.0’, END)

global diabetic_X

global diabetic_Y

global insulin_X

global insulin_Y

51
global diabetic_X_train, diabetic_X_test, diabetic_y_train, diabetic_y_test

global insulin_X_train, insulin_X_test, insulin_y_train, insulin_y_test

global diabetic_dataset

global insulin_dataset

diabetic_dataset = diabetic_dataset.values

cols = diabetic_dataset.shape[1]-1

diabetic_X = diabetic_dataset[:,0:cols]

diabetic_Y = diabetic_dataset[:,cols]

diabetic_X = normalize(diabetic_X)

print(diabetic_X.shape)

print(diabetic_Y)

insulin_Y = insulin_dataset[‘Insulin’]

insulin_Y = np.asarray(insulin_Y)

insulin_dataset.drop([‘Insulin’,’Outcome’], axis = 1,inplace=True)

insulin_dataset = insulin_dataset.values

insulin_X = insulin_dataset[:,0:insulin_dataset.shape[1]]

insulin_X = normalize(insulin_X)

print(insulin_X.shape)

diabetic_X_train, diabetic_X_test, diabetic_y_train, diabetic_y_test =


train_test_split(diabetic_X, diabetic_Y, test_size=0.2,random_state=0)

insulin_X_train, insulin_X_test, insulin_y_train, insulin_y_test = train_test_split(insulin_X,


insulin_Y, test_size=0.2,random_state=0)

52
text.insert(END,”Total records available in both datasets : “+str(diabetic_X.shape[0])+”\n”)

text.insert(END,”Total records used to train machine learning algorithm (80%) :


“+str(diabetic_X_train.shape[0])+”\n”)

text.insert(END,”Total records used to test machine learning algorithm (20%) :


“+str(diabetic_X_test.shape[0])+”\n”)

def gradientBoosting():

global diabetic_classifier

global diabetic_X_train, diabetic_X_test, diabetic_y_train, diabetic_y_test

global diabetic_X

global diabetic_Y

global gbc_acc

text.delete(‘1.0’, END)

diabetic_classifier = XGBClassifier() #object of extreme gradient boosting

diabetic_classifier.fit(diabetic_X, diabetic_Y)#training xgb on 56rrange

predict = diabetic_classifier.predict(diabetic_X_test)

gbc_acc = accuracy_score(predict, diabetic_y_test)

text.insert(END,”Graident Boosting Diabetes Prediction Accuracy : “+str(gbc_acc)+”\n\n”)

def linearRegression():

global insulin_classifier

global insulin_X_train, insulin_X_test, insulin_y_train, insulin_y_test

global insulin_X

global insulin_Y

global lr_acc

insulin_classifier = LinearRegression()#object of linear regression

insulin_classifier.fit(insulin_X, insulin_Y)#training linear regression on insulin dataset

53
predict = insulin_classifier.predict(insulin_X_test)

lr_acc = 1 – insulin_classifier.score(insulin_X, insulin_Y)

text.insert(END,”Logistic Regression Insulin Dosage Prediction Accuracy


:”+str(lr_acc)+”\n\n”)

def graph():

height = [gbc_acc,lr_acc]

bars = (“Gradient Boosting Accuracy”,”Linear Regression Accuracy”)

y_pos = np.arange(len(bars))

plt.bar(y_pos, height)

plt.xticks(y_pos, bars)

plt.show()

def predict():

global diabetic_classifier

global insulin_classifier

text.delete(‘1.0’, END)

filename = filedialog.askopenfilename(initialdir=”Dataset”)

diabetes_test = pd.read_csv(filename)

insulin_test = pd.read_csv(filename)

insulin_test.drop([‘Insulin’], axis = 1,inplace=True)

diabetes_test = diabetes_test.values[:, 0:diabetes_test.shape[1]]

insulin_test = insulin_test.values[:, 0:insulin_test.shape[1]]

print(diabetes_test.shape)

54
print(insulin_test.shape)

diabetes_test = normalize(diabetes_test)

insulin_test = normalize(insulin_test)

diabetes_prediction = diabetic_classifier.predict(diabetes_test)

insulin_prediction = insulin_classifier.predict(insulin_test)

print(diabetes_prediction)

print(insulin_prediction)

for I in range(len(diabetes_test)):

if diabetes_prediction[i] == 0:

text.insert(END,”X=%s, Predicted = %s” % (diabetes_test[i], ‘No Diabetes


Detected’)+”\n\n”)

if diabetes_prediction[i] == 1:

text.insert(END,”X=%s, Predicted = %s” % (diabetes_test[i], ‘Diabetes Detected &


Required Insulin Dosage : ‘+str(int(insulin_prediction[i])))+”\n\n”)

def close():

main.destroy()

font = (‘times’, 16, ‘bold’)

title = Label(main, text=’Machine learning Models for diagnosis of the diabetic patient and
predicting insulin dosage’)

title.config(bg=’dark goldenrod’, fg=’white’)

title.config(font=font)

title.config(height=3, width=120)

55
title.place(x=0,y=5)

font1 = (‘times’, 13, ‘bold’)

upload = Button(main, text=”Upload Diabetic Dataset”, command=upload)

upload.place(x=700,y=100)

upload.config(font=font1)

pathlabel = Label(main)

pathlabel.config(bg=’DarkOrange1’, fg=’white’)

pathlabel.config(font=font1)

pathlabel.place(x=700,y=150)

preprocessButton = Button(main, text=”Preprocess Dataset”, command=preprocess)

preprocessButton.place(x=700,y=200)

preprocessButton.config(font=font1)

gbButton=Button(main,text=”RunGradientBoosting Algorithm”,command=gradientBoosting)

gbButton.place(x=700,y=250)

gbButton.config(font=font1)

lrButton=Button(main,text=”RunLogisticRegression Algorithm”,command=linearRegression)

lrButton.place(x=700,y=300)

lrButton.config(font=font1)

graphButton = Button(main, text=”Accuracy Graph”, command=graph)

graphButton.place(x=700,y=350)

graphButton.config(font=font1)

56
predictButton = Button(main, text=”Predict Diabetes & Insulin Dosage”, command=predict)

predictButton.place(x=700,y=400)

predictButton.config(font=font1)

closeButton = Button(main, text=”Exit”, command=close)

closeButton.place(x=700,y=450)

closeButton.config(font=font1)

font1 = (‘times’, 12, ‘bold’)

text=Text(main,height=30,width=80)

scroll=Scrollbar(text)

text.configure(yscrollcommand=scroll.set)

text.place(x=10,y=100)

text.config(font=font1)

main.config(bg=’turquoise’)

main.mainloop()

57
9. TESTING

9.1 Test Dataset:

A test dataset is a dataset that is independent of the training dataset, but that follows the same
probability distribution as the training dataset. If a model fit to the training dataset also fits the
test dataset well, minimal over fitting has taken place (see figure below). A better fitting of the
training dataset as opposed to the test dataset usually points to over fitting.

A test set is therefore a set of examples used only to assess the performance (i.e., generalization) of
a fully specified classifier.

9.2 Validation Dataset:

A validation dataset is a set of examples used to tune the hyper parameters (i.e., the architecture) of
a classifier. It is sometimes also called the development set or the “dev set”

In order to avoid overfitting, when any classification parameter needs to be adjusted, it is


necessary to have a validation dataset in addition to the training and test datasets. For example, if
the most suitable classifier for the problem is sought, the training dataset is used to train the
candidate algorithms, the validation dataset is used to compare their performances and decide
which one to take and, finally, the test dataset is used to obtain the performance characteristics
such as accuracy, sensitivity, specificity, F-measure, and so on. The validation dataset functionsas
a hybrid: it is training data used by testing, but neither as part of the low-level training nor as
part of the final testing.

The basic process of using a validation dataset for model selection (as part of training dataset,
validation dataset, and test dataset)

Since our goal is to find the model having the best performance on new data, the simplest
approach to the comparison of different networks is to evaluate the error function using data
which is independent of that used for training. Various networks are trained by minimization of
an appropriate error function defined with respect to a training data set. The performance of the
networks is then compared by evaluating the error function using an independent validation set,
and the network having the smallest error with respect to the validation set is selected. This
approach is called the holdout method. Since this procedure can itself lead to some overfitting to
the validation set, the performance of the selected network should be confirmed by measuring its
performance on a third independent set of data called a test set.

58
An application of this process is in early stopping, where the candidate models are successive
iterations of the same network, and training stops when the error on the validation set grows,
choosing the previous model (the one with minimum error).

9.3 Selection of a validation dataset:

9.3.1 Holdout method

Most simply, part of the training dataset can be set aside and used as a validation set: this is
known as the holdout method. Common proportions are 70%/30% training/validation

9.3.2 Cross-validation

Alternatively, the hold out process can be repeated, repeatedly partitioning the original training
dataset into a training dataset and a validation dataset: this is known as cross-validation. These
repeated partitions can be done in various ways, such as dividing into 2 equal datasets and using
them as training/validation, and then validation/training, or repeatedly selecting a random subset
as a validation dataset.

While building our model, we split the dataset into two parts namely x and y of 0.5 proportion
each and each of the two sub parts, one for training our model and the other one for testing or
validating our model’s performance.

9.4 TESTING STRATEGIES:


9.4.1 UNIT TESTING
Unit testing, a testing technique using which individual modules are tested to determine if there
are issues by the developer himself. It is concerned with functional correctness of the standalone
modules. The main aim is to isolate each unit of the system to identify, analyze and fix the
defects.

Unit Testing Techniques:

Black Box Testing – Using which the user interface, input and output are tested. White Box

Testing –Used to test each one of those functions’ behaviour is tested.

9.4.2 DATA FLOW TESTING

Data flow testing is a family of testing strategies based on selecting paths through the program’s
control flow in order to explore sequence of events related to the status of Variables or data
object. Dataflow Testing focuses on the points at which variables receive and the points at which
these values are used.

59
9.4.3 INTEGRATION TESTING

Integration Testing done upon completion of unit testing, the units or modules are to be
integrated which gives raise too integration testing. The purpose of integration testing is to
verify the functional, performance, and reliability between the modules that are integrated.\

9.4.4 BIG BANG INTEGRATION TESTING

Big Bang Integration Testing is an integration testing Strategy wherein all units are linked at
once, resulting in a complete system. When this type of testing strategy is adopted, it is difficult to
isolate any errors found, because attention is not paid to verifying the interfaces across
individual units.

9.4.5 USER INTERFACE TESTING

User interface testing, a testing technique used to identify the presence of defects is a
product/software under test by Graphical User interface [GUI].

9.5 TEST CASES:

S.NO INPUT OUTPUT RESULT

A result is
The user An output
predicted whether a
gives is
Test Case 1 patient contains
(Unit the input predicted
diabetes detected or
testing of
in the diabetes
Dataset) not.
form of and
Diabetes dosage.
Insulin
Dataset
details.
The user An output A result is
gives is predict predicted whether a
Test Case 2 the input diabetes patient contains
(Unit and diabetes detected or
in the
testing of dosage . not using linear
Accuracy) form of regression and
patient Gradient Boosting
test data Algorithm
.

60
up to 80%
of accuracy

An output
The user A result is
is predict
Test Case 3 predicted whether a
gives
(Unit diabetes patient contains
testing of the input diabetes detected or
and
Machine not using linear
in the
Learning dosage . regression and
Algorithms) form of Gradient Boosting
Algorithm up to
patient
80% of accuracy
test data
.

The user An output A result is


Test Case 4 is predict predicted whether a
gives
(Integration diabetes patient contains
testing of the input and diabetes detected or
Dataset) dosage . not using linear
in the
regression and
form of Gradient Boosting
Algorithm up to
patient
80% of accuracy
test data
.
Test Case 5 The user An output A result is predicted
(Big Bang is predict whether a patient contains
gives
testing) diabetes diabetes detected ornot
the input and
dosage .
in the
form of
patient
test data
.
The user An output A result is
gives the is predict
predicted
Test Case 6 input in the diabetes
(Data Flow and whether a
form of
Testing) dosage .
Diabetes patient
Insulin Dataset contains
details.
diabetes
detected or
not

61
A result is
The user predicted
An output is
gives whether a
predicted on
the input patient
whether a patient
Test Case 7 in the contains
(User contains diabetes
form of diabetes
interface and dosage.
Testing) Diabetes detected or
Insulin
not
Dataset
details.
The user An output A result is
Test Case 8 is predicted
gives
(User predicted whether a
interface the input diabetes patient
Testing- and contains
in the
Event dosage. diabetes
based) form of detected or
not using
patient
linear
test data.
regression
and
Gradient
Boosting
Algorithm
up to 80%
of accuracy

62
10. OUTPUT SCREENSHOTS
Machine learning Models for diagnosis of the diabetic patient and predicting insulin dosage

In this project we are using Gradient Boosting Classifier to predict diabetes and then using
Linear Regression algorithm to predict insulin dosage in diabetic detected patients. To
implement this project we are using PIMA diabetes dataset and UCI insulin dosage dataset. We
are training both algorithms with above mention dataset and once after training we will upload
test dataset with no class label and then Gradient Boosting will predict presence of diabetes and
Linear Regression will predict insulin dosage if diabetes detected by Gradient Boosting.

Both dataset available inside Dataset folder and below screen is showing dataset details

Figure 10.1: Test values

In above screen folder we can see both datasets available and in testValues.csv file we have no
class label for diabetes as 0 or 1 where 0 means no diabetes detected and 1 means diabetes
detected. When we apply Gradient Boosting algorithm on above test values then Gradient
Boosting will predict class label and Linear Regression will predict insulin dosage.

63
SCREEN SHOTS:

To run project double, click on ‘run.bat’ file to get below screen

Figure 10.2: After Run Project Screenshot

In above screen click on ‘Upload Diabetic Dataset’ button to upload dataset

Figure 10.3: Upload Diabetic Dataset Screenshot

64
In above screen selecting and uploading entire ‘Dataset’ folder to load both diabetes and insulin
dataset and then click on ‘Select Folder’ button to get below screen

Figure 10.4: After Upload Dataset Screenshot

In above screen we can see both datasets loaded and we can see some records from each dataset
and will get below graph also

Figure 10.5: Graph for Diabetics Indicates

In above graph we can see diabetes for each column where red colour dots indicate presence of
diabetes and blue represents no diabetes detected. We are plotting graph for each column value

65
to show with which value diabetes is present and with which value diabetes is not present for
example in above graph in first column we are plotting graph for ‘number of Pregnancies’ with
‘presence or no presence of diabetes’ and now close above graph and then click on ‘Pre-process
Dataset’ button to remove missing values and to split dataset into train and test

Figure 10.6 Pre processing of Dataset

In above screen we can see dataset contains total 768 records and application using 80% records
for training and 20% records to test ML accuracy and now dataset is ready and now click on
‘Run Gradient Boosting Algorithm’ button to train gradient boosting with above dataset

Figure 10.7 Accuracy for Gradient Boosting Algorithm

66
In above screen we got gradient boosting accuracy as 98% and now click on ‘Run Linear
Regression Algorithm’ button to build Linear Regression algorithm

Figure 10.8 Accuracy for Linear Regression

In above screen with Linear Regression we got 51% accuracy and now click on ‘Accuracy
Graph’ button to get below graph

10.9 Accuracy Graph

67
In above graph x-axis represents algorithm name and y-axis represents accuracy of those
algorithms and now close above graph and then click on ‘Predict Diabetes & Insulin Dosage’
button to upload test values and then will get prediction result

Figure 10.10 Predicting Diabetes and Insulin Dosage

In above screen selecting and uploading ‘testValues.csv’ file and then click on ‘Open’ button
to get below result

Figure 10.11 Insulin Dosage when detected

68
In above screen in square bracket we can see test values and after square bracket we can see
predicted result as ‘No Diabetes Detected’ or ‘Diabetes Detected’ and if diabetes detected then
Linear Regression will predict insulin dosage

69
11. CONCLUSION AND FUTURE ENHANCEMENT
This paper was aimed at modelling neural network for the prediction of amount of insulin
dosage suitable for diabetic patients. A model based on gradient boosting trained with BP was
used. The model uses four input information about each patient its length, weight blood sugar,
and gender. Many experiments were conducted on 180 patient's data. we are using Gradient
boosting to predict diabetes and then applying Linear Regression Algorithm to predict insulin
dosage if diabetes detected by Gradient boosting algorithm. The Gradient boosting model
converged fast and gave results with high performance.

Future Enhancement:

The results of the research conducted in this thesis are promising for a wide range of
applications in T1D therapy. Nevertheless, the performance and safety of the predictions can be
improved further by generating a set of interchangeable models that predict useful Blood
Glucose values for control and therapy purposes based on the determination of individual
specific dynamics, lifestyle, and other factors. An extension of this work will include testing
personalized Blood Glucose prediction models in a more challenging situation involving real
subjects

70
12. REFERENCES
[1] American Diabetes Association. Diagnosis and Classification of Diabetes Mellitus. Diabetes
Care, Vol. 31, No. 1, 2008, 55-60, 1935- 5548.

[2] I. Eleni Georga, C. Vasilios Protopappas and I. Dimitrios Fotiadi, "Glucose Prediction in
Type 1 and Type 2 Diabetic Patients Using Data Driven Techniques," Knowledge-Oriented
Applications in Data Mining, InTech pp 277-296, 2011.

[3] V. Tresp, T. Briegel, and J. Moody, "Neural-network models for the blood glucose
metabolism of a diabetic," IEEE Transactions on Neural Networks, Vol. 10, No. 5, 1999, 1204-
1213, 1045-9227.

[4] C. M. Bishop, . Pattern Recognition and Machine Learning, Springer, 2006, New York.

[5] S. Haykin, Neural networks and learning machines. Pearson 2008.

[6] W.D. Patterson, Artificial Neural Networks- Theory and Applications, Prentice Hall,
Singapore. 1996.

[7] M. Pradhan and R. Sahu, "Predict the onset of diabetes disease using Artificial Neural
Network (ANN)," International Journal of Computer Science & Emerging Technologies, 303
Volume 2, Issue 2, April 2011.

[8] W, Sandham, D,Nikoletou, D.Hamilton, K, Paterson, A. Japp and C. MacGregor, "BLOOD


Glucose Prediction for Diabetes THERAPY USING A RECURRENT Artificial Neural
Networks ", EUSIPCO, Rhodes, 1998, PP. 673-676

[9] https://www.w3schools.com/python/
[10] https://www.tutorialspoint.com/python/index.html
[11] https://www.javatpoint.com/python-tutorial
[12] https://www.learnpython.org/

71
13. BIBLIOGRAPHY
1. Programming Python, Mark Lutz

2. Head First Python, Paul Barry

3. Core Python Programming, R. Nageswara Rao

4. Learning with Python, Allen B. Downey

5. Explicit App of Two-Level Multi-Level Linear Regression by Getnet Begashaw (Author)

6. Estimating Engineering and Manufacturing Development Cost Risk Using Logistic and Multiple
Regression Paperback- November 12, 2012 by John V. Bielecki (Author)

72

You might also like