You are on page 1of 16

A

Mini Skill Based Project Report


On
Machine Learning & Optimization (270404)
In fulfilment of the requirement for the award of the degree

SUBMITTED BY
Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)

4th SEMESTER
Artificial Intelligence And Data Science
SUBMITTED TO
Prof. Vibha Tiwari

Department Of Information Technology

Madhav Institute of Technology and Science, Gwalior


(A Govt. Aided UGC Autonomous & NAAC Accredited Institute Affiliated to RGPV, Bhopal)
Session: 2023
DECLARATION

I hereby declare that the mini skill based project for the course Machine Learning &
Optimization (270404) is being submitted in the partial fulfilment of the requirement for the
award of Bachelor of Technology in Artificial Intelligence And Data Science.
All the information in this document has been obtained and presented in accordance with
academic rule and ethical conduct.
Date : 15-03-2023
Place: Gwalior

Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
ACKNOWLEDGEMENT

I would like to express my greatest appreciation to all the individuals who have helped and
supported me throughout this lab file. I am thankful to whole Information Technology
department for their ongoing support during the experiments, from initial advice and provision
of contact in the first stages through ongoing advice and encouragement, which led to the finals
report of this lab file.
A special acknowledgement goes to my colleagues who help me in completing the file and by
exchanging interesting ideas to deal with problems and sharing the experience.
I wish to thank our professor Vibha Tiwari as well for her undivided support and interests
which inspired me and encouraged me to go my own way without whom I would be unable to
complete my project.
At the end, I want to thank my friends who displayed appreciation to my work and motivated
me to continue my work

Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
CHAP 1 – INTRODUCTION
1.1 PROBLEM STATEMENT
Diabetes is a chronic illness which can be caused by body’s inability to produce, or when
body cannot use the insulin that it produces [1]. The effects of diabetes mellitus include
long– term damage, dysfunction and failure of various organs (WHO). As a result, it has
significantly increased mortality in patients. There are mainly two types of diabetes: Type
I (T1) and Type II (T2). T1 occurs when the body is no longer able to produce insulin
whereas T1 is common in childhood and also known as juvenile diabetes. This form of
diabetes is less common; only about 5-10% of people with diabetes have T1 (American
Diabetes Association, 2010). T2 occurs when the body is unable to utilize the insulin
produced or not enough insulin is produced [9, 10 and 11]. In addition, there is another type
of diabetes named gestational diabetes which develops during pregnancy. Too much
glucose in blood can damage eyes, kidneys, and nerves. It can also cause of heart disease,
stroke, and insufficiency in blood flow to legs. Overweight, lack of exercise, family history
and stress increased the possible risk of diabetes 14, 15]. In Bangladesh, people are not
conscious about health. There are 7.1 million case of Diabetes in Bangladesh. The
increasing level of Diabetes is up bound. People do not know about it and they do not go
to check it.
Regression is a supervised learning algorithm in machine learning which is used for
prediction by learning and forming a relationship between present statistical data and target
value i.e., Sale Price in this case. Different factors are taken into consideration while
predicting the worth of the house like location, neighbourhood and various amenities like
garage space etc. if learning is applied to above parameters with target values for a certain
geographical region as different areas differ in price like land price, housing style, material
used, availability of public utilities.

1.2 Conceptual Background of the Domain Problem

The domain problem of a machine learning binary classification model to predict whether a
person is diabetic or not falls under the umbrella of healthcare and medical informatics.
Diabetes is a chronic condition that affects the way the body processes blood sugar, and it can
lead to serious complications such as heart disease, stroke, and kidney failure if left untreated.

The goal of a binary classification model in this domain is to accurately predict whether a
person has diabetes based on their medical history, physical examination, and other relevant
factors. The model would be trained on a dataset of individuals with and without diabetes, and
it would use features such as age, body mass index (BMI), blood pressure, and blood glucose
levels to make its predictions.
The development of a binary classification model for diabetes diagnosis is important because
it can help healthcare providers to make more accurate and timely diagnoses, which in turn can
lead to better treatment outcomes and improved quality of life for patients. Additionally, such
models can help identify high-risk individuals who may benefit from preventive interventions
or lifestyle modifications to reduce their risk of developing diabetes.

1.3 Motivation for the Problem Undertaken


The project is provided to our group by Prof. Vibha Tiwari as a part of mini skill based
project. The exposure to real world data and the opportunity to deploy our skillset in
solving a real time problem has been the primary motivation.
Our main objective of doing this project is to build a model to predict the house prices
with the help of other supporting features. In order to improve the selection of
customers, the client wants some predictions that could help them in further
investment and improvementin selection of customers.
The motivation for developing a machine learning binary classification model to predict if a
person is diabetic or not is primarily driven by the need to improve healthcare outcomes for
individuals with diabetes. Diabetes is a chronic disease that affects millions of people
worldwide, and it can lead to serious complications such as heart disease, stroke, kidney failure,
and blindness if not managed properly. Early detection and treatment of diabetes are critical to
preventing these complications and improving the quality of life for people with diabetes.

Chap 2- Analytical Problem Formulation

2.1 Mathematical / Analytical Modelling of theProblem

Goal of the paper is to investigate for model to predict diabetes with better accuracy. We
experimented with different classification and ensemble algorithms to predict diabetes. In the
following, we briefly discuss the phase. In this project we are going to use different types of
algorithms which uses their own mathematical equation on background. This project comes
with the data collected from different samples world-wide from which we will separate our
training and testing data. Initially data cleaning & pre-processing perform over data. Ordinal
encoding is performed to convert the different range of Quality data into two categories. In
model building Final model is select based on evaluation benchmark among different models
with different algorithms. Different Graphs, plots are also plotted for better understanding of
the data.
2. 2 Data Sources and their formats
The data is gathered from UCI repository which is named as Pima Indian Diabetes Dataset.
The dataset have many attributes of 768 patient. The 9th attribute is class variable of each data
points. This class variable shows the outcome 0 and 1 for diabetics which indicates positive or
negative for diabetics.

2.3 Importing The Libraries


Let us start the development by importing all the needed Libraries

We will be using these libraries throughout our program.

2.3.1 Importing The Diabetes Data:


Here we will import the csv file and store its data into variable named dataset. we will be
doing this using pandas read_csv command and providing it with the location of the file as an
argument.

Let us take a look at the data


2.4 Performing EDA(Exploratory Data Analysis)
2.4.1 Head and Tail
Head and tail command provides us with the first five and last five data’s respectively.
Head:

Tail:
2.4.2 Count
It counts the no of rows for each columns

2.4.3 Describe

It calculates all the statistical value of the data like mean.

2.4.4 Nunique

It shows number of unique values in each column


.
2.4.5 Info
It provides the structure of the data.

2.5.6 Shape and isna and sum


It shows the shape of the data i.e, in terms of number of rows and column
It finds the NA values in each column and the sum function adds them to display the total
number of NA values.

2.6. Data Pre-processing


Data preprocessing is most important process. Mostly healthcare related data contains missing
vale and other impurities that can cause effectiveness of data. To improve quality and
effectiveness obtained after mining process, Data preprocessing is done. To use Machine
Learning Techniques on the dataset effectively this process is essential for accurate result and
successful prediction. For Pima Indian diabetes
dataset we need to perform pre-processing in two steps:
1). Missing Values removal- Remove all the instances that have zero (0) as worth. Having
zero as worth is not possible. Therefore this instance is eliminated. Through eliminating
irrelevant features/instances we make feature subset and this process is called features subset
selection, which reduces diamentionality of data and help to work faster.
2). Splitting of data- After cleaning the data, data is normalized in training and testing the
model. When data is spitted then we train algorithm on the training data set and keep test data
set aside. This training process will produce the training model based on logic and algorithms
and values of the feature in training data. Here the data is splitted into the training data (70%)
and testing data (30%) randomly. Here we have done splitting the data into 0.7 and 0.3 ratios
of training and testing data respectively. Now, we will fit a decision tree classifier model with
the training and testing data and will evaluate its performance.
2.4. Data Inputs- Logic- Output Relationships
Correlation heatmap is plotted to gain understanding of relationship between target
features & independent features. To gain insights about relationship between Input &
output different types ofvisualization are plotted which we will see in EDA section of
this report. Heatmap/correlation matrix is used to find the correlation between two
labels or between a feature and a label (columns), in short it tells how they are related
to each other. In a heatmap, each cell in the grid represents a combination of two
variables, with one variable on the x-axis and the other variable on the y-axis. The
value in each cell is represented by a colour, with lighter colours representing higher
values and darker colours representing lower values. The colour gets lighter we tending
to 1 and gets darker on tending to -1.
In the above heatmap, brighter colors indicate more correlation. As we can see from the
table and the heatmap, glucose levels, age, BMI and number of pregnancies all have
significant correlation with the outcome variable. Also notice the correlation between
pairs of features, like age and pregnancies, or insulin and skin thickness.

2.6. Hardware & Software Requirements withTool Used

Hardware Used -
1. Processor — Intel i5 processor
2. RAM—8GB

Software utilised –

1.Anaconda –Jupyter Notebook


2. Google Colab – for Hyper parameter tuning

Chap. 3- Models Development & Evaluation

3.1 Identification Of Possible Problem-Solving Approaches


(Methods)
This is most important phase which includes model building for prediction of diabetes. In this
we have implemented various machine learning algorithms which are discussed above for
diabetes prediction.
For that purpose, first task is to convert categorical variable into numerical features. Once data
encodingis done then data is scaled using standard scalar. Final model is built over this scaled
data. For building ML model before implementing regression algorithm, data is split in training
& test data usingtrain_test_split from model_selection module of sklearn library.
Cross-validation is primarily used in applied machine learning to estimate the skill of a
machine learning model on unseen data. That is,to use a limited sample in order to estimate
how the model is expectedto perform in general when used to make predictions on data not
used during the training of the model. After that model is train with various regression
algorithm and 5-fold cross validation is performed. Further Hyperparameter tuning performed
to build more accurate model out of best model.

3.2 Testing of Identified Approaches (Algorithms)


The different algorithm used in this project to build ML modelare as below:
• Confusion Matrix
• Accuracy Score
• ROC curve
• AUC Score
• Dist Plot,
• Box plot,
• Outlier Detection

3.2.1 Confusion Matrix


Confusion Matrix is a table that summarizes the performance of a model by comparing
the predicted and actual classes of a set of test data. We could also calculate other
performance matrices of confusion matrix.
1. Accuracy: The proportion of correctly classified instances out of the total number
of instances.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
2. Precision: The proportion of true positive predictions among all positive
predictions.
Precision = TP / (TP + FP)
3. Recall: The proportion of true positive predictions among all actual positive
instances.
Recall = TP / (TP + FN)
4. F1-score: The harmonic mean of precision and recall.
F1-score = 2 * (Precision * Recall) / (Precision + Recall)
All these values are as follow:
3.2.2 Accuracy Score, ROC curve, AUC Score:
• The accuracy_score() method of sklearn.metrics, accept the true labels of the sample
and the labels predicted by the model as its parameters and computes the accuracy score
as a float value, which can likewise be used to obtain the accuracy score in Python .
• ROC curve is a graph that shows the performance of a classification model at all
possible thresholds( threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two parameters
• TRUE POSITIVE RATE
• FALSE POSITIVE RATE
• AUC measures how well a model is able to distinguish between classes. The AUC
score is simply the area under the curve which can be calculated with Simpson’s Rule.
The bigger the AUC score the better our classifier. Let Calculate AUC score, ROC curve
and Accuracy score:

3.2.3 Dist Plot:


A distplot plots a univariate distribution of observations. The distplot() function combines
the matplotlib hist function with the seaborn kdeplot() and rugplot() functions. Let show
the distplot of ages, pregnancies and diabetes Pedigree function.

Displot of pregnancies Displot of Age

Displot of diabetes Pedigree function.


3.2.4 BoxPlot:
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that
facilitates comparisons between variables or across levels of a categorical variable. The box
shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution,
except for points that are determined to be “outliers” using a method that is a function of the
inter-quartile range. Let us see the boxplot of Glucose,Insulin, BMI, Skin Thickness:
Boxplot of Glucose Boxplot of Insulin

Boxplot of SkinThickness Boxplot of BMI


3.2.5 Outlier Detection:
Outliers are the data point which are out of the range from the general data range. Outliers
creates the problems in prediction as they decrease the accuracy so let us find the outliers in
the data and after calculating the accuracy of the model if we are not satisfied with the accuracy
we may remove the outliers if any.. Here we can see the exact number of outliers in all the
columns.

So we will not remove the outliers and just continue woth them.

Chap 4- Conclusion
The main aim of this project was to design and implement Diabetes Prediction Using Machine
Learning Methods and Performance Analysis of that methods and it has been achieved
successfully. The proposed approach uses various classification and ensemble learning method
in which Boxplot, Distplot, Confusion Matrix, Accuracy Score, ROC curve, AUC Score,
Logistic Regression and Gradient Boosting classifiers are used. And 0.77% classification
accuracy has been achieved. The Experimental results can be asst health care to take early
prediction and make early decision to cure diabetes and save humans life.

You might also like