FDS Cia-3

1
CIA-3
Name: VARSHA DAHIYA

Roll No.: 22111766
Subject: Fundamentals of Data Science
Course Code: BBA112L
2
BACKGROUND: This
Report shows the list of 50 people (females)
who are tested for diabetes. The report has the
patients with positive results. The variables in
data are:
1. Insulin rate in body
2. No. of pregnancy
3. Glucose
4. Blood pressure
5. BMI
PREREQUITVE KNOWLEDGE:
1. normal blood pressure level is less
than 120/80 mmHg.
2. A fasting blood sugar level of 99
mg/dL or lower is normal, 100 to 125
mg/dL indicates you have
prediabetes, and 126 mg/dL or higher
indicates you have diabetes.
3. An A1C level of 6.5% or higher on
two separate tests means that you
have diabetes. An A1C between
5.7% and 6.4% means that you have
3
prediabetes. Below 5.7% is

considered normal.
4. normal measurement of free insulin
is less than 17 mcU/mL.
5. Having BMI more than 25 can lead
to diabetes
6. After pregnancy, female body
become more sensitive to insulin and
have risk of low blood pressure.
4
5
6
7
8
9
10
ANALYSIS:
• The ages of women lie b/w 23 and 62.
• More than 90% of the females have given
birth to the child, and majority of no. of
pregnancies ranges from 4 to 7.
• Around 95% females have BMI more than
25.
• Most of the people with diabetes have low
blood pressure.
• Insulin rate is either 0 or much more than
normal rate of insulin.
11
ANSWER TO QUESTIONS
1. Plug the missing values- discuss 2

methods and steps.
ANS. Imputation of values through mean or
median:
This works by calculating the
mean/median of the non-missing
values in a column and then
replacing the missing values within
each column separately and
independently from the others. It can
only be used with numeric data.
Imputation Using (Most Frequent)
or (Zero/Constant) Values:
Most Frequent is another statistical
strategy to impute missing values
and YES!! It works with categorical
features (strings or numerical
representations) by replacing
missing data with the most frequent
values within each column.
12
Pros:
Works well with categorical

features.
Cons:
It also doesn’t factor the correlations

between features.
It can introduce bias in the data.
Zero or Constant imputation — as
the name suggests — it replaces the
missing values with either zero or
any constant value you specify.
Steps to follow:
• Right click on the white working
area
• Search “impute”
• Click on ‘IMPUTE” from the
results.
• Click on impute, select the way
or method in which u want the
values to be filled.
13
• Select apply.
• Connect it with data table to see
the results.
14
2. Find the outliers in data and draw

data table and distribution.
Ans:
15
16
3. Draw a mosaic plot, heat map, and

feature statistics on the ‘inliers’ in
the data.
ANS:
17
18
4. Discuss the relevance of inliers and

outliers in the data evaluation.
ANS: INLIERS:
19
Everything is relevant in inliers and

outliers except the insulin rates in the
body. Outliers have insulin rates
comparatively too high,
5. Show Linear regression and
correlation.
ANS:
20
21
SUMMARY:
Diabetes is a chronic disease that occurs either
when the pancreas does not produce enough
insulin or when the body cannot effectively
use the insulin it produces. Insulin is a
hormone
22
that regulates blood glucose. Hyperglycaemia,

also called raised blood glucose or raised
blood
sugar, is a common effect of uncontrolled
diabetes and over time leads to serious damage
to
many of the body's systems, especially the
nerves and blood vessels. Follow-up data of
diabetic patients were used as data. The
Orange data mining software is used because
it is
easy to use in the modelling phase and
contains many methods. In this context, the
chapter
aims to develop an effective prediction model
by using a large number of feature selection
and classification methods. The results show
that the proposed model successfully predicts
the HbA1c parameter. In addition,
determination of the parameters that are
effective in the
diagnosis of diabetes has been carried out with
the feature selection methods.
23
BUSINESS/SOCIAL RELEVANCE AND

FUTURE SCOPE
Grounded on the analysis, we observed that
diabetes age, gender, BMI,
and blood pressure are the most useful features
to prognosticate the complications. This
will help us to prognosticate the condition of
diabetes in cases while it’s still in the early
stages. The dataset can be used to
prognosticate the case has diabetes,
grounded on certain diagnostics.
It'll help businesses to assess cases grounded
on these diagnostics and consequently treat
them. It also allows for the society to be more
apprehensive about the counteraccusations
of diabetes and the way they themselves can
take to control it. It also allows us to be better
prepared in the future in order to overcome
this situation and will help people each around
the world to attack this problem by early
vaticination and recognition of diabetes before
it evolves further.
24
CITATION:
https://www.researchgate.net/public
ation/349496581_Analysis_and_Pr
ediction_Of_Pima_Indian_Diabetes
_Dataset_Using_SDKNN_Classifie
r_Technique

FDS Cia-3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FDS Cia-3

Uploaded by

Copyright:

Available Formats

1

Name: VARSHA DAHIYA

prediabetes. Below 5.7% is

1. Plug the missing values- discuss 2

Works well with categorical

It also doesn’t factor the correlations

2. Find the outliers in data and draw

3. Draw a mosaic plot, heat map, and

4. Discuss the relevance of inliers and

Everything is relevant in inliers and

that regulates blood glucose. Hyperglycaemia,

BUSINESS/SOCIAL RELEVANCE AND

You might also like