Professional Documents
Culture Documents
CA-02
SECTION- Q2050
“It is not possible to prepare a project report without the assistance and encouragement of
other people. This one is certainly no exception”.
This report would not have been possible without the essential and gracious support of Mrs.
Kriti Bedi Ma’am (the faculty) of school of business management at Lovely Professional
University. Her willingness to motivate me contributed tremendously to my report. I also would
like to thank her for showing me some examples related to the topic of this report.
Besides, I would like to thank Lovely Professional University for providing me good
environment and facilities to complete the report. It gave an opportunity to participate and learn
about Machine Learning and its importance in the field of Business Analytics.
Finally, I would like to thank my family for their understanding and supports towards me for
completing this report.
CONTENTS
1. Linear Regression- It is used to estimate the real values based on continuous variables
for example cost of houses, number of calls, total sales etc.
2. Logistic Regression- It is ca classification method not a regression algorithm, it is used
to find out the discrete values as Binary 0,1/ yes/no, true, false.
3. Decision Tree- It is used mostly for the classification of the problems. It works with
both the kind of data categorical and continuous dependent variables. Data can be
splinted into 2 or more population.
4. SVM- support vector machine the data is plotted in n-dimensional space with a value
of each feature being value of a particular coordinate.
PREDICTION OF DIABETES IN
PIMA WOMEN
BACKGROUND OF STUDY
Diabetes mellitus which is a group of metabolic disorders where the blood sugar levels are
higher than normal for a very long period of time. It is caused due to the insufficient
production of insulin in the body or due to the improper response of the body’s cells to
insulin. It is categorised into 3 types.
URL- https://www.kaggle.com/uciml/pima-indians-diabetes-database
OBJECTIVE OF DATASET
It is to diagnostically predict weather or not a patient has diabetes, based on certain diagnostic
measurements included in the dataset. All patients here are females at lest 21 years of age and
old of PIMA INDIAN HERITAGE.
DATA DESCRIPTION
Variable Name Data Variable Description
Type
DATA PROCESSING-
At the first glace the data appeared to be clean but further detailed analysis revealed many
abnormality values for biological measures. Variables such as Skin Thickness and Glucose had
277 and 374 Zero values respectively. These zero values constitute about 20% of the
observation in the dataset. But these zero values are not null values they are the real values
having result zero. Therefore, delegating this value would lose a lot of important information,
therefore only the wrong values in the dataset were imputed.
CODE
OUTPUT
Fig 1.1- Output Missing value
ANALYSIS
VARIABLE RESPONSE
➢ OUTCOME- 268 women that were diagnosed with the Diabetes and 500 women that
didn’t have diabetes. 34% of the sample shows positive records of diabetes.
VARIABLE PREDICTOR
PREGNANCIES
From the below given graph women who have been diagnosed with diabetes had more
pregnancies than the other women. The histogram below, however gives no significant
relationship between the number of pregnancies and the occurrence of diabetes.
GLUCOSE
Below fig shows the clear difference between in the amount of glucose present in the
women who have been diagnose with diabetes and those who are not. The density
graphs show the difference in the level of glucose and its effect.
Fig- 2.7 Glucose variation in women
BLOOD PRESSURE
There is no clear difference seen in the category of women who don’t have diabetes.
This indicates that the blood pressure might not be a good indicator to predict diabetes.
SKIN THICKNESS
No clear difference between them. Skin is not a good predictor for the response
variable.
INSULIN
BMI
Women having BMI greater than 25, above the normal levels have diabetes, BMI range
of women ranging 18 to 60 don’t have Diabetes.
AGE
A correlation plot shows the liner association between each other, as mentioned in the bivariant
associations, Insulin and Glucose, BMI and Skin Thickness has a moderate- Higher liner
correlation.
Fig- 4.1matrix of correlation among different variables
INTERPRETATION
The matrix of correlation shows the deviation of the line with the change in terms of variable
which are independent and dependent variable.
Fig- 4.1.1and 4.2.1 Correlation Plot
TESTING DATA
INTERPRETATION
The data shows the summary of all the variables and the outcome, it is divided into 3 quadrants
followed my mean mode and median. Generally, this kind of model is used to explain the
relationship between multiple independent por predictor variables.
Here, it will be depicting the relationship between the BMI, Glucose, Age. It shows a strong
correlation among the variables. The below mentioned graphs represents the scatterplot for
multiple regression.
Fig-4.8 & 4.9 relationship between the level of Glucose and BMI, relationship between age
and BMI
Fig- 5 Histogram BMI
INTERPRETATION
The above-mentioned Histogram shows the BMI and its frequency occurring in the women, it
shows the higher the BMI there are more chances of a women getting diabetes, the BMI range
greater than 25 and between 40 are classed in obesity range. Whereas lower the BMI women
will not get diabetes.
Fig-5.1 scatter plot showing the 4 different variable analysis
Y = a + bX,
INTERPRETAION
Form the output we can see that the R square value is 1.30 and the correlation coefficients show
the p value is grater than the R. The correlation is done to see the effect of age and BMI. It can
be intercepted with the value 61.92, belonging to the third quadrant in the dataset. But the
simple liner regression is not the best fit for the machine learning. There is seen a lot of Outliner
in the data thus it becomes difficult to predict the accuracy of the model.
APPLYING LOGISTIC REGRESSION
Using a trained data set containing a random sample of 70% of the observation to perform a
Logistic Regression with “Diabetes” as the response and the remaining are predicated.
INTERPRETATION OF OUTPUT
The logistic regression is 74% accurate, depending on the exact training- test split. It appears
to be good choice for predicting. The misclassification rate of the model is 25%.
➢ CONFUSION MATRIX- It compares the actual outcomes with the predicted ones.
Predicted =0 Predicted=1
ACTUAL=0 True negatives (TN) False positives (FP)
High threshold:
• High specificity
• Low sensitivity
Low threshold:
• Low specificity
• High sensitivity
Fig- 5.6. ROC curve
INTERPRETATION OF ROC
Each point in the ROC curve represents the sensitivity and specificity between the pair
corresponding to a particular decision threshold. The curve is near to the upper left
corner which shows the model is moderate accurate. As we go in the upward direction
the sensitivity of the variable rises, hence the specificity tends to drop off.
CONCLUSION
The logistic regression model was chosen as the best model for predicting the occurrence of
diabetes in PIMA women, as it reported the highest cross-validated sensitivity. The patterns
are identified using data exploration methods, validating using the modelling techniques,
models such as logistic regression, classification trees, random forest ad SVM are used to
analyse the same. From the ROC curve we can determine logistic regression is suited best for
the model.
Applying the classification machine learning model in the data give the brief about the best fix,
of the Logistic Regression model in the data.