You are on page 1of 16

Table of Contents

Sr. no Topic Page no.


1 Multiple Linear Regression
1.1 Itroduction to MLR

1.2 Assumptions of MLR

2 Logistic Regression

2.1 Introduction to LR

3 Problem Statement

3.1 Problem Statement

3.2 Objective

4 Dataset

5 Methodology
(Steps In Statistical Model-Building)
5.1 Define problem
5.2 Develop a conceptual model
5.3 Design study

5.4 Collect data

5.5 Examine data


5.6 Select a suitable model

5.7 Estimate Parameter


5.8 Verify model

5.9 Validate the model


5.10 Interpret Results

6 Conclusions
1. Multiple Linear Regression
1.1 Introduction:
Multiple linear regression is a statistical method used to examine the
relationship between two or more independent variables (say, X) (also called
predictor variables, causal variables) and a dependent variable (say, Y) (also
called as response variable, criterion variable). In multiple linear regression, the
relationship is considered to be linear and the dependent variable is assumed to
follow normal distribution. MLR not only describes the pattern of relationship
between Y and X, but also infers about the strength of the relationship. In
addition, MLR is also used for prediction.
The general form of a multiple linear regression model with P independent
variables is given by the equation:
Y= β0X0 + β1X1 + β2X2 + β3X3 +……+ βpXp + ϵ
Here:
 Y is the dependent variable. The purpose of MLR is either to explain Y’s
relationship with the IV’s or to predict Y’s future values or both.
 X1, X2,……,XP are independent variables.
 X0 assumes a value of 1 only and its inclusion facilitates to quantify the
value of Y when all IV’s assume 0 (zero) values.
 β1, β2,….., βP are constant terms called as regression coefficients that
represent the amount of contribution of, X1, X2,……,XP respectively in
explaining or predicting Y.
 β0, a constant term is known as the value of Y when all the IV’s, X1, X2,
……,XP assume 0 (zero) value. Β0 is also called regression intercept.
 ∈ is a random error term which represents the amount (variance) of Y that
cannot be explained or predicted by the IV’s, X1, X2,……,XP.
1.2 Assumptions:
A model is a representation of the real-world phenomenon and is established
under certain assumptions. And the assumptions in MLR are as follows:
 Linearity of the phenomenon measured, i.e., Y is linearly related with
IV’s.
 Constant error variance across observations of IV’s, known as
homoscedasticity.
 Uncorrelated error terms, i.e ., Cov(∈I, ∈k)= 0 ,for i ≠ k.
 Normal distribution of the error terms, i.e., ∈I ~ N(0, σ2y).
2. Logistic Regression
2.1 Introduction:
Logistic regression models the probability that a given instance belongs to a
particular category. The logistic function (sigmoid function) is used to
transform the linear combination of the input features into a value between 0
and 1. The logistic regression model predicts the probability of the positive
class (1), and if this probability is above a certain threshold (commonly 0.5), the
instance is classified as belonging to the positive class; otherwise, it is classified
as belonging to the negative class (0).
The logistic regression equation is given by:
1
P(Y=1) = −( β 0 X 0 +β 1 X 1+ β 2 X 2+ β 3 X 3 +… …+β p X p)
1+ e

Here:
 P(Y=1) is the probability of the positive class.
 e is the base of the natural logarithm.
 β0 is regression intercept.
 β1, β2,….., βP are constant terms called as regression coefficients that
represent the amount of contribution of, X1, X2,……,XP respectively in
explaining or predicting Y.
The logistic function ensures that the output is bounded between 0 and 1,
making it suitable for representing probabilities. The probability of the negative
class P(Y=0) is simply 1- P(Y=1)

It's important to note that the logistic regression model is trained to find the
optimal values for the coefficients (β0, β1, β2,….., βP) during the training process.
The logistic function transforms the linear combination of input features into a
probability, and a threshold (commonly 0.5) is used to determine the predicted
class.
3. Problem statement
3.1 Problem statement:
To develop a predictive model to predict the likelihood of diabetes in patients
based on diagnostic measurements including pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function, and
age.
3.2 Objective:
The objective is to identify and quantify the impact of each diagnostic
parameter on the outcome variable (diabetes) for accurate prediction and
improved understanding of the disease.
4. Dataset

5. Methodology
Steps In Statistical Model-Building:
Step 1: Define problem:
To predict the likelihood of diabetes in patients based on diagnostic
measurements including pregnancies, glucose levels, blood pressure, skin
thickness, insulin levels, BMI, diabetes pedigree function, and age. The
objective is to identify and quantify the impact of each diagnostic parameter on
the outcome variable (diabetes) for accurate prediction and improved
understanding of the disease.
Step 2: Develop a conceptual model:
Conceptual model describes the relationships among the variables of interest
pertaining to a system under investigation, and is often represented pictorially.
Given problem can be conceptualized with following steps;
i. Identify the variables of interests: pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function,
age and outcome (diabetes).
ii. Identify dependent variables (Y): outcome (diabetes)
iii. Identify independent variables (X): pregnancies, glucose levels, blood
pressure, skin thickness, insulin levels, BMI, diabetes pedigree function,
age
iv. Find out dependence relationships: Y = f(X)

X0
β0

X1 Β1

X2 Β2
….
X3 Β3 ϵ

X4 Β4 Y

X5 Β5

Β6
X6 Β7
Β8
X7

X8
So, the statistical problem is as follows:
The changes in the response (dependent) variable, Y (outcome) caused by the
explanatory (independent) variables, X1 (pregnancies), X2 (glucose levels) and
X3 (blood pressure), X4 (skin thickness), X5 (insulin levels), X6 (BMI), X7
(diabetes pedigree function), X8 (age) as represented by the relationship
expressed as Y = f(X).
The linear relationship can be statistically modeled as:
Y = β 0 X0 + β 1 X1 + β 2 X2 + β 3 X3 + β 4 X4 + β 5 X5 + β 6 X6 + β 7 X7 + β 8 X8 + ϵ
The logistic regression equation is:
1
P(Y=1) = −( β 0 X 0 +β 1 X 1+ β 2 X 2+ β 3 X 3 +β 4 X 4+ β 5 X 5 +β 6 X 6+ β 7 X 7+ β 8 X 8 )
1+ e

Step 3: Design study:


 Once the conceptual model is tentatively defined, the researcher requires
collecting data through appropriate study design. Study design facilitates
the required data to be collected.
 Depending on the purpose of the study (defined in Step 1), the study
design can be broadly defined as observational and experimental.
 From frequency and mode of data collection, study designs can be (i)
cross-sectional, (ii) longitudinal, and (iii) cohort.
 For given dataset longitudinal study is carried out. In longitudinal study,
data is collected in more than one point on time and it can be
retrospective or prospective. A longitudinal retrospective study with
differential research design is known as case-control study. In case-
control design, two groups, one with a particular outcome (e.g., a
disease) called “case” and the other without that outcome called
“control” are considered.

Step 4: Collect data:


Once the study design is known, data collection becomes easier. Many times,
data are collected in two stages as preliminary data and full data.
Step 5: Examine data:
Step: Examine Data for Diabetes Prediction
 Understand Data Structure: Get an overview of the diabetes dataset's size
and structure.
 Calculate Basic Stats: Compute mean, median, and standard deviation for
each variable.
 Visualize Data: Use graphs to spot patterns and outliers in diagnostic
measurements
 Handle Missing Values: Decide how to deal with missing data (fill in or
remove).
 Identify Outliers: Detect and assess any unusual data points.
 Explore Variable Relationships: Investigate connections between
diagnostic measurements.
 Check Distribution Assumptions: Confirm if data meets regression
assumptions.
 Scale Numerical Features: Standardize or normalize numerical data for
fair comparisons.
 Split Data: Divide data into training and testing sets.
 Initial Model Testing: Optionally, test a simple model to gauge
performance.
This step ensures the dataset is ready for predicting diabetes with multiple linear
regression, addressing missing values, outliers, and confirming relationships
between variables.

Step 6: Select a suitable model


Chosen a logistic regression model to predict diabetes outcomes in the Pima
Indian dataset. The logistic regression model is a suitable choice for binary
classification problems like predicting whether an individual has diabetes or not
Step 7: Estimate parameters
Trained the logistic regression model which is named as LMLR Model using
the training data. we are predicting the Outcome variable based on all other
variables in the dataset. Here we are taking 70% of data for training and rest for
testing ,The family parameter is specified as binomial, indicating a logistic
regression model for binary outcomes.
Model updating
 Updated the model by removing certain variables (SkinThickness,
Insulin, Age, BloodPressure). This suggests an attempt to refine the
model by excluding less relevant or potentially correlated features.
 When we tested the model adequacy of MLR model found that it fails in
normality and test of variance.
 Rejecting MLR model further going forward with Logistic regression
model.

TEST OF NORMALIYY TEST OF VARIANCE

TEST OF INDEPENDENCE
Here the data for MLR model is not suitable, MLR model does not satisfied the
test of assumptions, this problem can be reduced by transforming the response
variable but according to our problem statement we have to predict the diabetes
positive or negative for this a logistic regression model is best suited. We done
all study further accordingly for the logistic regression.

LOGISTIC REGRESSION MODEL ADEQUACY TESTS

TEST OF NORMALIYY TEST OF VARIANCE

TEST OF INDEPENDENCE
Step 8: Verify model
The summary provides information about the estimated coefficients,
significance levels, and goodness-of-fit statistics. This information is crucial for
understanding the contribution of each variable to the model.
 Created diagnostic plots to assess the model's assumptions and identify
potential issues, such as heteroscedasticity or outliers.
Step 9: Validate model
 Evaluated the model's performance on the testing data using a confusion
matrix and calculated accuracy. This step provides insights into how well
the model generalizes to new, unseen data.
 Performed a Sum of Squares analysis to understand the total variability
(SST), variability explained by the model (SSR), and unexplained
variability (SSE).
Step 10: Interpret result
 The logistic regression model provides a way to estimate the probability
of diabetes based on the given predictor variables.
 The model has been refined by excluding certain variables, suggesting a
focus on more relevant features.
 Diagnostic plots and the confusion matrix provide insights into the
model's performance and potential areas for improvement.
 The Sum of Squares analysis offers a quantitative assessment of the
model's goodness of fit.

R- LANGUAGE CODE

#load required library


library(readxl)
library(corrplot)
library(ggplot2)
library(caret)
# Read the Excel file
MLR_data <- read_excel("C:/Users/…./Downloads/diabetes Pima Indian Dataset.xlsx")
# Create a correlation matrix
cor_matrix <- cor(MLR_data, method = "pearson")
round(cor_matrix, 4)
# Plot the correlation matrix using corrplot
corrplot(cor_matrix, order = "hclust")
# Prepare the dataset
set.seed(123)
n <- nrow(MLR_data)
training <- sample(n, trunc(0.70 * n))
MLR_data_training <- MLR_data[training, ]
MLR_data_testing <- MLR_data[-training, ]

nrow(MLR_data_training)
nrow(MLR_data_testing)

# Training the model


LMLR_Model <- glm(Outcome ~ ., data = MLR_data_training, family = binomial)
summary(LMLR_Model)

# Update the model by removing certain variables


LMLR_Model1 <- update(LMLR_Model, ~ . - SkinThickness - Insulin - Age -
BloodPressure)
summary(LMLR_Model1)

# Plot diagnostic plots


plot(LMLR_Model1)

# Testing the model


glm_probs <- predict(LMLR_Model1, newdata = MLR_data_testing, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
glm_pred <- factor(glm_pred, levels = c(0, 1))
MLR_data_testing$Outcome <- factor(MLR_data_testing$Outcome, levels = c(0, 1))

# Print confusion matrix for logistic regression


confusionMatrix(glm_pred, MLR_data_testing$Outcome)
# Calculate accuracy
acc_glm_fit <- confusionMatrix(glm_pred, MLR_data_testing$Outcome)
$overall['Accuracy']
acc_glm_fit
# Calculate mean of the outcome variable
mean_outcome <- mean(MLR_data$Outcome)
# Calculate Sum of Squares Total (SST)
SST <- sum((MLR_data$Outcome - mean_outcome)^2)
# Calculate predicted values from the model
predicted_values <- predict(LMLR_Model1, newdata = MLR_data, type = "response")

# Calculate Sum of Squares Error (SSE)


SSE <- sum((MLR_data$Outcome - predicted_values)^2)
# Calculate Sum of Squares Regression (SSR)
SSR <- sum((predicted_values - mean_outcome)^2)
# Display the results
cat("Sum of Squares Total (SST):", SST, "\n")
cat("Sum of Squares Error (SSE):", SSE, "\n")
cat("Sum of Squares Regression (SSR):", SSR, "\n")

# New data for prediction


new_data <- data.frame(
Pregnancies = 6,
Glucose = 148,
BMI = 33.6,
DiabetesPedigreeFunction = 0.627,
SkinThickness = 0,
Insulin = 0,
Age = 0,
BloodPressure = 0
)
# Predict the outcome
glm_new_probs <- predict(LMLR_Model1, newdata = new_data, type = "response")
glm_new_pred <- ifelse(glm_new_probs > 0.5, 1, 0)

# Display the predicted probability and outcome


cat("Predicted Probability:", glm_new_probs, "\n")
cat("Predicted Outcome:", glm_new_pred, "\n")
CODE RESULTS

1.Correlation Structure of Variables

2. Training of MLR Model Results


3.Training of MLR Model

4. Testing of LR Model
6. Sum Squre Calculation :

7. Prediction :

6.Conclusion
The logistic regression model, after refinement and evaluation, appears to be a
reasonable choice for predicting diabetes outcomes in the Pima Indian dataset.
And model is approx. 80% gives accurate result. Further refinements and
evaluations could be considered based on the diagnostic plots and other model
performance metrics.

You might also like