0% found this document useful (0 votes)

28 views20 pages

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

Fides Mboma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views20 pages

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

Fides Mboma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MDS5202 / Segment 04

VARIABLE SELECTION
METHODS AND
GOODNESS OF FIT
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Table of Contents

1. Variable Selection Methods – Variable Selection in Logistic Regression 4

1.1 Forward Selection 4

1.2 Backward Elimination 6

1.3 Choice of Forward Selection or Backward Elimination 8

2. Assessing Model Performance 8

2.1 ROC Curve 14

2.2 Area Under the Curve (AUC) 16

2.3 Model Adequacy 17

2.4 Logistic Regression 18

2.5 Performance metrics for Model fit 18

3. Summary 20

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Introduction
The performance of the model may vary depending on the thresholds and the number of
variables selected. Several parameters can be used to evaluate the model.

Learning Objectives

At the end of this topic, you will be able to:

● Describe and apply the variable selection methods

● Assess model performance using receiver operating characteristic curve and area
under the curve

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

1. Variable Selection Methods – Variable Selection in

Logistic Regression
Variable selection is extremely critical in building logistic regression models. However, there
is no hard and fast solution available for the selection. It is often necessary to develop a
model that achieves satisfactory prediction accuracy while remaining interpretable in terms
of a specific theory regarding the role of the independent variables. Note,
• Keeping too many variables may lead to overfitting.
• A simpler model may suffer from underfitting.
• The risk of applying variable selection is that one optimises the model for a particular
data set.

1.1 Forward Selection

Forward selection is a type of stepwise regression that begins with an empty model and
adds variables individually. Each forward step adds one variable that gives the single best
improvement to the model.

Following are the various steps that improve the model:

1. Start with a model with no variables.

Figure 1: Null Model

This is a log-likelihood model for the intercept only.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

2. Add the most significant variable.

Figure 2: Model with One Variable

Now pull in a high influencing variable most significant for a better outcome. So, you
develop your model and see how much information is added to the model with no predictors.

3. Keep adding the most significant variable until reaching the stopping rule or running
out of variables.

Figure 3: Model with Two Variables

In other words, add the next most significant variable and repeat this until you reach a place
where the model gets saturated.

Criteria for Selection

To determine the most significant variable at each step are as follows:
The most significant variable can be chosen so that when added to the model,
• It has the smallest 𝑝 − 𝑣𝑎𝑙𝑢𝑒.
• It provides the highest increase in 𝑅 2 .
• It provides the highest drop in model RSS. (Residuals Sum of Squares) compared to
other predictors under consideration.

Choose a stopping rule as follows:

• The stopping rule is satisfied when all remaining variables to consider have a p-value
larger than some specified threshold if added to the model.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

• When this state is reached, forward selection will terminate and return a model that
only contains variables with 𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑.

The threshold can be,

• Fixed value (for instance: 0.05 or 0.2 or 0.5)
• Determined by AIC (Akaike Information Criterion)
• Determined by BIC (Bayesian Information Criterion)

If a fixed value is chosen, the threshold will be the same for all variables.

If AIC or BIC is used to determine the threshold automatically, it will be different for each
variable.

1.2 Backward Elimination

In the other method, called backward elimination, we start with all the independent variables
in the model,

1. Start with a model with all the independent variables.

Figure 4: Full Model

2. Remove the least significant variable.

Figure 5: Model with Four Variables

Variable 𝑋4 was found to be the least significant, which is removed, and the model is left out
with the four variables.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

3. Keep removing the least significant variables until reaching the stopping rule or running
out of variables.

Figure 6: Model with Three Variables

So, this is now a model with the three variables.

Criteria for Elimination

To determine the least significant variable at each step are as follows:
The least significant variable is a variable that,
• It has the highest p-value in the model.
• Its elimination from the model causes the lowest drop in 𝑅 2 .
• Its elimination from the model causes the lowest decrease in RSS compared to other
predictors.

Choose a stopping rule as follows:

• The stopping rule is satisfied when all the remaining variables in the model have a p-
value smaller than some pre-specified threshold.
• When this condition is satisfied, backward elimination will terminate and return the
current step’s model.

As with forward selection, the threshold can be,

• A fixed value (For instance: 0.05, 0.2, or 0.5)
• Determined by AIC (Akaike information criterion)
• Determined by BIC (Bayesian information criterion)

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

1.3 Choice of Forward Selection or Backward Elimination

Following are the various approaches to choosing backward and forward elimination
methods:
• Unless the number of candidate variables > sample size (or the number of events), use
a backward stepwise approach.
• The selection of variables using a stepwise regression will be highly unstable,
especially when there is a small sample size compared to the number of variables
under consideration.
• This instability is reduced when there is a sample size (or the number of events) > 50
per candidate variable.

Whenever the backward method is not suiting for your model, use the forward selection
method; after that, you need to assess how good your model is. One needs to assess the
model's performance.

2. Assessing Model Performance

Sensitivity is calculated using this equation,
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (𝐹𝑁)

• TP is the number of positives correctly classified as positives by the model.

• FN is the number of positives misclassified as negative by the model.

Sensitivity is also called recall.

Specificity is the ability of the diagnostic test/model to correctly classify the test as negative
when the disease/event is not present.
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃(𝑑𝑖𝑎𝑔𝑛𝑜𝑠𝑡𝑖𝑐 𝑡𝑒𝑠𝑡 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 | 𝑝𝑎𝑡𝑖𝑒𝑛𝑡 ℎ𝑎𝑠 𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒/𝑛𝑜 𝑒𝑣𝑒𝑛𝑡)

When some data points of models are not classified as an event.:

In general,
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃(𝑚𝑜𝑑𝑒𝑙 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑠 𝑌𝑖 𝑎𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 | 𝑌𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 7: ROC Curve

Look at this example; the weights of individuals are classified into obese and non-obese
based on their weight. Here the scatter of points along the line of zero corresponds to the
non-event and the scatter of points along the line of one, which is the event.

So, the Y-axis has two categories When a logistic regression curve is fitted to the data:

• Blue ball represents an obese individual.

• Red ball represent individuals that are not obese.

Figure 8: Logistic Regression Curve

When referencing the logistic regression, the graph shows the sigmoid curve, which fits
the data. The model tries to classify based on the particular model, which has a new
observation that fits into the line. It is whether these are obese or non-obese.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 9: Individual is an Obese Curve

So, the graph describes the line of 0, which means it is not obese, but the line of 1 means it
is obese. In logistic regression, the Y-axis is converted into the probability that an individual
is obese.

Figure 10: Individual as Obese and Not Obese Curve

Figure 11: Individual as Obese and Not Obese Curve

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

To classify as obese or not obese, a method to turn probabilities into classifications should
be found. In the model, draw a line at 0.5 to classify whether the individual is obese or non-
obese. And if the probabilities are below the line of 0.5, classify the individuals as non-obese.
And if it is at line 0.5 and above, classify the individual as obese. So, this way, normal logistic
regression works in the classification process.

Then, the curve would show that,

• There is a high probability that this individual is obese.
• And this individual is not obese.

Figure 12: Four New Individuals are Not Obese Curve

Suppose there are eight new observations: four are obese, and the other four are non-obese.
So, the non-obese are shown in red, and the obese are shown in green.

Then, the curve would show that,

• These four new individuals are not obese shown in red color.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 13: Four New Individuals' Obese Curve

Suppose four new individuals are obese, shown in green colour and represented in Figure
below.

Figure 14: Three Individuals Curve

Mark the point on the sigmoid curve, which is fitted at 0.5. The last three individuals are
correctly classified in the classification rule where below 0.5 in probability is non-obese, and
0.5 and above is obese. As shown in the above Figure, the matrix generated is called the
confusion matrix. The graph shows that one non-obese will be classified as obese, and the
remaining three who are obese will be classified as non-obese. Three individuals have been
correctly classified as obese, and others were predicted as non-obese. Once the Confusion
Matrix is filled in, sensitivity and specificity can be calculated to evaluate this logistic
regression when 0.5 is the threshold for obesity.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 15: Obesity Curve 0.1 Threshold

Consider another example of Ebola infection, as shown in the above Figure. Consider
classifying the samples as "infected with Ebola' and 'not infected with Ebola'. In this case, it
is essential to correctly classify every sample infected with Ebola to minimise the risk of an
outbreak. If it was very important to classify every infected sample correctly, the threshold
could be set to 0.1. In this classification, all four infected have been correctly classified as
infected. Of the non-infected people out of the four, only two of these are correctly classified
as being not infected. And the model classified these two, who were not infected, as
infected. There is liberty in changing the threshold for classification. However, the
misclassifications may increase.

Figure 16: Obesity Curve 0.5 Threshold and Confusion Matrix

Consider another example where one would correctly classify the same number of obese
samples as when the threshold was set to 1.0. As shown in the above Figure, a higher
threshold can be set.

2.1 ROC Curve

Even if one confusion matrix is made for each threshold that matters, it will result in many
confusion matrices. To identify the best threshold, we make a plot as shown below. This
has False Positive rate (1- Specificity) on the X-axis and True Positive Rate (Sensitivity) on
the Y-axis. This is a reflection of various confusion matrices.

Figure 17: ROC Graph (True Positive Rate vs False Positive Rate)

False Positives
False Positive Rate = (1 - Specificity) =
False Positives + True Negatives

In Figure 17, each point reflects the TPR and FPR values for different threshold values. In
this case, it reflects an ideal scenario with a threshold of zero sensitivity and an almost
hundred percent specificity. The threshold corresponding to this point can be chosen for the
model. Thus, the ROC curve helps us to decide on the best threshold value where there is a
balance between low False Positive and high True Positive rates. So, we plot the sensitivity
versus the one minus specificity or the true positive rate versus the false positive rates. So,
instead of being overwhelmed with confusion matrices, ROC graphs provide a simple way
to summarise all the information.

Consider the confusion matrix as shown below:

Figure 18: Confusion Rate Matrix

4
True Positive Rate = Sensitivity = 4 + 0 = 1. That means it is a hundred percent. This True

Positive Rate, when the threshold is so low that every sample is classified as obese, is 1.
4
However, the false positive rate = 4 + 0 = 1.

Note that, an ideal model should generate 100% specificity (zero FPR) and 100% sensitivity.
However, this isn't easy to achieve. If we try to increase the sensitivity, specificity suffers
and vice-versa.

Figure 19: True Positive Against a False Positive

If we plot a graph of a true positive against a false positive, the green line is the worst-case
scenario. It's like tossing the coin and saying that this one is obese and not obese; there is
a 50% risk of misclassification in both categories.

So, the diagonal line shows the True Positive Rate = False Positive (FP) Rate.
• A point (0.75, 1) is to the left of the dotted line, the proportion of correctly classified
samples that were obese (TP) is greater than the proportion of the samples that were
incorrectly classified as obese (FP).
• Another point (0.5, 1) is even further to the left of the dotted green line, showing that
the new threshold further decreases the proportion of the samples incorrectly
classified as obese (FP).
• The point at 0, 0 represents a threshold that results in zero True Positives and zero
False Positives.

In the previous example represented in the graph below, the threshold represented by the
new point (0, 0.75) correctly classified 75% of the obese samples and 100% of the samples
that were not obese. Without sorting through the confusion matrices, one can tell that this
threshold (represented in yellow) is better than others .

2.2 Area Under the Curve (AUC)

Figure 20: AUC Model

In the given graph, let’s consider the green line. The area covered below 0.5 is not good;
above 0.5 is better. The closer it will move closer to 1, the better will be the model's accuracy.
So, in this scenario, the AUC is determined to be 0.9. It is good enough; it just requires
choosing the point based on the true positive and false positive rates.
So, two candidate thresholds could be thought about in the given graph.

Figure 21: Comparison of Two Models Using AUC Model

If there was another model and its ROC was generated as shown above, If the two models
for classification are compared, the basic objective is to look at which offers a higher AUC.
After comparing, it would be claimed as a better model. Here in the given graph, The AUC
for the red ROC curve is greater than the AUC for the blue ROC curve. So, in this instance, it
is suggested to go for the red curve.

2.3 Model Adequacy

Regression models for categorical outcomes should be evaluated for fit and adherence to
model assumptions. There are two main elements of such assessment. They are,
• Discrimination: It measures the ability of the model to classify observations into
outcome categories correctly
• Calibration: It measures how well the model-estimated probabilities agree with the
observed outcomes and is typically evaluated via a goodness-of-fit test

The (binary) logistic regression model describes the relationship between a binary outcome
variable and one or more predictor variables. Four tests can be used for are as follows:

Omnibus Test
• The omnibus test is a likelihood-ratio chi-square test of the current versus the null
model.

• The significance value of less than 0.05 indicates that the current model outperforms
the null model.
• Omnibus tests are generic statistical tests used for checking whether the variance
explained by the model is more than the unexplained variance.

Wald’s Test
• Wald's test checks whether an individual explanatory variable is statistically
significant.
• Wald's test is a chi-square test.
̂
𝛽
• A Wald test calculates a Z statistic which is 𝑊 = ̂)
𝑆𝐸(𝛽

• This squared value yields a chi-square distribution and is used as a Wald test statistic.

Hosmer-Lemeshow Test
• It is a chi-square goodness of fit test for binary logistic regression.

Pseudo R2
• Pseudo R2 is a measure of the goodness of the model.
• It is called pseudo R2 because it does not have the same interpretation of R2 as in the
multiple linear regression (MLR) model.

2.4 Logistic Regression

In logistic regression, the model performance is often measured using sensitivity,
specificity, and precision.
• Sensitivity: The ability of the model to correctly classify positives.
• Specificity: The ability of the model to correctly classify negatives.

These terminologies originated in medical diagnostics. In medical diagnostics, sensitivity

(also known as TP rate) measures the ability of a diagnostic test to identify if a disease is
present in a patient (TP).

2.5 Performance Metrics for Model Fit

The evaluation measures in classification problems are defined from a confusion matrix
with the numbers of examples correctly and incorrectly classified for each class.

Table 1: The Confusion Matrix for a Binary Classification Problem

A confusion matrix, also called a contingency table, contains information about actual and
predicted classifications done by a classification system.

Example:
Consider the spread of Covid.

Figure 22: Example of Covid

It shows the true class of different people, and the other side, logistic regression, shows the
predicted class, which contains two people. The performance of such systems is commonly
evaluated using the data in the matrix.

F-Score (F-Measure) is another measure used in binary logistic regression that combines
precision and recall (harmonic mean of precision and recall). In the form of the equation, it
can be written as,
2 × Precision × Recall
F-Score =
Precision + Recall
True Positive
Recall (R) =
True Positive + False Negative

3. Summary
In this topic, we discussed:

• The forward selection and backward elimination methods for variable selection
• When to choose which method of variable selections
• The use of Receiver Operating Characteristic (ROC) Curve and Area Under the Curve
(AUC) for the goodness of fit

3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
Logistic Regression Analysis Steps
No ratings yet
Logistic Regression Analysis Steps
5 pages
13 Paper PDF
No ratings yet
13 Paper PDF
14 pages
Hybrid Genetic and Annealing Algorithms for Regression Variable Selection
No ratings yet
Hybrid Genetic and Annealing Algorithms for Regression Variable Selection
11 pages
Chap3 Variable Selection
No ratings yet
Chap3 Variable Selection
23 pages
Five Miths About Variable Selection
No ratings yet
Five Miths About Variable Selection
5 pages
Variable Selection in Regression Analysis
No ratings yet
Variable Selection in Regression Analysis
10 pages
Model Selection Techniques in Regression
No ratings yet
Model Selection Techniques in Regression
17 pages
Model Selection for Statisticians
No ratings yet
Model Selection for Statisticians
41 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Variable Selection Techniques for Models
No ratings yet
Variable Selection Techniques for Models
26 pages
Multicollinearity and Model Selection Techniques
No ratings yet
Multicollinearity and Model Selection Techniques
30 pages
Logistic Regression Model Building Guide
No ratings yet
Logistic Regression Model Building Guide
7 pages
Modeling and Variable Selection in Epidemiologic Analysis
No ratings yet
Modeling and Variable Selection in Epidemiologic Analysis
10 pages
Bayesian Variable Selection Methods Review
No ratings yet
Bayesian Variable Selection Methods Review
34 pages
Techniques for Variable Selection in Regression
No ratings yet
Techniques for Variable Selection in Regression
25 pages
Choosing Independent Variables in Regression
No ratings yet
Choosing Independent Variables in Regression
14 pages
Variable Selection in Bioinformatics
No ratings yet
Variable Selection in Bioinformatics
33 pages
Regression Analysis: Variable Selection Guide
No ratings yet
Regression Analysis: Variable Selection Guide
39 pages
Variable Selection in Regression Analysis
No ratings yet
Variable Selection in Regression Analysis
22 pages
Variable Selection in Regression Analysis
No ratings yet
Variable Selection in Regression Analysis
27 pages
Variable Selection - A Review and Recommendations For The Practicing Statistician
No ratings yet
Variable Selection - A Review and Recommendations For The Practicing Statistician
19 pages
Model Selection R Chap 4
No ratings yet
Model Selection R Chap 4
5 pages
Variable Selection Biometrical 2018
No ratings yet
Variable Selection Biometrical 2018
19 pages
S, Anno LXIII, N. 2, 2003: Tatistica
No ratings yet
S, Anno LXIII, N. 2, 2003: Tatistica
22 pages
TP MSDC 2 Sujet
No ratings yet
TP MSDC 2 Sujet
5 pages
Regression Analysis: Simple vs. Multivariate
No ratings yet
Regression Analysis: Simple vs. Multivariate
8 pages
WINSEM2024-25 CSE3008 ETH AP2024254000248 2025-01-24 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE3008 ETH AP2024254000248 2025-01-24 Reference-Material-I
27 pages
Variable Selection Strategies in Regression
No ratings yet
Variable Selection Strategies in Regression
31 pages
Econometric Modeling:: Model Specification and Diagnostic Testing
100% (2)
Econometric Modeling:: Model Specification and Diagnostic Testing
57 pages
Linear Model Selection Techniques
No ratings yet
Linear Model Selection Techniques
57 pages
Midterm 1 Prep
No ratings yet
Midterm 1 Prep
7 pages
Lecture4 - Model Selection and Regularization - Ver2
No ratings yet
Lecture4 - Model Selection and Regularization - Ver2
98 pages
Model Selection
No ratings yet
Model Selection
49 pages
Robust LARS Model Selection with S-Estimators
No ratings yet
Robust LARS Model Selection with S-Estimators
10 pages
Stepwise Regression
0% (1)
Stepwise Regression
9 pages
Course Regression Model Strategies PDF
No ratings yet
Course Regression Model Strategies PDF
307 pages
Logistic Regression & Model Evaluation
100% (1)
Logistic Regression & Model Evaluation
11 pages
Unit 4
No ratings yet
Unit 4
7 pages
Specification Errors in Regression Analysis
No ratings yet
Specification Errors in Regression Analysis
7 pages
Variable Selection & Model Building Guide
No ratings yet
Variable Selection & Model Building Guide
32 pages
Variable Selection in SAS Enterprise Guide and SAS Enterprise Miner - Ask The Expert - May 11 2017
No ratings yet
Variable Selection in SAS Enterprise Guide and SAS Enterprise Miner - Ask The Expert - May 11 2017
66 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Overview of Model Selection Techniques
No ratings yet
Overview of Model Selection Techniques
21 pages
Model Evaluation: Fitting, Testing, Selection
No ratings yet
Model Evaluation: Fitting, Testing, Selection
6 pages
Confusion Matrix in Regression Analysis
No ratings yet
Confusion Matrix in Regression Analysis
23 pages
Bayesian Interval Data Modeling
No ratings yet
Bayesian Interval Data Modeling
20 pages
Practice Question 15.12.2023 2
No ratings yet
Practice Question 15.12.2023 2
13 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Background 2.1. Logistic Definition
No ratings yet
Background 2.1. Logistic Definition
6 pages
BWIN817 - 2025 - L12 - Seminar4
No ratings yet
BWIN817 - 2025 - L12 - Seminar4
43 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
9 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Statistics in Action
96% (26)
Statistics in Action
903 pages
Learning Statistics
100% (30)
Learning Statistics
408 pages
Understanding Machine Learning
100% (73)
Understanding Machine Learning
416 pages
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
89% (18)
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
102 pages
How To Win Friends and Influence People (PDFDrive)
97% (78)
How To Win Friends and Influence People (PDFDrive)
215 pages
Statistical Regression and Classification - From Linear Models To Machine Learning
100% (10)
Statistical Regression and Classification - From Linear Models To Machine Learning
532 pages
Mathematical Statistics With Applications PDF
100% (18)
Mathematical Statistics With Applications PDF
644 pages
Visual Bible PDF
95% (73)
Visual Bible PDF
548 pages
A Concise Introduction To Statistical Inference
100% (8)
A Concise Introduction To Statistical Inference
231 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (31)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
Hackers Guide To Machine Learning With Python PDF
100% (16)
Hackers Guide To Machine Learning With Python PDF
272 pages
Chronological Study Bible
94% (90)
Chronological Study Bible
144 pages
Research Methods & Statistics For Public & Nonprofit Administrators-Practical Guide - Nishishiba 2014 PDF
93% (14)
Research Methods & Statistics For Public & Nonprofit Administrators-Practical Guide - Nishishiba 2014 PDF
393 pages
Bayesian Statistical Methods
100% (11)
Bayesian Statistical Methods
288 pages
The Python Bible
97% (34)
The Python Bible
506 pages
The Bible's Difficult Scriptures EXPLAINED!
93% (72)
The Bible's Difficult Scriptures EXPLAINED!
131 pages
Research Design and Methods An Applied Guide For The Scholar-Practitioner (Gary J. Burkholder, Kimberle
100% (11)
Research Design and Methods An Applied Guide For The Scholar-Practitioner (Gary J. Burkholder, Kimberle
517 pages
Data Analytics Concepts Techniques and A PDF
100% (15)
Data Analytics Concepts Techniques and A PDF
451 pages
Mathematical Modeling For Business Analytics
94% (16)
Mathematical Modeling For Business Analytics
451 pages
Understanding Research Methods
100% (14)
Understanding Research Methods
353 pages
Python in Excel (2024)
100% (15)
Python in Excel (2024)
607 pages
Nelson's NKJV Study Bible, Second Edition (PDFDrive)
100% (43)
Nelson's NKJV Study Bible, Second Edition (PDFDrive)
2,345 pages
W. Williams, James - How to Read People Like a Book_ a Guide to Speed-Reading People, Understand Body Language and Emotions, Decode Intentions, And Connect Effortlessly (Practical Emotional Intelligen (1)
94% (63)
W. Williams, James - How to Read People Like a Book_ a Guide to Speed-Reading People, Understand Body Language and Emotions, Decode Intentions, And Connect Effortlessly (Practical Emotional Intelligen (1)
100 pages
The Bible in 52 Weeks A Yearlong Bible Study For Women (Moore, Dr. Kimberly D.)
86% (21)
The Bible in 52 Weeks A Yearlong Bible Study For Women (Moore, Dr. Kimberly D.)
205 pages
Epidemiology - An Introduction
100% (13)
Epidemiology - An Introduction
281 pages
How To Read Body Language - Secrets To Analyzing & Speed Reading People Like A Book
97% (29)
How To Read Body Language - Secrets To Analyzing & Speed Reading People Like A Book
103 pages
Statistics, Data Analysis, and Decision Modeling, 5th Edition
100% (10)
Statistics, Data Analysis, and Decision Modeling, 5th Edition
556 pages
The Chronological Study Bible - New King James Version
95% (19)
The Chronological Study Bible - New King James Version
1,706 pages
Boundaries When To Say Yes, How To Say No To Take Control of Your Life (PDFDrive)
100% (14)
Boundaries When To Say Yes, How To Say No To Take Control of Your Life (PDFDrive)
357 pages
Forecasting Cryptocurrency Returns From Sentiment Signals: An Analysis of BERT Classifiers and Weak Supervision
No ratings yet
Forecasting Cryptocurrency Returns From Sentiment Signals: An Analysis of BERT Classifiers and Weak Supervision
29 pages
ML Lab Manual Bcsl606
No ratings yet
ML Lab Manual Bcsl606
36 pages
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
No ratings yet
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
13 pages
Continual Learning in Manufacturing Systems
No ratings yet
Continual Learning in Manufacturing Systems
27 pages
Lecture03 MachineLearning
No ratings yet
Lecture03 MachineLearning
78 pages
PRGM9.Ipynb - Colab
No ratings yet
PRGM9.Ipynb - Colab
7 pages
FDS Unit1
No ratings yet
FDS Unit1
30 pages
Python Programs for AI Algorithms
No ratings yet
Python Programs for AI Algorithms
28 pages
High-Sensitivity OTDR Event Detection Method
No ratings yet
High-Sensitivity OTDR Event Detection Method
5 pages
Pattern Recognition Assignment 1
No ratings yet
Pattern Recognition Assignment 1
12 pages
Apple Vs Orange
No ratings yet
Apple Vs Orange
24 pages
AIML Practical File-1
No ratings yet
AIML Practical File-1
27 pages
Naive Bayes Email Spam Classification
No ratings yet
Naive Bayes Email Spam Classification
4 pages
Titanic Survival Prediction Analysis
No ratings yet
Titanic Survival Prediction Analysis
15 pages
Predicting Blood Donor Retention with ML
No ratings yet
Predicting Blood Donor Retention with ML
16 pages
Robotic Arm Feeding
No ratings yet
Robotic Arm Feeding
7 pages
IDS U-5 Answers
No ratings yet
IDS U-5 Answers
16 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Clustering vs Classification Metrics Explained
No ratings yet
Clustering vs Classification Metrics Explained
13 pages
Data Preprocessing and EDA in Python
No ratings yet
Data Preprocessing and EDA in Python
21 pages
6. DWM Lab Manual
No ratings yet
6. DWM Lab Manual
23 pages
Fake News Detection with Machine Learning
No ratings yet
Fake News Detection with Machine Learning
65 pages
Reading Buildinga Decision Treein KNIME
No ratings yet
Reading Buildinga Decision Treein KNIME
5 pages
Spam Filter Project Report Logistic Regression
No ratings yet
Spam Filter Project Report Logistic Regression
10 pages
KUNet-An Optimized AI Based Bengali Sign Language Translator For Hearing Impaired and Non Verbal People
No ratings yet
KUNet-An Optimized AI Based Bengali Sign Language Translator For Hearing Impaired and Non Verbal People
12 pages
Ai&ml Unit 3
No ratings yet
Ai&ml Unit 3
81 pages
Half Year Set A
No ratings yet
Half Year Set A
5 pages
Classification Metrics For Pet Adoption Prediction With Machine Learning
No ratings yet
Classification Metrics For Pet Adoption Prediction With Machine Learning
11 pages
Pima Indians Diabetes Database
No ratings yet
Pima Indians Diabetes Database
6 pages
Anomaly Detection in 5G Using Variational Autoencoders
No ratings yet
Anomaly Detection in 5G Using Variational Autoencoders
6 pages

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

MDS5202 / Segment 04

1. Variable Selection Methods – Variable Selection in Logistic Regression 4

1.1 Forward Selection 4

1.2 Backward Elimination 6

1.3 Choice of Forward Selection or Backward Elimination 8

2. Assessing Model Performance 8

2.1 ROC Curve 14

2.2 Area Under the Curve (AUC) 16

2.3 Model Adequacy 17

2.4 Logistic Regression 18

2.5 Performance metrics for Model fit 18

At the end of this topic, you will be able to:

● Describe and apply the variable selection methods

1. Variable Selection Methods – Variable Selection in

1.1 Forward Selection

Following are the various steps that improve the model:

Figure 1: Null Model

This is a log-likelihood model for the intercept only.

2. Add the most significant variable.

Figure 2: Model with One Variable

Figure 3: Model with Two Variables

Criteria for Selection

Choose a stopping rule as follows:

The threshold can be,

1.2 Backward Elimination

1. Start with a model with all the independent variables.

Figure 4: Full Model

2. Remove the least significant variable.

Figure 5: Model with Four Variables

Figure 6: Model with Three Variables

So, this is now a model with the three variables.

Criteria for Elimination

Choose a stopping rule as follows:

As with forward selection, the threshold can be,

1.3 Choice of Forward Selection or Backward Elimination

2. Assessing Model Performance

• TP is the number of positives correctly classified as positives by the model.

Sensitivity is also called recall.

When some data points of models are not classified as an event.:

Figure 7: ROC Curve

• Blue ball represents an obese individual.

Figure 8: Logistic Regression Curve

Figure 9: Individual is an Obese Curve

Figure 10: Individual as Obese and Not Obese Curve

Figure 11: Individual as Obese and Not Obese Curve

Then, the curve would show that,

Figure 12: Four New Individuals are Not Obese Curve

Then, the curve would show that,

Figure 13: Four New Individuals' Obese Curve

Figure 14: Three Individuals Curve

Figure 15: Obesity Curve 0.1 Threshold

Figure 16: Obesity Curve 0.5 Threshold and Confusion Matrix

2.1 ROC Curve

Consider the confusion matrix as shown below:

Figure 18: Confusion Rate Matrix

Figure 19: True Positive Against a False Positive

2.2 Area Under the Curve (AUC)

Figure 20: AUC Model

Figure 21: Comparison of Two Models Using AUC Model

2.3 Model Adequacy

2.4 Logistic Regression

These terminologies originated in medical diagnostics. In medical diagnostics, sensitivity

2.5 Performance Metrics for Model Fit

Table 1: The Confusion Matrix for a Binary Classification Problem

Figure 22: Example of Covid

You might also like