0% found this document useful (0 votes)
28 views20 pages

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

Fides Mboma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views20 pages

RM - Variable Selection Methods and Goodness of Fit

Uploaded by

Fides Mboma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MDS5202 / Segment 04

VARIABLE SELECTION
METHODS AND
GOODNESS OF FIT
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Table of Contents

1. Variable Selection Methods – Variable Selection in Logistic Regression 4

1.1 Forward Selection 4

1.2 Backward Elimination 6

1.3 Choice of Forward Selection or Backward Elimination 8

2. Assessing Model Performance 8

2.1 ROC Curve 14

2.2 Area Under the Curve (AUC) 16

2.3 Model Adequacy 17

2.4 Logistic Regression 18

2.5 Performance metrics for Model fit 18

3. Summary 20

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Introduction
The performance of the model may vary depending on the thresholds and the number of
variables selected. Several parameters can be used to evaluate the model.

Learning Objectives

At the end of this topic, you will be able to:

● Describe and apply the variable selection methods

● Assess model performance using receiver operating characteristic curve and area
under the curve

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

1. Variable Selection Methods – Variable Selection in


Logistic Regression
Variable selection is extremely critical in building logistic regression models. However, there
is no hard and fast solution available for the selection. It is often necessary to develop a
model that achieves satisfactory prediction accuracy while remaining interpretable in terms
of a specific theory regarding the role of the independent variables. Note,
• Keeping too many variables may lead to overfitting.
• A simpler model may suffer from underfitting.
• The risk of applying variable selection is that one optimises the model for a particular
data set.

1.1 Forward Selection


Forward selection is a type of stepwise regression that begins with an empty model and
adds variables individually. Each forward step adds one variable that gives the single best
improvement to the model.

Following are the various steps that improve the model:


1. Start with a model with no variables.

Figure 1: Null Model

This is a log-likelihood model for the intercept only.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

2. Add the most significant variable.

Figure 2: Model with One Variable

Now pull in a high influencing variable most significant for a better outcome. So, you
develop your model and see how much information is added to the model with no predictors.

3. Keep adding the most significant variable until reaching the stopping rule or running
out of variables.

Figure 3: Model with Two Variables

In other words, add the next most significant variable and repeat this until you reach a place
where the model gets saturated.

Criteria for Selection


To determine the most significant variable at each step are as follows:
The most significant variable can be chosen so that when added to the model,
• It has the smallest 𝑝 − 𝑣𝑎𝑙𝑢𝑒.
• It provides the highest increase in 𝑅 2 .
• It provides the highest drop in model RSS. (Residuals Sum of Squares) compared to
other predictors under consideration.

Choose a stopping rule as follows:


• The stopping rule is satisfied when all remaining variables to consider have a p-value
larger than some specified threshold if added to the model.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

• When this state is reached, forward selection will terminate and return a model that
only contains variables with 𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑.

The threshold can be,


• Fixed value (for instance: 0.05 or 0.2 or 0.5)
• Determined by AIC (Akaike Information Criterion)
• Determined by BIC (Bayesian Information Criterion)

If a fixed value is chosen, the threshold will be the same for all variables.

If AIC or BIC is used to determine the threshold automatically, it will be different for each
variable.

1.2 Backward Elimination


In the other method, called backward elimination, we start with all the independent variables
in the model,

1. Start with a model with all the independent variables.

Figure 4: Full Model

2. Remove the least significant variable.

Figure 5: Model with Four Variables

Variable 𝑋4 was found to be the least significant, which is removed, and the model is left out
with the four variables.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

3. Keep removing the least significant variables until reaching the stopping rule or running
out of variables.

Figure 6: Model with Three Variables

So, this is now a model with the three variables.

Criteria for Elimination


To determine the least significant variable at each step are as follows:
The least significant variable is a variable that,
• It has the highest p-value in the model.
• Its elimination from the model causes the lowest drop in 𝑅 2 .
• Its elimination from the model causes the lowest decrease in RSS compared to other
predictors.

Choose a stopping rule as follows:


• The stopping rule is satisfied when all the remaining variables in the model have a p-
value smaller than some pre-specified threshold.
• When this condition is satisfied, backward elimination will terminate and return the
current step’s model.

As with forward selection, the threshold can be,


• A fixed value (For instance: 0.05, 0.2, or 0.5)
• Determined by AIC (Akaike information criterion)
• Determined by BIC (Bayesian information criterion)

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

1.3 Choice of Forward Selection or Backward Elimination


Following are the various approaches to choosing backward and forward elimination
methods:
• Unless the number of candidate variables > sample size (or the number of events), use
a backward stepwise approach.
• The selection of variables using a stepwise regression will be highly unstable,
especially when there is a small sample size compared to the number of variables
under consideration.
• This instability is reduced when there is a sample size (or the number of events) > 50
per candidate variable.

Whenever the backward method is not suiting for your model, use the forward selection
method; after that, you need to assess how good your model is. One needs to assess the
model's performance.

2. Assessing Model Performance


Sensitivity is calculated using this equation,
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (𝐹𝑁)

• TP is the number of positives correctly classified as positives by the model.


• FN is the number of positives misclassified as negative by the model.

Sensitivity is also called recall.

Specificity is the ability of the diagnostic test/model to correctly classify the test as negative
when the disease/event is not present.
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃(𝑑𝑖𝑎𝑔𝑛𝑜𝑠𝑡𝑖𝑐 𝑡𝑒𝑠𝑡 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 | 𝑝𝑎𝑡𝑖𝑒𝑛𝑡 ℎ𝑎𝑠 𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒/𝑛𝑜 𝑒𝑣𝑒𝑛𝑡)

When some data points of models are not classified as an event.:

In general,
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃(𝑚𝑜𝑑𝑒𝑙 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑠 𝑌𝑖 𝑎𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 | 𝑌𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 7: ROC Curve

Look at this example; the weights of individuals are classified into obese and non-obese
based on their weight. Here the scatter of points along the line of zero corresponds to the
non-event and the scatter of points along the line of one, which is the event.

So, the Y-axis has two categories When a logistic regression curve is fitted to the data:

• Blue ball represents an obese individual.


• Red ball represent individuals that are not obese.

Figure 8: Logistic Regression Curve

When referencing the logistic regression, the graph shows the sigmoid curve, which fits
the data. The model tries to classify based on the particular model, which has a new
observation that fits into the line. It is whether these are obese or non-obese.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 9: Individual is an Obese Curve

So, the graph describes the line of 0, which means it is not obese, but the line of 1 means it
is obese. In logistic regression, the Y-axis is converted into the probability that an individual
is obese.

Figure 10: Individual as Obese and Not Obese Curve

Figure 11: Individual as Obese and Not Obese Curve

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

To classify as obese or not obese, a method to turn probabilities into classifications should
be found. In the model, draw a line at 0.5 to classify whether the individual is obese or non-
obese. And if the probabilities are below the line of 0.5, classify the individuals as non-obese.
And if it is at line 0.5 and above, classify the individual as obese. So, this way, normal logistic
regression works in the classification process.

Then, the curve would show that,


• There is a high probability that this individual is obese.
• And this individual is not obese.

Figure 12: Four New Individuals are Not Obese Curve

Suppose there are eight new observations: four are obese, and the other four are non-obese.
So, the non-obese are shown in red, and the obese are shown in green.

Then, the curve would show that,


• These four new individuals are not obese shown in red color.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 13: Four New Individuals' Obese Curve

Suppose four new individuals are obese, shown in green colour and represented in Figure
below.

Figure 14: Three Individuals Curve

Mark the point on the sigmoid curve, which is fitted at 0.5. The last three individuals are
correctly classified in the classification rule where below 0.5 in probability is non-obese, and
0.5 and above is obese. As shown in the above Figure, the matrix generated is called the
confusion matrix. The graph shows that one non-obese will be classified as obese, and the
remaining three who are obese will be classified as non-obese. Three individuals have been
correctly classified as obese, and others were predicted as non-obese. Once the Confusion
Matrix is filled in, sensitivity and specificity can be calculated to evaluate this logistic
regression when 0.5 is the threshold for obesity.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 15: Obesity Curve 0.1 Threshold

Consider another example of Ebola infection, as shown in the above Figure. Consider
classifying the samples as "infected with Ebola' and 'not infected with Ebola'. In this case, it
is essential to correctly classify every sample infected with Ebola to minimise the risk of an
outbreak. If it was very important to classify every infected sample correctly, the threshold
could be set to 0.1. In this classification, all four infected have been correctly classified as
infected. Of the non-infected people out of the four, only two of these are correctly classified
as being not infected. And the model classified these two, who were not infected, as
infected. There is liberty in changing the threshold for classification. However, the
misclassifications may increase.

Figure 16: Obesity Curve 0.5 Threshold and Confusion Matrix

Consider another example where one would correctly classify the same number of obese
samples as when the threshold was set to 1.0. As shown in the above Figure, a higher
threshold can be set.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 13/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

2.1 ROC Curve


Even if one confusion matrix is made for each threshold that matters, it will result in many
confusion matrices. To identify the best threshold, we make a plot as shown below. This
has False Positive rate (1- Specificity) on the X-axis and True Positive Rate (Sensitivity) on
the Y-axis. This is a reflection of various confusion matrices.

Figure 17: ROC Graph (True Positive Rate vs False Positive Rate)

False Positives
False Positive Rate = (1 - Specificity) =
False Positives + True Negatives

In Figure 17, each point reflects the TPR and FPR values for different threshold values. In
this case, it reflects an ideal scenario with a threshold of zero sensitivity and an almost
hundred percent specificity. The threshold corresponding to this point can be chosen for the
model. Thus, the ROC curve helps us to decide on the best threshold value where there is a
balance between low False Positive and high True Positive rates. So, we plot the sensitivity
versus the one minus specificity or the true positive rate versus the false positive rates. So,
instead of being overwhelmed with confusion matrices, ROC graphs provide a simple way
to summarise all the information.

Consider the confusion matrix as shown below:

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 14/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 18: Confusion Rate Matrix

4
True Positive Rate = Sensitivity = 4 + 0 = 1. That means it is a hundred percent. This True

Positive Rate, when the threshold is so low that every sample is classified as obese, is 1.
4
However, the false positive rate = 4 + 0 = 1.

Note that, an ideal model should generate 100% specificity (zero FPR) and 100% sensitivity.
However, this isn't easy to achieve. If we try to increase the sensitivity, specificity suffers
and vice-versa.

Figure 19: True Positive Against a False Positive

If we plot a graph of a true positive against a false positive, the green line is the worst-case
scenario. It's like tossing the coin and saying that this one is obese and not obese; there is
a 50% risk of misclassification in both categories.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 15/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

So, the diagonal line shows the True Positive Rate = False Positive (FP) Rate.
• A point (0.75, 1) is to the left of the dotted line, the proportion of correctly classified
samples that were obese (TP) is greater than the proportion of the samples that were
incorrectly classified as obese (FP).
• Another point (0.5, 1) is even further to the left of the dotted green line, showing that
the new threshold further decreases the proportion of the samples incorrectly
classified as obese (FP).
• The point at 0, 0 represents a threshold that results in zero True Positives and zero
False Positives.

In the previous example represented in the graph below, the threshold represented by the
new point (0, 0.75) correctly classified 75% of the obese samples and 100% of the samples
that were not obese. Without sorting through the confusion matrices, one can tell that this
threshold (represented in yellow) is better than others .

2.2 Area Under the Curve (AUC)

Figure 20: AUC Model

In the given graph, let’s consider the green line. The area covered below 0.5 is not good;
above 0.5 is better. The closer it will move closer to 1, the better will be the model's accuracy.
So, in this scenario, the AUC is determined to be 0.9. It is good enough; it just requires
choosing the point based on the true positive and false positive rates.
So, two candidate thresholds could be thought about in the given graph.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 16/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Figure 21: Comparison of Two Models Using AUC Model

If there was another model and its ROC was generated as shown above, If the two models
for classification are compared, the basic objective is to look at which offers a higher AUC.
After comparing, it would be claimed as a better model. Here in the given graph, The AUC
for the red ROC curve is greater than the AUC for the blue ROC curve. So, in this instance, it
is suggested to go for the red curve.

2.3 Model Adequacy


Regression models for categorical outcomes should be evaluated for fit and adherence to
model assumptions. There are two main elements of such assessment. They are,
• Discrimination: It measures the ability of the model to classify observations into
outcome categories correctly
• Calibration: It measures how well the model-estimated probabilities agree with the
observed outcomes and is typically evaluated via a goodness-of-fit test

The (binary) logistic regression model describes the relationship between a binary outcome
variable and one or more predictor variables. Four tests can be used for are as follows:

Omnibus Test
• The omnibus test is a likelihood-ratio chi-square test of the current versus the null
model.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 17/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

• The significance value of less than 0.05 indicates that the current model outperforms
the null model.
• Omnibus tests are generic statistical tests used for checking whether the variance
explained by the model is more than the unexplained variance.

Wald’s Test
• Wald's test checks whether an individual explanatory variable is statistically
significant.
• Wald's test is a chi-square test.
̂
𝛽
• A Wald test calculates a Z statistic which is 𝑊 = ̂)
𝑆𝐸(𝛽

• This squared value yields a chi-square distribution and is used as a Wald test statistic.

Hosmer-Lemeshow Test
• It is a chi-square goodness of fit test for binary logistic regression.

Pseudo R2
• Pseudo R2 is a measure of the goodness of the model.
• It is called pseudo R2 because it does not have the same interpretation of R2 as in the
multiple linear regression (MLR) model.

2.4 Logistic Regression


In logistic regression, the model performance is often measured using sensitivity,
specificity, and precision.
• Sensitivity: The ability of the model to correctly classify positives.
• Specificity: The ability of the model to correctly classify negatives.

These terminologies originated in medical diagnostics. In medical diagnostics, sensitivity


(also known as TP rate) measures the ability of a diagnostic test to identify if a disease is
present in a patient (TP).

2.5 Performance Metrics for Model Fit


The evaluation measures in classification problems are defined from a confusion matrix
with the numbers of examples correctly and incorrectly classified for each class.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 18/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

Table 1: The Confusion Matrix for a Binary Classification Problem

A confusion matrix, also called a contingency table, contains information about actual and
predicted classifications done by a classification system.

Example:
Consider the spread of Covid.

Figure 22: Example of Covid

It shows the true class of different people, and the other side, logistic regression, shows the
predicted class, which contains two people. The performance of such systems is commonly
evaluated using the data in the matrix.

F-Score (F-Measure) is another measure used in binary logistic regression that combines
precision and recall (harmonic mean of precision and recall). In the form of the equation, it
can be written as,
2 × Precision × Recall
F-Score =
Precision + Recall
True Positive
Recall (R) =
True Positive + False Negative

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 19/20
VARIABLE SELECTION METHODS AND GOODNESS OF FIT

3. Summary
In this topic, we discussed:

• The forward selection and backward elimination methods for variable selection
• When to choose which method of variable selections
• The use of Receiver Operating Characteristic (ROC) Curve and Area Under the Curve
(AUC) for the goodness of fit

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 20/20

You might also like