You are on page 1of 37

Journal of Agricultural, Biological, and Environmental Statistics

Performance of variables selection in logistic regression: A comparison of LASSO,


PLS, Information criterion and Significance based procedures
--Manuscript Draft--

Manuscript Number:

Full Title: Performance of variables selection in logistic regression: A comparison of LASSO,


PLS, Information criterion and Significance based procedures

Article Type: Original Article

Corresponding Author: Azonvidé Hubert DOSSA, Master


Université Nationale des Sciences Technologies Ingénierie et Mathématiques:
Universite Nationale des Sciences Technologies Ingenierie et Mathematiques
Dassa-zoumé, BENIN

Corresponding Author Secondary


Information:

Corresponding Author's Institution: Université Nationale des Sciences Technologies Ingénierie et Mathématiques:
Universite Nationale des Sciences Technologies Ingenierie et Mathematiques

Corresponding Author's Secondary


Institution:

First Author: Azonvidé Hubert DOSSA, Master

First Author Secondary Information:

Order of Authors: Azonvidé Hubert DOSSA, Master

Judicael LALY, Master

Dossou Seblodo Judes Charlemagne GBEMAVO, Ph.D

Order of Authors Secondary Information:

Funding Information:

Abstract: The selection of relevant variables in the context of a large number of covariates to
perform accurate predictions is becoming increasingly necessary for statisticians and
practitioners in general. This study aimed to evaluate the performance of LASSO, PLS,
information criteria and Significance probalility of variable selection methods frequently
used in the generalized linear model. The simulation study allowed us to explore the
performance of those methods in terms of variable selection and prediction. We took
low, medium and high dimensional configurations and considered different cases of
multicollinearity between covariates and different sample sizes. An application using a
real medium dimensional agronomic data was also included in this study. It was found
that PLS method would be effective to use when the sample size is equal to or close to
the number of covariates, with a high correlation. The LASSO and AIC methods were
suitable when the correlation was medium and low as is the Stepwise method.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Title Page including author information

Performance of variables selection in logistic regression: A


comparison of LASSO, PLS, Information criterion and
Significance based procedures
Hubert A. DOSSA1,2*, Judicael LALY1,2† and Charlemagne D.S.J. GBEMAVO1,2†
1 Laboratoirede Biomathématiques et d’Estimations Forestières, University of
Abomey-Calavi, Street, Abomey-Calavi, 100190, State, Benin.
2 Ecole Nationale Supérieure des Biosciences et Biotechnologies Appliquées, National

University of Sciences, Technologies, Engineering and Mathematics, Street, Abomey, 10587,


State, Benin.

*Corresponding author(s). E-mail(s): hubertadonha@gmail.com;


Contributing authors: judicaelmaurilelaly@gmail.com; cgbemavo@yahoo.fr;
† These authors contributed equally to this work.

Abstract
The selection of relevant variables in the context of a large number of covariates to perform accurate
predictions is becoming increasingly necessary for statisticians and practitioners in general. This study
aimed to evaluate the performance of LASSO, PLS, information criteria and Significance probalility of
variable selection methods frequently used in the generalized linear model. The simulation study allowed
us to explore the performance of those methods in terms of variable selection and prediction. We took low,
medium and high dimensional configurations and considered different cases of multicollinearity between
covariates and different sample sizes. An application using a real medium dimensional agronomic data was
also included in this study. It was found that PLS method would be effective to use when the sample size
is equal to or close to the number of covariates, with a high correlation. The LASSO and AIC methods
were suitable when the correlation was medium and low as is the Stepwise method.

Keywords: Correlation ; Multicollinearity ; Variable selection ; Statistical methods ; modeling.

Acknowledgments. This study was carried out as part of the Master’s Degree in Statistics, major Bio-
statistics obtained at Laboratoire de Biomathematiques et d’Estimations Forestieres (LABEF) of University
of Abomey-Calavi (UAC). We express our sincere thanks to the examiner board.

Declarations
Funding: This research received no external funding
Conflict of interest: The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Code and Data Availability: The code to run the simulation, and the code and data used to fit the
models, are publicly available at https://github.com/Hubertdossa/Variable-Selection-Methods.git

1
Blind manuscrit Click here to view linked References

1
2
3
4 Performance of variables selection in logistic regression: A
5
6 comparison of LASSO, PLS, Information criterion and
7
8 Significance based procedures
9
10
11 Abstract
12 The selection of relevant variables in the context of a large number of covariates to perform accurate
13 predictions is becoming increasingly necessary for statisticians and practitioners in general. This study
14 aimed to evaluate the performance of LASSO, PLS, information criteria and Significance probalility of
15 variable selection methods frequently used in the generalized linear model. The simulation study allowed
16 us to explore the performance of those methods in terms of variable selection and prediction. We took low,
17 medium and high dimensional configurations and considered different cases of multicollinearity between
18 covariates and different sample sizes. An application using a real medium dimensional agronomic data was
19 also included in this study. It was found that PLS method would be effective to use when the sample size
20 is equal to or close to the number of covariates, with a high correlation. The LASSO and AIC methods
21 were suitable when the correlation was medium and low as is the Stepwise method.
22
23 Keywords: Correlation ; Multicollinearity ; Variable selection ; Statistical methods ; modeling.
24
25
26
27 Abbreviations
28
29 AICc: Akaike Information Criterion correction
30 GLMs: Generalized Linear Models
31 PLS: Partial Least Squares
32 LASSO: Least Absolute Shrinkage and Selection Operator
33 GVIF: Generalized Variance Inflation Factors
34 VIP: Variable Importance in Projection
35 Nber: Number
36 Imp%: Percentage of important variables selected
37 TP: True Positive
38
TN: True Negative
39
FP: False Positive
40
41 FN: False Negative
42 n: Sample size
43 p: Number of covariates
44 q: Number of correlated variables
45 FN: False Negative
46 Lc : Maximized likelihood value from the fitted model
47 Lnull : Maximized likelihood value from the null model
48 Pr: Probability
49 exp: Exponential
50 (.)T : Transpose
51 i.i.d : independent and identically distributed
52 min: Minimum
53
54
55 1 Introduction
56 In the statistical analysis, the selection of independent (predictor) variables in a regression model that might
57
influence the outcome variable is an important task (Zellner et al. 2004). Selecting variables then refers to
58
situations where the statistician or practitioner seeks to select a subset of variables to include in a model from
59
60 a set of starting variables (x1 , x2 , ..., xp ). It is therefore important to apply a first step of variable selection
61 to correctly explain the data and avoid unnecessary noise (Freijeiro-González et al. 2021). The selection of
62 variables is nowadays one of the crucial steps in the construction of a linear regression model (Khan et al.
63
64
65 1
2022). Choosing too many covariates is more likely to improve the variance of the estimated or trained model.
Many methods of variable selection have been proposed in several reviews by researchers. Zhang et al. (2018)
reported that a variety of methods exist for variable selection, but none of them is without limitations.
1
2 Autocorrelation between variables is one of the important problems that leads researchers to select the
3 variables. Several methods are used to eliminate autocorrelate variables, but users have no idea about the
4 accuracy of these methods. Finding effective solutions to this problem nowadays involves several variable
5 selection methods established by researchers such as Partial Least Squares (PLS) regression (Wang et al.
6 2015; Xiong et al. 2022), based on Information criteria (AIC) (Cavalaro and Pereira 2022; Stewart et al.
7 2022), Least Absolute Shrinkage and Selection Operator (LASSO) method (El-Sheikh et al. 2022; Bag et al.
8 2022), significance level or best significance level in stepwise (Hashemi et al. 2022). By taking advantage of the
9 statistical tests associated with linear regression, it is possible to select the significant explanatory variables
10 to be included in the PLS regression and to choose the number of PLS components to retain (Bastien et al.
11 2005). The acronym PLS is retained as a reference to a general methodology for relating a response variable
12 to a set of predictors. For example, Gauchi and Chagnon (1999) asserted that the best model should initially
13 contain few but relevant explanatory variables, and that their regression coefficients should be interpretable,
14 that is to say have the same sign as that of the regression coefficient of the corresponding simple regression
15 and its advantage in making variable selection. Freijeiro-González et al. (2021) in their paper asked the
16 question of whether LASSO is the best option or at least a good starting point for identifying relevant
17
covariates. It is then important to know which of these methods will be the most accurate in terms of the
18
model’s prediction. Although some studies (Su et al. 2017) discussed this topic, totally convincing answers
19
20 have not been found for the dependency scenarios. The position of these authors presents a scientific weakness
21 because it did not provide any scientific knowledge on the influence of the sample size, the number of variables
22 and the number of events per predictive variable, for example. Nevertheless, recent studies (Ranstam and
23 Cook 2018) has shown that LASSO regression outperforms standard variable selection methods in some
24 contexts, while enumerating one of its particularities to avoid potential biases related to the estimation of
25 certain elements. The studies of Muthukrishnan and Rohini (2016) compared LASSO to traditional methods
26 and justify the performance of LASSO regression over traditional procedures such as ordinary least squares
27 (OLS) regression, stepwise regression and partial least squares regression which are very sensitive to random
28 errors. Kumar et al. (2019) by a comparison showed that the performance of LASSO regression is better
29 than stepwise regression to some extent.
30 However, other methods such as the one based on information criteria are widely used in many cases
31 (Burr et al. 2008). The two most commonly used selection criteria of this family are the Bayesian information
32 criterion (BIC) and the Akaike information criterion (AIC) (Kuha 2004). Alternatively, the decision to keep
33 a variable in the model may be based on its clinical or statistical significance (Bursac et al. 2008). There
34
are then several variable selection algorithms. These methods are mechanical and, as such, have certain
35
limitations. This approach has an advantage when the analyst is interested in modeling risk factors and not
36
37 only in predicting (Bursac et al. 2008). In addition to significant covariates, this variable selection procedure
38 has the ability to retain important confounding variables, potentially resulting in a slightly richer model.
39 A comparative study of the methods used is urgently needed. This study is a scientific contribution that
40 analyzes the performance of the most widely used selection methods for a judicious choice by users. It proposes
41 to bring out the the variable selection and predictive performances of four variables selection methods such
42 as LASSO, PLS, AIC and Stepwise by varing the sample size, the number of predictors and the correlation
43 level between covariates, so different configurations or scenarios through simulation. The study achieved this
44 goal by answering three (03) research questions: (1) which of the four methods has the best variable selection
45 performance and predictive performance over sample size and total number of predictors in case of modelling
46 a binary response variable ? (2) Which of the four methods has the best variable selection performance and
47 predictive performance over sample size, total number of predictors and number of correlated covariates in
48 case of modelling a binary response variable ? (3) What happens to the variable selection performance and
49 the predictive performance of those four different methods on a real data from the agriculture field ? The
50 results of these questions will help to improve the quality of binary response variable modelling.
51
52
53 2 Theoritical framework of variables selection methods
54
55 2.1 Logistic regression
56
2.1.1 Brief overview
57
58 Logistic regression provides a method for modelling a binary response variable, which takes values 1 and 0.
59 This regression is used to obtain odds ratio in the presence of more than one explanatory variable (Sperandei
60 2014). Before delving into the details of the variable selection methods, it is necessary to recall the structure
61 of the logistic regression. In order to simplify notation, we used the quantity π(x) = E(Y | x) to represent
62
63
64
65 2
the conditional mean of Y given x when the logistic distribution is used. The specific form of the logistic
regression model used is:
eβ0 +β1 x
1 π(x) = . (1)
2 1 + eβ0 +β1 x
3 where Y denotes the response variable, x denotes a value of the independent variable, and the βj values
4 denote the model parameters.
5 A transformation of π(x) that is central to our study of logistic regression is the logit transformation. This
6 transformation is defined, in terms of π(x), as:
7
8
9 eβ0 +β1 x
π(x) =
10 1 + eβ0 +β1 x
11 1 + eβ0 +β1 x − eβ0 +β1 x
12 1 − π(x) =
13 1 + eβ0 +β1 x (2)
14 1
15 =
1 + eβ0 +β1 x
16
π(x)
17 ⇔ = eβ0 +β1 x
18 1 − π(x)
19
20 By applying the natural logarithm (ln) to this, we obtained:
21
22
 
π(x)
23 g(x) = ln
1 − π(x) (3)
24
25 g(x) = β0 + β1 x.
26
27 where g(.) is link function of the binary logistic models.
28 The importance of this transformation is that g(x) has many of the desirable properties of a linear regression
29 model (Al-Ghamdi 2002). The logit g(x), is linear in its parameters, may be continuous, and may range from
30 −∞ to +∞, depending on the range of x. The second important difference between the linear and logistic
31 regression models concerns the conditional distribution of the outcome variable. In the linear regression
32 model we assume that an observation of the outcome variable may be expressed as y = E(Y | x) + ε; where
33 the quantity ε is called the error and expresses an observation’s deviation from the conditional mean. The
34 most common assumption is that ε follows a normal distribution with mean zero and some variance that is
35 constant across levels of the independent variable. It follows that the conditional distribution of the outcome
36 variable given x is normal with mean E(Y | x), and a variance that is constant. This is not the case with a
37 dichotomous outcome variable. In this situation, we may express the value of the outcome variable given x
38
as y = π(x) + ε. Here the quantity ε may assume one of two possible values. If y = 1 then ε = 1 − π(x) with
39
probability π(x), and if y = 0 then ε = −π(x) with probability 1 − π(x). Thus, ε has a distribution with
40
41 mean zero and variance equal to π(x)[1 − π(x)]. That is to say, the conditional distribution of the outcome
42 variable follows a binomial distribution with probability given by the conditional mean π(x).
43 In summary, Hosmer et al. (2000) have shown that in a regression analysis when the outcome variable is
44 dichotomous, first the model for the conditional mean of the regression equation must be bounded between
45 zero and one. The logistic regression model, π(x), given in equation (3), satisfies this constraint. Second, the
46 binomial, not the normal, distribution describes the distribution of the errors and is the statistical distribution
47 on which the analysis is based. Third, the principales that guide an analysis using linear regression also guide
48 us in logistic regression.
49
50 2.1.2 Model
51
52 Consider the random vector (X1 , . . . , Xp , Y ), where Y is a binary variable, coded 0 or 1. We are interested
53 in the relationship between the response variable Y and several explanatory variables X = (X1 , . . . , Xp ).
54 The observations are groups of individuals (strata, denoted k = 1, . . . , K), consisting of a case (Y1k = 1) and
55
56 M controls (Yik = 0, i = 2, . . . , M + 1), each of them having a value of X. For individual i in stratum k, we
57 have the observation vector: xik = (xik1 , . . . , xikp ), where i = 1, . . . , M + 1, k = 1, . . . , K.
58 Let Pik be the probability (not conditional on the sampling mode) of the event occurring for subject i in
59
60
61
62
63
64
65 3
stratum k. Consider the logistic model, assuming that the risk varies from one group to another.

1 Pik = P (Yik = 1 | xik )


 
2
exp α0 + K
P Pp
3
α 1
i=1 i i + β x
j=1 j ikj
4 =  
PK Pp
1 + exp α0 + i=1 αi 1i + j=1 βj xikj (4)
5
6
7 exp (α0 + αk + xik β)
Pik =
8 1 + exp (α0 + αk + xik β)
9
10 where 1i is an indicator function that is 1 if the individual belongs to stratum i and 0 otherwise; α0 is the pro-
11 portion of cases in unexposed subjects, αk are the coefficients representing the effect of the matching variables
0
12 on the response, which reflect the differences between the strates; and the coefficients β = (β1 , . . . , βp ) rep-
13 resent the effects of the explanatory variables or, equivalently, the log-odds ratio. This relationship between
14 the probability of occurrence of the event (or the risk of occurrence of the event since Yik | xik are dis-
15 tributed according to a Bernoulli distribution and thus, Pik = E [Yik | xik ]) and the values of the explanatory
16 variables can also be expressed using the logit transformation:
17
18 P (Yik = 1 | xik )
19 logit [Pik ] = log
1 − P (Yik = 1 | xik ) (5)
20
21 logit [Pik ] = α0 + αk + xik β.
22
23 2.2 Variables selection procedures
24
25 In this setup, the response variable is always binary. If the outcome is a ”success” (respectively, ”failure”), we
26 assign 1 (respectively, 0) as the value of the response variable. Let us use Y = (y1 , y2 , . . . , yn ) to denote the
27 sample of response observations, which are assumed to depend on p number of covariates. For the ith sample
>
28 observation, the vector of covariates is denoted by xi = (xi1 , xi2 , . . . , xip ) . The corresponding regression
>
29 coefficients in the logistic regression are going to be denoted by β = (β1 , β2 , . . . , βp ) , and that is our primary
30 >
parameter of interest. We shall use X = [x1 : . . . : xn ] to denote the set of explanatory variables in the
31 model. The order of the design matrix X is n × p. Throughout this dissertation, n and p denote the number
32 of observations and the total number of covariates respectively. Then, the Logistic regression model can be
33 written using vector-matrix notations (Bag et al. 2022) as:
34
35
logit(Y | X) = Xβ (6)
36
37
38 where logit(Y | X) is used to denote the vector of logit transformation of yi given xi for 1 6 i 6 n, which
39 is defined as
P (yi = 1 | xi )
40 logit (yi | xi ) = log . (7)
41 P (yi = 0 | xi )
42 Writing the probability P (yi = 1 | xi ) as πi , we note that it can be expressed as
43
exp x>

44 i β
45 πi = (8)
1 + exp x>

i β
46
47 . Since the complete likelihood for the binary data is
48
49 n
Y 1−yi
50 L(β) = πiyi (1 − πi ) , (9)
51 i=1
52
53 the log likelihood for the regression coefficients can be written as
54
n
55 X
yi x> >
  
56 log L(β) = i β − log 1 + exp xi β . (10)
57 i=1
58
59 2.2.1 LASSO regression
60 The Least Absolute Shrinkage and Selection Operator ( LASSO) was set up by Tibshirani (1996) for param-
61 eter estimation and also variables (model) selection simultaneously in regression analysis. The LASSO is a
62
63
64
65 4
particular case of the penalized least squares regression rith Ll-penality function. The LASSO estimate can
be defined by  
N p
!2 p
1 X
1 X X 
β̂ lasso = arg min yi − β0 − xij βj +λ |βj | (11)
2 β 2 
i=1 j=1 j=1
3
4 which can also be written as
5 !2
N p
6 lasso
X X
7 β̂ = arg min yi − β0 − xij βj ,
β
8 i=1 j=1
9
10 subject to
p
11 X
|βj | ≤ t
12
j=1
13
14 where λ > 0 is a regularisation parameter.
15 LASSO transforms each and every coefficient by a constant component λ, truncating at zero. Hence it is a
16 forward-looking variable selection method for regression. It decreases the residual sum of squares subject to
17 the sum of the absolute value of the coefficients being less than a constant. LASSO was originally defined
18 in the context of least squares. But it can also be extended to a wide variety of models. LASSO improves
19 both prediction accuracy and model interpretability by combining the good qualities of ridge regression and
20 subset selection. If there is high correlation in the group of predictors, LASSO chooses only one among them
21 and shrinks the others to zero. It reduces the variability of the estimates by setting some of the coefficients
22 to exactly zero, which produces easily interpretable models.
23
24 2.2.2 PLS regression
25
26 PLS (partial least square) regression is an old method (Wold 1966) widely used, especially in chemometrics
27 in the food industry, when analyzing spectral data (Near Infra-Red or HPLC) discretized and therefore
28 always of high dimension. PLS regression is an efficient method that justifies its widespread use, but it does
29 not lend itself to a traditional statistical analysis that would show the laws of its estimators.
30 The basic idea is to calculate the principal scores of X and Y and to set up a regression model between the
31 scores. In case of single response y and p predictors, PLS regression model with h (h ≤ p) latent variables
32 can be expressed as follows (Geladi and Kowalski 1986; Eriksson et al. 2001):
33
34
X = TPt + E
35 (12)
36 Y = T b + f.
37
38 In equation (12), X(n × p), T(n × h), P(p × h), y(n × 1) and b(h × 1) are respectively used for predictors,
39 X scores, X loadings, a response, and regression coefficients of T. The k − th element of column vector b
40 explains the relation between y and tk , the k − th column vector of T. E(n × p) and f (n × 1)stand for random
41 errors of X and Y , respectively.
42 Based on how variable selection is defined in PLSR we can categorize the variable selection methods into three
43 main categories (Mehmood et al. 2012) such as filter-,wrapper-, and embedded methods. We see that there
44 is a large family of algorithms available for the selection of variables and their modeling. In our study here,
45 we used the Filter methods, more precisely the Variable importance in projection (V IP ) basic algorithm. It
46 is a method that is generally used as a criterion for variable selection. Further, past studies have generally
47 used this approach especially in comparative studies of variable selection methods; the case of Chong and
48
Jun (2005) for example. The variable importance in PLS projections (VIP) introduced by Wold (1966) as
49
”Variable influence on projection” which is known now as ”Variable importance in projection” termed by
50
51 Eriksson et al. (2001). The idea behind this measure is to aggregate the importance of each variable j in
52 each component. The VIP measure vj is defined as :
53 v
54 u A h
u X  i XA
2
55 vj = tp SSa waj / kwa k / (SSa ) (13)
56 a=1 a=1
57
58 where SSa is the sum of squares explained by the each component. Hence, the vj weights is a measure
59 of the contribution of each variable according to the variance explained by each PLS component where
60 2
(waj / kwa k) represents the importance of the j th variable. Since the variance explained by each component
61
62
63
64
65 5
can be computed by the expression qa2 t0a ta (Eriksson et al. 2001), the vj can alternatively be expressed as
v
u A h A
1 u X 2
i X
2 vj = tp (qa2 t0 a ta ) (waj / kwa k) / (qa2 t0 a ta ). (14)
3 a=1 a=1
4
5 Variable j can be eliminated if vj < u for some user-defined threshold u ∈ [0, ∞). It is generally accepted that
6 a variable should be selected if vj > 1 (Eriksson et al. 2001; Gosselin et al. 2010). Further, the importance of
7 each variable can, if preferred, be expressed as a percentage. Also, if probabilistic considerations regarding
8 the importance of v is required, a bootstrap procedure can be applied. This may improve the stability of
9 the results compared to selection based on regression coefficients β (Eriksson et al. 2001).
10
11 2.2.3 Information criterion: Case of Akaike Information Criterion (AIC)
12
13 There are a variety of established criterion. Here, we focus on the selection criteria which is most commonly
14 used, AIC (Chowdhury and Turin 2020). AIC was first developped by Akaike (1973) as a way to compare
15 different models on a given outcome. During the last fifteen years, Akaike’s entropy-based Information
16 Criterion (AIC) has had a fundamental impact in statistical model evaluation problems (Bozdogan 1987).
17 Including different variables in the model provides different models, and AIC attempts to select the model
18 by balancing underfitting (too few variables in the model) and overfitting (too many variables in the model)
19 (Burnham and Anderson 2004). Including too few variables often fails to capture the true relation and too
20 many variables create a generalisability problem (Aho et al. 2014). A trade-off is therefore required between
21 simplicity and adequacy of model fitting and AIC can help achieve this (Snipes and Taylor 2014). AIC tries
22 to estimate that relative information loss compared with other candidate models. Quality of the model is
23
believed to be better with smaller information loss and it is important to select the model that best minimises
24
that loss. Candidate models for the specific data are ranked from best to worst according to the value of AIC
25
26 (Burnham and Anderson, 2004). Among the available models for the specific data, the model with minimum
27 AIC is best (Snipes and Taylor 2014). AIC only provides information about the quality of a model relative
28 to the other models and does not provide information on the absolute quality of the model. With a small
29 sample size (relative to a large number of parameters/variables or any number of variables/parameters), AIC
30 often provides models with too many variables.
31 Akaike Information Criterion, AIC, attempts to quantify the difference between the distribution of the data,
32 Y , and the distribution specified by the model in question. A measure for the discrepancy between the true
33 model and the approximating model is given by the information’s quantity I(f, g), which is equal to the
34 negative of generalized entropy (Wagenmakers and Farrell 2004). Akaike (1973) has shown that choosing the
35 model with the lowest expected information loss (i.e., the model that minimizes the expected KullbackLeibler
36 discrepancy) is asymptotically equivalent to choosing a model Mi , i = 1, 2, . . . , K that has the lowest AIC
37 value. The AIC is defined as
38 AICi = −2 log Li + 2Vi , (15)
39
40 where Li , the maximum likelihood for the candidate model i, is determined by adjusting the Vi free parame-
41 ters in such a way as to maximize the probability that the candidate model has generated the observed data
42 Wagenmakers and Farrell (2004). Equation 15 shows that the AIC rewards descriptive accuracy via the
43 maximum likelihood, and penalizes lack of parsimony according to the number of free parameters (note that
44 models with smaller AIC values are to be preferred). Equation 15 is based on asymptotic approximations
45 and is valid only for sufficiently large data sets. The finite sample correction
46
47 2V (V + 1)
AICc = −2 log L + 2V + (16)
48 (n − V − 1)
49
50 Hurvich and Tsai (1991) is generally recommended when n/V < 40 (Wagenmakers and Farrell 2004). By
51 focusing on the estimation of the relative likelihood L of the model, Akaike (1978) tried to calculate, for each
52 model, the differences in AIC with respect to the AIC of the best candidate model that is:
53
54 ∆i (AIC) = AICi − min AIC. (17)
55
56
From the differences in AIC, we can then obtain an estimate of the relative likelihood L of model i by the
57
simple transform:
58  
1
59 L (Mi | data ) ∝ exp − ∆i (AIC) ; (18)
60 2
61 where ∝ stands for ”is proportional to”.
62
63
64
65 6
2.2.4 Significance probalility methods: Significance level in stepwise
The idea of this mothods is simply to use the conventional test statistic for testing the enlarged models
1 including one of the highest potential candidates against the current model, and use a new stopping rule
2 (Hwang and Hu 2015). The most potential candidates are chosen based on the strength of the correlations
3 between the covariates and the residuals of the current model. The proposed stopping rule is based on the
4 well-known theoretical properties that (1) the p-values of the test statistics are Unif (0,1) distributed if the
5 predictors are irrelevant to the responses and (2) the minimum of m independent Unif (0,1) random variables
6 can be assumed to be beta distributed with parameters 1 and m approximately (Loughin 2004). Three main
7 selection methods generally constitute this family. There are forward selection, backward selection and both
8
selection (Stepwise). For our comparison study, we used the stepwise approach, which is the one commonly
9
used and which takes more into account the reality and works both ways. So each dependent and independent
10
11 variable has been standardized and subtracting the mean, and then divided by the standard deviation of a
12 variable and obtaining the standardized regression coefficients. The following formula is used to estimate the
13 coefficients of these variables:  
Sxi
14 bj,std = bj (19)
15 Sy
16 where Sy and Sxj are the standard deviations for the dependent variable and the corresponding j th
17 independent variable (Loughin 2004).
18
19
20 3 Materials and methods
21
22 3.1 General procedure
23 The data were simulated based on the real data collected as part of the work of Loko et al. (2022). This
24 appraoch was taken from the work of Yoo and Rho (2021). The real data used has the advantage of being
25 made up of a mixture of variables from different distributions. They were collected as part of an agricultural
26 study and consisted of a matrix of 418 rows and 28 columns. The first line was variable names and the others
27 417 rows represented data from 417 respondents. In total we had 27 explanatory variables and 1 explained
28
variable (response variable). Quantitative and qualitative variables were included in the real data.
29
30
31 3.1.1 Simulation of independant variables
32 A data of different levels of n observations were generated followed by the parameters of each variable
33 determined from the real data (Yoo and Rho 2021) using Bernoulli and multinomial distributions for the
34 categorical variables, a Poisson distribution for the counting variables, and multivariate normal distributions
35
for the continuous quantitative variables.
36
37
38 3.1.2 Simulation of response variable
39 In the current study, the response variable was binary in nature such as pass/fail or absence/presence of a
40 response in each observation ni in the case of binary logistic regression. This response variable was generated
41 as 0 or 1, using a logistic regression model in equation (20). Indeed, after having generated the covariates, we
42 had written a model which was used to find the values of the response variable according to the covariates.
43
We had chosen to generate this response variable in order to maintain the noise contained in the real data so
44
as not to fail in our objectives since there is a relationship between these covariates which give a response yi .
45
46
47 yi ∼i.i.d. Ber (pxi ) , pxi = Pr (yi = 1xi )

48 exp xTi β (20)
49 pxi = .
1 + exp xTi β
50
51
3.1.3 Simulation frameworks
52
53 This study aimed to compare variable selection performance and predictive performance of the four com-
54 peting methods in our work such as LASSO, PLS, AIC and Stepwise. Thus, we drew on some previous
55 researches (Khan et al. 2022; Bag et al. 2022; Hazimeh and Mazumder 2020) to establish a configuration of
56 simulations. Based on the methodology of Bag et al. (2022), nine different configurations were considered in
57 the study (Figure 1) to encompass a wide range of problems encountered in real life. In this section, we had
58 used n, p and q to denote the number of observations in the sample, the total number of covariates and the
59 number of collinear covariates, respectively.
60
61
62
63
64
65 7
Simulation configuration

1
2 n = 80 n = 500 n = 1000
3
4
5
6 p = 10 p = 20 p = 27 p = 10 p = 20 p = 27 p = 10 p = 20 p = 27
7
8
9
10 q=4 q=7 q = 14 q=4 q=7 q = 14 q=4 q=7 q = 14
11 Fig. 1: Simulation configurations to be used in this work. Here, n is the sample size, p is the total number
12 of covariates, q is the number of collinear covariates, and Xi stands for the ith covariate.
13
14
15
16 The simulation frameworks were grouped as follows:
17 Let Xi be the matrix of covariates p. We had generated the observations for this number p of covariates to
18 constitute the matrix Xi of dimension (n × p). For that, we respected the configuration according to which
19 each simulated data base will carry in its breast a known number of correlated covariates. That is to say a
20 data base having a number q of correlated covariates. In order to have an idea of the correlated variables
21 to eliminate from our generalized linear regression model, we previously submitted the variables to a test of
22 the Variance Inflation Factor (VIF) , which is the most commonly used indicator to detect multicollinearity
23 (Tamura et al. 2019).
24 For each defined simulation configuration, different sample sizes n were simulated, n ∈ {80,500,1000} was
25 simulated as Khan et al. (2022), with the number of predictor variables p varying in the set {10, 20, 27}.
26 Assuming that these choices of sample size variation represent low, moderate, and high data dimensions,
27 respectively. The correlated covariates desired to eliminate by our four methods were varied in reasonable
28 ways for each configuration of p following the manner of Bag et al. (2022). A variation of q ∈ {4, 7, 14} was
29
respected to have data bases with different levels of number of correlated variables.
30
The assignment of the data to the distributions considered in this work was arbitrary. The representativeness
31
32 of each distribution in the simulated was taken into account ones so that for a simulated base, there were
33 variables of different distributions for respecting the heterogeneity of the data bases. For the current study
34 case, the distributions were not fair. The idea was just to have a database whose distributions are generally
35 represented.
36 For the coefficients, p × 1 parameter vector β used were values from the estimation of the parameters of
37 our real data. Each estimated parameter was in fact used to generate another variable of the same nature
38 as the one used to estimate it.
39
40 3.2 Comparison methods
41
42 3.2.1 Performance measures
43
Six performances criteria were used to compare the four selection methods. Three performances criteria
44
were used for the variable selection performance evaluation, and three others for prediction performance
45
46 evaluation.
47
48 3.2.2 Variable selection performance evaluation
49 The performances criteria used were mean squared error (MSE) and R McFadden’s pseudo-R squared (R2 )
50
which is the appropriate measure of R2 in the case of logistic regression with binary response and the last
51
one which concerns the number of covariates retained by method. The formal definitions of these measures
52
53 are provided below (Chai and Draxler 2014).
54 m
55 1X
M SE = (βj − βˆj )2 ; (21)
56 n j=1
57
58
59 where βj is the actuel coefficients values and βˆj is the estimated values of the covariables.
60 log(Lc )
2
61 RM cF adden = 1 − (22)
62 log(Lnull )
63
64
65 8
where Lc denotes the (maximized) likelihood value from the current fitted model, and Lnull denotes the
corresponding value but for the null model, the model with only an intercept and no covariates.
In order to achieve the the objective, the number of variables selected in each iteration was looked for and
1
2 the average over all iterations were calculated (Bag et al. 2022). This helped us to evaluate the performance
3 of a method in terms of variable reduction. So let V arselected be the mean number of covriate retain after
4 all repetition, and βˆj be the estimated coefficients.
5 p
6
X
V arselected = (pj ) − {1 6 j 6 p : βˆj = 0} (23)
7 j=1
8
9 However, the number of covariates selected did not necessarily imply a correct set of covariates. Thus, in
10 the real data application, the proportions of important and unimportant variables selected to analyze this
11 particular property were calculated.
12 A good procedure should result in an almost correct number of selected variables. In addition, the percentage
13 of important variables selected should be close to 100%, while that of unimportant variables should be close
14 to zero. These measures were mathematically defined y. Let βˆj and βj for (1 6 j 6 p) be the estimated and
15 the true coefficients; β̂ and β to denote the corresponding vectors. Then, the above quantities are defined as
16 follows:
17 {1 6 j 6 m : βˆj 6= 0, βj 6= 0}
18 Imp(%) = (24)
{1 6 j 6 m : βj 6= 0}
19
20 U nimp(%) = 100 − Imp. (25)
21
22 3.2.3 Prediction performance evaluation
23
24 Here Prediction Accuracy, RMSE and Precision were used for evaluating predictive performances of the
25 models. They are supposed to be the most used in logistic regression to evaluate models. These same
26 approaches have been used by many authors (Bag et al. 2022; Yoo and Rho 2021,2; Brier et al. 1950). So,
27 for evaluating the prediction performance, the empirical accuracy was first computed as the percentage of
28 overall correct predictions where success is predicted by (yˆi > 0.5) (Equation 26).
29 Let TP, FP, TN and FN denote the True Positives, the False Positives, True Negative and the False
30 Negatives respectively. Then the above measure is defined as follows:
31
32
 
TP + TN
33 Accuracy = 100 (26)
TP + FP + TN + FN
34
35 Next, the precision and the RMSE were calculated. Precision indicates how close the measurements are to
36 each other. In statistical terms, precision is an absence of bias (equation 27).
37
38 TP
39 P recision = . (27)
40 TP + FP
41
The root mean square error (RMSE)(equation 28) evaluated here is the prediction error of the model. That
42
is to say the amount of error that the model makes in the prediction.
43
44 Here, yj and yˆj denote respectively the observations of the actual response variable and the estimated
45 response variable.
46 v
u1 m
u X
47
48 RM SE = t (yj − yˆj )2 . (28)
n j=1
49
50 Both have their advantages in judging the prediction accuracy of a method. These measures lie between
51 0 and 1, and higher values are preferred except the RMSE. Evidently, the above measures give an idea of
52 the classification abilities of the method. However, they cannot provide an adequate idea of the predictive
53 accuracy in linear regression model setup where not only the predicted category, but also the predictive
54 distribution is crucial (Czado et al. 2009).
55 The statistical table of confusion matrix is used to calculate the different indices, including Precision and
56 Accuracy of prediction.
57
58
59 3.3 Application on real data
60 In this section, a high-dimensional dataset was considered which contained socio-demographic and eco-
61 nomic characteristics of rice farmers (age, gender, education, years of rice farming experience, rice training,
62
63
64
65 9
household size, source of income, membership in a farmers’ association), rice production system; cultivation
practices (area planted, number of rice plots, type of rice varieties grown, bird control, frequency of fertilizer
application, type of labor, number of weeding operations, yield, number of oxen plows, straw management),
1
2 and production constraints. This dataset was derived from the study of Loko et al. (2022). The dataset
3 contained relevant information from 28 different variables including a binomial response variable and 27
4 independent variables for a total of 417 participants or rice producers. Then, the performance of our four
5 variable selection methods discussed and explored in previous simulation studies were analyzed and com-
6 pared for this data set. To begin, using the full data set, the number of variables selected by each of the four
7 methods and the accuracy of the data fit for these methods were investigated.
8
9 4 Results
10
11 4.1 Performance of the four methods in function of the sample size
12
13 4.1.1 Comparing the performance of the four methods in selecting variable
14
15 The variable selection performance of each model was measured by calculating the number of varariable
16 selected by each method and this for different sample sizes. The simulation results are presented on table 1.
17 This table shows that there was a high variaility of the number of variables selected when the sample size
18 varied and that the number of variables selected depended on the methods used. In particular, when the
19 number of covariates was equal to 10 with 4 correlated variables, the AIC, LASSO and Stepwise methods
20 had the higher number of selected variables than the PLS method. Each time the sample size was increased,
21 these methods (AIC, LASSO and Stepwise) had keep the same selection capabilities except for the stepwise
22 method which experienced a drop in selection when the sample size increased to 500. Therefore, the AIC,
23 LASSO and Stepwise methods tend to select more variables than the PLS method. Moreover, the sample
24 size did not have a high influence on the selection (Table 1). When the number of covariates was increased
25 from 10 to 20 with 7 correlated covariates, the AIC and Stepwise methods still had more ability to select
26 more variables than the PLS method. But in this case, the selection capacity of the LASSO method has been
27 reduced. One important remark was observed when the number of covariates was increased to 27. In this
28 scenario, as the sample size was increased, the LASSO and PLS methods had selected more variables than
29
the others. From these observations, the variation of the sample size has little effect on the models ability
30
to select variables in presence of a medium or strong correlation of variables.
31
32 Morover, the Table 1 below shows values of some performance metrics that corroborates the previous
33 results presented. Indeed the AIC and stepwise methods have a high level of error (M SE = 6.43 and
34 M SE = 5.65 respectively) whatever the variation of the sample size (80,500,1000). From the results, one
35 can say that the ability to retain a large number of variables of these variable selection methods was not
36 without error. However, these methods were able to select unnecessary variables. In all cases of size variation,
37 LASSO had less error. The error decreased for the AIC and Stepwise methods when the sample size was
38 increased for the low correlation case. This error increased for the AIC when the sample size was increased
39 for the medium correlation q = 20 case. It was noticed that the PLS method presented less error when the
40 2
number of covariates was high. In terms of model fit obtained after selection, the RM cF adden informs that
41 the AIC and Stepwise methods nevertheless had presented good fits in general for all levels of correlation
42 compared to the others. But one noticed that each time the sample size increased, the model fitted less. This
43 observation was almost the same when moving from one level of correlation to another. Thus the AIC and
44 Stepwise methods fitted much less when the sample size went from a low level to a high level.
45
46
47
4.1.2 Comparing the performance of the four methods in predicting
48 In terms of prediction, the methods presented a good prediction in general. Indeed, for a configuration of
49 p = 10, q = 4, the variation of the sample size did not influence in general the performance of the methods
50 AIC and LASSO while the sample size increased in the case of AIC and PLS improved the accuracy of the
51 methods. When the number of covariates was increased, the PLS and Stepwise methods tried to improve their
52 performance from 0.73 to 0.78 and 0.68 to 0.88 respectively when the sample size was increased (Table 2).
53 Overall, when the sample size is small and the level of correlation was low, the AIC method performed better
54
than the others. For a medium covariate configuration and large sample size, the Stepwise method performed
55
better. In terms of prediction error, the LASSO and PLS methods presented less error in the prediction
56
57 compared to the others. It should be noted that each time the sample size increases for each configuration,
58 the prediction error of the PLS method decreases. Thus the LASSO and PLS methods predicted with less
59 error. With each configuration that the sample size was increased, the prediction errors (RM SE) of the PLS
60 method decreased.
61
62
63
64
65 10
Table 1: Comparison of the four methods in function of the in sample size
Configurations Performance metrics Selection Methods
1 Samples Lasso Stepwise AIC PLS
2 80 0.28 5.65 6.43 1.43
MSE
3 500 0.3 2.59 2.57 1.41
4 1000 0.31 2.27 2.28 1.41
5
80 0.08 0.28 0.21 0.05
6 p=10,q=4 2
RM cF adden 500 0.05 0.07 0.05 0.1
7
8 1000 0.041 0.05 0.05 0.07
9 80 3 3 3 2
10 Variable selected
500 3 2 3 2
11
12 1000 3 3 3 2
13 80 0.3 9.8 2.04 1.55
MSE
14 500 0.27 7.58 4.01 1.48
15 1000 0.27 3.8 8.82 1.27
16
80 0.11 0.59 0.65 0.62
17 p=20,q=7 2
RM cF adden 500 0.07 0.17 0.08 0.09
18
19 1000 0.05 0.42 0.07 0.06
20 80 5 7 8 5
21 Variable selected
500 5 5 7 6
22
23 1000 6 7 7 6
24 80 0.32 5.43 1.83
MSE
25 500 0.29 5.8 1.48
26 1000 0.26 2.82 1.4
27 80 0.14 0.7 0.67
28 p=27,q=7 2
RM cF adden 500 0.08 0.22 0.1
29
30 1000 0.06 0.52 0.07
31 80 6 11 6
32 Variable selected
500 6 6 8
33
34 1000 8 9 11
35
36
37 Table 2: Comparison of the four methods according to a sample size: model evaluation.
38 Configurations Performance metrics Selection Methods
39 Samples Lasso Stepwise AIC PLS
80 0.4 1.1 1.06 0.41
40 RMSE
41 500 0.39 1.08 1.07 0.4
42 p=10,q=4
1000 0.4 1.02 1.07 0.39
43 80 0.78 0.73 0.81 0.79
44 Accuracy
45 500 0.79 0.77 0.78 0.8
46 1000 0.78 0.77 0.77 0.8
47 80 0.41 1.15 1.03 0.42
RMSE
48 500 0.36 1.07 1.06 0.36
49 p=20,q=7
1000 0.35 1.04 1.05 0.36
50
80 0.75 0.68 0.85 0.73
51 Accuracy
52 500 0.8 0.8 0.79 0.78
53 1000 0.79 0.88 0.83 0.78
54 80 0.43 1.17 1.83
55 RMSE
500 0.37 1.08 0.38
56 p=27,q=7
1000 0.34 1.03 0.37
57
58 80 0.73 0.63 0.67
Accuracy
59 500 0.8 0.79 0.73
60 1000 0.83 0.89 0.74
61
62
63
64
65 11
4.2 Performance of the four methods in function of the sample size of the
number of covariates and autocorrelated variables
1 4.2.1 Comparing the performance of the four methods in selecting variable
2
3 Table 3 presents the comparison of the variable selection performance of the four methods over different
4 sample sizes and total number of covariates. From this table, all four variable selection methods performed
5 relatively well in terms of variable reduction. Indeed, it was observed that in the case of increasing the
6 number of covariates followed by an increase in the number of correlated covariates, the performance in
7 terms of selection had increased in general. However, the case of a variation of the sample size for a given
8 number p of covariates had generally no effect on the selection performance of the methods.
9 It is interesting to note then that the performance of the methods in terms of the number of variables
10 selected was probably not different in configurations where the number of covariates was the same from
11 one configuration to another. It was also observe that, when the number of variables was increased, the
12
overall selection performance for the LASSO and PLS methods remained almost the same, whereas a non-
13
negligible increase in the performance of AIC and Stepwise was observed. A possible explanation is that,
14
15 for the correlated case, if two variables are highly correlated and only one of them is important, then some
16 methods fail to make the right choice. Furthermore, AIC and Stepwise tended to select more variables than
17 the others as the number of covariates increases (Table 3). One would think that this would give these
18 methods a chance to select the majority of the most important variables for greater accuracy. From the table
19 3, AIC and Stepwise methods make the selection with the highest error with the highest MSE. For a given
20 small sample size, when the level of covariates was increased, the PLS and Lasso methods tended to behave
21 in the same way, i.e. a slight increase in the error level. While the AIC and Stepwise methods presented the
2
22 largest errors each time, the number of covariates had increased. Concerning the RM cF adden of the methods
23 for Table 3, it was showed that the AIC and PLS methods were the ones with a high R2 compared to the
24 others. This value had increased for each method each time the number of covariates was increased for each
25 sample size configuration.
26
27 4.2.2 Comparing the performance of the four methods in predicting
28
29 Table 4 presents compararison of the predictive performance of the four methods over different sample sizes,
30 total number covariates and number of correlated covariates base on RMSE and Accuracy. From this table,
31 the particular tendency of the methods to select more variables did not necessarily affect the Accuracy of
32 the prediction. Indeed, all methods tended to perform well base on accuracy. Nevertheless, it is found that
33 the penalty based methods (LASSO, PLS, AIC) have a high Accuracy than the one based on the p-value
34 criteria. The AIC method, for a configuration of small size and low covariates presented a good accuracy.
35 But when the sample size was increased, the LASSO and PLS were the ones with best accuracy. It was also
36 noted that, the PLS and LASSO methods presented less error in terms of prediction (see RMSE values for
37 table 4. Base on RMSE metric, it was observed that in most of the iterations, LASSO and PLS estimated
38
low values in terms of prediction errors.
39
To examine the predictive selection performance in more detail, it can be seen from the table 4 that when the
40
41 number of correlated variables is about 40% in the base, the PLS method has a better predictive performance
42 compared to the other methods. For the Lasso method, it performed better each time the number of covariates
43 was increased. The AIC method, showed a best predictive performance at the beginning of each configuration.
44 When the sample size was increased, the predictive performance of the AIC method had decreased.
45 From thes results, it was noted that the PLS method performed better in the case of correlated structure
46 in a high-dimensional configuration than the case of low correlation of variables to select the important
47 variables and to record good predictive accuracy. LASSO tended to keep its good performance in case of
48 some configurations. The Stepwise method became more efficient in the case of low correlation of variables.
49 Stepwise selection and AIC selection have quite similar performance (the former being much faster in terms
50 of estimation) when the number of covariates increases for a high sample size. As the level of autocorrelation
51 of the predictor increases, the performance in terms of accuracy generally appears to outperform the others.
52
53
54
4.3 An application on real data
55 In this subsection, a medium-sized agronomy dataset on the socio-demographic profile of the surveyed rice
56 farmers and the characteristics of their farms was used. The dataset contained 27 covariates for 417 individ-
57 uals. A generalized linear regression was used because the response variable was considered to be whether
58 or not an individual was practicing mechanized farming. The performances of the four variables selection
59 methods were analyzed and compared, and explored on the simulated results. First, using the full data set,
60 the number of variables selected, the bias, the accuarcy and the accuracy of the data fitted for the meth-
61 ods were studied. In order to have an idea on the covariates to be eliminated i.e. the one which presented
62
63
64
65 12
Table 3: Comparison of the four methods according to the variation of covariates
Selection Methods
Sample size Performances metrics
1 Configuration Lasso Stepwise AIC PLS
2 p=10, q=4 0.28 5.65 6.43 1.43
3 MSE
p=20, q=7 0.3 9.8 2.04 1.55
4 80
p=27, q=14 0.32 5.43 1.83
5
6 p=10, q=4 0.08 0.28 0.21 0.05
7 R2
p=20, q=7 0.11 0.59 0.65 0.62
8 p=27, q=14 0.14 0.7 0.67
9
10 p=10, q=4 3 3 3 2
Variables selected
11 p=20, q=7 5 7 8 5
12 p=27, q=14 6 11 6
13 p=10, q=4 0.3 2.59 2.57 1.41
14 MSE
p=20, q=7 0.27 7.58 4.01 1.48
15 500
p=27, q=14 0.29 5.8 1.48
16
17 p=10, q=4 0.05 0.07 0.05 0.1
18 R2
p=20, q=7 0.07 0.17 0.08 0.09
19 p=27, q=14 0.08 0.22 0.1
20
p=10, q=4 3 3 3 2
21
Variables selected
22 p=20, q=7 5 5 7 6
23 p=27, q=14 6 6 8
24 p=10, q=4 0.31 2.27 2.28 1.41
25 MSE
p=20, q=7 0.27 3.8 8.82 1.27
26 1000
p=27; q=14 0.26 2.82 1.4
27
28 p=10, q=4 0.041 0.05 0.05 0.07
29 R2
p=20, q=7 0.05 0.42 0.07 0.06
30 p=27, q=14 0.06 0.52 0.07
31
p=10, q=4 3 3 3 2
32
Variables selected
33 p=20, q=7 6 7 7 6
34 p=27, q=14 8 9 11
35
36
37 a strong correlation, a test of VIF (Variance Inflation Factor) on the data set was carried out. The VIF of
38 all covariates were commented and judged according to the collinearity threshold defined as VIF = 1 (Not
39 correlated ), 1 < VIF ≤ 5 (Moderately correlated), and VIF > 5 (Highly correlated).
40 The observation of the results of selection by the four ways (Table 5) served as the foundation for the
41 examination of our actual scenario. The VIF tests assisted in determining the applicability of each variable
42 and the level of effort put forth by each withdrawal technique. The current example has proved that the
43 pvalue (Stepwise) technique chose more variables than the other three in terms of variable selection. Accord-
44 ing to the exam of the performance measures, it exhibitted more selection error than the others (MSE=29.72)
45 and has lower accuracy and precision for prediction (0.84 and 0.67 respectively).
46 For this reason, it should be highlighted that the LASSO technique outperformed the others in terms of pre-
47 diction (Accuracy=0.90, Precision=0.91), with the PLS coming in second (Accuracy=0.87, Precision=0.86)
48 and the AIC coming in third. The selection of variables varied widely across all techniques, though. Stepwise
49 chose up to 19 variables and creates an incredibly complicated model, whereas LASSO and AIC used slightly
50 fewer variables (number selected=13). PLS, on the other hand, only chose four (04) variables, and its RMSE
51
value was lower than thoss of the others (RMSE=0.35). Because they chose 13 variables and recorded a high
52
level of accuracy, the AIC and LASSO techniques were likely the best suitable strategy in terms of parsi-
53
54 mony and precision. Those approaches helped to maintain a small number of highly correlated variables,
55 significant moderately correlated variables, and almost all uncorrelated variables, as can be shown by the
56 VIF test on the variables. In contrast, the PLS technique helped to get rid of all highly correlated and much
57 fewer moderately linked variables. Even if the PLS maintained few covariates, this feature of selection looked
58 much more rational because of the initial concept of the selection of variables, a reduction of variables and
59 ability to predict.
60
61
62
63
64
65 13
Table 4: Comparison of the four methods according to the variation of covariates:
model evaluation
1 Selection Methods
Sample size Performances metrics
2 Configuration Lasso Stepwise AIC PLS
3 p=10, q=4 0.4 1.1 1.06 0.41
4 RMSE
p=20, q=7 0.41 1.15 1.03 0.42
5 80
6 p=27, q=14 0.43 1.17 1.83
7 p=10, q=4 0.78 0.73 0.81 0.79
8 Accuracy
p=20, q=7 0.75 0.68 0.85 0.73
9
p=27, q=14 0.73 0.63 0.67
10 p=10, q=4 0.39 1.08 1.07 0.4
11 RMSE
12 p=20, q=7 0.36 1.07 1.06 0.36
500
13 p=27; q=14 0.37 1.08 0.38
14 p=10, q=4 0.79 0.77 0.78 0.8
15 Accuracy
p=20, q=7 0.8 0.8 0.79 0.78
16
17 p=27, q=14 0.8 0.79 0.73
p=10, q=4 0.4 1.02 1.07 0.39
18
RMSE
19 p=20, q=7 0.35 1.04 1.05 0.36
20 1000
p=27, q=14 0.34 1.03 0.37
21 p=10, q=4 0.78 0.77 0.77 0.8
22 Accuracy
23 p=20, q=7 0.79 0.88 0.83 0.78
24 p=27, q=14 0.83 0.89 0.74
25
26 From the above results, it was noted that in an agronomic, social, medical and any other study, it is
27
important to recognize all essential variables that are potentially correlated with the response variable.
28
Therefore, in an application like this, LASSO may be the most suitable as well as PLS.
29
30
31
32 Table 5: Variables retained and non retained by
33 each method.
34 Variables VIF Lasso AIC Stepwise Pls
35 AnimalT 7.4287 6 6 6 6
TRF 6.815 4 4 6 6
36 MachineryO 6.1945 4 4 4 6
37 GE 5.6780 4 4 4 6
38 LivestockO 5.0565 6 6 4 6
39 Irrigation 3.1223 6 6 6 6
40 LRS 2.9798 6 6 4 6
41 OSR 2.9719 4 4 4 6
MA 2.6197 4 6 6 6
42 HFL 2.3756 6 6 4 6
43 FOPR 2.3508 6 6 6 6
44 LandO 2.1435 6 6 4 6
45 FamilyWF 2.1018 6 6 4 6
46 Experience 1.9540 4 4 4 4
47 Age 1.9432 4 4 4 4
UFertilizer 1.8285 4 4 4 6
48
49 Variables VIF Lasso AIC Stepwise Pls
50 TFarmsize 1.7821 6 6 4 6
51 RMCrop 1.7255 6 6 4 6
52 NGOS 1.7177 6 4 4 6
53 RA 1.6906 4 4 4 6
54 TWF 1.6067 6 6 4 4
Education 1.5047 6 6 6 6
55
HouseSize 1.4698 4 4 6 4
56 OFI 1.4118 4 4 4 6
57 UPesticides 1.3783 6 6 6 6
58 Region 1.0000 4 4 4 4
59 InterInst 1.0000 4 4 4 6
60 Total − 14 13 19 5
61 The signs 6 and 4represent respectively the retained
62 variable and the non retained.
63
64
65 14
5 Discussion
The current study examined the impact of sample size variation for different configurations on the perfor-
1 mance of the variable selection methods studied and then evaluated the correlation effect of covariates in
2 variable selection on the performance of those methods in logistic regression. As several researches were inter-
3
ested in choosing appropriate methods to obtain or select covariates that can better explain the response
4
variable from a large number of available covariates, this study conducted a simulation that compared four
5
6 variable selection methods most commonly used nowadays and studied by researchers to see clearly on their
7 performance on different aspects.
8
9 5.1 As a function of variation in sample size
10 Regarding performance based on sample size variation, the results showed some differences between methods.
11 For the four methods considered, the number of covariates selected varies with increasing size in some
12
methods and does not in others. This leads us to say that a variation of the sample size with a given number
13
of covariates can affect the selection capacity of the methods. From the results obtained, the variation of
14
15 the sample size did not have much effect on the performance of a method when the number of covariates
16 is too low. Moreover, from a low to a high level of correlation, there was a change in selection capacity
17 over the sample sizes studied. The PLS method increased its capacity to retain more variables. The same
18 observation was made when the number of covariates was increased. This same remark was made by Gauchi
19 and Chagnon (2001) when they made it clear that the PLS method is able to handle exactly the case where
20 the number of variables is greater than the number corresponding to the observations and to handle extreme
21 multicollinearity. However, its use will not be a huge loss of information because in terms of prediction,
22 this method showed its good performances. Thus, PLS is less used than other methods for the selection of
23 variables nowadays.
24 Concerning the performance of the LASSO method, it was without much limit on any configuration. The
25 results have shown that this method manages to maintain its ability to select and predict low performance
26 whatever the configuration. This same observation was made by Freijeiro-González et al. (2022) when they
27 found that the LASSO procedure performs poorly in terms of recovering relevant covariates and avoiding
28 noisy covariates. However, its use will not be a huge loss of information because in terms of prediction, this
29 method showed its good performance. Thus, LASSO is less used than others for the selection of variables
30
nowadays.
31
The pvalue-based filtering method has been the first choice of our practitioners for years in terms of variable
32
33 selection. The results of our study showed that the stepwise method and the AIC method have the potential
34 to select a large number of covariates with or without varying the sample size. This was also one of the main
35 reasons why the authors Sanchez-Pinto et al. (2018) found them as parsimonious. Our results highlight that
36 in selection, the Stepwise method has the ability to select unimportant variables that would cause the largest
37 errors found in data prediction. Nevertheless, it was very interesting its ability to retain important variables
38 even if it adds others. In the study of Bursac et al. (2008), they came to the conclusion that in addition
39 to large covariates, this selection of procedural variables has the ability to retain important confounding
40 variables, which may result in a somewhat richer model (Bursac et al. 2008).
41 The studies of Bag et al. (2022) had proposed selection methods for moderate dimensions where the number
42 of predictors was comparable to the number of observations. They proposed the AIC method, with LASSO
43 in first place. This is not exactly the case for the current study. From our study, the appropriate method
44 should be PLS because it showed its good performance in case of strong correlation between covariates and
45 equality of the covariates with the sample size. This suggestion was certainly made in this context because
46 those authors did not evaluate the performance of the PLS method in their study. Comparing the PLS
47 method to the LASSO, Stepwise and AIC methods, Chong and Jun (2005) showed that the PLS method
48
performed very well in all configurations in identifying relevant predictors and outperformed LASSO and
49
AIC. This remack was also made in our study. It was also found that a model with good fitness performance
50
51 does not necessarily guarantee good variable selection performance. Thus, it would be better to take into
52 account a number of performance metrics in order to select relevant process variables and not focus only
53 on metrics such as RMSE, R-squared, etc. The results of Wang et al. (2015) and Xiong et al. (2022) were
54 consistent with those of our study when considering the random effect.
55
56 5.2 As a function of variation in the number of covariates and autocorrelated
57 variables
58
59 The variation in the number of covariates in the configurations shows a variability within methods in the
60 selection of variables. This leads us to say that a good performance of a method then depends on the structure
61 of the data (Bag et al. 2022). Indeed, it has been noticed that the PLS method performs better when the data
62 set contained a large correlated covariates and when the number of covariates was close to the sample size.
63
64
65 15
When the sample size moved away from the number of covariates its performance decreased. This remark
was made by Mehmood et al. (2012) when they pointed out that the PLS method is able to treat exactly the
case where the number of variables is bigger than the one corresponding to observations and to manage an
1
2 extreme multicolinearity. Regarding a method adapted to the majority of the constraints, our results inform
3 us about the LASSO as a general method and much adapted to the majority of the cases. Let us recall that
4 this method was proposed by Tibshirani (1996). After this proposal, the author found this method better than
5 existing methods. This remark was also made by Bag et al. (2022) in their study which confirmed that this
6 method have been immensely popularized because of its nice properties in variable selection mechanism. Our
7 results did not suggest otherwise. Nevertheless, it should be good to emphasize that this method takes a back
8 seat when it comes to independent correlation and too far (Freijeiro-González et al. 2022). Further, Bag et al.
9 (2022) showed that the LASSO method is highly dependent on the regularization parameters, which are not
10 fixed and are calculated by cross-validation. Thus, they may sometimes not produce accurate and consistent
11 results. Finally, the current study found that when the level of correlation between covariates increases, the
12 performance in terms of accuracy generally appeared to outperform the others. Nevertheless, Ranstam and
13 Cook (2018) have shown that LASSO regression outperforms standard methods in some contexts, while
14 enumerating one of its particularities to avoid potential biases related to the estimation of certain elements.
15
16
17 6 Conclusion
18
19 In this study, four different types of variable selection techniques were studied for a logistic regression
20 model, and their efficacy was assessed in nine different simulation settings pertaining to the following cases
21 of low, medium, and high dimensions of correlated covariates and sample size variation. The approaches’
22 accuracy varies depending on the parameter. For instance, penalty-based approaches like LASSO, AIC, and
23 p-value methods (Stepwise) are acceptable with low-dimensional data when the number of observations is
24 highly large than the number of predictors. The study found that PLS approaches are initially appropriate
25 in a moderately dimensional scenario where the number of predictors is equivalent to the number of data
26 observations, and LASSO might then be associated. These situations can also make use of LASSO and AIC,
27 which are both quick and precise when making predictions outside of a sample. The main problem with the
28 AIC-based penalty technique is that it takes too long time to run when there are a lot of predictors. To
29 finish, the results obtained on the real data confirmed those obtained by simulations.
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65 16
References
Zellner, D., Keller, F., Zellner, G.E.: Variable selection in logistic regression models. Communications in
1 Statistics-Simulation and Computation 33(3), 787–805 (2004)
2
3 Freijeiro-González, L., Febrero-Bande, M., González-Manteiga, W.: A critical review of lasso and its deriva-
4 tives for variable selection under dependence among covariates. International Statistical Review
5 (2021)
6
7 Khan, F., Urooj, A., Khan, S.A., Khosa, S.K., Muhammadullah, S., Almaspoor, Z.: Evaluating the per-
8 formance of feature selection methods using huge big data: A monte carlo simulation approach.
9
Mathematical Problems in Engineering 2022 (2022)
10
11 Zhang, Z., Trevino, V., Hoseini, S.S., Belciug, S., Boopathi, A.M., Zhang, P., Gorunescu, F., Subha, V.,
12
Dai, S.: Variable selection in logistic regression model with genetic algorithm. Annals of translational
13
medicine 6(3) (2018)
14
15
Wang, Z.X., He, Q.P., Wang, J.: Comparison of variable selection methods for pls-based soft sensor modeling.
16
Journal of Process Control 26, 56–72 (2015)
17
18
Xiong, Y., Yang, W., Liao, H., Gong, Z., Zhenzhen, X., Du, Y., Li, W.: Soft variable selection combining par-
19
tial least squares and attention mechanism for multivariable calibration. Chemometrics and Intelligent
20
21 Laboratory Systems, 104532 (2022)
22
Cavalaro, L.L., Pereira, G.H.: A procedure for variable selection in double generalized linear models. Journal
23
24 of Statistical Computation and Simulation, 1–18 (2022)
25
Stewart, P.S., Stephens, P.A., Hill, R.A., Whittingham, M.J., Dawson, W.: Model selection in occupancy
26
models: Inference versus prediction. bioRxiv (2022)
27
28
El-Sheikh, A.A., Abonazel, M.R., Ali, M.C.: Proposed two variable selection methods for big data: simulation
29
and application to air quality data in italy. Commun. Math. Biol. Neurosci. 2022, (2022)
30
31
Bag, S., Gupta, K., Deb, S.: A review and recommendations on variable selection methods in regression
32
models for binary data. arXiv preprint arXiv:2201.06063 (2022)
33
34
Hashemi, A., Dowlatshahi, M.B., Nezamabadi-pour, H.: Ensemble of feature selection algorithms: A multi-
35
criteria decision-making approach. International Journal of Machine Learning and Cybernetics 13(1),
36
37 49–69 (2022)
38
Bastien, P., Vinzi, V.E., Tenenhaus, M.: Pls generalised linear regression. Computational Statistics & data
39
40 analysis 48(1), 17–46 (2005)
41
Gauchi, J.-P., Chagnon, P.: Comparaison de méthodes de sélection de variables explicatives en régression pls
42
43 application aux données de procédés de fabrication industrielle. PhD thesis, auto-saisine (1999)
44
Su, W., Bogdan, M., Candes, E.: False discoveries occur early on the lasso path. The Annals of statistics,
45
46 2133–2150 (2017)
47
Ranstam, J., Cook, J.: Lasso regression. Journal of British Surgery 105(10), 1348–1348 (2018)
48
49
Muthukrishnan, R., Rohini, R.: Lasso: A feature selection technique in predictive modeling for machine
50
learning. In: 2016 IEEE International Conference on Advances in Computer Applications (ICACA),
51
52 pp. 18–20 (2016). IEEE
53
Kumar, S., Attri, S., Singh, K.: Comparison of lasso and stepwise regression technique for wheat yield
54
55 prediction. Journal of Agrometeorology 21(2), 188–192 (2019)
56
Burr, T., Fry, H., McVey, B., Sander, E., Cavanaugh, J., Neath, A.: Performance of variable selection
57
58 methods in regression using variations of the bayesian information criterion. Communications in
59 Statistics—Simulation and Computation® 37(3), 507–520 (2008)
60
61 Kuha, J.: Aic and bic: Comparisons of assumptions and performance. Sociological methods & research 33(2),
62 188–229 (2004)
63
64
65 17
Bursac, Z., Gauss, C.H., Williams, D.K., Hosmer, D.W.: Purposeful selection of variables in logistic
regression. Source code for biology and medicine 3(1), 1–8 (2008)
1 Sperandei, S.: Understanding logistic regression analysis. Biochemia medica 24(1), 12–18 (2014)
2
3 Al-Ghamdi, A.S.: Using logistic regression to estimate the influence of accident factors on accident severity.
4 Accident Analysis & Prevention 34(6), 729–741 (2002)
5
6 Hosmer, D.W., Lemeshow, S., Cook, E.: Applied logistic regression 2nd edition. New York: Jhon Wiley and
7 Sons Inc (2000)
8
9 Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:
10 Series B (Methodological) 58(1), 267–288 (1996)
11
12 Wold, H.: Estimation of principal components and related models by iterative least squares. Multivariate
13 analysis, 391–420 (1966)
14
15 Geladi, P., Kowalski, B.R.: Partial least-squares regression: a tutorial. Analytica chimica acta 185, 1–17
16 (1986)
17
18 Eriksson, L., Johansson, E., Kettaneh-Wold, N., Wold, S.: Multi-and megavariate data analysis. Umetrics
19 Academy, Umeå, 43 (2001)
20
21 Mehmood, T., Liland, K.H., Snipen, L., Sæbø, S.: A review of variable selection methods in partial least
22 squares regression. Chemometrics and intelligent laboratory systems 118, 62–69 (2012)
23
24 Chong, I.-G., Jun, C.-H.: Performance of some variable selection methods when multicollinearity is present.
25 Chemometrics and intelligent laboratory systems 78(1-2), 103–112 (2005)
26
27 Gosselin, R., Rodrigue, D., Duchesne, C.: A bootstrap-vip approach for selecting wavelength intervals in
28 spectral imaging applications. Chemometrics and Intelligent Laboratory Systems 100(1), 12–21 (2010)
29
30 Chowdhury, M.Z.I., Turin, T.C.: Variable selection strategies and its importance in clinical prediction
31 modelling. Family medicine and community health 8(1) (2020)
32
33 Akaike, H.: Information theory and an extension of the maximum likelihood principle,[w:] proceedings of the
34 2nd international symposium on information, bn petrow, f. Czaki, Akademiai Kiado, Budapest (1973)
35
36 Bozdogan, H.: Model selection and akaike’s information criterion (aic): The general theory and its analytical
37 extensions. Psychometrika 52(3), 345–370 (1987)
38
39 Burnham, K.P., Anderson, D.R.: Multimodel inference: understanding aic and bic in model selection.
40 Sociological methods & research 33(2), 261–304 (2004)
41
42 Aho, K., Derryberry, D., Peterson, T.: Model selection for ecologists: the worldviews of aic and bic. Ecology
43 95(3), 631–636 (2014)
44
45 Snipes, M., Taylor, D.C.: Model selection and akaike information criteria: An example from wine ratings and
46 prices. Wine Economics and Policy 3(1), 3–9 (2014)
47
48 Wagenmakers, E.-J., Farrell, S.: Aic model selection using akaike weights. Psychonomic bulletin & review
49 11(1), 192–196 (2004)
50
51 Hurvich, C.M., Tsai, C.-L.: Bias of the corrected aic criterion for underfitted regression and time series
52 models. Biometrika 78(3), 499–509 (1991)
53
54 Akaike, H.: On the likelihood of a time series model. Journal of the Royal Statistical Society: Series D (The
55 Statistician) 27(3-4), 217–235 (1978)
56
57 Hwang, J.-S., Hu, T.-H.: A stepwise regression algorithm for high-dimensional variable selection. Journal of
58 Statistical Computation and Simulation 85(9), 1793–1806 (2015)
59
60 Loughin, T.M.: A systematic comparison of methods for combining p-values from independent tests.
61 Computational statistics & data analysis 47(3), 467–485 (2004)
62
63
64
65 18
Loko, Y.L.E., Gbemavo, C.D., Djedatin, G., Ewedje, E.-E., Orobiyi, A., Toffa, J., Tchakpa, C., Sedah,
P., Sabot, F.: Characterization of rice farming systems, production constraints and determinants of
adoption of improved varieties by smallholder farmers of the republic of benin. Scientific Reports 12(1),
1
2 1–19 (2022)
3
Yoo, J.E., Rho, M.: Large-scale survey data analysis with penalized regression: A monte carlo simulation on
4
5 missing categorical predictors. Multivariate Behavioral Research, 1–29 (2021)
6
Hazimeh, H., Mazumder, R.: Fast best subset selection: Coordinate descent and local combinatorial
7
8 optimization algorithms. Operations Research 68(5), 1517–1537 (2020)
9
Tamura, R., Kobayashi, K., Takano, Y., Miyashiro, R., Nakata, K., Matsui, T.: Mixed integer quadratic
10
11 optimization formulations for eliminating multicollinearity based on variance inflation factor. Journal
12 of Global Optimization 73(2), 431–446 (2019)
13
14 Chai, T., Draxler, R.R.: Root mean square error (rmse) or mean absolute error (mae)?–arguments against
15 avoiding rmse in the literature. Geoscientific model development 7(3), 1247–1250 (2014)
16
17 Yoo, J.E., Rho, M.: Large-scale survey data analysis with penalized regression: A monte carlo simulation on
18 missing categorical predictors. Multivariate Behavioral Research 57(4), 642–657 (2022)
19
20 Brier, G.W., et al.: Verification of forecasts expressed in terms of probability. Monthly weather review 78(1),
21 1–3 (1950)
22
23 Czado, C., Gneiting, T., Held, L.: Predictive model assessment for count data. Biometrics 65(4), 1254–1261
24 (2009)
25
26 Gauchi, J.-P., Chagnon, P.: Comparison of selection methods of explanatory variables in pls regression with
27 application to manufacturing process data. Chemometrics and Intelligent Laboratory Systems 58(2),
28 171–193 (2001)
29
30 Freijeiro-González, L., Febrero-Bande, M., González-Manteiga, W.: A critical review of lasso and its deriva-
31 tives for variable selection under dependence among covariates. International Statistical Review 90(1),
32 118–145 (2022)
33
34 Sanchez-Pinto, L.N., Venable, L.R., Fahrenbach, J., Churpek, M.M.: Comparison of variable selection
35 methods for clinical predictive modeling. International journal of medical informatics 116, 10–17 (2018)
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65 19
Supplementary Material

Click here to access/download


Supplementary Material
empty.eps
Supplementary Material

Click here to access/download


Supplementary Material
fig.eps
Supplementary Material

Click here to access/download


Supplementary Material
Manuscrit-Dossa.tex
Supplementary Material

Click here to access/download


Supplementary Material
Title page Manuscrit.tex
Supplementary Material

Click here to access/download


Supplementary Material
user-manual.pdf
Supplementary Material

Click here to access/download


Supplementary Material
Manuscrit-Dossa.aux
Supplementary Material

Click here to access/download


Supplementary Material
Manuscrit-Dossa.blg
Supplementary Material

Click here to access/download


Supplementary Material
Manuscrit-Dossa.log
Supplementary Material

Click here to access/download


Supplementary Material
Manuscrit-Dossa.out
Supplementary Material

Click here to access/download


Supplementary Material
4e457608-b23d-4af5-a0cc-f4961e8f268a
Supplementary Material

Click here to access/download


Supplementary Material
Title page Manuscrit.aux
Supplementary Material

Click here to access/download


Supplementary Material
fig-eps-converted-to.pdf
Supplementary Material

Click here to access/download


Supplementary Material
Title page Manuscrit.blg
Supplementary Material

Click here to access/download


Supplementary Material
Title page Manuscrit.log
Supplementary Material

Click here to access/download


Supplementary Material
Title page Manuscrit.out
Supplementary Material

Click here to access/download


Supplementary Material
16796e15-3d98-418f-92bb-e4d3cc049632

You might also like