You are on page 1of 17

mathematics

Review
Mitigating the Multicollinearity Problem and Its Machine
Learning Approach: A Review
Jireh Yi-Le Chan 1, *,† , Steven Mun Hong Leow 1,† , Khean Thye Bea 1 , Wai Khuen Cheng 2 ,
Seuk Wai Phoong 3 , Zeng-Wei Hong 4 and Yen-Lin Chen 5, *

1 Faculty of Business and Finance, Universiti Tunku Abdul Rahman, Kampar 31900, Malaysia;
steven.utar@1utar.my (S.M.H.L.); beakheanthye@1utar.my (K.T.B.)
2 Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman,
Kampar 31900, Malaysia; chengwk@utar.edu.my
3 Department of Management, Faculty of Business and Economics, Universiti Malaya,
Kuala Lumpur 50603, Malaysia; phoongsw@um.edu.my
4 Department of Information Engineering and Computer Science, Feng Chia University,
Taichung 407102, Taiwan; zwhong@fcu.edu.tw
5 Department of Computer Science and Information Engineering, National Taipei University of Technology,
Taipei 106344, Taiwan
* Correspondence: jirehchan@utar.edu.my (J.Y.-L.C.); ylchen@mail.ntut.edu.tw (Y.-L.C.)
† These authors contributed equally to this work.

Abstract: Technologies have driven big data collection across many fields, such as genomics and
business intelligence. This results in a significant increase in variables and data points (observations)
collected and stored. Although this presents opportunities to better model the relationship between
predictors and the response variables, this also causes serious problems during data analysis, one of

 which is the multicollinearity problem. The two main approaches used to mitigate multicollinearity
Citation: Chan, J.Y.-L.; Leow, S.M.H.; are variable selection methods and modified estimator methods. However, variable selection methods
Bea, K.T.; Cheng, W.K.; Phoong, S.W.; may negate efforts to collect more data as new data may eventually be dropped from modeling,
Hong, Z.-W.; Chen, Y.-L. Mitigating while recent studies suggest that optimization approaches via machine learning handle data with
the Multicollinearity Problem and Its multicollinearity better than statistical estimators. Therefore, this study details the chronological
Machine Learning Approach: A developments to mitigate the effects of multicollinearity and up-to-date recommendations to better
Review. Mathematics 2022, 10, 1283.
mitigate multicollinearity.
https://doi.org/10.3390/
math10081283
Keywords: multicollinearity; variable selection methods; optimization approaches; neural network;
Academic Editors: Wanquan Liu, machine learning
Xianchao Xiu and Xuefang Li
MSC: 62M10
Received: 7 March 2022
Accepted: 10 April 2022
Published: 12 April 2022

Publisher’s Note: MDPI stays neutral 1. Introduction


with regard to jurisdictional claims in
Multicollinearity is a phenomenon that can occur when running a multiple regression
published maps and institutional affil-
model. In this age of big data, multicollinearity can also be present in the field of artifi-
iations.
cial intelligence and machine learning. There is a lack of understanding of the different
methods for mitigating the effects of multicollinearity among people in domains outside of
statistics [1]. This paper will discuss the development of the methods chronologically and
Copyright: © 2022 by the authors.
compile the latest methods.
Licensee MDPI, Basel, Switzerland. Forecasting in finance deals with a high number of variables, such as macroeconomic
This article is an open access article data, microeconomic data, earnings reports, and technical indicators. Multicollinearity is
distributed under the terms and a common problem in finance as the dependencies between variables can vary over time
conditions of the Creative Commons and change due to economic events. Past literature tried to remove collinear data to reduce
Attribution (CC BY) license (https:// the effects of multicollinearity. This is done through stepwise regression that eventually
creativecommons.org/licenses/by/ arrives at a model with a low root mean square error (RMSE). The computational difficulty
4.0/). of this has led to many selection criteria to be developed to choose models. A breakthrough

Mathematics 2022, 10, 1283. https://doi.org/10.3390/math10081283 https://www.mdpi.com/journal/mathematics


Mathematics 2022, 10, 1283 2 of 17

method to solve multicollinearity came in the form of ridge regression. Instead of selecting
variables, all the variables are used. The method modifies the estimator by adding a penalty
term to the ordinary least square (OLS) estimators. The goal is to reduce variance by
introducing bias. Papers published since then have built on these two ideas to work on
different functional forms and improve performance. For example, the author of [2] has
provided a review of Poisson regression. Moreover, recent developments in computing
power introduced mathematical optimization to variable selection.
The aim of this paper is to review and propose methods to solve multicollinearity. The
methods can be decided depending on the purpose of the regression, whether forecast-
ing or analysis. Recent developments in machine learning and optimization have shown
better results than conventional statistical methods. The pros and cons of the methods
will be discussed in later sections. The paper is organized as follows. The multicollinear-
ity phenomenon is explained in Section 2, including its effects and ways to measure it.
Section 3 discusses the methods to reduce the effects of multicollinearity, including variable
selection, modified estimators, and machine learning methods. Section 4 presents the
concluding remarks.

2. What Is Multicollinearity?
Multicollinearity is a condition where there is an approximately linear relationship
between two or more independent variables. This is a multiple linear regression model:

y = β0 + x1 β1 + . . . + x p β p + e, (1)

where y is the dependent variable, x1 , . . . ,x p represent the explanatory variable, β0 is the


constant term, β1 , . . . , β p are the coefficients of the explanatory variable, and e is the error
term. The error term is the difference between the observed and the estimated values. It is
normally distributed with a mean of 0 and variance σ. In the presence of multicollinearity,
x1 may be linearly dependent on another explanatory variable such as x2 . The resulting
model would be unreliable. The effects and problems are discussed in the following section.
For example, when using technical indicators in stock analysis. There will be a
multicollinearity issue if the indicators measure the same type of information such as
momentum [3]. The different indicators are all derived from the same series of closing
prices in such a case. In the context of the stock market, data are handled differently from
time-series data in other fields. This is due to the following few key reasons, according
to The authors of [4]. The goal of compiling stock market data is to maximize profit and
not reduce prediction error. Stock market data are highly time-variant, which means the
output depends on the moment of the input. They are also dependent on indeterminate
events. This means that the event that causes the response is not fixed.

2.1. Effects of Multicollinearity


According to [5], there are four main symptoms of multicollinearity. The first one is
a large standard error of the coefficients. Next, the sign of a variable coefficient can be
different from the theory. The explanation of the variable’s effect on the output will be
wrong or misleading. In addition, there will be a high correlation between the predictor
variable and outcome, but the corresponding parameter is not statistically significant. The
last symptom is that some correlation coefficients among predictor variables are large in
relation to the explanatory power or R-Squared of the overall equation.
These are merely symptoms and do not guarantee the presence of multicollinearity.
There are two major problems of multicollinearity. Estimates are unstable due to the
interdependence of the variables and standard errors if the regression coefficient is large.
This makes the estimates unreliable and therefore decreases their precision [6]. As two
or more variables have linear relationships, the marginal impact of a variable is hard to
measure. The model will have poor generalization ability and overfit the data. This means
that it will perform poorly on data it has never seen.
Mathematics 2022, 10, 1283 3 of 17

2.2. Ways to Measure Multicollinearity


Previous literature found that there are four measurements of multicollinearity. The
first detector of multicollinearity is a pairwise correlation using a correlation matrix. Ac-
cording to [7], a bivariate correlation of 0.8 or 0.9 is commonly used as a cut-off to indicate
a high correlation between two regressors. However, the problem with this method is that
the correlations do not necessarily mean multicollinearity as they are not the same. The
most widely used indicator of multicollinearity is the Variation Inflation Factor (VIF) or
Tolerance (TOL) [8]. The VIF is defined as

1
V IFj =  , (2)
1 − R2j

where R2j is the coefficient of determination for the regression of x j on the remaining vari-
ables. The VIF is the reciprocal of TOL. There is no formal value of VIF to determine the pres-
ence of multicollinearity, but a value of 10 and above often indicates multicollinearity [9].
Another method of measuring multicollinearity is using eigenvalues, which is from the
Principal Component Approach (PCA). A smaller eigenvalue indicates a larger probability
of the presence of multicollinearity. The fourth method is the Condition Index (CI). It
is based on the eigenvalue. CI is the square root of the ratio between the maximum
eigenvalue and each eigenvalue. According to [10], a CI of between 10 to 30 indicates
moderate multicollinearity, while above 30 indicates severe multicollinearity.
VIF and CI 2 are commonly used treatments to determine the severity of the dataset
before performing the methods to solve multicollinearity. It is important to note that
the effectiveness of the two treatments in reducing multicollinearity is usually deter-
mined by comparing the root mean square error or out-sample forecast before and after
treatments [11].

3. Reducing the Effects of Multicollinearity


Collecting more data is one of the simplest solutions to reduce the effects of multi-
collinearity because collinearity is more of a data problem than model specification problem.
However, this is not always feasible, especially when research is undertaken using conve-
nience sampling [1]. There is a cost associated with collecting more data. Furthermore, the
quality of data collected might be compromised. Methods to eliminate multicollinearity by
reducing the variances of regressor variances can be categorized into two methods: variable
selection and modified estimates. Both methods can be applied at the same time. The detail
of the variable selection and modified estimates methods are explained in the following
sub-topics. Next, the machine learning approaches are also presented.

3.1. Variable Selection Methods


Researchers are mainly concerned about multicollinearity problems when forecasting
with a linear regression model. Generally, researchers try to mitigate the effects of multi-
collinearity by using variable selection techniques so that a more reliable estimate of the
parameter can be obtained [12]. These are commonly heuristic algorithms and rely on using
indicators. The method can be completed by combining or eliminating variables. However,
caution must be taken not to compromise the theoretical model to reduce multicollinearity.
One of the earliest methods was stepwise regression. There are two basic processes, namely
forward selection and backward elimination [13]. The forward selection method starts
with an empty model and adds variables one at a time, while the backward elimination
method starts with the full model with all available variables and drops them one by one.
In each stage, they select the variable with the highest decrease in the residual sum of
squares for forward selection or the lowest increase in the residual sum of squares for
backward elimination.
However, there are some drawbacks to stepwise regression. According to the author
of [14], it does not necessarily yield the best model in terms of the residual sum of squares
Mathematics 2022, 10, 1283 4 of 17

because of the order that these variables are added. This is especially true in the presence
of multicollinearity. It is also not clear which of the two methods of stepwise regression is
better. Furthermore, it also assumes there is a best equation when there can be equations
with different variables that are equally as good. Another problem of the selection criterion
is the computational effort required [15]. There are 2k possible combinations for k indepen-
dent variables. The amount of computation needed also increases exponentially with the
total number of independent variables.
To reduce computation time, the authors of [16] therefore developed a more compre-
hensive method to fit the equation to the data. It uses a fractional factorial design with the
statistical criteria, Cp, to avoid computing all the possible equations. It also works better on
data with multicollinearity as the effectiveness of a variable is evaluated by its presence or
absence in an equation. The Cp criterion was developed by the author of [17]. It provides a
way to graphically compare different equations. The selection criterion Cp is as follows:

RSSp
Cp = − ( n − 2p ), (3)
σ̂2
where p is the number of variables, RSS, is the residual sum of squares for the regression
being considered and σ̂2 is an estimate of σ2 ; it is frequently the residual mean square from
the complete regression. The model with a lower Cp is better.
Later, the authors of [18] proposed a more general selection criterion, Sp that has
shown to outperform the Cp criterion. Methods that are based on the least square estimator
such as the Cp criterion suffer in the presence of outliers and when the error variable
deviates from normality. The Sp criterion solves this problem and can be used with any
estimator of β without a need for modification. The Sp criterion is defined as follows:
n 2
Sp = ∑ i =1 Ŷik − Ŷip /σ2 − (k − 2p), (4)

where k and p are the parameters of the full and subset model, respectively.
Information criteria provide an attractive way for model selection. Other criterions
that are often used include, Akaike Information Criterion (AIC), Bayesian Information
Criterion (BIC), etc. [19]. According to [20], the difference between AIC and BIC is that BIC
is consistent in selecting the model when the true model is under consideration. Meanwhile,
AIC aims to minimize risk functions when the true model is not one of the candidates.
The choice of criterion depends on the researcher and both AIC and BIC are suggested
to be used together. Table 1 provides a summary of each stepwise feature selection and
quality criterion.

Table 1. Stepwise feature selection and quality criterions.

Author Year Objective Method Pros Cons


Develop a method for Forward selection and Simple to Final model affected
Ralston and Wilf [13] 1960
model selection backward elimination understand and use by order
Graphically compare
A criterion for subset Suffers with outlier
Mallows [17] 1964 Cp criterion quality between
selection and non-normality
models
Fractional factorial Fractional factorial
Gorman and Avoid computing all
1966 design to model design with the Heuristic technique
Toman [16] possible model
selection statistical criteria, Cp
A more general selection Computationally
Kashid and Applicable on
2002 criterion than Cp when Sp criterion difficult and not
Kulkarni [18] any estimator
least square is not best consistent result
Improve classification Recursive Feature
Misra and Does not delete Evaluated on small
2020 accuracy in small Elimination with
Yadav [21] the records sample size
sample size Cross-Validation
Mathematics 2022, 10, 1283 5 of 17

The authors of [5] proposed the use of principal component analysis (PCA) as a solu-
tion to multicollinearity among predictor variables in a regression model. It is a statistical
method to transform variables into new uncorrelated variables called principal components
and reduce the number of predictive variables. Regression analysis is done using the
principal components. The principal components are independent and thus satisfy the
OLS assumption. The principal components are ranked according to the magnitude of
the variance. This means that principal component 1 is responsible for a larger amount
of variation than principal component 2. Therefore, PCA is useful for dimensionality
reduction. Principal components with an eigenvalue near zero can be eliminated. This way
the model is sparse while not dropping variables that might contain useful information.
The Partial Least Squares (PLS) method was developed by the author of [22]. It
is a better alternative to multiple linear regression and PCA because it is more robust.
The model parameters do not change by much when new samples are used. PLS is like
PCA as it is also a dimension reduction technique. The difference is that it captures the
characteristics of both X and Y instead of just X as does PCA. The PLS method works by
iteratively extracting factors from X and Y while maximizing the covariance between the
extracted factors. The PLS derives its usefulness from its ability to analyze noisy data with
multicollinearity. This is because its underlying assumptions are much more realistic than
traditional multiple linear regression. The authors of [23,24] compared the PLS method
with the lasso and stepwise method and found it to be performing better.
A few journals have made comparisons among the techniques. The authors of [25]
discussed and compared PCA and PLS as they are both dimension reduction methodologies.
Both methods are used to convert highly correlated variables into independent variables
and variable reduction. The methodology of PCA does not consider the relationship
between the predictor variable and the response variable, while PLS does not. Therefore,
the PCA is a dimension reduction technique that is unsupervised and PLS is a supervised
technique. They also found that the predictive power of the principal components does
not line up with the order. For example, the principal component 1 explains the change in
response variable less than principal component 2. PLS is more efficient than PCA in this
regard as it is a supervised technique. PLS is extracted based on significance and predictive
power. The author of [26] compared partial least square regression (PLSR), ridge regression
(RR), and principal component regression (PCR) using a simulation study. The study used
MSE to compare the methods. They have found that when the number of independent
variables increases, PLSR is the best. If the number of observations and the number of
multicollinearities are large enough while the number of independent variables is small,
RR has the smallest MSE.
Recent application of PLS is seen in the chaos phase modulation technique for un-
derwater acoustic communication. The authors of [27] adopted a PLS regression into the
chaos phase modulation communication to overcome the multicollinearity effect. They
described PLS as a machine learning method that uses the training and testing processes
simultaneously. The study found that this method effectively improves the communication
signals. The authors compared it with two algorithms: the Time Reversal Demodulator
and 3-layer Back Propagation Neural Network that does not perform feature analysis and
relationship analysis. It shows that PLS regression has the best performance.
A multigene genetic programming was first developed by the authors of [28,29], who
used this method to automate predictor selection to alleviate multicollinearity problems.
The authors of [30] described a genetic algorithm-based machine learning approach to
perform variable selection. The genetic algorithm is a general optimization algorithm based
on concepts such as evolution and survival of the fittest. The model is initialized with
creating a population with several individuals. Each individual is a different model. The
genes of the model are features of the model. An objective function is used to determine the
fitness of the models. In the next generation/iteration, the best model will be selected and
have their genes “crossover”. Some features of the parent model are combined. Mutation
may also occur with some determined probability where the feature is reversed. According
Mathematics 2022, 10, 1283 6 of 17

to [30], this machine learning concept should be combined with a derivative based search
algorithm for a hybrid model. This is because genetic algorithms are very good at finding
generally good solutions but not good at finding local minima, such as derivative based
search algorithms. Derivative based search algorithms can be performed after a certain
amount of iteration of the genetic algorithm. Iterations are continued until no improvement
in the fitness of the model is seen.
The authors of [31] proposed a quadratic programming approach to feature selection
because previous methods do not consider the configuration of the dataset and therefore
is not problem dependent. The aim of using quadratic programming is to maximize the
number of relevant variables and reduce similar variables. The criterion Q that represents
the quality of a subset of features a is presented in quadratic form. Q( a) = a T Qa − b T a,
where Q ∈ Rnxn is a matrix of pairwise predictor similarities, and b ∈ Rn is a vector for
relevance of the predictor to the target vector. The author suggested that the similarity
between the features xi and x j and between xi and y can be measured using Pearson’s
correlation coefficient [32] or the concept of mutual information [33]. However, these two
methods do not directly capture feature relevance. The authors utilized a standard t-test to
estimate the normalized significance of the features to account for it. This proposed method
outperforms other feature selection methods, such as Stepwise, Ridge, Lasso, Elastic Net,
LARS, and the genetic algorithm.
The authors of [34] presented the maximum relevance–minimum multicollinearity
(MRmMC) method to perform variable selection and ranking. Its approach focusses on
relevancy and redundancy as well. Relevancy refers to the relationship between features
and the target variable, while redundancy is the multicollinearity between features. The
main advantage of this paper over others is that it does not require any parameter tuning
and is relatively easy to implement. Relevant features are measured with a correlation
coefficient and redundancy with squared multiple correlation coefficient. A measure J that
combines relevancy and multicollinearity is developed.
k
2
f j , c − ∑i=1 sc f j , qi ],
  
J fj = max [rqn (5)
fj ∈ F – S

where r2 is the correlation coefficient between feature f and target c. sc is the squared
multiple correlation coefficient between feature f and its orthogonal transformed variable q.
The first feature is selected using the optimization criteria V and the following are selected
based on criterion J using a forward stepwise method. Although non-exhaustive, it is a
very competent method for feature selection and reducing dimension.
The authors of [35] suggested that the mixed integer optimization (MIO) based ap-
proach to selecting variables has received increased attention with development in algo-
rithms and hardware. They developed mixed integer quadratic optimization (MIQO) to
eliminate multicollinearity in linear regression models. It adopts VIF as an indicator for
detecting multicollinearity. Subset selection is performed subject to an upper bound con-
straint on VIF of each variable. It achieved higher R-Squared than heuristic-based methods
such as stepwise selection. The solution is also computationally tractable and simpler to
implement than cutting plane algorithm.
The authors of [36] proposed a profiled independence screening (PIS) method of screen-
ing for variables with high dimensionality and highly correlated predictors. It is built upon
sure independence screening (SIS) [37]. Many variable selection methods developed before
SIS do not work well in extremely high dimension data where predictors vastly outnumber
the sample size. However, SIS may break down where the predictors are highly correlated,
 −1 T
which resulted in the PIS. A factor profiling operator Q( Z I ) = In − Z I Z IT Z I Z I is intro-
duced to eliminate the correlation between predictors. The profiled data are applied to the
SIS. Z I ∈ Rnxd is the latent factor matrix of X and d is the number of latent factors. Factor
profiling is as follows:
Q(ZI )y = Q(ZI )Xβ + Q(ZI )ε, (6)
Mathematics 2022, 10, 1283 7 of 17

Q( Z I )y is the profiled response variable and the columns of Q( Z I ) X are the profiled
predictors. However, PIS may be misleading in a spiked population model. Preconditioned
profiled independence screening (PPIS) solves this by using preconditioning and factor
profiling. Two real data analyses show that PPIS has good performance.
Outlier detection is also a viable method for variable selection. Recently, projection
pursuit was used to perform an outlier detection-based feature selection [37]. Projection
pursuit aims to look for the most interesting linear projections. The author optimized it
to find outliers. The method was found to be effective in improving classification tasks.
However, it performs poorly when most features are highly correlated or when features are
binary. Table 2 provides a summary of the findings for variable selection approaches.

Table 2. A summary of previous studies on variable selection.

Author Year Objective Method Pros Cons


Creates new Cannot exhibit
Supervised
components using the Partial Least Square significant
Wold [22] 1982 component
relationship of predictor (PLS) non-linear
extraction
and response characteristics
Does not account for
Using PCA to perform Principal component
Lafi and Kanenee [5] 1992 Reduce dimensions relationship with
regression analysis (PCA)
response variable
Genetic algorithm-based
Less subjectivity on Not good in finding
Bies et al. [30] 2006 approach to model Genetic algorithm
model local minima
selection
Cannot evaluate
Investigates the
multicollinearity
Katrutsa and Quadratic programming Quadratic relevance and
2017 between quantitative
Strijov [31] approach programming redundancy of
and nominal random
features
variable.
Maximum
Feature selection and relevance-minimum Works well with
Senawi et al. [34] 2017 Non-exhaustive
ranking multicollinearity classifying problems
(MRmMC)
Mixed integer Uses backward
Mixed integer Only applies to low
Tamura et al. [11] 2017 semidefinite elimination to
optimization number of variables
optimization (MISDO) reduce computation
Mixed integer
Mixed integer Only applies to low
Tamura et al. [35] 2019 quadratic optimization Uses VIF as indicator
optimization number of variables
(MIQO)
Combines the result of
Higher computation
filter, wrapper, and Ensemble feature Overcome local
Chen et al. [38] 2020 cost than single
embedded feature selection optima problem
solution
selection
Variable screening based Preconditioned Variable screening in Require
Zhao et al. [36] 2020 on sure independence profiled independence high dimensional decorrelation of the
screening (SIS) screening (PPIS) setting predictors
Found outliers Does not work well
Larabi-Marie- Feature selection based
2021 Projection Pursuit correlated with when features
Sainte [39] on outlier detection
irrelevant features are noisy
Linear combination
Singh and Does not remove Based on
2021 Creates new variables and ratio of
Kumar [40] any variables trial-and-error
independent variables

The variable selection methods aim to reduce the number of variables to the few
that are the most relevant. This may reduce the information gain from having more data
Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18

Mathematics
Mathematics
2022, 10,
2022,
x FOR
10, xPEER
FOR PEER
REVIEW
REVIEW

work with. Furthermore, the modern optimization methods depend on subjectively


Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18 9 of 18
Mathematics 2022, 10, 1283 determined indicators of relevance and similarity. This can be seen from [11] where8 of the17
authors suggested other work work
measures with.with. Furthermore,Furthermore,
of multicollinearity the modern the formodern future optimizationoptimization
research. Itmethods methods
is therefore depend depend on
difficult to suggest which determined determined
methodindicators is betterindicators of relevance
without of relevance
directly andcomparing similarity.
and similarity. This This
performance can be canseen be seen
on fromfro [1
PEER REVIEW Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18 9 of 18
rthermore, the modern optimizationthe methods work
same with. with.
depend
dataset. Furthermore,
Better on authorssubjectively
authors
performance the
suggested modern
suggested
canother also optimization
other
measures
be due measures
to the methods
of multicollinearity depend
of multicollinearity
specific problem foron subjectively
future
for future research. researI
Mathematics 2022, 10, x FOR PEERtoREVIEW work Furthermore, the modern optimization methods depend ontested.subjectively 9 of 18
icators of relevance and similarity. This candetermined
be seen
determined indicators of relevance and
fromindicators [11]
difficult where of
difficult relevance
to the
suggestto suggestand
which similarity.
which method method This
is can
better is be
better seen
without withoutfromdirectly [11]
directly where
comparing the
comparing per
9 ofsimilarity.
18 This can be seen from [11] where the
ted other measures of multicollinearity 3.2. for
authors authors
future
Modified suggested research.
suggested
Estimators other It
the other
is
Methodssametherefore
the
measures measures
same
dataset. dataset. of
Better multicollinearity
Better
performance performance
of multicollinearity for future research. It is therefore for
can future
also
can be
also research.
due be to
due the It
to is
specific
the therefore
specific problem prob
Mathematics 2022, 10, x FOR PEER REVIEW
gest which work with. is
method Furthermore,
better without thedirectlymodern
work difficult
Modified with. optimization
comparing Furthermore,
toEstimators
suggest performance methods
which
is themethod
another modern
ondepend optimization
on that
is better subjectively
without methods
directly and comparingdepend on subjectively
performance in on
difficult to suggest which method is approach
better without use
directly biased comparing shrunken performance estimators on the
et. Betterdetermined performance indicators
can also of berelevance
dueexchange
to work
determined
the theand
specific
same with.
similarity.
for lower dataset.Furthermore,
indicators
problem This
3.2.
Better of can
Modified
tested.3.2. the
relevance
be
Modified
performance seen modern
Estimatorsand
from
Estimators
can optimization
similarity.
[11]
Methods
also where
Methods
be This
due the tomethods
can the be seen
specific depend from
problem on
[11] subjectively
is that thethe
where
tested.
same dataset. Bettervariance performance and thus can also reduce be due overfittingto the specific [12]. The problem advantage tested.
with. Furthermore,
authors suggested the modern other measures optimization determined
authors
of methods
multicollinearity
suggested
theoretical model is not Modified depend
indicators other for of onrelevance
measures
future subjectively
Modified
compromised research.
of and similarity.
multicollinearity
Estimators EstimatorsIt
because is therefore
is another is This
ofanotherthefor can
approach future be
approach
dropping seen
research.
thatof from
use
that [11]
useisbiased
It
biased
variables. where
therefore
andIts theshrun
shrunken
and
ined
imators indicators
Methods of
difficult to relevance
suggest which and similarity.
method authors
3.2.
difficult
is This Modified
better
3.2. Modified Estimators
disadvantage to can
suggested
suggest
withoutbe
is that the seen
Estimators which from
other
directly
Methods
exchangeestimators [11]
measures
Methods
method
exchange comparing
for lowerwork
where
are is
fornowof with.
the
better
lower performance
variance Furthermore,
multicollinearity
without
biased.variance and
The thusdirectly
on
and
mostthus for the
future
reduce modern
comparing
known reduce research.
overfitting
method optimization
performance
overfitting It
is [12]. is
the ridge therefore
[12]. on
The advantag methods
The adv
s suggested other measures of multicollinearity difficult for to future
suggest research.
which It
method is determined
therefore is better indicators
without of
directly relevance comparing and similarity.
performance This can
inofbev
Estimators theissame another dataset.approach Better performance
that use the
biased
regression samecan
Modifiedand also
dataset.
Modified
developed be
shrunken
EstimatorsdueBetter
Estimatorsto
theoretical
by the the
estimatorsperformance
specific
theoretical
is author is
another another
in
model problem
can
model
ofapproach
[41]. also
approach
isThis not tested.
is
that be not
method due
that
compromised to
use
usecompromised the
adds aand
biased biased specific
penaltyand
because
shrunkenproblem
shrunken
because
term: tested.
of estimators
the
the estimators
ofsquareddropping
the dropping in on
twerto suggest
variancewhich and thus method reduce is better
overfitting without
magnitude
exchange the
exchange
[12]. directly
same The
forofdataset.
the
lower comparing
advantage
for lower Better variance
disadvantage
coefficient
variance performance
performance
isdisadvantage
that andβ the to authors
and isthe
thus thus
that on
can
isloss
reduce
suggested
the
that also
reduce thebe
estimators
function. dueother
overfitting
estimators
overfitting to[12].
are
The measures
thenow specific
[12].
are
general
The now The
biased. of multicollinearity
problem
advantage
biased.
equation
advantage The most The tested.
isof that is
most
known
ridge that
the knownfor
the
methodfutu m
me dataset. 3.2. Modified
Better
del is not compromised because of Estimators
performance Methods
can also be3.2.
the
regression due
theoretical Modified
theoretical to
dropping the
is Estimators
specific
as follows:
model model
isofnot problemMethods
variables.
is regression
regression
compromised tested.
not compromised difficult
Its developed
developed because to
byofthesuggest
because
byauthor
the the
droppingwhich
authorof
of [41]. method
the dropping
ofofvariables.
[41].
This is
This
method better
Itsmethodofaddswithout
variables.
disadvantage adds
a penalty directly Itsterm
a penalty co
that the estimators Modified are Estimators
now biased. is another
The is that 3.2.
most Modified
approach
Modified
disadvantage
the known estimators Estimators
that
Estimators
methodis use
that biased
magnitude
are isthe Methods
theisestimators
magnitude
now another
and
ridge
biased. the
ofshrunkenthesame
approach
are
ofThe thenowdataset.
estimators
coefficient
most that
biased.
coefficient known Better
use
β toThe in
biased performance
βmethod
the tomost and
loss
the shrunken
known
isloss
function. can
the function.method
ridge also The be
estimators due
isThe
regression to
the general
general ridge the
in equate spe
dified ∑[12]. is(𝑌 −∑ 𝑋reduce𝛽 ) +is ∑ use
ƛthat 𝛽 biased , [12]. The (7)thein
lopedEstimators
exchange
by the author Methods
for of lower[41]. variance
This method and exchange
developedthus
adds
regressionModified
areduce forthe
penalty
by lower Estimators
overfitting
developedterm:
author variance
the
regression by squared
the
ofregression
[41]. another
andThe
is asthus
author
This is approach
advantage
of
follows:
as
method [41].
follows: adds This that
overfitting
a method
penalty the addsand
term: athe shrunken
advantage
penaltysquared term: estimators
isthe
magnitude that squared
odified
the coefficient Estimators
theoretical β to ismodel
another
the loss is approach
not
function. ofthat
compromised The use
exchange
theoretical
magnitude biased
general for
because
model and
oflower
equation
tothe shrunken
ofvariance
is not
the
coefficient estimators
and
compromised
dropping
offunction.
ridge 3.2.β thus Modified
to ofin
the reduce Estimators
variables.
because
loss overfitting
function. ofIts Methods
the of The [12].
dropping The
general advantage
ofequationvariables. is that
oftoridgeItsthe
ƛ)ƛ.∑+Ifƛƛ∑
the The coefficient
main issue β the
with loss
ridge regression The general
is how toequation
find ridge ridge
parameter regression isisasequal follows:
ge for lower variance and thus reduce overfitting [12]. The advantage is that Modified
the ∑ ∑
(𝑌
Estimators − (𝑌∑ − is ∑
𝑋another
𝛽𝑋 ) + 𝛽approach 𝛽variables.
, the
that 𝛽 use , biased an
follows: disadvantage is that the estimators
Mathematics 2022, zero, theoretical
disadvantage
are
regression
then now theREVIEW model
biased.is that
is as follows:
estimate The is
the not
most
will equalestimators compromised
known are methodnow because
biased.
is the square estimate. However, if the ƛ Its
ridge
The of most the dropping
known method of is ridge
VIEW 10, x FOR PEER  to the ordinary 9 of 18 least
2
n exchange p for lower variancep and thus reduce overfitting [12].
ical model
antage is
is not compromised
regression developed by the author
∑ (𝑌 − ∑ 𝑋 𝛽 ) + ƛ ∑is too
that the estimators are now biased.
because regression
𝛽 big,
of
disadvantage
of [41].
, Theitmost
theThis
developed
will dropping
is that
method
lead
known
to an bythe

method
of
adds
the
The variables.
estimators
author
underfittingaY
main
The
i =1∑ (7)
penalty
is
i (𝑌−main ∑ofare
issue
theoretical
the
of
j∑
−ridge=1 ij𝑋 j𝛽 ) + ƛ∑
Its
[41].
term:
issuenow
with
theX This the
βwithbiased.
ridge
model.
model
method
squared
+ ƛ isThe
ridge
regression
is
adds most
regression
∑selected
j=1 𝛽j ,
not
β is2 known
a, penalty
how
by
compromised
to method
islooking
how term:
find tofor the
ridge
find isridge
the
because
the
squared
parameter
least ridge
(7)
of
paramete
(7)
the
ƛ. If
dr
magnitude of the coefficient β regression
magnitude
to the loss developed
of function.
the coefficient
zero, byThe
zero,
thenthe
increase in the root mean square error (RMSE) within an appropriate decrease in ridge author
general
then
theβ to
estimate
the of
the [41].
equation loss
estimate will This function.
of
equal
will method
ridge
equal
to The
the adds
to a
general
ordinary
the penalty
ordinary equation
least term:
least
square the of
square squared
ridge
estimate. estimate. How
ion
ssue withdeveloped
regression by the
is asauthor
follows: of [41]. This method
magnitude
regression adds
is as of a penalty
the
ƛ.with
follows: term:
coefficient
ƛfor the disadvantage
β squared
to the loss is that
function. the estimators
The general are now
equation
ƛ .isƛ. biased.
ƛisisƛ ais of The
ridge most k
work with.ridge regression is
Furthermore, thehow modernto find variable ridge parameter
The
Theinflation
optimization main main issue
work issue
factors
methods isIf too
with. with
isis equal
big,
ridge too
each
depend ridge
Furthermore, toon
itbig,
will regression
variable.
regression it lead
will Alead
to
subjectively an
isridge
the is
how
modern how
tounderfitting
an
trace to optimization
to underfitting
find find
is used
ridge ridge
ofto the ofparameter
model.
assist
parameter the inmodel.
methods this. IfIt If
selected
depend equal
selected
plot
equal by to
onlookingby loo
subjec
ude of the coefficient β to the loss function.
regression The is general
as follows: equation regression
of
ƛto ridge developed by the author of [41]. This method adds a
stimate willindicators
determined equal to the of ordinary
relevance ∑least
and oftosquare
(𝑌 −zero,
zero, ∑ then
coefficient
similarity. estimate.
then
𝑋 𝛽 ꞵ the
the
This versus However,
)determined
+canestimate
estimate
ƛ∑ beƛ. It
increase will
𝛽
seen will
is∑if
increase inthe
equal
,used
indicators
from equal
(𝑌the in
− to
[11] to
root
the
ofthe
pick
∑ the
root
mean
where𝑋 ordinary
ordinary
the
relevance mean
𝛽smallestsquare
)the +and ƛsquare
least
∑ ƛsimilarity.
leasterror
(7) 𝛽square
square
at error
which (RMSE)
, estimate. estimate.
(RMSE)
the
This within
canHowever,
coefficients However,
within
be anseen appropriate
anfrom
start if
the
to(7) [11]ƛdecre
the
ifappropriate whe
lionlead is as
to follows:
an underfitting of the model. ƛ is selected
is
too too big, big, by
it it looking
will will lead
lead variablefor
to to the
an an
variable least magnitude
underfitting
underfitting
inflation inflation factors ofof of each
the
the
factorsfor the
model.
model.
for coefficient
each ƛ ƛ
variable. is selected
selected
variable. A βridgeto
A by the
by lookingloss
looking function.
for
for thethe least The g
authors suggested other measures of multicollinearity level off. Alternatively, authors
for future a validation
suggested ∑ (𝑌
research. other dataset
− is
It ∑measures is used, find
𝑋 𝛽 ) of+multicollinearity
therefore ƛ∑ 𝛽 , that minimizes for future research.to
ridge
trace trace
validation is used is used
to
SSE. assist It assist
(7) isintheth
regression ƛ. is
ƛ.how ƛItterm
as follows:
ƛ.is ƛ.is ƛ is
ƛIflarger ƛinequal
root mean
difficult
The main
square
to suggest error

which
issue
(𝑌 −with
(RMSE) ∑ ridge
method within )regression
least
Identify
𝑋is 𝛽better an+increase∑ƛ such
The
appropriate
ƛincrease
without
main
is
𝛽 in how
in,that issue
the
directly
difficult
to
the find
decrease
root
rootofwith meanridge
mean ridge
coefficient
reduction
comparing
to of
suggest
parameter
incoefficient
ridge
square
square in ꞵ error
regression
the versus
performance
which ꞵ (7)
error
variance is
If
(RMSE)
(RMSE)
versus
method
isis
on
equal
to
within
used find
within
Itbetter
ofis the to
toan
used ridge an
pick parameter
toappropriate
appropriate
slope
without pick
the smallest
parameter
directly decrease
the smallest decrease
at
comparing which in
atridge
thanwhich theto coeffic
ridge
performan the co
n factors zero,for then
each the estimate
variable. A will equal
ridge trace zero,
variable
the also to
is the
variable
used
increment The
then ordinary
inflation
to main
the
assist
inflation issue
estimateleast
factors
in this. with
factors
level square
will
for
It
levelis
off. ridge
equal
each
fora estimate.
plot
each regression
Alternatively,
off. to
variable. the However,
variable.
Alternatively, ordinary
A a is
ridge A how ridge
validation
a to
least
if
trace the
validation find
trace ƛ
square
is used
dataset ridge
is used
dataset
isparameter
estimate.
to assist
to
used, in
assist
is However,
used, this.
find ƛ.
in ƛIfIt
this.
find ƛ
thatis ƛ
is aif
It ƛ ∑ƛ 𝛽tested
equalthe
plot
is
minimizes
that a ƛ to
plot
minimize , va
he same dataset. Better performance can be duein toits
the the squared
same specific dataset. bias.
problem The authors
Better tested.
performance of [42]can reviewedalso ∑ beestimation
(𝑌 −
due to∑the methods 𝑋 𝛽) +
specific for
problem
heversusmainis ƛ. It is used to pick the smallest
issue
too with
big, itridge
will regression
lead to an is how
underfitting
of
and ƛ at
is to
zero,
too find
coefficient
ofwhich
new big,then
of
coefficient
methodsridge
the
it the
will
β̂ model.
versus ꞵ versus
parameter
estimate
lead
the coefficients
were ƛ
to
Identify. is an
It ƛ.
ƛ.start
will
selected
is
Identify
suggested. ƛ such
If
used
It ƛ
equal
underfitting
is toisƛAsuch
by
to
used equal
to
pick
that
more the
looking
tothe to
of ordinary
the
pick
that the
recent for model.
smallest the
thereduction
reduction
the least
smallest
paper ƛ
least is
at
inproposed
thein ƛvariance
square
selected
whichat
thewhichestimate.
theby
variance
a Bayesian looking
coefficients
the term
term However,
coefficients
ofapproach for
theofslope the
start
the slope if the
least
to
start
parameter
to param ƛ
to
hen the estimate
atively, increase
a validation will
in the equal
root
dataset tomean
isthe
used,ordinary
squarefindlevel
solving ƛ
is least
increase
error too
off.
level
that the square
big,
(RMSE)
in it will
the
Alternatively,
minimizes
off. estimate.
root
Alternatively,
problem lead
within mean
validation
ofthe a to
an However,
an
validation
a
increment
findingthe underfitting
appropriate
square
validation
SSE.
increment
the in ifits
error
dataset
ridge The
the ƛits
dataset
in main
of
decrease
(RMSE)is
squared
parameter the
used, is
squared issue
model.
within
in
used,
bias. ridge
find
[43]. with ƛan
find
bias.
The is
that ƛ
ridge
selected
appropriate
authors
The
Simulation regression
minimizes
that authors by
minimizes
of [42]
result looking
ofdecrease isvalidation
validation
reviewed
[42]
showed how forinto
reviewed the find
ridge
SSE. least
estimation
that SSE. ridge m
estimati p
3.2. Modified Estimators Methods 3.2. Modified Estimators Methods
ig,
hatitthe will lead toinflation
variable
reduction an
in underfitting factors for
the variance of
term the
each of
the model.
Identify increase
variable
variable.
the Identify
slope
approach ƛ is such ƛisridge
selected
in
inflation
A the
parameter
suchthat
more rootby
the
factors
trace
that and looking
mean
isreduction
the
robust is for
larger
new
and usedandsquare
each
reductionthan
new for
to
methods in zero,
the
the error
variable.
assistin
methods
provide then
least
variance
the
wereinmore the
(RMSE)
this.
A
variance
were ridge
suggested. estimate
term
It iswithin
aof
trace
term
suggested.
flexibility plot the
Aof will
is anslope
used
more
in the
A equal
appropriate
to
slope
more
recent
handling to
parameter
assist the
parameter
recent
paper inordinary
decrease
isproposed
this.
paper
multicollinearity. larger isleast
in
isItproposed
largerthan
a plotridgesquare
a than a Baye
Bayesian
ne itsin Modified
the ofroot
squared
Estimators
mean
coefficient
bias. The ꞵ authors
square is error
versus
another
ƛ.ofIt
(RMSE)approach
is used
[42] the
reviewed
Later, within
increment
ofvariable
thatan
coefficient
tothe
the pick
increment
use
the
estimation
authors
biased
inꞵsmallest
appropriate
inflation its Modified
insquared
versus
of its
anddecrease
factors
methods
solving
[44] squared
shrunken
ƛƛ.solving
atItfor
Estimators
bias.
which
proposed is
the each
used
for bias.ƛthe
is
The
problem
the in estimators
too ridge
toproblem
is another
big,
authors
variable.
coefficients
The
another pick it
the
authors
of Awill
of in[42]
ridge
ofsmallest
finding
way
approach
lead
start
ofof
finding [42]
the
solving to
reviewed
trace ƛridge
to anat
that
isthe
reviewed
the used
which
ridge
use
underfitting
estimation
parametertothe
problem
biased
assist
estimation
parameter of inand
coefficients
[43].
of the
methodsthis. shrunken
model.
methods
[43]. for
Itstart
Simulation
finding isthe a ƛplot
Simulation for
estima
ƛ re
toisresult
select
exchange for lowerforvariance and thus reduce overfitting exchange [12]. The
for ƛ. lower
advantage variance ispickthatand the thus reduce ƛ overfitting [12]. The advantage is th
edsinflation
werelevel factors
suggested. Aeach
off. Alternatively,morevariable.recent a validation
paper and
A ridge
ridge of
level new
proposed
and trace methods
coefficient
dataset
off. new is used
Alternatively,
parameter. Theirthea is used,
methods
Bayesian ꞵ were
to assist
versus
findwere a ƛ
suggested.
approach in Itthis.
validation
that is
suggested.
approach
the approach
generalized usedincrease
It
minimizes
to A
is
is more amore
to plot
dataset
A
cross-validationmore
is more in the
recent
validation
is
robust used,
recent root
robust paper
smallest
and mean
find
SSE.
paper
approach provide ƛ
proposed
and provide atsquare
that which
proposed minimizes
to more
is able a error
Bayesian
the
morea (RMSE)
coefficients
Bayesian
flexibility
to find validation
flexibility
theapproachwithin
approach
inglobal start
handling SSE.
in handling an
toto appr
mult
heoretical model
versus ƛƛ.the is not compromised because of theoretical
the dropping model of is variables.
not compromised Its because
ƛslope of the dropping of variable
icientofꞵIdentify
blem finding It is
such usedparameter
that
ridge theto reduction
pick the tosmallest
[43].
minimum. solving
level
Identify
in solving
Simulationoff.
the variance ƛthe at
such
the problem
which
Alternatively,
that
termthe
problem
result of of
showed
Later, of afinding
coefficients
reduction
the
Later, validation
slope
finding
the that
authors
the the
variable
start
parameter
in the
authors ridge
thedataset
ofvariance
ridge [44] is
ofparameter
toinflation is
larger
parameter used,
term
proposed
[44] factors
than
proposedof [43].
find the
[43]. for
another Simulation
each
that
Simulation
another variable.
minimizes
parameter
way result
of result
way A
solving
of isridge showed
validation
larger
showed
solving thetrace than SSE. isproble
that
problem
the used o
disadvantage is that the estimators are now biased. ƛ The disadvantage
most known is method
that the is
estimators
the ridge ꞵ are now ƛ. biased. The most known method ƛ iswhi the
f.more
Alternatively,
the increment
robust a validation
and provide in its squared dataset
more flexibility is that
bias.theused, the
Identify
The
the find
increment
More approach
inauthors
approach
handling that
such
estimators of
in is is
[42] more
minimizes
that
itsmore the
squared
ridge
have reviewed
multicollinearity. robustvalidation
reduction
robust
ridge bias.
parameter.
been andof
estimation
and
parameter.
developed The provide
coefficient
SSE.
inprovide
the variance
authors
Their more
methods
Their
from more
generalizedof
the flexibility
versus
term
[42] ofƛ estimator.
forreviewed
flexibility
generalized
ridge It
cross-validationin
is handling
used
thecross-validation
slope
in estimation
handling to
parameterpick
The approach multicollinear-
the
methods smallest
is larger
multicollinearity.
authors approach ofto[45] is than
for to
able at
is able
to fin
egression
ƛ such developed by the inauthor of [41].
ity. This
Later, ofmethod
the regression
authors adds aof penalty
developed
[44] term: by thethe squared
author of [42][41]. aThis method adds a penalty term: ƛthethatsqm
yors and
of [44] that the reduction
new
proposed methods another were theway variance
suggested.
of used and
solving term
the
A anew
Later, increment
morethe
jack-knife methods
the slope
recent
problem authors parameter
inpaper
its
were
procedure ofof squared
minimum. proposedto proposed
suggested.
finding
[44]
minimum. is
proposed
the
reduce level
larger
bias. aA The off.
than
Bayesian
more
the another
Alternatively,
authors
another recent
approach
significant way
of
way paper of
amount solving
oftoreviewed validation
proposed
solving of bias the
the problem
dataset
estimation
a Bayesian
problem
of estimators isof
ofused,
methods finding
approachfinding
from find
fortothe
magnitude of the coefficient β to the the loss
ridge function.
parameter. magnitudeThe general
Their of the equation
generalized coefficient
Identify ofƛhave ridge
β
cross-validation
such to that the the loss approach
reduction function. to isThe able general
to find equation
the of
rement
r. Theirsolving in its squared
generalized the problem bias. The
cross-validation authors
of finding approach
ridge ofand
the [42]
solving ridge
ridge new reviewed
tothe
regression. ismethods
parameter
problem
parameter.
able Thetoestimation
were
find [43].
Theirof the
author finding
More methods
suggested.
Simulation
generalized
ofglobal
More the
estimators
[46] proposed for
Aresult
ridge more
cross-validation
estimators recent
parameter
ashowed have
new been paper
that
been
class [43].
approach
developed of proposed
Simulation
developed
estimator, toin
from is the
afromable
the variance
Bayesian
result to showed
ridge
the
Liu find
ridge term
approachthe
estimator.
estimator, of the
that
global
estimator. toTheslop au
Th
egression
w methods is as follows:
were suggested. global minimum. regression is as follows: the increment in its squared bias. The authors of [42] reviewed
the approach is moreArobust more recent and the paper
solving
provide
approach
minimum.
based on the ridge estimator. proposed
the
more problem
is more a
flexibility Bayesian
usedused of
robust finding
in
a jack-knife
It has approach
handling
and
a jack-knifethe
provide
the added ridge to parameter
multicollinearity.
procedure more
procedure
advantage flexibility
to reduce [43].
toofreduce inSimulation
handling
the significant
a simple theprocedure
significantresult showed
multicollinearity.
amount toamount
findofthe biasthat
of bias
of esti o
matorsthe problemLater,
have been of finding
the authorsthe
developed ∑of from
ridge
[44] the∑ parameter
(𝑌 −proposed
parameter
ridge 𝑋the
Later, More
𝛽 )approach
estimator.+ƛ.
[43].
another
the
More estimators

ƛauthors
Simulation
Thisway 𝛽isof
is
estimators
The more
of have
solving
authors
ridge
,because[44]result
havebeen
robust
ridge proposed
of the
regression.
the beendeveloped
showed
and
[45] and
problem
regression.
estimate new
another
developed
The ∑ author
that
provide The
is from
methods
of(𝑌amore −the
finding
wayfrom
author
linear
(7) ∑of
of ridge
were
the
[46] solving
the
of
function 𝛽estimator.
suggested.
𝑋 proposed
flexibility ridge
[46] ) proposed
+
inthe ∑
ofƛhandling
ƛ.
estimator. A
problem
a new The
The 𝛽more
a new ,authors
The
class
authorofrecent
multicollinearity.
finding
authors
class
of ofof [45]
paper
estimator,
of [47] ofthe propos
[45]
estimator, the Lt
proach
fe procedure is
ridgemore torobust
reduceand
parameter. Their
the providegeneralized
significant more used
amount
proposed Later,
ridge useda jack-knife
flexibility athe
cross-validation
parameter.
of
the in of
authors
jack-knife
bias
Liu-Type procedure
handling
Their of
approach
estimators
procedure
based [44] onto
based
estimator. from
the reduce
multicollinearity.
generalized proposed
toonto solving
is
ridge
They able
reduce
the the
ridge tothe
anothersignificant
cross-validationfind
the
estimator.
found problem
estimator.
that way
the Itglobal
significant
the has amount
of Itof
approach
the
shrinkage hasfinding
solving
amountadded
the ofto
of bias
the the
is
added ofable of
ridge
problem
bias
advantage
ridge estimators
tooffind
advantage
regression parameter
of athe
estimators
of isfrom
finding
simple
of global
anot [43].
the
from
simpleprocedu Simu
pro
n.heThe The
authors main
minimum.
author issue
ofof[44] with
[46]proposed ridge
proposedanother regression
a new class ridge
way is
ridge
minimum. how
regression.
ofestimator,
ridge
of solvingto
parameter.find
regression. theridge
The The
theproblem
Their
Liu parameter
authormain
The estimator,
parameter issue
of
of ƛ.
generalized
author
parameter [46] ƛ.
the
finding
of withIf
[46]
This ƛ
proposed ridge
is
ƛ.approach equal
thebecause
cross-validation
proposed
This
is regression
a
is becauseto
new
is amore class
newapproach
the is
robust
class
estimate
the ofhow estimateto
estimator,
ofand to find
ais
estimator,
is provideridge
the
isable
linear Liu
to
a linearthe parameter
more
findestimator,
Liu
function flexibility
the ƛ.
global
estimator, If
ƛ. of ƛ is eq
ƛ.
in h
effective when faced with severe multicollinearity. The Liu-Type estimator has afunction
lower of The T
au
zero, then the estimate thewill equal tobeen
thebased ordinary on theleast zero,
ridge square then
estimator. estimate.
the estimate
Itthe However,
has Later, will
the equal
added
the ifauthors
the ƛadvantage
to the
advantage ofof[45]ordinary
[44]of least
a asimple
proposed square
procedure
another estimate. to findHowever, thi
dgearameter.
estimator. Their
More generalized
It has estimators added cross-validation
have
advantage MSE minimum.
developed
ofbased
when approach
aMore
simple on estimators
from
the
compared to
procedure the
ridge is able
to ridge
have
estimator.
proposed to
ridge tofind
proposed find
estimator.
been theItthe
regression developed
has
Liu-Type
the global
The
the
Liu-Type
and authors
added from
estimator.
fully the
estimator.
addresses ridge
They estimator.
of
They
found
severe simple
foundthat The
procedure
the
that
multicollinearity. the way
authors
shrinkage toof
shrinkage of
find solving
of[45] the
ridgeof ridgeregr
um.s too
his is big,
used
becauseit will lead
a jack-knife
the to anprocedure
estimate underfitting
is a linear tothe of
reduce
used the
parameter
function amodel.
parameter More
the
jack-knife ƛ. ƛƛ..The
is is
too
estimators
ofsignificant Thisselected
This big,
procedureisis
author it
have
amount will
becauseby looking
because lead
been
to theridge
ofreduce to for
an
estimate
developed
bias of underfitting
the
parameter.
the least
estimatorsisfrom
significanta linear
aTheirthe ofamount
from the
function model.
generalized
ridge ofbias
estimator.
ofbeen ƛof.cross-validation
isTheselected
ƛ.
of The authorauthors
estimators byoflooking [47] [47]for
approach
of
from [45] tht
Since then, variations effective onof
effective when
ridge [47] the
when
andfacedestimate
faced
Liu-typewith is
with
severe linear
estimatorssevere function
multicollinearity.
multicollinearity.
have created The
The author
Liu-Type
The
for use Liu-Type of
in estimator estim
ncrease
ore in theregression.
estimators rootThey
have mean
beenfound square
developed thaterror from (RMSE)
proposed the
used awithin
ridge the increase
Liu-Type an
estimator.
jack-knife appropriate
inThe the
estimator.
procedure root
authors todecrease
They mean
minimum.
reduce of found
[45]square
into ridge
that error (RMSE)
the shrinkage within
ofofridge an appropriate
regression is not decrease in
u-Type ridge
estimator. The author of
the [46]
ridge
shrinkage
different proposed
proposed regression.
types of ridge
the
ofa new The
Liu-Type class
regression
regression.MSE author MSE of estimator,
of
estimator.
when The is not
when [46]
compared
authors proposed
They
compared ofthe
the Liu
found
[48] significant
ridge aproposed
estimator,
to new
that
ridge class
the
regression amount of estimator,
ashrinkage
regressionLiu-typeand bias
of
fully
and ridge
fully
estimator of
the
addresses estimators
Liu forestimator,
regression
addresses severe
binary from
is multico
not mu
severe
variable
jack-knife inflation
procedure factors for
to reduce each variable. effective
the significant A
ridge ridge when
amount trace
regression. faced
variable
is
of used
bias
Thewith inflation
to
of severe
assist
estimators multicollinearity.
factors
in this. for
More
fromIt is
eacha plot variable.
estimators The Liu-Type
A
have ridge been traceestimator
developed is used has
to
from a
assist lower
the in this.
ridge It
estimis
faced based
with on themulticollinearity.
severe ridge estimator. ItThe hasbased
logistic theregression
Liu-Type
effective on
added thewhen ridge
advantage
estimator that estimator.
faced ishas aauthor
of
with
Sinceaa lower
simpleof
Itsevere
Since
generalization has
then, [46] the proposed
procedure
added
multicollinearity.
then,
variations
ofvariations
the Liu-type on afind
advantage
to new onthe
ridge class
The
ridge
and
estimator of
of aLiu-Type estimator,
simple
Liu-type
and Liu-type
for aprocedure
estimator
estimators
linear the Liu
estimators
model. tohas estimator,
find
have Theahavethe been
lower
been creat
of coefficient
egression.
pared to parameter
ridge ꞵ
The authorversusƛ. This
regression ƛ.
of [46]It is used
proposed
is because
and to
fully addresses
MSE
pick
the the
aparameter
new
basedwhen
estimate
MSE smallest
class on
severe
when
compared
ƛ.
of
the of
isstated ƛ coefficient
ridge
This at
estimator,
acompared
linear
multicollinearity.
to
which ridge
estimator.
isdifferent
because the
function ꞵ
the
to ridge
different
regression
Liuversus
coefficients
typesthe used
estimator,
Itof has ƛ.
ƛ. the
estimate
regression
types
of TheIt
aof and
isstart
added
regression.
fully
used
jack-knife
author
is
and
regression. to toaddresses
aadvantage
linear
of
fully
The pick
procedure
[47] the
function
addresses
authors
The
severe
smallest
oftoaof
authors tosimple
severe
[48] ƛ
multicollinearity.
ofofƛ.proposed
reduce at which
the
procedure
[48]The author
multicollinearity.
proposed the
significant toof
a Liu-type coefficients
afind [47]amount
Liu-typetheestimat estso
authors of [49] that not much attention has been given shrinkage estimators for
evel off. Alternatively, a validation dataset Since
is used, then, find
aƛ.level ƛ
variations
that
off. on
Alternatively,
minimizes ridge and
validation
a Liu-type
validation
regression. SSE. estimators
dataset is have
used, been
find ƛ
created
that
ƛ. model, for
minimizes use in validatio
onvariations
the ridgeproposed estimator.
on the and
ridge It has
Liu-Type theestimator.
Liu-type addedestimators advantage
generalizedparameter
proposed
They havefound
Since of
thebeen
linear simple
that
then, This
Liu-Type the
created procedure
is
variations
logistic
regression because
shrinkage
estimator.
forregression
logistic use onin
models, to
the
ofridgefind
They
regression the
estimate
ridge found
and
that
such regression ais
Liu-type
is
that
as isa aThe
that linear
the
is
generalization
the notauthor function
shrinkage
estimators
generalization
Poisson ofofhave[46]
the
regression of
ofof ridgeproposed
been
Liu-type
the The regression
created
Liu-type alogistic
author
estimator newfor ofclass
is
estimator not
use[47]
for aof
infor
lineaest
a
of dentify
eter ƛ. Thisƛ such
effective
regression. isThethat
because
when the reduction
authors within
theofestimate
faced [48] the
proposed
different
variance
isregression
severe aeffective
linear
proposed types
term
multicollinearity.
different
a Liu-type function
when the
types
ofLiu-Type
Identify
of regression.
the
faced
estimatorof slope
Theƛ. ƛwithsuch
The
regression. Liu-Type
for
The
parameterthatThe
author
estimator.
severe
binary
authors
the
based is
ofreduction
They offound
larger
on
[47]
multicollinearity.
estimator
authors
[48]
the than
has
of
proposed
in
ridge the
that
[48] a variance
estimator.
the
lower
The
proposed
aLiu-Type
Liu-type
shrinkage a term
It has
Liu-type of
of estimator
theadded
the
ridge
estimator slope for
regression
estimator has binary
parameter
advantage
a
for loweris
binarynot is
of large
a sim
model, andauthors negative authors of binomial
[49] of stated
[49] regression
stated
that not thatmuch not much
model. attention attention
Therefore, hasthey been
hasintroduced
been
givengiven to shrinkage
ato shrinka e
he the
ed increment
Liu-Type in estimator.
itscompared
squared bias.
They found logistic
The regression
authors
that the
effective regression
of [42]
shrinkagewhen the
reviewed that
increment
ofaddresses
faced ridge iswitharegression
generalization
estimation
is ainsevere its squared
methods
parameter is not ofƛ.
bias. the
for ƛLiu-type
The
This authors
is The because estimator
ofsevere
[42]
the for
asreviewed
estimate a linear estimation
is model.
a model.
linear method
function
on that MSE when
is a generalization of to ridge
the Liu-type MSE estimator
two-parameter when
logistic and compared
fully
regression
for a linear
shrinkage thatto
model.
generalized ridge
generalized
estimator severe
regression
generalization
The formulticollinearity.
linear linear and
regression
negative of fully
the
regression
binomial addresses
Liu-type
models, models, Liu-Type
models. estimator
such such
It estimator
ismulticollinearity.
for
the aPoisson
linear
aascombination
the has
Poisson a of lower
regression The
regressio mo
and new methods were suggested. A The authors
more recent of [49]
paper stated that not much attention has been givenpaper to shrinkage estimators
e when
stated that faced Since
not with
much severe
then, attention multicollinearity.
variations hason been ridge
thegiven MSE and
authors
ridge The
Since when
to of and
Liu-Type
Liu-type
then,
shrinkage
estimator [49] proposed
compared new
variations
stated
and
methods
estimator
estimators
estimators
regression
Liu to
thataridge
Bayesian
on
regression have
not
estimator.
were
has
ridge
for
model,proposed
much approach
suggested.
amodel,
regression
been lower
and
The created
and the
andLiu-Type
Liu-type
attention
negative
and
authors
to
forA
fully
has more addresses
estimators
negative
of use been
binomial
[50]
recent
inestimator.
given
binomial
modified have severe They
been
toregression
regression
proposed
shrinkage
the found
multicollinearity.
created
model.
Jackknifed
afor
that Bayesian
estimators
model. the
use
Therefore,
ridge infor
Therefore,
appro
shrinkagethey t
olving the problem of finding the ridge for generalized
parameter [43].linear
solving regression
Simulation
the problem resultmodels,of
effective showed
finding such
when that
the as the
ridge
faced Poisson
parameter regression [43]. model,
Simulation logistic
result showe
hen regression
ear compared
differentto ridgeofregression
types
models, regression.
such as the and The fully
different
Poisson authors
regression addresses
Since
generalized of
types
regression
estimatorthen,
[48] severe
of
linear variations
proposed
regression.
model,
to multicollinearity.
regression
two-parameter
form aon
alogistic
Liu-type
The
two-parameter ridge
Modified authors
models, and
estimator
shrinkage ofLiu-type
such
shrinkage
Jackknifed [48] for proposed
asbinary
estimator thewith
estimators
estimator
Poisson Poisson
for severe
forhave
a negative
Liu-type
ridge multicollinearity.
been
regression
negative estimator
binomial
regression created
binomialmodel,forfor
models.
estimator. binary The
use It in
logistic
models. Liu-Ty
is aIt com
is
he approach isregression
more regression model, and negative binomial regression model. Therefore, they introduced a
nceel, and then, variations
logistic
negative onrobust
binomial ridge that and
and
regression provide
is aLiu-type
generalization
model.
The
more
estimators
different
logistic
regression
author Therefore,flexibility
oftypes
regression
of the
[2]
thehave
model, approach
of
Liu-type
they
reviewed
inbeen
the andhandling
regression.
that is
introduced createdis more
estimator
ridgeanegative
the
multicollinearity.
The
generalization
the ridge
estimator
biased MSE
for forrobust
use
a estimator awhen
authors
binomial inofand
linear
and
estimators of compared
the
Liu
and [48]
model.
regressionprovide
proposed
Liu-type
estimator.
inLiu The to
theestimator.
more
Poisson ridge
estimator
model. The
flexibility
regression
a Liu-type
Thefor
Therefore,
authors
regression authors
inthey
aestimator
linear
of andhandling
[50]
model fully
ofmodel. for
introduced
modified
[50]
multicollin
addresses
binary
in modified
the Thethea Jack se
the
Later, the two-parameter shrinkage estimator for negative binomial models. Itsolving
is a combination of of findin
nt types
shrinkage ofauthors
authors regression.
estimator of for
of [49] [44]
The
stated proposed
authors
negative that not ofanother
[48]
binomial much proposed
logistic
authors
presence
way
attention
two-parameter
models. of of a
[49]Itsolving
regressionLater,
Liu-type
has
isstated
a been the
shrinkagethe
that that
combination
regression
authors
estimator
is
given problem
a
not
regression to much
estimatorof
for
generalization of
shrinkage
of
estimator
[44] finding
Since
binary proposed
attention
estimatorfor then,
of the
the
estimators
negative
tomaximum
form has
to form
another
variations
Liu-type
been for
binomial
a Modified given
a Modified
wayon
estimatorto ofshrinkage
ridge
models.
Jackknifed and
for
Jackknifed It a is the
Liu-type
linear
a problem
estimators
Poisson model.
combination
Poisson estimators
ridgeridge The
for of
regressio ha
regr
idge parameter. the ridgeof multicollinearity.
estimator and Liu The
estimator. regular The authors of [50] likelihood modified methodthe Jackknifed in estimating ridge
atorregression
and Liuthat
generalized isTheir
estimator. generalized
a generalization
linear Theregression
authors ofcross-validation
the
of models,
[50]Liu-type
regression authors
generalized
themodified such
ridge ofapproach
estimatorridge
[49]
linear
as
thethe
estimator
coefficient
parameter.
stated for
Jackknifed
The
is
to
regression
Poisson
and
not
is
athat
author
The able
linearLiunot Their
to
model.
much
regression
ridge
author
reliable models, find
different
estimator.
of [2]
in
generalized
of the
The
attention
such
reviewed
the [2] global
types
model,
The
reviewed
presence as of
authors
thecross-validation
hastheregression.
been
logistic
of biased
thePoisson
ofbiased given
[50] The toapproach
authors
regression
modified
estimators
multicollinearity. shrinkage
estimators in the
Theythe of to
model, is able
[48]
estimators
Jackknifed
in Poisson
the
compared proposed
logistic
Poisson toridgefind
for
regression athe
regress Liu
minimum. regression estimator minimum. to form a Modified Jackknifed Poisson ridge regression estimator.
s of [49]
mator stated
form athat
toregression not much
model,
Modified and attention
negative
Jackknifed has
binomial been
generalized
regression
Poisson regression ridge given
regression
model, to
linear
regression
estimator shrinkage
and regression
model.
presence negative
estimator.
to estimators
Therefore,
form
presence logistic
models,
binomial they
amulticollinearity.
Modified regression
forregression
such toas
introduced
Jackknifed that
the model. is
Poisson
aregular
Poissona generalization
regression
Therefore,ridge they
regressionof the
model,
introduced Liu-type logistic
estimator. estima
a methi
More estimators have been developed
the
Theperformance
author
from theof [2] ridge
ofreviewed
four
More
estimator. theof
estimators
estimators biased
The
of
in multicollinearity.
authors
have
addition
estimators
beenofregression
[45] in
developed
theThe
thewidely The
Poissonfrom
regular
used
the
maximum
ridge
regression
ridge
maximum likelihood
estimator
model
estimator.
likelihood
and
inThe the method
authors
2]ized linear
reviewed regression
two-parameter
the biased models,
shrinkage
estimators such
estimator
in the as
found the
regression
two-parameter
for
Poisson
The that Poisson
negative
author model,
regression
Liu-type ofregression
shrinkage
binomial
[2] and
reviewed
model
regression
estimators model,
negative
estimator
models.
regressionin the
the authors
coefficient
have logistic
binomial
It for
biased
coefficient
superior of
is anegative [49]
combination
estimators
is not is stated
binomial
reliable
not
performance in that
model.
of
the
reliable in not
models.
Poisson
the in
over much
Therefore,
presence
the other It attention
ismethods
regression
presence aofthey
combination has
introduced
model
multicollinearity.
of been
multicollinearity
in the in of given
thea The to
used a jack-knife procedure to reduce presence
the significant of multicollinearity.
used
amount aof jack-knife
of bias The
ofprocedure regular to
estimators maximum
from
reduce the likelihood
significant methodamount inof estimating
bias of estimators
ion model,
ulticollinearity.the ridgeand The negative
estimator
regular binomial
and Liuregression
maximum estimator.
Poissonthe ridge
likelihood
presence model.
two-parameterThe
regression estimator
authors
method
of Therefore,shrinkage
and
multicollinearity.
model. in
the [50] they
Liu modified
estimating
performance
the estimator.
performance generalized
introduced
estimator The the
offorThe
regulara
fournegative
Jackknifed
of linear
authors
four maximum
estimators ofregression
binomial
ridge
estimators [50] in models.
modified
likelihood
addition
in models,
addition It
the
to is
method
the such
a
Jackknifed
to widely
the as
combination
in the
widely ridge
estimating
used Poissonof
used
ridge ridr
es
idge regression. The author of regression coefficient isestimator,
not reliable inLiu theestimator,
presence of multicollinearity. They compared the the Liu estim
rameter
ficient shrinkage
isregression
not reliable estimator
estimator
in the presencefor[46]
to proposed
negative
form a Modified
of binomial
the
regression a new
ridge
multicollinearity.
regression class
Jackknifedridge
models.
estimator
estimator
coefficientof regression.
They Itand is
Poisson
foundto a
form
compared
is
found not The
combination
Liu ridge
a
thatinthat
the
estimator.
Modified
reliable
Liu-type
author
regression
regression
in
Liu-type of
The ofmodel,
[46]
authors
Jackknifed
the
estimators
proposed
estimator.
presence
estimators and of
Poisson
have ofnegative
[50] a modified
new
ridge
multicollinearity.
have
superior superior
class
binomial of estimator,
the
regression
performance regression
Jackknifed
They
performance estimator.
compared
over model.
ridge
over
otherothe Th
me
based onThe the ridge estimator. It has performance ofbasedfour aestimators addition to the thewidely used ridge estimator and procedure
found
ge e of estimator
four and Liu
author
estimators ofin estimator.
[2]addition
reviewed Thetothe
the
the added
authors biased
The
widely
the
that Liu-type
advantage
of
regression
author [50]
estimators
performance
used modified
ofestimator
[2]
ridge
estimators
ofin on
reviewedsimple
the
estimator
of
Poisson the
thefour
have
ridge
procedure
Jackknifed
toPoisson
form the
estimators
and
regressionestimator.
two-parameter
abiasedModified
regression to
ridge find
Itmodel.
estimators
in
model. has
Jackknifed
model
addition the
shrinkage
in
intoadded
the
thePoisson
the Poissonadvantage
estimator
widely ridgeregression
used ofridge
for anegative
regression simplemodel
estimator binomial
estimator.
in the and to fi
mod
parameter
ion estimator
-type presence
estimatorsƛ. Thisto of ismulticollinearity.
form
have because
asuperior
Modified theperformance
estimate
Jackknifed
The The
presence is aauthor
regular
found linear
Poisson
over ofmaximum
that parameter
otherof function
ridge
[2]methods
multicollinearity.
Liu-type regression
reviewed ƛ. This
of
likelihoodƛ.insuperior
estimators The
the The
the isthe author
because
estimator.
biased
method
regular
have
performance
of
ridgeestimators the
in
superior [47]
estimator
maximum estimate
estimating in
over is
and the
performance
other
LiuPoisson
likelihood
methods
a estimator.
linearover function
regression
method other
in the
Thein of
authors
model
methods
Poisson
ƛ. The
estimating of
in author
in[50] themodo
the
proposed the Liu-Type estimator. They regression
found that model.
the proposed
shrinkage theof ridge
Liu-Type regressionestimator. is not They found that the shrinkage of ridge regression
thor
ion model. of regression
[2] reviewed the biased
coefficient estimators
is not reliable in
regressioninthe
presence
Poisson thePoissonpresence
coefficient
regression regression
of multicollinearity. ofmodel.
ismulticollinearity.
not reliable model regression
Theinregular the They estimator
presencemaximum
compared oftomulticollinearity.
form
likelihood a Modified method TheyJackknifed
incompared
estimating Poisson ri
effective the when
ce of multicollinearity. faced with
performance ofsevere
The regular
four multicollinearity.
estimators maximumthe regression
performance
in addition The
likelihood effective
coefficientLiu-Type
to ofthe method
four when estimator
isestimators
widely not faced
in The with
estimating
reliable
used has
author
ridge
in severe
a estimator
inaddition
the lower
of
presence multicollinearity.
[2] toreviewed
theand ofwidely theused
multicollinearity. Theridge
biased Liu-Type
estimators
They
estimator estimator
in and
compared the Poisso has a
MSE when
ion coefficient compared
found that is not to
Liu-type ridge
reliableestimators regression
in the presence and
found
have fully
of
the performance addresses
MSE
multicollinearity.
superior
that Liu-type when
performance severe compared
four They
of estimators multicollinearity.
estimators
over presence
compared
have to
other ridge
in regression
of multicollinearity.
addition
superior
methods to and
in the widely
performance fully over addresses
Theused regular
other ridge severe
methods maximum
estimator multicollinear
in the and likeliho
formanceSince then,
of four variations
estimators
Poisson regression model. on ridge
in and
addition Liu-type
to
found the estimators
widely
that
Poisson regression model. Liu-type Since
used have then,
ridge been
estimators variations
estimatorcreated
regression
have andfor
on use
ridge
superior in
coefficient and
performanceLiu-type
is not estimators
reliable
over in
other the have presence
methods been created
of
in multicol
the for
different types of regression.
that Liu-type estimators have superior performance The authors of [48] proposed
Poisson regression different a
over other Liu-typetypes
model. of
estimator
methods regression.
the in for
performance
the The
binary authors of [48] proposed
of four estimators in addition to the widely a Liu-type estimator for
lower variance and thus reduce overfitting [12]. The advantage is that the
model is not compromised because of the dropping of variables. Its
e is that the estimators are now biased. The most known method is the ridge
veloped by the author
Mathematics ofx [41].
2022, 10, This REVIEW
FOR PEER method adds a penalty term: the squared 9 of 18
of the coefficient β to the loss function. The general equation of ridge
Mathematics 2022, 10, 1283 9 of 17
as follows:
∑ (𝑌 − ∑ ƛ ∑ with.
𝑋 𝛽 ) +work 𝛽 , Furthermore, the modern (7) optimization methods depend on subjectively
Mathematics 2022,The 10,authors
x FOR PEER of [51]
REVIEWproposed a partial ridge regression to solve three problem of regular
determined indicators of relevance and similarity. This can be seen from [11] where the
n issue with ridge regression is how toridge
find ridge
Mathematics parameter
regression.
2022, 10, x FOR BiasPEER ƛ.applied
is If ƛ is equal
REVIEW to alltovariables regardless of the degree of multicollinearity
authors suggested other measures of multicollinearity for future research. It is therefore
e estimate will equal to the ordinary least squareridge
in normal estimate. However,
regression. Stability if theisƛachieved at the cost of MSE and the selection method
difficult to suggest which method is better without directly comparing performance on
will lead to an underfitting of the model. of ƛ is selected
arbitrary.by Thelooking
proposed forwith.the
method least applies the the ridgemodern parameter only to variables withdependa
the same dataset. Betterwork performance Furthermore,
can also be due to the specific optimization
problem tested. methods on
he root mean square error (RMSE) within high an appropriate
degree decrease
of collinearity. work This in ridge
way
with. the precision
Furthermore, of thethe
determined indicators of relevance and similarity. This can be seen from [ parameter
modern estimator
optimization improves
methods depend
whileisretaining
tion factors for each variable. A ridge trace used to assistthe MSE closeIt to
inauthors
this. is that
a plot of OLS. Estimates are closer to a true OLS estimate, β
3.2. Modified Estimators Methods determined suggested indicators
other measures of relevance and similarity. This
of multicollinearity for future can beresearch.
seen fro
ꞵ versus ƛ. It is used to pick the smallest
and ƛoverall
at which variance is reducedstart
the coefficients authors
difficult to
significantly.
to
suggested
suggest which
It outperforms existing method in terms of
other measures
method is of
better multicollinearity
without directly for future
comparing researpe
ernatively, a validation dataset is used,bias,findModified
ƛ thatand
MSE, Estimators is anotherSSE.
relative validation
minimizes efficiency. approach that use biased and shrunken estimators in
exchange for lower the
variance difficult
same and thus to
dataset. suggest
Better
reduce which
performance
overfitting method canis better
also be without
due to directly
the specific comparing
problem
h that the reduction in the variance term of The Lasso
the slope regression
parameter is
isthe
largera method than developed by the[12]. author Theofadvantage
[52] as a result is thatofthe the
theoretical model is not same
compromised dataset. Better performance
because of the droppingcan alsoofbevariables. due to the Its specific prob
t in its squared bias. The authors of [42]problems
reviewed of both stepwise
estimation regression
methods
3.2. Modifiedare for ƛand ridge regression. This problem is interpretability.
Estimators Methods
hods were suggested. A more recent paper disadvantage
Stepwise proposed is that
regression the
a Bayesian is estimators
interpretable,
approach nowthe
tobut biased.
process Theismost veryknown discrete, method
as it isisnot the known
ridge
regression developed by the 3.2.
Modified
author Modified of [41]. Estimators
Estimators Thisthe isMethods
method another approach that use biased andinshrunken
problem of finding the ridge parameter why [43].variables
Simulation are included
result or dropped
showed that from model.adds Ridge a penalty
regression term: is the
very squared
good
magnitude exchange Modified
for lower Estimators
variance is
and another
thus approach
reduce that
overfitting use biased
[12]. The and shrun
advanta
in of
multicollinearity
h is more robust and provide more flexibility thedue
handling coefficient
to the stability
multicollinearity. β to the of the loss function.
shrank coefficient. The general However, equation
it does not of reduce
ridge
zero, theoreticalexchange for lower
model is not variance compromised and thus reduce becauseoverfitting of the dropping [12]. The adv of
thors of [44] proposed another wayregression
thesolving
of coefficientisthe
as tofollows:
problem therefore
of finding resulting
the in models that are hard to interpret. The Lasso is
theoretical
disadvantage is model
that the is not
estimators compromised
are now because
biased. The of
most the known droppingmetho
eter. Their generalized cross-validationknown approach as L1 toregularization,
is able to find ∑ the while −∑
(𝑌 global the ridge 𝑋is 𝛽 regression
) + ƛ∑ is𝛽 known
, [41]. as L2 regularization. The
(7)
main difference between disadvantage
regression
the two developed
is that Lasso thatby the
the
reduces estimators
author certain of are now biased.
This method
parameter estimates The
adds most
to azero. known
penalty termm
The main toissue with regression
magnitude
ridge as regressionof developed
the coefficient by the β author
to the of [41].
loss This
function. method The
ƛ. If ƛ is equal to adds
general a penalty
equa
timators have been developed from the This serves
ridge estimator. selectThe variables
authors [45]Theisequation
ofwell. how to find is shown ridge parameter
below:
zero, then the estimate will magnitude
regression
equal is the
to of
as follows: the coefficient
ordinary least square β toestimate.
the lossHowever, function.if The the ƛgeneral e
nife procedure to reduce the significant amount of bias of estimators n
from p 2 p
ion. The author of [46] proposed a new is too
classbig, ofitestimator,
will lead to thean Liu∑regression
underfitting (Yi − ∑
iestimator,
=1
isofasthe follows:β ) + ƛ(𝑌∑
X model.
j=1 ij j ∑
is − selected
j=∑ 1
| β j |𝑋, by 𝛽 )looking
+ ƛ ∑ for𝛽the, least (8)
increase
ridge estimator. It has the added advantage ofin the root
a simple mean square
procedure to find error the (RMSE) within ∑ an(𝑌appropriate− ∑ 𝑋 𝛽decrease ) + ƛ ∑ in ridge 𝛽 ,
variable The penalty
inflation term,
factors
This is because the estimate is a linear function of ƛ. The author of [47] for The
absolute
each main
value
variable. issue
of the
A with
ridge ridge
coefficient
trace regression
is β is
used addedto is how
to
assist the to
in findfunction.
lost
this. Itridge
is a plot parameter
In ƛ.
Liu-Type estimator. They found thatof this
the equation,
coefficient
shrinkage ꞵofY ƛ.
is a (nx1)
versus
ridge zero, then
It isvector
regression used The the
is of
to main
not estimate issue
response,
pick with
will
X is
the smallest ridge
equal ƛ at
a (nxp) toregression
the
matrix
which theofiscoefficients
ordinary howleasttosquare
predictor find ridge
variables
start to paramete
estimate. How
and
level off.
β is a (px1)
Alternatively,
en faced with severe multicollinearity. The Liu-Type estimator has a lowervector is
a oftoozero,big,
unknown
validation then
it will the estimate
lead
constants.
dataset isto anAs
used, will
with
find equal
underfitting ƛ
ridge
that to ofthe the
regression,
minimizes ordinary
model. as ƛ
least
validationis square
selected
approaches SSE. estimate.
by lookin
zero,
ompared to ridge regression and fullyIdentify
addresses theƛ such
equation
severe becomes
thatmulticollinearity.
the is too
increase
reduction closer big,
inin toitthe
the
the willleast
root
variance leadsquare
mean tosquare
term anofunderfittingerror
estimate.
the slope (RMSE) of thewithin
However,
parameter model.
ifisthe anƛ appropriate
larger is
valueselected
than is bydecr loo
en, variations on ridge and Liu-type the very large,
increment
estimators the
have coefficient
inbeen
its squared increase
approaches
variable
createdbias. Theinauthors of [42] reviewed estimation methods for ƛ
inflation
for use in the
zero. root
The
factors mean
Ridge
for each square
regression
variable. error A (RMSE)
shrinks
ridge the within
trace estimator
is an
used appropriate
but
to assist in t
does nothing
and newamethods
es of regression. The authors of [48] proposed in
Liu-typewere variable of
estimator selection,
variable
coefficient
suggested. for binary ꞵ
while
inflation Lasso
versus ƛ.
factors achieves
It isforused each both.
A more recent paper proposed a Bayesian approach to to pick Due
variable. the to A this
ridge
smallest reason, ƛ
trace at Lasso
is
whichused is to
the assist
coeffi
more desirable.
solving the problem It is of
ssion that is a generalization of the Liu-type estimator for a linear model. The
morelevel parsimonious
of
finding coefficient
off. ridgeꞵand
Alternatively,
the therefore
versus
parameter ƛ. It[43].
a validation better
is used intoexplaining
dataset
Simulation pickis used,
result the
the smallest
find
showed ƛ atthat
ƛ that
relationship which the co
minimizes v
between
the approach independent
is more
9] stated that not much attention has been given to shrinkage estimators for
andlevel
Identify
robust dependent
and ƛ off.
such
provide variables.
Alternatively,
that the
more reduction a validation
flexibility in the
in dataset
variance
handling is
termused, of find
the
multicollinearity. ƛ
slope that minimiz
parameter
linear regression models, such as the Later, When
the authors
Poisson
faced with
regression of [44] multicollinearity,
the Identify
increment
proposed
model, logisticƛanother
such
in itsRidge
that
squared
way and
the ofLasso
reduction
bias.
solving perform
The inthe
authors differently.
theproblem
variance
of [42] term
of Ridge
of the
reviewed
finding tends
slope
the paramm
estimation
to
ridge spread
parameter. the effect
Their evenly
and the
new
generalized and shrink
increment
methods the
in
were
cross-validation estimators
its squared
suggested. of
bias.A all
morethe
The variables.
authors
recent
approach to is able to find the global of
paper While
[42] reviewed
proposed Lasso a estimati
Bayesia
odel, and negative binomial regression model. Therefore, they introduced a
is unstable and tends solving
minimum. to retain
and new
theone of the variables
methods
problem ofwere
finding and
suggested. theeliminates
ridge A more all
parameter the [43].
recent others.
paperSimulation Lasso aresult
proposed Bay
er shrinkage estimator for negative binomial models. It is a combination of
performs
Moremodifiedpoorly
estimators in the case
the solving where
approach the number
theis problem
more of
robustofthevariables,
finding
and
ridgeprovide p, is
the ridge more
more than
parameter the
flexibility number
[43]. of
Simulation
in [45]
handling mu re
mator and Liu estimator. The authors of [50] thehave been developed
Jackknifed ridge from estimator. The authors of
observations,
used a jack-knife n. It selects
procedure Later,at
thetomost
theapproachn variables.
authors
reduce the is of more When
[44]
significant robust
proposednamount
> andp, the performance
provide
another
of bias more
way
of of of Lassofrom
flexibility
solving
estimators is
the not
in handling
problem o
timator to form a Modified Jackknifed Poisson ridge regression estimator.
as good
ridge as RidgeThe
regression. regression.
ridge
author Later,The
parameter.
of [46] authors
the authors
proposed of [53]
Their of
a proposed
[44]
generalized
new proposed
class an
of elastic
another
cross-validation
estimator, netthe that
way combines
of
approach
Liu solving
estimator, both
to the
is proble
able to fi
f [2] reviewed the biased estimators in the Poisson regression model in the
Ridgeonand theLasso regression. ItIthashas theaddedadvantage of both ofthe regularization methods, andto is able
multicollinearity. The regular maximum based likelihood ridge method inridge
minimum.
estimator. estimating parameter.
the Their
advantage generalized a simplecross-validation
procedure toapproach
find the
it also shows grouping
ƛ. This is because effect.
minimum. The elastic net groups variables that are
ƛ. Thethe highly correlated
parameter
efficient is not reliable in the presence of multicollinearity. TheyMore the estimators
compared estimate ishave a linear beenfunction developed of from ridge of
author estimator.
[47] The a
together. the
proposed It either
Liu-Type dropsestimator.
or retains
used a More all of
jack-knife
They them
estimators
procedure
found together.
that have
the to Typically,
been
reduce
shrinkage developed
the cross-validation
of from
significant
ridge the
regressionamount is used
ridge is of
not to
estimator.
bias of estTh
nce of four estimators in addition to choose
the widely used ridge
the tuning parameter. estimator It was and originally used of by [46] [54]. In cross-validation, a of
subset ofof bias
effective
iu-type estimators have superior performance when faced withridge used
severe a
regression. jack-knife The
multicollinearity. procedure
author The to reduce
proposed
Liu-Type the a significant
new
estimator class
has amount
a estimator,
lower the oL
sample is aover holdout other methods
inbased
order ridge on
in the the performance.
to validate
regression.
the ridge The
estimator. author It has of [46]
the proposed
added advantagea new class
of a of
simple estimator,
procedu t
ession model. MSE when compared to ridge regression and fully addresses severe multicollinearity.
The authors of [55] developed
based
parameter on ƛ. theridge
the
This algorithm
is estimator.
because leastthe Itangle
has
estimate regression
the added
is a (LARs).
advantage
linear function Itoftakes
a ƛ.
simple
of Thepro a
Since then, variations on ridge and Liu-type estimators have been created for use in
inspiration from Lassoproposed andparameter
stagewise regression
ƛ. This is and aimsThey
because theto be a computationally
estimate is athe linear simpler of ƛ.reg
different types of regression. The authors the Liu-Type of [48] estimator.
proposed a Liu-typefound that
estimator for function
shrinkage binaryof ridge T
method. The LARs begins similarly
proposed whenthe to forward
Liu-Type regression
estimator. where it
They for starts
found withthat all
the coefficients
shrinkage of ridge
logistic regression that iseffective a generalization facedof the with severe
Liu-type multicollinearity.
estimator a linear The model.Liu-Type The estimato
equal to zero and then adds effectivethe predictorwhen mostwith
faced correlated
severe with the response.The The next
authors of [49] stated that MSE not when much compared
attention tohas ridgebeen given multicollinearity.
regression and fully addresses
to shrinkage estimators Liu-Type
severe
for multico estim
variable has as much correlation as the current residuals. LARs proceeds equiangularly
generalized linear regression MSE
Since when
then,compared
models, variations
such as the toonridge ridgeregression
Poisson andregression
Liu-type and fully estimators
model, addresses havesevere
logistic been crea mu
between the predictors, along the “least angle direction”, until the next most correlated
different Since
types then,
of variations
regression.
regression model, and negative binomial regression model. Therefore, they introduced a The on ridge
authors and of Liu-type
[48] proposed estimatorsa Liu-type have been
estima
variable. The authors of [56] also improved upon the Lasso regression by using a mixed-
two-parameter shrinkage different
logistic
estimator regression types
for of regression.
that
negative is abinomial
generalization The authors
models. of theItof [48]
isLiu-type proposed
a combination estimator a of
Liu-type
for a line est
integer programming approach. It eliminates structured noised and thus makes it perform
the ridge estimator andauthors Liulogistic of [49]
estimator. regression
stated
The authors that not
that isof a much
generalization
[50] modifiedattentionthe ofhasthe Liu-type
been
Jackknifed givenridgeestimator
to shrinkage for a e
better in a high dimensional environment where p > n. The authors of [57] further expanded
regression estimator to form authors
generalized a Modified of [49]Jackknifed
linear stated
regression thatPoissonnot
models, much attention
such
ridge as the
regression hasPoisson
been given
estimator. to shrinka
regression m
on the idea and developed several penalized mixed-integer nonlinear programming models.
The author of [2] reviewed generalized
regression
the biasedmodel, linear
estimators regression
and negative in the models,
binomial
Poisson such as model.
regression the Poisson
model inTherefore,
the regressio
they
The models are also solvable by a meta heuristic algorithm.
presence regression
two-parameter model,
shrinkage andestimator
negative binomial
for negative regression
binomial model.
models. Therefore,
It is a cot
Theof multicollinearity.
authors of [58] introduced The regular a strictly maximum
concave penalty likelihood function method called in modified
estimating log
regression two-parameter
the ridge in the shrinkage
estimator and Liu estimator. ofestimator for authors
The negative ofbinomial
[50] modified models. theItJac is
penalty. Itcoefficient
is contraryistonot thereliable
strictly convex presence
penalty of multicollinearity.
Elastic net. It is aimed They at compared
achieving a
the performance of four the ridge
regression
estimators estimator
estimator
in addition to andform
to Liu
the estimator.
a Modified
widely used The
Jackknifed authors
ridge of [50]and
Poisson
estimator modified
ridge regressithe
parsimonious model even under the effects of multicollinearity. Methods such as the Elastic
found that Liu-type estimators Theregression
author have of [2] estimator
reviewed
superior to formthe biased
performance a Modified estimators
over Jackknifed
other in the Poisson
methods Poisson ridge regr
in theregression
Poisson regression model. The
presence author
of of [2]
multicollinearity. reviewed the
The biased
regular estimators
maximum in the
likelihoodPoisson regress
method
presence
regression of multicollinearity.
coefficient is not reliableThe in theregular presence maximum likelihood meth
of multicollinearity. Th
the regression
performance coefficient
of four estimatorsis not reliable in the presence
in addition to the widely of multicollinearity
used ridge e
found thethat performance
Liu-type of four estimators
estimators have superior in addition to the widely
performance over used other rid m
Poissonfound that Liu-type
regression model.estimators have superior performance over othe
Mathematics 2022, 10, 1283 10 of 17

net tend to focus on the grouping effect which means that collinear variables are included
together. Table 3 provides a summary of the findings for modified estimator approaches.

Table 3. A Summary of previous studies on modified estimators.

Author Year Objective Method Pros Cons


Introduces
Adds bias in exchange for
Hoerl [41] 1962 Ridge regression Reduces overfitting significant amount
lower variance
of bias
Simple method to
Address significant obtain confidence
Larger variance than
Singh et al. [45] 1986 amount of bias in ridge Jack-knife procedure intervals for the
ridge regression
regression regression
parameters.
Ridge estimate is a Does not work in
Simple procedure to find
Liu [46] 1993 Liu estimator linear function of severe
ridge parameter
ridge parameter multicollinearity
Address interpretability of Worse performance
Reduces coefficient
Tibshirani [52] 1996 stepwise and ridge Lasso regression than Ridge and does
to zero
regression not work when p > n
Existing method does not
Allows large Two parameter
Liu [47] 2003 work in severe Liu-type estimator
shrinkage parameter estimation
multicollinearity
Least angle Computationally Very sensitive to the
Efron et al. [55] 2004 Computational simplicity
regression (LARs) simpler Lasso presence of outliers
Combines Ridge and Achieves grouping
Zou and Hastie [53] 2005 Elastic net No parsimony
Lasso regression effect
Applies Ridge parameters
Chandrasekhar Partial ridge More precise Subjective measure
2016 only on variable with high
et al. [51] regression parameter estimates of high collinearity
collinearity
Produce a marginal
A conditionally conjugate Bayesian approach Only focus on
posterior of
Assaf et al. [43] 2019 prior for the biasing to finding ridge getting a single
parameter given the
constant parameter parameter
data
Parsimony variable
Strictly concave penalty
Nguyen and Ng [58] 2019 Modified log penalty selection under No grouping effect
function
multicollinearity
Alternative to the ordinary Outperforms Ridge
Kibria and Kibria–Lukman Results depends on
2020 least and Liu-type
Lukman [59] estimator certain conditions
squares estimator regression
High-dimensional
Two-parameter Has asymptotic Lower efficiency in
Arashi et al. [60] 2021 alternative to Ridge and
estimator properties sparse model
Liu
Effective with
Tune parameter alpha of Optimized Elastic Accuracy metric not
Qaraad et al. [61] 2021 imbalanced and
Elastic Net Net discussed
multiclass data

Modified estimators aim to improve the efficiency in parameter estimation in the


presence of multicollinearity. This comes with a bias–variance trade-off. Researchers can
select the methods based on their purpose such as grouping effect or parsimony. However,
it can require extensive knowledge to know which one works better on the problem. For
example, some methods are shown to work better in high or low dimensionality, degree
of multicollinearity. Moreover, some methods are for linear regression and modifications
need to be made for other functional form predictions or classification problem.
Mathematics 2022, 10, 1283 11 of 17

3.3. Machine Learning Methods


In this section, we attempt to present the overall state of the multicollinearity problem
in machine learning and introduce interesting algorithms that deal with it implicitly. It is
proven that a neural network is superior to traditional statistical models. The authors of [62]
used a feed forward artificial neural network to model data with multicollinearity and
found that it has much better performance in terms of RMSE compared to the traditional
ordinary least square (OLS). This shows that machine learning methods with more complex
architecture have the potential to produce much better estimates of the parameters than
statistical methods. The authors of [51] provided reasons why a machine learning algorithm
might be better. They have no requirement for assumptions about the function, can uncover
complex patterns, and dynamically learn changing relationships.
Next, it is observed that variable selection methods have been applied in neural
networks. The authors of [63] proposed a hybrid method that combines factor analysis
and artificial neural network to combat multicollinearity. ANN is not able to perform
variable selection, therefore PCA is used to extract components. ANN is then applied to
the components. This method is named FA-ANN (factor analysis–artificial neural network).
It is compared with regression analysis and genetic programming. FA-ANN has the best
accuracy among them. The advantage of FA-ANN and genetic programming is that it is
not based on any statistical assumptions, so it is more reliable and trustworthy. In addition,
they can generalize over new sample data unlike regression analysis. However, they are
considered a black-box model and are hard to interpret. A more recent version of this
approach has been used in quality control research. The authors of [64] proposed a residual
(r) control chart for data with multicollinearity. They suggested a neural network because
a generalized linear model (GLM) may not work best in asymmetrically distributed data.
They concluded that neural network model and functional PCA (FPCA) can deal with the
high dimensional and correlated data.
Furthermore, regularization and penalty mechanisms can also be used to solve mul-
ticollinearity in machine learning models. For example, the Regularized OS-ELM algo-
rithm [65], OS-ELM Time-varying (OS-ELM-TV) [66], Timeliness Online Sequential ELM
algorithm [67], Least Squares Incremental ELM algorithm [68], and Regularized Recursive
least-squares [69]. However, these mechanisms increase the computational complexity. For
this reason, The authors of [70] proposed a method called the Kalman Learning Machine
(KLM). It is an Extreme Learning Machine (ELM) that uses a Kalman filter to update the
output weights of a Single Layer Feedforward Network (SLFN). A Kalman filter is an
equation that can efficiently estimate the state of a process that minimizes mean squared
error. The state is not updated in the learning stage as with the concept of ELM. The
resulting model has shown to outperform basic machine learning models in prediction
error (RMSE) and computing time. However, it requires manual optimization by humans.
A constructive approach to building the model is suggested.
Although deep learning (DL) has emerged as an efficient method to automatically learn
data representation without feature engineering, its discussion in terms of multicollinearity
is very limited. Based on this motivation, our paper discussed the properties of neural
networks, such as the convolutional neural network (CNN), recurrent neural network
(RNN), attention mechanism, and graph neural network, before illustrating the example in
mitigating the multicollinearity issue.
CNN is a neural network which was first introduced by the authors of [71] in the field
of computer vision. It developed the concept of local receptive fields and shared weights to
reduce the number of network parameter. It is very interesting in its way of addressing
relationships between features. Traditional deep neural network suffers from booming
parameter issues. CNN adopted multiple convolutional and pooling (subsampling) layers
to detect the most representative features before connecting to a fully connected network for
prediction. Specifically, the convolutional layer applied multiple feature extractors (filter)
to detect the local features and produce its corresponding feature map to represent each
local feature. The composition of multiple feature maps may represent the entire series. The
Mathematics 2022, 10, 1283 12 of 17

pooling layer is a dimensional reducing method to extract the most representative features
and lower the noise. The generated features maps are likely to be independent of each other
and potentially mitigate the multicollinearity problem. For example, The authors of [72]
proposed the CNNpred framework to model the correlations among different variable to
predict the stock market movement. Two variants of CNNpred, namely 2D-CNNpred and
3D-CNNpred, were introduced in their paper to extract combined features from a diverse
set of input data. It comprises five major US stock market indices, currencies exchange
rate, future contracts, commodities prices, treasury bill rates, etc. Their results showed a
significant predictive improvement as compared to the state-of-the-art baseline. Another
interesting study by the authors of [73] proposed to integrate the features learned from
different representation of the same data to predict the stock market movement. They
employed chart images (e.g., Candle bar, Line bar, and F-line bar) derived from stock prices
as additional input to the prediction the SPDR S&P 500 ETF movement. The proposed
model ensembled Long Short-Term Memory (LSTM) and CNN models to exploit their
advantages in extracting temporal and image features, respectively. Thus, the result showed
that the prediction error can be efficiently reduced by integrating the temporal and image
features from the same data.
Other than feature maps, there is another influential development, namely the atten-
tion mechanism in the Recurrent Neural Network (RNN). RNN was first proposed by
the author of [74] to process sequential information. Based on [75], the term “recurrent”
explained the general architecture idea where a similar function applied on each element of
the sequence and the computed output of the previous element will be aggregately retained
over the internal memory of RNN until the end of sequence. Based on this, RNN enables
compressing the information and producing a fixed-size vector to represent a sequence.
The recurrence operation of RNN is advantageous in series data since the inherent infor-
mation of a sequential can be effectively captured. Unlike CNN, RNN is more flexible to
model a variable length of a sequence that can capture unbounded contextual information.
However, the authors of [76] criticized that the recurrent-based model may be problematic
in handling the long-range dependencies in data due to the memory compression issue in
which the neural network struggles to compress all the necessary information from a long
sequence input into a fixed-length vector. In order words, it is difficult to represent the en-
tire input sequence without any information loss using the fixed-length vector. Despite the
help of the gated activation function, the forgetting issues of RNN-based model becomes
serious as the length of input sequence grows. Based on this, the attention mechanism was
proposed to deal with the long-range dependencies issue by enabling the model to focus on
the relevant part of input sequence when predicting a certain part of the output sequence.
According to [77], the attention mechanism was used to simulate visual attention
where humans usually adjust their focal point over time to perceive a “high resolution”
when focusing on a particular region of an image but perceive a “low resolution” for
the surrounding image. Similarly, the attention mechanism enables the model to learn
to assign different weights according to their contribution and may capture asymmetric
influence between the features to mitigate the multicollinearity problem. For example,
the authors of [78] proposed a CNN based on deep factorization machine and attention
mechanism (FA-CNN) to enhance feature learning. In addition to capturing temporal
influence, the attention mechanism enables modeling of the intraday interaction between
the input features. The result showed a 7.38% improvement over LSTM in predicting
stock movement.
Recently, there is another promising research to apply Graph Convolutional Networks
(GCN) or graph embeddings in series data. Graph neural networks convert series data into
a graph-structured data while enabling the model to capture the interconnectivity between
the nodes. The interconnectivity or correlation modeling is relatively useful in reducing the
multicollinearity effect. For example, the authors of [79] proposed the hierarchical graph
attention network (HATS) to process the relational data for stock market prediction. Their
study defined the stock market graph as a spatial–temporal graph where each individual
Mathematics 2022, 10, 1283 13 of 17

stock (company) is regarded as a node. Each node feature represented the current state of
each company in response to its price movement and the state is dynamic over time. Based
on this, HATS can selectively aggregate important information from various relation data to
represent the company as a node. Thereafter, the model is trained to learn the interrelation
between nodes before feeding into a task-specific layer for prediction. Table 4 provides a
comprehensive summary of the machine learning approaches reviewed.

Table 4. A summary of machine learning approaches on solving multicollinearities.

Author Year Objective Method


Multi-objective optimization function to
Huynh and Won [65] 2011 Regularized OS-ELM algorithm
minimize error
Factor analysis-artificial neural
Garg and Tai [63] 2012 Hybrid method of PCA and ANN
network (FA-ANN)
Ye et al. [66] 2013 Input weight that changes with time OS-ELM Time-varying (OS-ELM-TV)
Timeliness Online Sequential ELM
Gu et al. [67] 2014 Penalty factor in the weight adjustment matrix
algorithm
Least Squares Incremental ELM
Guo et al. [68] 2014 Smoothing parameter to adjust output weight
algorithm
Hoseinzade and Model the correlation among different features
2019 CNN-pred
Haratizadeh [72] from a diverse set of inputs
Using features from different representation of
Kim and Kim [73] 2019 LSTM-CNN
same data to predicting the stock movement
Nóbrega and Oliveira [70] 2019 Kalman filter to adjust output weight Kalman Learning Machine (KLM)
Hua [80] 2020 Decision tree to select features XGBoost
Compare ANN and OLSR in presence of
Obite et al. [62] 2020 Artificial neural network
multicollinearity
Applied attention to capture the intraday CNN-deep factorization machine and
Zhang et al. [78] 2021
interaction between input features attention mechanism (FA-CNN)
Mahadi et al. [69] 2022 Regularization parameter that varies with time Regularized Recursive least-squares

4. Conclusions
Most methods of solving for multicollinearity can be categorized as one of two. That
is variable selection and modified estimators. This paper detailed the development of
different methods over the years. Variable selection has the benefit of being simple to
perform and can result in a sparse model. This makes it easy to interpret and does not
overfit. The disadvantage is that the selection is very discretionary. There is also the
underlying assumption that there is a best model when a model with different variables can
be equally good. Recent papers on solving multicollinearity capitalizes on better computing
power. The subset selection problem can be presented as an optimization problem where
they search for the least redundant variable to reduce multicollinearity. In addition, they are
also able optimize on the most relevant variables. They are based on criterions developed
from previous literature that can represent relevance and redundancy. In this way, problems
where there is high dimensionality (more variables than observations) can be handled.
Modifying the estimators is a very broad method and is more complex. There is
different modified estimator for every functional form of the data. One of the main problems
of modifying estimators is interpretability. It is very difficult to explain coefficients that are
close to but not zero. However, it is more robust and performs better in the presence of
multicollinearity. Some modified estimators even can perform variable selection.
The authors of [81] performed a stress test experiment on the following variable
selection methods: Stepwise, Ridge, Lasso, Elastic Net, LARs, and Genetic algorithms.
They compared their performance according to several quality measures on a few synthetic
Mathematics 2022, 10, 1283 14 of 17

datasets. The authors of [82] compared various statistical and machine learning methods. It
is important to note that comparisons between methods have their drawback. For example,
different tuning parameters can affect performances of the methods. The author of [14]
considered domain knowledge in the studied field to be important in selecting variables
as statistics alone is not enough in practice. All the methods are performed at a different
degrees when dealing with different types of data.
Both variable selection and modified estimators can be used together. The number
of features can be rapidly reduced to below the number of samples and then modified
estimators can be applied. This can be seen in machine learning papers. The findings in this
review paper are that variable selection drops variables and reduces information gain, while
the multicollinearity measures to optimize are subjective. In addition, modified estimators
have inconsistent performance depending on the data and are not able to be applied in
every problem. The literature review also showed that machine learning algorithms are
better than the simple OLS estimator in fitting data with multicollinearity. They do not
need to have information on the relationships among the data or the distribution. This
paper suggests that the relevancy and redundancy concept from feature selection can be
adopted when training a machine learning model.

Author Contributions: J.Y.-L.C. and S.M.H.L. investigated the ideas, review the systems and methods,
and wrote the manuscript. K.T.B. provided the survey studies and methods. W.K.C. conceived the
presented ideas and wrote the manuscript with support from Y.-L.C., S.W.P.; and Z.-W.H. provided
the suggestions on the experiment setup and provided the analytical results. J.Y.-L.C. and Y.-L.C. both
provided suggestions on the research ideas, analytical results, wrote the manuscript, and provided
funding supports. All authors have read and agreed to the published version of the manuscript.
Funding: This work was funded by the Fundamental Research Grant Scheme provided by the
Ministry of Higher Education of Malaysia under grant number FRGS/1/2019/STG06/UTAR/03/1.
This work was also supported by the Ministry of Science and Technology in Taiwan, grants MOST-109-
2628-E-027-004–MY3 and MOST-110-2218-E-027-004, and also supported by the Ministry of Education
of Taiwan under Official Document No. 1100156712 entitled “The study of artificial intelligence and
advanced semiconductor manufacturing for female STEM talent education and industry-university
value-added cooperation promotion”.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: No potential conflict of interest was reported by the authors.

References
1. Schroeder, M.A.; Lander, J.; Levine-Silverman, S. Diagnosing and dealing with multicollinearity. West. J. Nurs. Res. 1990, 12,
175–187. [CrossRef] [PubMed]
2. Algamal, Z.Y. Biased estimators in Poisson regression model in the presence of multicollinearity: A subject review. Al-Qadisiyah J.
Adm. Econ. Sci. 2018, 20, 37–43.
3. Bollinger, J. Using bollinger bands. Stock. Commod. 1992, 10, 47–51.
4. Iba, H.; Sasaki, T. Using genetic programming to predict financial data. In Proceedings of the 1999 Congress on Evolutionary
Computation-CEC99 (Cat. No. 99TH8406), Washington, DC, USA, 6–9 July 1999; pp. 244–251.
5. Lafi, S.; Kaneene, J. An explanation of the use of principal-components analysis to detect and correct for multicollinearity. Prev.
Vet. Med. 1992, 13, 261–275. [CrossRef]
6. Alin, A. Multicollinearity. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 370–374. [CrossRef]
7. Mason, C.H.; Perreault, W.D., Jr. Collinearity, power, and interpretation of multiple regression analysis. J. Mark. Res. 1991, 28,
268–280. [CrossRef]
8. Neter, J.; Kutner, M.H.; Nachtsheim, C.J.; Wasserman, W. Applied Linear Statistical Models; WCB McGraw-Hill: New York, NY,
USA, 1996.
9. Weisberg, S. Applied Linear Regression; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 528.
10. Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; John Wiley & Sons:
Hoboken, NJ, USA, 2005.
11. Tamura, R.; Kobayashi, K.; Takano, Y.; Miyashiro, R.; Nakata, K.; Matsui, T. Best subset selection for eliminating multicollinearity.
J. Oper. Res. Soc. Jpn. 2017, 60, 321–336. [CrossRef]
Mathematics 2022, 10, 1283 15 of 17

12. Askin, R.G. Multicollinearity in regression: Review and examples. J. Forecast. 1982, 1, 281–292. [CrossRef]
13. Ralston, A.; Wilf, H.S. Mathematical Methods for Digital Computers; Wiley: New York, NY, USA, 1960.
14. Hamaker, H. On multiple regression analysis. Stat. Neerl. 1962, 16, 31–56. [CrossRef]
15. Hocking, R.R.; Leslie, R. Selection of the best subset in regression analysis. Technometrics 1967, 9, 531–540. [CrossRef]
16. Gorman, J.W.; Toman, R. Selection of variables for fitting equations to data. Technometrics 1966, 8, 27–51. [CrossRef]
17. Mallows, C. Choosing Variables in a Linear Regression: A Graphical Aid; Central Regional Meeting of the Institute of Mathematical
Statistics: Manhattan, KS, USA, 1964.
18. Kashid, D.; Kulkarni, S. A more general criterion for subset selection in multiple linear regression. Commun. Stat.-Theory Methods
2002, 31, 795–811. [CrossRef]
19. Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2021.
20. Vrieze, S.I. Model selection and psychological theory: A discussion of the differences between the Akaike information criterion
(AIC) and the Bayesian information criterion (BIC). Psychol. Methods 2012, 17, 228. [CrossRef] [PubMed]
21. Misra, P.; Yadav, A.S. Improving the classification accuracy using recursive feature elimination with cross-validation. Int. J. Emerg.
Technol. 2020, 11, 659–665.
22. Wold, H. Soft modeling: The basic design and some extensions. Syst. Under Indirect Obs. 1982, 2, 343.
23. Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130.
[CrossRef]
24. Chong, I.-G.; Jun, C.-H. Performance of some variable selection methods when multicollinearity is present. Chemom. Intell. Lab.
Syst. 2005, 78, 103–112. [CrossRef]
25. Maitra, S.; Yan, J. Principle component analysis and partial least squares: Two dimension reduction techniques for regression.
Appl. Multivar. Stat. Models 2008, 79, 79–90.
26. Onur, T. A Comparative Study on Regression Methods in the presence of Multicollinearity. İstatistikçiler Derg. İstatistik Ve Aktüerya
2016, 9, 47–53.
27. Li, C.; Wang, H.; Wang, J.; Tai, Y.; Yang, F. Multicollinearity problem of CPM communication signals and its suppression method
with PLS algorithm. In Proceedings of the Thirteenth ACM International Conference on Underwater Networks & Systems,
Shenzhen, China, 3–5 December 2018; pp. 1–5.
28. Willis, M.; Hiden, H.; Hinchliffe, M.; McKay, B.; Barton, G.W. Systems modelling using genetic programming. Comput. Chem. Eng.
1997, 21, S1161–S1166. [CrossRef]
29. Castillo, F.A.; Villa, C.M. Symbolic regression in multicollinearity problems. In Proceedings of the 7th Annual Conference on
Genetic and Evolutionary Computation, Washington, DC, USA, 25–29 June 2005; pp. 2207–2208.
30. Bies, R.R.; Muldoon, M.F.; Pollock, B.G.; Manuck, S.; Smith, G.; Sale, M.E. A genetic algorithm-based, hybrid machine learning
approach to model selection. J. Pharmacokinet. Pharmacodyn. 2006, 33, 195. [CrossRef] [PubMed]
31. Katrutsa, A.; Strijov, V. Comprehensive study of feature selection methods to solve multicollinearity problem according to
evaluation criteria. Expert Syst. Appl. 2017, 76, 1–11. [CrossRef]
32. Hall, M.A. Correlation-Based Feature Selection for Machine Learning; The University of Waikato: Hamilton, New Zealand, 1999.
33. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and
min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [CrossRef] [PubMed]
34. Senawi, A.; Wei, H.-L.; Billings, S.A. A new maximum relevance-minimum multicollinearity (MRmMC) method for feature
selection and ranking. Pattern Recognit. 2017, 67, 47–61. [CrossRef]
35. Tamura, R.; Kobayashi, K.; Takano, Y.; Miyashiro, R.; Nakata, K.; Matsui, T. Mixed integer quadratic optimization formulations
for eliminating multicollinearity based on variance inflation factor. J. Glob. Optim. 2019, 73, 431–446. [CrossRef]
36. Zhao, N.; Xu, Q.; Tang, M.L.; Jiang, B.; Chen, Z.; Wang, H. High-dimensional variable screening under multicollinearity. Stat 2020,
9, e272. [CrossRef]
37. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911.
[CrossRef]
38. Chen, C.W.; Tsai, Y.H.; Chang, F.R.; Lin, W.C. Ensemble feature selection in medical datasets: Combining filter, wrapper, and
embedded feature selection results. Expert Syst. 2020, 37, e12553. [CrossRef]
39. Larabi-Marie-Sainte, S. Outlier Detection Based Feature Selection Exploiting Bio-Inspired Optimization Algorithms. Appl. Sci.
2021, 11, 6769. [CrossRef]
40. Singh, S.G.; Kumar, S.V. Dealing with Multicollinearity problem in analysis of side friction characteristics under urban heteroge-
neous traffic conditions. Arab. J. Sci. Eng. 2021, 46, 10739–10755. [CrossRef]
41. Horel, A. Applications of ridge analysis toregression problems. Chem. Eng. Progress. 1962, 58, 54–59.
42. Duzan, H.; Shariff, N.S.B.M. Ridge regression for solving the multicollinearity problem: Review of methods and models. J. Appl.
Sci. 2015, 15, 392–404. [CrossRef]
43. Assaf, A.G.; Tsionas, M.; Tasiopoulos, A. Diagnosing and correcting the effects of multicollinearity: Bayesian implications of ridge
regression. Tour. Manag. 2019, 71, 1–8. [CrossRef]
44. Roozbeh, M.; Arashi, M.; Hamzah, N.A. Generalized cross-validation for simultaneous optimization of tuning parameters in
ridge regression. Iran. J. Sci. Technol. Trans. A Sci. 2020, 44, 473–485. [CrossRef]
45. Singh, B.; Chaubey, Y.; Dwivedi, T. An almost unbiased ridge estimator. Sankhyā Indian J. Stat. Ser. B 1986, 48, 342–346.
Mathematics 2022, 10, 1283 16 of 17

46. Kejian, L. A new class of biased estimate in linear regression. Commun. Stat.-Theory Methods 1993, 22, 393–402. [CrossRef]
47. Liu, K. Using Liu-type estimator to combat collinearity. Commun. Stat.-Theory Methods 2003, 32, 1009–1020. [CrossRef]
48. Inan, D.; Erdogan, B.E. Liu-type logistic estimator. Commun. Stat.-Simul. Comput. 2013, 42, 1578–1586. [CrossRef]
49. Huang, J.; Yang, H. A two-parameter estimator in the negative binomial regression model. J. Stat. Comput. Simul. 2014, 84,
124–134. [CrossRef]
50. Türkan, S.; Özel, G. A new modified Jackknifed estimator for the Poisson regression model. J. Appl. Stat. 2016, 43, 1892–1905.
[CrossRef]
51. Chandrasekhar, C.; Bagyalakshmi, H.; Srinivasan, M.; Gallo, M. Partial ridge regression under multicollinearity. J. Appl. Stat.
2016, 43, 2462–2473. [CrossRef]
52. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [CrossRef]
53. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [CrossRef]
54. Mosier, C.I. Problems and designs of cross-validation 1. Educ. Psychol. Meas. 1951, 11, 5–11. [CrossRef]
55. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [CrossRef]
56. Roozbeh, M.; Babaie–Kafaki, S.; Aminifard, Z. A nonlinear mixed–integer programming approach for variable selection in linear
regression model. Commun. Stat.-Simul. Comput. 2021, 1–12. [CrossRef]
57. Roozbeh, M.; Babaie-Kafaki, S.; Aminifard, Z. Improved high-dimensional regression models with matrix approximations applied
to the comparative case studies with support vector machines. Optim. Methods Softw. 2022, 1–18. [CrossRef]
58. Nguyen, V.C.; Ng, C.T. Variable selection under multicollinearity using modified log penalty. J. Appl. Stat. 2020, 47, 201–230.
[CrossRef]
59. Kibria, B.; Lukman, A.F. A new ridge-type estimator for the linear regression model: Simulations and applications. Scientifica
2020, 2020, 9758378. [CrossRef]
60. Arashi, M.; Norouzirad, M.; Roozbeh, M.; Mamode Khan, N. A High-Dimensional Counterpart for the Ridge Estimator in
Multicollinear Situations. Mathematics 2021, 9, 3057. [CrossRef]
61. Qaraad, M.; Amjad, S.; Manhrawy, I.I.; Fathi, H.; Hassan, B.A.; El Kafrawy, P. A hybrid feature selection optimization model for
high dimension data classification. IEEE Access 2021, 9, 42884–42895. [CrossRef]
62. Obite, C.; Olewuezi, N.; Ugwuanyim, G.; Bartholomew, D. Multicollinearity effect in regression analysis: A feed forward artificial
neural network approach. Asian J. Probab. Stat. 2020, 6, 22–33. [CrossRef]
63. Garg, A.; Tai, K. Comparison of regression analysis, artificial neural network and genetic programming in handling the
multicollinearity problem. Proceedings of International Conference on Modelling, Identification and Control, Wuhan, China,
24–26 June 2012; pp. 353–358.
64. Kim, J.-M.; Wang, N.; Liu, Y.; Park, K. Residual control chart for binary response with multicollinearity covariates by neural
network model. Symmetry 2020, 12, 381. [CrossRef]
65. Huynh, H.T.; Won, Y. Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks.
Pattern Recognit. Lett. 2011, 32, 1930–1935. [CrossRef]
66. Ye, Y.; Squartini, S.; Piazza, F. Online sequential extreme learning machine in nonstationary environments. Neurocomputing 2013,
116, 94–101. [CrossRef]
67. Gu, Y.; Liu, J.; Chen, Y.; Jiang, X.; Yu, H. TOSELM: Timeliness online sequential extreme learning machine. Neurocomputing 2014,
128, 119–127. [CrossRef]
68. Guo, L.; Hao, J.-h.; Liu, M. An incremental extreme learning machine for online sequential learning problems. Neurocomputing
2014, 128, 50–58. [CrossRef]
69. Mahadi, M.; Ballal, T.; Moinuddin, M.; Al-Saggaf, U.M. A Recursive Least-Squares with a Time-Varying Regularization Parameter.
Appl. Sci. 2022, 12, 2077. [CrossRef]
70. Nobrega, J.P.; Oliveira, A.L. A sequential learning method with Kalman filter and extreme learning machine for regression and
time series forecasting. Neurocomputing 2019, 337, 235–250. [CrossRef]
71. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
72. Hoseinzade, E.; Haratizadeh, S. CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Syst. Appl.
2019, 129, 273–285. [CrossRef]
73. Kim, T.; Kim, H.Y. Forecasting stock prices with a feature fusion LSTM-CNN model using different representations of the same
data. PLoS ONE 2019, 14, e0212320. [CrossRef] [PubMed]
74. Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [CrossRef]
75. Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput.
Intell. Mag. 2018, 13, 55–75. [CrossRef]
76. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473.
77. Zhang, L.; Wang, S.; Liu, B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018,
8, e1253. [CrossRef]
78. Zhang, X.; Liu, S.; Zheng, X. Stock Price Movement Prediction Based on a Deep Factorization Machine and the Attention
Mechanism. Mathematics 2021, 9, 800. [CrossRef]
Mathematics 2022, 10, 1283 17 of 17

79. Kim, R.; So, C.H.; Jeong, M.; Lee, S.; Kim, J.; Kang, J. Hats: A hierarchical graph attention network for stock movement prediction.
arXiv 2019, arXiv:1908.07999.
80. Hua, Y. An efficient traffic classification scheme using embedded feature selection and lightgbm. In Proceedings of the Information
Communication Technologies Conference (ICTC), Nanjing, China, 29–31 May 2020; pp. 125–130.
81. Katrutsa, A.; Strijov, V. Stress test procedure for feature selection algorithms. Chemom. Intell. Lab. Syst. 2015, 142, 172–183.
[CrossRef]
82. Garg, A.; Tai, K. Comparison of statistical and machine learning methods in modelling of data with multicollinearity. Int. J. Model.
Identif. Control 2013, 18, 295–312. [CrossRef]

You might also like