You are on page 1of 27

Accepted Manuscript

Design of Experiments and Response Surface Methodology to Tune


Machine Learning Hyperparameters, with a Random Forest
Case-Study

Gustavo A. Lujan-Moreno, Phillip R. Howard, Omar G. Rojas,


Douglas C. Montgomery

PII: S0957-4174(18)30317-8
DOI: 10.1016/j.eswa.2018.05.024
Reference: ESWA 11977

To appear in: Expert Systems With Applications

Received date: 25 November 2017


Revised date: 20 May 2018
Accepted date: 21 May 2018

Please cite this article as: Gustavo A. Lujan-Moreno, Phillip R. Howard, Omar G. Rojas,
Douglas C. Montgomery, Design of Experiments and Response Surface Methodology to Tune
Machine Learning Hyperparameters, with a Random Forest Case-Study, Expert Systems With
Applications (2018), doi: 10.1016/j.eswa.2018.05.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Highlights
• Design of experiments identified significant hyperparameters in the random forest

• No. of features and sampling with replacement were discarded in the screening

• Interaction between class weights and cutoff had the largest effect on the response

T
• Response surface methodology correctly tuned random forest hyperparameters

IP
• The methodology achieved an outstanding 0.81 cross-validated BACC vs default of 0.64

CR
US
AN
M
ED
PT
CE
AC

1
ACCEPTED MANUSCRIPT

Design of Experiments and Response Surface


Methodology to Tune Machine Learning

T
Hyperparameters, with a Random Forest Case-Study

IP
Corresponding author: Gustavo A. Lujan-Moreno a,b

CR
a
Universidad Panamericana. Escuela de Ciencias Económicas y Empresariales
Prolongación Calzada Circunvalación Poniente 49

US
Zapopan, Jalisco, 45010, México
Intel Corporation
b

Avenida del Bosque 1001, El Bajio


AN
Zapopan, Jalisco, 45019, México
email: lujangus@hotmail.com / Telephone: +52 1 33 6059 8818
M

Phillip R. Howard b
b
Intel Corporation
5000 W Chandler Blvd
ED

Chandler, AZ, 85226, USA


email: prhoward@asu.edu
PT

Omar G. Rojas a
a
Universidad Panamericana. Escuela de Ciencias Económicas y Empresariales
CE

Prolongación Calzada Circunvalación Poniente 49


Zapopan, Jalisco, 45010, México
email: orojas@up.edu.mx
AC

Douglas C. Montgomery c
c
School of Computing, Informatics and Decision Systems, Arizona State University
Brickyard Engineering, 699 S Mill Ave
Tempe, AZ, 85281, USA
email: doug.montgomery@asu.edu

2
ACCEPTED MANUSCRIPT

Declarations of interest: none


Abstract
Most machine learning algorithms possess hyperparameters. For example, an artificial
neural network requires the determination of the number of hidden layers, nodes, and many
other parameters related to the model fitting process. Despite this, there is still no clear con-
sensus on how to tune them. The most popular methodology is an exhaustive grid search,

T
which can be highly inefficient and sometimes infeasible. Another common solution is to

IP
change one hyperparameter at a time and measure its effect on the model’s performance.
However, this can also be inefficient and does not guarantee optimal results since it ignores
interactions between the hyperparameters. In this paper, we propose to use the Design of

CR
Experiments (DOE) methodology (factorial designs) for screening and Response Surface
Methodology (RSM) to tune a machine learning algorithm’s hyperparameters. An applica-
tion of our methodology is presented with a detailed discussion of the results of a random

US
forest case-study using a publicly available dataset. Benefits include fewer training runs,
better parameter selection, and a disciplined approach based on statistical theory.

Keywords: design of experiments, hyperparameters, machine learning, random forest,


AN
response surface methodology, tuning.

1 Introduction
M

Hyperparameter tuning is essential for optimizing the performance of any machine learning
(ML) algorithm. Despite this importance, understanding how hyperparameters interact with
ED

model performance remains an open research question. Several efforts have been made to
fine-tune specific ML algorithms. For example, in [Lalor et al., 2017], an approach for tuning
deep neural networks is proposed by using subsets of data to pre-train a model. A gradient-
based approach for neural network model tuning was proposed that calculates the derivatives
PT

of the cross-validation error with respect to model hyperparameters in stochastic gradient de-
scent [Maclaurin et al., 2015]. In [Nickson et al., 2014], the authors proposed a stochastic algo-
rithm tuning method for big data when training with a complete dataset is not feasible. In that
CE

study, the authors used subsets of data in order to train the algorithm. However, this method
is only applicable to Gaussian process time series data. The authors of [Bardenet et al., 2013]
presented a hyperparameter selection algorithm which incorporates knowledge learned from
AC

previous experiments using surrogate-based ranking and optimization techniques. A Gaussian


process-based Bayesian optimization method for selecting numerical hyperparameters has also
been proposed in [Snoek et al., 2012].
Random search is a method that has recently become a popular alternative to grid search.
The authors in [Bergstra and Bengio, 2012] claim that random search is more efficient than
grid search because typically only a subset of a model’s tunable hyperparameters is important
for optimizing performance. However, random search was shown to be unreliable for tuning
the hyperparameters of deep belief networks (DBNs) [Bergstra et al., 2011]; the authors of that

3
ACCEPTED MANUSCRIPT

work introduced two greedy sequential selection strategies which outperformed both random
search and human-guided searches for DBNs. However, these approaches do not measure the
effect that each of the hyperparameters have on model performance and they also ignore pos-
sible interactions between hyperparameters. For example, the effect that the number of nodes
in a neural network has on model performance can depend on the number of hidden layers.
Consequently, the current tuning strategies can be seen as optimizing an unknown black-box

T
function [Snoek et al., 2012] while largely ignoring the question of understanding the internal
behavior of the system.

IP
In this paper, we propose a design of experiments (DOE) methodology as the first step to
screen the most significant hyperparameters (factors) of a ML algorithm. Reducing the number

CR
of factors to a subset which has the greatest effect on model performance considerably reduces
the number of model-fitting runs in the next round of hyperparameter tuning experiments. The
screening phase is done using fractional factorial designs, which are well-suited for scenarios
in which we do not have the luxury of running many experiments; screening may also be done
US
using other designs, as explained at the end of Section 2.1. Once the main factors are identified,
a full factorial experiment can be run on the factors as a confirmatory procedure. The second
phase of our method consists of applying response surface methodology (RSM) to model a first-
AN
or second-order polynomial which approximates the performance of the model given different
hyperparameter configurations. An application of our methodology is presented with a detailed
discussion of the results of a random forest case-study using a publicly available dataset. A
total of seven hyperparameters were chosen as initial candidate factors, and balanced accuracy
M

(BACC) was selected as the target to optimize for the random forest classifier.
A DOE-inspired algorithm for selecting Support Vector Machine (SVM) parameters has
been previously proposed [Staelin, 2003]. The algorithm iteratively refines the boundaries and
ED

resolution of a search grid over which different parameter values are tested. This has been
shown to result in performance nearly as good as grid search while requiring substantially fewer
model evaluations, with demonstrated success in applications of least squares SVM regression
PT

to credit scoring [Zhou et al., 2009], credit risk evaluation [Yu et al., 2011], and crude oil price
forecasting [Yu et al., 2017]. The main difference between our methodology and this prior work
is that we address the case of having many hyperparameters to tune (SVMs have only two
CE

tunable parameters) and we provide a more general framework for using statistical inference
to guide the parameter search process by focusing on those which have a significant effect
on model performance. Thus, the primary contribution of our work is the demonstration of
AC

how DOE and RSM can be used to reduce the complexity of tuning machine learning models
which require the selection of more hyperparameters than can be evaluated under all possible
combinations.
The rest of this paper is organized as follows: in section 2 the DOE and RSM concepts are
introduced as well as the random forest algorithm and several performance metrics. The main
experiment is explained in section 3. Results from the experiments are provided in section 4.
Finally, a discussion of results and concluding remarks is given in section 5.

4
ACCEPTED MANUSCRIPT

2 Background
2.1 Design of Experiments
An experiment is a series of systematic tests which attempt to find the factors which have the
largest effect on a response variable [Montgomery, 2017]. Once these factors have been iden-
tified, the main objective of the DOE methodology is to optimize this response variable. The

T
design of these experiments involves carefully selecting the variables, their ranges, and the

IP
number of experiment runs in order to identify the relationship between factors and the re-
sponse variable. Traditionally, the effect of factors on the response variable has been tested by

CR
altering the levels of one factor at a time while the other factors are held constant. However,
this approach is inefficient and misses information about possible interactions. For example,
in Figure 1 we see that the effect of factor B on the response is not affected by the levels of
factor A. On the other hand, in Figure 2 a negative interaction between factors can be observed

US
because the response decreases when both factors have the same sign and increases when the
factors’ levels are different. A positive interaction is also possible and happens when the re-
sponse increases when the levels of the two factors are the same and decreases when they are
AN
different. This is often overlooked in hyperparameter tuning efforts.
M
ED
PT
CE

Figure 1: A factorial experiment without in- Figure 2: A factorial experiment with a nega-
teraction tive interaction
AC

A response variable may be impacted by both controllable and uncontrollable factors. The
controllable factors are those for which an experimenter can alter its level while the uncontrol-
lable input factors are those variables that cannot be controlled by the experimenter, but can be
monitored and included in the statistical model in order to account for some of the variation.
According to [Montgomery, 2017], the three basic principles of DOE are randomization, repli-
cation and blocking. These three principles apply when it is desirable to understand a physical
process as well as computer simulated experiments. The principle of randomization indicates

5
ACCEPTED MANUSCRIPT

that experiments should be run in random order to prevent external factors from affecting the
results. For example, we could be interested in measuring a performance metric such as the
total training time for a machine learning algorithm on a Hadoop cluster. The total training time
on the cluster could be affected by other people utilizing the resources on the cluster at the same
time the machine learning algorithm is running or by a failure of cluster nodes during some of
the experiments. If we do not randomize the order of the trials, the effect of the Hadoop clus-

T
ter health or performance could be incorrectly attributed to one of the factors being measured.
Replication is also very important as many ML algorithms are not deterministic; randomness

IP
in model training often means that we will not obtain the same result twice even if the training
is done with the same hyperparameter levels and on the same dataset. Replication in computer

CR
simulated experiments allows us to compute the internal standard error and it makes the com-
putation of the results more accurate as long as we run the experiments with different sequences
of pseudorandom numbers. Lastly, blocking is also important and applicable in many simulated
designs. For example, perhaps we are interested in training a model on two datasets located
US
in different facilities and for security reasons, we cannot transfer data from one facility to the
other. In this scenario, we would likely assign the two facilities as the block variable in order to
reduce the variability transmitted from this nuisance factor.
AN
The most basic type of experiment is the two-level factorial design (2k ) in which each of
the k factors is set to two different levels for experimentation: low and high. In the case of the
simplest experiment with only two factors, 22 = 4 runs of the experiment are needed. With
these few runs, we are able to identify which of the main effects are significant and also if the
M

response variable is affected by their interactions. The effects model of a two-factor factorial
design can be represented as:
ED

yijk = µ + τi + βj + (τ β)ij + ijk ,

for i = 1, 2, . . . , a, j = 1, 2, . . . , b and k = 1, 2, . . . , n, where µ is the overall mean effect, τi


PT

is the effect of i-th level of factor A, βj is the effect of the j-th level of factor B, (τ β)ij is the
interaction effect and ijk is the random error. As a matter of fact, the effects model can also be
represented as an ordinary regression model:
CE

y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ,

where β1 and β2 are related to the main effects and β12 to interactions.
AC

It is easy to see that as the number of factors k increases, the number of experimental
runs grows exponentially. The fractional factorial design is an alternative to the full factorial
design where fewer runs are needed. In this type of design, only a fraction of the original
experiment is run. However, this represents a trade-off with accuracy because we have fewer
degrees of freedom to evaluate each factor and every possible interaction, which causes some
of the estimated main effects to be aliased or confounded with other higher-order interactions.
Fractional factorial designs are powerful screening methods which often provide a wealth of

6
ACCEPTED MANUSCRIPT

information about the main effects as well as partial information about higher-order interactions.
Fractional factorials are popular because they have 3 unique characteristics that make them
highly efficient: first, the sparsity of effects principle states that only a small number of effects
are significant and that the final model tends to be composed of lower-order terms instead of
higher-order ones [Wu and Hamada, 2009]; second, the projection property states that a design
can be projected to a lower dimension using a subset of factors, making the reduced design

T
generally stronger than the original; third, fractional factorial experiments can be combined to
form designs of higher resolution using a technique called fold over. A fold over of the original

IP
design is achieved by switching certain signs of the fractional factorials design matrix to isolate
effects of particular interest. For example, in a 26−3 III design with generators I=ABD, I=ACE,

CR
and I=BCF, if we add another fraction of the design with the signs for the column D reversed,
we effectively isolate the effect of D and all of its second-order interactions. On the other hand,
if we add another fraction with all of the signs of the design matrix reversed, we isolate all of
the main effects.
US
One disadvantage of the 2k factorial design is that it is not able to detect quadratic effects.
In order to detect the curvature from second-order effects, we need to add at least a third level
to one or more factors in addition to the two levels in a 2k factorial design. Adding center
AN
points is a common approach because it does not impact the effect estimates βˆj for j ≥ 1;
while βˆ0 becomes the average of all observations. The reason center points do not impact the
effect estimates is because the design matrix in coded units assigns (-) or -1 to the low level of a
factor and (+) or +1 to the high level while the center points are coded as 0 (midpoint of factor’s
M

range); therefore, the center points have no effect when computing the contrasts. Hence, adding
center points serves two purposes: first, the contrast between the mean of the factorial design
ED

(ȳf ) and the mean of the center points (ȳc ) are used to test for lack of fit because when the data
follows a first-order model the difference (ȳf − ȳc ) is expected to be zero. Second, the variation
at the center point provides an estimate of the pure error. This allows the partitioning of the
residual sum of squares to be decomposed as SSE = SSP E + SSLoF , for pure error and lack
PT

of fit respectively.
There are other efficient screening designs such as Morris’ elementary effects and sequential
bifurcation [Shi and Kleijnen, 2018]. However, these designs also exhibit a trade-off between
CE

the number of runs and the accuracy of the results. Two-level fractional factorial designs iden-
tify more precisely the main effects and some higher-order interactions. The main drawback
of the highly efficient Morris’ elementary effects model is that it is only capable of identify-
AC

ing main effects, although an extended approach has been proposed to identify higher-order
interactions [Fédou and Rendas, 2015]. The bifurcation approach requires fewer runs, but the
experimenter needs to know the direction of the effects a priori to ensure that the first-order
polynomial approximations are monotonic. Another advantage of the two-level fractional fac-
torial designs is that we can isolate effects in which we are particularly interested by combin-
ing fractional factorial designs (i.e. switching certain signs of alias structure). For example,
if we are interested in isolating only one effect from a two-factor interaction, we could do a
single-factor fold over. On the other hand, if we would like to isolate all of the main effects of

7
ACCEPTED MANUSCRIPT

a resolution III fractional factorial design, we could just reverse the signs for all factors in the
alias structure to obtain a full fold over. Other alternatives include the Placket-Burman designs,
which are nonregular designs. This type of design can be run in multiples of four rather than just
multiples of 2k−p . In nonregular designs, some of the effects are partially aliased with other ef-
fects (not completely confounded); this occurs because some of the constants in the alias chains
are not equal to ± 1. Moreover, two-level fractional factorials including the Placket-Burman

T
designs have a property called projectivity, which means that they can be collapsed into a full
factorial design to find the correct subset factors. Ultimately the experimenter can choose any

IP
suitable screening method based upon the number of available runs, the desired precision of
the results, and the potential reuse of previous runs (fold over designs, projectivity). Replicate

CR
runs, which are independent repeated runs of each combination in the design, are recommended
whenever possible. They should be run in random order and provide two main advantages:
first, replicates allow estimation of the experimental error; and second, they offer more precise
estimation of the effects represented by the average of each combination or corner point in the
design.
US
2.2 Response Surface Methodology
AN
RSM is closely related to DOE and is used to model a surface using statistical techniques
for the purpose of optimizing a response [Myers et al., 2016]. For instance, if a researcher
is looking for the values of two variables, x1 , x2 , which maximize or minimize the response, y,
M

of a process, then y is a function of the levels of the predictors:

y = f (x1 , x2 ) + ,
ED

where the error observed in the response y is denoted by . If we let E(y) = f (x1 , x2 ) = η be
the expected value, then the response surface is:
PT

η = f (x1 , x2 ).

Since the shape of the surface is unknown a priori, one of the first steps is to find a model which
CE

fits the relationship between the predictors and the response using a polynomial function. A
low-order model is often sufficient to describe such a relationship, with first and second-order
models being the most popular choices. A first-order model is represented as:
AC

y = β0 + β1 x1 + β2 x2 + ... + βk xk + .

The second-order model is usually preferred since it works very well in modeling curvature
around promising regions. The approximating function is:
k
X k
X XX
y = β0 + βj xj + βjj x2j + βij xi xj + .
j=1 j=1 i<j

8
ACCEPTED MANUSCRIPT

Figure 3 shows an example of a first-order response surface represented as a flat plane. Fig-
ure 4 depicts the contour plot of this response surface where the lines are straight and parallel. In
contrast, Figure 5 shows a second-order response surface and Figure 6 depicts its corresponding
contour plot. A first-order response surface with interactions (not shown here) has the effect of
twisting the plane, which causes non-parallel curves or lines in the contour plot. Curvature on
the response surface is caused by the presence of quadratic effects.

T
IP
CR
US
AN

Figure 3: A first-order response sur- Figure 4: The contour plot for the first-
M

face with estimates β1 = 9.82 and order model of Fig. 3


β2 = 4.21
ED
PT
CE
AC

Figure 5: A second-order response


surface with estimates β1 = 9.28, Figure 6: The contour plot for the
β2 = 4.21, β1,2 = −7.75, β1,1 = second-order model of Fig. 5
−8.87 and β2,2 = −5.12

9
ACCEPTED MANUSCRIPT

These low-order models only represent approximations of the real system; it is assumed
that they will behave similarly to the real system in at least a small neighborhood of the sur-
face [Myers et al., 2016]. RSM is a sequential procedure where at each step, we move in a
direction of improvement for our objective (maximization, minimization, or a target value). In
order to accomplish this, the most common method is to follow the direction of steepest ascent
(descent in case of minimization). Under this method, we try to move as efficiently as possi-

T
ble towards an optimal region of the response surface using a step size and a scale defined by
the experimenter [Myers et al., 2016]. This procedure is repeated several times, following the

IP
path of steepest ascent until no more improvements are found in a local neighborhood. Scale-
independent approaches are also available (See [Kleijnen, 2015]).

CR
The most popular RSM designs are central composite designs (CCD) and Box-Behnken de-
signs (BBD). The CCD can be seen as a factorial design (either full or fractional) with center
points along with star points that extend the cuboidal region of the original factorial design.
These extra points allow estimation of the curvature of the response across different factor lev-
US
els. On the other hand, BBDs do not contain an embedded factorial design and the experimental
levels are located at the midpoints of the hypercube. Each of these designs has several advan-
tages and disadvantages. CCDs are normally rotatable or spherical, which provides benefits
AN
such as better suitability for blocking. BBDs on the other hand are highly efficient: they some-
times require fewer runs and they have desirable statistical properties such as rotatability or
near-rotatability. BBDs also do not contain the corner points of the design hypercube and there-
fore are suitable for experiments where running trials in those regions is difficult or infeasible.
M

2.3 Random Forest


ED

Ensemble methods are classification techniques which aim to predict class labels by consid-
ering the predictions of multiple base classifiers [Tan et al., 2006]. Ensemble methods tend to
perform better than traditional classification methods because they weight each vote of the base
PT

classifiers. Random forest (RF) is a type of ensemble method proposed by Leo Breiman in
2001. The base classifiers in RF are fully-grown decision trees which vote for each of the pre-
dictions. The robustness of classification relies on the accuracy of each individual tree as well
CE

as how independent the trees are from each other [Breiman, 2001]. A decision tree is a simple
classifier which discriminates data instances based on a series of feature splits. At each node,
a slitting criteria based on the features is constructed and the instances for which this criteria is
AC

true are sent to the right child node while instances for which the criteria is false are sent to the
left child node. The top node of the tree is referred to as the root node while nodes at the bottom
of the tree with no further splits are referred to as leaves. In order to train the random forest,
K trees are fully grown with no pruning and the class label of a data instance is determined by
the most popular class across the trees. More formally, a random forest is a collection of weak
learners (decision trees) of the form:

h(x, θk ) for k = 1, ..., K,

10
ACCEPTED MANUSCRIPT

where θk are independent and identically distributed random feature vectors. Each of the K
trees casts a vote and consensus about the class of input x is reached by vote majority. If the
number of trees is sufficiently large, it has been shown that the generalization error is:
error ≤ p̄(1 − s2 )/s2 ,
where p̄ is the correlation between the trees and s is a measurement of the strength of the

T
trees. According to [Breiman, 2001], the advantages of RF include its robustness to outliers
and noise; its internal estimates of error and variable importance via out-of-bag (OOB) samples;

IP
and its efficient training method which can be parallelized because each of the trees is grown
independent of the rest.

CR
2.4 Performance Metrics
Several performance metrics can be used to evaluate the performance of a ML algorithm. For

US
binary classification problems, these include measures such as the true positive (TP), true neg-
ative (TN), false positive (FP) and false negative (FN) instances. In this study, we will use the
performance metric called balanced accuracy (BACC) which is computed by the formula:
AN
BACC = (T P R + T N R)/2,
where T P R = T P/(T P + F N ) is the true positive rate and T N R = T N/(T N + F P ) is the
true negative rate. BACC represents the average of the true positive and true negative rates and
M

is a good metric to be used with highly-unbalanced data. There are other more advanced metrics
such as the area under the receiving operating characteristic curve (AU-ROC) which considers
the overall performance of the classifier as a threshold is varied. For problems with more than
ED

two classes, a popular metric is the multi-class logarithmic loss, which directly considers the
probability given by the classifier. We use BACC in this study because of the aforementioned
properties, but the techniques proposed in this paper would be useful to tune any ML algorithm
regardless of the chosen metric.
PT

Another important aspect of measuring performance is determining a proper procedure to


follow in order to have an accurate estimate of the chosen metric. Overfitting is a very common
issue in supervised learning problems, occurring when the model fits very well on the training
CE

data but generalizes poorly to new data. In deep neural networks, for example, overfitting
is a serious problem and several techniques such as early stopping criteria, dropout, weight
penalties and weight sharing have been proposed [Srivastava et al., 2014]. As a consequence,
AC

robust performance estimation methods should be considered such as bootstrapping, leave-one-


out and n-fold cross-validation. However, for real-world data, 10-fold cross-validation has been
consistently found to be the most accurate method to estimate performance [Kohavi, 1995]. In
a n-fold cross-validation the data is segmented into n equally-sized partitions. At each iteration
one of the partitions is used for testing while the rest is used to train the model. This process is
repeated n times and therefore each partition is used for testing exactly once. The error is found
by aggregating the results across all of the iterations [Tan et al., 2006]. Thus, we use 10-fold
cross-validation to estimate the BACC in our experiments.

11
ACCEPTED MANUSCRIPT

2.5 The randomForest package in R


We used the randomForest1 package in R for our experiments. This package is the original
implementation of Leo Breiman’s random forest model and is based on the Fortran code by
Breiman and Adele Cutler. It has several available hyperparameters from which we will only
use seven, as shown in Table 1. We are interested in discovering which of these seven hyperpa-
rameters contributes the most to maximizing the BACC as well as their potential interactions to

T
gain more knowledge about the RF behavior on a specific dataset.

IP
Table 1: Hyperparameters available in the randomForest pack-
age.

CR
Hyperparameter Description
ntree Number of trees to grow
mtry
replace
nodesize
US
Number of features to use at each split
Sampling with or without replacement
Number of instances in each leave node
AN
classwt Prior probabilities for each of the classes
cutoff Threshold for binary classification
maxnodes Maximum number of nodes per tree
M

3 Experiments
ED

3.1 The dataset


For this experiment we use the adult data set from the UCI ML repository2 . We chose this
PT

dataset because the training set is fairly large with 32,561 instances and it has 14 attributes with
a mixture of continuous and categorical data. The target class is a binary indicator for whether
or not a person makes over $50,000 per year. The attributes include age, marital-status, race, sex
CE

and capital gain. The size of the dataset also makes it ideal to perform a 10-fold cross-validation
and have robust estimates of the BACC.
AC

3.2 Hyperparameter tuning using DOE and RSM


The general procedure for hyperparameter tuning of a ML algorithm using DOE and RSM is
proposed as follows: 1) choose a machine learning algorithm and decide on the response vari-
able to tune (accuracy, T P R, F1 -score, etc.); 2) select the hyperparameters to tune as well as
1
https://cran.r-project.org/web/packages/randomForest
2
https://archive.ics.uci.edu/ml/datasets/adult

12
ACCEPTED MANUSCRIPT

their ranges; 3) perform a screening design (fractional factorials, extended Morris’ method, se-
quential bifurcation) and identify the important factors (hyperparameters); 4) reduce the model
and, depending on the number of experiments that are feasible to run, perform either a full or
fractional 2k factorial design; 5) fit a second-order model using RSM (CCD, BBD), selecting
the hyperparameter configuration with the best performance from the previous step as the cen-
ter of the design; and 6) recursively optimize the second-order model until the change in the

T
response is ≤ . This optimization step can be performed using ridge analysis, which is very
useful when the stationary point is outside of the experimental region but the experimenter is

IP
interested in locating the optimal response within the boundaries of the experiment. Ridge anal-
ysis produces a canonical analysis located at the center point of the experiment and is used to

CR
find a local minimum, maximum or saddle point [Hoerl, 1985].
Throughout each of these steps, the response variable y should be estimated using n-fold
cross-validation. In step 2, the ranges should not be so large that they affect the common
behavior of the algorithm but also should not be too small or else an effect could be missed. In
US
step 3, any of the screening methods can be used; typically, this choice will be influenced by the
trade-off between the number of runs and the accuracy of the results. In step 4, the researcher
will obtain very useful information about how large the effect is for each of the hyperparameters
AN
and how the interactions (if present) affect the response.
In order to compare our proposed methodology with a baseline value, the randomForest
algorithm was run using the adult dataset. Leaving the default parameters unchanged, the 10-
fold cross-validated BACC was 0.6376. This is the starting value against which improvements
M

will be compared. Although the adult dataset is fairly large (32561 rows) it can still be run on
a personal computer and it is feasible to run a full factorial experiment even with replicates.
However, in order to explore this methodology in a more realistic scenario, we will assume
ED

that resources are scarce and that an initial screening design is needed. In the initial screening
experiment, it is recommended to use two level designs to extend the region of interest and
to keep the number of runs low [Montgomery, 2017]. Therefore, we decided to use a 27−2
PT

fractional factorial design. This design provides a IV resolution, which means that main effects
are not aliased with other main effects nor two-factor interactions. However, some two-factor
interactions can be aliased with other two-factor interactions. Table 2 shows the levels of the
CE

initial design. With the factor ntree, we are trying to find significant differences between the
two levels. We know that more trees typically improve performance, but this improvement
comes with a computational cost. Hence, we are trying to find if there is a significant difference
AC

between running the default number of trees in the package (500) and running only 100 trees,
which is computationally more efficient. If no significant effect is found in this main factor
or its interactions, then we could run the rest of the experiments at this lower level to reduce
the computation time. For the factor mtry, we selected the floor of log(m) and the ceiling of

m to define 2 and 4 as the levels. According to [Montgomery, 2017], the distance between
the low level (-) and the high level (+) should be increased aggressively to obtain accurate
effect estimates; this is particularly important for unreplicated designs where there is no internal
estimate of the pure error. If the range of the levels is not sufficiently large, we could miss

13
ACCEPTED MANUSCRIPT

identifying a significant effect that is present. The factor replace is set to either false or true.
In the factor classwt, we assigned 10 times more weight to the class > 50K with the opposite
for the high level. For cutoff, the low level was set to 0.2 and the high level to 0.8. Finally, the
maximum number of nodes was set to 5 for the low level and null for the high level, where a null
value means that the tree size is only restricted by the node size. The hyperparameters and their
levels are shown in Table 2. The reader can refer to the randomForest package documentation

T
in order to learn the syntax to be used, for example both classwt and cutoff expect a vector as
input.

IP
Table 2: Initial levels for the 27−2 fractional

CR
factorial design .

Factor low level (-) high level (+)


ntree
mtry
replace
nodesize
100
2US
FALSE
1
500
4
TRUE
3256
AN
classwt 1 10
cutoff 0.2 0.8
maxnodes 5 NULL
M

4 Main Results
ED

4.1 Fractional Factorial Design for Initial Screening


PT

The results of the initial 27−2 fractional factorial design are shown in Table 3. The ANOVA
for the overall model yields F (25, 6) = 24.78, p < 0.001 with R2 = 0.99, Radj 2
= 0.95 and
P RESS = 0.073. We observe that ntree and mtry are not statistically significant, but the rest of
CE

the main factors have very small p-values. There are several significant two-factor interactions.
The aliasing in this design is as follows: ntree:mtry = replace:cutoff, ntree:replace = mtry:cutoff
and ntree:cutoff = mtry:replace. We found that the two-factor interaction ntree:mtry was sig-
AC

nificant, which is unfortunate in this design because there is no way to separate the effect of
this interaction from the effect replace:cutoff. We therefore proceeded with a fold over design
to completely separate the effects of the two-factor interactions, which would be equivalent to
an unreplicated fractional factorial 27−1 . This design has resolution V II, which indicates that
no main factor or two-factor interactions are aliased with each other.
The results of the 27−1 fractional factorial design are shown in Table 4 with the ANOVA
yielding F (28, 35) = 3.698, p < 0.001 and R2 = 0.747, Radj 2
= 0.545 and P RESS = 0.457.
None of the main factors are significant. However, there are several two-factor interactions

14
ACCEPTED MANUSCRIPT

Table 3: Results for unreplicated fractional factorial design 27−2

Coefficients Estimate Std. Error t-value P (> |t|)


(Intercept) 0.3458 0.0043 80.503 2.47E-10 ***
ntree 0.0029 0.0043 0.684 0.5193
mtry -0.0069 0.0043 -1.614 0.1578

T
replace -0.0253 0.0043 -5.879 0.0011 **

IP
nodesize 0.0435 0.0043 10.132 5.37E-05 ***
classwt -0.1364 0.0043 -31.766 6.47E-08 ***
cutoff 0.0475 0.0043 11.07 3.24E-05 ***

CR
maxnodes -0.0593 0.0043 -13.816 8.95E-06 ***
ntree:mtry -0.0371 0.0043 -8.636 0.0001 ***
ntree:replace 0.0003 0.0043 0.085 0.9357
ntree:nodesize
ntree:classwt
ntree:cutoff
ntree:maxnodes
-0.0038
-0.0033
0.0018
0.0025
US
0.0043
0.0043
0.0043
0.0043
-0.877
-0.776
0.409
0.593
0.4140
0.4672
0.6966
0.5750
AN
mtry:nodesize -0.0027 0.0043 -0.622 0.5569
mtry:classwt 0.0069 0.0043 1.614 0.1577
mtry:maxnodes 0.0042 0.0043 0.978 0.3658
M

replace:nodesize 0.0198 0.0043 4.621 0.0036 **


replace:classwt 0.0244 0.0043 5.689 0.0013 **
replace:maxnodes -0.0278 0.0043 -6.478 0.0006 ***
ED

nodesize:classwt -0.0336 0.0043 -7.814 0.0002 ***


nodesize:cutoff 0.0266 0.0043 6.192 0.0008 ***
nodesize:maxnodes 0.0475 0.0043 11.05 3.27E-05 ***
classwt:cutoff -0.0678 0.0043 -15.794 4.09E-06 ***
PT

classwt:maxnodes 0.0485 0.0043 11.287 2.89E-05 ***


cutoff:maxnodes -0.0195 0.0043 -4.545 0.0039 **
CE

∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1

which are significant. For our model selection, we retained the following factors which had
AC

significant two-factor interactions: nodesize, classwt, cutoff and maxnodes. This approach of
model selection is done in order to support the hierarchy principle, which states that lower-
order terms are more important than higher-order terms [Wu and Hamada, 2009] and that if a
model has a significant higher-order term, the model should also retain all of the corresponding
lower-order terms to achieve consistency [Montgomery, 2017]. However, the approach in this
case conflicts with the heredity principle which states that an interaction can only be active
if one or both of its parents (main effects) are also active [Wu and Hamada, 2009]. Under
strong heredity, an AB interaction can only be active if both parents A and B are also active,

15
ACCEPTED MANUSCRIPT

whereas some interactions may only obey the weak heredity principle in which only one of the
parents needs to be active [Montgomery, 2017],[Wu and Hamada, 2009]. Nevertheless, some
models may perform better when non-significant terms promoting hierarchy or heredity are
not considered, particularly when the main goal is prediction [Montgomery, 2017]. In this
initial screening phase, we aimed to not discard potentially useful main effects too early and
therefore we decided to retain some factors with non-significant main effects that were involved

T
in significant two-factor interactions.
The factor ntree was not significant and will be removed for the next round of experiments.

IP
Recall that the low level for this factor was 100 and the high level was 500. This result implies
that there are no significant differences between these two levels, but that does not necessarily

CR
mean that it is an irrelevant hyperparameter, particularly because the generalization error for
RF is known to converge as the number of trees increases [Breiman, 2001]. Hence, if we know
that performance does not suffer when the number of trees is set to 100, then we can reduce the
computational burden of our experiments by setting this hyperparameter to its lower level. For
US
the next round, we will set this parameter to 250 in order to reduce computational time while
also utilizing a sufficiently large number of trees for the generalization error to converge. The
factor mtry was not significant, which does not mean that it is irrelevant as a hyperparameter
AN
under other scenarios. Because the number of predictors it not very large in this case (m = 15),
the difference between the two levels chosen for the experiments (low = 2 , high = 4) is not
significant.
√ In the next round of experiments, we will use the default setting for this parameter
( m). Finally, the factor replace was not significant and it was not involved in any two-factor
M

interaction. Hence, the default will be used in the next round of experiments.
ED

4.2 Full Factorial and Model Reduction


Now that we are left with 4 out of the original 7 factors, we can run a full factorial design. In
this case, we used a 24 experiment with two replicates. The levels are set exactly as before. The
PT

results are shown in Table 5. The overall model is significant with F (10, 21) = 7.056, p < 0.001
and R2 = 0.77, Radj2
= 0.661 and P RESS = 0.152. None of the main factors was significant,
but all of them are involved in significant two-factor interactions.
CE

Figure 7 shows the main effects plot for this experiment. We observe that the slopes are
predominately flat relative to the interaction plots depicted on the same scale in Figure 8. An
intersection between two lines in the interaction plot indicates that the effect a given factor A has
AC

on the response varies depending on the level of a second factor B. The plots are consistent with
the results obtained in Table 5; for example, the most obvious interaction is between classwt and
cutoff, which has the smallest p-value (p < 0.001) and the largest magnitude in the estimate
(0.06). Conversely, when inspecting at the interaction between nodesize and maxnodes, we
observe that the lines do not intersect; this is consistent with the result in Table 5 indicating
that the interaction is not significant (p = 0.5087). The design matrix with the 10-fold cross-
validated results of BACC is not shown here, but we found that run number 15 in standard
order yielded the best results for the two replicates with an average 10-fold cross-validated

16
ACCEPTED MANUSCRIPT

Table 4: Results for unreplicated fractional factorial design 27−1

Coefficients Estimate Std. Error t-value P (> |t|)


(Intercept) 5.92E-01 7.82E-03 75.777 2E-16
ntree -9.07E-04 7.82E-03 -0.116 0.9082
mtry 5.36E-03 7.82E-03 0.686 0.4975

T
replace 1.61E-03 7.82E-03 0.206 0.8377

IP
nodesize -6.41E-03 7.82E-03 -0.821 0.4174
classwt -1.42E-02 7.82E-03 -1.818 0.0777 +
cutoff -3.06E-03 7.82E-03 -0.391 0.6978

CR
maxnodes 1.39E-02 7.82E-03 1.782 0.0834 +
ntree:mtry 4.29E-04 7.82E-03 0.055 0.9566
ntree:replace -3.16E-03 7.82E-03 -0.405 0.6882
ntree:nodesize
ntree:classwt
ntree:cutoff
ntree:maxnodes
-3.97E-04
9.92E-04
2.21E-04
-6.16E-04
US7.82E-03
7.82E-03
7.82E-03
7.82E-03
-0.051
0.127
0.028
-0.079
0.9597
0.8997
0.9777
0.9377
AN
mtry:replace -6.17E-04 7.82E-03 -0.079 0.9375
mtry:nodesize 1.29E-03 7.82E-03 0.164 0.8703
mtry:classwt -4.16E-05 7.82E-03 -0.005 0.9958
M

mtry:cutoff 7.90E-04 7.82E-03 0.101 0.9200


mtry:maxnodes -4.62E-04 7.82E-03 -0.059 0.9532
replace:nodesize 8.14E-04 7.82E-03 0.104 0.9177
ED

replace:classwt -1.12E-03 7.82E-03 -0.143 0.8873


replace:cutoff -3.98E-05 7.82E-03 -0.005 0.9960
replace:maxnodes 1.54E-03 7.82E-03 0.197 0.8453
nodesize:classwt -2.69E-02 7.82E-03 -3.447 0.0015 **
PT

nodesize:cutoff -2.61E-02 7.82E-03 -3.341 0.0020 **


nodesize:maxnodes -7.93E-03 7.82E-03 -1.014 0.3174
classwt:cutoff 5.76E-02 7.82E-03 7.367 1.29E-08 ***
CE

classwt:maxnodes 2.03E-02 7.82E-03 2.592 0.0138 *


cutoff:maxnodes 2.51E-02 7.82E-03 3.215 0.0028 **
∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1
AC

BACC of 0.7788. For this run, the combination of factor levels was as follows: nodesize =
−1 , classwt = 1, cutof f = 1 and maxnodes = 1 in coded units. We can observe that
maxnodes was set to its high level (maxnodes = N U LL), which is the default setting for
the randomForest package. However, nodesize was set to the low level, which means that the
model performs better with fully-grown trees than when the number of nodes is restricted. This
finding is consistent with Breiman’s original paper [Breiman, 2001]. As a consequence, in the

17
ACCEPTED MANUSCRIPT

Table 5: Results for the reduced model: full factorial 24 with two repli-
cates

Coefficients Estimate Std. Error t-value P (> |t|)


(Intercept) 0.5929 0.0099 60.014 2E-16 ***
nodesize -0.0056 0.0099 -0.571 0.5741

T
classwt -0.0126 0.0099 -1.274 0.2168

IP
cutoff -0.0044 0.0099 -0.444 0.6618
maxnodes 0.0168 0.0099 1.705 0.1030
nodesize:classwt -0.0297 0.0099 -3.002 0.0068 **

CR
nodesize:cutoff -0.0256 0.0099 -2.594 0.0169 *
nodesize:maxnodes -0.0066 0.0099 -0.672 0.5087
classwt:cutoff 0.0605 0.0099 6.121 4.50E-06 ***
classwt:maxnodes
cutoff:maxnodes
0.0197
0.0277 US
0.0099
0.0099
1.993
2.807
∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1
0.0595 .
0.0106 *
AN
next round we will leave the maxnodes parameter fixed with the default value of NULL and
control the size of the trees with nodesize.
M
ED
PT
CE
AC

Figure 7: Main effects plot with vertical axis as the response and horizontal axis showing low
and high levels

In the next experiment, we eliminated maxnodes and only included the other three remaining
factors. This time we considered a full factorial 23 with two replicates. Results are shown in
Table 6. The ANOVA yielded F (6, 9) = 15.72, p < 0.0001 and R2 = 0.913, Radj 2
= 0.855

18
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN
M

Figure 8: Plot for interactions with vertical axis as the response and horizontal axis showing
low and high levels
ED

and P RESS = 0.042. The factor cutoff was significant (p = 0.046) and all of the two-factor
interactions were also significant. We have identified the main factors as well as the two-factor
interactions that have the largest effect on our response variable, and therefore we are ready to
PT

proceed with RSM.

4.3 RSM for Final Hyperparameter Optimization


CE

For RSM, we chose a Box-Behnken design with 3 center points for a total of 15 runs. The levels
for the design are shown in Table 7. The center (0,0,0) is located on the values for the highest
AC

BACC from the last DOE since this is the area we would like to further explore.
The results for the second-order model are shown in Table 8. We observe that all main
factors are significant as well as all two-factor interactions. On the other hand, only the quadratic
term corresponding to classwt was significant. We proceeded to discard non-significant terms
and ran the analysis again. The lack of fit test was not significant (p = 0.7291), which means
that our reduced model has a good quality of fit. The reduced model was significant with
R2 = 0.998, Radj2
= 0.996 and P RESS < 0.001.
The next step was to find the path of steepest ascent using ridge analysis, which is a function

19
ACCEPTED MANUSCRIPT

Table 6: Results for the reduced model: full factorial 23 with two
replicates

Coefficients Estimate Std. Error t-value P (> |t|)


(Intercept) 0.6102 0.0096 63.324 3.08E-13 ***
nodesize -0.0116 0.0096 -1.202 0.2600

T
classwt 0.0066 0.0096 0.682 0.5127

IP
cutoff 0.0222 0.0096 2.306 0.0465 *
nodesize:classwt -0.0590 0.0096 -6.122 0.0002 ***
nodesize:cutoff -0.0504 0.0096 -5.227 0.0005 ***

CR
classwt:cutoff 0.0455 0.0096 4.719 0.0011 **
∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1

level nodesize
US
Table 7: Factors and levels of the
Box-Behnken design

classwt cutoff
AN
-2 1 6 0.70
-1 3 8 0.75
0 6 10 0.80
M

1 9 12 0.85
2 12 14 0.90
ED

Table 8: Results for Box-Behnken design with 3 center points for a total
of 15 runs. First iteration

Coefficients Estimate Std. Error t-value P (> |t|)


PT

(Intercept) 0.7535 0.0006 1229.7295 6.75E-15 ***


nodesize -0.0098 0.0004 -26.1806 1.52E-06 ***
CE

classwt -0.0035 0.0004 -9.387 0.0002314 ***


cutoff 0.0250 0.0004 66.5094 1.46E-08 ***
nodesize:classwt -0.0015 0.0005 -2.8349 0.0365 *
nodesize:cutoff 0.0021 0.0005 3.9444 0.0109 *
AC

classwt:cutoff -0.0021 0.0005 -4.005 0.0103 *


nodesize2 0.0007 0.0006 1.3271 0.2419
classwt2 0.0015 0.0006 2.7753 0.0391 *
cutoff2 -0.0007 0.0006 -1.33 0.2410
∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1

20
ACCEPTED MANUSCRIPT

provided by the rsm R package. The path is shown in Table 9 ranging from distance 0 to 2 in
increments of 0.5. It is advisable to not extend too far from the area of the experiment because
the predictions for the response variable become unreliable. In this case we chose a distance of
2, which falls under the area of the experiment. The increase/decrease for each of the factors
is given in coded units. The response surface plot is shown in Figure 9 where the center is
located at (0,0,0). The maximum is roughly located at nodesize = 5 , classwt = 9 and

T
cutof f = 0.90. This is the location where we center the next round of analysis with another
Box-Behnken design with 3 center points (15 runs).

IP
Table 9: Path of steepest ascent from ridge

CR
analysis.

dist nodesize classwt cutoff BACC


0
0.5
1
1.5
0
-0.164
-0.284
-0.351
US0
-0.085
-0.216
-0.404
0
0.465
0.934
1.405
0.753
0.767
0.78
0.794
AN
2 -0.361 -0.643 1.85 0.807
M
ED
PT
CE

Figure 9: Response surface plot centered at origin


AC

The results for the second iteration using the Box-Behnken design are shown in Table 10.
This time only the results for the reduced model are shown after removing non-significant terms.
The factor nodesize is not significant, but the interaction between this factor and cutoff is sig-
nificant; consequently, we included it in the model to preserve hierarchy. The main factor cutoff
is significant along with its quadratic term. The lack of fit test is not significant (p = 0.1519)
which means that the reduced model has a good quality of fit. The reduced model is signifi-
cant with R2 = 0.866, Radj 2
= 0.813 and P RESS < 0.0001. The steepest ascent procedure

21
ACCEPTED MANUSCRIPT

points towards the direction in code units of nodize + 1.137 and cutof f + 0.766. We move
towards that direction running again the same design, slightly reducing the level ranges in order
to explore only a small local neighborhood of the hyperparameter space. The BACC for all
of the runs ranges from 0.802 to 0.811, where the latter figure is the maximum achieved using
nodesize = 11 , classwt = 7 and cutof f = 0.94. From Figure 10, we can observe that a local
maximum has been achieved. We decided to end our experimentation at this point, satisfied

T
that our proposed methodology improved the default model performance from 0.6367 to 0.811
in terms of BACC. The final model with the optimized hyperparameters was evaluated using a

IP
withheld test dataset which can also be found in the UCI Machine Learning Repository, yielding
a BACC of 0.81. The withheld test dataset was not used during the optimization/tuning phase

CR
either in the DOE or RSM phases; thus, we believe this BACC is a good estimate of the model’s
generalization performance on new data.

Table 10: Results of the Box-Behnken design with 3 center points

US
for a total of 15 runs. Second iteration. Results show the reduced
model.
AN
Coefficients Estimate Std. Error t value P (> |t|)
(Intercept) 0.8022 0.0017 470.9058 2.2E-16 ***
nodesize -0.0001 0.0016 -0.0903 0.9298
M

cutoff 0.0068 0.0016 4.2903 0.0016 **


nodesize:cutoff 0.0123 0.0023 5.4581 0.0003 ***
cutoff2 -0.0095 0.0023 -4.0797 0.0022 **
ED

∗ ∗ ∗p < 0.0001 ; ∗ ∗ p < 0.001 ; ∗p < 0.05 ; + < 0.1


PT
CE
AC

Figure 10: Response surface plot where a local maximum has been reached

22
ACCEPTED MANUSCRIPT

5 Discussion and Conclusions


Tuning ML hyperparameters is often considered to be an art which requires expert knowledge,
rules-of-thumb, time, and significant computational resources [Snoek et al., 2012]. In this pa-
per, we proposed the DOE methodology to screen potential hyperparameters which have an
impact on a response variable such as an ML performance metric. In a second phase, we used

T
RSM to optimize this response variable by fitting a second-order polynomial function which is
assumed to approximate the real behavior of the system. The initial phase of the experiment

IP
found that factors mtry and replace had no significant main effects or interactions and were
therefore discarded in the next rounds of experiments. The factor ntree was not significant for

CR
levels 100 and 500 and consequently it was left fixed at 250, contributing to savings in compu-
tational time for the next rounds of experiment while still maintaining an adequate number of
trees for the model performance to converge. While it was statistically significant, the factor
maxnodes was discarded because the tree size can be controlled with the nodesize parameter.

US
The RSM was applied using only three factors: classwt, cutoff and nodesize. We ran a total
of 3 rounds of experiments in the RSM phase resulting in an optimized BACC value of 0.811,
which was approximately the same performance observed for the held-out test dataset and rep-
AN
resents a substantial improvement over the default model performance of 0.6367. The 10-fold
cross-validation produces very good estimates of the response (BACC in this case), but it is also
computationally expensive. For the specific case of the RF, estimates such as the OOB error
can be used, while a testing dataset withheld from the training process could be used for other
M

algorithms.
The methodology proposed here allows us to not only optimize an ML performance metric,
but also understand which factors have the largest effect on the response variable. For example,
ED

in a grid search approach, it is assumed that all factors have the same effect on the response.
As a consequence, when we select the ”best” result, we could end up with a combination in
which one or more factors do not have a significant impact on the model’s performance but are
PT

set to a level which makes the training more computationally expensive or results in the final
model being more complex than necessary. Our statistical results demonstrated this when we
found that reducing the default value of the number of trees from 500 to 100 had no effect on
CE

performance.
We were surprised to see that the replace hyperparameter was not significant for this particu-
lar dataset, but this does not necessarily imply that it is insignificant for models trained on other
datasets. We believe that this parameter could be more relevant with smaller datasets where re-
AC

placing the samples could help preserve the data distribution. The hyperparameter mtry, which
represents the number of features to use at each split, might be a more relevant hyperparameter
for models trained on higher-dimensional datasets. This parameter controls for the amount of
correlation among the trees but also controls for the prediction strength of each individual tree.
Hence, the levels of mtry should help fine-tune this trade-off between correlation and prediction
strength when there are a large number of features present. It was interesting to see some unex-
pected significant interactions between nodesize and classwt. It is possible that for imbalanced

23
ACCEPTED MANUSCRIPT

datasets like the one used in this work, the interaction between these two parameters becomes
more relevant because a node of size 1 could neutralize the effect of imbalanced class weights.
The parameter cutoff was expected to be a significant effect because it controls the trade-off
between the true positive rate and the false positive rate.
Our methodology not only enables the selection of significant hyperparameters in the early
stages of experimentation, but also provides measurements of the importance of the main effects

T
and their interactions. For example, the absolute or squared value of the t-statistics, as well as
the standardized regression estimates, provide a measurement of the magnitude of the effect.

IP
The tables presented in this study show the estimates for the coefficients as in a linear regression
analysis, but the same results could be presented in an ANOVA table.

CR
We believe this work represents the beginning of a new area of research for improving
hyperparameter tuning. For example, it is unrealistic to believe that the entire exploration area
of the response can always be modeled with a second-order polynomial having a smooth and
convex surface. Hence, a global optimum is very difficult to prove and in the case of our
US
experiments, we most likely arrived at a local maximum. However, this local maximum could
be located in an area of the response surface that was not explored, unlike the grid search
approach where we only have information about the points of the combinations that were tested.
AN
For other work related to global optimization in black-box functions, the reader is referred to
[Jones et al., 1998] and [Kleijnen, 2015].
It would be interesting to apply this methodology to other algorithms such as neural net-
works, support vector machines, and boosted trees. In unsupervised training, this methodology
M

could also be used to tune hyperparameters in order to optimize metrics such as homogeneity,
completeness and V-measures if a ground truth is known. Moreover, even when the ground
truth is unknown, this methodology can still be applied. For example, in the Latent Dirichlet
ED

Allocation (LDA) the parameters α, β, number of topics and number of iterations can be tuned
to optimize performance measures such as perplexity [Blei et al., 2003].
This experiment had a total of 157 runs which started with a screening procedure using a
PT

fractional factorial design. For a very simple 2-level grid search which is equivalent to a full
factorial design, we would need 128 runs to test 7 hyperparameters. If we wanted to tests 4
levels for each of the factors, we would have needed 16384 runs, which is likely infeasible in
CE

many scenarios. Clearly, our approach can substantially reduce the number of runs needed,
particularly when there are several hyperparameters to be tuned. But perhaps the most impor-
tant advantage of using this approach is that it is possible to gain new knowledge about the
AC

interactions between factors (hyperparameters), which is ignored using traditional approaches


such as the grid search and one-factor-at-a-time exploration. Knowing which interactions are
present is critical for the optimization process because it indicates that the effect a factor has
on the response depends on the level of another factor. We believe that this disciplined and
statistically-based methodology will help practitioners not only optimize their algorithms, but
also gain a better understanding of the effect that each of the hyperparameters have on the per-
formance of a machine learning algorithm. This is useful regardless of whether they are working
with a small dataset that can be run on a single machine or a very large dataset that needs to be

24
ACCEPTED MANUSCRIPT

analyzed on a distributed system where resources and training time are constrained.

References and Notes


[Bardenet et al., 2013] Bardenet, R., Brendel, M., Kégl, B., and Sebag, M. (2013). Collab-

T
orative hyperparameter tuning. In International Conference on Machine Learning, pages
199–207.

IP
[Bergstra and Bengio, 2012] Bergstra, J. and Bengio, Y. (2012). Random search for hyper-
parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.

CR
[Bergstra et al., 2011] Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algo-
rithms for hyper-parameter optimization. In Advances in Neural Information Processing
Systems, pages 2546–2554.
US
[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
AN
[Breiman, 2001] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[Fédou and Rendas, 2015] Fédou, J.-M. and Rendas, M.-J. (2015). Extending morris method:
M

identification of the interaction graph using cycle-equitable designs. Journal of Statistical


Computation and Simulation, 85(7):1398–1419.

[Hoerl, 1985] Hoerl, R. W. (1985). Ridge analysis 25 years later. The American Statistician,
ED

39(3):186–192.

[Jones et al., 1998] Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global opti-
mization of expensive black-box functions. Journal of Global optimization, 13(4):455–492.
PT

[Kleijnen, 2015] Kleijnen, J. P. (2015). Design and analysis of simulation experiments.


Springer, 2nd edition.
CE

[Kohavi, 1995] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy
estimation and model selection. In IJCAI’95 Proceedings of the 14th international joint
conference on Artificial intelligence, volume 2, pages 1137–1143. Stanford, CA.
AC

[Lalor et al., 2017] Lalor, J. P., Wu, H., and Yu, H. (2017). Improving machine learning ability
with fine-tuning. arXiv preprint arXiv:1702.08563.

[Maclaurin et al., 2015] Maclaurin, D., Duvenaud, D., and Adams, R. (2015). Gradient-based
hyperparameter optimization through reversible learning. In International Conference on
Machine Learning, pages 2113–2122.

25
ACCEPTED MANUSCRIPT

[Montgomery, 2017] Montgomery, D. C. (2017). Design and analysis of experiments. Wiley,


Hoboken, NJ., 9th edition.

[Myers et al., 2016] Myers, R. H., Montgomery, D. C., and Anderson-Cook, C. M. (2016). Re-
sponse surface methodology: process and product optimization using designed experiments.
John Wiley & Sons, New York (Probability and Statistics Series), 4th edition.

T
[Nickson et al., 2014] Nickson, T., Osborne, M. A., Reece, S., and Roberts, S. J. (2014). Au-

IP
tomated machine learning on big data using stochastic algorithm tuning. arXiv preprint
arXiv:1407.7969.

CR
[Shi and Kleijnen, 2018] Shi, W. and Kleijnen, J. P. (2018). Testing the assumptions of sequen-
tial bifurcation for factor screening. Simulation Modelling Practice and Theory, 81:85–99.

[Snoek et al., 2012] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian

US
optimization of machine learning algorithms. In Advances in neural information processing
systems, pages 2951–2959.
AN
[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhut-
dinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The
Journal of Machine Learning Research, 15(1):1929–1958.
M

[Staelin, 2003] Staelin, C. (2003). Parameter selection for support vector machines. Hewlett-
Packard Company, Tech. Rep. HPL-2002-354R1.

[Tan et al., 2006] Tan, P.-N. et al. (2006). Introduction to data mining. Pearson Education
ED

India.

[Wu and Hamada, 2009] Wu, C. and Hamada, M. (2009). Experiments: Planning, analysis,
and optimization (wiley series in probability and statistics).
PT

[Yu et al., 2017] Yu, L., Xu, H., and Tang, L. (2017). Lssvr ensemble learning with uncertain
parameters for crude oil price forecasting. Applied Soft Computing, 56:692–701.
CE

[Yu et al., 2011] Yu, L., Yao, X., Wang, S., and Lai, K. K. (2011). Credit risk evaluation using
a weighted least squares svm classifier with design of experiment for parameter selection.
Expert Systems with Applications, 38(12):15392–15399.
AC

[Zhou et al., 2009] Zhou, L., Lai, K. K., and Yu, L. (2009). Credit scoring using support vector
machines with direct search for parameters selection. Soft Computing, 13(2):149.

26

You might also like