Professional Documents
Culture Documents
Abstract—While early AutoML frameworks focused on optimizing traditional ML pipelines and their hyperparameters, a recent trend in
AutoML is to focus on neural architecture search. In this paper, we introduce Auto-PyTorch, which brings the best of these two worlds
together by jointly and robustly optimizing the architecture of networks and the training hyperparameters to enable fully automated
deep learning (AutoDL). Auto-PyTorch achieves state-of-the-art performance on several tabular benchmarks by combining multi-fidelity
optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs) and common baselines for
tabular data. To thoroughly study our assumptions on how to design such an AutoDL system, we additionally introduce a new
arXiv:2006.13799v2 [cs.LG] 24 Aug 2020
benchmark on learning curves for DNNs, dubbed LCBench, and run extensive ablation studies of the full Auto-PyTorch on typical
AutoML benchmarks, eventually showing that Auto-PyTorch performs better than several state-of-the-art competitors on average.
Index Terms—Machine learning, Deep learning, Automated machine learning, Hyperparameter optimization, Neural architecture
search, Multi-fidelity optimization, Meta-learning
1 I NTRODUCTION
In recent years, machine learning (ML) has experienced training only for a few epochs. Although there is evidence
enormous growth in popularity as the last decade of re- that correlation between different fidelities can be weak [11],
search efforts has lead to high performances on many tasks. [12], we show in this paper that it is nevertheless possible
Nevertheless, manual ML design is a tedious task, slowing to build a robust AutoDL system with strong anytime
down new applications of ML. Consequently, the field of performance.
automated machine learning (AutoML) [1] has emerged To study multi-fidelity approaches in a systematic way,
to support researchers and developers in developing high- we also introduce a new benchmark on learning curves in
performance ML pipelines. Popular AutoML systems, such this paper, which we dub LCBench. This new benchmark
as Auto-WEKA [2], hyperopt-sklearn [3], TPOT [4], auto- allows the community and us to study in an efficient way
sklearn [5] or Auto-Keras [6] are readily available and allow the challenges and the potential for using multi-fidelity
for using ML with a few lines of code. approaches with DL and combining it with meta-learning,
While initially most AutoML frameworks focused on by providing learning curves of 2 000 configurations on a
traditional machine learning algorithms, deep learning (DL) joint optimization space for each of 35 diverse datasets.
has become far more popular over the last few years due Finally, we introduce Auto-PyTorch Tabular, an AutoML
to its strong performance and their ability to automatically framework targeted at tabular data that performs multi-
learn useful representations from data. While choosing the fidelity optimization on a joint search space of architectural
correct hyperparameters for a task is of major importance parameters and training hyperparameters for neural nets.
for DL [1], the optimal choice of neural architecture also Auto-PyTorch Tabular, the successor of Auto-Net [13] (part
varies between tasks and datasets [7], [8]. This has lead to of the winning system in the first AutoML challenge [14]),
the recent rise of neural architecture search (NAS) meth- combines state-of-the-art approaches from multi-fidelity op-
ods [9]. Since there are interaction effects between the best timization [15], ensemble learning [5] and meta-learning for
architecture and the best hyperparameter configuration for a data-driven selection of initial configurations for warm-
it, AutoDL systems have to jointly optimize both [10]. starting Bayesian optimization [16].
Traditional approaches for AutoML, such as Bayesian Specifically the contributions of this paper are:
optimization, evolutionary strategies or reinforcement
learning, perform well for cheaper ML pipelines but hardly 1) We address the problem of jointly and robustly op-
scale to the more expensive training of DL models. To timizing hyperparameters and architectures of deep
achieve better anytime performance, one approach is to use neural networks using multi-fidelity optimization.
multi-fidelity optimization, which approximates the full op- 2) We introduce LCBench1 , a new benchmark for study-
timization problem by proxy tasks on cheaper fidelities, e.g. ing multi-fidelity optimization w.r.t. learning curves
on a joint optimization space of architectural and
• L. Zimmer is with the Institute of Computer Science, University of training hyperparameters across 35 datasets.
Freiburg, Germany. E-mail: zimmerl@cs.uni-freiburg.de 3) We study several characteristics of LCBench to gain
• M. Lindauer is with the Institute of Information Processing, Leibniz
University Hannover. E-mail: lindauer@tnt.uni-hannover.de insights on how to design an efficient AutoDL
• F. Hutter is with the University of Freiburg and the Bosch Center for framework, including correlation between fideli-
Artificial Intelligence. E-mail: fh@cs.uni-freiburg.de
Manuscript received MONTH DAY, YEAR; revised MONTH DAY, YEAR. 1. https://github.com/automl/LCBench
2
ties and evaluation of hyperparameter importance hyperparameter settings and choice of architecture and also
across different fidelities. are an expensive additional step at the end.
4) We propose to combine BOHB, as a robust op- Instead, we propose to jointly optimize architectures and
timizer for AutoDL, with automatically-designed hyperparameters by combining a robust approach based
configuration portfolios and ensembling, showing on multi-fidelity optimization and an efficiently designed
the superior performance of this combination. configuration space, where we make use of repeated blocks
5) To achieve state-of-the-art performance on tabular and groups and minimize the number of hyperparameters
data, we combine tuned DNNs with simple tradi- by only describing shapes of these.
tional ML baselines. There is also a growing interest in well-designed Au-
6) We introduce our Auto-PyTorch Tabular2 tool to toML benchmarks to take reproducibility and compara-
make deep learning easily applicable to new tabular bility of AutoML approaches into account. For example,
datasets for the DL framework PyTorch. HPOlib [21] provides benchmarks for hyperparameter opti-
7) We show that Auto-PyTorch Tabular outperforms mization, ASlib [22] for meta-learning of algorithm selection
several other common AutoML frameworks: Au- and NASBench101, 1Shot1 and 201 [11], [23], [24] for neural
toKeras, AutoGluon, auto-sklearn and hyperopt- architecture search. However, to the best of our knowledge,
sklearn. there is yet no multi-fidelity benchmark on learning curves
for the joint optimization of architectures and hyperparam-
In contrast to the mainstream in DL, here we do not
eters. We address this problem by our new LCBench, which
focus on image data, but on tabular data, since we feel
provides data on 35 datasets in order to allow combinations
that DL is understudied for many applications of great
with meta-learning.
relevance that feature tabular data, such as climate data,
Last but not least, several papers address the problem of
classical sensor data in manufacturing and medical data
understanding the characteristics of AutoDL tasks [12], [25].
files. Although we only study tabular data in this paper, the
A typical finding for example is that the design space is
Auto-PyTorch approach is not limited to tabular data, but
over-engineered, leading to very simple optimization tasks
can in principle also be applied to other data modalities and
where even random search can perform well. As we show
tasks, such as image classification, semantic segmentation,
in our experiments, this does not apply to our tasks. In
speech processing or natural language processing.
addition to studying the characteristics of LCBench and also
the full design space of Auto-PyTorch Tabular, we study
2 R ELATED W ORK hyperparameter importance from a global [25], [26], [27],
In the early years of automated machine learning, AutoML but also from a local point of view [28], showing that both
tools such as Auto-WEKA [2], hyperopt-sklearn [3], auto- provide different insights.
sklearn [5] and TPOT [4], focused on traditional machine
learning pipelines because tabular data was and still is
3 AUTO -P Y TORCH
important for major applications. Popular choices for the
optimization approach under the hood include Bayesian op- We now present Auto-PyTorch, an AutoDL framework uti-
timization, evolutionary strategies and reinforcement learn- lizing multi-fidelity optimization to jointly optimize NN
ing [17]. In contrast to plain hyperparameter optimization, architectural parameters and training hyperparameters. As
these tools can also address the combined algorithm se- the name suggests, Auto-PyTorch relies on the PyTorch
lection and hyperparameter (CASH) problem by choosing framework [29] as the DL framework. Auto-PyTorch im-
several algorithms of an ML pipeline and their correspond- plements and automatically tunes the full DL pipeline,
ing hyperparameters [2]. Although we focus on single-level including data preprocessing, network training techniques
hyperparameter optimization, our approach also addresses and regularization methods. Furthermore, it offers warm-
mixed configuration spaces with hierarchical conditional starting the optimization by sampling configurations from
structures, e.g., the choice of the optimizer as a top-level portfolios as well as an automated ensemble selection. In the
hyperparameter and its sub-hyperparameters. following, we provide more background on the individual
Another trend in AutoML is neural architecture components of our framework and introduce two search
search [9], which addresses the problem of determining a spaces of particular importance for this work. In this paper,
well-performing architecture of a DNN for a given dataset. we focus on tabular datasets in Auto-PyTorch Tabular, and
Early NAS approaches established new state-of-the-art per- thus the configuration spaces are geared to this type of data.
formances [7], but were also very expensive in terms of
GPU compute resources. To improve efficiency, state-of-the- 3.1 Configuration Space
art systems make use of cell-search spaces [18], i.e., config-
uring only repeated cell-architectures instead of the global The configuration space provided by Auto-PyTorch Tabular
architecture, and use gradient based optimization [19]. Al- contains a large number of hyperparameters, ranging from
though gradient-based approaches are very efficient on preprocessing options (e.g. encoding, imputation) architec-
some datasets, they are known for their sensitivity to hy- tural hyperparameters (e.g. network type, number of layers)
perparameter settings and instability [20]. Recent practices to training hyperparameters (e.g. learning rate, weight de-
for doing hyperparameter optimization only as a post- cay). Specifically, the configuration space Λ of Auto-PyTorch
processing step of NAS ignore interaction effects between consists of all these hyperparameters and is structured by a
conditional hierarchy, such that top-level hyperparameters
2. https://github.com/automl/Auto-PyTorch can activate or deactivate sub-level hyperparameters based
3
parameters in total. See Table 1. Out of these, 5 are common Linear (70) Linear (70)
training hyperparameters when training with momentum Linear (70) block group
SGD. For the architecture, we deploy shaped MLPs [31] with ReLU, Dropout
(p=0.066)
BN, ReLU, Dropout
ReLUs and dropout [32]. Shaped MLP nets offer an efficient Linear (70)
use a funnel shaped variant that only requires a predefined ReLU, Dropout
number of layers and a maximum number of units as input. (p=0.033) BN, ReLU
The first layer is initialized with the maximum amount of Linear (10) Linear (40)
neurons and each subsequent layer contains Linear (40)
BN, ReLU,
Dropout
(nmax − nout )
ni = ni−1 − (1) Linear (40)
(nlayers − 1)
neurons, where ni−1 is the number of neurons in the pre- BN, ReLU
TABLE 1
Hyperparameters and ranges of our Configuration Space 1.
over multiple budgets. BOHB combines Bayesian optimiza- the runtime because of its generality and interpretability.
tion (BO) [38] with Hyperband (HB) [39] and has been Epochs allow a simple comparison between performance of
shown to outperform BO and HB on many tasks. It also configurations across datasets and transfer of configurations
achieves speed ups of up to 55x over Random Search [15]. (e.g. learning rate schedules).
Similar to HB, BOHB consists of an outer loop and an
inner loop. In the inner loop, configurations are evaluated 3.3 Parallel Optimization
first on the lowest budget (defined by the current outer loop
step) and well-performing configurations advance to higher Since BOHB samples from its KDE instead of meticulously
budgets via SuccessiveHalving (SH) [40]. In the outer loop, optimizing the underlying acquisition function, a batch of
the budgets are calculated from a given minimum budget sampled configurations has a certain degree of diversity. As
bmin , a maximum budget bmax and a scaling factor η such shown by Falkner et al. [15], this approach leads to efficient
b
that for two successive budgets i+1 scaling for a parallel optimization. We make use of this
bi = η .
by deploying a master-worker architecture, such that Auto-
PyTorch can efficiently use additional compute resources.
Fig. 2. Components and Pipeline of a parallel Auto-PyTorch based on a See Figure 2.
master-work principle.
3.5 Ensembles
Training The iterative nature of Auto-PyTorch implies that many
score models are trained and evaluated. This naturally allows for
further boosting the predictive performance by ensembling
Predictions
the evaluated models. Auto-PyTorch uses an approach for
ensembling inspired by auto-sklearn [43], implementing an
automated post-hoc ensemble selection method [44] which
Ensemble Selection Baselines efficiently operates on model predictions stored during
optimization. Starting from an empty set, the ensemble
selection iteratively adds the model that gives the largest
Unlike HB, BOHB fits a kernel density estimator (KDE) performance improvement until reaching a predefined en-
as a probabilistic model to the observed performance data semble size. Allowing multiple additions of the same model
and uses this to identify promising areas in the configu- results in a weighted ensemble. Since it has been shown
ration space by trading off exploration and exploitation. that regularization improves ensemble performance [45],
Additionally, a fraction of configurations is sampled at and following auto-sklearn [43] we only consider the k
random to ensure convergence. Before fitting a KDE on best models in our ensemble selection (k = 30 in our
budget b, BOHB samples randomly until it has evaluated experiments).
as many data points on b as there are hyperparameters in Λ An advantage of our post-hoc ensembling is that we can
to ensure that the BO model is well-initialized. For sampling also include other models besides DNNs easily. As we show
a configuration from the KDE, BOHB uses the KDE on in our experiments, this combination of different model
the highest available budget. Since the configurations on classes is one of the key factors to achieve state-of-the-art
smaller budgets are evaluated first (according to SH), the performance on tabular data.
KDEs on the smaller budgets are available earlier and can
guide the search until better informed models on higher
budgets are available. 3.6 Portfolios
An important choice in setting up multi-fidelity opti- BOHB is geared towards good anytime performance on
mization is the type of budget used to create proxy tasks. large search spaces. However, it starts from scratch for
There are many possible choices such as runtime, number each new task. Therefore, we warmstart the optimization
of training epochs, dataset subsamples [41] or network to improve the early performance even further. auto-sklearn
architecture-related choices like number of stacks [19]. We tried to use task meta-features to determine promising con-
choose the number of training epochs as our budgets over figurations for warmstarting [5]. In this work, we follow the
5
approach of the newer PoSH-Auto-Sklearn to utilize portfo- space. In particular, we argue for the benefit of portfolios
lios in a meta-feature free approach. Auto-PyTorch simply and justify the use of multi-fidelity optimization in our
starts BOHB’s first iteration with a set of complementary setting.
configurations that cover a set of meta-training datasets
well; afterwards it transitions to BOHB’s conventional sam- 4.1 Experimental Setup
pling.
We collected data by randomly sampling 2 000 configura-
To construct the portfolio P , offline, we performed a
tions and evaluating each of them across 35 datasets and
BOHB run on each of many meta training datasets Dmeta ,
three budgets. Each evaluation is performed with three
giving rise to a set of portfolio candidates C . The incumbent
different seeds on Intel Xeon Gold 6242 CPUs with one core
configurations from the individual runs are then evaluated
per evaluation, totalling 1 500 CPU hours.
on all D ∈ Dmeta , resulting in a performance meta-matrix,
such that the portfolio can simply be constructed by table
4.1.1 Datasets
look-ups. From the candidates λ ∈ C , configurations λ
are iteratively and greedily added to the portfolio P in We evaluated the configurations on 35 out of 39 datasets
order to minimize its mean relative regret R over all meta from the AutoML Benchmark [49] hosted on OpenML [48].
datasets Dmeta . Specifically, motivated by Hydra [46], [47], As we designed LCBench to be a cheap benchmark, we omit
we successively add λ∗i to the previous portfolio Pi−1 in the four largest datasets (robert, guillermo, riccardo, dilbert),
iteration i: in order to ensure low runtimes and memory footprint.
Nevertheless, the datasets we chose are very diverse in the
X number of features (5 − 1 637), data points (690 − 581 012)
λ∗i ∈ arg min minS R (λ0 , Dtrain , Dtest ) and classes (2 − 355) (also see Figure 7) and cover binary
λ∈C λ0 ∈Pi−1 {λ}
(Dtrain ,Dtest ) and multi-class classification, as well as strongly imbalanced
∈Dmeta
tasks. Whenever possible, we use the test split defined by
where P0 is the empty portfolio and the relative regret R the OpenML task with a 33 % test split and additionally use
is calculated w.r.t. the best observed performance over the fixed 33 % of the training data as validation split. In case
portfolio candidates. Therefore, in the first iteration the con- there is no such OpenML task with a 33 % split available
figuration that is best on average across all datasets is added for a dataset, we create a 33 % test split and fix it across the
to the portfolio. In all subsequent iterations, configurations configurations.
are added that tend to be more specialized to subsets of
Dmeta for which further improvements are possible. 4.1.2 Configuration Space and Training
Configurations are added in this manner until a pre- We searched on Search Space 1 introduced in Section 3.1.1
defined portfolio size is reached. Limiting the size of the and Table 1, which comprises 7 hyperparameters (3 integer,
portfolio in this manner balances between warmstarting 4 float), two of which describe the architecture and five of
with promising configurations and the overhead induced which are training hyperparameters. We used SGD for train-
by first running the portfolio. ing and cosine annealing [50] as a learning rate scheduler.
This approach assumes (as all meta-learning approaches) We fixed all remaining hyperparameters to their default
that we have access to a reasonable set of meta-training value.
datasets that are representative of meta-test datasets. We
believe that this is particularly possible for tabular datasets 4.1.3 Budgets
because of platforms such as OpenML [48], but sizeable We used the number of epochs as our budgets and evaluated
dataset collections also already exist for several other each configuration for 12, 25 and 50 epochs. These budgets
data modalities, such as images and speech (see, e.g. correspond to the ones BOHB selects with the parameters
https://www.tensorflow.org/datasets/catalog). (bmin , bmax , η) = (12, 50, 2). For each of these evaluations,
the cosine annealing scheduler was set to anneal to 10−8
when reaching the respective budget. Since we also log
4 LCB ENCH : C OMPREHENSIVE AUTO DL S TUDY the full learning curve, we hence are able to compare the
FOR M ULTI -F IDELITY O PTIMIZATION adaptive learning rate scheduling with a seperate cosine
To gain insights on how to design multi-fidelity optimiza- schedule for each budget with the non-adaptive one using
tion for AutoDL, we first performed an initial analysis on a single cosine schedule for 50 epochs, also evalutated at 12
the smaller configuration space (see Section 3.1.1). Starting and 25 epochs.
on a small configuration space allows us to sample con-
figurations more densely and with many repeats to obtain 4.1.4 Data
information about all regions of the space. We study (i) the Throughout these evaluations we logged a vast number of
performance distribution of configurations across datasets metrics, including learning curves on performance metrics
to show that single configurations can perform well across on train, validation and test set, configuration hyperparam-
datasets, (ii) the correlation between budgets for adaptive eters and global as well as layer wise gradient statistics over
and non-adaptive learning rate schedules to justify the use time like gradient mean and standard deviation. Although
of multi-fidelity optimization and (iii) the importance of we only use the performance metrics and configurations
architectural design decisions and hyperparameters across in this study, the full data is publicly available and we
budgets. Based on our findings we suggest a number of hope that it will be very helpful for future studies by the
design choices for the full Auto-PyTorch Tabular design community. We dub this benchmark LCBench.
6
KDDCooNEketin yee
bloodalian yee
shutt up09 g
mfeat-g ss 2p
shutton-M g
volkeect-4 ss 2p
MiniB-mar mplo
bank on e NIST
higgst sfus
Austr e mplo
higgsle NIST
Fashieme ketin
kr-vs a sfus
phontine tors
segmle tors
cnaeect-4 .6
adultailure .6
conn rai28
ailure
credie che
alber -tran
janni up09
APSF rai28
Amazon-M
kr-vs alian
jasm ooNE
car type
conne che
helen-tran
chris t-fac
kc1 eme
jungl type
sylvinon e
vehict-fac
phon-mar
janni ent
car ent
adult-kp
airlintine
Austrine
cover-kp
bloodes
bankine
KDDCt-g
numele
Fashi e
faberrt
nomale
APSF o
segm-9
dioni a
nomaes
faberrt
alber o
dioni-9
vehict
volkes
covers
sylvin
helen
mfea t
kc1 s
s
MiniBt
nume
Amaz
airlin
jungl
jasm
credi
cnae
chris
1.0
1750 1750
80 0.8
1500 1500
Mean accuracy rank
750 40 0.4
750
500 500
20 0.2
250 250
0 0 0 0.0
Fig. 3. Mean validation accuracy (left) and mean relative regret (right) of 2000 evaluated configurations across 35 datasets. For better visualization,
we sort by mean accuracy/regret along each axis respectively.
100
Fig. 6. Boxplots on the hyperparameter importance according to
10 1 fANOVA (left) and LPI (right) on LCBench.
0.6 1.0
10 2 0.8 Epochs
Importance
Importance
0.4 0.6 12
10 3 25
0.2 0.4
10 4 0 0.2
50
5 10 15 20 0.0 0.0
Portfolio size
mo x. u ate
te
t d ate
ig la e
igh g r s
drobatchecay
maout r ize
ntu s
m
moht deyers
bat entuay
m ch m
dro ax. size
t ra ts
we rnin ayer
me nit
we m. rat
pou uni
m c
p s
nu ning
leaum. l
r
lea
n
cov hNigISET
cnas-kp
chsrhi uttitnlg
vemh enrtt
plodyit-g
nu -t es
ban mcoerraani skfcu1s
k-m nne28.6
arkect-4
KDDhesltinee
Cupena
jung p aalbe0r9
Am le chhonedultt
azo es me
n emcre s 2p
Fas Min jdainonies
hioniBo nis
Auserty gs
tral pe
jansommiaano
sevgolkiene
ksry-vlvicnlee
mfe fabee-9
at-f c rt
actoar
rs
e
-M oN
error
10 1
0.4 25
wall clock time [s] wall clock time [s]
0.2 50 BOHB BOHB + simple PF BOHB + greedy PF
0.0
arg t)
SV (Res )
rni nits )
eca m)
dim
we decay (MLP)
bat GD)
mo size
pou m
ma rate MLP)
s am
lea ax u (MLP
D t Ne
dro entu
t d (Ada
5.3.3 Ensembles
S
x u (Ad
et
ch
(
y(
t
s
m
igh er
nit
ng
m
t
nu
10 1
error
10 2 10 1
10 2
10 2
103 105 104
wall clock time [s] wall clock time [s] 103 105 103 105
wall clock time [s] wall clock time [s]
BOHB + greedy PF Extra Trees
BOHB + greedy PF + Ens. (div.) KNN
5.3.2 Portfolios Random Forest LightGBM
Anytime performance further improves when we add well- CatBoost
constructed portfolios to Auto-PyTorch. While the sim-
ple portfolio (with all incumbents of the 100 meta-train
10
TABLE 4
Accuracy and standard deviation across 5 runs of different AutoML frameworks after 1h. ”-” indicates that the system crashed and did not return
predictions.
[7] B. Zoph and Q. V. Le. Neural architecture search with reinforce- 11828 of Lecture Notes in Computer Science, pages 112–126. Springer,
ment learning. In Proceedings of the International Conference on 2019.
Learning Representations (ICLR’17) [60]. Published online: iclr.cc. [26] J. van Rijn and F. Hutter. Hyperparameter importance across
[8] M. Lindauer and F. Hutter. Best practices for scientific research on datasets. In Y. Guo and F.Farooq, editors, Proceedings of the 24th
neural architecture search. CoRR, abs/1909.02453, 2019. ACM SIGKDD International Conference on Knowledge Discovery and
[9] T. Elsken, J. Metzen, and F. Hutter. Neural architecture search: A Data Mining (KDD), pages 2367–2376. ACM Press, 2018.
survey. J. Mach. Learn. Res., 20:55:1–55:21, 2019. [27] F. Hutter, H. Hoos, and K. Leyton-Brown. An efficient approach
[10] A Zela, A. Klein, S. Falkner, and F. Hutter. Towards automated for assessing hyperparameter importance. In E. Xing and T. Jebara,
deep learning: Efficient joint neural architecture and hyperparam- editors, Proceedings of the 31th International Conference on Machine
eter search. In ICML 2018 AutoML Workshop, 2018. Learning, (ICML’14), pages 754–762. Omnipress, 2014.
[11] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hut- [28] A. Biedenkapp, J. Marben, M. Lindauer, and F. Hutter. CAVE: Con-
ter. Nas-bench-101: Towards reproducible neural architecture figuration assessment, visualization and evaluation. In Proceedings
search. In Proc. of ICML, pages 7105–7114. PMLR, 2019. of the International Conference on Learning and Intelligent Optimization
[12] A. Yang, P. Esperana, and F. Carlucci. Nas evaluation is frustrat- (LION), Lecture Notes in Computer Science. Springer, 2018.
ingly hard. In International Conference on Learning Representations, [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
2020. Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differ-
[13] H. Mendoza, A. Klein, M. Feurer, J. Springenberg, and F. Hutter. entiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
Towards automatically-tuned neural networks. In ICML 2016 [30] M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, J. Mar-
AutoML Workshop, 2016. ben, P. Müller, and F. Hutter. Boah: A tool suite for multi-
[14] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, fidelity bayesian optimization & analysis of hyperparameters.
Tin Kam Ho, N. Macià, B. Ray, M. Saeed, A. Statnikov, and arXiv:1908.06756 [cs.LG].
E. Viegas. Design of the 2015 chalearn automl challenge. In 2015 [31] M. Kotila. Autonomio. https://mikkokotila.github.io/slate/,
International Joint Conference on Neural Networks (IJCNN), pages 1– 2019.
8. IEEE Computer Society Press, 2015.
[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[15] S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and Efficient R. Salakhutdinov. Dropout: A simple way to prevent neural
Hyperparameter Optimization at Scale. In Dy and Krause [61], networks from overfitting. Journal of Machine Learning Research,
pages 1437–1446. 15:1929–1958, 2014.
[16] M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter.
[33] D. Kingma and J. Ba. Adam: A method for stochastic optimization.
Practical automated machine learning for the automl challenge
2015.
2018. In AutoML workshop at international conference on machine
learning (ICML), 2018. [34] H. Zhang, M. Cissé, Y. Dauphin, and D. Lopez-Paz. mixup: Be-
[17] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated yond empirical risk minimization. In Proceedings of the International
Machine Learning. Springer, 2019. Conference on Learning Representations (ICLR’18), 2018. Published
online: iclr.cc.
[18] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efficient neural
architecture search via parameter sharing. In Dy and Krause [61], [35] X. Gastaldi. Shake-shake regularization of 3-branch residual net-
pages 4092–4101. works. In 5th International Conference on Learning Representations,
[19] H. Liu, K. Simonyan, and Y. Yang. DARTS: differentiable architec- Workshop Track Proceedings. OpenReview.net, 2017.
ture search. In 7th International Conference on Learning Representa- [36] Y. Yamada, M. Iwamura, T. Akiba, and K. Kise. Shakedrop
tions, ICLR. OpenReview.net, 2019. regularization for deep residual learning. IEEE Access, 7:186126–
[20] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter. 186136, 2019.
Understanding and robustifying differentiable architecture search. [37] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
In International Conference on Learning Representations, 2020. image recognition. In 2016 IEEE Conference on Computer Vision
[21] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, and Pattern Recognition, (CVPR), pages 770–778. IEEE Computer
H. Hoos, and K. Leyton-Brown. Towards an empirical foun- Society, 2016.
dation for assessing Bayesian optimization of hyperparameters. [38] J. Mockus. Application of Bayesian approach to numerical meth-
In NIPS Workshop on Bayesian Optimization in Theory and Practice ods of global and stochastic optimization. Journal of Global Opti-
(BayesOpt’13), 2013. mization, 4(4):347–365, 1994.
[22] B. Bischl, P. Kerschke, L. Kotthoff, M. Lindauer, Y. Malitsky, [39] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar.
A. Frechétte, H. Hoos, F. Hutter, K. Leyton-Brown, K. Tierney, and Hyperband: A novel bandit-based approach to hyperparameter
J. Vanschoren. ASlib: A benchmark library for algorithm selection. optimization. Journal of Machine Learning Research, 18(185):1–52,
Artificial Intelligence, pages 41–58, 2016. 2018.
[23] A. Zela, J. Siems, and F. Hutter. Nas-bench-1shot1: Benchmarking [40] K. Jamieson and A. Talwalkar. Non-stochastic best arm iden-
and dissecting one-shot neural architecture search. In International tification and hyperparameter optimization. In A. Gretton and
Conference on Learning Representations, 2020. C. Robert, editors, Proceedings of the Seventeenth International Con-
[24] X. Dong and Y. Yang. Nas-bench-201: Extending the scope of ference on Artificial Intelligence and Statistics (AISTATS), volume 51.
reproducible neural architecture search. In International Conference Proceedings of Machine Learning Research, 2016.
on Learning Representations, 2020. [41] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast
[25] A. Sharma, J. van Rijn, F. Hutter, and A. Müller. Hyperparameter Bayesian optimization of machine learning hyperparameters on
importance for image classification by residual neural networks. large datasets. In A. Singh and J. Zhu, editors, Proceedings of
In Discovery Science - 22nd International Conference, DS, volume the Seventeenth International Conference on Artificial Intelligence and
12
A PPENDIX A
A BLATION T RAJECTORIES ON ALL M ETA - TEST
DATASETS
Figure 14 shows the results of the ablations study on all
meta-test datasets. For the baselines we used scikit-sklearn
version 0.23.0 (Random Forest, Extra Trees, KNN), Light-
GBM version 2.3.1 and Catboost version 0.23.1.
A PPENDIX B
C OMPARISON AGAINST C OMMON BASELINES AND
AUTO ML F RAMWORKS
Figure 15 shows the results of Auto-PyTorch and other com-
mon AutoML frameworks as well as our baseline models
on all meta-test datasets. For all experiments, the same
resources were allocated (i.e. 6 Intel Xeon Gold 6242 CPU
cores, 6 GB RAM per core). For Auto-PyTorch the logging
post-hoc ensemble selection allows the construction of the
full trajectory. For other frameworks, the trajectories were
created by running for 1, 3, 5, 10, 30, 50, ..., 10 000 trials and
using the timestamp of the trial finish as timestamp for the
point on the trajectory.
We used Auto-Keras 1.0.2 and the default settings.
For AutoGluon, we used version 0.0.9 in the ”optimize
for deployment” setting with the same validation split
as Auto-PyTorch uses. For hyperopt-sklearn, we allowed
any classifier and preprocessing to be used and used the
tpe alogrithm. Finally, we used auto-sklearn 0.7.0 with an
ensemble size of 50, initializing with the 50 best, model
selection via 33 % holdout, 3 workers. All other settings
were set to their default value.
14
Fig. 14. Results on all meta-test datasets. ”BOHB + greedy PF + Ens. (div.)” refers to the full Auto-PyTorch method.
15
Fig. 15. Comparison against competitors on all meta-test datasets. ”BOHB + greedy PF + Ens. (div.)” refers to the full Auto-PyTorch method.