Professional Documents
Culture Documents
Abstract—The popularity of automated machine learning (Au- introduced for automating model management, data labeling,
toML) tools in different domains has increased over the past few and optimizing hyperparameters.
years. Machine learning (ML) practitioners use AutoML tools to
The popularity of AutoML tools has attracted the attention
arXiv:2208.13116v1 [cs.SE] 28 Aug 2022
Requirements
other ML practitioners. In addition, developers of AutoML
Engineering
Deployment
Monitoring
Evaluation
Collection
Cleaning
Training
labeling
Feature
Model
Model
Model
Model
Model
tools can leverage our findings to provide APIs to make the
Data
Data
Data
collaboration between their tool and other co-used tools easier.
Organization: The rest of the paper is organized as follows.
In Section II, the background about the AutoML tools and ML
pipeline is provided. In Section III, we provided a summary of
the important related works. In Section IV, the experimental
setup of the study is explained. The results of the research Fig. 1. The stages of the ML pipeline based on [11].
questions are presented in Section V. A discussion of these
results is presented in Section VI. Section VII discusses threats B. Automated machine learning (AutoML)
to the validity of our study. In Section VIII, we conclude the The primary goal of automation in ML started from reduc-
paper and outline some avenues for future works. ing human effort in the loop, specifically during the hyperpa-
rameter tuning and model selection stage [14]. However, the
II. BACKGROUND success of hyperparameter tuning and model selection also
AutoML tools are used to automate one or more stages (e.g., depends on other steps, such as data cleaning and preparation
hyperparameter optimization) in the ML pipeline. This section or feature engineering [14]–[17]. More recently, AutoML has
provides background information on a typical ML pipeline been broadly used in automating multiple steps of the machine
and AutoML tools. In this article, the word “ML practitioner” learning workflow, from data collection and preparation to the
refers to anyone who uses AutoML or ML in their project or deployment and monitoring of the ML model in production.
work. An ML practitioner can be a developer, a researcher, a A variety of AutoML tools have been proposed to automate
data scientist, or an ML engineer. the computational work of building an ML pipeline. Some of
A. Machine learning pipeline these tools are built upon widely used ML frameworks. For
example, Auto-Sklearn [18], [19], and TPOT [20], [21] are
Figure 1 shows the nine stages of a typical ML pipeline built on scikit-learn [22]. These AutoML tools focus largely on
for building and deploying ML software systems used at supporting data preparation [14], [16], [23], feature engineer-
Microsoft, described by Amershi et al. [11]. We consider it ing [24], hyperparameter tuning [25], model selection [14],
as a reference ML pipeline in our study. Similar ML pipelines [16], [26], and end-to-end development support for machine
are also adopted by other commercial companies, such as learning software systems [27].
Google [12], or IBM [13]. In the following, we describe the
different stages of the ML pipeline mentioned in [11]. III. R ELATED W ORKS
The model requirements stage is where ML practitioners The related research papers are categorized into two groups
select the type of ML model that fits their problem and and discussed in this section.
products. In data collection stage, ML practitioners identify
the data from different sources to collect, such as database A. Qualitative studies on the usage of AutoML tools
systems or file storage. The stage of data cleaning includes Prior work performs qualitative studies (e.g., through in-
cleaning the data to remove any anomalies that would likely terviews) to understand practitioners’ experiences of using
AutoML tools [8]–[10], [15], [28]–[30]. For example, Xin Experimental Setup (IV)
et al. [8] investigated the use of AutoML tools through AutoML tools AutoML tools
Extracting the
Con gurations (CN) of
Search for projects Download source
from website studied in research importing CN code les for PN
semi-structured interviews with AutoML users of different AutoML tools
categories and expertise. They argued that although AutoML Collecting GitHub projects (PN)
Identifying AutoML
1 2 Filtering AutoML tools 3 using AutoML tools
tools increase productivity for experts and make ML accessible tools
and their metadata
AutoML tools. Similarly, Wang et al. [9] argued that there is Extracting the Identifying the purpose of
Examine the usage of
no need for tools that are fully automated, instead, developing function calls and
nding the most
Analyze characteristics
of the projects
Manual analysis of
functions
using AutoML tools
and the stages of
different AutoML tools in the
same source code le
used AutoML tools the ML pipeline
AutoML tools that are explainable and that integrate people fi fi fi fi
using the abstract syntax tree (AST) Python library [48]. Step
5) The AST Python library is used to extract the function
Optuna
Hyperopt
Skopt
featuretools
Tpot
Bayes_opt
Autokeras
Autosklearn
Ax
Snorkel
Dragonfly
Pycaret
Autogluon
nni
Keras−tuner
Learn2learn
Evalml
FEDOT
Hypernets
MOE
Auto−PyTorch
Once−for−all
NASLib
Pytorch−meta
Adanet
Hyperoptsklearn
Hyperparam_hunter
Aw_nas
Merlion
Determined
FLAML
MLBox
AutoDL
Archai
Mindsdb
Optunity
FAR−HO
HPOlib2
Oboe
Hypertunity
Auptimizer
Forecasting
Devol
Autobot
Autoai
AutoGL
Igel
Runai
Ludwig
Mljar
Automl
ATM
AutoFolio
BTB
Gama
Libra
SMAC3
operation statements, d) super functions of a class in which
the parent is an AutoML tool are extracted. After collecting AutoML Tools
the function calls from all source code files, we counted the
total number of calls of the functions of each AutoML tool. Fig. 3. AutoML tools and the number of function calls.
Then, we sorted the AutoML tools based on the number of Age commits Contributors
600
Analyzing the characteristics of GitHub projects us- 20 7.5
the projects that use AutoML tools, such as the number of 200
2.5
Metrics
contributors, number of forks, size, age, star, and commits. Age
0 0 0.0
commits
3) Results: Figure 3 shows the total number of function
Value
Age commits Contributors
Contributors
Forks size stars
calls for each AutoML tool. As seen in Figure 3, the top 3e+05 Forks
size
10 AutoML tools with the most number of function calls 7.5
20
stars
that are associated with the top 10 most used tools (i.e., 5.0
10
18% of the 57 selected AutoML tools) is 68%; these top 2.5
1e+05
Bayes opt, Autokeras, Auto-sklearn, AX, and Snorkel. The 0.0 0e+00 0
Forks size stars
details of each of these top 10 most used AutoML tools are Project's Metrics
The 18% (10 out of 57) most used AutoML tools (e.g., B. RQ2 : How do ML practitioners use AutoML tools?
Optuna) account for 68% of all function calls. 1) Motivation: ML practitioners may use AutoML tools for
different purposes across their ML pipeline. We analyzed the
Furthermore, the profile of the projects that are using stages of the ML pipeline in which AutoML tools are used and
AutoML tools is shown in Figure 4. For clear visualization, the purposes of using AutoML tools to help AutoML tools’
we removed the outliers within the last 5 percent of data in developers gain more insights on the usage of their tool and
each metric and showed the distribution of 95% percent of the enable them to improve their tools in a more efficient way.
data in Figure 4. As seen in Figure 4, 75% of the repositories Understanding the stages of the ML pipeline in which the
that are using AutoML tools are less than five months old AutoML tools are most popular can help developers of the
and have fewer than 65 commits, two contributors, two forks, tools find the shortcomings of their tools and help improve
35,753 lines of code, two stars. On the other hand, the projects their tools efficiently. Besides, our results can provide insights
that use AutoML tools also include some projects that are very for ML practitioners to better leverage AutoML tools in
mature (up to 10 years in age), popular (with up to 102,983 different stages of their ML pipeline and for different purposes.
stars), with many contributors (with up to 433 contributors),
and large in size (with up to 17,453,577 lines of code). 2) Approach: To understand the purposes of using AutoML
tools and the stages of the ML pipeline where they are being
The most used AutoML tools are popular tools accord- used, we used the extracted function calls in Section V-A3
ing to their number of stars and forks. Furthermore, and labeled them. To reduce the required time and effort for
most of the projects that use AutoML tools are young manual labeling, we only focused on the function calls of the
and not popular (based on the number of stars). top 10 most used AutoML tools (identified in RQ1). Besides,
we focused on the function calls that are most commonly
TABLE I
S UMMARY STATISTICS OF THE TOP 10 AUTO ML TOOLS .
Function: Number of function calls, S-code: Number of unique source files importing the tool. Project: Number of unique projects importing the tool, Size:
Size of the tool. Stars: Total number of stars. Age: Age of the tool (i.e., the difference between latest commit date and the creation date). Forks: Number of
forks associated to the tool. Releases: The total number of releases associated to the tool. Descriptions: The description of the tool.
AutoML repository (short name) Function S-code Project Size Stars Age Forks Release Descriptions
1 optuna/optuna (Optuna) 34,916 5,482 3,140 13,577 5,553 45 602 41 A software framework for automatic hyperarameter optimization [49]
2 hyperopt/hyperopt (Hyperopt) 20,903 5,271 4,432 6,059 5,955 122 925 0 A library to optimize hyperparameter [50]
3 scikit-optimize/scikit-optimize (Skopt) 20,313 2,807 1,830 9,423 2,238 68 424 23 ”Sequential model-based optimization” [51]
4 Featuretools/featuretools (Featuretools) 14,786 1,053 548 5,888 5,864 50 769 94 A library to automate feature engineering [52]
5 EpistasisLab/tpot (Tpot) 13,725 2,893 1,796 7,9030 8,349 72 1,437 27 A library to optimize machine learning pipeline [53]
6 fmfn/BayesianOptimization (Bayes opt) 11,582 3,128 2,605 23,290 5,543 89 1,222 7 ”Global optimization with gaussian processes” [54]
7 keras-team/autokeras (Autokeras) 10,331 1,259 550 44,487 8,241 48 1,334 53 ”An AutoML system based on Keras” [55]
8 automl/auto-sklearn (Auto-sklearn) 8,909 974 543 7,4993 5,870 76 1,092 29 An AutoML toolkit based on sklearn [56]
9 facebook/Ax (AX) 8,442 722 213 283,337 1,649 33 178 23 A adaptive platform for adaptive experiments [57]
10 snorkel-team/snorkel (Snorkel) 8,237 853 264 292,468 4,922 68 797 14 A tool for generating training data [6]
used in different projects. Thus, from the set of all unique in machine learning and software engineering. The primary
function calls of the selected AutoML tools, we compute their reference during the labeling is the official documentation
total frequency and Shannon entropy value, then choose the of the functions from the tool websites and the publications
functions that have more than 80% of 1) the Shannon entropy associated with the AutoML tools. For example, the func-
value and 2) the total frequency for manual labeling. In the tions ‘autokeras.TextClassifier.evaluate’ [60]
following, we explain 1) the process of selecting functions and ‘autokeras.ImageRegressor.evaluate’ [61]
based on Shannon entropy and usage frequency and 2) the of AutoKeras are used to evaluate the best model and their
process of manual labeling. purposes were labeled as ‘Model evaluation’. The functions
Selecting functions based on Shannon entropy and usage related to data formatting, generating statistics of the input
frequency: We used the normalized Shannon’s entropy [58], data, or cleaning is assigned the stage of ‘Data cleaning’.
[59] to compute the distribution of the function calls across During the first iteration of labeling, the first author labeled
the projects: the functions of Optuna, Skopt, Featuretools, and Ax, and the
second author labeled the functions of the remaining six tools.
i=n
X In the Second iteration, all the initial labels were combined,
Hn (F ) = − pi ∗ logn (pi ) (1) and the first two authors underwent through each label together
i=1 and subsequently further discussed the labels that were not
In Equation (1), F is a function that we want to measure agreed upon until a consensus was achieved. A random sample
its Shannon’s entropy value, n is the total number of unique of 238 functions was selected with a confidence level of 95%
projects, i is the unique number of projects, pi is the proba- and a confidence interval of 5% for further validation by the
Pi=n
bility of the use of function F (pi ≥ 0, and i=1 pi = 1) in a other two authors (not involved in the initial labeling). They
given project i. pi is calculated by dividing the total number of agreed on 98% of the stages of the functions and 96% of the
occurrence of function F in the project i by the total number purposes of the functions.
of occurrence of function F in all the projects. If all the unique 3) Results: Table II details the stage of the ML pipeline
projects have the same probability of using function F (i.e., (i.e., the table columns) in which ML practitioners use Au-
pi = n1 , ∀i ∈ 1, 2, .., n), the Shannon entropy of F is maximal toML tools. Each cell value in Table II represents the per-
(i.e., Hn (F ) = 1). Thus, the function F with higher Shannon centage of the total number of calls of the functions from
entropy is more likely to be used more by ML practitioners the respective AutoML tool corresponding to the stages of the
compared to the functions with less Shannon entropy. ML pipeline. As shown in Table II, the AutoML tools are
The functions with Hn (F ) > 0.8 and with their respective mostly used for hyperparameter optimization (accounting for
counts of function calls above 80% percentage (sorted in 40.9% of the function calls). One of the possible reasons is
descending order) were chosen for manual labeling. A total that hyperparameter optimization involves multiple repetitive
of 619 functions from the top 10 AutoML tools were selected tasks and requires a lot of effort and time [2]; similarly, there
in this step. might be numerous different sub-tasks involved compared to
other stages of the ML pipeline.
Manual labeling: Here we try to understand why AutoML
tools are used by ML practitioners and what they are used
for. To this end, we manually labeled the selected functions in AutoML tools are mainly used (40.9%) during the Hy-
the previous step and formed categories and sub-categories perparameter Optimization stage of the ML pipeline,
of purposes of using AutoML tools. We also assigned a followed by Model training (11.5%), Model evaluation
separate label representing the stages of the ML pipeline for (11.2%), and Feature Engineering (10.1%). AutoML
which the respective functions are called, following the stages tools mostly focus on providing functions for training
presented in Figure 1. The first two authors were involved ML models effectively, selecting a model, and evalu-
in the initial labeling and construction of the taxonomy. ating the performance of the resulting model.
Both individuals are graduate students with strong knowledge
TABLE II during model training), handling of input/output operations of
T HE PERCENTAGE USAGE OF AUTO ML FUNCTIONS ACROSS THE ML ML models, showing training or optimization progress, and
WORKFLOW.
handling of thread execution.
Feature Hyperparameter
Data col- Data Dala label- Model Model Model de- Model
AutoML
lection cleaning ing
engineer-
ing
optimiza-
tion
training evaluation ployment monitor
Others
• Feature engineering: This category is about using AutoML
Optuna 1.4 2.9 0 0 72.9 2.1 0.7 0.7 12.1 6.4
hyperopt
Skopt
11.5
0
1.9
0
0
0
5.8
0
75
90.6
0
3.1
0
3.1
0
0
3.8
0
1.9
3.1
tools to automate the extraction of features from the given
Featuretools 12.8 7.7 0 69.2 0 0 0 0 2.6 7.7
Tpot
Bayes-opt
0
0
11.6
0
0
0
9.3
0
9.3
78.9
23.3
0
18.6
0
18.6
0
4.7
5.3
4.7
15.8
data, usually using domain knowledge of the data and the
AutoKeras
Auto-sklearn
Facebook-Ax
3.6
2.7
0
3.6
5.4
4.2
0
1.4
0
8.3
5.4
0
8.3
4.1
68.8
40.5
25.7
4.2
25
44.6
4.2
7.1
1.4
2.1
0
4.1
4.2
3.6
5.4
12.5
model being built. The resulting features are used as input
Snorkel
AVERAGE
9.5
4.2
21.6
5.9
27
2.8
2.7
10.1
1.4
40.9
16.2
11.5
16.2
11.2
1.4
3.1
0
3.7
4.1
6.5 to some algorithms to train the ML models. We reported
in Figure 5 two main sub-categories of Feature engineering,
Figure 5 presents high-level (in the grey background) and including feature analysis and feature generation/extraction.
low-level (in white background) categories of the purposes
for which ML practitioners use AutoML tools. As shown • Pipeline management: This category is about using Au-
in Figure 5, the high-level categories of purposes are in the toML to automate the end-to-end management of an ML
order: Hyperparameter Optimization (30.6%), Model manage- model or a set of multiple models, including the orchestration
ment (26.5%), Data management (15%), Visualization (7.4%), of data that flows into or out of the model.
Logging (3.4%), Utility functions (3.2%), Feature Engineering • Visualization: This category is about the graphical rep-
(3.1%), Storage management (2.1%), Pipeline management resentation of data and information, such as visualizing the
(2%) and User Interface (0.5%). The following section de- feature importance using a chart for interpretability or visual
scribes some of the categories and sub-categories of purposes. representation of data distribution during data analysis.
• Hyperparameter optimization (HPO): This category is • Logging: This category involves using AutoML tools to
about using AutoML to determine the right combination of track and store records related to the execution events, data
hyperparameters that maximizes the model performance, usu- inputs, processes, data outputs, resulting in model performance
ally through running multiple experiments in a single training and other related tasks when developing and deploying ML
process. Automating HPO is indeed crucial, particularly for models. ML practitioners use the resulting information to
Deep Neural Networks (DNN) [7], [62], which depend on identify any suspicious activities in ML software projects’
a wide range of hyperparameter choices about the function privileged operations and assess the impact of state transfor-
tuning or regularization, parameter sampling, the network’s mations on performance.
architecture, and optimization. In Figure 5, we reported 16
• Storage management: In this category, AutoML tools
different activities involved during the HPO process, from ex-
are used for managing the data storage resources aiming
ecuting hyperparameter optimization and parameter sampling
at improving and maximizing the efficiency of data storage
to HPO documentation.
resources, often in terms of performance, availability, recov-
• Model management: This category is about using AutoML erability, and capacity.
to handle the different tasks related to training ML algorithms
with some input data and the resulting model that can generate • User interface: This category is about the element of
predictions using patterns in the input data. In Figure 5 we AutoML tools that allows the ML practitioners to easily
highlighted the different sub-tasks where ML practitioners use interact with the AutoML features, using the concepts such
AutoML in the ML models management, using AutoML for as visual and interactive design and information architecture
model evaluation and error-analysis, model training/testing, (e.g., providing web-based interactive interface for displaying
model definition, model monitoring, and optimizing model live code).
training performance. We used the label Others for the function calls which do
• Data management: In this category, AutoML is used to not fall into any of the categories.
handle the tasks related to the data used for building ML
models. This process includes a set of steps involving data, We derived 10 high-level categories of the purposes
including data generation and processing steps, managing data of using AutoML tools, including Hyperparameter
flow to ML models, working with multiple datasets, and optimization (30.6%), Model management (26.5%),
other tasks throughout the data pipeline. In particular, we Data management (15%), Visualization (7.4%), and
summarized 12 different sub-tasks related to data management Logging (3.4%). ML practitioners can save time and
in Figure 5 including data transformation, data analysis, data effort by using AutoML tools to automate their ML
loading, data labeling, and data export. pipeline tasks [2] for similar purposes.
• Utility functions: These are functions provided by specific
AutoML tools to perform some common functionally and In Figure 6 we provide the breakdown of the major cate-
are often reusable in the ML pipeline. The Utility functions gories of purposes (x-axis) of using AutoML provided in the
are particularly important to reduce the pipeline complexity top 10 AutoML tools (y-axis). The circle size represents the
and increase the readability of the source code. Examples of composition of using the respective AutoML tools for that
the tasks in this category include session management (e.g., purpose, computed as the percentage of total function calls
Purposes
Feature analysis Manage session Data/feature transform Model evaluation/Model Execute optimization
(2.1%) (0.8%) (3.5%) error analysis (9.3%) (7.2%)
Score pipeline Manage timezone Data cleaning/ ltering Model/training report Report optimization
(0.3%) (0.3%) (1.3%) (1.9%) (2.3%)
Pipeline con guration Thread execution Data export Export/Import model Sample parameters
(0.3%) (0.2%) (0.6%) (1.8%) (2.1%)
defining that purpose to the total number of function calls function calls for Feature engineering (45%) and Pipeline man-
from the individual AutoML tool. agement (22.7%) tasks are dominantly used from Featuretools
and Tpot, respectively, while a minimal percentage (1.3%) of
Tpot
function calls related to User Interface comes from Snorkel
and Optuna.
Snorkel
Skopt
C. RQ3: Are different AutoML tools used together?
1) Motivation: From Table II we observe that at least
Top 10 AutoML tools
Optuna
value
Hyperopt
0 two of the AutoML tools can be used to automate the
20
Featuretools 40
functionality in each stage of the ML pipeline. This question
60 aims to understand whether the different AutoML tools are
Facebook−Ax
used together in the same source code file. For instance,
Bayes−opt
ML practitioners may use AutoSklearn for model training
Autokeras
and tune the hyperparameters using the functionality provided
Auto−sklearn by Optuna within the same code. Answering this question
could help ML practitioners choose the AutoML tools used
User.Interface
Storage.management
Logging
Visualization
Pipeline.management
Feature.engineering
Utility.function
Data.management
Model.management
HPO
Others
0.6
bayes_opt 0.14 0.18 0.2 0.03 0.01 1 0 0.01 0 0 the usages. Future work can explore the factors that impact
tpot 0.42 0.69 0.1 0.09 1 0.01 0.05 0.1 0 0
0.4
the adoption of the AutoML tools.
featuretools 0.07 0.02 0 1 0.09 0.02 0 0 0 0
Also, regarding the usage of the AutoML tools in the studied
skopt 0.08 0.26 1 0 0.07 0.11 0 0 0 0
0.2 projects, our results indicate that most of the functions used
hyperopt 0.49 1 0.25 0.01 0.45 0.1 0 0 0 0
by the ML practitioners are model-oriented, specifically in
optuna 1 0.29 0.05 0.03 0.16 0.05 0 0.01 0 0