Automating Data Science

review articles
DOI:10.1145/ 3495256
of interventions, to the human factors
Given the complexity of data science projects and psychology that underlie how cus-
tomers react to visual information dis-
and related demand for human expertise, plays and make decisions.
automation has the potential to transform As another example, in areas such
the data science process. as astronomy, particle physics, and
climate science, there is a rich tradi-
tion of building computational pipe-
BY TIJL DE BIE, LUC DE RAEDT, JOSÉ HERNÁNDEZ-ORALLO,
lines to support data-driven discovery
HOLGER H. HOOS, PADHRAIC SMYTH, and hypothesis testing. For instance,
AND CHRISTOPHER K.I. WILLIAMS geoscientists use monthly global
landcover maps based on satellite
Automating
imagery at sub-kilometer resolutions
to better understand how the Earth’s
surface is changing over time.50 These
maps are interactive and browsable,
Data Science
and they are the result of a complex
data-processing pipeline, in which
terabytes to petabytes of raw sensor
and image data are transformed into
databases of automatically detected
and annotated objects and informa-
tion. This type of pipeline involves
many steps, in which human deci-
sions and insight are critical, such as
instrument calibration, removal of
outliers, and classification of pixels.
DATA SCIENCE COVERS the full spectrum of deriving The breadth and complexity of
insight from data, from initial data gathering and these and many other data science sce-
interpretation, via processing and engineering of data, narios means the modern data scien-
tist requires broad knowledge and ex-
and exploration and modeling, to eventually producing perience across a multitude of topics.
novel insights and decision support systems. Together with an increasing demand
for data analysis skills, this has led to a
Data science can be viewed as overlapping or broader shortage of trained data scientists with
in scope than other data-analytic methodological appropriate background and experi-
disciplines, such as statistics, machine learning, ence, and significant market competi-
tion for limited expertise. Consider-
databases, or visualization.10 ing this bottleneck, it is not surprising
To illustrate the breadth of data science, consider, there is increasing interest in automat-
for example, the problem of recommending items
(movies, books, or other products) to customers. key insights
While the core of these applications can consist of ˽ Automation in data science aims to
facilitate and transform the work of data
algorithmic techniques such as matrix factorization, scientists, not to replace them.
Important parts of data science are
a deployed system will involve a much wider range ˽
already being automated, especially in
of technological and human considerations. the modeling stages, where techniques
ILLUSTRATION BY J UST IN M ETZ
such as automated machine learning

These range from scalable back-end transaction (AutoML) are gaining traction.
systems that retrieve customer and product data in ˽ Other aspects are more difficult
to automate, not only because of
real time, experimental design for evaluating system technological challenges, but because
open-ended and context-dependent
changes, causal analysis for understanding the effect tasks require human interaction.
76 COMM UNICATIO NS O F THE AC M | M A R C H 2022 | VO L . 65 | NO. 3

MA R C H 2 0 2 2 | VO L. 6 5 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 77
review articles
ing parts, if not all, of the data science pervised with humans in the loop. tomated machine learning (AutoML).
process. This desire and potential for The lower quadrants of Data Explo- However, much of this impact has oc-
automation is the focus of this article. ration and Exploitation are typically curred for modeling approaches based
As illustrated in these examples, closely coupled to the application do- on supervised learning, and automa-
data science is a complex process, driv- main, while the upper quadrants of tion is still far less developed for other
en by the character of the data being Data Engineering and Model Building kinds of learning or modeling tasks.
analyzed and by the questions being are often more domain agnostic. The Continuing our discussion of Figure 1,
asked and is often highly exploratory horizontal axis characterizes the de- Data Engineering tasks are estimated
and iterative in nature. Domain context gree to which different activities in the to often take 80% of the human effort
can play a key role in these exploratory overall process range from being more in a typical data analysis project.7 Con-
steps, even in relatively well-defined open-ended to more precisely speci- sequently, it is natural to expect that
processes such as predictive model- fied, such as having well-defined goals, automation could play a major role in
ing (for example, as characterized by clear modeling tasks and measurable reducing this human effort. However,
CRISP-DM5) where human expertise in performance indicators. Data Engi- efforts to automate Data Engineering
defining relevant predictor variables neering and Data Exploration are often tasks have had less success to date
can be critical. not precisely specified and are quite compared to efforts in automating
Figure 1 provides a conceptual iterative in nature, while Model Build- Model Building.
framework to guide our discussion of ing and Exploitation are often defined Data Exploration involves identify-
automation in data science, including more narrowly and precisely. In classi- ing relevant questions given a dataset,
aspects that are already being auto- cal goal-oriented projects, the process interpreting the structure of the data,
mated as well as aspects that are poten- often consists of activities in the fol- understanding the constraints provid-
tially ready for automation. The verti- lowing order: Data exploration, data ed by the domain as well as the data an-
cal dimension of the figure reflects the engineering, model building, and ex- alyst’s background and intentions, and
degree to which domain context plays ploitation. In practice, however, these identifying issues related to data eth-
a role in the process. Domain context trajectories can be much more diverse ics, privacy, and fairness. Background
not only includes domain knowledge and exploratory, with practitioners knowledge and human judgement are
but also human factors, such as the navigating through activities in these key to success. Consequently, it is not
interaction of humans with the tech- quadrants in different orders and in surprising that Data Exploration poses
nology,1 the side effects on users and an iterative fashion (for example, Mar- the greatest challenges for automation.
non-users, and all the safety and ethi- tínez-Plumed et al.31). Finally, Exploitation turns action-
cal issues, including algorithmic bias. From the layout of Figure 1 we see, able insights and predictions into deci-
These factors have various effects on for example, that Model Building is sions. As these may have a significant
data understanding and the impact where we might expect automation to impact, some level of oversight and hu-
of the extracted knowledge, once de- have the most direct impact—which is man involvement is often essential, for
ployed, and are often addressed or su- indeed the case with the success of au- example, new AI techniques can bring
new opportunities in automating the
Figure 1. The four data science quadrants used in this article to illustrate different areas reporting and explanation of results.29
where automation can take place.
Broadly speaking, automation in
The vertical dimension determines the degree of dependence on domain
the context of data science is challeng-
context, usually introduced through human interaction. The horizontal ing depending on the form it takes,
dimension determines the degree to which a process is open-ended. ranging in complexity depending on
Some activities, such as data augmentation and feature engineering, are whether it involves a single task or an
situated in data engineering near the boundary with data exploration.
entire iterative process, or whether par-
tial or complete automation is the goal.
Data Engineering: Model Building: Less 1. A first form of automation—
data wrangling, algorithm selection, dependent on mechanization—occurs when a task is
data integration, parameter optimization, domain context
data preparation, performance evaluation, so well specified that there is no need
data transformation, model selection, for human involvement. Examples of
… …
such tasks include running a cluster-
ing algorithm or standardizing the val-
Data Exploration: Exploitation: ues in a table of data. This can be done
domain understanding, model interpretation and visualization, by functions or modules in low-level
goal exploration, reporting and narratives,
data aggregation, predictions and decisions, languages, or as part of statistical and
data visualization, monitoring and maintenance,
… …
algorithmic packages that have tradi-
More tionally been used in data science.
dependent on 2. A second form of automation—
More Less domain context
open-ended open-ended composition—deals with strategic se-
quencing of tasks or integration of
different parts of a task. Support for
code or workflow reuse is available
78 COM MUNICATIO NS O F TH E AC M | M A R C H 2022 | VO L . 65 | NO. 3

review articles
in more sophisticated tools that have

emerged in recent years, from interac-
tive workflow-oriented suites (such as
KNIME, RapidMiner, IBM Modeler,
From Machine Learning
SAS Enterprise Miner, Weka Knowl- to Automated
Machine Learning
edge Flows and Clowdflows) to high-
level programming languages and en-
vironments commonly used for data
The problem of supervised machine learning can be formalized as finding a function
analysis and model building (such as
f that maps possible input instances from a given set X to possible target values from
R, Python, Stan, BUGS, TensorFlow, a set Y such that a loss function is minimized on a given set of examples, that is, as
and PyTorch). determining arg min f∈F L(f, E), where F, referred to as hypothesis space, is a set of
3. Finally, a third form of automa- functions from X to Y , L is the loss function, and E is the set of examples (or training
data), comprised of input instances and target values.
tion—assistance—derives from the When Y is a set of discrete values, this problem is called (supervised) classification;
production of elements such as vi- when it is the set of real numbers, it is known as (supervised) regression. Popular loss
sualizations, patterns, explanations, functions include cross-entropy for classification and mean squared error for regression.
among others, that are specifically tar- In this formulation, different hypothesis spaces F can be chosen for a given
supervised machine learning task. In addition to the parameters of a given model
geted at supporting human efficiency. (such as the connection weights in a neural network) that determine a specific
This includes a constant monitoring f ∈ F, there are typically further parameters that define the function space F (such
of what humans are doing during the as the structure of a neural network) or affect the performance of the model
induction process (such as learning rates). Generally, these hyperparameters can be
data science process, so that an auto- of different types (such as real numbers, integers or categorical) and may be subject
mated assistant can identify inappro- to complex dependencies (such as certain hyperparameters only being active when
priate choices, make recommenda- others take certain values). Because the performance of modern machine learning
tions, and so on. While some limited techniques critically depends on hyperparameter settings, there is a growing need for
hyperparameter optimization techniques.
form of assistance is already provided At the same time, because of the complex dependencies between hyperparameters,
in interactive suites such as KNIME sophisticated methods are needed for this optimization task. Human experts not only
and RapidMiner, the challenge is to ex- face the problem of determining performance-optimizing hyperparameter settings, but
the choice of the class of machine learning models to be used in the first place, and the
tend this assistance to the entire data
algorithm used to train these. In automated machine learning (AutoML) all these tasks,
science process. often along with feature selection, ensembling and other operations closely related to
Here, we organize our discussion model induction, are fully automated, such that performance is optimized for a given
into sections corresponding to the four use case, for example, in terms of the prediction accuracy achieved based on given
training data.
quadrants from Figure 1, highlighting
the three forms of automation where
relevant. Because the activities are ar- data are available; satisfying this as- of extensive, often onerous experimen-
ranged into quadrants rather than sumption typically poses challenges, tation, but also enables non-experts to
stages following a particular order, we which we address in later sections of obtain substantially better performance
begin with Model Building, which ap- this article (see Ratner et al.34). than otherwise possible. AutoML sys-
pears most amenable to automation, While there are different categories tems often achieve these advantages at
and then discuss the other quadrants. of machine learning problems and rather high computational cost.
methods, including supervised, un- It is worth noting that AutoML falls
Model Building: supervised, semi-supervised and rein- squarely into the first form of automa-
The Success Story of AutoML forcement learning, the definition of tion, mechanization, as discussed in the
In the context of building models (Fig- the target function and its optimization introduction. At the same time, it can be
ure 1), machine learning methods fea- is most straightforward for supervised seen as yet another level of abstraction
ture prominently in the toolbox of the learning (as discussed in “From Ma- over a series of automation stages. First,
data scientist, particularly because chine Learning to Automated Machine there is the well-known use of program-
they tend to be formalized in terms of Learning”). Focusing on supervised ming for automation. Second, machine
objective functions that directly relate learning, there are many methods for learning automatically generates hy-
to well-defined task categories. accomplishing this task, often with potheses and predictive models, which
Machine learning methods have be- multiple hyperparameters, whose val- typically take the form of algorithms (for
come very prominent over the last two ues can have substantial impact on the example, in the case of a decision tree or
decades, including relatively complex prediction accuracy of a given model. a neural network); therefore, machine
methods, such as deep learning. Auto- Faced with the choice from a large set learning methods can be seen as meta-
mation of these machine learning meth- of machine learning algorithms and an algorithms that automate programming
ods, which has given rise to a research even larger space of hyperparameter set- tasks, and hence “automate automa-
area known as AutoML, is arguably the tings, even seasoned experts often must tion.” And third, automated machine
most successful and visible application resort to experimentation to determine learning makes use of algorithms that
to date of automation within the overall what works best in each use case. Au- select and configure machine learning
data science process (for example, Hut- tomated machine learning attempts to algorithms—that is, of meta-meta-algo-
ter et al.22). It assumes, in many cases, automate this process, and thereby not rithms that can be understood as auto-
that sufficient amounts of high-quality only spares experts the time and effort mating the automation of automation.
review articles
AutoML systems have been gradu- tegration with the Weka environment. reached by AutoML systems are evident
ally automating more of these tasks: The complex optimization process at in the results from recent competitions.17
model selection, hyperparameter the heart of Auto WEKA is carried out Notably, Auto-sklearn significantly out-
optimization and feature selection. by SMAC. Auto-sklearn12 makes use of performed human experts in the human
Many of these systems also deal with the Python-based machine learning track of the 2015/2016 ChaLearn AutoML
automatically selecting learning algo- toolkit scikit-learn and is also powered Challenge. Yet, results from the same
rithms based on properties (so-called by SMAC. Unlike Auto-WEKA, Auto- competition suggest that human experts
metafeatures) of given datasets, build- sklearn first determines multiple base can achieve significant performance im-
ing on the related area of meta-learn- learning procedures, which are then provements by manually tweaking the
ing.4 In general, AutoML systems are greedily combined into an ensemble. classification and regression algorithms
based on sophisticated algorithm These AutoML methods are now obtained from the best AutoML systems.
configuration methods, such as SMAC making their way into large-scale Therefore, there appears to be consider-
(sequential model-based algorithm commercial applications enabling, able room for improvement in present
configuration),21 learning to rank and for example, non-experts to build rela- AutoML systems for standard supervised
Monte-Carlo Tree Search.33 tively complex supervised learning learning settings.
So far, most work on AutoML has models more easily. Recent work on Other systems, such as the Auto-
been focused on supervised learning. AutoML includes neural architecture matic Statistician,29 handle different
Auto-WEKA,41 one of the first AutoML search (NAS), which automates key as- kinds of learning problems, such as
systems, builds on the well-known pects of the design of neural network time series, finding not only the best
Weka machine learning environment. architectures, particularly (but not ex- form of the model, but also its param-
It encompasses all the classification ap- clusively) in deep learning (for example, eters. We will revisit this work in the
proaches implemented in Weka’s stan- Liu et al.28). Google Cloud’s proprietary section on Exploitation.
dard distribution, including many base AutoML tool, launched in early 2018, The automation of model building
classifiers, feature selection techniques, falls into this important, but restricted tasks in data science has been remark-
meta-methods that can build on any of class of AutoML approaches. Similarly, ably successful, especially in supervised
the base classifiers, and methods for Amazon SageMaker, a commercial learning. We believe the main reason
constructing ensembles. Auto-WEKA service launched in late 2017, provides for this lies in the fact that these tasks
225 additionally deals with regression some AutoML functionality and covers are usually very precisely specified and
procedures and permits the optimiza- a broad range of machine learning have relatively little dependence on the
tion of any of the performance metrics models and algorithms. given domain (see also Figure 1), which
supported by Weka through deep in- The impressive performance levels renders them particularly suitable for
mechanization. Conversely, tasks be-
Figure 2. FlashExtract.27 yond standard supervised learning,
such as unsupervised learning, have
After separating attributes by colors, FlashExtract can recognize
examples (such as Be, 9 and 0.070073; and Ti, 48 and 10.653153) and
proven to be considerably harder to au-
counter examples (such as the part struck through in red), in order to tomate effectively, because the optimi-
induce a program that is able to identify other occurrences of these zation goals are more subjective and do-
fields and put them in a spreadsheet or table for further processing.
main-dependent, involving trade-offs
between accuracy, efficiency, robust-
ness, explainability, fairness, and more.
Such machine learning methods, which
are often used for feature engineering,
domain understanding, data transfor-
mation, and so on, thus extend into
the remaining three quadrants, where
we believe that more progress can be
obtained using the other two kinds of
automation seen in the introduction:
composition and assistance.
Data Engineering:
Big Gains, Big Challenges
A large portion of the life of a data sci-
entist is spent acquiring, organizing,
and preparing data for analysis, tasks
we collectively term data engineering.a
a Data wrangling and data cleansing are

terms that are also associated with many of
these stages.
80 COM MUNICATIO NS O F TH E ACM | M A R C H 2022 | VO L . 65 | NO. 3

review articles
The goal of data engineering is to cre- ing names of people or places). Ide-
ate consolidated data that can be used ally, a dataset should be described by a
for further analysis or exploration. This data dictionary or metadata repository,
work can be time-consuming and la- which specifies information such as
borious, making it a natural target for
automation. However, it faces the chal- The automation the meaning and type of each attribute
in a table. However, this is often miss-
lenge of being more open-ended, as
per its location in Figure 1.
of model building ing or out-of-date, and it is necessary to
infer such information from the data
To illustrate the variety of tasks in- tasks in data itself. For the data type of an attribute,
volved in data engineering, consider
the study2 of how shrub growth in the
science has this may be at the syntactic level (for
example, the attribute is an integer or
tundra has been affected by global been remarkably a calendar date), or at a semantic level
warming. Growth is measured across a
number of traits, such as plant height
successful, (for example, the strings are all coun-
tries and can be linked to a knowledge
and leaf area. To carry out this analysis, especially in base, such as DBPedia).6
the authors had to: integrate tempera-
ture data from another dataset (using supervised learning. FlashExtract27 is an example of a
tool that provides assistance to the
latitude, longitude and date informa- analyst for the information extraction
tion as keys); standardize the plant task. It can learn how to extract records
names, which were recorded with from a semi-structured dataset using a
some variations (including typos); han- few examples; see Figure 2 for an illus-
dle problems arising from being un- tration. A second assistive tool is Data-
able to integrate the temperature and Diff,39 which integrates data that is re-
biological data if key data was missing; ceived in installments, for example, by
and handle anomalies by removing means of monthly or annual updates.
observations of a given taxon that lay It is not uncommon that the structure
more than eight standard deviations of the data may change between in-
from the mean. stallments, for example, an attribute is
In general, there are many stages in added if new information is available.
the data engineering process, with po- The challenge is then to integrate the
tential feedback loops between them. new data by matching attributes be-
These can be divided into three high- tween the different updates. DataDiff
level themes, around data organiza- uses the idea that the statistical distri-
tion, data quality and data transforma- bution of an attribute should remain
tion,32 as we will discuss in turn. For a similar between installments to auto-
somewhat different structuring of the mate the process of matching.
relevant issues, see Heer et al.19 In the second stage of data engi-
Beginning with the first stage, data neering, data quality, a common task
organization, one of the first steps is is standardization, involving process-
typically data parsing, determining es that convert entities that have more
the structure of the data so that it can than one possible representation
be imported into a data analysis soft- into a standard format. These might
ware environment or package. Another be phone numbers with formats like
common step is data integration, which “(425)-706-7709” or “416 123 4567,” or
aims to acquire, consolidate and re- text, for example, “U.K.” and “United
structure the data, which may exist in Kingdom.” In the latter case, stan-
heterogeneous sources (for example, dardization would need to make use
flat files, XML, JSON, relational data- of ontologies that contain informa-
bases), and in different locations. It tion about abbreviations. Missing data
may also require the alignment of data entries may be denoted as “NULL” or
at different spatial resolutions or on “N/A,” but could also be indicated by
different timescales. Sometimes the other strings, such as “?” or “-99.” This
raw data may be available in unstruc- gives rise to two problems: the identifi-
tured or semi-structured form. In this cation of missing values and handling
case it is necessary to carry out informa- them downstream in the analysis.
tion extraction to put the relevant pieces Similar issues of identification and
of information into tabular form. For repair arise if the data is corrupted by
example, natural language processing anomalies or outliers. Because much
can be used for information extraction can be done by looking at the distribu-
tasks from text (for example, identify- tion of the data only, many data sci-
review articles
storage, aggregation and data clean-

ing have been significantly automated
Five Data Exploration by recent database technology, signifi-
cant challenges remain, since data en-
Subtasks in gineering is often an iterative process
Social Network Analysis

over representation and integration
steps, involving data from very differ-
ent sources and in different formats,
Computational social scientists may wish to explore a social network to gain an
understanding of the social interactions it describes. For example, an analyst may with feedback loops between the steps
decide to look for community patterns, formalized as subsets of the nodes and the that trigger new questions (for exam-
edges connecting them. In the broad context of data exploration, five subtasks that can ple, Heer et al.19). For instance, in the
potentially be automated are outlined as follows:
1. Form of the pattern. Options include the network’s high-level topology, degree
Tundra example, one must know that
distribution, clustering coefficient, or the existence of dense subnetworks (communities) it is important to integrate the biologi-
as considered here by way of example. cal and temperature data, that the data
2. Measuring pattern ‘interestingness.’ Interestingness can be quantified as the must already be in a close-enough for-
number of edges or the average node degree within the community, the local modularity,
or subjective measures that depend on the analyst’s prior knowledge, or measures mat for the transformations to apply,
developed from scratch. and that domain knowledge is needed
3. Algorithmic strategy. Optimizing the chosen measure can require numerical to fuse variant plant names.
linear algebra, graph theory, heuristic search (for example, beam search), or bespoke
As all these data engineering chal-
approaches.
4. Pattern presentation. The most interesting communities can be presented to the lenges occupy large amounts of analyst
analyst as lists of nodes, by marking them on a suitably permuted adjacency matrix, or time, there is an incentive to automate
using other visualizations of the network. them as much as possible, as the gains
5. Interaction. Almost invariably, the analyst will want to iterate on some of the
subtasks, for example, to retrieve more communities, or to explore other pattern forms. could be high. However, doing this
poorly can have a serious negative im-
pact on the outcome of a data science
ence tools include (semi-)automated approach to data engineering, which project. We believe that many aspects
algorithms for data imputation and shows aspects of both compositional of data engineering are unlikely to be
outlier detection, which would fall and assistive automation, is the pre- fully automated soon, except for a few
under the mechanization or assistance dictive interaction framework.18 This specific tasks, but that further develop-
forms of automation. approach provides interactive recom- ments in the direction of both assistive
Finally, under the data transforma- mendations to the analyst about which and compositional semi-automation
tion heading, we consider processes at data engineering operations to apply will nonetheless be fruitful.
the interface between data engineering at a particular stage, in terms of an ap-
and model building or data explora- propriate domain specific language, Data Exploration:
tion. Feature engineering involves the ideas that form the basis of the com- More Assistance Than Automation
construction of features based on the mercial data wrangling software from Continuing our discussion of the
analyst’s knowledge or beliefs. When Trifacta. Another interesting direction quadrants in Figure 1, we next focus
the data involves sensor readings, im- is based on a concept known as data on data exploration. The purpose of
ages or other low-level information, programming, which exploits domain data exploration is to derive insight
signal processing and computer vision knowledge by means of programmatic or make discoveries from given data
techniques may be required to deter- creation and modeling of datasets for (for example, in a genetics domain,
mine or create meaningful features supervised machine learning tasks.34 understanding the relation between
that can be used downstream. Data Methods from AutoML could poten- particular genes, biological processes,
transformation also includes instance tially also help with data engineering. and phenotypes), often to determine
selection, for example, for handling For instance, Auto-sklearn12 includes a more precise goal for a subsequent
imbalanced data or addressing unfair- several pre-processing steps in its analysis (for example, in a retailing do-
ness due to bias. search space, such as simple missing main, discovering that a few variables
As well as the individual tasks data imputation and one-hot encod- explain why customers behave differ-
in data engineering, where we have ing of categorical features. However, ently, suggesting a segmentation over
seen that assistive automation can be these steps can be seen as small parts these variables). This key role of hu-
helpful, there is also the need for the of the data quality theme, which can man insight in data exploration sug-
composition of tasks. Such a focus on only be addressed once the many is- gests that the form of automation that
composition is found, for example, in sues around data organization and prevails in this quadrant is assistance,
Extraction, Transformation and Load other data quality steps (for example, by generating elements that can help
(ETL) systems, which are usually sup- the identification of missing data) have humans reach this insight. We will
ported by a collection of scripts that been carried out. These earlier steps collectively refer to all these elements
combine data scraping, source integra- are more open ended and thus much that ease human insight as patterns,
tion, cleansing and a variety of other less amenable to inclusion in the Au- capturing particular aspects or parts of
transformations on the data. toML search process. the data that are potentially striking,
An example of a more integrated While many activities related to interesting, valuable, or remarkable
82 COMM UNICATIO NS O F THE ACM | M A R C H 2022 | VO L . 65 | NO. 3

review articles
for the data analyst or domain expert, automation is possible and desirable, subtask may require understanding
and thus worthy of further investiga- without being exhaustive, it is helpful the data analyst’s intentions or pref-
tion or exploitation. Patterns can take to identify five important and com- erences,35 the perceived complexity
many forms, from the very simple (for mon subtasks in data exploration, as of the patterns, and the data analyst’s
example, merely reporting summary illustrated for a specific use case (so- background knowledge about the data
statistics for the data or subsets there- cial network analysis) in the associated domain—all of which require interac-
of), to more sophisticated ones (com- box. These five problems are discussed tion with the data analyst. The latter is
munities in networks or low-dimen- in “Five Data Exploration Subtasks in particularly relevant for the formaliza-
sional representations). Social Network Analysis.” tion of novelty and surprisingness in
The origins of contemporary data The form of the patterns (subtask 1) a subjective manner, and recent years
exploration techniques can be traced is often dictated by the data analyst, have seen significant progress along
back to Tukey and Wilk,43 who stressed that is, user involvement is inevitable this direction using information-theo-
the importance of human involvement in choosing this form. Indeed, certain retic approaches.8
in data analysis generally speaking, types of patterns may be more intel- The next stage (subtask 3) is to
and particularly in data analysis tasks ligible to the data analyst or may cor- identify the algorithms needed to opti-
aiming at ‘exposing the unanticipat- respond to a model of physical reality. mize the chosen measure. In principle,
ed’—later coined Exploratory Data As illustrated in the box, a computa- it would be attractive to facilitate this
Analysis (EDA) by Tukey42 and others. tional social scientist may be inter- task using higher-level automation, as
The goal of EDA was described as ested in finding dense subnetworks in done in AutoML. However, consider-
hypothesis generation, and was con- a social network as evidence of a tight ing the diversity of data across applica-
trasted with confirmatory analysis social structure. tions, the diversity of pattern types, and
methods, such as hypothesis testing, There are often too many possible the large number of different ways of
which would follow in a second step. patterns. Thus, a measure to quantify quantifying how interesting any given
Since the early days of EDA in the how interesting any given set of pat- pattern is, there is a risk that differ-
1970s, the array of methods for data terns of this type is to the data analyst ent data exploration tasks may require
exploration, the size and complexity is required (subtask 2). Here, ‘interest- different algorithmic approaches for
of data, and the available memory and ingness’ could be defined in terms of finding the most interesting patterns.
computing power have all vastly in- coverage, novelty, reliability, peculiar- Given the challenges in designing
creased. While this has created unprec- ity, diversity, surprisingness, utility, or such algorithms, we believe that more
edented new potential, it comes at the actionability; moreover, each of these generic techniques or declarative ap-
price of greater complexity, thus creat- criteria can be quantified either ob- proaches (such as inductive databases
ing a need for automation to assist the jectively (dependent on the data only), and probabilistic programming, cov-
human analyst in this process. subjectively (dependent also on the ered in the final section of the paper)
As an example, the ‘Queriosity’ sys- data analyst), or based on the seman- may be required to make progress in
tem48 provides a vision of automated tics of the data (thus also dependent the composition and assistance forms of
data exploration as a dynamic and in- on the data domain).14 Designing this automation for this subtask.
teractive process, allowing the system measure well is crucial but also highly The user interface of a data explora-
to learn to understand the analyst’s non-trivial, making this a prime tar- tion system often presents the data, and
evolving background and intent, to en- get for automation. Automating this identifies patterns within it, in a visual
able it to proactively show ‘interesting’
patterns. The FORSIED framework8 Figure 3. A fragment of the Automatic Statistician report for the “airline” dataset, which
considers airline passenger volume over the period from 1949 to 1961.29
has a similar goal, formalizing the
data exploration process as an interac-
tive exchange of information between Raw
Raw
Data
Data Full
Full
Model
Model
Posterior
Posterior
with
with
Extrapolations
Extrapolations
data and data analyst, accounting for 700
700 700
700
600
600 600
600
the analyst’s prior belief state. These
500
500 500
500
approaches stand in contrast to the 400
400
400
400
more traditional approach to data ex- 300
300
300
300 200
200
ploration, where the analyst repeatedly 200
200 100
100
queries the data for specific patterns 100
100 0 0
in a time- and labor-intensive process, 1950
19501952
19521954
19541956
19561956
1956
1960
19601962
1962 1950
19501952
19521954
19541956
19561956
1956
1960
19601962
1962
in the hope that some of the patterns

turn out to be interesting. This vision The structure search algorithm has identified four additive components in the data. The first 2 additive
means that the automation of data ex- components explain 98.5% of the variation in the data as shown by the coefficient of determination (R2)
values in accompanying table. Short summaries of the additive components are as follows:
ploration requires the identification
• A linearly increasing function.
of what the analyst knows (and does • An approximately periodic function with a period of 1.0 years
not know) about the domain, so that and with approximately linearly increasing amplitude.
knowledge and goals, and not only pat- • A smooth function.
terns, can be articulated by the system. • Uncorrelated noise with linearly increasing standard deviation.
To investigate the extent to which
review articles
manner to the analyst (subtask 4). This Exploitation: Automation

makes it possible to leverage the strong within the Real World
perceptual abilities of the human vi- The bottom right quadrant in Figure 1 is
sual system, as has been exploited and usually reached when the insights from
enhanced by decades of research in the
visual analytics community.23 At the It is important to other tasks must be translated back to
the application domain, often—but not
same time, the multiple comparisons
problem inherent in visual analysis
raise awareness of always—in the form of predictions or,
more generally, decisions. This quad-
may necessitate steps to avoid false the potential pitfalls rant deals with extracted knowledge
discoveries.51 Automating subtask 4
beyond some predefined visualizations
and side effects and less with data, involving the under-
standing of the patterns and models,
(as in the Automatic Statistician, see of higher levels of publishing them as building blocks for
Figure 3) requires a good understand-
ing of the particular perception and
automation in data new discoveries (for example, in scien-
tific papers or reports), putting them
cognition capacities and preferences science. into operation, validating and moni-
of each user, a question that also fea- toring their operation, and ultimately
tures prominently in the related area revising them. This quadrant is usu-
of explainable artificial intelligence, ally less open-ended, so it is no surprise
which we will discuss. that some specific activities here, such
Such visualizations and other kinds as reporting and maintenance, can be
of tools for navigating the data must al- automated to a high degree.
low for rich and intuitive forms of inter- The interpretation of the extracted
action (subtask 5), to mitigate the open- knowledge is closely related to the
endedness of typical data exploration area of explainable or interpretable
tasks. They must allow the analyst to machine learning. Recent surveys
follow leads, verify or refine hypotheses cover different ways in which explana-
by drilling deeper, and provide feed- tions can be made, but do not analyze
back to the data exploration system the degree and form of automation
about what is interesting and what is (for example, Guidotti et al.16). Clearly,
not. A huge challenge for automation the potential for automation depends
is how a novice data analyst could be strongly on whether a generic expla-
given hints and recommendations of nation of a model (global explana-
the type of an expert might use, assist- tion) or a single prediction (local ex-
ing in the process of data navigation, planation) is required, and whether
from the combinatorial explosion of the explanation has to be customized
ways of looking into the data and pos- for or interact with a given user, by
sible kinds of patterns. For instance, adaptation to their background, ex-
the SeeDB45 and Voyager49 systems in- pectations, interests and personality.
teractively recommend visualizations Explanations must go beyond the in-
that may be particularly effective, and spection or transformation of models
Interactive intent modeling35 has been and predictions, and should include
proposed to improve information seek- the relevant variables for these predic-
ing efficiency in information retrieval tions, the distribution of errors and
applications. the kind of data for which it is reliable,
Each of the five subtasks is chal- the vulnerabilities of a model, how
lenging on its own and contains many unfair it is, and so on. A prominent
design choices that may require ex- example following the mechanization
pert knowledge. We argue that the form of automation is the Automatic
limitations of current AI techniques Statistician,b,29 which can produce a
in acquiring and dealing with human textual report on the model produced
knowledge in real-world domains are (for a limited set of problem classes).
the main reason why automation in Figure 3 shows a fragment of such a
this quadrant is typically in the form of report, including graphical represen-
assistance. Meanwhile, we should rec- tations and textual explanations of the
ognize that the above subtasks are not most relevant features of the obtained
independent, as they must combine, model and its behavior.
through the composition form of auto- We believe that fully understanding
mation, to effectively assist the data the behavior and effect of the models
analyst, and non-expert users, in their
search for new insights and discoveries. b https://www.automaticstatistician.com/
84 COMM UNICATIO NS O F THE ACM | M A R C H 2022 | VO L . 65 | NO. 3

review articles
and insight produced in earlier stages ticipate effects and risks, as well as to ing the experiment), but it depends
of the data science pipeline is an inte- optimize for the best solutions. Opti- on whether the operating conditions
gral part of the validation of the entire mization tools that have proven so use- that were used initially still hold after
process, and key to a successful deploy- ful in the AutoML scenario can be used the data shift. Reliable and well under-
ment. However, ‘internal’ evaluation, to derive globally optimal decisions stood models can often be reused even
which is usually coupled with model that translate from the digital twin to in new or changing circumstances,
building or carried out immediately the real world, provided the simulator through domain adaptation, transfer
after, is done in the lab, trying to maxi- is an accurate model at the required learning, lifelong learning, or refram-
mize some metric on held-out data. In level of abstraction. The digital twin ing;20 this represents a more composi-
contrast, validation in the real world re- can also be a source of simulated data tional form of automation.
fers to meeting some goals, with which for further iterations of the entire data Data science creates many patterns,
the data, objective functions and other science process. models, decisions, and meta-knowl-
elements of the process may not be Deployment becomes more com- edge. The organization and reuse of
perfectly aligned. Consequently, this plex as more decisions are made, models and patterns can be automated
broad perspective of the ‘external’ vali- models are produced and combined, to some degree via inductive databas-
dation poses additional challenges for and many users are involved. Accord- es, via specialized databases of models
automation, as domain context plays ingly, we contend that automating (for example, machine learning model
a more important role (Figure 1). This model maintenance and monitoring is management46), or by means of large-
is especially the case in areas, where becoming increasingly relevant. This scale experimentation platforms, such
optimizing for trade-offs between ac- includes tracing all the dependencies as OpenML.c In the end, we believe the
curacy and fairness metrics may still between models, insights and deci- automation of knowledge management
end up producing undesirable global sions that were generated during train- and analysis for and from data science
effects in the long term, or areas such ing and operation, especially if re-train- activities will be a natural evolution of
as safety-critical domains, where ex- ing is needed,36 resembling software the automation of data management
perimenting with the actual systems is maintenance in several ways. Some and analysis.
expensive and potentially dangerous, aspects of monitoring trained models
for example, in medical applications or seem relatively straightforward and Perspectives and Outlook
autonomous driving. A very promising automatable, by re-evaluating indica- The quest for automation, in the broad
approach to overcome some of these tors (metrics of error, fairness, among context of data analysis and scientific
challenges is the use of simulation, others) periodically and flagging im- discovery, is not new, spanning decades
where an important part of the appli- portant deviations, as a clear example of work in statistics, artificial intelli-
cation domain is modeled, be it a hos- of the assistive form of automation, gence (AI), databases, and program-
pital11 or a city. The concept of ‘digital which allows for extensive reuse. Once ming languages. We now visit each of
twins’40 allows data scientists to deploy models are considered unfit or de- these perspectives in turn, before draw-
their models and insights in a digital graded, retraining to some new data ing some final conclusions.
copy of the real world, to understand that has shifted from the original data
and exploit causal relations, and to an- seems easily mechanizable (repeat- c www.openml.org44
Selected research challenges in automating data science, with their associated quadrants and likely forms of automation (mechanization,
composition, and assistance).
Quadrant Challenge Mechanization Composition Assistance

Generic Enhancing human-AI collaboration, by incorporating domain context for
interactively defining and refining the goal of data science activities.
× ×
Generic Addressing ethical, privacy, and legal issues in the automation
of data science.
× × ×
Model Building Extending AutoML to tasks beyond supervised learning. ×
Model Building/Data Engineering Generating meaningful features, considering domain context and task. × ×
Data Engineering Streamlining the ETL (Extract, Transform, Load) process by using
pipeline schemas and reusing preprocessing subcomponents.
×
Data Engineering Expediting the data cleaning, outlier detection and
data imputation processes.
× × ×
Data Exploration Supporting the design of interactive data and pattern visualizations. ×
Data Exploration Developing human-AI collaborative systems for
data and pattern exploration.
× ×
Exploitation Generating collaborative reports and presentations, facilitating
the interrogation, validation and explanation of models and results.
× ×
Exploitation Dealing with concept drift, monitoring the interaction of several data
science models, and assessing their effects more globally.
× ×
review articles
First, there is a long tradition in AI Third, in a database context, the con- work, and how data science projects
of attempts to automate the scientific cept of inductive query languages al- evolve from conception to deployment
discovery process. Many researchers lows a user to query the models and pat- and maintenance, will be key for more
have tried to understand, model, and terns that are held in the data. Patterns ambitious tools. Progress in areas of AI
support a wide range of scientific pro- and models become “first-class citi- such as reinforcement learning can ac-
cesses with AI, including approaches to zens” with the hope of reducing many celerate this.
leverage cognitive models for scientific activities in data science to a querying It is important to raise awareness of
discovery (such as Kepler’s laws).26 More process, in which the insights obtained the potential pitfalls and side effects of
recent and operational models of scien- from one query led to the next query, higher levels of automation in data sci-
tific discovery include robot scientists,24 until the desired patterns and models ence. These include over-reliance on
which are robotic systems that design have been found. These systems are typ- the results obtained from systems and
and carry out experiments in order to ically based on extensions of SQL and tools; the introduction of errors that are
find models or theories, for example, in other relational database languages (for subtle and difficult to detect; and cogni-
the life sciences. While these attempts example, Blockeel et al.3). Doing data tive bias towards certain types of obser-
included experimental design and not science as querying or programming vations, models and insights facilitated
only observational data, they were also may help bridge the composition and by existing tools. Also, data science
specialized to particular domains, re- mechanization forms of automation. tools in the context of human-AI collab-
ducing the challenges of the domain Fourth, in recent years, there has oration are seen as displacing the work
context (the vertical dimension in Figure been an increasing attention on proba- practice of data scientists, leading to
1). Many important challenges remain bilistic programming languages, which new roles.47 Similarly, this collaborative
in this area, including the induction or allow the expression and learning of view suggests new forms of interaction
revision of theories or models from very complex probabilistic models, extend- between data scientists and machines,
sparse data; the transfer of knowledge ed or combined with first-order logic.9 as these become proactive assistants
between domains (which is known to Probabilistic programming languages rather than tools.1
play an important role in the scientific have been used inside tools for democ- With all of this in mind, we cau-
process); the interplay between the de- ratizing data science, such as BayesDB30 tiously make the following predic-
sign of methodology, including experi- and Tabular,15 which build probabilistic tions. First, it seems likely that there
ments, and the induction of knowledge models on top of tabular databases and will continue to be useful and signifi-
from data; and the interaction between spreadsheets. Probabilistic program- cant advances in the automation of
scientists and advanced computational ming can also, for example, propagate data science in the three most acces-
methods designed to support them in uncertainty from an imputation meth- sible quadrants in Figure 1: data en-
the scientific discovery process. od for missing data into the predictive gineering (for example, automation
Second, there were efforts in the analysis and incorporate background of inference about missing data and
1980s and 1990s at the interface of knowledge into the analysis. This may of feature construction), model build-
statistics and AI to develop software support a more holistic view of automa- ing (for example, automated selection,
systems that would build models or tion by increasing the integration of the configuration and tuning beyond the
explore data, often in an interactive four quadrants in Figure 1, which may current scope of AutoML), and ex-
manner, using heuristic search or plan- mutate accordingly. ploitation (for example, automated
ning based on expert knowledge (for All four of these approaches have techniques for model diagnosis and
example, Gale13 and St. Amant et al.38). had some success in specific domains summarization). Second, for the most
This line of research ran up against the or standard situations, but still lack challenging quadrant of data explora-
limits of knowledge representation, the generality and flexibility needed tion, and for tasks in the other quad-
which proved inadequate to capture the for broader applications in data sci- rants where representation of domain
subtleties of the statistical strategies ence, as the discipline incorporates new knowledge and goals is needed, we an-
used by expert data analysts. Today, the methods and techniques at a pace that ticipate that progress will require more
idea of a ‘mechanized’ statistical data these systems cannot absorb. More sci- effort. And third, across the full spec-
analyst is still being pursued (see the entific and community developments trum of data science activities, we see
Automatic Statistician29), but with the are needed to bridge the gap between great potential for the assistance form
realization that statistical modeling of- how data scientists conduct their work of automation, through systems that
ten relies heavily on human judgement and the level of automated support that complement human experts, tracking
in a manner that is not easy to capture such approaches can provide. The ac- and analyzing workflows, spotting er-
formally, beyond the top right quad- companying table presents a series of rors, detecting and exposing bias, and
rant in Figure 1. It is then the composi- indicative technical challenges for auto- providing high-level advice. Overall,
tion and assistance forms of automation mating data science. we expect an increasing demand for
that are still targeted when modular While AutoML will continue to be a methods and tools that are better in-
data analytic operations are combined flagship example for automation in data tegrated with human experience and
into plans or workflows in current data science, we expect most progress in the domain expertise, with an emphasis
science platforms, such as KNIME and following years to involve stages and on complementing and enhancing the
Weka, or in the form of intelligent data tasks other than modeling. Capturing work of human experts rather than on
science assistants.37 information about how data scientists full mechanization.
86 COMMUNICATIO NS O F TH E AC M | M A R C H 2022 | VO L . 65 | NO. 3

review articles
35th ACM SIGPLAN Conf. on Programming Language 2018 CHI Conf. on Human Factors in Computing
References
Design and Implementation, 2014, 542–553. Systems, 1–12.
1. Amershi, S. et al. Guidelines for human-AI interaction.
28. Liu, C., et al. Progressive neural architecture search.
In Proceedings of the 2019 CHI Conf. on Human
In Proceedings of the European Conf. on Computer
Factors in Computing Systems, 2019, 1–13. Tijl De Bie (tijl.debie@ugent.be) is a professor in the
Vision, 2018, 19–34.
2. Bjorkman, A. et al. Plant functional trait change across Internet and Data Lab (IDLab) at Ghent University, Belgium.
29. Lloyd, J., Duvenaud, D., Grosse, R., Tenenbaum, J., and
a warming tundra biome. Nature 562, 7725 (2018), 57.
Ghahramani, Z. Automatic construction and natural- Luc De Raedt (luc.deraedt@kuleuven.be) is a professor at
3. Blockeel, H., Calders, T., Fromont, É., Goethals, B.,
language description of nonparametric regression the Department of Computer Science and Director of the
Prado, A., and Robardet, C. An inductive database
models. In Proceedings of the 28th AAAI Conf. on KU Leuven Institute for AI at KU Leuven, Belgium, and
system based on virtual mining views. Data Mining and
Artificial Intelligence, 2014. Wallenberg Guest Professor at Örebro University, Sweden.
Knowledge Discovery 24, 1 (2012), 247–287.
30. Mansinghka, V., Tibbetts, R., Baxter, J., Shafto, P.,
4. Brazdil, P., Carrier, C., Soares, C., and Vilalta, R. José Hernández-Orallo (jorallo@upv.es) is a professor
and Eaves, B. BayesDB: A probabilistic programming
Metalearning: Applications to Data Mining. Springer at the Valencian Research Institute for Artificial
system for querying the probable implications of data.
Science & Business Media, 2008. Intelligence, Universitat Politècnica de València, Spain.
2015; arXiv:1512.05006.
5. Chapman, P. et al. CRISP-DM 1.0 Step-by-step data
31. Martínez-Plumed, F. et al. CRISP-DM twenty
mining guide, 2000. Holger H. Hoos (hh@liacs.nl) is Professor of Machine
years later: From data mining processes to data
6. Chen, J., Jimenez-Ruiz, E., Horrocks, I., and Sutton, Learning at the Leiden Institute of Advanced Computer
science trajectories. IEEE Trans. Knowledge
C. ColNet: Embedding the semantics of Web tables Science (LIACS) at Leiden University, The Netherlands,
and Data Engineering (2020), 1; doi 10.1109/
for column type prediction. In Proceedings of the 33rd and adjunct professor of computer science at the
AAAI Conf. on Artificial Intelligence, 2019. TKDE.2019.2962680. University of British Columbia in Vancouver, Canada.
7. Dasu, T. and Johnson, T. Exploratory Data Mining and 32. Nazabal, A., Williams, C., Colavizza, G., Smith, C., and
Data Cleaning. Wiley, 2003. Williams, A. Data engineering for data analytics: A Padhraic Smyth (smyth@ics.uci.edu) is Chancellor’s
8. De Bie, T. Subjective interestingness in exploratory classification of the issues, and case studies. 2020; Professor in the Computer Science and Statistics
data mining. In Proceedings of the Intern. Symp. arXiv:2004.12929. Departments at the University of California, Irvine, USA.
Intelligent Data Analysis. Springer, 2013, 19–31. 33. Rakotoarison, H., Schoenauer, M., and Sebag, M.
Automated machine learning with Monte-Carlo tree Christopher K.I. Williams (ckiw@inf.ed.ac.uk) is
9. De Raedt, L., Kersting, K., Natarajan, S., and Poole, a professor of Machine Learning in the School of
D. Statistical relational artificial intelligence: Logic, search. In Proceedings of the 28th Intern. Joint Conf.
on Artificial Intelligence 7, (2019); doi: 10.24963/ Informatics, University of Edinburgh, U.K, and a Turing
probability, and computation. Synthesis Lectures on Fellow at the Alan Turing Institute, London, U.K.
Artificial Intelligence and Machine Learning 10, 2 ijcai.2019/457; https://doi.org/10.24963/ijcai. 2019/457.
(2016), 1–189. 34. Ratner, A., De Sa, C., Wu, S., Selsam, D., and Ré, C.
10. Donoho, D. 50 years of data science. J. Computational Data programming: Creating large training sets, Acknowledgments
and Graphical Statistics 26, 4 (2017), 745–766. quickly. In Proceedings of the 30th Intern. Conf.
on Neural Information Processing Systems, 2016, The authors thank the anonymous referees for their
11. Elbattah, M. and Molloy, O. Analytics using machine
3574–3582. comments, which helped to improve the article.
learning-guided simulations with application to
healthcare scenarios. Analytics and Knowledge Mgmt. 35. Ruotsalo, T., Jacucci, G., Myllymäki, P., and Kaski, S.
Auerbach Publications, 2018, 277–324. Interactive intent modeling: Information discovery
Funding information.
12. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., beyond search. Commun. ACM 58, 1 (Jan. 2014), 86–92.
Blum, M., and Hutter, F. Efficient and robust automated 36. Sculley, D. et al. Hidden technical debt in machine TDB: The European Research Council under the European
machine learning. Advances in Neural Information learning systems. Advances in Neural Info. Processing Union’s Seventh Framework Programme (FP7/2007-
Processing Systems 28, 2015, 2962–2970. Systems 28, (2015), 2503–2511. 2013) / ERC Grant Agreement no. 615517. The Flemish
13. Gale, W. Statistical applications of artificial 37. Serban, F., Vanschoren, J., Kietz, J., and Bernstein, A. A Government under the “Onderzoeksprogramma Artificiële
intelligence and knowledge engineering. The survey of intelligent assistants for data analysis. ACM Intelligentie Vlaanderen” programme. The Fund for
Knowledge Engineering Rev. 2, 4 (1987), 227–247. Computing Surveys 45, 3 (2013), 1–35. Scientific Research–Flanders (FWO–Vlaanderen), project
14. Geng, L. and Hamilton, H. Interestingness measures 38. St. Amant, R. and Cohen, P. Intelligent support for no. G091017N, G0F9816N, 3G042220.
for data mining: A survey. ACM Computing Surveys 38, exploratory data analysis. J. Computational and LDR: The research reported in this work was supported
3 (2006), 9. Graphical Statistics 7, 4 (1998), 545–558. by the European Research Council under the European
15. Gordon, A., Graepel, T., Rolland, N., Russo, C., 39. Sutton, C., Hobson, T., Geddes, J., and Caruana, R. Data Union’s Horizon 2020 research and innovation programme
Borgstrom, J., and Guiver, J. Tabular: A schema driven diff: Interpretable, executable summaries of changes (grant agreement No [694980] SYNTH: Synthesising
probabilistic programming language. ACM SIGPLAN in distributions for data wrangling. In Proceedings of Inductive Data Models), the EU H2020 ICT48 project
Notices 49, 1 (2014), 321–334. the 24th ACM SIGKDD Conf. on Knowledge Discovery “TAILOR” contract #952215; the Flemish Government’s
16. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., and Data Mining, 2018. “Onderzoeksprogramma Artificiële Intelligentie
Giannotti, F., and Pedreschi, D. A survey of methods for 40. Tao, F. and Qi, Q. Make more digital twins. Nature 573 Vlaanderen” programme and the Wallenberg AI,
explaining black box models. ACM Computing Surveys (2019), 490–491. Autonomous Systems and Software Program funded by
51, 5 (2018). 93. 41. Thornton, C., Hutter, F., Hoos, H., and Leyton-Brown, K. the Knut and Alice Wallenberg Foundation.
17. Guyon, I., et al. A brief review of the ChaLearn AutoML Auto-WEKA: Combined selection and hyperparameter
optimization of classification algorithms. In JHO: EU (FEDER) and the Spanish MINECO, Grant:
Challenge: Any-time any-dataset learning without RTI2018-094403-B-C3. Generalitat Valenciana, Grant:
human intervention. In Proceedings of the Workshop Proceedings of the 19th ACM SIGKDD Intern. Conf. on
Knowledge Discovery and Data Mining, 2013, 847–855. PROMETEO/2019/098. FLI, Grant RFP2-152. MIT-Spain
on Automatic Machine Learning 64 (2016), 21–30. F. INDITEX Sustainability Seed Fund. Grant: FC200944. EU
Hutter, L. Kotthoff, and J. Vanschoren, Eds. 42. Tukey, J. Exploratory Data Analysis. Pearson, 1977.
43. Tukey, J. and Wilk, M. Data analysis and statistics: An H2020. Grant: ICT48 project “TAILOR” contract #952215.
18. Heer, J., Hellerstein, J., and Kandel, S. Predictive
interaction for data transformation. In Proceedings of expository overview. In Proceedings of 1966 Fall Joint HHH: The research reported in this work was partially
Conf. on Innovative Data Systems Research, 2015. Computer Conf. (Nov. 7–10, 1966), 695–709. supported by the EU H2020 ICT48 project “TAILOR”
19. Heer, J., Hellerstein, J., and Kandel, S. Data Wrangling. 44. Vanschoren, J., Van Rijn, J., Bischl, B., and Torgo, L. contract #952215; by the EU project H2020-
Encyclopedia of Big Data Technologies. S. Sakr and A. OpenML: Networked science in machine learning. ACM FETFLAG-2018-01, “HumanE AI,” contract #820437, and
Zomaya, Eds. Springer, 2019. SIGKDD Explorations Newsletter 15, 2 (2014), 49–60. by start-up funding from Leiden University.
20. Hernández-Orallo, J., et al. Reframing in context: A 45. Vartak, M., Rahman, S., Madden, S., Parameswaran, PS: This material is based on work supported by
systematic approach for model reuse in machine A., and Polyzotis, N. SeeDB: Efficient data-driven the US National Science Foundation under awards
learning. AI Commun. 29, 5 (2016), 551–566. visualization recommendations to support visual DGE-1633631, IIS-1900644, IIS-1927245, DMS-
21. Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential analytics. In Proceedings of the Intern. Conf. on Very 1839336, CNS-1927541, CNS-1730158, DUE-1535300;
model-based optimization for general algorithm Large Data Bases 8 (2015), 2182. by the US National Institutes of Health under award
configuration. In Proceedings of the Intern. Conf. on 46. Vartak, M., et al. ModelDB: A system for machine 1U01TR001801-01; by NASA award NNX15AQ06A.
Learning and Intelligent Optimization. Springer, 2011, learning model management. In Proceedings of
507–523. the ACM Workshop on Human-In-the-Loop Data CKIW: This work is supported in part by grant EP/
22. Hutter, F., Kotthoff, L., and Vanschoren, J., Eds. Analytics, 2016, 14. N510129/1 from the UK Engineering and Physical
Automated Machine Learning—Methods, Systems, 47. Wang, D., et al. Human-AI collaboration in data Sciences Research Council (EPSRC) to the Alan Turing
Challenges. Springer, 2019. science: Exploring data scientists’ perceptions of Institute. He thanks the Artificial Intelligence for Data
23. Keim, D., Andrienko, G., Fekete, J., Görg, C., automated AI. In Proceedings of the ACM Conf. on Analytics team at the Turing Institute for many helpful
Kohlhammer, J., and Melançon, G. Visual analytics: Human-Computer Interaction 3, 2019, 1–24. conversations.
Definition, process, and challenges. Information 48. Wasay, A., Athanassoulis, M., and Idreos, S. Queriosity:
Visualization. Springer, 2008, 154–175. Automated data exploration. In Proceedings of the
24. King, R., et al. Functional genomic hypothesis 2015 IEEE Intern. Congress on Big Data, 716–719. © 2022 ACM 0001-0782/22/3
generation and experimentation by a robot scientist. 49. Wongsuphasawat, K., Moritz, D., Anand, A., Mackinlay,
Nature 427, 6971 (2004, 247. J., Howe, B., and Heer, J. Voyager: Exploratory
25. Kotthoff, L., Thornton, C., Hoos, H., Hutter, F., and analysis via faceted browsing of visualization
Leyton-Brown, K. Auto-WEKA 2.0: Automatic model recommendations. IEEE Trans. Visualization and
selection and hyperparameter optimization in WEKA. Computer Graphics 22, 1 (2015), 649–658.
J. Machine Learning Research 18, 1 (2017), 826–830. 50. Wulder, M., Coops, N., Roy, D., White, J., and
26. Langley, P., Simon, H., Bradshaw, G., and Zytkow, J. Hermosilla, T. Land cover 2.0. Intern. J. Remote Watch the authors discuss
Scientific Discovery: Computational Explorations of the Sensing 39, 12 (2018), 4254–4284. this work in the exclusive
Creative Processes. MIT Press, 1987. 51. Zgraggen, E., Zhao, Z., Zeleznik, R., and Kraska, T. Communications video.
27. Le, V. and Gulwani, S. FlashExtract: A Framework for Investigating the effect of the multiple comparisons https://cacm.acm.org/videos/
data extraction by examples. In Proceedings of the problem in visual analysis. In Proceedings of the automating-data-science

Automating Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automating Data Science

Uploaded by

Copyright:

Available Formats

review articles

such as automated machine learning

76 COMM UNICATIO NS O F THE AC M | M A R C H 2022 | VO L . 65 | NO. 3

78 COM MUNICATIO NS O F TH E AC M | M A R C H 2022 | VO L . 65 | NO. 3

in more sophisticated tools that have

a Data wrangling and data cleansing are

80 COM MUNICATIO NS O F TH E ACM | M A R C H 2022 | VO L . 65 | NO. 3

storage, aggregation and data clean-

Social Network Analysis

82 COMM UNICATIO NS O F THE ACM | M A R C H 2022 | VO L . 65 | NO. 3

in the hope that some of the patterns

manner to the analyst (subtask 4). This Exploitation: Automation

84 COMM UNICATIO NS O F THE ACM | M A R C H 2022 | VO L . 65 | NO. 3

Quadrant Challenge Mechanization Composition Assistance

86 COMMUNICATIO NS O F TH E AC M | M A R C H 2022 | VO L . 65 | NO. 3

You might also like