Professional Documents
Culture Documents
AI Operations Platforms
Thomas Rausch Waldemar Hummer Vinod Muthusamy
TU Wien / IBM Research AI IBM Research AI IBM Research AI
t.rausch@dsg.tuwien.ac.at waldermar.hummer@gmail.com vmuthus@us.ibm.com
Abstract—Operationalizing AI has become a major endeavor digit millions). Manual maintenance of these pipelines turned
in both research and industry. Automated, operationalized out to be unsustainable, especially as the company is moving
arXiv:2006.12587v1 [cs.DC] 22 Jun 2020
pipelines that manage the AI application lifecycle will form a sig- to shorter training intervals and training of large numbers of
nificant part of tomorrow’s infrastructure workloads. To optimize
operations of production-grade AI workflow platforms we can patient-specific models.
leverage existing scheduling approaches, yet it is challenging to AI platforms need to cope with an ever increasing number
fine-tune operational strategies that achieve application-specific of AI pipelines used to manage the AI application life-
cost-benefit tradeoffs while catering to the specific domain char- cycle. Capacity planning becomes more difficult given the
acteristics of machine learning (ML) models, such as accuracy, heterogeneous workloads and infrastructure required, and the
robustness, or fairness. We present a trace-driven simulation-
based experimentation and analytics environment that allows variety of automation rules that trigger pipeline executions.
researchers and engineers to devise and evaluate such operational Developing operational strategies, such as optimized schedul-
strategies for large-scale AI workflow systems. Analytics data ing of automated pipelines, therefore plays a major role
from a production-grade AI platform developed at IBM are in the development of AI platforms. Ultimately, our goal
used to build a comprehensive simulation model. Our simulation is to leverage the runtime data of pipeline executions to
model describes the interaction between pipelines and system
infrastructure, and how pipeline tasks affect different ML model build predictive models that allow us to automatically derive
metrics. We implement the model in a standalone, stochastic, optimized scheduling decisions for continuous improvement
discrete event simulator, and provide a toolkit for running of AI models. To that end, this paper contributes a trace-
experiments. Synthetic traces are made available for ad-hoc driven simulation framework, for analyzing and experiment-
exploration as well as statistical analysis of experiments to test ing in large-scale AI operations platforms. This paper is an
and examine pipeline scheduling, cluster resource allocation, and
similar operational mechanisms. extended version of the OpML’20 paper [4]. Our approach
builds on modeling the structural and qualitative aspects of
I. I NTRODUCTION AI pipelines, and executing these pipelines in a simulated
Developing and operating artificial intelligence (AI) appli- environment, to fine-tune operational strategies in the real
cations involves complex workflows that comprise many tasks system, by systematically mutating parameters in an iterative,
performed in an increasingly automated fashion. This includes exploratory process. The simulation model reflects existing
data preprocessing flows, data bias checks, machine learn- high-level AI ops platform architectures such as ModelOps [1]
ing (ML) model training, vulnerability mitigation algorithms, and IBM Watson OpenScale, and considers concepts from AI
model compression steps, and potentially many others. Some application workflow as first-class citizens. We implement the
examples of such workflows are shown in Figure 1. Systems model in a stochastic discrete event simulator using existing
such as ModelOps [1] are used to compose and automate techniques [5], [6]. To obtain a realistic representation of
such high-level AI workflows into pipelines that track both different execution parameters (e.g., job arrival patterns, task
build and runtime metrics of deployed ML models. Pipelines execution times, training data sizes), we apply clustering and
are re-executed either automatically due to events such as sampling techniques to extract statistical distributions from a
arrival of new training data, or manually by developers or data
scientists [2], [3]. Furthermore, these pipelines can be long
running. It is not uncommon for the training of a deep neural
Preprocess Train Evaluate Deploy
network, for example, to take days. Consequently, a production Preprocess Train Evaluate
system may have hundreds or thousands of pipelines executing
at any given time to build and maintain AI models. Preprocess Train Harden Compress Base
In one of the use cases we have explored as part of this Model
Bias Deploy
research, a company from the health care domain trains ML Detection Evaluate User-specific models
Adapting Windowing
Online Ensemble
potential improvement of all automated AI pipelines, while
Accuracy
Static Dynamic
maintaining SLAs and balancing infrastructure utilization. Fig-
measures – measures –
Delay onset of Recover from ure 4 conceptualizes a scheduler that deploys pipelines on to
20%
attacks attacks 250K 500K 750K
Fig. 2. Illustration of model performance over time, indicating an adversarial Scheduler / Optimizer Infrastructure / Budget
attack [22] (left), accuracy drop due to a sudden concept drift [23] (right),
●
Probabilistic Parameters
and abstract concept drift patterns [11] (bottom). ●
Model staleness
●
Expected improvement (EI)
●
Risk (e.g., need for human intervention)
In order to manage the risks and prevent models from be- ●
User Preferences
●
Priority of models
coming stale, it is crucial for AI operations to employ pipeline ●
Resource Requirements
●
Hardware, duration
automation with triggers that monitor runtime indicators of ●
Resource Availability
deployed models. An execution trigger e is a set of rules that
reason about the pipeline inputs (such as the dataset used for Fig. 4. An AI pipeline scheduler optimizes overall user satisfaction and
training, to detect changes or drift), previous executions of resource balancing
the pipelines (when was the model last built and deployed),
and performance of the deployed model. If any of the rules
C. Challenges in Operational Research for AI Ops
underlying e is met, an execution of the pipeline is triggered
automatically. Figure 3 illustrates the feedback loop. Recent work on scheduling and multi-tenant resource shar-
ing allows users to utilize training infrastructure in an efficient
way [9]. Yet, devising optimal operational strategies is still
Data
challenging due to a number of factors:
e Preprocess Train Evaluate Deploy Cutting Edge: As large-scale AI pipelines are an emerging
Performance
field, industry and research have not yet converged to a
Runtime
Monitoring
Model common configuration format for pipelines, hence pipeline
definitions are often custom-built and not readily available for
analysis or optimization.
Limited Testability: The execution of AI pipelines - in-
Fig. 3. Automated AI pipelines with execution triggers, quality gates, and cluding tasks like model training on GPU clusters - is often
runtime monitoring
long-running, resource intensive, and hence costly. Operational
The basic unifying characteristic of AI ops pipelines is research relies on experimentation with quick feedback cycles,
that they generate or augment AI models. Each task plays yet executing test workloads to evaluate large-scale strategies
a role in either training a model, or enhancing model metrics in the real system may be prohibitively expensive.
such as performance or robustness. This concept allows us to Lack of Data: As the field is still evolving, there is currently
reason about a model and its properties, and the pipeline it a lack of datasets about AI pipeline executions, capturing
is generated by, almost interchangeably. We can think of the the variety of specialized ML tasks (e.g., data preprocessing,
model as a latent component of a pipeline, whose properties model training, bias and robustness checks, model compres-
are assigned or changed once its specific tasks are executed. sion, etc), including the specific effects of each task on the
code, data, and model assets, across the entire lifecycle.
B. Optimized Operation of Automated AI Pipelines
A key challenge is to determine the trade-off of costs asso- IV. E XPERIMENTATION AND A NALYTICS FOR AI
ciated with pipeline execution, and the benefit of (expected) O PERATIONS P LATFORMS
performance improvement [13]. Optimized scheduling and re- In this section we introduce PipeSim, an experimentation
source sharing for ML pipelines under constraints is an active and analytics framework for AI operations platforms. The
research problem [9], [13]. Previous research has found that system architecture is illustrated in Figure 5. The real system,
ML platform users often do not use infrastructure resources which consists of compute clusters, pipeline executions, and
in a useful way [9]. For example, users would reserve several various lifecycle services (e.g., model compression, model
GPUs for weeks to improve models that already had an robustness, and data bias checkers), is represented in a modeled
system, which feeds into a simulator component. The system Pipeline & Tasks Preprocess Train ... Deploy
model is parameterized with simulation parameters whose
distributions are sampled from empirical data, extracted from Read Run Persist
Task Executor
real-world usage logs of the system. & System Operations Data Training Model
emulates
Modeled System Real System
labeled with the input that triggers a transition. To better
Simulation
reason about the control flow it is useful to also explicitly
Compute Lifecycle
Parameters Empirical model decisions and joins (task that are only transitioned
Clusters Services
Data
System Operational
(Usage Logs) to if all previous task have been executed). However, as
Pipe-
Model Strategies
lines
this is not the focus of the paper, we omit these defini-
tions. We model different types of tasks denoted τ , where
emulates Operational
Strategies
τ ∈ {preprocess, train, evaluate, compress, harden, ...}. We
control
shorten the notation to the first letter of the type. Instances
creates
Predictive
Models
Statistical of tasks of a specific type in a pipeline are denoted by v τ ∈
Analysis
Tool Vpτ ⊂ Vp . Each task instance holds type specific variables, for
Experiment revise example we associate the original and transformed data asset
Simulator Experiment
Parameters Database
Exploratory
D and D0 with a data preprocessing task v p .
revise
Analysis b) Resources: A resource R represents an infrastructure
Dashboard
component required for executing pipeline tasks. We model
a generic system comprising (i) a generic data storage, (ii)
Fig. 5. System Architecture of PipeSim
a training, and (iii) a compute infrastructure, but we allow
The main entry point for users is to define an experiment for custom configuration fo resource types. Data stores are
and its parameters; once it gets executed by the simulator, the abstracted in terms of read and write operations, and can there-
user can drill into the results using the exploratory analysis fore be anything from a cloud-based object store (such as S3),
dashboard. The statistical analysis tool analyzes the results to a relational or NoSQL database. Cloud-based AI platforms,
of the experiment database and is used to create predictive such as Watson OpenScale, perform training of models on
models, which themselves feed into the operational strategies, a dedicated infrastructure with specialized hardware (GPUs),
to close the feedback loop into the real system. whereas generic compute tasks, such as data preprocessing,
In the following subsections we discuss the overall approach may be performed on general purpose compute hardware
and the individual components of the system in more detail. (e.g., running Spark or Hadoop). Each compute resource is
assumed to have a specified job capacity and a work queue,
A. AI Ops Platforms: Conceptual System Model but we do not make assumptions about individual scheduling
As discussed in Section III-A, automated AI ops pipelines mechanisms.
integrate the entire lifecycle of an AI model, from training, c) Assets: An asset A denotes any data artifact that
to deployment, to runtime monitoring. In our definition of is transferred, processed, and generated by data stores and
pipelines, we distinguish between the build-time and run-time compute resources. Modeling asset characteristics is critical, as
view of the platform – analogous to the two steps in the the execution time of pipeline tasks often significantly depends
conceptual ML workflow described by [13]: the training phase on the size and dimensions of processed data sets, and is
and the inference phase. necessary to simulate traffic between tasks and data stores. We
1) Build-Time View: At build time, an AI pipeline generates distinguish between Data Assets D and Trained ML Models
or augments a model (classifier) by operating on data assets M . This allows us to store ML model specific characteristics,
and using underlying infrastructure resources (e.g., GPUs). like the number of neurons or layers of neural networks.
The build-time view of our system model is illustrated in d) Task Executors: Task executors encapsulate the actual
Figure 6. system operations performed by a task. For example, a task
a) Pipeline and Tasks: AI pipelines are compositions of executor of a training task on a distributed compute cluster is
tasks that create or operate on machine learning models [7], an iterative process of fetching training data from a remote
[13]. Formally, a pipeline Gp is a directed graph (digraph) storage (like S3) and running an optimization algorithm (e.g.,
Gp = (Vp , Ep ), where vertices Vp are tasks (i.e., work- gradient descent) on the data, and finally persisting the model
flow operations), and directed edges Ep are task transitions to the storage. We define the following system operations Ω =
{read(A), write(A), req(R), rel(R), exec(vpτ , R)} Where the there should be some randomness involved, the sequence of
first to are read and write operations on the data store for an steps in a synthetic pipeline should be sensible (e.g., a model
Asset A, req(R) and rel(R) are request and release operations validation task cannot precede a training task). This process is
for compute resources, and exec(v τ , R) is the task type τ challenging because the structure of complex AI ops pipelines,
specific execution of vpτ on R which describes how that i.e., beyond simple train–deploy workflows, is still poorly
task interacts with the system resources. A task executor understood. However, we can make some basic assumptions
is therefore a sequence of operations (ω0 , ..., ωn ), ωi ∈ Ω. about pipeline structures based on the prototypical pipelines
Typically the first and last operations in a sequence are read we have identified by analyzing both commercial and research
and write operations, respectively. use cases as presented in Figure 1.
2) Run-Time View: The outcome of a successful pipeline We also recognize that some steps within these pipeline
execution is usually a deployed model that is being served by structures may be optional. While a pipeline that generates a
the platform and used by applications for scoring. At runtime, model unconditionally requires a training step, not all require a
the deployed model has associated performance indicators that data preprocessing step if they make use of already structured
change over time. Some indicators can be measured directly, and curated datasets. When generating pipelines, some tasks
by inspecting the scoring inputs and outputs (e.g., confidence), therefore have a certain (possibly conditional) probability
whereas other metrics (e.g., bias, or drift) require continuous associated with them, that may depend on the state of the
evaluation of the runtime scoring data against the statistical pipeline currently being generated.
properties of the historical scorings and training data.
Task characteristics, such as the framework and algorithm
Model Train & Deploy Model Train & Deploy
used for training, or the number of operations in a prepro-
Pipeline triggers
Instances ... ... ... ... ... ...
cessing step, are also synthesized. Such characteristics can
be based on frequencies observed in production systems,
Models Confidence 0.78 0.82 0.80 0.85 or configured as experiment parameters. For example, while
& Metrics Trained Drift 0.09 0.10 0.21 Trained 0.08
Classifier v1 Staleness 0.00 0.01 0.02 Classifier v2 0.00 examining the AI pipeline executions of our platform we
Scoring found that 63% of training jobs use SparkML, 32% use
Requests monitors t1 t2 t3 t4
Time
Drift &
TensorFlow, 3% PyTorch, 1% Caffe, and 1% use a variety
Performance Trigger
Monitoring
Staleness Rules
Detector
✓ ✓ ✗ ✓ of other frameworks. Given that different frameworks map
to different infrastructures (e.g., a Spark cluster) and often
Fig. 7. Run-time view: Drift detection and continuous retraining correlate with significantly different execution times (see later
in Figure 9(b)), we want to easily adapt these percentages to
The ability to react to scoring events, introspect the his- observe the effect on the system.
torical payloads, and continuously compute performance met- 2) Synthetic Assets: A key ingredient of AI pipelines are as-
rics is a critical piece of functionality that itself requires a sets, i.e., training data assets, and trained models. We describe
significant amount of computational resources. In fact, model properties of assets as multivariate random variables, as this
performance monitoring tools like drift detectors are them- allows us to use common sampling methods for synthesizing
selves ML models, for example based on model explanation assets.
classifiers like IME [24], that need to be trained, deployed,
and continuously maintained. For example, we model a data asset D as an observation
Figure 7 illustrates the interaction between model assets and of a multivariate random variable D = (Dd , Dr , Db ) where
performance monitoring in PipeSim. A detector component Dd constitutes the number of dimensions of the dataset (e.g.,
monitors a trained classifier, and computes drift and staleness number of columns in tabular data), Dr the number of rows or
metrics over time. At time point t3 , a trigger rule detects that instances, and Db the disk space in bytes required to store the
the drift metric exceeds a threshold, and triggers a retraining asset (uncompressed). In Section V-A we describe how we can
pipeline which creates version v2 of the classifier. Note that obtain synthetic data assets by sampling from a multivariate
the pipeline may employ active learning with humans in the Gaussian mixture fitted on empirical data.
loop to perform data selection and labeling. Describing a trained model M in this way is not as straight
forward, as many of its properties are the result of stochastic
B. Pipeline and Data Synthesizer processes in the pipeline execution. In general, we say that
To run an experiment we need to artificially generate data a model has a set of static and dynamic properties. Static
that follow some underlying process we can control. We properties include those that are assigned by the pipeline at
describe our approach to generate synthetic pipelines and build-time, such as the prediction type Mt (e.g., binary or
assets that are used as input for a simulation. multiclass classification, or regression) or the model type Me
1) Synthetic Pipelines: Because the goal is to simulate the (e.g., Linear Regression, Random Forest, or Neural Network).
execution of a large number of pipelines, a key element of Dynamic properties include metrics we have described in
the experimentation environment is a Pipeline Synthesizer that Section III-A, such as model performance Mp , and CLEVER
stochastically generates plausible pipelines. That is, although score MCLEVER [19].
C. Process Simulator A. Statistical Modeling and Pipeline Simulation
1) Task Execution: Besides the side effects a task execution As discussed in Section IV-B, a fundamental requirement
has on the properties of an Asset, we are mostly interested for an experimentation environment is to allow meaningful
in simulating the execution duration of a task. Because a reasoning over the original system, and therefore requires
task is a sequence of system operations (ωn ), we can define generated data and underlying process simulation to reflect
the execution duration of a task t(v τ ) as the sum over properties of that system. While we can make some assump-
the execution tion about, e.g., the distribution of number of pipelines per user
P duration of the task’s system operations, i.e.,
t(v τ ) = t(ωi ). Similarly, because we define a pipeline as a (where the Pareto principle will likely apply), most data should
sequence of steps, we can define the execution duration of an be based on empirical observations. The analytics database of
entire pipeline as the sum over task execution durations (the (product name blinded) with several million rows of user and
current system model assumes that tasks are not running in system events is our primary source of empirical data. We run
parallel). This allows granular modeling of system behavior. queries on this database an fit different statistical distributions
For example, t(req(R)) equates to the time a task as to wait for on the extracted data, which include several thousand pipeline
a resource R to become available, t(read(A)) and t(write(A)) execution traces.
are functions of Ab and the up/download bandwidth and la- In general, we generate random entities in the simulator by
tency associated with the data store, and so on. These functions first using scikit-learn or SciPy to fit statistical models on the
can be expressed as analytical functions of system properties. respective observed data. The generated models or distribu-
However, for functions that are subject to complex processes, tion parameters are exported using Python’s serialization to
such as data preprocessing t(execp (v p , R)), or model training the simulator. During simulation time, we can then sample
task t(exect (v t , R)), we rely on empirical data and statistical individuals from the statistical model. In principle, fitting all
modeling of these function. necessary models that we currently use could also be done
2) Pipeline Arrivals: A fundamental parameter of a plat- on the fly when starting the simulation, but would take in the
form simulation is the rate at which pipeline executions are order of a few minutes. This allows us to plug in the live,
triggered. We say the average arrival rate is λ. A typical updated data sources for grounding simulation parameters.
way for simulating arrivals is to model the time between 1) Synthesizing Data Assets: The data processing compo-
arrivals (interarrivals) as a random variable [6] and sampling nent of (product name blinded) stores metadata about data
from the underlying distribution after every event. It is well assets, i.e., the number of dimensions and instances, as well
known that interrarivals typically follow some exponential or as the amount of bytes it transfers between object storage
related process. Researchers have found that, for example, and execution platform. To generate synthetic data assets we
TCP traffic is well described by lognormal, Pareto, or Weibull sample from the distribution of rows, columns and bytes
distributions [25]. Our data suggests that pipeline-interarrival- processed. Note that, while the usage data we sample from
times δ follows a exponentiated Weibull process (see V-A). is mostly about processing of tabular/structured data, our
However, pipelines are executed at a given time either because approach generalizes to arbitrary types of data (e.g., images,
they were triggered manually by a user (or other application), text, speech).
or they were triggered automatically due to a rule. To us as an We filter data assets with less than 50 rows and 2 columns,
operator, the former is a random process, whereas the latter as they are unlikely to be used for training models. Figure 8
we have control over. It is therefore useful to simulate the shows two log-transformed density scatter plots of a de-
processes that underly the user-defined rules (e.g., the arrivals duplicated subsample of observations. The first shows the
of new data, the expected model drift, etc.). Describing these distribution of columns and rows of data assets. We observe
processes is part of the run-time view of the system and is that there are clusters of assets with similar structure. The
ongoing work. Furthermore, it is important to preserve, to second shows the dimension of the data (rows × columns)
some degree, the complexity of workload arrival patterns, and and corresponding bytes, which reveals an (expected) linear
there are many ways to achieve this. We discuss our approach relationship, but also a large variability in values.
in more detail in Section V-A.
V. AI P LATFORM S IMULATOR
This section presents our effort towards a full implemen-
tation of the experimentation and analytics environment as
described in the previous section. The current implementation
comprises a simulator backed by empirical data from (product
name blinded), and an analytics frontend connected to a time
series database that contains the simulated system data. In Fig. 8. Observations (n = 9 821) of asset dimensions and size
this section we first describe how we acquired the data for
statistical modeling of simulation processes. We then describe The model we use to sample a random data asset is a
the implementation of individual components. multivariate Gaussian Mixture Model. We use the scikit-learn
implementation to fit a mixture with 50 components and a full the processing time. Although we do not have data on this at
covariance matrix on the three-column data set. The model present, we plan to include it in a future iteration.
is then exported into the simulator as described previously. b) Training: For a training task v t we know from the
Because the original data contain many extreme and spread assignment we made during pipeline synthesis which frame-
out values which cause too many singletons to form in the work F is used, and we know that frameworks have vastly
mixture, we fit instead on the log-transformed data. During different execution duration distributions. For example, 50%
simulation time, we transform the data back and reject out-of- of TensorFlow jobs run in under 180 seconds, whereas 50% of
bound values. Spark ML jobs run in under 10 seconds. Figure 9(b) shows two
2) Simulating Task Execution: As discussed in Sec- histograms over the compute time of a subsample (n = 50 000)
tion IV-A, the execution duration of a task typically depends on of these jobs. For a better illustration we only show values
several factors. In particular, tasks like data preprocessing and below the 99th percentile. We model t(v t ) in a similar fashion
model training are not straight forward to express analytically, as t(v p ). To estimate t(exec(v t , R)) for a given framework F ,
and we instead define statistical models for them. Figure 9 we stratify the observed execution duration by F and then
shows examples of different ways to simulate the compute fit a Gaussian mixture model pF on each stratum. During
time of data preprocessing and training tasks. Individual plots simulation time we then simply pick a random sample from
are explained in more detail below. pF . This gives us a relatively good fidelity when testing, e.g.,
the effect of specific frameworks trending (which is something
we have observed in the production system, specifically that
the number of Tensorflow builds are increasing over time). In
104 Spark ML
our implementation we also model PyTorch and Caffe training
102
Number of builds (log)
jobs.
0 50 100 150 c) Model Evaluation: For a model evaluation (validation)
104 Tensorflow task v e we have no data to correlate, only the raw compute
102 time data where we again fit a Gaussian mixture to sample
0 10000 20000 30000 40000 50000 t(exec(v e , R)) at simulation time. However, we plan to inves-
Compute time (s), p = .99 tigate this further in the next iteration of our simulator. It is
(a) Data preprocessing task (b) Training task reasonable to assume that the time it takes to validate a model
can be described by the dataset size used for validation, and
Fig. 9. Observations of compute time for data preprocessing and training the model size (which will generally be a good indicator for
tasks based on other known aspects of the task.
the inferencing time).
d) Model Compression: Model compression can be used
a) Data Preprocessing: For a data preprocessing task, to reduce the size and inferencing time of deep neural net-
we can correlate the data asset’s size with the computation works [26]. We know from the way state-of-the art model
time. Figure 9(a) illustrates this relation. The red line is an algorithms operate, that model compression requires roughly
exponential function f (x) = abx + c with parameters a = as much time as training. We can therefore re-use the execution
0.018, b = 1.330, c = 2.156, fitted on the loge -transformed duration we have simulated for training and add Gaussian
data using SciPy’s implementation of non-linear least squares. noise to it to simulate t(v c ). Compression affects several
During simulation time, we use the size of the synthesized data model metrics, specifically performance, size, and inferencing
asset to estimate the compute time from the fitted function and time. Table I shows these values from experiments we have
add noise from a log-normal distribution α = 0.15, µ = −1 to performed with GoogleNet and ResNet50 networks on the
simulate the long tail around the function. By associating the Food101 training set using Caffe.
data asset D with the task instance v p , and given the compute
resource R, we can express t(v p ) = t(req(R))+t(read(D))+ TABLE I
t(exec(v p , R)) + t(write(D0 )), and estimate t(exec(v p , R)) = E FFECT OF MODEL COMPRESSION ON MODEL PARAMETERS
f (Dd · Dr ) + n with n being a random sample from our noise
function. Prune Accuracy (%) Size (MB) Inference (ms)
2
to simulate req(R). It then simulates the task execution by
generating a timeout event with the value sampled as described
in Section V-A. Afterwards, a Trained Model Asset is created,
0 and properties associated with the model (size, performance,
0 5 10 15 20 0 2 4 6
Hour of day Weekday [monday = 0] CLEVER score, etc.) are materialized. For example, to materi-
alize the performance of the model (e.g., the AUC or F-score),
Fig. 10. Average arrivals per hour stratified by hour of day and weekday we could sample from the distribution of performance values
(n = 210 824). µ shows the average arrivals per hour aggregated over all historically observed for the estimator type. While this is not
values. Error bars show one standard deviation. an accurate estimate of the performance an individual model
will have, it will give us an idea of the overall distribution
B. Simulator of, e.g., pipelines that may not meet certain quality gates. The
At the core of the experimentation environment lies the created Trained Model Asset is then persisted to the data store
simulator. It implements the conceptual system model and and the execution trace is recorded via the logging interface.
data synthesizers as described in Section IV as a stochas-
tic, standalone, discrete-event simulator using the simulation VI. E VALUATION
framework SimPy2 . We also make use of SciPy and scikit- To assess the feasibility of our prototype, we evaluate three
learn for implementing statistical operations such as distribu- key aspects. First, we show how the simulator results can
tion fitting and sampling. Synthetic traces are persisted into be used to analyze the effect of different parameters on the
InfluxDB3 . We also developed a Python toolkit for analyzing system. Second, we examine if the data produced by the
the experiment data (which we also use for the evaluation), simulator reflect the original system under simulation. Third,
and a analytics dashboard using Grafana4 . We briefly describe we test how well the simulator scales even when simulating
1 In this paper we report relative numbers for the arrival rates, as we cannot
years of pipeline executions.
disclose the absolute numbers by company policy at the present time. Note A. Exploratory Analysis using the Dashboard
that these numbers correspond to the real-world metrics we have collected,
and are useful here for illustration purposes. The analytics frontend allows exploratory analysis of exper-
2 https://simpy.readthedocs.io
iment results. Figure 11 shows the dashboard populated with
3 https://www.influxdata.com/time-series-platform/influxdb/
4 https://grafana.com/ 5 https://simpy.readthedocs.io/en/latest/topical guides/resources.html
Fig. 11. Experiment analytics dashboard showing infrastructure and pipeline execution metrics.
data from a sample experiment run. It shows the experiment identified by unique IDs generated by the system. This way
parameters, general statistics about individual task executions the lineage of a pipeline can be tracked, and the accuracy
and wait time. The graphs shows the resource utilization over time (which are currently synthesized values and added
of compute resources, individual tasks arrivals, and overall Gaussian noise) observed.
wait time of pipelines, which allows us to quickly observe
the impact of resource utilization on pipeline wait time. B. Simulation Accuracy
The dashboard also shows the network traffic caused by the
We use the statistical analysis component of PipeSim to
execution platform and includes TCP overhead.
evaluate the accuracy of our simulation data. Figure 12 shows
This example illustrates how we can analyze the impact the results of several pipeline execution experiments to com-
of, e.g., arrival peaks on the infrastructure. Around 16:00, pare how well the empirical and simulated data agree.
a typical peak in pipeline arrivals occurs. The usage of The Q–Q plots in Figure 12(a) plot the quantiles of different
the compute cluster spikes slightly because of preprocessing task execution durations against each other. Our preprocessing
tasks being scheduled to it. However, because the learning task simulation slightly overestimates execution duration for
cluster is quickly saturated by long-running training jobs, short running tasks, but overall performs well considering
subsequent jobs have to queue, and post-processing task (like the relatively simple statistical model for this complex pro-
model validation) arrivals are delayed to a later time. This cess. Training tasks, which we simulated separately for each
scenario illustrates how the simulator can be used to examine framework exhibit a very good fit, which we attribute to the
load balancing of heterogeneous compute infrastructure given performance of Gaussian mixtures given the large amount of
different system conditions. data we have for each framework. The plot for the evaluation
Although the simulation synthesizes and logs model metrics task shows how extreme outliers are sometimes difficult to
for each pipeline, we currently do not have a good way of simulate correctly.
visualizing these data. We are working on a way to visualize The Q–Q plots of interarrivals in Figure 12(b) show that
aggregated model metrics (such as overall potential improve- both the random and realistic (clustered by weekday and hour
ment, as described in Section III) in a meaningful way. Queries of day) arrival profiles slightly overestimates pipeline interar-
can however be executed, for example, via the InfluxDB web rivals. This is acceptable because the experiment environment
UI, and metrics can be aggregated over pipelines which are takes an interarrival factor parameter that allows us to increase
log10 seconds (simualted) Processing Duration Training Duration Evaluation Duration We simulated the system for up to a year (365 days), with
6 6
4
an average interarrival of 44 seconds, which corresponds to
4 4 about 720 000 pipeline executions. This simulation took on
2 2 average 517 seconds, or 8.6 minutes, meaning that simulating
2
0 0 a pipeline execution takes, on average, around 1.4 ms on
2 4 6 0 2 4 6 0 2 4 the evaluation machine. We ran these experiments several
log10 seconds (empirical)
times and observed negligible variance. The maximum mem-
(a) Q–Q plots of task execution duration in log10 seconds
ory consumption for this run was roughly 850 MB. While
Random Clustered we found that the simulator is overall capable of running
extensive simulations on a single machine, we quickly ran
log10 seconds (simualted)
2 Simulated
range scalability.
VII. C ONCLUSION
0 5 10 15 20 0 2 4 6
Hour of day Weekday [monday = 0] Efforts of both research and industry to build platforms for
(c) Average arrivals per hour with realistic arrival profile democratizing and operationalizing AI have revealed exciting
new opportunities for operations research. The development of
Fig. 12. Statistical evaluation of simulated processes. such platforms and their operational strategies is challenging
due to the complex nature of the AI application lifecycle, as
well as the growing need for reconciling build-time and run-
or decrease the average arrivals of pipelines, and control for
time aspects of ML models.
such errors. Figure 12(c) shows a detailed view of the realistic
Being able to examine and experiment with the behavior
arrival profile, where we simulated four weeks of pipeline
of these systems is a critical requirement for AI platform
executions The black line shows average arrivals per hour in
operators as our understanding of the AI lifecycle continues
our simulation, and the red line shows the values from our
to grow. To that end, we have presented an experimentation
real system. We can see that our clustered sampling approach
and analytics framework for large-scale AI platforms. Based
generally performs well in capturing different arrival peaks.
on our current knowledge of production-grade AI systems, we
C. Simulation Performance developed a system and process model to simulate, monitor,
and analyze the operation of such platforms, enabling contin-
To asses the scalability of the simulator, we run several uous improvement with humans in the loop.
experiments with an increasing number of pipeline executions
We have shown how to build statistical models from em-
and record the (wall clock) execution duration, and memory
pirical data that accurately reflect the effect of AI pipeline
consumption of the executing python process. The simulation
executions and model scorings on the infrastructure as well
runs in a single thread and is executed on an AMD FX-
as metrics of the model itself. Simulating a year’s worth of
8350 4.0GHz CPU machine with 8GB RAM running Linux
pipeline executions takes only a few minutes on a single
Mint 18.2 and Python 3.6. Our experiment results in Fig-
machine, enabling a quick feedback cycle to run experiments.
ure 13 reveal a straight-forward linear relationship between the
Our evaluation demonstrates that the analytics components can
number of executed pipelines and the execution duration, and
be used to examine the impact of different system parameters
likely polynomial memory usage due to the internal storage
and support advanced techniques like capacity planning and
of execution traces.
resource optimization. For our future work we plan to further
extend the current simulator and establish a stronger link
Max Resident Set Size (MB)