You are on page 1of 56

Accelerate Machine

Learning with a Unified


Analytics Architecture
Deploy Machine Learning Models
in Minutes, Not Months

Ben Epstein and Paige Roberts

Beijing Boston Farnham Sebastopol Tokyo


Accelerate Machine Learning with a Unified Analytics Architecture
by Ben Epstein and Paige Roberts
Copyright © 2022 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional
use. Online editions are also available for most titles (http://oreilly.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Acquisitions Editor: Jessica Haberman Proofreader: Justin Billing


Development Editor: Melissa Potter Interior Designer: David Futato
Production Editor: Elizabeth Faerm Cover Designer: Karen Montgomery
Copyeditor: Sharon Wilkey Illustrator: Kate Dullea

January 2022: First Edition

Revision History for the First Edition


2022-01-28: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Accelerate


Machine Learning with a Unified Analytics Architecture, the cover image, and related
trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the
publisher’s views. While the publisher and the authors have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Micro Focus. See our
statement of editorial independence.

978-1-098-12028-3
[LSI]
Table of Contents

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1. Applied Machine Learning and Why It Fails. . . . . . . . . . . . . . . . . . . . . . 1


Data Science Tools Were Not Built for Big Data 2
Data Tooling Explosions 6
Disparate Datasets and Poor Data Architectures 7
Complexities of Data Science and the Cost of Skills 8

2. Evolution of the Data Lake and Data Warehouse. . . . . . . . . . . . . . . . 11


Data Warehouse Evolution 13
Data Lake Evolution 14
Cooperative Architecture 16

3. Unified Analytics Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


Prevalence of the Cloud 19
Separating Compute and Storage 20
Unifying Storage 22
Data Life Cycle Management 25
Isolating Workloads 27
Unifying Analytics 31

4. In-Database Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


What Is In-Database ML? 35
Why Do ML in a Database? 39
Why Not Do ML in a Database? 43
Unified Analytics: Managing Models 44

iii
Unified Analytics: Feature Stores 44
The Unified Analytics Architecture 45

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

iv | Table of Contents
Introduction

The list of potential industries to be transformed by machine learn‐


ing (ML) has exploded in recent years and is poised to continue that
trajectory. We can see this in the increasing number of data scientists
employed across the globe. In time, nearly all major industries will
have embedded ML into the core of their businesses, automating
away mundane and repetitive decisions that are better made by
algorithms than humans. But the adoption discrepancy between
potential industries and actual industries continues to widen.
This report dives into the wide range of reasons so many ML
initiatives fail, and why the majority of those failures occur at the
proof-of-concept (POC) stage, right at the inflection point where
teams are ready to put their work into production. Of those that
don’t fail, studies have shown that 40% to 80% of successful projects
take from one month to a year, or more, to finally reach that finish
line.
At the end of this report, you will have an understanding of a
new data architecture that streamlines the daily workflows of data
scientists and enables the seamless transition of models from devel‐
opment into production. We will cover the evolution of data lakes
and data warehouses, looking at their strengths and weaknesses, and
how they are being used cooperatively in many current production
data architectures. We’ll explain where cooperation falls short, and
discuss the reasons behind the movement to merge these two con‐
cepts into a unified analytics architecture.

v
Finally, we will look at a new but emerging concept, in-database
machine learning. When paired with a unified analytics architec‐
ture, this may serve as the key to rapid productionalization of ML
models.

vi | Introduction
CHAPTER 1
Applied Machine Learning
and Why It Fails

There are many reasons for the contrast between potential and
actual applications of ML in industry, but at its core, the failure
is due to the unique challenges of operationalizing ML, commonly
known as MLOps. Although companies across the globe have stan‐
dardized the practice of DevOps, the workflow and automation of
putting code into production systems, MLOps presents a uniquely
different set of problems.
In the practice of DevOps, the code is the product; what is written
by the engineer defines precisely how the system acts and what
the customer sees. The code is easy to track, manage, update, and
correct. But MLOps isn’t as straightforward; code is only half of the
solution, and in many cases, much less than half. MLOps requires
the seamless cooperation, coordination, and trust of code and data,
a problem that has not been fully tackled and standardized by the
industry.
When an ML model is put into production, teams must access not
only the code that created it, and the model artifact itself, but the
data it was trained on, as well as the data it begins making predic‐
tions on post-deployment. Because a model is merely an interpreta‐
tion of a set of data points, any change in that original dataset, by
definition, is a new model. And as the model makes predictions in
the real world, any discrepancies between the distributions of data
the model trained on and data it predicted on (commonly referred

1
to as data drift) will cause the model to underperform and require
retraining. Too small of a sample size, and your model may underfit
the data—fail to converge on an understanding of its distribution.
Too much biased data, and your model may overfit the data, gaining
an overconfidence (and sometimes memorization) of the training
data, but failing to generalize to the real world. This introduces
multiple issues in current analytic architectures, from data tooling,
storage, and compute, to human resource utilization.

Data Science Tools Were Not Built for Big Data


Talk to any data scientist, and they will tell you that their day con‐
sists mostly of SQL, Python, and R. These languages, which have
matured over years of development, enable rapid experimentation
and testing, and this is exactly what a data scientist needs in the
early stages of building an ML model. Python and R are extremely
flexible and streamline iterative workflows natively. In Example 1-1,
we see how a data scientist could quickly run through 50 ML model
configurations in just 10 lines of code.

Example 1-1. Rapid prototyping in Python with only a few lines of


code

from pprint import pprint


import random
from sklearn.model_selection import GridSearchCV
# Define hyper parameters
hyper_params = {
'num_trees': [5,10,20,40],
'max_depth': [4,12,18,24],
'train_test_split': [(0.8, 0.2), (0.7,0.3), (0.55,0.45)]
}
# Iterate through many combinations of hyper parameters
for test in range(50):
hp_vals = {}
for hp in hyper_params:
hp_vals[hp] = random.choice(hyper_params[hp])
print('-'*50)
print(f'\nTraining model using the following Hyper Parameters:')
pprint(hp_vals)
train_model(hp_vals)
print('Done Training\n')

In data science, however, the major trend has been toward build‐
ing models using larger datasets that cannot fit in memory. This

2 | Chapter 1: Applied Machine Learning and Why It Fails


creates an issue for data scientists, because Python and R are best
suited and natively built for in-memory use cases. The most popular
data science framework in Python is pandas, an in-memory data
computation and analysis library that makes the common tasks
of data scientists—such as imputing missing data (handling null
values through statistical replacement), creating statistical summa‐
ries, and applying custom mathematical functions—incredibly sim‐
ple. It works nicely with scikit-learn, one of the most popular and
easy-to-use ML frameworks in Python. Combining these two tools
enables data scientists to rapidly prototype algorithms, experiment
with ideas, and come to conclusions faster and with less effort.
Pandas and scikit-learn work well for users as long as their data
is only a few gigabytes in size. Because these tools can run on
only a single machine at a time, you cannot distribute the compu‐
tational load across many machines in parallel. Once you surpass
the limitations of your personal machine or server, you begin to
see degradations of performance and even system failures. As the
amount of acquired data continues to explode, the work required is
increasingly beyond the capacity of these tools.
Given this limitation, data scientists often resort to training models
on subsets of their data. They can use statistical tools to capture
the most important data, but these practices inevitably lead to
lower-quality models, since the model has been fit (exposed) to less
information.

Generally speaking, exposing your model


to a higher quality or greater quantity of
training data will increase the accuracy
more than modifying the algorithm.

Another tactic is partial learning—a practice in which a subset of


data is brought into memory to train a model, then discarded and
replaced with another subset to train on. This iterative process
eventually captures the entirety of the data, but it can be expensive
and time-consuming. Many teams won’t have the time or resources
to employ this methodology, so they stick with subsampling the
original dataset.

Data Science Tools Were Not Built for Big Data | 3


But what is the purpose of storing all this data if data scientists
cannot use it to improve the business?

Spark
A common response to this issue is the introduction of Spark into
the data science ecosystem. Apache Spark is a popular open source
project for processing massive amounts of data in parallel. It was
originally built in Scala and now has APIs for Java, Python, and
R. Many infrastructure engineers are comfortable setting up cluster
environments for Spark processing, and significant community sup‐
port exists for the tool.
Spark, however, does not solve the entire issue. First, Spark is
not simple to use: its programming paradigms are very different
from traditional Python programming and often require specialized
engineers to properly implement. Spark also does not natively sup‐
port all of Python and R programmers’ favorite analytics tools. If,
for example, a Python developer using Spark wants to use statsmo‐
dels, a popular statistical framework, that Python package must be
installed on all distributed Spark executors in the cluster, and it must
match the version that the data scientist is working with; otherwise,
unforeseen errors may occur.
PySpark and RSpark, the Python and R API wrappers for Spark,
also have notorious difficulty in interpreting error messages. Simple
syntax errors are often hidden under hundreds of lines of Scala stack
traces. Figure 1-1 shows a typical stack trace of a Python PySpark
error. Much of it is Java and Scala code, which does not leave the
data scientist with many avenues for debugging or explanation.
Many of these issues have been front and center for the Apache
Spark team, which has been working hard to mitigate them—
improving error messages, adding pandas API syntax to Spark data‐
frames, and enabling Conda (the most popular Python package
manager) files for dependency management. But this is not a catch‐
all solution, and Spark was fundamentally written for functional
programming and large data pipelines meant to be written once and
run in batch, not constantly iterated upon.

4 | Chapter 1: Applied Machine Learning and Why It Fails


Figure 1-1. Common PySpark stack trace

Shifting to Spark for production data science will likely require


another resource such as a data engineer to take the work of a data
scientist and completely rewrite it using Spark to scale the workload.
This is a major hurdle for large companies and a contributing factor
to the failure of 80% of data science initiatives.

Dask
One proposed solution is to use Dask, a pure Python implementa‐
tion of a distributed compute framework. Dask has strong support
for tools like pandas, NumPy, and scikit-learn, and it supports tradi‐
tional data science workflows. Dask also works well with Python
workflow orchestration tools like Prefect, which empowers data sci‐
entists to build end-to-end workflows using tools they understand
well.
But Dask is not a full solution to the issue either. To start, Dask
provides no support for R, Java, or Scala, leaving many engineers
without familiar tools to work with. Additionally, Dask is not a
full drop-in replacement for pandas, and does not support the
full capabilities of the Python-based statistical and ML tools. Other
projects (for example, Modin) have stepped in to attempt to gain full
compatibility, but these are young systems and do not yet offer the
maturity necessary to be trusted and adopted by large enterprises.

Data Science Tools Were Not Built for Big Data | 5


The infrastructure needs of Dask present another potential hurdle.
Using Dask to its full potential requires a cluster of machines capa‐
ble of running Dask workloads. Companies may not have the neces‐
sary resources to deploy and support a Dask cluster, as it is a new
project in the space. Far more infrastructure engineers are familiar
with Spark than with Dask.
Companies such as Coiled are beginning to support Dask as a
service, managing all of the infrastructure for you. But these compa‐
nies, much like Dask itself, are young and may not support the strict
security or stability requirements of a given organization. Dask also
adds yet another tool to the ever growing toolbox necessary to bring
data science out of the back room and into production.

Data Tooling Explosions


The preceding discussion may have exposed you to multiple projects
and technologies you had never heard of before. That is indicative of
another major hurdle to large organizations’ efforts to leverage data
science: the explosion of data science tools.
Both Python and R significantly reduce the barrier to entry for
developing new packages relative to many other programming lan‐
guages. The major advantage of this is that the open source commu‐
nity constantly creates new projects that build off of each other and
continuously ease the daily lives of programmers, data scientists,
and engineers. The difficulties come when organizations need to
support this ever growing list of tools, many of which have con‐
flicting dependencies and untested stability and security concerns.
Supporting this web of tooling and infrastructure is a never ending
task, which becomes a job in and of itself, taking resources away
from the core goal: putting ML models into production.
Simply moving data from a data warehouse, data lake, or multiple
original sources to an analytic environment brings a wealth of con‐
cerns and opportunities for human error, which often requires a
plethora of startups and open source projects to rescue the effort.
Picking the proper tools, training a team on their function, and
supporting their reliability is the job of an entire team for which
there may not be a budget. As many data science organizations store
their raw data in data lakes and data warehouses, each separate from
data stores being used by the data scientists, this single task can be a
major barrier to streamlined ML.

6 | Chapter 1: Applied Machine Learning and Why It Fails


Disparate Datasets and
Poor Data Architectures
ML fails at organizations primarily because of improper storage,
management, and ownership of raw data and data movement. Data
science is, at its core, an iterative and experimental field of study.
Support for rapid testing, trial, and error is core to its success in
business. If data scientists are forced to wait hours or days to access
the data they need, and every change in request requires additional
waiting periods, progress will be slow or impossible.
Across many modern data architectures, refined data is stored in
data warehouses, separate from the needs of the data scientists
and analysts, or data lakes that store far too much data in various
stages of refinement for easy location, access, transformation, and
experimentation. The problems of opaque changes in upstream data
sources and a lack of visibility in the data lineage can decrease the
efficiency of data scientists when they are dependent on data engi‐
neers to provide clean data for use in their modeling. If the under‐
lying data sent to data scientists changes between model-training
iterations, catching that change can be extremely challenging.
The enterprise data warehouse (EDW) offers advantages including
performance and concurrency, reliability, and strong security. Com‐
mercial EDWs such as Snowflake, Amazon Redshift, and Vertica
usually come with a strong SQL interface familiar to data engineers
and architects.
But data warehouses were not initially architected for complex itera‐
tive computations required by state-of-the-art data science and have
struggled to support that need. Just a few years ago, they generally
didn’t support time series, semi-structured data, or streaming data.
They often ran on specialized hardware making scale expensive,
and they required slow batch extract, transform, and load (ETL)
processes to get the data refined and into the hands of data analysts.
Even now, many analytical databases at the heart of data warehouses
support only a subset of the data formats needed for data science.
The structured data they store is focused on the needs of business
intelligence, not data science. These limitations have become a
major bottleneck to achieving production ML.
As an attempt to mitigate the shortcomings of the data warehouse,
the data lake was introduced in the 2010s. Streaming data and time-

Disparate Datasets and Poor Data Architectures | 7


series data were supported, varied data structures including semi-
structured data (such as JSON, Avro, and logs) were native, and
scaling was cheap because data lakes ran on commodity hardware.
But the data lake also came with major hurdles. Data lakes were
complex and required specialized and high-salaried engineers to
operate. Data lake performance was poor, and they struggled specif‐
ically with concurrent users, bogging down and even failing with
as few as 10 simultaneous requests. Over time, some other data
lake weaknesses, such as ACID compliance (ensuring atomicity,
consistency, isolation and durability) and metadata management,
have been addressed.
Most important, however, data lakes could not, and still struggle to,
natively support the tools that data scientists need. Data scientists
still routinely take subsets of data out of the lake, combine those
with data from the data warehouse or other sources, and process
those combined subsets in memory by using Python and R. Data
lakes solved a problem, centralizing a company’s massive aggrega‐
tion of raw data. But they did not solve the problem of data scientists:
scaling the workflow they were used to.

Complexities of Data Science and the


Cost of Skills
Lastly, but certainly of great importance, is the technical depth nec‐
essary to actually perform adequate data science. As a field, data
science and ML have exploded in popularity, and with that popular‐
ity, a plethora of easy-to-use open source tools have emerged. But
in reality, these tools do not usually address complex and company-
specific business problems. When using ML to determine who gets
a small business loan, for example, it’s simple enough to use a scikit-
learn model and analyze Shapley Additive Explanations (SHAP)
scores, but only a trained data scientist can help you understand
whether your model is inheriting biases from your historical data, or
where specifically (and why) your model is underperforming.
Truly powerful data science requires a mastery of data and statistics
alongside a strong ability to program, and a keen understanding of
storytelling and data visualization.

8 | Chapter 1: Applied Machine Learning and Why It Fails


The intersection of these three skill sets
is minimal, and as a result the costs
of employing good data scientists are
extraordinarily high.

To mitigate that, an increasing number of architectures are trying


to support the concept of a citizen data scientist—someone with a
deep understanding of the specific business and business problem,
often a business analyst, who also has a basic grasp of applicable
data science concepts. This conceptual role is generally filled by
someone who uses SQL daily and understands data visualization
tools, but typically does not have a good grasp of programming in
languages like Python or R, or the normal depth of expertise in
data science expected of someone who has spent years of training in
that field. Enabling this citizen data scientist role is a goal of many
architectures.
At this point, we’ve looked at reasons ML projects fall short of their
potential. The remainder of this report will walk through the evolu‐
tion of data warehouses and data lakes into a unified architecture, an
evolution that supports successful ML implementations across many
industries. Importantly, we’ll take a dive into in-database ML, a new
concept that will fundamentally change the way enterprises utilize
ML and democratize its usage. Finally, we’ll see how this unified
architecture helps companies productionalize ML pipelines and take
their data scientists out of the back room.

Complexities of Data Science and the Cost of Skills | 9


CHAPTER 2
Evolution of the Data Lake
and Data Warehouse

This report assumes a basic understanding of the data warehouse


and data lake architectures, as well as their advantages and disad‐
vantages. It is important, however, to understand the nature of the
evolution of these two systems.
At its inception about a decade ago, the data lake was quite different
from the data warehouse and claimed a myriad of benefits. In the
Venn diagram of data lake and warehouse value-adds, the overlap
was minimal, aside from the general storage and retrieval of data, as
seen in Figure 2-1.
When the data lake first made waves into the data landscape, the
data warehouse ran on expensive, proprietary technologies, was not
particularly parallelizable across multiple machines, and handled
strictly structured data. This was problematic, and the data lakes of
the world—particularly Apache Hadoop—presented a compelling
alternative. For example, the Hadoop MapReduce programming
model (read file from data source A, do one processing step, write
the data back to disk, read again, do next processing step, write to
disk, repeat until data processing is completed, and then move to
data source B) lead to the first real data lake that used the Hadoop
Distributed File System (HDFS) for data storage.

11
Figure 2-1. Venn diagram of a historical data warehouse versus a
historical data lake

The data lake broke ground by offering affordable, massively scal‐


able ways to store and analyze a variety of data formats from
structured, to semi-structured, and even unstructured data. The
processing engine ran on commodity hardware, making scalability
affordable. As data grew, Hadoop administrators simply added more
inexpensive commodity hardware, and their scalability issues were
alleviated. MapReduce gained massive popularity and was the tool
of choice for analyzing all of this data. MapReduce engineer job
listings exploded, and their salaries responded accordingly. Finally,
we had a tool that could process any kind of data, any amount of
data, and store it cheaply and reliably. This would clearly pave the
way for an explosion of ML, right?
Not quite. First, MapReduce is not easy. In fact, it is particularly
complex and difficult. It is not flexible to the needs of users and
forces a strict conformity to the MapReduce paradigm—read data
from disk, do one operation, sort, shuffle, redistribute and write

12 | Chapter 2: Evolution of the Data Lake and Data Warehouse


data back to disk—even if the problem statement doesn’t fit that
standard. MapReduce aside, data lakes developed a bad reputation,
becoming known as dumps for unrefined and uncataloged data,
leading to the notion of a data swamp. Given the flexibility of the
data lake, companies needed to conform to some standards early on
with respect to tracking data lineage, changes, and copies (general
data governance principles); otherwise, drain the swamp and start
again.
Hadoop was also not particularly concurrency friendly. Until the
addition of Apache Hadoop YARN, Hadoop was designed to do just
one thing on a cluster at a time. Lastly, Hadoop had no built-in
security and user-management models, which added enormous risk
for large enterprises that took on this newfound technology.

Data Warehouse Evolution


But the advantages of commodity hardware, massive parallelism,
and flexible data structure requirements were not lost on the data
warehouses of the world, and they sprung into action. Over the last
decade, data warehouse databases have rebuilt their systems from
the ground up on massively parallel processing (MPP) architectures,
taking advantage of commodity hardware and native cloud support
for nearly infinite, affordable scalability.
Native, first-class support for semi-structured data like JSON, and
for powerful new columnar structured data formats that came
from data lake innovation like Apache ORC and Apache Parquet,
as well as the addition of schema-on-read capabilities, have trans‐
formed many analytical databases. This put data warehouses into a
completely new light in the competitive space of data storage and
processing. Native support for popular streaming technologies like
Apache Kafka and Amazon Kinesis have also mitigated the need for
complex ETL pipelines, reducing the time from published event to
actionable data from hours and days to seconds.
Many warehouses have even added support for ML, Python, and R,
and have added native accessibility for data scientists. Some support
notebooks for interactivity. Combine that with robust SQL support
and battle-tested reliability, security, and governance, and today’s
data warehouses present an enticing offer to consumers.

Data Warehouse Evolution | 13


Data Lake Evolution
The data lakes of the world did not become complacent in their vic‐
tories of scalability on commodity hardware either. They recognized
their weaknesses and worked to address them.
The data lake started as typically append only, which enabled data
processing only by making copies. The scalability of data lakes
allowed raw data to be stored at any volume, and regardless of later
refinement, that original raw data still existed. Many copies of an
organization’s data that all needed to be secured and managed were
made.
The first thing that happened was that nearly immediate recognition
of the limitations and sluggishness of MapReduce led a few years
later to the almost total replacement of that technology with Apache
Spark, a far more flexible, robust, and performant data processing
engine. Spark has become a standard in data lakes, providing the
flexibility that MapReduce lacked.
Next, data lakes developed a set of SQL dialects. Tools such as
Apache Hive, Apache Impala, and Presto have made massive
strides in approaching database-like query performance. One step
forward was in the creation of structured columnar storage formats
designed for analytic query performance: ORC and Parquet. Hive
on ORC now advertises full ACID compliance, a distinction typi‐
cally reserved only for SQL databases. Data lakes have also been
sprinting to improve reliability, security, and governance in their
systems.
Concurrency remains to this day a massive issue in these systems,
with the CEO of Databricks claiming, “Concurrency is where things
like Spark don’t do well. And it’s been a focus area for us.” More
than 10 users or SQL queries at a time on a single cluster could lead
to massive performance issues on queries or failures altogether. The
current solution is to auto-scale another cluster into activity when
more than 10 concurrent processes are attempted.

14 | Chapter 2: Evolution of the Data Lake and Data Warehouse


Data warehouses and data lakes are
racing to the middle from their extremes,
becoming less and less distinguishable,
leading to the new lakehouse, or unified
analytics architecture.

In some cases, the race starts with a data lake that now has SQL
capabilities, and ACID compliance, plus data science workbenches
on top for access to popular tools such as notebooks, Python, and
R. Sometimes the race starts with a data warehouse, which now has
affordable high-scale storage and processing, support for streaming,
and different kinds of data, among other formerly data-lake-only
capabilities. After this race, we see a much less distinguishable set of
technologies, as shown in Figure 2-2.

Figure 2-2. Venn diagram of a modern data warehouse and a modern


data lake

Data Lake Evolution | 15


Cooperative Architecture
In an attempt to get the benefits of both architectures, compa‐
nies began to realize that they could stitch the two technology
stacks together, simply taking advantage of each wherever necessary.
Under the hood, there are often two or more distinct engines doing
the work, and two distinct locations storing the data needed for
analytics. An attempt is usually made to take advantage of each tool
independently while abstracting away the communication layers as
much as possible. And thus, a cooperative architecture emerged,
which offered nearly all of the advantages (and countered the disad‐
vantages) of both. This concept exploded and is currently the data
architectural design of choice for many organizations.
In these cooperative architectures (shown in Figure 2-3), raw data
typically feeds the data lake, where cleaning and refinements are
performed with Spark. The recent, key subsets of data are pipelined
into the data warehouse by using tools like Apache Airflow and
Prefect.
When high-speed, low-latency data analytics or business questions
are required, the team queries the data warehouse and returns
results quickly. But when granularity is needed or more data may
provide a more insightful answer, the team migrates its inquiries to
the data lake, which often stores much longer-term data. If, however,
the team needs data from both storage locations to solve a problem,
things become tricky.
The benefits of this architecture are clear. Businesses need business
intelligence and data science, high concurrency, low-latency analyt‐
ics, and massive-scale long-term analysis, and ML on both.
But the drawbacks are more nuanced. First, complexity is signif‐
icant. The number of distinct, often incompatible components
that require tight communication is a constant challenge. Second,
because the data lives in two distinct locations, one locked in an
application, ETL and ELT processes are often duplicated to achieve
the same results, and governance of these pipelines becomes a seri‐
ous bottleneck to success.
And because the data gets split, so too are the analytics teams. Busi‐
ness intelligence analysts are often working in the data warehouse,
with data scientists needing raw access to the massive amounts of
data from the data lake, even though the overlap of data, and goals,

16 | Chapter 2: Evolution of the Data Lake and Data Warehouse


between the teams is evident. This has often led to the duplication
of data and pipelines, further complicating the system. Data scien‐
tists and business analysts may end up solving the same issue with
inconsistent views of their data, leading to conflicting answers. Con‐
sistency has become a serious issue across the teams, and it hasn’t
always been obvious whose data was correct.

Figure 2-3. Cooperative architecture diagram

Cooperative Architecture | 17
Even with the advancements in each technology, data scientists
often still took subsets of their available data completely out of
the enterprise data architecture to access the tools they needed to
accomplish their goals, the fundamental issue these technologies went
out to solve. Once the data scientists finished their work, a team of
engineers was still required to rebuild it for scale and deploy it to
drive decisions, applications, and automation.
The cooperative architecture depicted in Figure 2-3 shone a light on
a new vision of the future: a single, unified architecture providing
the advantages of both systems without the need for two distinct
engines. But several factors were required to turn this into reality.

18 | Chapter 2: Evolution of the Data Lake and Data Warehouse


CHAPTER 3
Unified Analytics Architecture

In the last few years, multiple factors have converged with great
synchronicity, paving the way to the dream of the last decade: a uni‐
fied analytics architecture, a single architecture enabling the aggre‐
gation, analysis, and modeling of the full gambit of a company’s
data. This has the potential to revolutionize the development of ML
models and the organization and processing of data. The factors that
enabled this are far reaching, but one powerful contributing factor is
universal object storage and elasticity of the cloud.

Prevalence of the Cloud


The cloud has been a core driver of unification in data storage,
analytics, modeling, and governance. The cloud abstracted away
the complex management of servers and distributed compute and
storage, enabling millions of teams to thrive that otherwise wouldn’t
have the IT support to run these workloads. Most major data storage
players (including Snowflake, Cloudera, Yellowbrick, Databricks,
and Vertica) have worked tirelessly to support cloud deployments
of their stacks, with a few top players offering on-premises, cloud,
and hybrid support out of the box for companies with complex and
varying requirements.
Cloud analytics does have drawbacks such as periodic high latency
due to the “noisy neighbor” situation (lots of people using the same
cloud network at the same time), but that’s not stopping anyone.
Companies have emerged that are powered by these storage mech‐
anisms and are implementing specialized data caching to buffer

19
against periodic network lag. Object storage alone has been a major
accelerator in the move to a unified analytics architecture, because
it can handle any format of data in a single location, including all
the data lake storage formats as well as the analytic database storage
format, thus unifying data storage.
But storage isn’t the only thing cloud providers have enabled. Com‐
pute has been made easier than ever. Today, you can log into an
Amazon Web Services (AWS) account, click no more than four
buttons, and have a free mini computer running Linux at your
fingertips. You can launch a website, run some Python programs,
or simply learn the basics of Linux. This mini computer is entirely
free, although more powerful machines, such as those with graphical
processing units (GPUs), will cost significant amounts. AWS, Goo‐
gle Cloud, and Microsoft Azure have a suite of tools that enable
you to scale your compute workloads infinitely and incredibly easily.
And what’s even more important, these systems of compute scale
independently from your storage.

Separating Compute and Storage


Another major contributor to the rise of the unified analytics archi‐
tecture, powered by cloud economics, is the separation of compute
from storage. Imagine a company with 4 petabytes (PB) of data
(that’s four thousand thousand gigabytes). This company wants to
store that data and run massive amounts of analytics to gain insights
and improve its business.
The company can store its data in an Amazon Simple Storage Ser‐
vice (S3) account, without paying for a single CPU of compute.
This storage is relatively cheap compared to the expense of compute
and incredibly reliable (AWS boasts 99.999999999% reliability in a
year). Now, the company is ready to run analytics. It starts 100 Ama‐
zon Elastic Compute Cloud (EC2) instances and runs an analytics
job that takes two hours. Once complete, the compute instances
shut down and are deleted from the account. The company pays
for exactly 200 CPU hours of computation plus storage fees, and
nothing more. This on-demand usage of compute was a major evo‐
lutionary step toward serverless workflows.
The major advantage of this model is nonpersistent compute. Some
companies need to run tens of thousands of reports every quarter,
every month, or even every day in some cases, or train ML learning

20 | Chapter 3: Unified Analytics Architecture


models on enormous datasets, or have major seasonal spikes in
compute needs (like retailers around the end of year). With their
data persisted in an object store, they can spin up as many clusters
as necessary to run the computation they need, only while they
need them, and delete them after they are done. In an on-prem
data center, companies would have to buy, provision, and have those
machines continually available, paying to power and maintain a
huge number of machines that remain idle most of the time.
Although data lakes were originally built as separate compute engine
and storage model components, traditional databases were built
as a software monolith. Deploying and undeploying database com‐
pute brought with it extra storage baggage because of the database
requirement to preserve all data and changes made internally. It’s
perfectly reasonable to run an ordinary Hadoop data lake, or an
MPP scale-out analytical database in the cloud without separation of
compute and storage, but you lose some cloud advantages. You can
still take advantage of easy scaling and low maintenance require‐
ments, but to get the nimble flexibility of the cloud, analytical data‐
bases needed to break apart the compute engine from the storage
format. Most modern databases have done that.
This software architecture brought many advantages along with
compute elasticity: easier node recovery, more efficient load balanc‐
ing, and rapid provisioning—and one interesting capability equiva‐
lent to turning out the lights on Friday night before you go home for
the weekend. Completely turning off the database was now possible
when no compute was needed, leaving just the cloud storage main‐
taining the data.
This separation has so broken down monolithic analytical database
software architecture that most databases can now be fully con‐
tainerized (see Figure 3-1). This makes it possible to automate or
schedule dynamic compute scaling intelligently as the workloads
require. Also, containerization makes software far more independ‐
ent of infrastructure. Instead of having to rebuild software to be
compatible with S3 to function on AWS, then Azure Blob storage to
function on Azure, then Google Cloud Storage to function on Goo‐
gle Cloud, software can include requirements within a container.
Then, a deployment manager like Kubernetes can deploy the soft‐
ware anywhere, from a laptop or edge Internet of Things (IoT)
device to Alibaba Cloud to a private cloud you’ve never heard of.

Separating Compute and Storage | 21


Figure 3-1. Decomposing the database monolith

Unifying Storage
Unifying storage is another big step toward a unified architecture.
Both data warehouse and data lake architectures put a lot of empha‐
sis on where and how data is stored, but data scientists and business
analysts generally do not have preferences as to the location or
storage formats of their data. They are interested in a standardized
mechanism for accessing and querying that data, like SQL, a busi‐
ness intelligence (BI) visualization tool, or Python. How the data
gets to them is someone else’s concern.
HDFS first provided a way to store pretty much every kind of data.
Object storage improved on that model to the point where it is now
a standard. Any kind of data can be stored in an S3-compatible
object store. Analytical databases, instead of storing data inside the
monolith, now also store their carefully curated and structured data
in a format designed for analytical performance, in exactly the same
place as every other kind of data—in object storage. The data is no
longer split across native database software and other file or object
storage locations.
As the cloud vendors grew, storage options like Amazon S3 became
increasingly standard for many teams and products. S3 has become
so prevalent that an industry of tools now exist to support S3-
like clients abstracted over other storage options, like commodity
computer file storage, Google Cloud Storage, or Azure Blob (see
MinIO). Specialized hardware is now available that stores data in
S3 object stores on premises, to provide benefits similar to cloud
deployments to companies that need to stay on prem for regulatory
or other reasons (see Pure Storage FlashBlade, Scality RING, Dell
EMC Elastic Cloud Storage [ECS], and NetApp StorageGRID).

22 | Chapter 3: Unified Analytics Architecture


Achieving the goals of both BI and data science should not require
retrieving data from multiple locations and using multiple engines
to query and analyze it. Creating a singular location where all data
can live removes the problem of duplication of data pipelines to
multiple locations. The last thing a data scientist or business analyst
wants is to have two records of the same user that tell different sto‐
ries in two different locations and formats. By unifying storage, as
shown in Figure 3-2, you end up managing a team of data engineers
whose job is to facilitate robust and reliable pipelines for many kinds
of data to a single unified storage point.

Figure 3-2. A unified storage layer for data

Before, most data was dumped into a data lake, where it was com‐
bined and refined; some structured data was pushed to the database,
while other data was left in the lake. Data scientists and business
analysts worked in those two separate environments, even though
they often needed much of the same data.
When data storage is unified, this issue of duplication of data and
pipelines disappears. Instead of structured data in one system and
unstructured in another, cloud storage can store both, efficiently.
Raw data streams into one system, cleanup and refinement are done
in place, saving money on I/O and time to run processes, and that
same data is queried directly by users. Data engineers build one set

Unifying Storage | 23
of pipelines, and data scientists or business analysts use the interface
they’re comfortable with to access data in one place.
There’s no pressure to try to force all data to be structured, or
any single format for that matter. We now have many efficient, ana‐
lyzable columnar storage formats such as Parquet, ORC, or highly
optimized database storage formats such as Read Optimized Store
(ROS), Vertica’s storage format, which is used as a data warehouse
database storage format example in the diagram. You are empow‐
ered to use the data storage format that makes the most sense for
your business, both from a cost and analytical speed perspective.
But for this to work, the query engine must be capable of interoper‐
ating with many disparate data storage formats. For data lakes, the
solution has been to produce more and more types of query and
analysis capabilities, each for a different type of data.
Database vendors have somewhat of a head start with more mature
security, access management, and governance, but data lake engines
were created for breadth: the ability to analyze many kinds of data.
The data lake engine that probably comes the closest to this unified
data concept is Presto, which can analyze a great many types of data.
Newer versions are even beginning to have the capability to train
ML models as are many databases.
Many analytical databases in the last three to five years have added
the ability to query external tables, data existing outside the database
storage itself, often in open source object storage formats such as
Parquet and ORC. External table implementations vary by database,
but they are all references to datasets that exist outside the internal
database storage, typically in object stores such as S3. All require
defining the metadata for external files as if they were internal
tables.
The main differences in various implementations are seen when a
query or other analysis is initiated. For some, at that point, the data
is imported or transformed into the internal database format before
the query can be answered. The more robust implementation of
this functionality directly queries the external data without moving,
copying, or transforming it. This essentially turns the database into
a high-speed query and analysis engine for its own data, as well as a
query and analysis engine for many other formats of data.

24 | Chapter 3: Unified Analytics Architecture


If the database engine supports only structured data, that still gives it
the ability to query data in columnar formats like Parquet and ORC.
Databases that already have the capability to handle semi-structured
data with automated parsing or schema-on-read vastly expand the
data that a single engine can analyze. This provides a single point
of access for a vast array of data, beginning to approximate a data
architecture called a data fabric, which is described in greater detail
by Gartner, but also providing a single unified access point for
multiple types of analysis.
With a more than three-decade head start on optimizing analytics
engines, database vendors have a big lead in performance and con‐
currency. As they expand supported data formats, the difference in
cross-format analytical capabilities between database and data lake
engines shrinks.

Data Life Cycle Management


When you shift paradigms to storing all data in a single location,
you gain tight control over user access management and data life
cycle policies. Having all data and transformations centralized helps
manage lineage, track changes, and maintain data consistency.
In this situation, data formats are based on analytic requirements
and the data life cycle can be managed by switching from one for‐
mat to another. As fresh raw data is added, it is combined, refined,
transformed, and stored in more optimized, more expensive formats
for quick reads and analytics, such as the analytical database storage
format. As that data ages and becomes stale, it can be transformed
into a cheaper but still easily analyzable storage type such as Parquet
or ORC, possibly compressed using gzip or Snappy compression to
optimize the size on disk, and minimize costs of long-term storage.
Streaming data might go into both at once, providing analytical
access to data as it comes in, such as for immediate predictions, and
constantly refreshing the long-term data that ML models often use
for training (see Figure 3-3).

Data Life Cycle Management | 25


Figure 3-3. Data life cycle management

Once archived, the data is ideal for infrequent access or long-term


pattern analysis. This is all tracked in one place and can be automa‐
ted to transform from one format to the other rapidly as needed
according to metrics defined by the use case. When required, the
analytics engine needs to be able to join the two types of data and
evaluate both at once.

26 | Chapter 3: Unified Analytics Architecture


Isolating Workloads
Extract, transform, and load (ETL) has been the standard for data
warehouse architectures, not extract, load, and transform (ELT),
but many architectures are shifting. One reason for using ETL
historically is that as other users are reading, updating, and query‐
ing the database, the finite resources of that engine are utilized.
Transforming the data after being ingested would add yet another
resource constraint on the database, especially with a streaming data
component that requires constant ingestion and refinement so it
can’t be scheduled for slow usage times. As a result, in the older
data warehouse architectures, consolidating and transforming data
sources was far better to do before the data went into the data
warehouse, even though database engines have always been good at
rapidly manipulating data.
Even simply inserting clean data into the database was often sched‐
uled during low database usage periods such as the middle of the
night. This was in large part to prevent contention for the precious
and expensive analytical compute power of the database. If the
database was using a fair amount of its compute power to trans‐
form or ingest data, then analytical queries, or dashboard report
drill-downs, or essential analysis-dependent applications could bog
down in performance, miss service-level agreements (SLAs), a.k.a.
required response times, or even fail.
Now, that problem is compounded by the need to capture streaming
data and run ML workflows in-database. Training ML models is
incredibly resource intensive, and data onboarding can’t be sched‐
uled for the middle of the night if data is streaming in at all
times. If you’re using the same system to train ML models as to
power business intelligence, slowdowns become common, analytic
failures happen, and mission-critical applications suffer. Even with
the strong concurrency of the analytical database software architec‐
ture, every system has its limits.
The data lake wasn’t designed for any real level of concurrency at
all, so it had an even worse starting point. Until the addition of the
YARN resource manager, the Hadoop stack could do only one thing
at a time. When data lakes first became popular for on-premises
deployment (still to this day, with about 20% of deployments
entirely on prem), workload isolation was achieved by deploying
an entirely new data lake cluster with the necessary data duplicated

Isolating Workloads | 27
to the new system. This guaranteed that your data scientists did not
interfere with your business analysts, and your data engineers could
build their ETL pipelines in peace.
But this introduced even more complexity into the already convo‐
luted system. Which data lake’s data was true? Which subset of
the duplicates should be piped over to the data warehouse or into
more rapidly analyzable formats like ORC or Parquet? How many
pipelines were needed? Which software needed to be provisioned
on each cluster? What if one workload needed more compute, and
another cluster was sitting idle because that workload was at a slow
period? These questions were potentially unanswerable and were a
nightmare for the data engineering, IT, and analytics architecture
teams.
Now, the answer is workload isolation: dedicating specific compute
resources to a single workload that doesn’t overlap with resources
dedicated to any other workload.
Today, with the prevalence of cloud elasticity, subclusters spin up as
needed. They can then be dedicated to a single workload, preventing
resource contention and making certain that each workload has
all the compute it needs and that no compute is wasted when not
needed (see Figure 3-4). If a specific workload needs fewer com‐
pute resources, that subcluster can be spun down and its compute
instances deleted. This concept completely relies on the separation
of compute and storage. No data copying or extra pipelines are
required, since all the subclusters pull data from a single shared
object storage location.

28 | Chapter 3: Unified Analytics Architecture


Figure 3-4. Workload isolation via subclusters

In this architecture, a single object storage grounds the system and


preserves data integrity, with different clusters of compute, tailored
to the needs of a subset of users, instantiated as necessary to run
various workloads. The BI team has clusters to run its queries, and
may even have separate clusters to generate reports or drive dash‐
boards. Meanwhile, the data engineering team has clusters running
massively parallel ELT jobs that never stop, thanks to data constantly
streaming in. The data science team has clusters with high levels
of compute for training models on terabytes, or even petabytes, of
both structured and unstructured data. Applications can be certain
of meeting their SLAs by having all the compute they need dedicated
to their workloads. Each subcluster is independent, with its own
compute resources, all accessing the same underlying storage (see
Figure 3-5).
Because cloud networks, especially public cloud networks, often
experience the noisy neighbor effect causing network slowdown,
an automatic data-caching system can help improve analytic perfor‐
mance and dependability. But the shared data remains consistently
the same for all subclusters.

Isolating Workloads | 29
Figure 3-5. A unified storage layer with isolated workloads

As we work toward this unified analytics architecture, we mustn’t


forget the underlying goal: to accelerate ML deployment. It’s easy to
get lost in all of the opportunities this architecture provides, but
we must keep our eye on the prize. An issue that penetrated every
chapter of this report has been the requirement of data scientists to
take subsets of data off of the enterprise architecture and into their
own silos to operate on because of the in-memory limitations of
their most productive frameworks. A part of that has been alleviated
by the use of dedicated compute for data scientists, but that does not

30 | Chapter 3: Unified Analytics Architecture


solve the entire issue. The next step is to unify the data access and
analytics layer.

Unifying Analytics
As opposed to having the Spark engineers moving and transforming
data, then reserving some data for the SQL experts and BI visualiza‐
tion tools, then setting aside other data for the Python developers,
providing access to all of the data from all of the frameworks creates
tremendous opportunities. Unified data platforms support Python,
R, and SQL—data access and analysis control languages with the
unified analytics engine executing the commands.
Users can instruct the engine to join disparate datasets, modify data
to remove outliers or encode categorical variables as binary dummy
variables, train a model, and make a prediction without ever moving
the data offsite because they are not limited by the in-memory
capabilities of a Python or R engine. The instructions issued by
Python or R are distributed and executed in the most performant
manner possible by the highly optimized MPP database engine in
the same way a query optimizer is used in a database to execute SQL
instructions. This creates the distributed, full-scale pipelines, as the
work is done, so no rebuilding is necessary at the end.
If a user’s preferred, most productive framework is Python, then
Python can aggregate monthly sales just as easily as it trains a
decision tree or joins two time-series datasets and interpolates the
missing values. Results can be plotted in a Jupyter notebook with
something like Matplotlib. Graphs can be directly sent to critical
decision makers just as quickly as if the data science team were
working locally on a tiny subset of data, but with the full accuracy of
training on massive datasets.
Similarly, a business analyst can leverage SQL to build regression
models directly against tables in the database and then feed those
predictions into a Tableau dashboard. See Figure 3-6, which shows
how this all comes together.

Unifying Analytics | 31
Figure 3-6. A unified data platform

Each user can now use their most comfortable and productive tools
for the job and expect the same level of performance on the full
dataset regardless of scale. And because of dedicated workload isola‐
tion, training an ML algorithm on a petabyte of data will not slow

32 | Chapter 3: Unified Analytics Architecture


executive dashboards in the least, even though they’re all using the
same data.
As databases and data lake stacks that can support this unified
approach continue to mature, their support for ML algorithms,
scientific computing, and analysis matures with them, and their
applications widen. Right now, geospatial analysis can be done rap‐
idly on massive datasets and plotted on a map with SQL or Python.
Time-series analysis of IoT data for anomaly or pattern detection
is done every day on unified analytics platforms. Algorithms like
regression, decision trees, clustering, and a growing list of more
can be trained, evaluated, saved, versioned, and deployed in unified
analytics platforms.
One more key step makes this possible. This step is possible only
in a world with a unified analytics architecture, separated compute
and storage, and strong resource and security isolation: in-database
machine learning.

Unifying Analytics | 33
CHAPTER 4
In-Database Machine Learning

Being able to do machine learning where the data lives has been
a goal from the beginning of the data explosion. But what does
in-database ML mean, and how does it work?

What Is In-Database ML?


In-database machine learning is the practice of developing and exe‐
cuting ML workloads—exploring and preparing the data; training
algorithms; evaluating, saving, and versioning models; and serving
those models in production—inside the database that contains the
data. As we’ve seen, many business architectures operate with their
data science teams extracting subsets and samples of data from data
lakes and data warehouses, training models, and then embarking on
the journey of MLOps—from building distributed pipelines to do all
the data preparation steps on production data, serving models, often
via representational state transfer (REST) endpoints, to persisting
the inputs and outputs of each prediction, to analyzing those results
for concept and feature drift, and finally reextracting the new data to
retrain the model, and starting the process over (see Figure 4-1).
Many tools have been created and advertised as simplifications for
this involved process, but most present the same problematic strat‐
egy: bringing the data to the model. Since modern datasets for
ML tend to be huge (terabytes or even petabytes), the principle of
data gravity makes moving all that data difficult, if not outright
impossible. The solution during training is sampling and working
elsewhere. But when it comes time to deploy, that typically involves

35
a fair amount of rebuilding. Every single step that a data scientist
has taken to prepare a tiny sample of data must now be re-created
at scale for production-level datasets, which could be in the multipe‐
tabyte range.

Figure 4-1. MLOps cycle

We’ve long thought about databases and data lakes as places for
storage, reporting, and ad hoc analytics of data, but the core premise
of in-database ML is to use the powerful distributed database engine
to natively build models and predict future outcomes based both
on data stored in the database format and other data in distributed
object storage systems.
Advanced data warehouses are capable of the targeted analytics used
for feature engineering, such as outlier detection, one-hot-encoding,
and even fast Fourier transforms. Enabling execution with the dis‐
tributed database engine while providing instructions in something
most data scientists are familiar with, like SQL or Python, instantly
unlocks a wealth of opportunity.
Vertica’s in-database ML, to take one example, has straightforward
SQL to perform one-hot encoding, converting categorical values to
numeric features:
SELECT one_hot_encoder_fit('bTeamEncoder', 'baseball', 'team'
USING PARAMETERS extra_levels='{"team" : ["Red Sox"]}');
For Python developers, Vertica’s VerticaPy library enables data sci‐
entists to achieve these same results by using only Python code. The
get_dummies function is applied to a virtual dataframe (familiar to
Python developers, but not limited by available memory size) and
provides the same functionality:

36 | Chapter 4: In-Database Machine Learning


churn = churn.select(
["InternetService", "MonthlyCharges", "churn"]
)
churn["InternetService"].get_dummies()
RedisML has similar capabilities. For example, to scale a value
(commonly referred to as standard scaling in ML), simply use the
provided ML SQL syntax:
(ML.MATRIX.SCALE key scalar)
The ML potential becomes particularly powerful when these func‐
tions are joined by the rich wealth of capabilities databases already
had, such as joins, windows, and aggregate functions. It becomes
simple now to join data across your tables and apply your set of
feature engineering functions entirely in SQL or Python. And these
ML-minded databases will continue to grow their already impres‐
sive library of functionalities and make the feature engineering step
of the ML pipeline ever simpler.
But feature engineering is only one step of that pipeline. You may
be asking about actual model development and its scalability. Some
databases include homegrown algorithms to support a wide range
of models, from random forests and logistic regression to clustering
and support vector machines. The syntax is just as simple as feature
engineering, and with the many SQL IDEs today, like DataGrip,
Apache Superset, and even Tableau, you can access these functions
and create pipelines just by writing SQL. Many have great documen‐
tation, including RedisML, Vertica, and BigQuery ML. In fact, many
analytical databases support at least some form of ML, though most
include it as a paid add-on.
In RedisML, for example, on the Redis command-line interface, you
could train a K-means model with an example dataset. This sets
a model with two clusters and three dimensions. In the following
example, the cluster centers are (1, 1, 2) and (2, 5, 4):
redis> ML.KMEANS.SET k 2 3 1 1 2 2 5 4
OK
Then predict the cluster of feature vector 1, 3, 5 like this:
redis> ML.KMEANS.predict k 1 3 5
(integer) 1
After you define and aggregate your features and train your model,
the next step is often testing and deployment. When your MLOps
stack is external to the database, it adds yet another step to the

What Is In-Database ML? | 37


process: you have to move data outside the system yet again, adding
compute to run tests, and standing up more infrastructure. When
this is all done via SQL and database-native APIs, minimal work is
required since a dev and test database are essential to normal imple‐
mentations, and should be the same environment as the production
database. Testing and evaluation of the now-trained model requires
just one more SQL statement or Python command, and deployment
of the model is simply the automated invocation of it on new data.
We have no new infrastructure or tooling, no rebuilding, and no
additional UIs for managers to monitor.
After a model has been deployed, monitoring becomes the crucial
next step. As more data enters the system, data distributions change,
and drift occurs. Often drift is categorized as concept drift (the distri‐
butions of the predictions of the model change) or feature drift (the
distributions of the features that the model predicts on no longer
match the distributions of those features during training). In either
case, this means that your production model’s idea of reality no
longer matches the ground truth. This requires retraining.
Dozens of companies have attempted to solve this specific issue—
model observability in production. Measuring and notifying occur‐
rences of drift in models requires a lot of infrastructure and tooling.
A new persistence layer is often required to store all of the inputs
and outputs to the model, along with a connection layer to the
original training data for comparisons. External compute is required
to calculate the various distribution algorithms to detect drift, and
some alerting functionality is likely necessary, which adds to the
growing amount of infrastructure requirements.
When this model evaluation layer is embedded into the database,
however, the additional layer of tooling and infrastructure is
removed. In a database-ML world, all of the predictions of the
model are persisted next to all of the inputs automatically, which,
quite conveniently, are stored next to all of the training data. Com‐
parisons of distribution are simple SQL queries, and altering incor‐
rect data (for example, mislabeled samples) can occur directly in the
database. You can even create simple pipelines that automatically
run the SQL statements to retrain the model upon a certain distribu‐
tion shift threshold.

38 | Chapter 4: In-Database Machine Learning


Why Do ML in a Database?
So we see now that in-database ML is possible, but, why would you
do it? What are the advantages?
While many ML models never make it into production, even of
those that do get there, 40% take between 30 days and a year or
more to get to production, according to one survey. As stated previ‐
ously in this report, another survey was more pessimistic, saying
that 80% of respondents’ companies take more than six months to
deploy an artificial intelligence (AI) or ML model into production.
Shortening that time significantly has to be the goal of any organiza‐
tion that needs ML in production to meet its needs.
As we’ve seen, one of the major benefits of in-database ML is infra‐
structure simplicity. Your DevOps engineers will thank you greatly
for removing your requests for all of those new tool integrations.
But this new approach has other major benefits, which are detailed
next.

Security
Many of the tools in the marketplace today offer an abundance of
bells and whistles, and should be considered when choosing the
right tool for your company. But one of the most important features
of all of these systems—sometimes overlooked and crucial to get
right—is security. When your production ML models are making
decisions that could make or break your business and are touching
some of the most important data in your system, security must be
the number one determiner of a product and is table stakes for
any consideration. With emerging standards for data protection like
the European General Data Protection Regulation (GDPR) and the
California Privacy Rights Act (CPRA), security and data governance
are essential.
When running ML inside a database, one of the instant benefits
you get is years of mature, battle-tested, database security practices.
From their inception, database systems were created to securely
store your data. When you start with those pillars and then add ML
on top, you don’t have to worry about your models being safe.

Why Do ML in a Database? | 39
Speed and Scalability
Much as with security, since you’re starting with the foundation
of a robust analytical database, speed and scalability are built in.
MPP databases are already battle-tested and trusted by engineers.
Speed and scale come in two forms for ML, however: training and
deployment.
When training your models, with the data already in the database
alongside the models, no movement occurs across systems—no net‐
work nor I/O latencies, no waiting hours for data to arrive in your
modeling environments. And there’s no need to adjust data types to
new environments to mitigate data type incompatibilities. This helps
data scientists iterate faster and build better models.
From the perspective of serving, in-database ML removes all of the
legwork from the MLOps engineers to think about how the system
will grow with more requests. When the database is the host of
the model, it scales exactly as the rest of the database scales, which
engineers already trust. The simplicity of the architecture enables
engineers to build logic around their models with the full assurance
that what they create will scale with the data they train on.
Also, the model prediction function is a simple SQL call, just like
all the data transformation functions already in the data ingestion
pipeline. As data flows in, it gets prepped, the model evaluates it,
and the result is sent on to a visualization tool or AI application. A
lot of the data pipelines are a series of SQL calls anyway. What’s one
more? That automation means data flows in and a prediction flows
out, often in less than a second.

No Downsampling
Similar to scalability, but important in its own right, is the elimina‐
tion of downsampling. We looked earlier at SQL syntax added to
advanced databases with ML capabilities. When you train a model
in that paradigm, you don’t need to consider whether the data you
extract will fit locally since you’re not extracting anything.
You begin thinking at a system level. You have the scalability neces‐
sary to run the analyses you want, so you can begin optimizing
for training time, accuracy, and price instead of local memory. This
approach opens a world of possibilities for data scientists, and a new
way of thinking that was previously available only to data engineers.

40 | Chapter 4: In-Database Machine Learning


Concurrency
Modern advanced databases have nailed concurrency and resource
isolation, as we’ve discussed in previous sections. This applies
equally to in-database ML. Allocate as many or as few resources
as necessary to the data scientists and let them run free, knowing
with full confidence that their work won’t affect ELT pipelines or
critical dashboards. Let dozens of engineers train models, or dozens
of applications use model predictions, all bounded by the limits
easily set by any database administrator. This includes scheduled or
automated scaling of infrastructure, which is a common capability
of many databases, and can be defined for dedicated subclusters as
well.

Accessibility
A prohibitive issue for many companies is the cost of data scientists,
something we touched on in Chapter 1. This is a serious problem,
as the work they do can be critical to giving companies competitive
advantages. While there is no replacement for an advanced data
scientist analyzing a complex problem and building a complex, cus‐
tom solution, many problems that companies face can be considered
low-hanging fruit. They may be solvable by team members without
deep data science skills, but having familiarity with SQL, data ana‐
lytics, and the needs of the business.
The issue thus far, however, has been that more accessible ML
frameworks like scikit-learn are not easily deployable by these ana‐
lysts. Python-based ML libraries are a completely different set of
technological tools that may have no precedence in a current system.
Plus, while many business analysts are skilled in SQL, not all have
the same experience with Python.
With SQL-based ML, common now in many analytical databases,
you can enable analysts and citizen data scientists to achieve massive
results by adding a few ML-focused SQL commands to their reper‐
toire. Alongside the joins and window functions, they can apply an
XGBoost model and potentially unlock serious value for their com‐
pany, without needing to hire a data science team. Many problems
need advanced data science skill sets, but adding SQL language for
building classical ML models enables a plethora of new participants
who may otherwise not have the skills or experience to help.

Why Do ML in a Database? | 41
Governance
A crucial component of MLOps systems is user management. Who
can train a model? Who can deploy it? Who has access to that
dataset? Who can replace or remove a model from production?
Who trained model 5, which is in production now?
Equally important is data and model governance. Which models
and versions are currently in production? How is our churn model
doing? Which data was this model trained on?
Each of these questions becomes a single SQL statement in a data‐
base ML world. Granting privileges on a model is the same as
granting them on a table. Checking who deployed a model requires
simply looking at the SQL query logs. Checking the live perfor‐
mance of a model is simply running a SQL statement combining
two tables (predictions and ground truth).
With governance baked in, it’s yet again one less thing to focus on,
putting real modeling front and center.

Production Readiness
The practice of MLOps has grown enormously. Endless numbers
of companies today are building model-tracking services, feature
stores, model-serving environments, and pipelining tools—with
some companies offering all under one roof. But the practice
of DevOps and DataOps has already existed for years, with well-
documented best practices. MLOps is the intersection of these two
worlds, yet many companies have been approaching it in isolation. If
we have a robust way to version code, and a robust way to version
tables and data, we can seemingly have a robust way to manage
ML models just like tables in a database, where their engine is the
database, and their fuel is the data stored internally and all around
them in object stores.

42 | Chapter 4: In-Database Machine Learning


In-database ML offers near instant production readiness, with
already defined ways of managing SQL code, database tables, and
raw and prepared data. Serving the model can be as simple as
issuing SQL or a database API call, and updating with a new model
is as simple as versioning a table.

After weeks developing, training, and testing


a model to solve a business problem,
deploying it to production should be a single
command, not an infrastructure nightmare.

Why Not Do ML in a Database?


You may be asking if there are any reasons one would avoid
in-database ML. Some use cases are not yet suited for databases.
Deeply custom models that may require GPUs for training or that
implement advanced, custom algorithms not easily compiled into
a runtime package (such as in research) may be best suited for a
more old-fashioned approach. This is not to say that ML-centered
databases won’t eventually get there, but at this time there isn’t deep
support for those use cases.
Another question, however, may be, “Why haven’t we always done
ML in the database?” And that is a good question, with a good
answer. In-database ML leverages the ability of the database engine
now to evaluate a lot of data formats, not just its own, and requires
both the data capacity and analytic horsepower to train and deploy
ML models. In addition, until workloads could be properly isolated,
most companies were not willing to risk their BI SLAs by letting
data scientists loose in the same database. Without these new devel‐
opments, even with a cooperative architecture, data still moves,
things are lost, deployment is complex, security is not straightfor‐
ward, and resource isolation is nearly impossible. Simply put, you
need a unified analytics architecture to enable in-database ML. But
now that it’s here, why not do ML in a database is a harder question
to answer.

Why Not Do ML in a Database? | 43


Unified Analytics: Managing Models
We’ve talked specifically about in-database ML as a set of custom-
built algorithms that databases provide. In these cases, a model is
simply a specialized table, but its interactions are otherwise the
same. These are powerful systems indeed, but they do not cover
the full gamut of possibilities in ML. In many cases, even with
in-database ML available, users will want to write custom code,
leveraging libraries that are particularly useful to them, and still
deploy those models like all others.
This is a common use case and one that has motivated many data‐
base companies. The ability to import external models in forms
like Predictive Model Markup Language (PMML), JSON configura‐
tion (like TensorFlow models), or Open Neural Network Exchange
(ONNX) format (like PyTorch and others) is becoming more and
more common.
Microsoft Azure SQL Edge, for example, can directly import ONNX
models, deploy them into your database, and invoke them using
a built-in PREDICT SQL function. Similarly, Vertica enables you to
import or export PMML, and import JSON configurations of an
externally trained TensorFlow model and deploy it in a distributed
fashion on your database. This can be useful if you want to train
your deep learning models externally with GPUs but deploy them
alongside all of your other models, as deep learning models typically
do not need expensive GPUs for serving, only training.

Unified Analytics: Feature Stores


Another capability that the unified analytics architecture unlocks is
the feature store. A feature store is a layer between raw data and
structured feature vectors for data science. Feature stores promise
to help bridge the gap between data engineers and data scientists by
making it easy to manage how features change over time, how they
are being used in models, and how they are joined together to create
training datasets.
Most feature stores today fundamentally follow the cooperative
architecture, with separate engines for offline data (such as data
from a data lake), and online data (typically a key-value store for
quick lookups). Even the most advanced feature stores, such as
Uber’s Michelangelo, follow this architecture. This can, and often

44 | Chapter 4: In-Database Machine Learning


does, as we’ve seen, lead to enormous complexity, data inconsis‐
tency, and lost days of troubleshooting. The unified analytics archi‐
tecture has yet to transform this world but likely will in coming
years.
Imagine a single engine powering your batch, streaming, and
real-time data, alongside transformations, an end-to-end MLOps
platform for training and serving, and a feature store for closely
managing all of the data that your models learn from. This will
become commonplace, and people will begin to wonder how we
ever did it differently.

The Unified Analytics Architecture


Put together, the unified analytics architecture (shown in Fig‐
ure 4-2) is neither a data lake nor a warehouse. It is much more
than either and has the advantages of both. This architecture enables
a paradigm shift in the way we see and work with our data and
massively simplifies the systems we intend to build. The singular
architecture makes ML easier to maintain over time, and assuming
you’ve avoided any of the technologies that work only on the cloud,
is highly flexible to work on premises, on any cloud, or on a hybrid
system. Flexibility makes it a durable platform that can stand the
constant changes that any data architecture has to weather.

The Unified Analytics Architecture | 45


Figure 4-2. Unified analytics architecture

46 | Chapter 4: In-Database Machine Learning


Conclusion

The benefits of production ML span multiple industries across


numerous use cases. Predictive maintenance, customer service,
fraud detection, industrials, and IoT, among countless others, are all
being looked at from completely new angles with the advent of ML.
Organizations that don’t leverage these growing technologies will be
left behind by the ones that do. As it becomes simpler to create and
deploy, ML will become table stakes for any company to survive, let
alone thrive. It will become as simple for a new ML model to be put
into production as it is now to put a new graph on your dashboard.
And this democratization is good. It will lead to better products,
cheaper services, and wins for both companies and consumers.
The unified analytics architecture will do to data science teams
what Tableau did to data analytics teams. Building and deploying
the model will no longer be the challenge; rather, understanding
the business problem and how to deliver the most value will be the
focus. When ML becomes the tool instead of the challenge, value
will begin to explode.
Because data preparation and model training have already been
done on the full, high-scale dataset, and the environment is identical
in development, in test, and in production, moving a proven model
into production requires a single line of code. Getting an ML model
deployed into production takes minutes, not months.

47
About the Authors
Ben Epstein was the machine learning lead at Splice Machine, an
end-to-end MLOps and feature store platform. As ML lead, Ben was
responsible for bringing to market a full stack, user-facing ML plat‐
form, supporting large-scale data and ML systems spanning from
data ingestion to production model monitoring. With a focus on
real-time, distributed use cases, the platform was built on Apache
Spark, Kubernetes, MLFlow, and a custom-built database model
deployment architecture. Ben has extensive experience designing
end-to-end ML systems for scale, supporting petabytes of data, and
he recognizes the challenges that come with it in today’s ML tooling
landscape. Today, he works as a founding engineer focusing on
data-centric AI, building systems to help data scientists and machine
learning engineers derive better data for their models. He also works
with Washington University in St. Louis as an adjunct professor on
a cloud computing and big data course, focusing on real-world use
cases and skill sets.
Paige Roberts (@RobertsPaige) has worked as an engineer, trainer,
support technician, technical writer, marketer, product manager,
and a consultant in the last 25 years. She has built data engineering
pipelines and architectures, documented and tested open source
analytics implementations, spun up Hadoop clusters, picked the
brains of stars in data analytics, worked in different industries, and
questioned a lot of assumptions. She has worked for companies
like Pervasive, the Bloor Group, Actian, Hortonworks, Syncsort, and
Vertica. Now, she promotes understanding of Vertica, distributed
data processing, open source, large-scale data engineering architec‐
ture, and how the analytics revolution is changing the world.

You might also like