Professional Documents
Culture Documents
978-1-098-12028-3
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
Unified Analytics: Feature Stores 44
The Unified Analytics Architecture 45
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv | Table of Contents
Introduction
v
Finally, we will look at a new but emerging concept, in-database
machine learning. When paired with a unified analytics architec‐
ture, this may serve as the key to rapid productionalization of ML
models.
vi | Introduction
CHAPTER 1
Applied Machine Learning
and Why It Fails
There are many reasons for the contrast between potential and
actual applications of ML in industry, but at its core, the failure
is due to the unique challenges of operationalizing ML, commonly
known as MLOps. Although companies across the globe have stan‐
dardized the practice of DevOps, the workflow and automation of
putting code into production systems, MLOps presents a uniquely
different set of problems.
In the practice of DevOps, the code is the product; what is written
by the engineer defines precisely how the system acts and what
the customer sees. The code is easy to track, manage, update, and
correct. But MLOps isn’t as straightforward; code is only half of the
solution, and in many cases, much less than half. MLOps requires
the seamless cooperation, coordination, and trust of code and data,
a problem that has not been fully tackled and standardized by the
industry.
When an ML model is put into production, teams must access not
only the code that created it, and the model artifact itself, but the
data it was trained on, as well as the data it begins making predic‐
tions on post-deployment. Because a model is merely an interpreta‐
tion of a set of data points, any change in that original dataset, by
definition, is a new model. And as the model makes predictions in
the real world, any discrepancies between the distributions of data
the model trained on and data it predicted on (commonly referred
1
to as data drift) will cause the model to underperform and require
retraining. Too small of a sample size, and your model may underfit
the data—fail to converge on an understanding of its distribution.
Too much biased data, and your model may overfit the data, gaining
an overconfidence (and sometimes memorization) of the training
data, but failing to generalize to the real world. This introduces
multiple issues in current analytic architectures, from data tooling,
storage, and compute, to human resource utilization.
In data science, however, the major trend has been toward build‐
ing models using larger datasets that cannot fit in memory. This
Spark
A common response to this issue is the introduction of Spark into
the data science ecosystem. Apache Spark is a popular open source
project for processing massive amounts of data in parallel. It was
originally built in Scala and now has APIs for Java, Python, and
R. Many infrastructure engineers are comfortable setting up cluster
environments for Spark processing, and significant community sup‐
port exists for the tool.
Spark, however, does not solve the entire issue. First, Spark is
not simple to use: its programming paradigms are very different
from traditional Python programming and often require specialized
engineers to properly implement. Spark also does not natively sup‐
port all of Python and R programmers’ favorite analytics tools. If,
for example, a Python developer using Spark wants to use statsmo‐
dels, a popular statistical framework, that Python package must be
installed on all distributed Spark executors in the cluster, and it must
match the version that the data scientist is working with; otherwise,
unforeseen errors may occur.
PySpark and RSpark, the Python and R API wrappers for Spark,
also have notorious difficulty in interpreting error messages. Simple
syntax errors are often hidden under hundreds of lines of Scala stack
traces. Figure 1-1 shows a typical stack trace of a Python PySpark
error. Much of it is Java and Scala code, which does not leave the
data scientist with many avenues for debugging or explanation.
Many of these issues have been front and center for the Apache
Spark team, which has been working hard to mitigate them—
improving error messages, adding pandas API syntax to Spark data‐
frames, and enabling Conda (the most popular Python package
manager) files for dependency management. But this is not a catch‐
all solution, and Spark was fundamentally written for functional
programming and large data pipelines meant to be written once and
run in batch, not constantly iterated upon.
Dask
One proposed solution is to use Dask, a pure Python implementa‐
tion of a distributed compute framework. Dask has strong support
for tools like pandas, NumPy, and scikit-learn, and it supports tradi‐
tional data science workflows. Dask also works well with Python
workflow orchestration tools like Prefect, which empowers data sci‐
entists to build end-to-end workflows using tools they understand
well.
But Dask is not a full solution to the issue either. To start, Dask
provides no support for R, Java, or Scala, leaving many engineers
without familiar tools to work with. Additionally, Dask is not a
full drop-in replacement for pandas, and does not support the
full capabilities of the Python-based statistical and ML tools. Other
projects (for example, Modin) have stepped in to attempt to gain full
compatibility, but these are young systems and do not yet offer the
maturity necessary to be trusted and adopted by large enterprises.
11
Figure 2-1. Venn diagram of a historical data warehouse versus a
historical data lake
In some cases, the race starts with a data lake that now has SQL
capabilities, and ACID compliance, plus data science workbenches
on top for access to popular tools such as notebooks, Python, and
R. Sometimes the race starts with a data warehouse, which now has
affordable high-scale storage and processing, support for streaming,
and different kinds of data, among other formerly data-lake-only
capabilities. After this race, we see a much less distinguishable set of
technologies, as shown in Figure 2-2.
Cooperative Architecture | 17
Even with the advancements in each technology, data scientists
often still took subsets of their available data completely out of
the enterprise data architecture to access the tools they needed to
accomplish their goals, the fundamental issue these technologies went
out to solve. Once the data scientists finished their work, a team of
engineers was still required to rebuild it for scale and deploy it to
drive decisions, applications, and automation.
The cooperative architecture depicted in Figure 2-3 shone a light on
a new vision of the future: a single, unified architecture providing
the advantages of both systems without the need for two distinct
engines. But several factors were required to turn this into reality.
In the last few years, multiple factors have converged with great
synchronicity, paving the way to the dream of the last decade: a uni‐
fied analytics architecture, a single architecture enabling the aggre‐
gation, analysis, and modeling of the full gambit of a company’s
data. This has the potential to revolutionize the development of ML
models and the organization and processing of data. The factors that
enabled this are far reaching, but one powerful contributing factor is
universal object storage and elasticity of the cloud.
19
against periodic network lag. Object storage alone has been a major
accelerator in the move to a unified analytics architecture, because
it can handle any format of data in a single location, including all
the data lake storage formats as well as the analytic database storage
format, thus unifying data storage.
But storage isn’t the only thing cloud providers have enabled. Com‐
pute has been made easier than ever. Today, you can log into an
Amazon Web Services (AWS) account, click no more than four
buttons, and have a free mini computer running Linux at your
fingertips. You can launch a website, run some Python programs,
or simply learn the basics of Linux. This mini computer is entirely
free, although more powerful machines, such as those with graphical
processing units (GPUs), will cost significant amounts. AWS, Goo‐
gle Cloud, and Microsoft Azure have a suite of tools that enable
you to scale your compute workloads infinitely and incredibly easily.
And what’s even more important, these systems of compute scale
independently from your storage.
Unifying Storage
Unifying storage is another big step toward a unified architecture.
Both data warehouse and data lake architectures put a lot of empha‐
sis on where and how data is stored, but data scientists and business
analysts generally do not have preferences as to the location or
storage formats of their data. They are interested in a standardized
mechanism for accessing and querying that data, like SQL, a busi‐
ness intelligence (BI) visualization tool, or Python. How the data
gets to them is someone else’s concern.
HDFS first provided a way to store pretty much every kind of data.
Object storage improved on that model to the point where it is now
a standard. Any kind of data can be stored in an S3-compatible
object store. Analytical databases, instead of storing data inside the
monolith, now also store their carefully curated and structured data
in a format designed for analytical performance, in exactly the same
place as every other kind of data—in object storage. The data is no
longer split across native database software and other file or object
storage locations.
As the cloud vendors grew, storage options like Amazon S3 became
increasingly standard for many teams and products. S3 has become
so prevalent that an industry of tools now exist to support S3-
like clients abstracted over other storage options, like commodity
computer file storage, Google Cloud Storage, or Azure Blob (see
MinIO). Specialized hardware is now available that stores data in
S3 object stores on premises, to provide benefits similar to cloud
deployments to companies that need to stay on prem for regulatory
or other reasons (see Pure Storage FlashBlade, Scality RING, Dell
EMC Elastic Cloud Storage [ECS], and NetApp StorageGRID).
Before, most data was dumped into a data lake, where it was com‐
bined and refined; some structured data was pushed to the database,
while other data was left in the lake. Data scientists and business
analysts worked in those two separate environments, even though
they often needed much of the same data.
When data storage is unified, this issue of duplication of data and
pipelines disappears. Instead of structured data in one system and
unstructured in another, cloud storage can store both, efficiently.
Raw data streams into one system, cleanup and refinement are done
in place, saving money on I/O and time to run processes, and that
same data is queried directly by users. Data engineers build one set
Unifying Storage | 23
of pipelines, and data scientists or business analysts use the interface
they’re comfortable with to access data in one place.
There’s no pressure to try to force all data to be structured, or
any single format for that matter. We now have many efficient, ana‐
lyzable columnar storage formats such as Parquet, ORC, or highly
optimized database storage formats such as Read Optimized Store
(ROS), Vertica’s storage format, which is used as a data warehouse
database storage format example in the diagram. You are empow‐
ered to use the data storage format that makes the most sense for
your business, both from a cost and analytical speed perspective.
But for this to work, the query engine must be capable of interoper‐
ating with many disparate data storage formats. For data lakes, the
solution has been to produce more and more types of query and
analysis capabilities, each for a different type of data.
Database vendors have somewhat of a head start with more mature
security, access management, and governance, but data lake engines
were created for breadth: the ability to analyze many kinds of data.
The data lake engine that probably comes the closest to this unified
data concept is Presto, which can analyze a great many types of data.
Newer versions are even beginning to have the capability to train
ML models as are many databases.
Many analytical databases in the last three to five years have added
the ability to query external tables, data existing outside the database
storage itself, often in open source object storage formats such as
Parquet and ORC. External table implementations vary by database,
but they are all references to datasets that exist outside the internal
database storage, typically in object stores such as S3. All require
defining the metadata for external files as if they were internal
tables.
The main differences in various implementations are seen when a
query or other analysis is initiated. For some, at that point, the data
is imported or transformed into the internal database format before
the query can be answered. The more robust implementation of
this functionality directly queries the external data without moving,
copying, or transforming it. This essentially turns the database into
a high-speed query and analysis engine for its own data, as well as a
query and analysis engine for many other formats of data.
Isolating Workloads | 27
to the new system. This guaranteed that your data scientists did not
interfere with your business analysts, and your data engineers could
build their ETL pipelines in peace.
But this introduced even more complexity into the already convo‐
luted system. Which data lake’s data was true? Which subset of
the duplicates should be piped over to the data warehouse or into
more rapidly analyzable formats like ORC or Parquet? How many
pipelines were needed? Which software needed to be provisioned
on each cluster? What if one workload needed more compute, and
another cluster was sitting idle because that workload was at a slow
period? These questions were potentially unanswerable and were a
nightmare for the data engineering, IT, and analytics architecture
teams.
Now, the answer is workload isolation: dedicating specific compute
resources to a single workload that doesn’t overlap with resources
dedicated to any other workload.
Today, with the prevalence of cloud elasticity, subclusters spin up as
needed. They can then be dedicated to a single workload, preventing
resource contention and making certain that each workload has
all the compute it needs and that no compute is wasted when not
needed (see Figure 3-4). If a specific workload needs fewer com‐
pute resources, that subcluster can be spun down and its compute
instances deleted. This concept completely relies on the separation
of compute and storage. No data copying or extra pipelines are
required, since all the subclusters pull data from a single shared
object storage location.
Isolating Workloads | 29
Figure 3-5. A unified storage layer with isolated workloads
Unifying Analytics
As opposed to having the Spark engineers moving and transforming
data, then reserving some data for the SQL experts and BI visualiza‐
tion tools, then setting aside other data for the Python developers,
providing access to all of the data from all of the frameworks creates
tremendous opportunities. Unified data platforms support Python,
R, and SQL—data access and analysis control languages with the
unified analytics engine executing the commands.
Users can instruct the engine to join disparate datasets, modify data
to remove outliers or encode categorical variables as binary dummy
variables, train a model, and make a prediction without ever moving
the data offsite because they are not limited by the in-memory
capabilities of a Python or R engine. The instructions issued by
Python or R are distributed and executed in the most performant
manner possible by the highly optimized MPP database engine in
the same way a query optimizer is used in a database to execute SQL
instructions. This creates the distributed, full-scale pipelines, as the
work is done, so no rebuilding is necessary at the end.
If a user’s preferred, most productive framework is Python, then
Python can aggregate monthly sales just as easily as it trains a
decision tree or joins two time-series datasets and interpolates the
missing values. Results can be plotted in a Jupyter notebook with
something like Matplotlib. Graphs can be directly sent to critical
decision makers just as quickly as if the data science team were
working locally on a tiny subset of data, but with the full accuracy of
training on massive datasets.
Similarly, a business analyst can leverage SQL to build regression
models directly against tables in the database and then feed those
predictions into a Tableau dashboard. See Figure 3-6, which shows
how this all comes together.
Unifying Analytics | 31
Figure 3-6. A unified data platform
Each user can now use their most comfortable and productive tools
for the job and expect the same level of performance on the full
dataset regardless of scale. And because of dedicated workload isola‐
tion, training an ML algorithm on a petabyte of data will not slow
Unifying Analytics | 33
CHAPTER 4
In-Database Machine Learning
Being able to do machine learning where the data lives has been
a goal from the beginning of the data explosion. But what does
in-database ML mean, and how does it work?
35
a fair amount of rebuilding. Every single step that a data scientist
has taken to prepare a tiny sample of data must now be re-created
at scale for production-level datasets, which could be in the multipe‐
tabyte range.
We’ve long thought about databases and data lakes as places for
storage, reporting, and ad hoc analytics of data, but the core premise
of in-database ML is to use the powerful distributed database engine
to natively build models and predict future outcomes based both
on data stored in the database format and other data in distributed
object storage systems.
Advanced data warehouses are capable of the targeted analytics used
for feature engineering, such as outlier detection, one-hot-encoding,
and even fast Fourier transforms. Enabling execution with the dis‐
tributed database engine while providing instructions in something
most data scientists are familiar with, like SQL or Python, instantly
unlocks a wealth of opportunity.
Vertica’s in-database ML, to take one example, has straightforward
SQL to perform one-hot encoding, converting categorical values to
numeric features:
SELECT one_hot_encoder_fit('bTeamEncoder', 'baseball', 'team'
USING PARAMETERS extra_levels='{"team" : ["Red Sox"]}');
For Python developers, Vertica’s VerticaPy library enables data sci‐
entists to achieve these same results by using only Python code. The
get_dummies function is applied to a virtual dataframe (familiar to
Python developers, but not limited by available memory size) and
provides the same functionality:
Security
Many of the tools in the marketplace today offer an abundance of
bells and whistles, and should be considered when choosing the
right tool for your company. But one of the most important features
of all of these systems—sometimes overlooked and crucial to get
right—is security. When your production ML models are making
decisions that could make or break your business and are touching
some of the most important data in your system, security must be
the number one determiner of a product and is table stakes for
any consideration. With emerging standards for data protection like
the European General Data Protection Regulation (GDPR) and the
California Privacy Rights Act (CPRA), security and data governance
are essential.
When running ML inside a database, one of the instant benefits
you get is years of mature, battle-tested, database security practices.
From their inception, database systems were created to securely
store your data. When you start with those pillars and then add ML
on top, you don’t have to worry about your models being safe.
Why Do ML in a Database? | 39
Speed and Scalability
Much as with security, since you’re starting with the foundation
of a robust analytical database, speed and scalability are built in.
MPP databases are already battle-tested and trusted by engineers.
Speed and scale come in two forms for ML, however: training and
deployment.
When training your models, with the data already in the database
alongside the models, no movement occurs across systems—no net‐
work nor I/O latencies, no waiting hours for data to arrive in your
modeling environments. And there’s no need to adjust data types to
new environments to mitigate data type incompatibilities. This helps
data scientists iterate faster and build better models.
From the perspective of serving, in-database ML removes all of the
legwork from the MLOps engineers to think about how the system
will grow with more requests. When the database is the host of
the model, it scales exactly as the rest of the database scales, which
engineers already trust. The simplicity of the architecture enables
engineers to build logic around their models with the full assurance
that what they create will scale with the data they train on.
Also, the model prediction function is a simple SQL call, just like
all the data transformation functions already in the data ingestion
pipeline. As data flows in, it gets prepped, the model evaluates it,
and the result is sent on to a visualization tool or AI application. A
lot of the data pipelines are a series of SQL calls anyway. What’s one
more? That automation means data flows in and a prediction flows
out, often in less than a second.
No Downsampling
Similar to scalability, but important in its own right, is the elimina‐
tion of downsampling. We looked earlier at SQL syntax added to
advanced databases with ML capabilities. When you train a model
in that paradigm, you don’t need to consider whether the data you
extract will fit locally since you’re not extracting anything.
You begin thinking at a system level. You have the scalability neces‐
sary to run the analyses you want, so you can begin optimizing
for training time, accuracy, and price instead of local memory. This
approach opens a world of possibilities for data scientists, and a new
way of thinking that was previously available only to data engineers.
Accessibility
A prohibitive issue for many companies is the cost of data scientists,
something we touched on in Chapter 1. This is a serious problem,
as the work they do can be critical to giving companies competitive
advantages. While there is no replacement for an advanced data
scientist analyzing a complex problem and building a complex, cus‐
tom solution, many problems that companies face can be considered
low-hanging fruit. They may be solvable by team members without
deep data science skills, but having familiarity with SQL, data ana‐
lytics, and the needs of the business.
The issue thus far, however, has been that more accessible ML
frameworks like scikit-learn are not easily deployable by these ana‐
lysts. Python-based ML libraries are a completely different set of
technological tools that may have no precedence in a current system.
Plus, while many business analysts are skilled in SQL, not all have
the same experience with Python.
With SQL-based ML, common now in many analytical databases,
you can enable analysts and citizen data scientists to achieve massive
results by adding a few ML-focused SQL commands to their reper‐
toire. Alongside the joins and window functions, they can apply an
XGBoost model and potentially unlock serious value for their com‐
pany, without needing to hire a data science team. Many problems
need advanced data science skill sets, but adding SQL language for
building classical ML models enables a plethora of new participants
who may otherwise not have the skills or experience to help.
Why Do ML in a Database? | 41
Governance
A crucial component of MLOps systems is user management. Who
can train a model? Who can deploy it? Who has access to that
dataset? Who can replace or remove a model from production?
Who trained model 5, which is in production now?
Equally important is data and model governance. Which models
and versions are currently in production? How is our churn model
doing? Which data was this model trained on?
Each of these questions becomes a single SQL statement in a data‐
base ML world. Granting privileges on a model is the same as
granting them on a table. Checking who deployed a model requires
simply looking at the SQL query logs. Checking the live perfor‐
mance of a model is simply running a SQL statement combining
two tables (predictions and ground truth).
With governance baked in, it’s yet again one less thing to focus on,
putting real modeling front and center.
Production Readiness
The practice of MLOps has grown enormously. Endless numbers
of companies today are building model-tracking services, feature
stores, model-serving environments, and pipelining tools—with
some companies offering all under one roof. But the practice
of DevOps and DataOps has already existed for years, with well-
documented best practices. MLOps is the intersection of these two
worlds, yet many companies have been approaching it in isolation. If
we have a robust way to version code, and a robust way to version
tables and data, we can seemingly have a robust way to manage
ML models just like tables in a database, where their engine is the
database, and their fuel is the data stored internally and all around
them in object stores.
47
About the Authors
Ben Epstein was the machine learning lead at Splice Machine, an
end-to-end MLOps and feature store platform. As ML lead, Ben was
responsible for bringing to market a full stack, user-facing ML plat‐
form, supporting large-scale data and ML systems spanning from
data ingestion to production model monitoring. With a focus on
real-time, distributed use cases, the platform was built on Apache
Spark, Kubernetes, MLFlow, and a custom-built database model
deployment architecture. Ben has extensive experience designing
end-to-end ML systems for scale, supporting petabytes of data, and
he recognizes the challenges that come with it in today’s ML tooling
landscape. Today, he works as a founding engineer focusing on
data-centric AI, building systems to help data scientists and machine
learning engineers derive better data for their models. He also works
with Washington University in St. Louis as an adjunct professor on
a cloud computing and big data course, focusing on real-world use
cases and skill sets.
Paige Roberts (@RobertsPaige) has worked as an engineer, trainer,
support technician, technical writer, marketer, product manager,
and a consultant in the last 25 years. She has built data engineering
pipelines and architectures, documented and tested open source
analytics implementations, spun up Hadoop clusters, picked the
brains of stars in data analytics, worked in different industries, and
questioned a lot of assumptions. She has worked for companies
like Pervasive, the Bloor Group, Actian, Hortonworks, Syncsort, and
Vertica. Now, she promotes understanding of Vertica, distributed
data processing, open source, large-scale data engineering architec‐
ture, and how the analytics revolution is changing the world.