Professional Documents
Culture Documents
Batch Streaming RT For Production ML
Batch Streaming RT For Production ML
Real-Time Transformations.................................................................... 6
Batch Transformations........................................................................... 6
2
Introduction
Today, the data flows for machine learning applications are more complex and varied than ever. Often, a
company starts their ML journey with a relatively simple use case, such as weekly customer demand fore-
casting or churn prediction, that relies on batch historical data and an offline model. But sooner or later,
most businesses are eager to differentiate by including online models in their products or their real-time
processes, introducing a new range of operational concerns—and, often, an explosion in the number of data
sources and transformations used by the model.
This is because, for online models, predictions can generally be improved by taking advantage of the freshest
data coming in. Data scientists often want to include event stream data as signals to the model, forcing
them to integrate with more complex frameworks or systems like Kafka or Spark. Going further, models can
also benefit from real-time feature data that’s only available at the moment of the prediction.
ML products are data products, and the problem with adding more data flows is that you greatly increase
ML infrastructure complexity, time-to-market, cost, and risk. It’s one thing to experiment with streaming
or real-time features in a notebook, but quite another to make them work efficiently in a production ML
system. The more sources and pipelines you have to manage, the easier it is for something to break—or
worse, fail silently and cause your model to make predictions with incorrect or missing data.
Consider the following ML use cases, and note that the challenges build on each other. In other words, you
have to solve all the previous challenges on the list in order to progress.
3
In this whitepaper, we’ll show you how a feature platform can help you easily evolve from batch use cases
to leveraging a wide variety of batch, streaming, and real-time data to build powerful models. We’ll clear up
common misconceptions along the way, such as the idea that real-time ML means that the model only uses
real-time data (nope!) or that real-time transformations only happen on extremely fresh data (again, not
necessarily!).
Let’s start by exploring the main differences between batch, streaming, and real-time data, with a focus on
how these different kinds of raw data are transformed into signals for ML models.
On the other hand, real-time features are calculated exactly at the time of prediction, using the freshest
data for the process.
Most real-time models don’t just use real-time data to make predictions. They use a mixture of data sources
to get a range of signals. This could be exclusively batch data, batch and streaming data, or all three: batch,
streaming, and real-time data.
Because real-time ML tends to be in the critical path of the product, it needs to be as fast as its name implies.
Latency in fetching and transforming data or serving predictions can be extremely disruptive to the user
experience.
4
In the world of data processing, transformations play a critical role in shaping and refining raw data into
usable insights. But not all data is created—or processed—equally.
Since the advent of “big data” some 15–20 years ago, data scientists have been using all sorts of techniques
to transform batch data into ML features. But with the amount of app and web instrumentation today, en-
terprises have access to huge amounts of new data that don’t even reach the data warehouse—rather, the
data is available as an event stream or is queried on demand. We’re having to invent new ways to transform
these fresher sources of information into usable signals for models.
Streaming Transformations
In streaming transformations, data is continuously processed, typically in small batches, as soon as it’s
generated or received. This is sometimes referred to as handling “data in motion.” It allows models to get
signals that are fresh, but not necessarily real-time.
Streaming transformations are useful when a balance is needed between response time and computational
efficiency, such as when tracking website user activity or data from IoT sensors. They’re also essential for
doing stateful transformations.
Stateful transformations consider the entire history of data points: current and past data points influence
each new data point’s transformation. In other words, the result of a stateful transformation is shaped
not only by present data, but also by the accumulated information from past data points. For a model that
estimates how long a user will need to wait for a customer support agent, a stateful transformation might
look at the average wait time in the past hour for similar requests.
Historical Data
To account for state, streaming transformations have to perform additional data processing like keeping
track of counts over time. They often rely on powerful streaming engines like Apache Spark to do this
heavy-duty lifting. These engines are robust, resilient, and can instantly perform complex calculations on
continuous data streams. However, they require substantial computational resources and infrastructure.
5
Real-Time Transformations
Real-time transformations process each piece of data at
prediction time. For instance, the system might fetch a user’s More nuance on “real time”
When referring to “real-time fea-
GPS location and convert it into a Geohash via mathematical
tures,” we typically point to data
operations.
transformed during the prediction
phase. However, the data under-
going transformation might not be
freshly minted—it could be hours
Geohash Service 9q8yyzkeb
old! For instance, a user’s bank ac-
count balance that was updated 7
hours ago could be used in real time
at prediction time “right now.” In
Since real-time transformations are performed on a predic- contrast, you could also transform
data that has been generated at the
tion-by-prediction basis, they don’t perform stateful operations
exact moment of prediction, like a
or retain context. Thus, real-time transformation engines have
user’s search query produced just
simpler infrastructures and fewer computational requirements
milliseconds ago.
compared to streaming systems.
However, real-time transformations are still challenging because they need to happen on demand and with
very little latency. This requires a robust infrastructure capable of continuously fetching data that may
come from a wide range of different sources, such as an operational database or a third-party API. It also
means running highly efficient computations.
Batch Transformations
Real-time and streaming transformations often steal the limelight, but batch online and batch offline
scenarios are equally crucial in the ML lifecycle. Batch transformations involve processing data in large groups
or “batches” at once, often at scheduled times, and are not meant for immediate insights. Examples include
generating daily marketing reports about potential churners or analyzing trends over a certain period.
6
It’s important to note, however, that data from batch transformations can still be useful for real-time
models. For instance, you might want a real-time model to make predictions based on a user’s aggregated
activity over the last month, which comes from a batch transformation.
Depending on how much data they process, batch transformations sometimes use less-complex engines
and require fewer computational resources than real-time or streaming transformations. They also require
a less robust infrastructure since the data doesn’t need to be processed immediately upon receipt. This
doesn’t mean that batch transformations lack the need for computational power; they simply require it on
a different scale and timeframe.
With batch transformations, the challenge shifts to efficiently storing high volumes of data until it’s ready
to be processed and ensuring the infrastructure can handle this large-scale processing when it occurs.
Streaming
Real-time
Batch
By far one of the biggest challenges is keeping latency at a minimum. Imagine you have a recommendations
model that relies on a feature vector that includes real-time, batch, and streaming data. This could mean
that all of the following processes need to complete successfully and simultaneously in milliseconds:
• Request-time data, such as the user ID, IP address, and search query, must be sent to the real-time
transformation pipeline and processed.
7
• Batch data, such as the user’s average spend over the last month, needs to be fetched. Traditional data
warehouses are too slow and aren’t built for production systems, so you need to make sure that the
results of batch transformations are stored in a low-latency database that can handle high demands—
perhaps a NoSQL or Postgres DB. Also keep in mind that the batch transformation pipeline needs to be
running successfully for this data to be accurate.
• The results of a streaming transformation that provides the user’s recent click history is fetched as well.
This also needs to come from a low-latency online data store. The streaming transformation needs to be
happening continuously in the background at low latency to ensure that the data is accurate and fresh.
• Finally, you need to stitch all of this data, which may be represented quite differently, into a single fea-
ture vector for the model to consume.
All this work ends up massively delaying companies’ go-to-market plans for ML applications. One solution
might be to outsource the entire ML app—like turning over fraud detection to a third party vendor. But
this isn’t usually an attractive option for businesses because they want end-to-end control over their mis-
sion-critical ML applications.
That’s why many companies are increasingly choosing a middle path: leveraging third-party tools to build
ML applications in-house with fewer headaches (and fewer resources). With a feature platform, you can
outsource the orchestration, monitoring, and management of your data flows for ML, letting your data
scientists and engineers focus on feature and model development—not infrastructure.
1. An offline store for serving training data (aka labeled data).
The feature platform is capable of joining data, applying filters, and combining features to form a com-
prehensive dataset for training purposes.
8
3. A
n orchestrator that manages the coordination and execution of data pipelines that
feed the offline and online stores.
The orchestrator ingests necessary data from various sources, such as data warehouses, real-time event
streams, and operational databases.1
Tecton also continuously updates the feature vector with new data as it arrives from a data stream
or operational databases, ensuring ML models can always access the most current feature values for
optimal predictions. The feature vector is dispatched to the ML model, usually hosted on a platform such
as TensorFlow Serving or SageMaker, where it’s harnessed to generate a prediction.
All of the above is managed via the Tecton software development kit (SDK), which lets you programmati-
cally define the data sources and transformations needed for your features.
However, don’t rule out a feature platform for batch, offline use cases—a compelling example is HelloFresh,
which employs Tecton solely for this purpose and reaps considerable benefits. The core value in this scenar-
io is feature reusability across departments and use cases, which facilitates efficient sharing, collaboration
among teams, swift iteration in complex production environments, monitoring active data workflows, and
controlling costs.
1 These are low-latency, high-demand data stores that contain an enterprise’s source of truth for recent operational data.
This data is later ETLed to a data warehouse for long-term storage
9
How Orchestration & Transformations Work Inside
a Feature Platform
Tecton’s feature platform simplifies the orchestration of multiple data pathways, enabling engineers and
data scientists to focus more on extracting insights and less on managing the data workflows themselves.
Tecton can natively run batch pipelines and process data from your data warehouse or data lake. All you
need to do is point Tecton at the raw data and provide it with the feature transformation.
If you’d like, Tecton can also connect to your Spark infrastructure and leverage that to execute batch
pipelines.
Training Customer
Customer Data ML Training
Data Lake
Text
Tecton
BatchFeatureView
Tecton
Third-Party e.g. Tecton
user_id, user_city
SDK
Data Sources Offline Store
from user_table
Tecton
Orchestrator
10
Tecton effectively understands and executes a given feature definition in the context of batch transforma-
tions.
For instance, imagine you want to calculate a customer’s age in days from their birthdate. This batch trans-
formation is set to update this data every 24 hours. It looks at the customer’s data warehouse, specifically
the customers table, which contains columns such as user_id, signup_timestamp, and dob (date of
birth). In this batch transformation, a query is created that pulls the user_id, signup_timestamp, and a
computed field, relative_user_age, which calculates the number of days between the current date and
the user’s date of birth.
Python
customers = BatchSource(
name=″customers″,
batch_config=FileConfig(
uri=″s3://tecton.ai.public/tutorials/fraud_demavo/customers/data.pq″,
file_format=″parquet″,
timestamp_field=″signup_timestamp″,
),
)
@batch_feature_view(
sources=[customers],
entities=[user],
mode=’spark_sql’,
online=False,
offline=False,
feature_start_time=datetime(2017, 1, 1),
batch_schedule=timedelta(days=1),
ttl=timedelta(days=3650),
timestamp_field=’signup_timestamp’,
prevent_destroy=False,
tags={‘release’: ‘production’},
owner=’nacosta@tecton.ai’,
description=’User age.’,
)
def user_relative_age(customers):
return f’’’
SELECT
user_id,
signup_timestamp,
datediff(current_date(), dob) as relative_user_age
FROM
{customers}
‘’’
11
For data warehouses, Tecton translates the batch transformation definition into a SQL query that it executes
against the data warehouse. The orchestrator fetches the data and runs any computations as part of the
transformation. The results are stored in an online store for online serving. Optionally, Tecton can also store
the feature data in Tecton’s offline store, which is typically helpful to speed up repeated offline training data
or prediction data generations.
Tecton
BatchFeatureView Customer Data Warehouse
select user_id, Table: users
user_name, user_id, user_name, dob
(now()-dob) as user_id, Peter, 07/11/1987
relative_user_age
Tecton
SQL query
Orchestrator
Offline
Store
On Spark, the orchestrator represents the transformation either as PySpark code or a Spark SQL query.
This is sent to the customer’s Spark cluster, which could be running on platforms such as Databricks or
AWS EMR. The Spark cluster fetches the necessary data from the data lake and performs the required
transformations to produce the desired results.
Tecton
BatchFeatureView Customer Data Warehouse
select user_id, Table: users
user_name, user_id, user_name, dob
(now()-dob) as user_id, Peter, 07/11/1987
relative_user_age
Offline Store
Regardless of which platform you choose, however, nothing inherently stops a user from leveraging Tecton
for batch offline scenarios. This is particularly true when factoring in the availability of Feature Tables in
Tecton, which allow users to input data from an outside pipeline, eliminating the need to rely solely on
Tecton’s own framework for importing feature data.
It’s important to note, though, that Tecton has a specific approach and set of requirements for the types of
transformations it supports. Tecton is optimized to handle aggregation-type features and transformations
in a highly efficient manner. If you have mostly non-aggregation features, it can be more challenging to
integrate them into Tecton’s framework.
12
Unpacking Real-Time & Streaming Transformations in Tecton
Tecton orchestrates real-time and streaming transformations by processing raw data at request time from
streams or on-demand sources. Just as with batch transformations, the orchestrator interprets simple
Python files that describe the transformation. Here is an example:
Python
# The following defines several sliding time window aggregations over a user's transaction amounts
@stream_feature_view(
source=FilteredSource(transactions_stream),
entities=[user],
mode='spark_sql',
aggregation_interval=timedelta(minutes=10), # Defines how frequently feature values get
updated in the online store
batch_schedule=timedelta(days=1), # Defines how frequently batch jobs are scheduled to ingest
into the offline store
aggregations=[
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(hours=1)),
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(days=1)),
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(days=3)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(hours=1)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(days=1)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(days=3))
],
online=True,
offline=True,
feature_start_time=datetime(2022, 5, 1),
tags={‘release’: ‘production’},
owner=’kevin@tecton.ai’,
description=’Transaction amount statistics and total over a series of time windows, updated
every 10 minutes.’
)
def user_transaction_amount_metrics(transactions):
return f’’’
SELECT
user_id,
amt,
timestamp
FROM
{transactions}
‘’’
13
In this Python code, we’ve defined a Tecton Feature View indicating the source of streaming data for user
transactions. We want to get transaction amount totals and averages over a series of time windows, and to
do so, we define several time-windowed aggregations. In addition, Spark SQL is used to define row-level
filtering or projection transformations. While these kinds of streaming data operations can be very difficult
to manage efficiently, Tecton handles all of the orchestration and transformations so you don’t have to
worry about the underlying mechanics.
To provide users with even more options, Tecton has launched the Stream Ingest API, which drastically
reduces the infrastructure burden streaming features put on Tecton’s customers. With the Stream Ingest
API, all the customer has to do is send records to Tecton. Tecton immediately handles the transformation of
those records at sub-second latency, writes them to the Tecton feature store, and ensures that streaming
features are continuously updated with the latest data. Despite being a simpler solution, the Stream Ingest
API provides a powerful, cost-effective way to support streaming features, delivering capabilities on par
with or exceeding those of traditional Spark-based streaming engines.
Whether harnessing real-time insights, incorporating streaming features, generating training data that is
consistent with production environments, or serving predictions in online applications, Tecton excels in
catering to diverse data flow scenarios. And we’re always evolving to ensure that we remain the best fit for
your evolving needs.
14