You are on page 1of 14

Batch, Streaming

& Real-Time Data


How to Leverage Their Full Power
for Production ML
Contents
Introduction................................................................................................... 3

An Overview of Feature Types & Transformations:


Batch, Streaming & Real Time................................................................. 4
Streaming Transformations................................................................... 5

Real-Time Transformations.................................................................... 6

Batch Transformations........................................................................... 6

The Challenge of Serving Feature Data............................................... 7

Feature Platforms Enable Complex Data Flows for ML.................... 8

How Orchestration & Transformations Work Inside a Feature


Platform......................................................................................................... 10
Batch Transformations in Tecton: Under the Hood.......................... 10

Unpacking Real-Time & Streaming Transformations in Tecton......13

Unlock MLOps Excellence.........................................................................14

2
Introduction
Today, the data flows for machine learning applications are more complex and varied than ever. Often, a
company starts their ML journey with a relatively simple use case, such as weekly customer demand fore-
casting or churn prediction, that relies on batch historical data and an offline model. But sooner or later,
most businesses are eager to differentiate by including online models in their products or their real-time
processes, introducing a new range of operational concerns—and, often, an explosion in the number of data
sources and transformations used by the model.

This is because, for online models, predictions can generally be improved by taking advantage of the freshest
data coming in. Data scientists often want to include event stream data as signals to the model, forcing
them to integrate with more complex frameworks or systems like Kafka or Spark. Going further, models can
also benefit from real-time feature data that’s only available at the moment of the prediction.

ML products are data products, and the problem with adding more data flows is that you greatly increase
ML infrastructure complexity, time-to-market, cost, and risk. It’s one thing to experiment with streaming
or real-time features in a notebook, but quite another to make them work efficiently in a production ML
system. The more sources and pipelines you have to manage, the easier it is for something to break—or
worse, fail silently and cause your model to make predictions with incorrect or missing data.

Consider the following ML use cases, and note that the challenges build on each other. In other words, you
have to solve all the previous challenges on the list in order to progress.

Feature data Type of serving Challenges

Batch only Offline • Creating training and prediction data on-the-fly


• Feature reusability, collaboration, and sharing

Batch only Online • Maintaining consistency between an online and


offline store
• Stitching together feature vectors
• Monitoring production data infrastructure

Batch and streaming Online • Managing operational streaming infrastructure


• Ensuring features are continuously updated with
fresh values

Batch, streaming, and real-time Online • Latency in fetching real-time data


• Managing on-demand data processing pipelines

3
In this whitepaper, we’ll show you how a feature platform can help you easily evolve from batch use cases
to leveraging a wide variety of batch, streaming, and real-time data to build powerful models. We’ll clear up
common misconceptions along the way, such as the idea that real-time ML means that the model only uses
real-time data (nope!) or that real-time transformations only happen on extremely fresh data (again, not
necessarily!).

Let’s start by exploring the main differences between batch, streaming, and real-time data, with a focus on
how these different kinds of raw data are transformed into signals for ML models.

An Overview of Feature Types & Transformations:


Batch, Streaming & Real Time
It’s critical to distinguish between batch, streaming, and real-time features in ML applications. Both batch
and streaming features are known as pre-computed features; i.e., they’re computed in advance, separate
from the prediction-making moment.

On the other hand, real-time features are calculated exactly at the time of prediction, using the freshest
data for the process.

Examples of Different Types of Features

Batch feature A user’s city of residence stored in a database

Streaming feature A user’s click history for their web session

Real-time feature A user’s search query or GPS location

What is real-time ML?


When someone talks about real-time ML or calls a model a “real-time model,” they are almost always referring
to making predictions in real time and on demand. An example would be a fraud detection model that instantly
blocks a credit card transaction if it believes that the data indicates a fraudulent charge.

Most real-time models don’t just use real-time data to make predictions. They use a mixture of data sources
to get a range of signals. This could be exclusively batch data, batch and streaming data, or all three: batch,
streaming, and real-time data.

Because real-time ML tends to be in the critical path of the product, it needs to be as fast as its name implies.
Latency in fetching and transforming data or serving predictions can be extremely disruptive to the user
experience.

4
In the world of data processing, transformations play a critical role in shaping and refining raw data into
usable insights. But not all data is created—or processed—equally.

Since the advent of “big data” some 15–20 years ago, data scientists have been using all sorts of techniques
to transform batch data into ML features. But with the amount of app and web instrumentation today, en-
terprises have access to huge amounts of new data that don’t even reach the data warehouse—rather, the
data is available as an event stream or is queried on demand. We’re having to invent new ways to transform
these fresher sources of information into usable signals for models.

Streaming Transformations
In streaming transformations, data is continuously processed, typically in small batches, as soon as it’s
generated or received. This is sometimes referred to as handling “data in motion.” It allows models to get
signals that are fresh, but not necessarily real-time.

Streaming transformations are useful when a balance is needed between response time and computational
efficiency, such as when tracking website user activity or data from IoT sensors. They’re also essential for
doing stateful transformations.

Stateful transformations consider the entire history of data points: current and past data points influence
each new data point’s transformation. In other words, the result of a stateful transformation is shaped
not only by present data, but also by the accumulated information from past data points. For a model that
estimates how long a user will need to wait for a customer support agent, a stateful transformation might
look at the average wait time in the past hour for similar requests.

Historical Data

You will be connected


Stateful
to a live service
Transformation
avg_wait_time 2 minutes
representative in minutes.
request_type
customer_location
customer_queue
reps_available

To account for state, streaming transformations have to perform additional data processing like keeping
track of counts over time. They often rely on powerful streaming engines like Apache Spark to do this
heavy-duty lifting. These engines are robust, resilient, and can instantly perform complex calculations on
continuous data streams. However, they require substantial computational resources and infrastructure.

5
Real-Time Transformations
Real-time transformations process each piece of data at
prediction time. For instance, the system might fetch a user’s More nuance on “real time”
When referring to “real-time fea-
GPS location and convert it into a Geohash via mathematical
tures,” we typically point to data
operations.
transformed during the prediction
phase. However, the data under-
going transformation might not be
freshly minted—it could be hours
Geohash Service 9q8yyzkeb
old! For instance, a user’s bank ac-
count balance that was updated 7
hours ago could be used in real time
at prediction time “right now.” In
Since real-time transformations are performed on a predic- contrast, you could also transform
data that has been generated at the
tion-by-prediction basis, they don’t perform stateful operations
exact moment of prediction, like a
or retain context. Thus, real-time transformation engines have
user’s search query produced just
simpler infrastructures and fewer computational requirements
milliseconds ago.
compared to streaming systems.

However, real-time transformations are still challenging because they need to happen on demand and with
very little latency. This requires a robust infrastructure capable of continuously fetching data that may
come from a wide range of different sources, such as an operational database or a third-party API. It also
means running highly efficient computations.

Batch Transformations
Real-time and streaming transformations often steal the limelight, but batch online and batch offline
scenarios are equally crucial in the ML lifecycle. Batch transformations involve processing data in large groups
or “batches” at once, often at scheduled times, and are not meant for immediate insights. Examples include
generating daily marketing reports about potential churners or analyzing trends over a certain period.

Batch Total Customer Service


Historical Data Transformation Requests in the Past week

6
It’s important to note, however, that data from batch transformations can still be useful for real-time
models. For instance, you might want a real-time model to make predictions based on a user’s aggregated
activity over the last month, which comes from a batch transformation.

Depending on how much data they process, batch transformations sometimes use less-complex engines
and require fewer computational resources than real-time or streaming transformations. They also require
a less robust infrastructure since the data doesn’t need to be processed immediately upon receipt. This
doesn’t mean that batch transformations lack the need for computational power; they simply require it on
a different scale and timeframe.

With batch transformations, the challenge shifts to efficiently storing high volumes of data until it’s ready
to be processed and ensuring the infrastructure can handle this large-scale processing when it occurs.

Transformation Type Stateful? Data continually Low-latency Computationally


processed? requirements? intensive?

Streaming

Real-time

Batch

The Challenge of Serving Feature Data


There’s a lot of nuance in managing batch, streaming, and real-time transformations in order to serve a
feature vector to a model. (A feature vector is a compact representation of data used for inference.) For
instance, the transformation code a data scientist writes in their notebook will need to be translated into
production code in an entirely different system, and errors can slip in. Data pipelines can fail due to flaky
infrastructure, schema changes in the underlying data, and a wide range of other issues.

By far one of the biggest challenges is keeping latency at a minimum. Imagine you have a recommendations
model that relies on a feature vector that includes real-time, batch, and streaming data. This could mean
that all of the following processes need to complete successfully and simultaneously in milliseconds:

• Request-time data, such as the user ID, IP address, and search query, must be sent to the real-time
transformation pipeline and processed.

7
• Batch data, such as the user’s average spend over the last month, needs to be fetched. Traditional data
warehouses are too slow and aren’t built for production systems, so you need to make sure that the
results of batch transformations are stored in a low-latency database that can handle high demands—
perhaps a NoSQL or Postgres DB. Also keep in mind that the batch transformation pipeline needs to be
running successfully for this data to be accurate.

• The results of a streaming transformation that provides the user’s recent click history is fetched as well.
This also needs to come from a low-latency online data store. The streaming transformation needs to be
happening continuously in the background at low latency to ensure that the data is accurate and fresh.

• Finally, you need to stitch all of this data, which may be represented quite differently, into a single fea-
ture vector for the model to consume.

All this work ends up massively delaying companies’ go-to-market plans for ML applications. One solution
might be to outsource the entire ML app—like turning over fraud detection to a third party vendor. But
this isn’t usually an attractive option for businesses because they want end-to-end control over their mis-
sion-critical ML applications.

That’s why many companies are increasingly choosing a middle path: leveraging third-party tools to build
ML applications in-house with fewer headaches (and fewer resources). With a feature platform, you can
outsource the orchestration, monitoring, and management of your data flows for ML, letting your data
scientists and engineers focus on feature and model development—not infrastructure.

Feature Platforms Enable Complex Data Flows for ML


Tecton’s feature platform was designed so that companies can build in-house, mission-critical ML applica-
tions and products without having to manage the overwhelming complexity of backend data systems and
pipelines. It has four main components:

1. An offline store for serving training data (aka labeled data).
The feature platform is capable of joining data, applying filters, and combining features to form a com-
prehensive dataset for training purposes.

2. An online store for serving prediction data.


For the lowest possible latencies, the online store uses an optimized infrastructure often called a NoSQL
database or a key-value store, and can serve features in as little as 5 milliseconds.

8
3. A
 n orchestrator that manages the coordination and execution of data pipelines that
feed the offline and online stores.
The orchestrator ingests necessary data from various sources, such as data warehouses, real-time event
streams, and operational databases.1

4. A hosted feature server.


When a live application, such as a recommendation service on a website, sends a prediction request,
Tecton fetches the relevant feature data from the online store, conducts any real-time transformations,
and forms a feature vector.

Tecton also continuously updates the feature vector with new data as it arrives from a data stream
or operational databases, ensuring ML models can always access the most current feature values for
optimal predictions. The feature vector is dispatched to the ML model, usually hosted on a platform such
as TensorFlow Serving or SageMaker, where it’s harnessed to generate a prediction.

All of the above is managed via the Tecton software development kit (SDK), which lets you programmati-
cally define the data sources and transformations needed for your features.

When do you need a feature platform?


A feature platform is the most valuable when you use a variety of complex data flows for ML, especially re-
al-time and streaming features, and when your organization needs to manage a large number of ML applica-
tions simultaneously. This is where you’ll see the biggest ROI gain compared to crafting a bespoke or in-house
solution. (Check out our Build vs. Buy Guide to read more on this topic.)

However, don’t rule out a feature platform for batch, offline use cases—a compelling example is HelloFresh,
which employs Tecton solely for this purpose and reaps considerable benefits. The core value in this scenar-
io is feature reusability across departments and use cases, which facilitates efficient sharing, collaboration
among teams, swift iteration in complex production environments, monitoring active data workflows, and
controlling costs.

1 These are low-latency, high-demand data stores that contain an enterprise’s source of truth for recent operational data.
This data is later ETLed to a data warehouse for long-term storage

9
How Orchestration & Transformations Work Inside
a Feature Platform
Tecton’s feature platform simplifies the orchestration of multiple data pathways, enabling engineers and
data scientists to focus more on extracting insights and less on managing the data workflows themselves.

Batch Transformations in Tecton: Under the Hood


It’s a common misconception that because Tecton excels at orchestrating transformations for real-time
ML, its functionality is limited solely to real-time or streaming data processing. Remember: “Real-time
ML” just means that predictions are being made live in the product—but any type of data can be used.
Tecton is well equipped to handle batch data for online and offline predictions, making it a versatile tool for
ML model development.

Tecton can natively run batch pipelines and process data from your data warehouse or data lake. All you
need to do is point Tecton at the raw data and provide it with the feature transformation.

If you’d like, Tecton can also connect to your Spark infrastructure and leverage that to execute batch
pipelines.

Training Customer
Customer Data ML Training
Data Lake

Text
Tecton
BatchFeatureView
Tecton
Third-Party e.g. Tecton
user_id, user_city
SDK
Data Sources Offline Store
from user_table

Execute data pipeline


Customer
Customer Prediction ML Offline
on a schedule Data
Data Warehouse (e.g. once a day) Prediciton System

Tecton
Orchestrator

10
Tecton effectively understands and executes a given feature definition in the context of batch transforma-
tions.

For instance, imagine you want to calculate a customer’s age in days from their birthdate. This batch trans-
formation is set to update this data every 24 hours. It looks at the customer’s data warehouse, specifically
the customers table, which contains columns such as user_id, signup_timestamp, and dob (date of
birth). In this batch transformation, a query is created that pulls the user_id, signup_timestamp, and a
computed field, relative_user_age, which calculates the number of days between the current date and
the user’s date of birth.

Python

from tecton import Entity, BatchSource, FileConfig, batch_feature_view


from datetime import datetime, timedelta

customers = BatchSource(
name=″customers″,
batch_config=FileConfig(
uri=″s3://tecton.ai.public/tutorials/fraud_demavo/customers/data.pq″,
file_format=″parquet″,
timestamp_field=″signup_timestamp″,
),
)

user = Entity(name=″user″, join_keys=[″user_id″])

@batch_feature_view(
sources=[customers],
entities=[user],
mode=’spark_sql’,
online=False,
offline=False,
feature_start_time=datetime(2017, 1, 1),
batch_schedule=timedelta(days=1),
ttl=timedelta(days=3650),
timestamp_field=’signup_timestamp’,
prevent_destroy=False,
tags={‘release’: ‘production’},
owner=’nacosta@tecton.ai’,
description=’User age.’,
)
def user_relative_age(customers):
return f’’’
SELECT
user_id,
signup_timestamp,
datediff(current_date(), dob) as relative_user_age
FROM
{customers}
‘’’

11
For data warehouses, Tecton translates the batch transformation definition into a SQL query that it executes
against the data warehouse. The orchestrator fetches the data and runs any computations as part of the
transformation. The results are stored in an online store for online serving. Optionally, Tecton can also store
the feature data in Tecton’s offline store, which is typically helpful to speed up repeated offline training data
or prediction data generations.

Tecton
BatchFeatureView Customer Data Warehouse
select user_id, Table: users
user_name, user_id, user_name, dob
(now()-dob) as user_id, Peter, 07/11/1987
relative_user_age

Tecton
SQL query
Orchestrator

Offline
Store

On Spark, the orchestrator represents the transformation either as PySpark code or a Spark SQL query.
This is sent to the customer’s Spark cluster, which could be running on platforms such as Databricks or
AWS EMR. The Spark cluster fetches the necessary data from the data lake and performs the required
transformations to produce the desired results.

Tecton
BatchFeatureView Customer Data Warehouse
select user_id, Table: users
user_name, user_id, user_name, dob
(now()-dob) as user_id, Peter, 07/11/1987
relative_user_age

Tecton Spark Job Customer’s


Orchestrator (PySpark Code, Spark Cluster
Spark SQL Query) (Databricks, AWS EMR)

Offline Store

Regardless of which platform you choose, however, nothing inherently stops a user from leveraging Tecton
for batch offline scenarios. This is particularly true when factoring in the availability of Feature Tables in
Tecton, which allow users to input data from an outside pipeline, eliminating the need to rely solely on
Tecton’s own framework for importing feature data.

It’s important to note, though, that Tecton has a specific approach and set of requirements for the types of
transformations it supports. Tecton is optimized to handle aggregation-type features and transformations
in a highly efficient manner. If you have mostly non-aggregation features, it can be more challenging to
integrate them into Tecton’s framework.

12
Unpacking Real-Time & Streaming Transformations in Tecton
Tecton orchestrates real-time and streaming transformations by processing raw data at request time from
streams or on-demand sources. Just as with batch transformations, the orchestrator interprets simple
Python files that describe the transformation. Here is an example:

Python

from tecton import stream_feature_view, FilteredSource, Aggregation


from fraud.entities import user
from fraud.data_sources.transactions import transactions_stream
from datetime import datetime, timedelta

# The following defines several sliding time window aggregations over a user's transaction amounts
@stream_feature_view(
source=FilteredSource(transactions_stream),
entities=[user],
mode='spark_sql',
aggregation_interval=timedelta(minutes=10), # Defines how frequently feature values get
updated in the online store
batch_schedule=timedelta(days=1), # Defines how frequently batch jobs are scheduled to ingest
into the offline store
aggregations=[
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(hours=1)),
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(days=1)),
Aggregation(column=’amt’, function=’sum’, time_window=timedelta(days=3)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(hours=1)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(days=1)),
Aggregation(column=’amt’, function=’mean’, time_window=timedelta(days=3))
],
online=True,
offline=True,
feature_start_time=datetime(2022, 5, 1),
tags={‘release’: ‘production’},
owner=’kevin@tecton.ai’,
description=’Transaction amount statistics and total over a series of time windows, updated
every 10 minutes.’
)
def user_transaction_amount_metrics(transactions):
return f’’’
SELECT
user_id,
amt,
timestamp
FROM
{transactions}
‘’’

13
In this Python code, we’ve defined a Tecton Feature View indicating the source of streaming data for user
transactions. We want to get transaction amount totals and averages over a series of time windows, and to
do so, we define several time-windowed aggregations. In addition, Spark SQL is used to define row-level
filtering or projection transformations. While these kinds of streaming data operations can be very difficult
to manage efficiently, Tecton handles all of the orchestration and transformations so you don’t have to
worry about the underlying mechanics.

To provide users with even more options, Tecton has launched the Stream Ingest API, which drastically
reduces the infrastructure burden streaming features put on Tecton’s customers. With the Stream Ingest
API, all the customer has to do is send records to Tecton. Tecton immediately handles the transformation of
those records at sub-second latency, writes them to the Tecton feature store, and ensures that streaming
features are continuously updated with the latest data. Despite being a simpler solution, the Stream Ingest
API provides a powerful, cost-effective way to support streaming features, delivering capabilities on par
with or exceeding those of traditional Spark-based streaming engines.

Unlock MLOps Excellence


Navigating the vast landscape of MLOps and data processing tools can be overwhelming. Armed with the
information in this whitepaper, you can now better understand the different types of data and transforma-
tions for ML, and how a feature platform can help take the work out of managing data flows so that your
team can focus on building features and models.

Whether harnessing real-time insights, incorporating streaming features, generating training data that is
consistent with production environments, or serving predictions in online applications, Tecton excels in
catering to diverse data flow scenarios. And we’re always evolving to ensure that we remain the best fit for
your evolving needs.

Want to learn more? Book a demo with Tecton today.

14

You might also like