You are on page 1of 14

Breaking Out of the

Training Data Bottleneck


How to Generate High-Quality Data, Faster
Contents
Introduction..........................................................................................3

Training data 101..................................................................................4


Creating a training dataset......................................................................... 4

Retraining.......................................................................................................... 6

The importance of accurate training data.............................................. 6

Why generating training data is hard—and why


most organizations don’t get it right.................................................7
1. Getting access to the right training datasets...................................... 7

2. Backfilling data........................................................................................... 7

3. The time-travel problem........................................................................... 8

4. Re-creating feature logic......................................................................... 10

5. Lack of traceability and lineage............................................................. 10

The advantages of using a feature platform....................................11


1. Single authorship of features................................................................... 11

2. Easy to generate training data............................................................... 12

3. Time travel, solved...................................................................................... 12

4. Version and share features like code.................................................... 13

5. Accelerate notebook-driven development ........................................ 13

Build real-time predictive products faster


with a feature platform........................................................................14

2
Introduction
The American football coach Vince Lombardi famously said, “Practice does
not make perfect. Only perfect practice makes perfect.” In other words, it’s not
enough to just train. You have to invest in making that training accurate and
precise, so you’re not enforcing wrong ideas and assumptions.

For the same reasons, it’s a fact of machine learning that you can’t create a
great model without great training data. In other words, you can’t get great (or
even accurate) predictive outputs if you don’t train your models the right way
with the right data. Just like an athlete will perform poorly if they don’t train on
realistic conditions, models will perform poorly if the data they’re trained on is
not the same as the data they’ll see in production. Also called training-serving
skew, this is one of the biggest challenges in machine learning. The model’s
mistakes can often go undetected and cause significant losses over time.

3
Training data 101
Data scientists and ML experts, skip ahead to “Why generating training data is hard.”

Grossly simplified, training data is historical data from the past that a model uses to predict the future. From
the training data, the model learns statistical patterns and relationships, which it applies to new data points.

In most cases, you can’t train a model on raw data—unprocessed, unorganized data that hasn’t been manip-
ulated or formatted in any way. Instead, someone, typically a data scientist, needs to transform the data,
which can include cleaning, filtering, aggregating, merging, or applying mathematical operations to the raw
data. This is often referred to as data wrangling. These transformations remove noise from the data and
extract the important data signals or features that the model will use to make predictions.

Depending on the type of model, one of the transformation steps may include labeling the data with target
values or classes. For instance, if you want to train a classifier to identify fraudulent transactions, you will
need to label each example in your training dataset as fraudulent or non-fraudulent.

Creating a training dataset


Imagine you’re a company like Uber Eats and you want to build an application that predicts food delivery
times. Well, behind the beautifully designed application that your customers will use, there’s a whole
complex system of models and data pipelines that you need to run and orchestrate. And the hardest part
is probably the data part. Because getting the variables, or features, that can effectively predict delivery
time is hard. Sure, you have a ton of raw data to work with, but that raw data needs to be transformed into
actionable insights, such as:

• How long it typically takes this restaurant to fulfill an order


• How many drivers are in the local area at the time of the order
• How many orders the restaurant has in the queue at the time of the order

You want your training dataset to give your model a large number of diverse examples of how long it took to
deliver food orders in the past and what all the important features were for each of those orders.

4
Once your model has been trained on this data and is able to accurately predict the delivery times for
historical data points that you’ve hidden from the model, then it should be able to make accurate predic-
tions on net new data.

To create a training dataset, you need to:

• Collect raw data from your data sources. First, you need to identify all the raw data sources
that you want to use to generate training data for your model. Often, there isn’t just one source, but
many sources that are both historical data (stored in a table in somewhere in your data warehouse) and
real-time data (which may or may not be stored at all!).

• Derive features from the raw data. The next step is to develop features for the model to consume.
Usually, features are some kind of aggregation or statistic on the data.

For instance, a useful feature for this model might be how long it takes a restaurant on average to
fulfill an order. Typically, this feature won’t be stored directly in your raw data. Instead, you might
have a record of historical order fulfillment times for each restaurant, and you need to use that
to create a feature representing the average time. Note, however, that this is a simple example—
feature engineering can get extremely complex, involving many chained data transformations.

Raw Data Sources ML Features

30min_restaurant_order_count:
• Live Deliveries 17
• Live Restaurant Orders
5min_available_drivers_count:
• Historical Customer Purchases
4
• Historical Restaurant Ratings

• Customer Search Query


30d_cust_fav_restaurant_category:
“mexican”
 

• Split data into training and validation. As data scientists are generating training data, they are
also generating the data they’ll use to validate the model’s performance at the same time. This evalua-
tion is typically used as the basis for the decision to deploy the model. The validation set is often called
a “hold out set” because it’s held out, or separated, from the training data but comes from the same
overall dataset.

5
Retraining
It’s essential to retrain models regularly on new data. Great training data creates an accurate snapshot of
the world. But as time progresses and the world changes, this snapshot will inevitably become out-of-date.
As a result, the model’s performance will suffer because its assumptions based on the training data are no
longer valid.
Learn Model Decide

Data Product
Warehouse ML Flywheel

Organize Observation Collect

Model training is the “learn” step that improves models through the ML flywheel.

Even a few years ago, model retraining was mostly a manual process for all but the most sophisticated
tech companies; therefore, models were only retrained when their accuracy dropped significantly (and
sometimes, not even then).

Modern MLOps tools have made it much easier to retrain models on an automated basis, weekly or even
daily. Just as production data is constantly changing, in this new world of continuous model development,
training data is also an evolving, living asset, not something static that doesn’t change over time.

The importance of accurate training data


High-quality training data is critical: If there are mismatches between production (the real world) and
training (the model’s representation of the real world), models will perform poorly. Unlike software, they
won’t break or throw an error—they’ll make incorrect assumptions or bad predictions that can be very
difficult to debug.

It isn’t easy to catch these problems ahead of time because the model will appear to perform well on the
validation dataset, which is generated from the same dataset as the training data! And with automated
retraining of models, if an error happens, it can affect production systems quickly and with little forewarning.

6
Why generating training data is hard—and
why most organizations don’t get it right
Generating accurate training data for developing and continuously retraining models is a complex problem,
particularly when incorporating real-time data into models. Most organizations are struggling with one or
more of the challenges described below.

1. Getting access to the right training datasets


In most organizations, data is siloed. While a data warehouse like Snowflake or Databricks holds a lot of
historical business data, other important data like event-stream data or on-demand data from a third-party
API requires accessing different systems. Furthermore, data scientists are used to working in Python and
may not have the expertise or the knowledge of how data is formatted and stored to query the data directly.

As a result, data scientists have to make requests to specific data teams or data engineers just to generate a
dataset they can use to train models—particularly if that data comes from streaming or real-time systems.
Sometimes, this manual process can take weeks, and by the time the data scientist has access to the raw
data they requested to create training data, the business problem may have changed or that data is no
longer accurate to the current context.

2. Backfilling data
Backfilling is often necessary when you want to add a new feature to your model and you need to quickly
generate training data that has that feature.

Going back to the food delivery time model, imagine that we want to add a feature that looks at live traffic
stats in the local area because we think it will make the model better. One option would be to start logging
traffic information every time the model makes a prediction going forward. If we’re Uber Eats, we might
have enough traffic feature data to train a new model.

However, most companies would need to wait much longer to have enough examples to retrain their model.
The “log and wait” method can take a long time and slow down experimentation. The alternative is back-
filling: You look back at historical data and fill in the values that would have been there, pretending as if
your model had access to the data at that time.

7
Customer Order ID Timestamp Real-time Traffic Data

2394400344 03-27-2021 ?
3340273112 09-12-2022 ?
6803922503 04-29-2022 ?
1395822099 01-05-2021 ?

Backfilling training data is notoriously tricky. It can be difficult to retrieve data from the past, especially
real-time and streaming data, without losing information. It’s also very easy to accidentally incorporate
information from the future into your training data (the time-travel problem). Furthermore, any transfor-
mation logic that you use to backfill data must be the exact same logic your model uses to fetch data in
production. Otherwise, you will end up with training-serving skew.

3. The time-travel problem


In constructing a training data set, it’s extremely important not to accidentally use information from the
future in any one example.

For instance, imagine we have a model that recommends ads for users to click on. To construct a training
dataset, we might start with a historical log of when users clicked on ads over the past six months. One of
the features that we want to pull into our model is how often a given user has clicked on a similar ad in the
past since this can be a good predictor of whether they’ll click on another ad. To do that, we need to look up
in the table of users and aggregate all of their clicks over time.

Here’s where the problem can come in: If we aggregate all the user’s clicks, we will be incorporating infor-
mation from the future into the training dataset. Instead, it’s essential that for each data point, we only look
at the user’s clicks before that point in time.

8
Prediction Event

Out of scope events Events in scope for join


User Events Over Time

Tracy

Omar

Jin

Point-in-time correct joins

If you train a model on future information, it will seem like it performs really well—because it can see into
the future! But in production, the model won’t have any future information, and it will act erratically.

The time-travel problem with real-time models

For real-time models, the time-travel problem gets even more difficult to reason about because you have
to think about whether the model will have access to the data at the moment that it’s going to make a
prediction.

Imagine that you have a dataset of historical transactions and when they occurred, and you want to train a
real-time model on it. For a given point in time in your training dataset, all the historical transactions that
occurred earlier than that time should be fair game, right? Not quite.

You now have to think about whether the data would have actually been available to your model at that time.
If there is any delay in loading the transaction data into the real-time system that feeds data to your model
(such as a batch process that updates the transactions every hour), you need to account for that delay when
you generate your training data.

9
4. Re-creating feature logic
In many organizations, data science is separate from ML engineering. Data scientists work on model training
and development in an offline environment, and then once they are satisfied with the model’s performance
in evaluations, the engineering team works on productionizing the model. They build the required data
pipelines and logic to supply feature data to the model in an online environment.

The more complicated the logic is for generating a feature (i.e., transforming a set of raw data points into a
predictive input for the model), the more likely it is that there will be discrepancies between how the data
scientists encoded that logic for the training data set and how the ML engineering team ends up having to
build that pipeline in a production manner. (If you have really complicated logic, it’s tough to get it right even
once!)

This is especially challenging when working with streaming data and on-demand features where the model
requires freshness; e.g., being able to query the last 15 minutes of transactions on the fly for a fraud model.
It takes a lot of effort for the data scientist to build out the right query, and passing it off to the engineering
team requires a lot of coordination to recreate that query in an entirely different environment. Usually,
something slips through the cracks, and there are inconsistencies created between training and serving.

5. Lack of traceability and lineage


Re-creating feature logic isn’t only a problem when you’re going from training to serving. It can also create
issues when data scientists want to reuse features, but they don’t have any way to standardize or share
feature definitions. If they have to create the feature logic again from scratch, it’s likely that mistakes will
be made, resulting in lower-quality training data.

For example, we recently talked to a customer who had risk models that were trained a long time ago. They
don’t know how to retrain them since they don’t have the code to do so. They were probably trained on a
data scientist’s laptop using a Jupyter Notebook, but the original data and logic to create those features
wasn’t saved or stored anywhere.

10
The advantages of using a feature platform
With all the challenges of creating and reusing training data, many teams are turning to the MLOps space
for a solution. Feature platforms are a tool to manage the full lifecycle of features for machine learning
models, and generally consist of:

1. A place to design and define feature definitions using a familiar coding language
and a DSL (domain-specific language): Feature platforms let data scientists define fea-
tures-as-code using .py files.

2. A catalogue of feature definitions: A centralized repository that enables users to manage


feature definitions as files. With Tecton, users define features in code, version control them in git,
unit test them, and roll them out safely using Continuous Delivery pipelines.

3. Transformation: Based on the definitions, the platform orchestrates and continuously runs data
pipelines to compute features from raw data sources (both streaming and batch) and stores the
resulting feature values.

4. Storage: The platform maintains consistency between two stores for feature values: an offline
store optimized for large-scale, low-cost retrieval for training, and an online store optimized for
low-latency retrieval for production.

5. Serving: The platform exposes an API/SDK to retrieve feature values quickly and reliably for
both training and production.

Feature platforms have become indispensable tools for standardizing the access and use of training data
across an organization for the following reasons.

1. Single authorship of features


Feature platforms let you write a single definition of a feature that will work in an online environment
and can be backfilled against historical data in the offline environment. This gives you consistency across
training and serving because you won’t have to write two completely different pipelines (which are not
guaranteed to work the same).

11
2. Easy to generate training data
With a feature platform, you can generate an accurate training dataset on demand with just a few lines of
code. All the backfilling complexity is handled for you—all you need to provide is the feature time window.

For example, in Tecton, constructing training data just involves three steps:

1. Creating the Feature Service that assembles the features you want, if you don’t already have one.

2. Selecting the keys and timestamps for each sample you want in your training data. We call this the
spine.

3. Requesting the training data from the Feature Service by giving it the spine.

Resource:
• Tecton docs: Constructing training data

3. Time travel, solved


Feature platforms backfill feature data by performing point-in-time correct joins, avoiding the time-travel
problem and ensuring consistency between training and serving. Some feature platforms will also let you
set explicit schedules if the data is not always available immediately to the production model. For example,
to ensure that the platform can recreate training data that mirrors the exact moment when the data would
be available to the model, you might be able to tell the platform that your data isn’t usually available until an
hour after the end of the day because of processing.

Resources:
• Tecton docs: Point-in-time correctness
• Tiled time-window aggregations for backfilling real-time features with high performance

12
4. Version and share features like code

By storing all feature definitions in a versioned repository as .py files, feature platforms solve several
problems. First, you can easily reuse complex features in training and serving, including streaming and
on-demand features that might be difficult to query otherwise.

Secondly, since you don’t have to reimplement the logic in serving, it’s less likely that you will introduce
some difference in how the features are implemented. And finally, data scientists can easily find all the
features available when they are generating new training data.

5. Accelerate notebook-driven development

Typically, generating training data is a process of experimentation to see what new data and features will
improve the model. Notebook-driven development lets data teams collaborate and rapidly iterate during
model training and development. Within notebooks, data scientists can run code, explore data, and share
results all in one place.

However, it can be challenging the get the right training data into the notebook quickly. And when the data
team finds something interesting, they need to “throw it over the fence” to engineering to productionize
the results. Feature platforms can solve both these problems. For example, with Tecton you can retrieve
training data and define features inside a notebook with just a few lines of Python. Then you can immedi-
ately push new features to the offline and online feature stores to be consumed by your model. This rapid
development is especially important for teams working on real-time use cases where there’s a need to
refine models ASAP in response to external changes.

Resources:
• Tecton docs: Serverless feature retrieval from any Python environment
• Tecton blog: Notebook-Driven Development

13
Build real-time predictive products faster
with a feature platform
Good training data is essential to building a successful model. When data scientists struggle to get the
training data they need or share feature definitions, model development slows down. When training data is
successfully generated but doesn’t match what the model sees in production, models fail—often in invisible
ways that result in slightly worse predictions, eating away at value over time.

Feature platforms are designed to solve these problems, letting teams generate high-quality training data
for models with a single line of code. By guaranteeing that there’s the same representation of features in
training and serving environments, feature platforms avoid training-serving skew and cut out the double
work that teams are doing to reconstruct features in production. Feature platforms also make it easy to
share features and iterate rapidly on training models.

Want to learn more about how feature platforms can help accelerate
your ML efforts? Check out a Tecton demo.

14

You might also like