Testing Machine Learning Systems - Code, Data and Models - Made With ML

Testing Machine Learning Systems: Code,
Data and Models
Goku Mohandas
· ·
Testing code, data and models to ensure consistent behavior in ML systems.
Repository
📬 Receive new lessons straight to your inbox (once a month) and
join 20K+ developers in learning how to responsibly deliver value with
applied ML.
Your email address...
Subscribe
Intuition
Tests are a way for us to ensure that something works as intended. We're incentiviz ed to
implement tests and discover sources of error as early in the development cycle as
possible so that we can reduce increasing downstream costs and wasted time. Once
we've designed our tests, we can automatically execute them every time we implement
a change to our system, and continue to build on them.
Types of tests
There are many four majors types of tests which are utiliz ed at di erent points in the
development cycle:
. Unit tests : tests on individual components that each have a single responsibility
(ex. function that lters a list).
. Integration tests : tests on the combined functionality of individual components
(ex. data processing).
. System tests : tests on the design of a system for expected outputs given inputs
(ex. training, inference, etc.).
. Acceptance tests : tests to verify that requirements have been met, usually referred
to as User Acceptance Testing (UAT).
. Regression tests : testing errors we've seen before to ensure new changes don't
reintroduce them.
Not e
There are many other types of functional and non-functional tests as well, such as
smoke tests (quick health checks), performance tests (load, stress), security tests,
etc. but we can g eneralize these under the system tests above.
How should we test?
The framework to use when composing tests is the Arrange Act Assert methodology.
Arrange : set up the di erent inputs to test on.

Act : apply the inputs on the component we want to test.
Assert : con rm that we received the expected output.
Tip
Cleaning is an uno cial fourth step to this methodolog y because it's important to
not leave remnants of a previous state which may a ect subsequent tests. We can
use packag es such as pytest-randomly to test ag ainst state dependency by
executing tests randomly.
In Python, there are many tools, such as unittest, pytest, etc., that allow us to easily
implement our tests while adhering to the Arrange Act Assert framework above. These
tools come with powerful built-in functionality such as parametriz ation, lters, and more,
to test many conditions at scale.
Not e
When arranging our inputs and asserting our expected outputs, it's important to test
across the entire g ambit of inputs and outputs:
inpu t s: data types, format, leng th, edg e cases (min/max, small/larg e, etc.)
ou t pu t s: data types, formats, exceptions, intermediary and nal outputs
Best practices
Regardless of the framework we use, it's important to strongly tie testing into the
development process.
atomic : when creating unit components, we need to ensure that they have a single
responsibility so that we can easily test them. If not, we'll need to split them into
more granular units.
compose : when we create new components, we want to compose tests to validate

their functionality. It's a great way to ensure reliability and catch errors early on.
regression : we want to account for new errors we come across with a regression
test so we can ensure we don't reintroduce the same errors in the future.
coverage : we want to ensure that 100% of our codebase has been accounter for. This
doesn't mean writing a test for every single line of code but rather accounting for
every single line (more on this in the coverage section below).

automate : in the event we forget to run our tests before committing to a repository,
we want to auto run tests for every commit. We'll learn how to do this locally using
Precommit and remotely (ie. main branch) via GitHub actions in subsequent lessons.
Test-driven development
Test-driven development (TDD) is the process where you write a test before completely
writing the functionality to ensure that tests are always written. This is in contrast to
writing functionality rst and then composing tests afterwards. Here are my thoughts on
this:
good to write tests as we progress, but it's not the representation of correctness.
initial time should be spent on design before ever getting into the code or tests.
using a test as guide doesn't mean that our functionality is error free.
Perfect coverage doesn't mean that our application is error free if those tests aren't
meaningful and don't encompass the eld of possible inputs, intermediates and
outputs. Therefore, we should work towards better design and agility when facing errors,
quickly resolving them and writing test cases around them to avoid them next time.
Wa rning
This topic is still hig hly debated and I' m only re ecting on my experience and what's
worked well for me at a larg e company (Apple), very early stag e startup and running a
company of my own. What's most important is that the team is producing reliable
systems that can be tested and improved upon.
Application
In our application, we'll be testing the code, data and models. Be sure to look inside
each of the di erent testing scripts after reading through the components below.
1 great_expectations/ # data tests

2 | ├── expectations/
3 | | ├── projects.json
4 | | └── tags.json
5 | ├── ...
6 tagifai/
7 | ├── eval.py # model tests
8 tests/ # code tests
9 ├── app/
10 | ├── test_api.py
11 | └── test_cli.py
12 └── tagifai/
13 | ├── test_config.py
14 | ├── test_data.py
15 | ├── test_eval.py
16 | ├── test_models.py
17 | ├── test_train.py
18 | └── test_utils.py
Not e
Alternatively, we could' ve org anized our tests by types of tests as well (unit,
integ ration, etc.) but I nd it more intuitive for navig ation by org anizing by how our
application is set up. We' ll learn about markers below which will allow us to run any
subset of tests by specifying lters.
🧪 Pytest
We're going to be using pytest as our testing framework for it's powerful builtin features
such as parametriz ation, xtures, markers, etc.
Con guration
Pytest expects tests to be organiz ed under a tests directory by default. However, we
can also use our pyproject.toml le to con gure any other test path directories as well.
Once in the directory, pytest looks for python scripts starting with tests_*.py but we
can con gure it to read any other le patterns as well.
1 # Pytest
2 [tool.pytest.ini_options]
3 testpaths = ["tests"]
4 python_files = "test_*.py"
Assertions
Let's see what a sample test and it's results look like. Assume we have a simple function
that determines whether a fruit is crisp or not (notice: single responsibility):
1 # food/fruits.py
2 def is_crisp(fruit):
3 if fruit:
4 fruit = fruit.lower()
5 if fruit in ["apple", "watermelon", "cherries"]:
6 return True
7 elif fruit in ["orange", "mango", "strawberry"]:
8 return False
9 else:
10 raise ValueError(f"{fruit} not in known list of fruits.")
11 return False
To test this function, we can use assert statements to map inputs with expected
outputs:
1 # tests/food/test_fruits.py
2 def test_is_crisp():
3 assert is_crisp(fruit="apple") # or == True
4 assert is_crisp(fruit="Apple")
5 assert not is_crisp(fruit="orange")
6 with pytest.raises(ValueError):
7 is_crisp(fruit=None)
8 is_crisp(fruit="pear")
Not e
We can also have assertions about exceptions like we do in lines 6-8 where all the
operations under the with statement are expected to raise the speci ed exception.
Execution
We can execute our tests above using several di erent levels of granularity:
1 pytest # all tests

2 pytest tests/food # tests under a directory
3 pytest tests/food/test_fruits.py # tests for a single file
4 pytest tests/food/test_fruits.py::test_is_crisp # tests for a single
function
Running our speci c test above would produce the following output:
tests/food/test_fruits.py::test_is_crisp PASSED [100%]
Had any of our assertions in this test failed, we would see the failed assertions as well
as the expected output and the output we received from our function.
Not e
It's important to test for the variety of inputs and expected outputs that we outlined
above and to never assume that a test is trivial. In our example above, it's important
that we test for both "apple" and "Apple" in the event that our function didn' t account
for casing !
Classes
We can also test classes and their respective functions by creating test classes. Within
our test class, we can optionally de ne functions which will automatically be executed
when we setup or teardown a class instance or use a class method.
setup_class : set up the state for any class instance.
teardown_class : teardown the state created in setup_class.
setup_method : called before every method to setup any state.
teardown_method : called after every method to teardown any state.
1 class Fruit(object):
2 def __init__(self, name):
3 self.name = name
4
5
6 class TestFruit(object):
7 @classmethod
8 def setup_class(cls):
9 """Set up the state for any class instance."""
10 pass
11
12 @classmethod
13 def teardown_class(cls):
14 """Teardown the state created in setup_class."""
15 pass
16
17 def setup_method(self):
18 """Called before every method to setup any state."""
19 self.fruit = Fruit(name="apple")
20
21 def teardown_method(self):
22 """Called after every method to teardown any state."""
23 del self.fruit
24
25 def test_init(self):
26 assert self.fruit.name == "apple"
We can execute all the tests for our class by specifying the class name:
1 tests/food/test_fruits.py::TestFruit . [100%]
We use test classes to test all of our class modules such as LabelEncoder ,
Tokenizer , CNN , etc.
Parametrize
So far, in our tests, we've had to create individual assert statements to validate di erent
combinations of inputs and expected outputs. However, there's a bit of redundancy
here because the inputs always feed into our functions as arguments and the outputs
are compared with our expected outputs. To remove this redundancy, pytest has the
@pytest.mark.parametrize decorator which allows us to represent our inputs and
outputs as parameters.
1 @pytest.mark.parametrize(
2 "fruit, crisp",
3 [
4 ("apple", True),
5 ("Apple", True),
6 ("orange", False),
7 ],
8 )
9 def test_is_crisp_parametrize(fruit, crisp):
10 assert is_crisp(fruit=fruit) == crisp
pytest tests/food/test_is_crisp_parametrize.py ... [100%]
. [Line 2] : de ne the names of the parameters under the decorator, ex. "fruit, crisp"
(note that this is one string).
. [Lines 3-7] : provide a list of combinations of values for the parameters from Step 1.
. [Line 9] : pass in parameter names to the test function.
. [Line 10] : include necessary assert statements which will be executed for each of
the combinations in the list from Step 2.
In our application, we use parametriz ation to test components that require varied
sets of inputs and expected outputs such as preprocessing, ltering, etc.

Not e
We could pass in an exception as the expected result as well:
1 @pytest.mark.parametrize(
2 "fruit, exception",
3 [
4 ("pear", ValueError),
5 ],
6 )
7 def test_is_crisp_exceptions(fruit, exception):
8 with pytest.raises(exception):
9 is_crisp(fruit=fruit)
Fixtures
Parametriz ation allows us to e ciently reduce redundancy inside test functions but
what about its inputs? Here, we can use pytest's builtin xture which is a function that is
executed before the test function. This signi cantly reduces redundancy when multiple
test functions require the same inputs.
1 @pytest.fixture
2 def my_fruit():
3 fruit = Fruit(name="apple")
4 return fruit
5
6
7 def test_fruit(my_fruit):
8 assert my_fruit.name == "apple"
We can apply xtures to classes as well where the xture function will be invoked when
any method in the class is called.
1 @pytest.mark.usefixtures("my_fruit")
2 class TestFruit:
3 ...
We use xtures to e ciently pass a set of inputs (ex. Pandas DataFrame) to di erent
testing functions that require them (cleaning, splitting, etc.).
1 @pytest.fixture
2 def df():
3 projects_fp = Path(config.DATA_DIR, "projects.json")
4 projects_dict = utils.load_dict(filepath=projects_fp)
5 df = pd.DataFrame(projects_dict)
6 return df
7
8
9 def test_split(df):
10 splits = split_data(df=df)
11 ...
Not e
Typically, when we have too many xtures in a particular test le, we can org anize
them all in a fixtures.py script and invoke them as needed.
Markers
We've been able to execute our tests at various levels of granularity (all tests, script,
function, etc.) but we can create custom granularity by using markers. We've already
used one type of marker (parametriz e) but there are several other builtin markers as
well. For example, the skipif marker allows us to skip execution of a test if a condition is
met.
1 @pytest.mark.skipif(
2 not torch.cuda.is_available(),
3 reason="Full training tests require a GPU."
4 )
5 def test_training():
6 pass
We can also create our own custom markers with the exception of a few reserved
marker names.
1 @pytest.mark.fruits
2 def test_fruit(my_fruit):
3 assert my_fruit.name == "apple"
We can execute them by using the -m ag which requires a (case-sensitive) marker
expression like below:
1 pytest -m "fruits" # runs all tests marked with `fruits`

2 pytest -m "not fruits" # runs all tests besides those marked with `fruits`
The proper way to use markers is to explicitly list the ones we've created in our
pyproject.toml le. Here we can specify that all markers must be de ned in this le with
the --strict-markers ag and then declare our markers (with some info about them) in
our markers list:

1 # Pytest
2 [tool.pytest.ini_options]
3 testpaths = ["tests"]
4 python_files = "test_*.py"
5 addopts = "--strict-markers --disable-pytest-warnings"
6 markers = [
7 "training: tests that involve training",
8 ]
Once we do this, we can view all of our existing list of markers by executing pytest --
markers and we'll also receive an error when we're trying to use a new marker that's not
de ned here.
We use custom markers to label which of our test functions involve training so we can
separate long running tests from everything else.
1 @pytest.mark.training
2 def test_train_model():
3 experiment_name = "test_experiment"
4 run_name = "test_run"
5 result = runner.invoke()
6 ...
Not e
Another way to run custom tests is to use the -k ag when running pytest. The k
expression is much less strict compared to the marker expression where we can
de ne expressions even based on names.
Coverage
As we're developing tests for our application's components, it's important to know how
well we're covering our code base and to know if we've missed anything. We can use the
Coverage library to track and visualiz e how much of our codebase our tests account for.
With pytest, it's even easier to use this package thanks to the pytest-cov plugin.
1 pytest --cov tagifai --cov app --cov-report html

Here we're asking for coverage for all the code in our tagifai and app directories and to
generate the report in HTML format. When we run this, we'll see the tests from our tests
directory executing while the coverage plugin is keeping tracking of which lines in our
application are being executed. Once our tests are complete, we can view the
generated report (default is htmlcov/index.html ) and click on individual les to see
which parts were not covered by any tests. This is especially useful when we forget to
test for certain conditions, exceptions, etc.

Wa rning
Thoug h we have 100% coverag e, this does not mean that our application is perfect.
Coverag e only indicates that a piece of code executed in a test, not necessarily that
every part of it was tested, let alone thoroug hly tested. Therefore, coverag e should
never be used as a representation of correctness. However, it is very useful to
maintain coverag e at 100% so we can know when new functionality has yet to be
tested. In our CI/CD lesson, we' ll see how to use GitHub actions to make 100%
coverag e a requirement when pushing to speci c branches.
Exclusio ns
Sometimes it doesn't make sense to write tests to cover every single line in our
application yet we still want to account for these lines so we can maintain 100%
coverage. We have two levels of purview when applying exclusions:
. Excusing lines by adding this comment # pragma: no cover, <MESSAGE>
1 if self.trial.should_prune(): # pragma: no cover, optuna pruning

2 pass
. Excluding les by specifying them in our pyproject.toml con guration.
1 # Pytest coverage
2 [tool.coverage.run]
3 omit = ["app/main.py"] # sample API calls
The key here is that we were able to add justi cation to these exclusions through
comments so our team can follow our reasoning.
Machine learning
Now that we have a foundation for testing traditional software, let's dive into testing our
data and models in the context of machine learning systems.
🔢 Data
We've already tested the functions that act on our data through unit and integration
tests but we haven't tested the validity of the data itself. Once we de ne what our data
should look like, we can use (and add to) these expectations as our dataset grows.
Expectations
There are many dimensions to what our data is expected to look like. We'll brie y talk
about a few of them, including ones that may not directly be applicable to our task but,
nonetheless, are very important to be aware of.
rows / cols : the most basic expectation is validating the presence of samples
(rows) and features (columns). These can help identify mismatches between
upstream backend database schema changes, upstream UI form changes, etc.
presence of speci c features
row count (exact or range) of samples
individual values : we can also have expectations about the individual values of
speci c features.
missing values
type adherence (ex. feature values are all float )
values must be unique or from a prede ned set
list (categorical) / range (continuous) of allowed values
feature value relationships with other feature values (ex. column 1 values must
always be great that column 2)
aggregate values : we can also expectations about all the values of speci c
features.
value statistics (mean, std, median, max, min, sum, etc.)
distribution shift by comparing current values to previous values (useful for
detecting drift)
To implement these expectations, we could compose assert statements or we could
leverage the open-source library called Great Expectations. It's a fantastic library that
already has many of these expectations builtin (map, aggregate, multi-column,
distributional, etc.) and allows us to create custom expectations as well. It also provides
modules to seamlessly connect with backend data sources such as local le systems,
S3, databases and even DAG runners. Let's explore the library by implementing the
expectations we'll need for our application.
First we'll load the data we'd like to apply our expectations on. We can load our data from
a variety of sources ( lesystem, S3, DB, etc.) which we can then wrap around a Dataset
module (Pandas / Spark DataFrame, SQLAlchemy).
1 from pathlib import Path

2 import great_expectations as ge
3 import pandas as pd
4 from tagifai import config, utils
5
6 # Create Pandas DataFrame
7 projects_fp = Path(config.DATA_DIR, "projects.json")
8 projects_dict = utils.load_dict(filepath=projects_fp)
9 df = ge.dataset.PandasDataset(projects_dict)
id title description tags
How to Deal with How to supercharge [article, google-

0 2438 Files in Google your Google Colab colab, colab,
Colab: What Y... experienc... file-system]
[api, article,
A powerful web and
code, dataset,
1 2437 Rasoee mobile application
paper,
that ide...
research,...
Machine Learning Most common [article, deep-

2 2436 Methods Explained techniques used in learning, machine-
(+ Examples) data science pr... learning, dim...
Explore the
Top “Applied Data [article, deep-
innovative world
3 2435 Science” Papers learning, machine-
of Machine
from ECML-PK... learning, adv...
Learni...
MMCV is a python [article, code,

OpenMMLab Computer
4 2434 library for CV pytorch, library,
Vision
research and s... 3d, computer...
Built-in
Once we have our data source wrapped in a Dataset module, we can compose and apply
expectations on it. There are many built-in expectations to choose from:
1 # Presence of features
2 expected_columns = ["id", "title", "description", "tags"]
3 df.expect_table_columns_to_match_ordered_list(column_list=expected_columns)
4
5 # Unique
6 df.expect_column_values_to_be_unique(column="id")
7
8 # No null values
9 df.expect_column_values_to_not_be_null(column="title")
10 df.expect_column_values_to_not_be_null(column="description")
11 df.expect_column_values_to_not_be_null(column="tags")
12
13 # Type
14 df.expect_column_values_to_be_of_type(column="title", type_="str")
15 df.expect_column_values_to_be_of_type(column="description", type_="str")
16 df.expect_column_values_to_be_of_type(column="tags", type_="list")
17
18 # Data leaks
19 df.expect_compound_columns_to_be_unique(column_list=["title",
"description"])
Each of these expectations will create an output with details about success or failure,
expected and observed values, expectations raised, etc. For example, the expectation
df.expect_column_values_to_be_of_type(column="title", type_="str") would produce
the following if successful:
1 {
2 "exception_info": {
3 "raised_exception": false,
4 "exception_traceback": null,
5 "exception_message": null
6 },
7 "success": true,
8 "meta": {},
9 "expectation_config": {
10 "kwargs": {
11 "column": "title",
12 "type_": "str",
13 "result_format": "BASIC"
14 },
15 "meta": {},
16 "expectation_type": "_expect_column_values_to_be_of_type__map"
17 },
18 "result": {
19 "element_count": 2032,
20 "missing_count": 0,
21 "missing_percent": 0.0,
22 "unexpected_count": 0,
23 "unexpected_percent": 0.0,
24 "unexpected_percent_nonmissing": 0.0,
25 "partial_unexpected_list": []
26 }
27 }
and this output if it failed (notice the counts and examples for what caused the failure):
1 {
2 "success": false,
3 "exception_info": {
4 "raised_exception": false,
5 "exception_traceback": null,
6 "exception_message": null
7 },
8 "expectation_config": {
9 "meta": {},
10 "kwargs": {
11 "column": "title",
12 "type_": "int",
13 "result_format": "BASIC"
14 },
15 "expectation_type": "_expect_column_values_to_be_of_type__map"
16 },
17 "result": {
18 "element_count": 2032,
19 "missing_count": 0,
20 "missing_percent": 0.0,
21 "unexpected_count": 2032,
22 "unexpected_percent": 100.0,
23 "unexpected_percent_nonmissing": 100.0,
24 "partial_unexpected_list": [
25 "How to Deal with Files in Google Colab: What You Need to Know",
26 "Machine Learning Methods Explained (+ Examples)",
27 "OpenMMLab Computer Vision",
28 "...",
29 ]
30 },
31 "meta": {}
32 }
We can group all the expectations together to create an Expectation Suite object which
we can use to validate any Dataset module.
1 # Expectation suite
2 expectation_suite = df.get_expectation_suite()
3 print(df.validate(expectation_suite=expectation_suite,
only_return_failures=True))
1 {
2 "success": true,
3 "results": [],
4 "statistics": {
5 "evaluated_expectations": 9,
6 "successful_expectations": 9,
7 "unsuccessful_expectations": 0,
8 "success_percent": 100.0
9 },
10 "evaluation_parameters": {}
11 }
Cust o m
Our tags feature column is a list of tags for each input. The Great Expectation's library
doesn't come equipped to process a list feature but we can easily do so by creating a
custom expectation.
. Create a custom Dataset module to wrap our data around.
. De ne expectation functions that can map to each individual row of the feature
column (map) or to the entire feature column (aggregate) by specifying the
appropriate decorator.
1 class CustomPandasDataset(ge.dataset.PandasDataset):
2 _data_asset_type = "CustomPandasDataset"
3
4 @ge.dataset.MetaPandasDataset.column_map_expectation
5 def expect_column_list_values_to_be_not_null(self, column):
6 return column.map(lambda x: None not in x)
7
8 @ge.dataset.MetaPandasDataset.column_map_expectation
9 def expect_column_list_values_to_be_unique(self, column):
10 return column.map(lambda x: len(x) == len(set(x)))
. Wrap data with the custom Dataset module and use the custom expectations.
1 df = CustomPandasDataset(projects_dict)
2 df.expect_column_values_to_not_be_null(column="tags")
3 df.expect_column_list_values_to_be_unique(column="tags")
Not e
There are various levels of abstraction (following a template vs. completely from
scratch) available when it comes to creating a custom expectation with Great
Expectations.
Projects
So far we've worked with the Great Expectations library at the Python script level but we
can organiz e our expectations even more by creating a Project.
. Initializ e the Project using great_expectations init . This will interactively walk us
through setting up data sources, naming, etc. and set up a great_expectations
directory with the following structure:

1
great_expectations/
2
| ├── checkpoints/
3
| ├── expectations/
4
| ├── notebooks/
5
| ├── plugins/
6
| ├── uncommitted/
7
| ├── .gitignore
8
| └── great_expectations.yml
. De ne our custom module under the plugins directory and use it to de ne our data
sources in our great_expectations.yml con guration le.
1 datasources:
2 data:
3 class_name: PandasDatasource
4 data_asset_type:
5 module_name: custom_module.custom_dataset
6 class_name: CustomPandasDataset
7 module_name: great_expectations.datasource
8 batch_kwargs_generators:
9 subdir_reader:
10 class_name: SubdirReaderBatchKwargsGenerator
11 base_directory: ../assets/data
. Create expectations using the pro ler, which creates automatic expectations
based on the data, or we can also create our own expectations. All of this is done
interactively via a launched Jupyter notebook and saved under our
great_expectations/expectations directory.
1 great_expectations suite scaffold SUITE_NAME # uses profiler

2 great_expectations suite new --suite # no profiler
3 great_expectations suite edit SUITE_NAME # add your own custom
expectations
When using the automatic pro ler, you can choose which feature columns to
apply pro ling to. Since our tags feature is a list feature, we'll leave it commented
and create our own expectations using the suite edit command.
. Create Checkpoints where a Suite of Expectations are applied to a speci c Data
Asset. This is a great way of programmatically applying checkpoints on our existing
and new data sources.
1 great_expectations checkpoint new CHECKPOINT_NAME SUITE_NAME

2 great_expectations checkpoint run CHECKPOINT_NAME
. Run checkpoints on new batches of incoming data by adding to our testing pipeline
via Make le, or work ow orchestrator like Air ow, etc. We can also use the Great
Expectations GitHub Action to automate validating our data pipeline code when we
push a change. More on using these Checkpoints with pipelines in our work ows
lesson.
Data docs
When we create expectations using the CLI application, Great Expectations
automatically generates documentation for our tests. It also stores information about
validation runs and their results. We can launch the generate data documentation with
the following command: great_expectations docs build
Best practices
We've applied expectations on our source dataset but there are many other key areas
to test the data as well. Throughout the ML development pipeline, we should test the
intermediate outputs from processes such as cleaning, augmentation, splitting,
preprocessing, tokeniz ation, etc. We'll use these expectations to monitor new batches
of data and before combining them with our existing data assets.
Not e
Currently, these data processing steps are tied with our application code but in
future lessons, we' ll separate these into individual pipelines and use Great
Expectation Checkpoints in between to apply all these expectations in an
orchestrated fashion.
Pipelines with Great Expectations Checkpoints
🤖 Models
The other half of testing ML systems involves testing our models during training,
evaluation, inference and deployment.
Training
We want to write tests iteratively while we're developing our training pipelines so we can
catch errors quickly. This is especially important because, unlike traditional software, ML
systems can run to completion without throwing any exceptions / errors but can
produce incorrect systems. We also want to catch errors quickly to save on time and
compute.
Check shapes and values of model output
1 assert model(inputs).shape == torch.Size([len(inputs), num_classes])

Check for decreasing loss after one batch of training
1 assert epoch_loss < prev_epoch_loss
Over t on a batch
1 accuracy = train(model, inputs=batches[0])

2 assert accuracy == pytest.approx(1.0, abs=0.05) # 1.0 ± 0.05
Train to completion (tests early stopping, saving, etc.)
1 train(model)
2 assert learning_rate >= min_learning_rate
3 assert artifacts
On di erent devices
1 assert train(model, device=torch.device("cpu"))

2 assert train(model, device=torch.device("cuda"))
Not e
You can mark the compute intensive tests with a pytest marker and only execute
them when there is a chang e being made to system a ecting the model.
1 @pytest.mark.training
2 def test_train_model():
3 ...
Evaluation
When it comes to testing how well our model performs, we need to rst have our
priorities in line.
What metrics are important?
What tradeo s are we willing to make?
Are there certain subsets (slices) of data that are important?
Overall
We want to ensure that our key metrics on the overall dataset improves with each
iteration of our model. Overall metrics include accuracy, precision, recall, f1, etc. and we
should de ne what counts as a performance regression. For example, is a higher
precision at the expensive of recall an improvement or a regression? Usually, a team of
developers and domain experts will establish what the key metric(s) are while also
specifying the lowest regression tolerance for other metrics.
1 assert precision > prev_precision # most important, cannot regress

2 assert recall >= best_prev_recall - 0.03 # recall cannot regress > 3%
Slicing
Just inspecting the overall metrics isn't enough to deploy our new version to production.
There may be key slices of our dataset that we expect to do really well on (ie. minority
groups, large customers, etc.) and we need to ensure that their metrics are also
improving. An easy way to create and evaluate slices is to de ne slicing functions.
1 # tagifai/eval.py
2 from snorkel.slicing import slicing_function
3
4 @slicing_function()
5 def cv_transformers(x):
6 """Projects with the `computer-vision` and `transformers` tags."""
7 return all(tag in x.tags for tag in ["computer-vision", "transformers"])
Here we're using Snorkel's slicing_function to create our di erent slices. We can
visualiz e our slices by applying this slicing function to a relevant DataFrame using
slice_dataframe .
1 from snorkel.slicing import slice_dataframe

2
3 test_df = pd.DataFrame({"text": X_test, "tags":
4 label_encoder.decode(y_test)})
5 cv_transformers_df = slice_dataframe(test_df, cv_transformers)
cv_transformers_df[["text", "tags"]].head()
id text tags
vedastr vedastr open source [computer-vision, natural-

0 10
scene text recogni... language-processing,...
[computer-vision,
hugging captions generate
1 15 huggingface, language-
realistic instagram ...
modeli...
id text tags
transformer ocr rectification [attention, computer-vision,

2 49
free ocr using s... natural-language-...
We can de ne even more slicing functions and create a slices record array using the
PandasSFApplier . The slices array has N (# of data points) items and each item has S (#
of slicing functions) items, indicating whether that data point is part of that slice. Think
of this record array as a masking layer for each slicing function on our data.
1 # tagifai/eval.py | get_performance()
2 from snorkel.slicing import PandasSFApplier
3
4 slicing_functions = [cv_transformers, short_text]
5 applier = PandasSFApplier(slicing_functions)
6 slices = applier.apply(df)
7 print (slices)
[(0, 0) (0, 1) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(1, 0) (0, 0) (0, 1) (0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 1) (0, 0)
...
(0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (1, 0)]
One we have our slices record array, we can compute the performance metrics for each
slice.
1 # tagifai/eval.py | get_performance()
2 for slice_name in slices.dtype.names:
3 mask = slices[slice_name].astype(bool)
4 metrics = precision_recall_fscore_support(y_true[mask], y_pred[mask],
average="micro")
Not e
Snorkel comes with a builtin slice scorer but we had to implemented a naive version
since our task involves multi-label classi cation.
We can add these slice performance metrics to our larger performance report to
analyz e downstream when choosing which model to deploy.

1
{
2
"overall": {
3
"precision": 0.8050380552475824,
4
"recall": 0.603411513859275,
5
"f1": 0.6674448998627966,
6
"num_samples": 207.0
7
},
8
"class": {
9
"attention": {
10
"precision": 0.6923076923076923,
11
"recall": 0.5,
12
"f1": 0.5806451612903226,
13
"num_samples": 18.0
14
},
15
...
16
"unsupervised-learning": {
17
"precision": 0.8,
18
"recall": 0.5,
19
"f1": 0.6153846153846154,
20
"num_samples": 8.0
21
}
22
},
23
"slices": {
24
"f1": 0.7395604395604396,
25
"cv_transformers": {
26
"precision": 1.0,
27
"recall": 0.5384615384615384,
28
"f1": 0.7000000000000001,
29
"num_samples": 3
30
},
31
"short_text": {
32
"precision": 0.8333333333333334,
33
"recall": 0.7142857142857143,
34
"f1": 0.7692307692307692,
35
"num_samples": 4
36
}
37
}
38
}
Ext ensio ns
We've explored user generated slices but there is currently quite a bit of research on
automatically generated slices and overall model robustness. A notable toolkit is the
Robustness Gym which programmatically builds slices, performs adversarial attacks,
rule-based data augmentation, benchmarking, reporting and much more.

Robustness Gym slice builders
Instead of passively observing slice performance, we could try and improve them.
Usually, a slice may exhibit poor performance when there are too few samples and so a
natural approach is to oversample. However, these methods change the underlying data
distribution and can cause issues with overall / other slices. It's also not scalable to train
a separate model for each unique slice and combine them via Mixture of Experts (MoE).
To combat all of these technical challenges and more, the Snorkel team introduced the
Slice Residual Attention Modules (SRAMs), which can sit on any backbone architecture
(ie. our CNN feature extractor) and learn slice-aware representations for the class
predictions.
Slice Residual Attention Modules (SRAMs)
Inference
When our model is deployed, most users will be using it for inference (directly /
indirectly), so it's very important that we test all aspects of it.
Lo ading art if act s
This is the rst time we're not loading our components from in-memory so we want to
ensure that the required artifacts (model weights, encoders, con g, etc.) are all able to
be loaded.
1 artifacts = main.load_artifacts(run_id=run_id, device=torch.device("cpu"))

2 assert isinstance(artifacts["model"], nn.Module)
3 ...
Predict io n
Once we have our artifacts loaded, we're readying to test our prediction pipelines. We
should test samples with just one input, as well as a batch of inputs (ex. padding can
have unintended consequences sometimes).
1 # tests/app/test_api.py | test_best_predict()
2 data = {
3 "run_id": "",
4 "texts": [
5 {"text": "Transfer learning with transformers for self-supervised
6 learning."},
7 {"text": "Generative adversarial networks in both PyTorch and
8 TensorFlow."},
9 ],
10 }
11 response = client.post("/predict", json=data)
12 assert response.json()["status-code"] == HTTPStatus.OK
13 assert response.json()["method"] == "POST"
assert len(response.json()["data"]["predictions"]) == len(data["texts"])
...
Behavio ral t est ing
Besides just testing if the prediction pipelines work, we also want to ensure that they
work well. Behavioral testing is the process of testing input data and expected outputs
while treating the model as a black box. They don't necessarily have to be adversarial in
nature but more along the types of perturbations we'll see in the real world once our
model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing
of NLP Models with CheckList which breaks down behavioral testing into three types of
tests:
invariance : Changes should not a ect outputs.
1 # INVariance via verb injection (changes should not affect outputs)

2 tokens = ["revolutionized", "disrupted"]
3 tags = [["transformers"], ["transformers"]]
4 texts = [f"Transformers have {token} the ML field." for token in tokens]
5 compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="INV")
directional : Change should a ect outputs.
1 # DIRectional expectations (changes with known outputs)

2 tokens = ["PyTorch", "Huggingface"]
3 tags = [
4 ["pytorch", "transformers"],
5 ["huggingface", "transformers"],
6 ]
7 texts = [f"A {token} implementation of transformers." for token in
8 tokens]
compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="DIR")
minimum functionality : Simple combination of inputs and expected outputs.
1 # Minimum Functionality Tests (simple input/output pairs)

2 tokens = ["transformers", "graph neural networks"]
3 tags = [["transformers"], ["graph-neural-networks"]]
4 texts = [f"{token} have revolutionized machine learning." for token in
5 tokens]
compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="MFT")
Not e
Be sure to explore the NLP Checklist packag e which simpli es and aug ments the
creation of these behavioral tests via functions, templates, pretrained models and
interactive GUIs in Jupyter notebooks.
NLP Checklist
We combine all of these behavioral tests to create a behavioral report
( tagifai.eval.get_behavioral_report() ) which quanti es how many of these tests are
passed by a particular instance of a trained model. This report is then saved along with
the run's artifacts so we can use this information when choosing which model(s) to
deploy to production.
1 {
2 "score": 1.0,
3 "results": {
4 "passed": [
5 {
6 "input": {
7 "text": "Transformers have revolutionized the ML field.",
8 "tags": [
9 "transformers"
10 ]
11 },
12 "prediction": {
13 "input_text": "Transformers have revolutionized the ML field.",
14 "preprocessed_text": "transformers revolutionized ml field",
15 "predicted_tags": [
16 "natural-language-processing",
17 "transformers"
18 ]
19 },
20 "type": "INV"
21 },
22 ...
23 {
24 "input": {
25 "text": "graph neural networks have revolutionized machine
26 learning.",
27 "tags": [
28 "graph-neural-networks"
29 ]
30 },
31 "prediction": {
32 "input_text": "graph neural networks have revolutionized machine
33 learning.",
34 "preprocessed_text": "graph neural networks revolutionized
35 machine learning",
36 "predicted_tags": [
37 "graph-neural-networks",
38 "graphs"
39 ]
40 },
41 "type": "MFT"
42 }
43 ],
"failed": []
}
}
Wa rning
When you create additional behavioral tests, be sure to reevaluate all the models
you' re considering on the new set of tests so their scores can be compared. We can
do this since behavioral tests are not dependent on data or model versions and are
simply tests that treat the model as a black box.
1 tagifai behavioral-reevaluation --experiment-name=best --all-runs #

2 update all runs in experiment
tagifai behavioral-reevaluation --run-id=0deb534 # update specific run
Sorted runs
We can combine our overall / slice metrics and our behavioral tests to create a holistic
evaluation report for each model run. We can then use this information to choose which
model(s) to deploy to production.
bash
Deployment
There are also a whole class of model tests that are beyond metrics or behavioral
testing and focus on the system as a whole. Many of them involve testing and
benchmarking the tradeo s (ex. latency, compute, etc.) we discussed from the
baselines lesson. These tests also need to performed across the di erent systems (ex.
devices) that our model may be on. For example, development may happen on a CPU but
the deployed model may be loaded on a GPU and there may be incompatible
components (ex. reparametriz ation) that may cause errors. As a rule of thumb, we
should test with the system speci cations that our production environment utiliz es.
Not e
We' ll automate tests on di erent devices in our CI/CD lesson where we' ll use GitHub
Actions to spin up our application with Docker Machine on cloud compute instances
(we' ll also use this for training ).
Once we've tested our model's ability to perform in the production environment (o ine
t est s), we can run several types of o nline t est s to determine the quality of that
performance.
AB tests :
sending production tra c to the di erent systems.
involves statistical hypothesis testing to decide which system is better.
need to account for di erent sources of bias (ex. novelty e ect).
multiarmed bandits might be better if optimiz ing on a certain metric.
Shadow tests :
sending the same production tra c to the di erent systems.
safe online evaluation as the new system's results are not served.
easy to monitor, validate operational consistency, etc.
Testing vs. monitoring
We'll conclude by talking about the similarities and distinctions between testing and
monitoring. They're both integral parts of the ML development pipeline and depend on
each other for iteration. Testing is assuring that our system (code, data and models)
behaves the way we intend at the current time t0. Whereas, monitoring is ensuring that
the conditions (ie. distributions) during development are maintained and also that the
tests that passed at t0 continue to hold true post deployment through tn. When this is
no longer true, we need to inspect more closely (retraining may not always x our root
problem).
With monitoring, there are quite a few distinct concerns that we didn't have to consider
during testing since it involves (live) data we have yet to see.
features and prediction distributions (drift), typing, schema mismatches, etc.
determining model performance (rolling and window metrics on overall and slices of
data) using indirect signals (since labels may not be readily available).
in situations with large data, we need to know which data points to label and
upsample for training.
identifying anomalies and outliers.
We'll cover all of these concepts in much more depth (and code) in our monitoring
lesson.
Resources
Great Expectations
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt
Reduction
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
A Recipe for Training Neural Networks
E ective testing for machine learning systems
Slice-based Learning: A Programming Model for Residual Learning in Critical Data
Slices
Robustness Gym: Unifying the NLP Evaluation Landscape
To cite this lesson, please use:
1 @article{madewithml,
2 title = "Testing - Made With ML",
3 author = "Goku Mohandas",
4 url = "https://madewithml.com/courses/applied-ml/testing/"
5 year = "2021",
6 }

Testing Machine Learning Systems - Code, Data and Models - Made With ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing Machine Learning Systems - Code, Data and Models - Made With ML

Uploaded by

Copyright:

Available Formats

Testing Machine Learning Systems: Code,

Data and Models

Testing code, data and models to ensure consistent behavior in ML systems.

📬 Receive new lessons straight to your inbox (once a month) and

join 20K+ developers in learning how to responsibly deliver value with

Your email address...

a change to our system, and continue to build on them.

. Integration tests : tests on the combined functionality of individual components

(ex. data processing).

(ex. training, inference, etc.).

How should we test?

Arrange : set up the di erent inputs to test on.

Assert : con rm that we received the expected output.

use packag es such as pytest-randomly to test ag ainst state dependency by

executing tests randomly.

to test many conditions at scale.

ou t pu t s: data types, formats, exceptions, intermediary and nal outputs

more granular units.

compose : when we create new components, we want to compose tests to validate

every single line (more on this in the coverage section below).

systems that can be tested and improved upon.

1 great_expectations/ # data tests

subset of tests by specifying lters.

such as parametriz ation, xtures, markers, etc.

Pytest expects tests to be organiz ed under a tests directory by default. However, we

can con gure it to read any other le patterns as well.

that determines whether a fruit is crisp or not (notice: single responsibility):

1 pytest # all tests

tests/food/test_fruits.py::test_is_crisp PASSED [100%]

when we setup or teardown a class instance or use a class method.

setup_class : set up the state for any class instance.

teardown_class : teardown the state created in setup_class.

setup_method : called before every method to setup any state.

teardown_method : called after every method to teardown any state.

combinations of inputs and expected outputs. However, there's a bit of redundancy

@pytest.mark.parametrize decorator which allows us to represent our inputs and

pytest tests/food/test_is_crisp_parametrize.py ... [100%]

(note that this is one string).

. [Line 9] : pass in parameter names to the test function.

sets of inputs and expected outputs such as preprocessing, ltering, etc.

We could pass in an exception as the expected result as well:

test functions require the same inputs.

any method in the class is called.

testing functions that require them (cleaning, splitting, etc.).

them all in a fixtures.py script and invoke them as needed.

We can execute them by using the -m ag which requires a (case-sensitive) marker

expression like below:

1 pytest -m "fruits" # runs all tests marked with `fruits`

our markers list:

separate long running tests from everything else.

de ne expressions even based on names.

1 pytest --cov tagifai --cov app --cov-report html

generated report (default is htmlcov/index.html ) and click on individual les to see

test for certain conditions, exceptions, etc.

never be used as a representation of correctness. However, it is very useful to

coverag e a requirement when pushing to speci c branches.

coverage. We have two levels of purview when applying exclusions:

. Excusing lines by adding this comment # pragma: no cover, <MESSAGE>

1 if self.trial.should_prune(): # pragma: no cover, optuna pruning

. Excluding les by specifying them in our pyproject.toml con guration.

comments so our team can follow our reasoning.

data and models in the context of machine learning systems.

nonetheless, are very important to be aware of.