You are on page 1of 33

Testing Machine Learning Systems: Code,

Data and Models

Goku Mohandas

· ·

Testing code, data and models to ensure consistent behavior in ML systems.

Repository

📬   Receive new lessons straight to your inbox (once a month) and

join 20K+ developers in learning how to responsibly deliver value with

applied ML.

Your email address...

Subscribe

Intuition

Tests are a way for us to ensure that something works as intended. We're incentiviz ed to

implement tests and discover sources of error as early in the development cycle as

possible so that we can reduce increasing downstream costs and wasted time. Once

we've designed our tests, we can automatically execute them every time we implement

a change to our system, and continue to build on them.

Types of tests

There are many four majors types of tests which are utiliz ed at di erent points in the

development cycle:
. Unit tests : tests on individual components that each have a single responsibility
(ex. function that lters a list).

. Integration tests : tests on the combined functionality of individual components

(ex. data processing).

. System tests : tests on the design of a system for expected outputs given inputs

(ex. training, inference, etc.).

. Acceptance tests : tests to verify that requirements have been met, usually referred
to as User Acceptance Testing (UAT).

. Regression tests : testing errors we've seen before to ensure new changes don't

reintroduce them.

Not e

There are many other types of functional and non-functional tests as well, such as

smoke tests (quick health checks), performance tests (load, stress), security tests,

etc. but we can g eneralize these under the system tests above.

How should we test?

The framework to use when composing tests is the Arrange Act Assert methodology.

Arrange : set up the di erent inputs to test on.


Act : apply the inputs on the component we want to test.

Assert : con rm that we received the expected output.

Tip

Cleaning is an uno cial fourth step to this methodolog y because it's important to

not leave remnants of a previous state which may a ect subsequent tests. We can

use packag es such as pytest-randomly to test ag ainst state dependency by

executing tests randomly.

In Python, there are many tools, such as unittest, pytest, etc., that allow us to easily

implement our tests while adhering to the Arrange Act Assert framework above. These
tools come with powerful built-in functionality such as parametriz ation, lters, and more,

to test many conditions at scale.

Not e

When arranging our inputs and asserting our expected outputs, it's important to test
across the entire g ambit of inputs and outputs:

inpu t s: data types, format, leng th, edg e cases (min/max, small/larg e, etc.)

ou t pu t s: data types, formats, exceptions, intermediary and nal outputs

Best practices

Regardless of the framework we use, it's important to strongly tie testing into the

development process.

atomic : when creating unit components, we need to ensure that they have a single

responsibility so that we can easily test them. If not, we'll need to split them into

more granular units.

compose : when we create new components, we want to compose tests to validate


their functionality. It's a great way to ensure reliability and catch errors early on.

regression : we want to account for new errors we come across with a regression
test so we can ensure we don't reintroduce the same errors in the future.

coverage : we want to ensure that 100% of our codebase has been accounter for. This

doesn't mean writing a test for every single line of code but rather accounting for

every single line (more on this in the coverage section below).


automate : in the event we forget to run our tests before committing to a repository,

we want to auto run tests for every commit. We'll learn how to do this locally using

Precommit and remotely (ie. main branch) via GitHub actions in subsequent lessons.

Test-driven development

Test-driven development (TDD) is the process where you write a test before completely

writing the functionality to ensure that tests are always written. This is in contrast to

writing functionality rst and then composing tests afterwards. Here are my thoughts on

this:

good to write tests as we progress, but it's not the representation of correctness.

initial time should be spent on design before ever getting into the code or tests.

using a test as guide doesn't mean that our functionality is error free.

Perfect coverage doesn't mean that our application is error free if those tests aren't

meaningful and don't encompass the eld of possible inputs, intermediates and

outputs. Therefore, we should work towards better design and agility when facing errors,

quickly resolving them and writing test cases around them to avoid them next time.

Wa rning

This topic is still hig hly debated and I' m only re ecting on my experience and what's

worked well for me at a larg e company (Apple), very early stag e startup and running a

company of my own. What's most important is that the team is producing reliable

systems that can be tested and improved upon.

Application

In our application, we'll be testing the code, data and models. Be sure to look inside

each of the di erent testing scripts after reading through the components below.

1 great_expectations/ # data tests


2 | ├── expectations/
3 | | ├── projects.json
4 | | └── tags.json
5 | ├── ...
6 tagifai/
7 | ├── eval.py # model tests
8 tests/ # code tests
9 ├── app/
10 | ├── test_api.py
11 | └── test_cli.py
12 └── tagifai/
13 | ├── test_config.py
14 | ├── test_data.py
15 | ├── test_eval.py
16 | ├── test_models.py
17 | ├── test_train.py
18 | └── test_utils.py

Not e

Alternatively, we could' ve org anized our tests by types of tests as well (unit,

integ ration, etc.) but I nd it more intuitive for navig ation by org anizing by how our

application is set up. We' ll learn about markers below which will allow us to run any

subset of tests by specifying lters.

🧪   Pytest

We're going to be using pytest as our testing framework for it's powerful builtin features

such as parametriz ation, xtures, markers, etc.

Con guration

Pytest expects tests to be organiz ed under a tests directory by default. However, we

can also use our pyproject.toml le to con gure any other test path directories as well.

Once in the directory, pytest looks for python scripts starting with tests_*.py but we

can con gure it to read any other le patterns as well.

1 # Pytest
2 [tool.pytest.ini_options]
3 testpaths = ["tests"]
4 python_files = "test_*.py"

Assertions

Let's see what a sample test and it's results look like. Assume we have a simple function

that determines whether a fruit is crisp or not (notice: single responsibility):

1 # food/fruits.py
2 def is_crisp(fruit):
3 if fruit:
4 fruit = fruit.lower()
5 if fruit in ["apple", "watermelon", "cherries"]:
6 return True
7 elif fruit in ["orange", "mango", "strawberry"]:
8 return False
9 else:
10 raise ValueError(f"{fruit} not in known list of fruits.")
11 return False

To test this function, we can use assert statements to map inputs with expected

outputs:

1 # tests/food/test_fruits.py
2 def test_is_crisp():
3 assert is_crisp(fruit="apple") # or == True
4 assert is_crisp(fruit="Apple")
5 assert not is_crisp(fruit="orange")
6 with pytest.raises(ValueError):
7 is_crisp(fruit=None)
8 is_crisp(fruit="pear")

Not e

We can also have assertions about exceptions like we do in lines 6-8 where all the

operations under the with statement are expected to raise the speci ed exception.

Execution

We can execute our tests above using several di erent levels of granularity:

1 pytest # all tests


2 pytest tests/food # tests under a directory
3 pytest tests/food/test_fruits.py # tests for a single file
4 pytest tests/food/test_fruits.py::test_is_crisp # tests for a single
function

Running our speci c test above would produce the following output:

tests/food/test_fruits.py::test_is_crisp PASSED [100%]

Had any of our assertions in this test failed, we would see the failed assertions as well

as the expected output and the output we received from our function.
Not e

It's important to test for the variety of inputs and expected outputs that we outlined

above and to never assume that a test is trivial. In our example above, it's important

that we test for both "apple" and "Apple" in the event that our function didn' t account

for casing !

Classes

We can also test classes and their respective functions by creating test classes. Within

our test class, we can optionally de ne functions which will automatically be executed

when we setup or teardown a class instance or use a class method.

setup_class : set up the state for any class instance.

teardown_class : teardown the state created in setup_class.

setup_method : called before every method to setup any state.

teardown_method : called after every method to teardown any state.

1 class Fruit(object):
2 def __init__(self, name):
3 self.name = name
4
5
6 class TestFruit(object):
7 @classmethod
8 def setup_class(cls):
9 """Set up the state for any class instance."""
10 pass
11
12 @classmethod
13 def teardown_class(cls):
14 """Teardown the state created in setup_class."""
15 pass
16
17 def setup_method(self):
18 """Called before every method to setup any state."""
19 self.fruit = Fruit(name="apple")
20
21 def teardown_method(self):
22 """Called after every method to teardown any state."""
23 del self.fruit
24
25 def test_init(self):
26 assert self.fruit.name == "apple"
We can execute all the tests for our class by specifying the class name:

1 tests/food/test_fruits.py::TestFruit . [100%]

We use test classes to test all of our class modules such as LabelEncoder ,
Tokenizer , CNN , etc.

Parametrize

So far, in our tests, we've had to create individual assert statements to validate di erent

combinations of inputs and expected outputs. However, there's a bit of redundancy

here because the inputs always feed into our functions as arguments and the outputs

are compared with our expected outputs. To remove this redundancy, pytest has the

@pytest.mark.parametrize decorator which allows us to represent our inputs and

outputs as parameters.

1 @pytest.mark.parametrize(
2 "fruit, crisp",
3 [
4 ("apple", True),
5 ("Apple", True),
6 ("orange", False),
7 ],
8 )
9 def test_is_crisp_parametrize(fruit, crisp):
10 assert is_crisp(fruit=fruit) == crisp

pytest tests/food/test_is_crisp_parametrize.py ... [100%]

. [Line 2] : de ne the names of the parameters under the decorator, ex. "fruit, crisp"

(note that this is one string).

. [Lines 3-7] : provide a list of combinations of values for the parameters from Step 1.

. [Line 9] : pass in parameter names to the test function.

. [Line 10] : include necessary assert statements which will be executed for each of
the combinations in the list from Step 2.

In our application, we use parametriz ation to test components that require varied

sets of inputs and expected outputs such as preprocessing, ltering, etc.


Not e

We could pass in an exception as the expected result as well:

1 @pytest.mark.parametrize(
2 "fruit, exception",
3 [
4 ("pear", ValueError),
5 ],
6 )
7 def test_is_crisp_exceptions(fruit, exception):
8 with pytest.raises(exception):
9 is_crisp(fruit=fruit)

Fixtures

Parametriz ation allows us to e ciently reduce redundancy inside test functions but

what about its inputs? Here, we can use pytest's builtin xture which is a function that is

executed before the test function. This signi cantly reduces redundancy when multiple

test functions require the same inputs.

1 @pytest.fixture
2 def my_fruit():
3 fruit = Fruit(name="apple")
4 return fruit
5
6
7 def test_fruit(my_fruit):
8 assert my_fruit.name == "apple"

We can apply xtures to classes as well where the xture function will be invoked when

any method in the class is called.

1 @pytest.mark.usefixtures("my_fruit")
2 class TestFruit:
3 ...

We use xtures to e ciently pass a set of inputs (ex. Pandas DataFrame) to di erent

testing functions that require them (cleaning, splitting, etc.).

1 @pytest.fixture
2 def df():
3 projects_fp = Path(config.DATA_DIR, "projects.json")
4 projects_dict = utils.load_dict(filepath=projects_fp)
5 df = pd.DataFrame(projects_dict)
6 return df
7
8
9 def test_split(df):
10 splits = split_data(df=df)
11 ...

Not e

Typically, when we have too many xtures in a particular test le, we can org anize

them all in a fixtures.py script and invoke them as needed.

Markers

We've been able to execute our tests at various levels of granularity (all tests, script,

function, etc.) but we can create custom granularity by using markers. We've already

used one type of marker (parametriz e) but there are several other builtin markers as

well. For example, the skipif marker allows us to skip execution of a test if a condition is

met.

1 @pytest.mark.skipif(
2 not torch.cuda.is_available(),
3 reason="Full training tests require a GPU."
4 )
5 def test_training():
6 pass

We can also create our own custom markers with the exception of a few reserved

marker names.

1 @pytest.mark.fruits
2 def test_fruit(my_fruit):
3 assert my_fruit.name == "apple"

We can execute them by using the -m ag which requires a (case-sensitive) marker

expression like below:

1 pytest -m "fruits" # runs all tests marked with `fruits`


2 pytest -m "not fruits" # runs all tests besides those marked with `fruits`

The proper way to use markers is to explicitly list the ones we've created in our

pyproject.toml le. Here we can specify that all markers must be de ned in this le with

the --strict-markers ag and then declare our markers (with some info about them) in

our markers list:


1 # Pytest
2 [tool.pytest.ini_options]
3 testpaths = ["tests"]
4 python_files = "test_*.py"
5 addopts = "--strict-markers --disable-pytest-warnings"
6 markers = [
7 "training: tests that involve training",
8 ]

Once we do this, we can view all of our existing list of markers by executing pytest --
markers and we'll also receive an error when we're trying to use a new marker that's not

de ned here.

We use custom markers to label which of our test functions involve training so we can

separate long running tests from everything else.

1 @pytest.mark.training
2 def test_train_model():
3 experiment_name = "test_experiment"
4 run_name = "test_run"
5 result = runner.invoke()
6 ...

Not e

Another way to run custom tests is to use the -k ag when running pytest. The k

expression is much less strict compared to the marker expression where we can

de ne expressions even based on names.

Coverage

As we're developing tests for our application's components, it's important to know how

well we're covering our code base and to know if we've missed anything. We can use the

Coverage library to track and visualiz e how much of our codebase our tests account for.

With pytest, it's even easier to use this package thanks to the pytest-cov plugin.

1 pytest --cov tagifai --cov app --cov-report html


Here we're asking for coverage for all the code in our tagifai and app directories and to

generate the report in HTML format. When we run this, we'll see the tests from our tests

directory executing while the coverage plugin is keeping tracking of which lines in our

application are being executed. Once our tests are complete, we can view the

generated report (default is htmlcov/index.html ) and click on individual les to see

which parts were not covered by any tests. This is especially useful when we forget to

test for certain conditions, exceptions, etc.


Wa rning

Thoug h we have 100% coverag e, this does not mean that our application is perfect.

Coverag e only indicates that a piece of code executed in a test, not necessarily that

every part of it was tested, let alone thoroug hly tested. Therefore, coverag e should

never be used as a representation of correctness. However, it is very useful to

maintain coverag e at 100% so we can know when new functionality has yet to be

tested. In our CI/CD lesson, we' ll see how to use GitHub actions to make 100%

coverag e a requirement when pushing to speci c branches.

Exclusio ns

Sometimes it doesn't make sense to write tests to cover every single line in our

application yet we still want to account for these lines so we can maintain 100%

coverage. We have two levels of purview when applying exclusions:

. Excusing lines by adding this comment # pragma: no cover, <MESSAGE>

1 if self.trial.should_prune(): # pragma: no cover, optuna pruning


2 pass

. Excluding les by specifying them in our pyproject.toml con guration.

1 # Pytest coverage
2 [tool.coverage.run]
3 omit = ["app/main.py"] # sample API calls

The key here is that we were able to add justi cation to these exclusions through

comments so our team can follow our reasoning.

Machine learning

Now that we have a foundation for testing traditional software, let's dive into testing our

data and models in the context of machine learning systems.

🔢   Data

We've already tested the functions that act on our data through unit and integration

tests but we haven't tested the validity of the data itself. Once we de ne what our data

should look like, we can use (and add to) these expectations as our dataset grows.
Expectations

There are many dimensions to what our data is expected to look like. We'll brie y talk

about a few of them, including ones that may not directly be applicable to our task but,

nonetheless, are very important to be aware of.

rows / cols : the most basic expectation is validating the presence of samples

(rows) and features (columns). These can help identify mismatches between

upstream backend database schema changes, upstream UI form changes, etc.

presence of speci c features

row count (exact or range) of samples

individual values : we can also have expectations about the individual values of
speci c features.

missing values

type adherence (ex. feature values are all float )

values must be unique or from a prede ned set

list (categorical) / range (continuous) of allowed values

feature value relationships with other feature values (ex. column 1 values must

always be great that column 2)

aggregate values : we can also expectations about all the values of speci c

features.

value statistics (mean, std, median, max, min, sum, etc.)

distribution shift by comparing current values to previous values (useful for

detecting drift)

To implement these expectations, we could compose assert statements or we could

leverage the open-source library called Great Expectations. It's a fantastic library that

already has many of these expectations builtin (map, aggregate, multi-column,

distributional, etc.) and allows us to create custom expectations as well. It also provides

modules to seamlessly connect with backend data sources such as local le systems,

S3, databases and even DAG runners. Let's explore the library by implementing the

expectations we'll need for our application.

First we'll load the data we'd like to apply our expectations on. We can load our data from

a variety of sources ( lesystem, S3, DB, etc.) which we can then wrap around a Dataset

module (Pandas / Spark DataFrame, SQLAlchemy).

1 from pathlib import Path


2 import great_expectations as ge
3 import pandas as pd
4 from tagifai import config, utils
5
6 # Create Pandas DataFrame
7 projects_fp = Path(config.DATA_DIR, "projects.json")
8 projects_dict = utils.load_dict(filepath=projects_fp)
9 df = ge.dataset.PandasDataset(projects_dict)

id title description tags

How to Deal with How to supercharge [article, google-


0 2438 Files in Google your Google Colab colab, colab,
Colab: What Y... experienc... file-system]

[api, article,
A powerful web and
code, dataset,
1 2437 Rasoee mobile application
paper,
that ide...
research,...

Machine Learning Most common [article, deep-


2 2436 Methods Explained techniques used in learning, machine-
(+ Examples) data science pr... learning, dim...

Explore the
Top “Applied Data [article, deep-
innovative world
3 2435 Science” Papers learning, machine-
of Machine
from ECML-PK... learning, adv...
Learni...

MMCV is a python [article, code,


OpenMMLab Computer
4 2434 library for CV pytorch, library,
Vision
research and s... 3d, computer...

Built-in

Once we have our data source wrapped in a Dataset module, we can compose and apply

expectations on it. There are many built-in expectations to choose from:

1 # Presence of features
2 expected_columns = ["id", "title", "description", "tags"]
3 df.expect_table_columns_to_match_ordered_list(column_list=expected_columns)
4
5 # Unique
6 df.expect_column_values_to_be_unique(column="id")
7
8 # No null values
9 df.expect_column_values_to_not_be_null(column="title")
10 df.expect_column_values_to_not_be_null(column="description")
11 df.expect_column_values_to_not_be_null(column="tags")
12
13 # Type
14 df.expect_column_values_to_be_of_type(column="title", type_="str")
15 df.expect_column_values_to_be_of_type(column="description", type_="str")
16 df.expect_column_values_to_be_of_type(column="tags", type_="list")
17
18 # Data leaks
19 df.expect_compound_columns_to_be_unique(column_list=["title",
"description"])

Each of these expectations will create an output with details about success or failure,

expected and observed values, expectations raised, etc. For example, the expectation

df.expect_column_values_to_be_of_type(column="title", type_="str") would produce

the following if successful:

1 {
2 "exception_info": {
3 "raised_exception": false,
4 "exception_traceback": null,
5 "exception_message": null
6 },
7 "success": true,
8 "meta": {},
9 "expectation_config": {
10 "kwargs": {
11 "column": "title",
12 "type_": "str",
13 "result_format": "BASIC"
14 },
15 "meta": {},
16 "expectation_type": "_expect_column_values_to_be_of_type__map"
17 },
18 "result": {
19 "element_count": 2032,
20 "missing_count": 0,
21 "missing_percent": 0.0,
22 "unexpected_count": 0,
23 "unexpected_percent": 0.0,
24 "unexpected_percent_nonmissing": 0.0,
25 "partial_unexpected_list": []
26 }
27 }

and this output if it failed (notice the counts and examples for what caused the failure):

1 {
2 "success": false,
3 "exception_info": {
4 "raised_exception": false,
5 "exception_traceback": null,
6 "exception_message": null
7 },
8 "expectation_config": {
9 "meta": {},
10 "kwargs": {
11 "column": "title",
12 "type_": "int",
13 "result_format": "BASIC"
14 },
15 "expectation_type": "_expect_column_values_to_be_of_type__map"
16 },
17 "result": {
18 "element_count": 2032,
19 "missing_count": 0,
20 "missing_percent": 0.0,
21 "unexpected_count": 2032,
22 "unexpected_percent": 100.0,
23 "unexpected_percent_nonmissing": 100.0,
24 "partial_unexpected_list": [
25 "How to Deal with Files in Google Colab: What You Need to Know",
26 "Machine Learning Methods Explained (+ Examples)",
27 "OpenMMLab Computer Vision",
28 "...",
29 ]
30 },
31 "meta": {}
32 }

We can group all the expectations together to create an Expectation Suite object which

we can use to validate any Dataset module.

1 # Expectation suite
2 expectation_suite = df.get_expectation_suite()
3 print(df.validate(expectation_suite=expectation_suite,
only_return_failures=True))

1 {
2 "success": true,
3 "results": [],
4 "statistics": {
5 "evaluated_expectations": 9,
6 "successful_expectations": 9,
7 "unsuccessful_expectations": 0,
8 "success_percent": 100.0
9 },
10 "evaluation_parameters": {}
11 }
Cust o m

Our tags feature column is a list of tags for each input. The Great Expectation's library

doesn't come equipped to process a list feature but we can easily do so by creating a

custom expectation.

. Create a custom Dataset module to wrap our data around.

. De ne expectation functions that can map to each individual row of the feature

column (map) or to the entire feature column (aggregate) by specifying the

appropriate decorator.

1 class CustomPandasDataset(ge.dataset.PandasDataset):
2 _data_asset_type = "CustomPandasDataset"
3
4 @ge.dataset.MetaPandasDataset.column_map_expectation
5 def expect_column_list_values_to_be_not_null(self, column):
6 return column.map(lambda x: None not in x)
7
8 @ge.dataset.MetaPandasDataset.column_map_expectation
9 def expect_column_list_values_to_be_unique(self, column):
10 return column.map(lambda x: len(x) == len(set(x)))

. Wrap data with the custom Dataset module and use the custom expectations.

1 df = CustomPandasDataset(projects_dict)
2 df.expect_column_values_to_not_be_null(column="tags")
3 df.expect_column_list_values_to_be_unique(column="tags")

Not e

There are various levels of abstraction (following a template vs. completely from

scratch) available when it comes to creating a custom expectation with Great

Expectations.

Projects

So far we've worked with the Great Expectations library at the Python script level but we

can organiz e our expectations even more by creating a Project.

. Initializ e the Project using great_expectations init . This will interactively walk us
through setting up data sources, naming, etc. and set up a great_expectations

directory with the following structure:


1
great_expectations/
2
| ├── checkpoints/
3
| ├── expectations/
4
| ├── notebooks/
5
| ├── plugins/
6
| ├── uncommitted/
7
| ├── .gitignore
8
| └── great_expectations.yml

. De ne our custom module under the plugins directory and use it to de ne our data

sources in our great_expectations.yml con guration le.

1 datasources:
2 data:
3 class_name: PandasDatasource
4 data_asset_type:
5 module_name: custom_module.custom_dataset
6 class_name: CustomPandasDataset
7 module_name: great_expectations.datasource
8 batch_kwargs_generators:
9 subdir_reader:
10 class_name: SubdirReaderBatchKwargsGenerator
11 base_directory: ../assets/data

. Create expectations using the pro ler, which creates automatic expectations

based on the data, or we can also create our own expectations. All of this is done

interactively via a launched Jupyter notebook and saved under our

great_expectations/expectations directory.

1 great_expectations suite scaffold SUITE_NAME # uses profiler


2 great_expectations suite new --suite # no profiler
3 great_expectations suite edit SUITE_NAME # add your own custom
expectations

When using the automatic pro ler, you can choose which feature columns to

apply pro ling to. Since our tags feature is a list feature, we'll leave it commented

and create our own expectations using the suite edit command.

. Create Checkpoints where a Suite of Expectations are applied to a speci c Data

Asset. This is a great way of programmatically applying checkpoints on our existing

and new data sources.

1 great_expectations checkpoint new CHECKPOINT_NAME SUITE_NAME


2 great_expectations checkpoint run CHECKPOINT_NAME
. Run checkpoints on new batches of incoming data by adding to our testing pipeline

via Make le, or work ow orchestrator like Air ow, etc. We can also use the Great

Expectations GitHub Action to automate validating our data pipeline code when we

push a change. More on using these Checkpoints with pipelines in our work ows

lesson.

Data docs

When we create expectations using the CLI application, Great Expectations

automatically generates documentation for our tests. It also stores information about

validation runs and their results. We can launch the generate data documentation with

the following command: great_expectations docs build

Best practices
We've applied expectations on our source dataset but there are many other key areas

to test the data as well. Throughout the ML development pipeline, we should test the

intermediate outputs from processes such as cleaning, augmentation, splitting,

preprocessing, tokeniz ation, etc. We'll use these expectations to monitor new batches

of data and before combining them with our existing data assets.

Not e

Currently, these data processing steps are tied with our application code but in

future lessons, we' ll separate these into individual pipelines and use Great

Expectation Checkpoints in between to apply all these expectations in an

orchestrated fashion.

Pipelines with Great Expectations Checkpoints

🤖   Models

The other half of testing ML systems involves testing our models during training,

evaluation, inference and deployment.

Training

We want to write tests iteratively while we're developing our training pipelines so we can

catch errors quickly. This is especially important because, unlike traditional software, ML

systems can run to completion without throwing any exceptions / errors but can

produce incorrect systems. We also want to catch errors quickly to save on time and

compute.

Check shapes and values of model output

1 assert model(inputs).shape == torch.Size([len(inputs), num_classes])


Check for decreasing loss after one batch of training

1 assert epoch_loss < prev_epoch_loss

Over t on a batch

1 accuracy = train(model, inputs=batches[0])


2 assert accuracy == pytest.approx(1.0, abs=0.05) # 1.0 ± 0.05

Train to completion (tests early stopping, saving, etc.)

1 train(model)
2 assert learning_rate >= min_learning_rate
3 assert artifacts

On di erent devices

1 assert train(model, device=torch.device("cpu"))


2 assert train(model, device=torch.device("cuda"))

Not e

You can mark the compute intensive tests with a pytest marker and only execute

them when there is a chang e being made to system a ecting the model.

1 @pytest.mark.training
2 def test_train_model():
3 ...

Evaluation

When it comes to testing how well our model performs, we need to rst have our

priorities in line.

What metrics are important?

What tradeo s are we willing to make?

Are there certain subsets (slices) of data that are important?

Overall
We want to ensure that our key metrics on the overall dataset improves with each

iteration of our model. Overall metrics include accuracy, precision, recall, f1, etc. and we

should de ne what counts as a performance regression. For example, is a higher

precision at the expensive of recall an improvement or a regression? Usually, a team of

developers and domain experts will establish what the key metric(s) are while also

specifying the lowest regression tolerance for other metrics.

1 assert precision > prev_precision # most important, cannot regress


2 assert recall >= best_prev_recall - 0.03 # recall cannot regress > 3%

Slicing

Just inspecting the overall metrics isn't enough to deploy our new version to production.

There may be key slices of our dataset that we expect to do really well on (ie. minority

groups, large customers, etc.) and we need to ensure that their metrics are also

improving. An easy way to create and evaluate slices is to de ne slicing functions.

1 # tagifai/eval.py
2 from snorkel.slicing import slicing_function
3
4 @slicing_function()
5 def cv_transformers(x):
6 """Projects with the `computer-vision` and `transformers` tags."""
7 return all(tag in x.tags for tag in ["computer-vision", "transformers"])

Here we're using Snorkel's slicing_function to create our di erent slices. We can

visualiz e our slices by applying this slicing function to a relevant DataFrame using

slice_dataframe .

1 from snorkel.slicing import slice_dataframe


2
3 test_df = pd.DataFrame({"text": X_test, "tags":
4 label_encoder.decode(y_test)})
5 cv_transformers_df = slice_dataframe(test_df, cv_transformers)
cv_transformers_df[["text", "tags"]].head()

id text tags

vedastr vedastr open source [computer-vision, natural-


0 10
scene text recogni... language-processing,...

[computer-vision,
hugging captions generate
1 15 huggingface, language-
realistic instagram ...
modeli...
id text tags

transformer ocr rectification [attention, computer-vision,


2 49
free ocr using s... natural-language-...

We can de ne even more slicing functions and create a slices record array using the

PandasSFApplier . The slices array has N (# of data points) items and each item has S (#

of slicing functions) items, indicating whether that data point is part of that slice. Think

of this record array as a masking layer for each slicing function on our data.

1 # tagifai/eval.py | get_performance()
2 from snorkel.slicing import PandasSFApplier
3
4 slicing_functions = [cv_transformers, short_text]
5 applier = PandasSFApplier(slicing_functions)
6 slices = applier.apply(df)
7 print (slices)

[(0, 0) (0, 1) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(1, 0) (0, 0) (0, 1) (0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 1) (0, 0)
...
(0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0)
(0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 0) (1, 0)]

One we have our slices record array, we can compute the performance metrics for each

slice.

1 # tagifai/eval.py | get_performance()
2 for slice_name in slices.dtype.names:
3 mask = slices[slice_name].astype(bool)
4 metrics = precision_recall_fscore_support(y_true[mask], y_pred[mask],
average="micro")

Not e

Snorkel comes with a builtin slice scorer but we had to implemented a naive version

since our task involves multi-label classi cation.

We can add these slice performance metrics to our larger performance report to

analyz e downstream when choosing which model to deploy.


1
{
2
"overall": {
3
"precision": 0.8050380552475824,
4
"recall": 0.603411513859275,
5
"f1": 0.6674448998627966,
6
"num_samples": 207.0
7
},
8
"class": {
9
"attention": {
10
"precision": 0.6923076923076923,
11
"recall": 0.5,
12
"f1": 0.5806451612903226,
13
"num_samples": 18.0
14
},
15
...
16
"unsupervised-learning": {
17
"precision": 0.8,
18
"recall": 0.5,
19
"f1": 0.6153846153846154,
20
"num_samples": 8.0
21
}
22
},
23
"slices": {
24
"f1": 0.7395604395604396,
25
"cv_transformers": {
26
"precision": 1.0,
27
"recall": 0.5384615384615384,
28
"f1": 0.7000000000000001,
29
"num_samples": 3
30
},
31
"short_text": {
32
"precision": 0.8333333333333334,
33
"recall": 0.7142857142857143,
34
"f1": 0.7692307692307692,
35
"num_samples": 4
36
}
37
}
38
}

Ext ensio ns

We've explored user generated slices but there is currently quite a bit of research on

automatically generated slices and overall model robustness. A notable toolkit is the

Robustness Gym which programmatically builds slices, performs adversarial attacks,

rule-based data augmentation, benchmarking, reporting and much more.


Robustness Gym slice builders

Instead of passively observing slice performance, we could try and improve them.

Usually, a slice may exhibit poor performance when there are too few samples and so a

natural approach is to oversample. However, these methods change the underlying data

distribution and can cause issues with overall / other slices. It's also not scalable to train

a separate model for each unique slice and combine them via Mixture of Experts (MoE).

To combat all of these technical challenges and more, the Snorkel team introduced the

Slice Residual Attention Modules (SRAMs), which can sit on any backbone architecture

(ie. our CNN feature extractor) and learn slice-aware representations for the class

predictions.
Slice Residual Attention Modules (SRAMs)

Inference

When our model is deployed, most users will be using it for inference (directly /

indirectly), so it's very important that we test all aspects of it.

Lo ading art if act s

This is the rst time we're not loading our components from in-memory so we want to

ensure that the required artifacts (model weights, encoders, con g, etc.) are all able to

be loaded.

1 artifacts = main.load_artifacts(run_id=run_id, device=torch.device("cpu"))


2 assert isinstance(artifacts["model"], nn.Module)
3 ...

Predict io n

Once we have our artifacts loaded, we're readying to test our prediction pipelines. We

should test samples with just one input, as well as a batch of inputs (ex. padding can

have unintended consequences sometimes).

1 # tests/app/test_api.py | test_best_predict()
2 data = {
3 "run_id": "",
4 "texts": [
5 {"text": "Transfer learning with transformers for self-supervised
6 learning."},
7 {"text": "Generative adversarial networks in both PyTorch and
8 TensorFlow."},
9 ],
10 }
11 response = client.post("/predict", json=data)
12 assert response.json()["status-code"] == HTTPStatus.OK
13 assert response.json()["method"] == "POST"
assert len(response.json()["data"]["predictions"]) == len(data["texts"])
...

Behavio ral t est ing

Besides just testing if the prediction pipelines work, we also want to ensure that they

work well. Behavioral testing is the process of testing input data and expected outputs

while treating the model as a black box. They don't necessarily have to be adversarial in

nature but more along the types of perturbations we'll see in the real world once our

model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing
of NLP Models with CheckList which breaks down behavioral testing into three types of

tests:

invariance : Changes should not a ect outputs.

1 # INVariance via verb injection (changes should not affect outputs)


2 tokens = ["revolutionized", "disrupted"]
3 tags = [["transformers"], ["transformers"]]
4 texts = [f"Transformers have {token} the ML field." for token in tokens]
5 compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="INV")

directional : Change should a ect outputs.

1 # DIRectional expectations (changes with known outputs)


2 tokens = ["PyTorch", "Huggingface"]
3 tags = [
4 ["pytorch", "transformers"],
5 ["huggingface", "transformers"],
6 ]
7 texts = [f"A {token} implementation of transformers." for token in
8 tokens]
compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="DIR")

minimum functionality : Simple combination of inputs and expected outputs.

1 # Minimum Functionality Tests (simple input/output pairs)


2 tokens = ["transformers", "graph neural networks"]
3 tags = [["transformers"], ["graph-neural-networks"]]
4 texts = [f"{token} have revolutionized machine learning." for token in
5 tokens]
compare_tags(texts=texts, tags=tags, artifacts=artifacts,
test_type="MFT")
Not e

Be sure to explore the NLP Checklist packag e which simpli es and aug ments the

creation of these behavioral tests via functions, templates, pretrained models and

interactive GUIs in Jupyter notebooks.

NLP Checklist

We combine all of these behavioral tests to create a behavioral report

( tagifai.eval.get_behavioral_report() ) which quanti es how many of these tests are

passed by a particular instance of a trained model. This report is then saved along with

the run's artifacts so we can use this information when choosing which model(s) to

deploy to production.

1 {
2 "score": 1.0,
3 "results": {
4 "passed": [
5 {
6 "input": {
7 "text": "Transformers have revolutionized the ML field.",
8 "tags": [
9 "transformers"
10 ]
11 },
12 "prediction": {
13 "input_text": "Transformers have revolutionized the ML field.",
14 "preprocessed_text": "transformers revolutionized ml field",
15 "predicted_tags": [
16 "natural-language-processing",
17 "transformers"
18 ]
19 },
20 "type": "INV"
21 },
22 ...
23 {
24 "input": {
25 "text": "graph neural networks have revolutionized machine
26 learning.",
27 "tags": [
28 "graph-neural-networks"
29 ]
30 },
31 "prediction": {
32 "input_text": "graph neural networks have revolutionized machine
33 learning.",
34 "preprocessed_text": "graph neural networks revolutionized
35 machine learning",
36 "predicted_tags": [
37 "graph-neural-networks",
38 "graphs"
39 ]
40 },
41 "type": "MFT"
42 }
43 ],
"failed": []
}
}

Wa rning

When you create additional behavioral tests, be sure to reevaluate all the models

you' re considering on the new set of tests so their scores can be compared. We can

do this since behavioral tests are not dependent on data or model versions and are

simply tests that treat the model as a black box.

1 tagifai behavioral-reevaluation --experiment-name=best --all-runs #


2 update all runs in experiment
tagifai behavioral-reevaluation --run-id=0deb534 # update specific run

Sorted runs

We can combine our overall / slice metrics and our behavioral tests to create a holistic

evaluation report for each model run. We can then use this information to choose which

model(s) to deploy to production.

bash
Deployment

There are also a whole class of model tests that are beyond metrics or behavioral

testing and focus on the system as a whole. Many of them involve testing and

benchmarking the tradeo s (ex. latency, compute, etc.) we discussed from the

baselines lesson. These tests also need to performed across the di erent systems (ex.

devices) that our model may be on. For example, development may happen on a CPU but

the deployed model may be loaded on a GPU and there may be incompatible

components (ex. reparametriz ation) that may cause errors. As a rule of thumb, we

should test with the system speci cations that our production environment utiliz es.

Not e

We' ll automate tests on di erent devices in our CI/CD lesson where we' ll use GitHub

Actions to spin up our application with Docker Machine on cloud compute instances

(we' ll also use this for training ).

Once we've tested our model's ability to perform in the production environment (o ine

t est s), we can run several types of o nline t est s to determine the quality of that
performance.

AB tests :

sending production tra c to the di erent systems.

involves statistical hypothesis testing to decide which system is better.

need to account for di erent sources of bias (ex. novelty e ect).

multiarmed bandits might be better if optimiz ing on a certain metric.

Shadow tests :

sending the same production tra c to the di erent systems.

safe online evaluation as the new system's results are not served.

easy to monitor, validate operational consistency, etc.

Testing vs. monitoring

We'll conclude by talking about the similarities and distinctions between testing and

monitoring. They're both integral parts of the ML development pipeline and depend on

each other for iteration. Testing is assuring that our system (code, data and models)

behaves the way we intend at the current time t0. Whereas, monitoring is ensuring that

the conditions (ie. distributions) during development are maintained and also that the

tests that passed at t0 continue to hold true post deployment through tn. When this is

no longer true, we need to inspect more closely (retraining may not always x our root

problem).

With monitoring, there are quite a few distinct concerns that we didn't have to consider

during testing since it involves (live) data we have yet to see.

features and prediction distributions (drift), typing, schema mismatches, etc.

determining model performance (rolling and window metrics on overall and slices of

data) using indirect signals (since labels may not be readily available).

in situations with large data, we need to know which data points to label and

upsample for training.

identifying anomalies and outliers.

We'll cover all of these concepts in much more depth (and code) in our monitoring

lesson.

Resources
Great Expectations

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt

Reduction

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

A Recipe for Training Neural Networks

E ective testing for machine learning systems

Slice-based Learning: A Programming Model for Residual Learning in Critical Data

Slices

Robustness Gym: Unifying the NLP Evaluation Landscape

To cite this lesson, please use:

1 @article{madewithml,
2 title = "Testing - Made With ML",
3 author = "Goku Mohandas",
4 url = "https://madewithml.com/courses/applied-ml/testing/"
5 year = "2021",
6 }

You might also like