You are on page 1of 505

Core ML Survival Guide

More than you ever wanted to know about mlmodel files and
the Core ML and Vision APIs

Matthijs Hollemans
This book is for sale at http://leanpub.com/coreml-survival-guide

This version was published on 2020-10-09

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean
Publishing process. Lean Publishing is the act of publishing an in-progress ebook using
lightweight tools and many iterations to get reader feedback, pivot until you have the
right book and build traction once you do.

© 2018 - 2020 M.I. Hollemans


Contents

About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Who Is This Book For? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Useful Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Part 1: The Core ML Ecosystem . . . . . . . . . . . . . . . . . . 1


What is Core ML — and What is It Not? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Core ML Version History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

The Vision Framework and Core ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

The SoundAnalysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Where to Get mlmodels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Create ML: The Easiest Way to Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Turi Create — it’s Like Create ML but in Python . . . . . . . . . . . . . . . . . . . . . . 22

coremltools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Part 2: Converting Models . . . . . . . . . . . . . . . . . . . . . . . . 28

Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Converting TensorFlow / PyTorch With the Unified API . . . . . . . . . . . . . . . . . 38

Keras Conversion Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Converting tf.keras With the Old Converter . . . . . . . . . . . . . . . . . . . . . . . . . 57

Caffe Conversion Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


CONTENTS

TensorFlow 1.x Conversion With tfcoreml . . . . . . . . . . . . . . . . . . . . . . . . . . 64

TensorFlow 2.x Conversion With coremltools 3 . . . . . . . . . . . . . . . . . . . . . . . 76

PyTorch Conversion Using ONNX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

ONNX Conversion Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Torch7 Conversion Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

MXNet Conversion Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Troubleshooting the Conversion Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Writing Your Own Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

NeuralNetworkBuilder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Model Training Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Part 3: Examining Models . . . . . . . . . . . . . . . . . . . . . . . . 126

Viewing Models With Netron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Viewing Models With visualize_spec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

The mlmodel File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Dynamic Tensor Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Using the Spec to Edit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Looking Inside an mlmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Verifying the Conversion is Successful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Looking at Intermediate Layer Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Checking the Layer Output Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

The mlmodel as a Big Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Part 4: Model Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Filling in the Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Changing the Image Preprocessing Options . . . . . . . . . . . . . . . . . . . . . . . . . 195


CONTENTS

Using a Different Scale for Each Color Channel . . . . . . . . . . . . . . . . . . . . . . . 201

Saving the Weights as 16-bit Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Quantizing the Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Changing the Input Type to Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Outputting an Image Instead of a MultiArray . . . . . . . . . . . . . . . . . . . . . . . . 219

Outputting Floats Instead of Doubles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Tidying up MultiArray Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Renaming Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Inserting a New Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Changing an Existing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

Deleting a Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

Example: Cleaning Up a Converted Model (DeepLab v3+) . . . . . . . . . . . . . . . . 238

Replacing the Class Names of a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

Part 5: Inside the App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Understanding the Xcode-generated File . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

mlmodelc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Running the Core ML Compiler Manually . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

Downloading and Compiling Models on the Device . . . . . . . . . . . . . . . . . . . . 268

Running the Model on the CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

The Neural Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

CPU, GPU, or Neural Engine? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Inspecting the Model at Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

Making Sure the Input is Correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

Working With CVPixelBuffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288


CONTENTS

Using CGImage or Image Files Instead of CVPixelBuffer . . . . . . . . . . . . . . . . . 296

Working With MLMultiArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

Reshaping an MLMultiArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

Transposing an MLMultiArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Converting MLMultiArray to an Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

Converting from UIImage to MLMultiArray . . . . . . . . . . . . . . . . . . . . . . . . . 324

Computing the Argmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

Translating Class Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

Part 6: Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 331

Making Multiple Predictions at Once With Batches . . . . . . . . . . . . . . . . . . . . 332

Size Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Using the MLModel API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

Vision FeaturePrint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

Using Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

Creating Your Own Custom Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

Creating Your Own Custom Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

Building Pipeline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

Linked Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

Control Flow in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

Working With Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

MobileNetV2 + SSDLite Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

Using Protobuf Without coremltools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436

Encrypting Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

Performance Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444


CONTENTS

Part 7: On-device Personalization . . . . . . . . . . . . . . . 447

What is on-device personalization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

Making a Model Updatable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479


About the Author
Hi, my name is Matthijs Hollemans.

I am an independent consultant who specializes in machine learning and AI on mobile and


edge devices. If you want to add machine learning to your app, I’m the guy to talk to.

You can read more about my services at machinethink.net/hire1 .

My blog about machine learning on mobile devices is at machinethink.net/blog2 .

I’ve also written a number of books that are published by the popular tutorial website
raywenderlich.com, such as The iOS Apprentice and more recently Machine Learning by
Tutorials3 .

Feel free to email me at matt@machinethink.net if you have any questions about this book
or if you just want to chat about machine learning.

You can also find me on Twitter as @mhollemans4 .

My GitHub5 has a lot of machine learning and Core ML-related repos. Check it out!
1 https://machinethink.net/hire/
2 https://machinethink.net/blog/
3 https://store.raywenderlich.com/products/machine-learning-by-tutorials
4 https://twitter.com/@mhollemans
5 https://github.com/hollance

i
Introduction
This guide is a collection of tips and tricks for working with Core ML and mlmodel files. I
collected these during my work as machine learning consultant for iOS and macOS apps,
which often involves making models suitable for use with Core ML.

At first sight, the Core ML API appears very simple to use. Xcode even automatically
generates a class for you that makes using machine learning models as easy as:

let model = YourModel()


let result = try? model.prediction(input: someInput)

Such a high-level API is really convenient for the developer when everything goes smooth
and works as intended. Unfortunately, things don’t always go smooth…

Core ML hides away a lot of the complexity involved in running machine learning models.
But that complexity is still there and — like it or not — you’re going to have to deal with it
when the bits start hitting the fan.

I know for a fact that a lot of developers struggle with the finer points of Core ML because
for the past couple of years I’ve been keeping a close eye on the Core ML topics on
StackOverflow6 and the Apple Developer Forums7 . I also receive a lot of questions on my
GitHub repos and by email.

It’s clear that using Core ML isn’t always as easy as it’s supposed to be…

That’s why I wrote this guide. It contains pretty much everything I know about Core ML.
As I learn new things, I’ll keep updating this guide.

I primarily work with deep learning and computer vision, so most of the tips in this guide
are about neural networks that take image inputs — but a lot of the advice applies to other
model types and data types too. The code examples are in Python and Swift, but should be
easy enough to translate to Objective-C.

Let me know if anything is still unclear, or if you have a question about Core ML that is not
answered in this book. Happy reading!
6 https://stackoverflow.com/questions/tagged/coreml
7 https://developer.apple.com/forums/tags/core-ml

ii
Who Is This Book For?
I wrote this book to help out developers who have questions about Core ML, or who want
to learn how to get the best performing Core ML models in their apps.

Just to be clear: This book does not teach machine learning!

You’re already supposed to know what a model is, the difference between training and
inference, and so on. If you’re completely new to machine learning, please check out my
other book Machine Learning by Tutorials8 .

This book also isn’t a beginner’s guide to Core ML. The intended audience is people who
want to get more out of Core ML, or who are stuck trying to get their models to work.

If you haven’t used Core ML before, I suggest you first follow one of the many beginner
tutorials9 available online and then come back to this book if you run into any problems —
or if you want to learn more, of course!
8 https://store.raywenderlich.com/products/machine-learning-by-tutorials
9 https://www.raywenderlich.com/7960296-core-ml-and-vision-tutorial-on-device-training-on-ios

iii
Useful Links
You may find the following links useful in your Core ML journey.

Official Core ML documentation:

• The coremltools documentation is a good place to get started. This is for coremltools
version 4 and later.

• The most up-to-date specification of the Core ML file format is in the proto files, which
you’ll look at in great detail in this book.

Old documentation that can still be useful sometimes:

• User manual for coremltools version 3.

• Code snippets and Jupyter notebooks for various tasks such as changing input/output
datatypes, quantizing, printing out layer attributes, making updatable models, and so
on.

Source code:

• The coremltools source code. Sometimes the only way to convert models is to hack
coremltools, so it’s good to know your way around this repo.

• The GitHub repo with the accompanying source code files for this book.

• CoreMLHelpers makes it a little easier to work with Core ML in Swift.

Tutorials and blogs:

• A curated list of machine learning resources, with a focus on Apple technologies.

• Machine learning tutorials at raywenderlich.com.

• Heartbeat by fritz.ai, a great blog about machine learning on the edge.

• My own blog, which covers Core ML but also Metal Performance Shaders and other
topics related to machine learning in Swift.

iv
Useful Links v

Questions and answers:

• Check out the list of open issues to see if your problem is already known and perhaps
has a workaround.

• Core ML questions on StackOverflow.

• The Core ML section of the Apple Developer Forums.


Part 1: The Core ML Ecosystem

What can you do with Core ML and where does it fit into the world of machine learning?
What is Core ML — and What is It Not?
Apple’s marketing department has done a good job of hyping up Core ML, but it’s important
to understand exactly what Core ML is and also what it isn’t. The goal of this chapter is to
set realistic expectations.

Core ML is really two things:

1. The mlmodel file format definition. This is an open file format for machine learning
models.

2. The CoreML.framework that lets you perform inference in your iOS, tvOS, watchOS
and macOS apps using the model from an mlmodel file.

Most Core ML tutorials focus on the API from the framework but to make the most of Core
ML you also need to understand how the mlmodel format works. In this guide you’ll find
a lot of tips for working with mlmodel files.

Core ML and the machine learning community

So where does Core ML fit into the larger picture?

The Core ML framework and the mlmodel format are on the fringes of the ML ecosystem.
Literally, because Core ML does machine learning “on the edge” as opposed to “in the cloud”
— but also because to the vast majority of machine learning practitioners, Core ML is not
important at all.

Most of the effort in machine learning goes into designing and training the models, while
Core ML is just one technology of many for deploying the trained models into production.
And mlmodel is just another file format.

It’s good that Apple opened up the mlmodel specification. This has made it easier for
other tools to adopt the format. However, mlmodel is not the industry standard for the
interchange of models between platforms. It looks like ONNX10 , the Open Neural Network
eXchange format, is going to win that race.
10 http://onnx.ai

2
What is Core ML — and What is It Not? 3

Perhaps things will change as machine learning on mobile becomes more popular, but right
now Core ML isn’t having much of an impact on the industry. One reason may be that Core
ML, unlike pretty much all other machine learning tools, is not open source and therefore
does not get to benefit from an active community of users and developers.

Don’t get me wrong… Core ML is great for iOS and macOS developers who want to add
machine learning to their apps, but it’s just one technology among many competitors that
all try to do the same thing. Even on Apple platforms, Core ML is not the only choice for
adding machine learning to your apps.

Benefits of using Core ML

The following are good reasons for using Core ML in your app:

• The API is easy to use, especially with the tips from this guide. ;-)

• You only need to provide the mlmodel file. You don’t need to implement any of the
machine learning algorithms yourself.

• Core ML can run your model on the CPU, on the GPU, or on the Neural Engine,
depending on what hardware is available in the user’s device. This allows it to make
optimal use of the available resources. Core ML can even split up the model to only
perform certain parts on the GPU (tasks that need to do a lot of computation) and the
other parts on the CPU.

• Because Core ML can run models on the CPU, you can develop and test your apps on
the iOS Simulator. This is not the case for apps that directly use the Metal APIs, since
the Simulator does not allow Metal apps. Nor do Metal apps play well with unit tests.
Core ML doesn’t have these problems. (Update: as of iOS 13, Metal apps can run on the
Simulator too but Metal Performance Shaders still cannot.)

• The biggest benefit: Core ML can use the Neural Engine on devices with the A12 chip
or better. The Neural Engine gives a great performance boost and right now Core ML
is the only way to get your hands on this extraordinary computing power.

Prior to the release of the iPhone XS, I advised my clients to prefer Metal Performance
Shaders over Core ML in order to get the best possible speed, especially for models that
need to process real-time video. I still recommend that for older devices, but having access
to this new Neural Engine really changes things and makes Core ML much more attractive.
What is Core ML — and What is It Not? 4

Downsides of Core ML

There’s also some bad news — the biggest downsides of Core ML are:

• Creating an mlmodel file is not always as easy as it looks. You may need to wrestle with
coremltools and Python code. (That’s why I called this the Core ML “survival” guide.)

• Core ML supports only a limited number of neural network layer types. It is possible to
create custom layers but that requires implementing the machine learning algorithms
yourself, and you lose the ability to use the Neural Engine. Core ML 3 introduced
many new layer types, including low-level mathematics operations that can be used to
implement new kinds of layers directly inside the mlmodel, but these are not always
sufficient.

• The speed can be unpredictable. Sometimes a model will suddenly run a lot slower
than before. It appears as if Core ML is trying to be clever in how it schedules things,
but doesn’t always succeed.

• You have little control. Core ML is a black box and so it’s hard to understand why and
how it does certain things.

• Troubleshooting is difficult. For example, you cannot easily look at the output pro-
duced by intermediate layers.

• The mlmodel files are unprotected. Even though your app doesn’t ship with the
original mlmodel file but with a compiled version, it’s still possible for competitors
to download your app and reverse engineer your model. (Update: As of iOS 14 and
macOS 11, Core ML models can be automatically encrypted.)

• Core ML is not open source and is only updated with new OS releases. Therefore, Core
ML will always lag behind the state-of-the-art.

Some of these limitations are more serious than others, but if you want to use Core ML in
your app, you’ll have to learn to work with them or around them. That’s what this guide is
for!

Alternatives to Core ML

For performing inference on the device you have a few options besides Core ML:
What is Core ML — and What is It Not? 5

• TensorFlow Lite. While it’s possible to convert some TensorFlow models to Core ML,
only a limited number of operations are supported. If you have a TensorFlow graph
that Core ML can’t handle, you can use TF-Lite11 instead. Note that TF-Lite requires you
to write C++ and supports fewer operations than the full version of TensorFlow. TF-
Lite can run models on the CPU, on the GPU using the GpuDelegate, and on the Neural
Engine using the CoreMLDelegate (in which case TF-Lite simply converts the model to
Core ML format and then uses Core ML to run it).

• PyTorch Mobile. I don’t have a lot of experience yet with this framework12 , but
currently it only supports model execution on the CPU, not on the GPU or Neural
Engine.

• Metal Performance Shaders (MPS). This is a toolkit for running neural networks
with Metal, Apple’s language for programming the GPU. Core ML actually uses MPS
internally too, but you can often get a speed improvement by implementing the neural
network yourself directly with MPS — especially on devices that do not have a Neural
Engine. MPS also can do training on the device.

• Roll your own. Using the Accelerate framework, BNNS, Metal or OpenGL, and third-
party libraries such as Eigen. Generally, I recommend against doing this as it’s a lot
of work and usually not worth the effort. (I’ve done this a few times with TensorFlow
models to avoid the overhead of the TensorFlow API.)

Even if you use Core ML, you may need to write additional pre- or post-processing code
using the Accelerate framework. And if your model needs a custom layer, you’ll have to
write some Metal code. So even with Core ML you may still end up using these other APIs.
11 https://www.tensorflow.org/lite/
12 https://pytorch.org/mobile/home/
Core ML Version History
Even though Core ML was only introduced in 2017, both the Core ML framework and the
mlmodel file format have already gone through a number of revisions.

If you’re considering using Core ML in your app, it’s important to know which functionality
is supported by which version of Core ML. For example, models using layer types that were
introduced in iOS 13 will not be backwards compatible and won’t be able to run on iOS 12.

In some cases it might be possible to perform surgery on a model to make it compatible


with earlier versions of Core ML, but it’s best to already keep these limitations in mind
when you design and train your models.

Unfortunately, the version history of Core ML is not very well documented and trying to
find out what is supported where can be a bit of a puzzle. This chapter attempts to clear up
the timeline.

Finding the mlmodel specification version

Every mlmodel file includes a specificationVersion field that tells you which version of
Core ML is required to run this model.

Note: When you convert a model using coremltools, the specificationVersion is set to the
minimum version needed to support all the operations or layers in your model. For exam-
ple, if a model only uses functionality from mlmodel version 1, its specificationVersion is
set to 1 and the model will run without problems on iOS 11.0.

Use the following Python snippet to view the specification version number for a Core ML
mlmodel file:

import coremltools as ct
spec = ct.utils.load_spec("YourModel.mlmodel")
print(spec.specificationVersion)

You can also use Netron to view the specification version number.

6
Core ML Version History 7

The history of Core ML so far…

The following table lists the mlmodel versions and the platforms they are available on:

Specification version iOS tvOS watchOS macOS


1 11.0 11.0 4.0 10.13 (High Sierra)
2 11.2 11.2 4.2 10.13.2
3 12.0 12.0 5.0 10.14 (Mojave)
4 13.0 13.0 6.0 10.15 (Catalina)
5 14.0 14.0 7.0 11.0 (Big Sur)

mlmodel version 1

The first version of the specification, introduced in 2017.

mlmodel version 2

This is a small update from late 2017 that adds the following features:

• 16-bit floating point weights in the mlmodel file

• custom layers in neural network models

mlmodel version 3 (Core ML 2)

This new version was announced at WWDC 2018 under the name Core ML 2, but the
specification version number for the new features is actually 3.

Version 3 has the following new features:

• quantized weights

• flexible image sizes and input shapes

• MLSequence for dealing with sequential data and RNNs

• batch predictions

• custom models
Core ML Version History 8

There are also new model types:

• BayesianProbitRegressor

• TextClassifier

• VisionFeaturePrint.Scene

• WordTagger

• NonMaximumSuppression

For neural network models, the following layer types were added:

• CropResizeLayerParams: extract regions of interest

• ResizeBilinearLayerParams: bilinear resize to arbitrary image sizes

Note: Version 3 models can only be opened by Xcode 10 and later.

mlmodel version 4 (Core ML 3)

Core ML 3 is a major update that was announced at WWDC 2019. The specification version
number for the new features is 4.

The biggest changes are:

• Several new model types.

• Support for on-device training, or “personalization” as Apple calls it.

• Many new layer types.

• Neural networks now support tensors with arbitrary shapes.

• Non-max suppression can now also be used as a layer inside a neural network.

• Dynamic graphs, including loops and branches.

The new model types are:

• k-Nearest Neighbors
Core ML Version History 9

• ItemSimilarityRecommender

• SoundAnalysisPreprocessing

• Gazetteer

• WordEmbedding

• Linked models

Training is available for k-Nearest Neighbors models, neural networks, and pipelines
containing these. For neural networks, only convolutional layers and fully-connected
layers can be trained.

There are many new neural network layer types — too many to mention here. Most of these
were added to support the new N-dimensional tensors, making Core ML more suitable for
data other than just images. Other notable additions are low-level mathematics operations
and layers for flow control.

Refer to this blog post13 for a detailed overview of all the new layer types.

Note: Version 4 models can only be opened by Xcode 11 and later.

mlmodel version 5

This is the 2020 update of Core ML. The “marketing” version number got dropped — Apple
doesn’t call this new version Core ML 4, just Core ML. However, coremltools did get bumped
to version 4. The specification version number for the new features is 5.

New layer types:

• Convolution3DLayer, Pooling3DLayer, GlobalPooling3DLayer

• OneHotLayer

• ClampedReLULayer

• ArgSortLayer

• CumSumLayer

• SliceBySizeLayer
13 https://machinethink.net/blog/new-in-coreml3
Core ML Version History 10

Changes to existing layers:

• InnerProductLayer and BatchedMatMulLayer can do their computations with 8-bit


quantized values.

• UpsampleLayer can have a fractional scaling factor. In BILINEAR mode, there are new
options for how the grid points are sampled (“align corners”).

• ReorganizeDataLayerParams has a PIXEL_SHUFFLE mode.

• SliceStaticLayer and SliceDynamicLayer now have a squeezeMasks property.

• TileLayer accepts a second input tensor, so that you can specify the number of
repetitions dynamically.

• ConcatNDLayerParams has an interleave option.

New model types:

• VisionFeaturePrint.Objects

• SerializedModel

Other improvements:

• It is now possible to send model updates through Apple Cloud to users’ devices.

• Automatic encryption of Core ML models.

• On A11 Bionic and up, Core ML can use 16-bit floating point operations when running
models on the CPU. Float16 is now also a first-class Swift data type.

For a full description of all the new stuff that was added in 2020 to Core ML and Apple’s
other machine learning frameworks, see this blog post14 .

Note: Version 5 models can only be opened by Xcode 12 and later.


14 https://machinethink.net/blog/new-in-apple-machine-learning-2020/
The Vision Framework and Core ML
There are two different ways to make predictions with Core ML models in your apps:

1. using CoreML.framework

2. using Vision.framework

The Vision framework is a high-level API for performing specific computer vision tasks,
such as face detection and object tracking. It also has an API for running Core ML models
on images.

What’s so special about Vision?

When using CoreML.framework, you are responsible for making sure the input image is
already in the format expected by the model. You need to resize the image, make sure it’s
in the proper orientation, put it into a CVPixelBuffer object, etc.

Even though Xcode automatically generates a handy class for your model, this class will
not actually do any of these things for you!

That’s where Vision comes in. It automatically takes care of preparing the image and more:

• Automatic resizing. Vision will crop and/or resize the image to the input size expected
by the mlmodel.

• Automatic rotation. Vision orients the image so that “up” is really up. This is
important because photos often store their pixel data in the physical orientation of
the camera’s sensors. Which side is really up is stored in the EXIF metadata. Without
accounting for the correct orientation, the image that Core ML sees may be rotated
by 90 degrees or be upside down — and that may affect the quality of the model’s
predictions. This is true for photos and video taken by the iPhone’s own camera but
also for images downloaded from the internet.

• More image formats. In addition to CVPixelBuffer, Vision also accepts CGImage and
CIImage types as input. This means you can use Core Image to process the input image
before you pass it to Vision. It also can work directly on PNG and JPEG image files.

11
The Vision Framework and Core ML 12

• Camera intrinsics. You can pass the camera’s intrinsics matrix to Vision. This matrix
includes the focal length of the camera lens and other information that describes how
3D coordinates are transformed into 2D image coordinates. (I’m not sure if and how
Vision actually uses this information when it runs Core ML models.)

• Colorspace conversion. The model is probably trained on RGB or BGR pixels, but your
input images may be in a different colorspace such as sRGB or P3. Vision automatically
performs the conversion from the image’s colorspace to the model’s.

Because of these reasons, it’s recommended that you use Core ML models that work on
images through the Vision framework.

Note: As of Core ML 3, the regular API also lets you pass in a CGImage object or a local image
file instead of a CVPixelBuffer. It allows to you crop and resize these images in the same
way as Vision. But this new API is primarily intended to help with on-device training, not
as a replacement for Vision. In Xcode 12, the auto-generated class also has functions that
accept CGImage objects, but these have no options for cropping and resizing.

Using Vision to drive Core ML

To use Vision with a Core ML model, you first create a VNCoreMLModel and a VNCoreMLRequest
instance. You only need to do this once. The code is really simple:

let visionModel = try? VNCoreMLModel(for: coreMLmodel)

let request = VNCoreMLRequest(model: visionModel) { request, error in


/* do something with the results */
}

request.imageCropAndScaleOption = .scaleFill

Then for every new image, you create a new VNImageRequestHandler instance and tell it to
perform the request:

let handler = VNImageRequestHandler(cgImage: yourImage)


try? handler.perform([request])

The call to handler.perform() blocks the current thread until the request completes. After-
wards, you can look at request.results to get the prediction results — although usually
you’d do this in the completion handler for VNCoreMLRequest.
The Vision Framework and Core ML 13

The Vision result objects

The request.results array contains zero or more objects that are subclasses of VNObservation.
The exact type of the observation objects in the array depends on the output of the Core ML
model.

Here are the possibilities:

• Zero objects: there were no results. For example, if the Core ML model is an object
detector but the image contained no objects whose confidence score is over the
minimum threshold.

• VNClassificationObservation: used when the model is a classifier. The results array


has one object for every possible class, and is sorted from highest to lowest probability.
As of Core ML 3, the observation object can also contain a precision-recall curve.

• VNRecognizedObjectObservation: used when the model is an object detector (such


as YOLO). You get one VNRecognizedObjectObservation instance for every object found
in the image. It contains the coordinates for a bounding box rectangle plus an array
of VNClassificationObservation instances that describe the class probabilities for the
object inside the bounding box.

• VNCoreMLFeatureValueObservation: used when the Core ML model has a Mul-


tiArray output. This object contains the unfiltered output values from the model
and you’ll need to write additional code to interpret these numbers. You get a
VNCoreMLFeatureValueObservation instance for every output in the model.

• VNPixelBufferObservation: used when at least one output of the Core ML model is


an image. Useful for image-to-image models such as style transfer.

There are also other types of Vision result objects, such as VNFaceObservation, but these are
not the result of applying a Core ML model.

To examine the Vision result objects, you’d write something like this:

if let observations = request.results as? [VNClassificationObservation] {


/* do something with the prediction */
}
The Vision Framework and Core ML 14

Image cropping and scaling

One thing to be aware of is how the image you’re passing to Vision gets cropped and resized
before it’s given to the Core ML model. This is determined by the imageCropAndScaleOption
property of the VNCoreMLRequest:

• centerCrop: scales the image to fit the shortest side, then crops out a rectangle around
the image’s center. This maintains the original aspect ratio, but you lose part of the
image if it wasn’t square to begin with.

• scaleFit: resizes the image to fit the longest side. The shortest side is padded with
zeros. This maintains the aspect ratio and you keep the whole image.

• scaleFill: resizes the image without regard to the aspect ratio. The image may be
squashed or stretched.

To get the best results, you should use the option that best matches the sort of cropping and
resizing that was performed during training. If none of these options are suitable, you’ll
have to do your own cropping beforehand using Core Image or the vImage framework.

To tell Vision to only look at a specific portion of the image, you can set the regionOfInterest
property on the request object.

Models with multiple inputs

Previously, Vision could only make predictions on models that had a single image input. As
of iOS 13 / macOS 10.15, you can also use Vision with models that have multiple inputs.

You configure the VNCoreMLModel object as follows. This example is for a model with three
inputs: image, iouThreshold, and confidenceThreshold.

let visionModel = try VNCoreMLModel(for: coreMLmodel)


visionModel.inputImageFeatureName = "image"
visionModel.featureProvider = try MLDictionaryFeatureProvider(dictionary: [
"iouThreshold": MLFeatureValue(double: 0.45),
"confidenceThreshold": MLFeatureValue(double: 0.25),
])

The inputImageFeatureName property should be the name of the input from the Core ML
model that will receive the image that you’ll pass into the VNImageRequestHandler.
The Vision Framework and Core ML 15

Any additional input values are supplied through the featureProvider dictionary. In the
example we’re using doubles, but any data type from MLFeatureValue should work. They can
even be additional image inputs, in which case you’d use MLFeatureValue(pixelBuffer:).

If you also need to know which model output produced which observation, you can look at
the featureName property of the VNCoreMLFeatureValueObservation or VNPixelBufferObservation
object.

Advanced pipelines

You can also create advanced image processing pipelines, using the computer vision
techniques provided by the Vision framework as preprocessing steps for your neural
network. Here are some examples:

• Use VNDetectFaceRectanglesRequest to detect the position and size of a person’s face,


crop the input image to that region using Core Image, and run your neural network
on just the part of the image where the face is.

• Use VNDetectRectanglesRequest to detect rectangular areas in the image, apply a per-


spective correction filter with Core Image and convert the image to grayscale, then
run a Core ML classifier.

• Use VNTranslationalImageRegistrationRequest to see if the user is holding the camera


still. While the camera is moving too much you don’t run the Core ML model in order
to save battery power. See the WWDC 2018 session Vision with Core ML for a code
example.

• Use VNGenerateImageFeaturePrintRequest to get a feature vector from Vision.FeaturePrint


for an image. This is useful if you don’t want to train your own feature extractor.

The combination of Vision’s built-in algorithms with your own Core ML models is very
powerful. Take advantage of it!

Note: There are a few advanced things that Vision is unable to do at the moment. For
example, you cannot make predictions on batches of images. In those cases you’ll need
to use the regular Core ML API.
The Vision Framework and Core ML 16

Ready-made models

Vision can do a lot more than run Core ML models, it already has many machine learning
models built in. Using these is a lot easier than building your own!

• VNClassifyImageRequest: a general purpose image classifier that knows about many


different categories of objects. If this already has the classes you’re interested in, it
might be worth using instead of your own classifier model.

• VNGenerateAttentionBased- and ObjectnessBasedSaliencyImageRequest: output heatmaps


that show which part of the image is interesting in some way.

• VNDetectAnimalRectanglesRequest and VNDetectHumanRectanglesRequest: object-detectors


trained to detect cats, dogs, and humans.

• VNRecognizeTextRequest: use this together with the document scanner from VisionKit
to perform OCR (optical character recognition) to find text in the image and return it
as strings.

• VNDetectFaceLandmarksRequest: find facial features such as a person’s eyes, nose, mouth.

• VNDetectFaceCaptureQualityRequest: get a score based on how good a photo of a human


face looks from a photography perspective.

• VNDetectHumanHandPoseRequest: detect human hands and the poses they make.

• VNDetectHumanBodyPoseRequest: multi-person full-body pose detection.


The SoundAnalysis Framework
Core ML is not just about computer vision but can handle other types of data as well,
including sound. The SoundAnalysis framework is like Vision but for audio.

You can use this framework to classify sounds using a MLSoundClassifier model that was
trained by Create ML. It is possible to analyze local audio files, such as M4A and WAV files,
but also streaming audio from AVAudioEngine.

Using SoundAnalysis is a lot simpler than writing your own audio handling code. The
framework takes care of all the grunt work: sample rate conversion, buffering the data,
splitting the audio stream into overlapping windows, and so on. All you have to do is
provide a Core ML model to perform the classification.

Even though the documentation for SoundAnalysis claims that the model must be trained
with Create ML, it accepts any mlmodel. This means you can also train your classifier
using any other training package, as long as it takes audio buffers as input and produces
a classification prediction. Tip: Use the built-in CoreMLModels.SoundAnalysisPreprocessing
model to preprocess the audio buffers before passing them on to your classifier model.

17
Where to Get mlmodels?
To use the Core ML or Vision API you need to have an mlmodel file. The quality of the
mlmodel file is key to a good Core ML experience.

There are a few ways to go about this:

• Find an mlmodel that someone has made freely available15 . The downside is that these
models usually won’t be tailored to your own use case.

• You can also download Core ML models from Apple16 . You can choose between full-
precision download, half-precision (16-bit floats), or a quantized model (8-bits). The
16-bit model should work just as well as the full-precision one but is only half the size;
the quantized model might give slightly worse results. Some of these models also have
a link to sample code that shows how to use them.

• Train a model using TensorFlow, Keras, Caffe, PyTorch or any of the other training
tools and convert it to an mlmodel file. This is the most flexible option but you need
to be careful in how you design your model, as Core ML or the conversion tools may
not support all operations.

• Apple provides easy-to-use training tools such as Create ML and Turi Create, that
directly export mlmodel files for use with Core ML. You’re guaranteed the model will
work with Core ML, but these tools only support specific use cases.

• As of Core ML 3, you can train models with Core ML itself, but this isn’t intended for
training new models from scratch, only to finetune them on the user’s data.

A caution about pretrained models

Academic research on machine learning for computer vision often uses VGG16 or ResNet-
101 as the base model because these are known to give good results. Unfortunately these
models have tens of millions of parameters, making them unsuitable for use on mobile.

In general, you can’t just grab any machine learning model off of the Internet and expect it
to work well with Core ML. In order to get good results on mobile, you need to use models
that have been designed with mobile in mind.
15 https://github.com/likedan/Awesome-CoreML-Models
16 https://developer.apple.com/machine-learning/models/

18
Where to Get mlmodels? 19

The solution is to re-architect the model using a smaller feature extractor such as MobileNet
or SqueezeNet, and train it again. Now you’ll end up with a model that is much smaller,
faster, and more energy-efficient. As a rule of thumb, aim for 10 million parameters or
less.
Create ML: The Easiest Way to Train
Even though there are Core ML converters for many training tools including TensorFlow,
PyTorch, Keras and Caffe, it’s even easier to use a training tool that was made specifically
for Core ML! You can use Create ML to directly train Core ML models.

With Xcode 10, Create ML is a private framework that only exists for the Mac (needs macOS
10.14 Mojave or later). Training a model is as easy as drag-and-dropping your training
data17 into a macOS Swift Playground.

As of Xcode 11, Create ML is a separate app that you launch from Xcode’s Open Developer
Tool menu. This app only runs on macOS 10.15 (Catalina) or later. The framework version
is still present, so you can still train your models from a Swift Playground if you prefer that
over using the app (this also works on 10.14).

Create ML can train a few different model types:

• general purpose classifiers, MLClassifier, and regressors, MLRegressor

• image classifier, MLImageClassifier

• text classifier, MLTextClassifier

• word classifier, MLWordTagger

The general-purpose classifiers and regressors are made from the standard model types
that were already supported by Core ML version 1. This includes decision trees, random
forests, boosted trees, linear & logistic regression, and SVMs.

The image, text, and word classifiers use Core ML model types that are only available on
iOS 12 and macOS 10.14 and later.

If you create an MLImageClassifier, the Core ML model is actually very small. That is possible
because it is based on a special feature extractor called Vision FeaturePrint.Scene that is
built into iOS 12 and macOS 10.14 and later. Any mlmodels that use this feature extractor
only need to store the weights for the final classification layer, which is just a logistic
regression on top of the extracted features.

Many of the other model types use some form of transfer learning as well.
17 https://www.raywenderlich.com/5653-create-ml-tutorial-getting-started

20
Create ML: The Easiest Way to Train 21

As of Xcode 11 (iOS 13 and macOS 10.15), Create ML can also train the following model
types:

• object detection, MLObjectDetector

• sound classification, MLSoundClassifier

• activity classification, MLActivityClassifer

• recommender model, MLRecommender

• helpers for natural language processing, MLGazetteer and MLWordEmbedding

Xcode 12 adds the following model types (iOS 14 / macOS 11.0):

• activity classification from video, MLActionClassifier

• style transfer for images and videos, MLStyleTransfer

• object detection using transfer learning, using the new VisionFeaturePrint.Objects


feature extractor

• transfer learning for word tagging with dynamic word embeddings

Create ML is still quite limited and if you already know how to use a tool such as TensorFlow
it’s probably not worth your time. But it’s great for building simple models, and for when
you want to build an image classifier or object detector on top of Vision FeaturePrint.

Also, Create ML is one of the few training tools that will use your Mac’s GPU to accelerate
training. This can make the difference between training for minutes and training for hours!
Turi Create — it’s Like Create ML but in
Python
In addition to Create ML, Apple also provides a Python-based training tool, Turi Create18 .

Turi Create can do most of what Create ML can, plus a bunch of additional stuff that Create
ML can’t (yet?).

Here are the advantages of Turi Create:

• It has built-in support for many common machine learning tasks, such as image clas-
sification, object detection, style transfer, recommender systems, text classification,
clustering, and a few others.

• The models that you train with Turi Create are optimized for use with Core ML. It
directly saves mlmodel files so you don’t need to convert them afterwards.

• Turi Create has a Python API, so you can fit it into your other Python workflows.

• Like Create ML, it can do GPU-accelerated training on the Mac.

• It also works on older versions of macOS (10.12 and up), Linux, and Windows 10.

• It’s open source, so you can learn interesting things from it by reading the code!

If you already are familiar with tools such as TensorFlow then perhaps Turi Create is not
that useful to you. However, if the only machine you have available for training is a Mac,
Turi may be a good choice, as it’s one of the few training tools that supports GPU training
on the Mac (provided your computer is not too old).

Note: Turi Create can only use the GPU for training on macOS 10.14 (Mojave) and later. If
you’re running an older version of macOS, training will be a lot slower.

18 https://github.com/apple/turicreate

22
coremltools
Chances are that your model was trained with a Python-based tool such as TensorFlow,
Keras, PyTorch, or MXNet. Because Python is the language of choice for machine learning,
Apple has provided the coremltools package for working with Core ML models from
Python.

The primary reason for using coremltools is to convert your model to Core ML format, but
it does more than just conversions alone. You can also use it to tweak existing models and
even build models from scratch.

What does it do?

coremltools lets you perform three main tasks:

1. Converting a trained model to Core ML format.

2. Using a Core ML model to make predictions from Python.

3. Changing properties of an existing Core ML model.

The coremltools package also contains the official specification of the Core ML file format.
You’ll look at this in detail in the mlmodel file format chapter.

coremltools 4 is different!

The way to convert models has changed quite a bit from coremltools v3. The old neural
network converters are deprecated and have been replaced by a newer, more flexible
design.

In this book you’ll find instructions for both coremltools 4 and 3, because sometimes you
may still want to use the old converters. For example, I prefer to use coremltools 3 to
convert Keras models when possible because this results in much simpler graphs (with
some tricks, this is even possible for tf.keras models). And even though it’s no longer
recommended to convert PyTorch models through ONNX, sometimes this workaround is
still necessary when coremltools 4’s PyTorch converter chokes on the model.

23
coremltools 24

The built-in converters

coremltools can convert neural networks made by the following tools:

• TensorFlow (1.x and 2.x), including tf.keras models

• PyTorch

• Caffe (not Caffe2)

• Keras (the standalone version, up to 2.2.4)

• ONNX

It also has converters for linear models and decision trees made by these packages:

• LIBSVM

• scikit-learn

• XGBoost

To convert TensorFlow and PyTorch models, you use the new unified conversion API.

You no longer need to use the separate tfcoreml or onnx-coreml packages to convert these
types of models! It’s also not recommended to use ONNX for PyTorch models anymore,
since coremltools can now directly convert PyTorch models.

The converters for Keras, Caffe, and ONNX do not use this new API. These are still the old
converters from coremltools 3. The ONNX converter is the old onnx-coreml package, which
is now part of coremltools. These converters will not be developed further and will only
get bug fixes.

For converting MXNet models, you’ll still need a separate converter in addition to coreml-
tools.

Doing it by hand

If a converter does not exist for your model format, or the existing converter cannot handle
your trained model for some reason, you can always write your own converter:

• NeuralNetworkBuilder from the coremltools.models.neural_network.builder module lets


you construct a neural network layer-by-layer.
coremltools 25

• KNearestNeighborsClassifierBuilder from coremltools.models.nearest_neighbors lets you


construct a k-Nearest Neighbors classifier.

• TreeEnsembleClassifier and TreeEnsembleRegressor from coremltools.models.tree_ensemble


let you construct tree ensemble models such as random forests.

• Pipeline, PipelineClassifier, and PipelineRegressor from coremltools.models.pipeline


let you build pipelines, which consist of several models in sequence that act as if they
were a single model.

Note: Because mlmodel files are in protobuf format, it’s also possible to create them without
using coremltools at all by directly writing protobuf messages.

Making predictions

To make predictions with a converted Core ML model from a Python script or Jupyter note-
book, you use the coremltools.models.MLModel object. This is primarily useful for verifying
the model was converted correctly.

Note: This functionality is only available on macOS 10.13 or better. Even though coreml-
tools works fine on Linux for converting models, it can’t make predictions from Linux.

Nurse, scalpel please!

The other functionality inside the coremltools.models module, as well as coremltools.utils,


is used to perform surgery on existing Core ML models.

For example, you can use these functions to quantize the model’s weights, to remove or
rename layers, and anything else you might want to do to patch up your Core ML model.
You’ll see many examples in this guide.

Installing coremltools

The easiest way to install coremltools is through the Python package manager pip:
coremltools 26

pip install -U coremltools

coremltools works with both Python versions 2 and 3. You can use it on macOS but also
from Linux, although certain functionality that requires the Core ML framework, such as
making predictions, only works on the Mac.

Tip: coremltools is an active open source project and the official PyPI package19 is not
always up-to-date with the latest changes. If you don’t mind living on the edge, or if the
official release does not yet have a bug fix or a new feature that you need, you can install
coremltools directly from GitHub:

pip install -U git+https://github.com/apple/coremltools.git

For example, if loading a model using coremltools gives an error such as the following, then
try installing the latest coremltools directly from the GitHub repo.

Error compiling model: "Error reading protobuf spec. validator error:


The .mlmodel supplied is of version 3, intended for a newer version of
Xcode. This version of Xcode supports model version 2 or earlier.

But be warned that using the GitHub version may cause problems of its own… It is
under active development, which occasionally breaks things. I often find myself switching
between the GitHub version and the last official release.

Using coremltools

To use coremltools in your Python script, write:

import coremltools as ct

When Python loads the coremltools package, you may get warning messages such as the
following:

19 https://pypi.python.org/pypi/coremltools
coremltools 27

WARNING:root:Keras version 2.3.1 detected. Last version known to be fully


compatible of Keras is 2.2.4 .
WARNING:root:TensorFlow version 1.15.0 detected. Last version known to be
fully compatible is 1.14.0 .

Usually you can safely ignore these messages, but be aware that you may run into errors
when converting your Keras or TensorFlow models. Especially the Keras API tends to
change a lot between releases, and if coremltools depends on some functionality that got
moved or renamed, the conversion will fail.

Tip: If coremltools throws an exception while converting your model, create a new
virtualenv and install the last known compatible versions of Keras and TensorFlow into
that environment — Keras 2.2.4 and TensorFlow 1.13.1 according to the warning, although
by the time you read this the version numbers may have changed — and try the conversion
from inside this virtualenv.

In the next chapters you’ll see how to use coremltools to convert models. In parts 3 and
4, you’ll learn how coremltools can be used to take apart models, fix issues, and put them
back together again.
Part 2: Converting Models

To use a machine learning model with Core ML, you first need to convert it to the mlmodel
format. This part of the book explains how to avoid potential pitfalls.
Image Preprocessing
The most common reason for getting wrong predictions out of Core ML is using incorrect
image preprocessing options.

A CVPixelBuffer usually contains pixels in ARGB or BGRA format where each color channel
is 8 bits. That means the pixel values in the image are between 0 and 255.

But your machine learning model may not actually expect pixels between 0 and 255. Other
common choices are:

• between 0 and 1

• between -1 and +1

• between -2 and +2

• between -128 and +128 with the average values of R, G, and B subtracted

• the color channels in BGR order instead of RGB

If your model expects pixel values in a different range than 0 – 255, you need to let Core
ML know about this. Core ML can then convert the CVPixelBuffer into the format that your
model understands, using a special preprocessing stage in the neural network.

This is very important! With the incorrect preprocessing settings, coremltools will create
an mlmodel file that will interpret your input images wrongly. The model will be working
on data it does not understand — and that produces results that don’t make sense. Garbage
in, garbage out.20

Note: For grayscale images, it’s also important to know what value is considered black and
what value is considered white. In some models 0 is black and 1 is white, but in others 1 is
black and 0 is white.

20 https://en.wikipedia.org/wiki/Garbage_in,_garbage_out

29
Image Preprocessing 30

What is the correct preprocessing for my model?

Usually the Python training code for the model has a preprocess_input() or normalize()
function that converts the pixels into the correct format. Keras, for example, uses the
following arithmetic for models trained on ImageNet:

def preprocess_input(x):
x /= 127.5
x -= 1.0
return x

You need to tell Core ML to perform this exact same preprocessing step.

If you did not train the model yourself, but you’re using a pretrained model that you
downloaded from the web, you should try to find out what sort of preprocessing is done
on the images before they go into the first neural network layer.

RGB or BGR?

Most training tools will load images in the RGB pixel order. Caffe, however, loads images
as BGR. If your training script uses OpenCV to load images, it probably uses the BGR pixel
order too!

This is no problem, as Core ML will automatically swap the red and blue color channels if
necessary. But you do need to tell it whether your model expects RGB or BGR.

If you get the pixel order wrong, your model may still appear to make reasonable predic-
tions — but the results won’t be as good as they could be! This can be a tricky bug to find.

See chapter Making Sure the Input is Correct for a way to make 100% sure your images are
in the right pixel order.

The available preprocessing options

You specify the preprocessing options in the Python script that converts the model. There
are two different ways to do this, depending on which converter you’re using.
Image Preprocessing 31

Using the unified conversion API from coremltools 4

With the unified converter (TensorFlow, PyTorch), you provide a ct.ImageType object for
each image input. The ImageType constructor takes the following arguments:

• scale: the pixel values will be multiplied by this number.

• bias: this is a list of 3 values that will be added to the RGB values of each pixel. If the
image is grayscale, bias should be a single value.

• color_layout: for a color image this is "RGB" or "BGR", for grayscale images it is "G".

• channel_first: lets you choose between NCHW (True) and NHWC (False) channel order.
Usually you can leave out this option and let the converter automatically choose based
on the model.

The trick is to choose appropriate values for these options!

Note: Before coremltools 4, the bias values were supplied as three separate arguments.
With the new API, you need to put these three values into a list, [red_bias, green_bias,
blue_bias], and pass that to the bias argument. For a grayscale image, bias should not be a
list but just a single value (I will call this gray_bias).

Using the old converters or coremltools 3

With the old Keras, Caffe, and ONNX converters, the following options are available:

• image_scale: the pixel values will be multiplied by this number

• red_bias, blue_bias, green_bias: these will be added to the RGB values of each pixel

• gray_bias: like the RGB biases but for grayscale images

• is_bgr: you typically need to set this to True for Caffe models, False otherwise

Instead of the channel bias values, the Caffe converter can also use a mean image from a
“binaryproto” file. This is literally the average image over all the images in the training set.

If the model has multiple input images, you can supply a dictionary instead of a single
value, to give each input its own preprocessing options. For example:
Image Preprocessing 32

red_bias = { "image1": -90, "image2": -110 }

How does it work?

The standard formula for normalization of data is:

normalized = (data - mean) / std

First you subtract the mean value, then you divide by the standard deviation of the data.
What Core ML does is similar but not quite the same:

normalized = data*scale + bias

The red_bias, blue_bias and green_bias values act as the normalization mean for the three
color channels (or just the gray_bias when the image only has one channel). The difference
is that the mean is subtracted while the bias is added, so if you’re given the mean you need
to flip its sign to get the bias:

red_bias = -red_mean
green_bias = -green_mean
blue_bias = -blue_mean

The scale acts like the standard deviation (std in the first formula), except that it is applied
before the bias is added, while the standard deviation would be applied after the biases are
added.

So if your training script supplies a standard deviation, you can use the scale option for
that, but you have to take the reciprocal:

scale = 1.0 / std

and you also have to apply this to the bias values to make the math work out:

red_bias /= std
green_bias /= std
blue_bias /= std
Image Preprocessing 33

Note: If you need to scale each color channel individually, you cannot use scale. See the
chapter Using a Different Scale for Each Color Channel.

Usage examples

0 - 255

If your model expects pixel values in the range 0 – 255, you don’t have to set preprocessing
options. Simply use the default values:

scale = 1
red_bias = 0
green_bias = 0
blue_bias = 0

0-1

If your model expects pixel values in the range 0 – 1, you should set:

scale = 1/255.0
red_bias = 0
green_bias = 0
blue_bias = 0

The bias values use the default value of 0 and have no effect.

-1 … +1

If your model expects pixel values in the range -1 to +1, you should set:
Image Preprocessing 34

scale = 2/255.0
red_bias = -1
green_bias = -1
blue_bias = -1

This first multiplies the pixels by 2/255 , which is the same as dividing by 127.5. Now they
are between 0.0 and 2.0. Then it subtracts 1.0 from each color channel to put the pixel
values between -1.0 and 1.0. This is the normalization used by Keras for ImageNet-based
models.

Caffe models

Models trained with Caffe on the ImageNet dataset will usually subtract the mean RGB
values:

red_bias = -123.68
green_bias = -116.779
blue_bias = -103.939
is_bgr = True

Now the pixels are roughly in the range -128 to +128.

Note that Caffe loads images in the BGR pixel order. Your training script may have the mean
color values listed as [103.939, 116.779, 123.68]. The first element in this list is for the blue
channel color, not red. This can be confusing, so pay attention when you copy-paste these
values from your training scripts!

This is also why is_bgr should be set to True.

Sometimes Caffe models also have a standard deviation applied during preprocessing. The
options then become:

image_scale = 0.017
red_bias = -123.68 * 0.017
green_bias = -116.779 * 0.017
blue_bias = -103.939 * 0.017
is_bgr = True

Here, the standard deviation is 58.8, which gives an image scale of 1/58.8 = 0.017. Since
scaling happens before the bias is added, you will need to multiply the red/green/blue_bias
by this scaling factor as well.

For Caffe models you can also specify the path to your mean.binaryproto file, if you have
one of those, that contains the average RGB values for the entire image. You would use this
instead of red/green/blue_bias.
Image Preprocessing 35

Mean & standard deviation

If you are given a mean for the RGB channels and a standard deviation, you can set the
preprocessing options as follows:

scale = 1.0 / std


red_bias = -red_mean / std
green_bias = -green_mean / std
blue_bias = -blue_mean / std

Here, the standard deviation is assumed to be the same for each color channel.

Note: The coremltools converters require that the input pixels are in the range 0 - 255.
But some training tools assume that pixels are between 0 and 1. If the mean and standard
deviation are given for the 0 - 1 range, you will need to divide the scale by 255.

Per-image mean and standard deviation

If your model does the following, you can’t use the preprocessing options for this:

normalized = (img - mean(img)) / std(img)

The scale and bias options are assumed to be fixed constants that are the same for every
image. If you want to normalize each image with its own mean and standard deviation
values, look into adding a MeanVarianceNormalizeLayer to the model instead.

Different scaling for each color channel

PyTorch models often do their image preprocessing as follows:


Image Preprocessing 36

def preprocess_input(x):
x /= 255.0
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
return (x - mean) / std

This uses a separate standard deviation for each color channel. You can approximate this
with the following options:

scale = 1.0 / (255.0 * 0.226)


red_bias = -0.485 / 0.226
green_bias = -0.456 / 0.226
blue_bias = -0.406 / 0.226

However, this isn’t exactly the same thing: it uses an average standard deviation of 0.226
but the original model has a slightly different standard deviation for each color channel.

The proper solution is to insert a “scale layer” at the beginning of the model that can scale
each channel by its own standard deviation. However, this requires model surgery. You
can read more about it in the chapter Using a Different Scale for Each Color Channel.

The model already has its own preprocessing layers

Sometimes the preprocessing calculations are part of the model already. For example, at
the start of the following graph there is a “multiply” layer followed by an “add” layer:

The multiply layer multiplies the pixels by the constant value 0.0078431 which is the same
as 1/127.5. The add layer subtracts the constant 1. Now the pixels are in the range -1 to +1.

If your model does something similar, then you should not supply any preprocessing
options to Core ML , otherwise the preprocessing happens twice.
Image Preprocessing 37

Changing the preprocessing for an existing model

To examine and possibly change the preprocessing options from an existing mlmodel, see
the chapter Changing the Image Preprocessing Options.
Converting TensorFlow / PyTorch With
the Unified API
As of coremltools 4, there is a new way to convert models to Core ML. This is known as the
unified conversion API. Previously, each model format had its own specialized converter,
but now there is a single converter that can handle many different formats.

With the new conversion API, you can currently convert the following model formats to
Core ML:

• TensorFlow 1.x

• TensorFlow 2.x

• tf.keras

• PyTorch

It is still possible to convert old Keras and Caffe models, but these use their own converters
and are not part of the unified API.

The new converters are built on something called MIL or Model Intermediate Language.
Each model format has its own front-end that reads the model file and converts it to MIL.
coremltools then performs optimizations on the MIL graph and converts it to a Core ML
file. Adding new model formats is now simply a matter of implementing a new front-end.

It’s therefore likely that we’ll see support for other model formats in this unified API in the
future. If at some point a new deep learning framework becomes popular, it can get its own
MIL converter.

Note: You no longer have to use ONNX to convert PyTorch models. The new converter
directly uses a TorchScript trace from PyTorch, which is a much nicer solution than going
through an intermediate format such as ONNX.

38
Converting TensorFlow / PyTorch With the Unified API 39

Using the converter

The basic operation of the unified conversion API is as follows:

import coremltools as ct

class_labels = [ "cat", "dog" ]

image_input = ct.ImageType(shape=(1, 224, 224, 3),


scale=2/255.,
bias=[-1, -1, -1])
model = ct.convert(
your_original_model,
inputs=[ image_input ],
classifier_config=ct.ClassifierConfig(class_labels)
)

model.save("YourModel.mlmodel")

The main conversion function is ct.convert(). You pass it the following arguments:

• model: the model object or file path

• source: which training framework was used. If you use the default of "auto", coreml-
tools inspects the model file to see what format it is and automatically chooses the
appropriate converter.

• inputs: describes the model’s inputs

• outputs: the names of the model’s output nodes (this is optional)

• classifier_config: used when the model is a classifier, leave empty otherwise

• minimum_deployment_target: which iOS version you want the model to target (optional)

To specify the model’s inputs, you need to provide a list containing one or more TensorType
or ImageType objects.

For image inputs, you use ct.ImageType. It has the following properties:

• name: the name of the input

• shape: the shape(s) that are valid for this input

• scale, bias, color_layout: for image preprocessing


Converting TensorFlow / PyTorch With the Unified API 40

• channel_first: to choose between NCHW and NHWC layouts

Most of these properties are optional and can often be inferred from the model itself.
However, make sure you fill in the image preprocessing options correctly!

For inputs that are not images, you use ct.TensorType. This has the following properties:

• name: the name of the input

• shape: the shape(s) that are valid for this input

• dtype: the datatype of the input, for example np.float32

• default_value: if specified, this input is considered optional

The list of inputs is not always required. For example, for TensorFlow models the converter
can often automatically determine what the model’s inputs are.

If the model is a classifier, you also need to tell it what the class labels are. This is done
using a ct.ClassifierConfig object.

Read the manual!

I’m not going to explain the conversion API in great detail here because the docs for
coremltools 4 are actually pretty good. The official documentation gives lots of examples
of how to use the new converter, and also explains how to do more advanced things such
as supporting custom layers.

• coremltools User’s Guide21

• API Reference22 — this explains exactly what arguments you need to pass into
ct.convert() for your particular model type

• Examples23

I wrote the Core ML Survival Guide because this kind of information was sorely missing in
the past, but now I recommend first looking at the official docs, and only consult my book
if you can’t find the answers there. :-)
21 https://coremltools.readme.io/docs
22 https://coremltools.readme.io/reference/convertersconvert
23 https://coremltools.readme.io/page/examples
Keras Conversion Tips
IMPORTANT: This chapter is about converting Keras models that were made with the
standalone version of Keras (2.2.4 or earlier), not with tf.keras. If you have a tf.keras
model, you should use the unified conversion API instead. The rest of this chapter assumes
you have a model that can be loaded with the standalone version of Keras.

The manual for the official Keras-to-Core ML conversion tool is at this link24 . If you
haven’t read it yet, I suggest doing so first. This chapter points out things that may not
be immediately obvious when converting models from Keras.

The steps for converting are always:

1. Load the Keras model.

2. Call coremltools.converters.keras.convert() with the correct options.

3. Optionally: change the properties of the converted model.

4. Save the converted model to an mlmodel file.

Usually you’d write a Python script to perform all these steps but you can also do them
inside the Python REPL or a Jupyter notebook. My preferred approach is to do everything
inside IPython first, then copy the code to a script so I can run it again at a later date.

I prefer to use Python 3 but if your Keras model was trained with Python 2, you should also
use Python 2 for the conversion. A model saved with Python 2 may not load in Python 3,
especially if it contains Lambda layers.

Note: Keras supports multiple backends, such as TensorFlow, Theano, and CNTK. The Keras
converter from coremltools currently only supports the TensorFlow backend. If you trained
your Keras model using another backend, you’ll have to write your own converter.

24 https://apple.github.io/coremltools/generated/coremltools.converters.keras.convert.html

41
Keras Conversion Tips 42

Keras version woes

Keras evolves faster than coremltools, so chances are you trained the model with a version
of Keras that is not officially supported by coremltools.

When you import the coremltools package, it may print the following message:

WARNING:root:Keras version 2.3.1 detected. Last version known to be fully


compatible of Keras is 2.2.4 .
WARNING:root:TensorFlow version 1.15.0 detected. Last version known to be
fully compatible is 1.14.0 .

This means your installed versions of Keras and TensorFlow are too new for coremltools
and are not guaranteed to work correctly.

This is not necessarily a problem, unless coremltools or Keras gives an error message when
you try to convert certain models. In that case, you can try downgrading your installation
of Keras and TensorFlow to the latest supported versions.

Note: Even if you have an older version of Keras installed, it’s often possible to convert a
model that was trained with a later version of Keras — but not always. You may have to
make your model backwards compatible with the older version before you can convert it.

To downgrade, you can run pip install -U keras==2.2.4, which will uninstall the current
version and use the one you specified.

If you don’t want to mess up your current Python environment, a cleaner solution is to
create a new virtual environment and work from that:

$ virtualenv -p python3 myenv


$ pip install tensorflow==1.14.0
$ pip install keras==2.2.4
$ pip install coremltools
$ source myenv/bin/activate

Note: There should be no spaces anywhere in the path to this new environment. Folder
names with spaces break virtualenv. You can also use Anaconda and the conda package
manager.
Keras Conversion Tips 43

Of course, it’s possible that now your Keras model no longer loads if it requires Keras or
TensorFlow features that are not present in this older version. Sometimes you can work
around this (see the next section). If all else fails, I recommend writing your own converter
using NeuralNetworkBuilder.

Loading the Keras model

In general, loading a Keras model looks something like this:

import keras
keras_model = keras.models.load_model("YourModel.h5")
keras_model.summary()

Tip: The line keras_model.summary() isn’t strictly needed, but is useful to verify that your
model was loaded correctly.

It’s possible you get a warning at this point, “Error in loading the saved optimizer state.”
This is no big deal, it just means that the model file does not include state that is used by
the training process. But you don’t need that right now anyway.

You’ve got bigger problems when you get an error that includes the message “No model
found in config.” This means the model file could not be loaded because it only contains
the weights but not the definition of the model architecture. This happens when the model
was saved with model.save_weights() instead of model.save().

Using the coremltools converter is easiest when you have a Keras model that was saved
with model.save(). However, if all you have is a weights file without the model definition,
you can reconstruct the Keras model in the Python script and then call load_weights() to
load the weights into the model, like so:
Keras Conversion Tips 44

def create_model():
model = Sequential()
model.add(...)
return model

keras_model = create_model()
keras_model.load_weights("YourModel.h5")

Some models have custom objects. MobileNet25 , for example, uses the ReLU6 activation
function, which until recently was not offically part of Keras. You need to provide such
custom objects when you call load_model() or Keras gives an error message:

import keras_applications
keras_model = keras.models.load_model("mobilenet_1_0_224.h5",
custom_objects={ "relu6": keras_applications.mobilenet_v2.relu6 })

One annoying thing about Keras is that the API keeps changing in significant ways between
minor releases. If you’re using Keras 2.2.1 or later the above code no longer works because
the relu6 function has been removed. Fortunately, you can still provide your own:

import keras.backend as K

def my_relu6(x):
return K.relu(x, max_value=6)

keras_model = keras.models.load_model("mobilenet_1_0_224.h5",
custom_objects={ "relu6": my_relu6 })

I simply copy-pasted that my_relu6 function from an older version of Keras (2.2.0) and now
the model loads again.

This is something to be aware of with Keras: a newer version of Keras may not be able to
load your older models without resorting to such tricks.

By the way, you don’t really need to call load_model() by hand, you can also pass the path to
the Keras model file directly to coremltools.converters.keras.convert(). However, I prefer
to use load_model(), which also allows me to inspect the Keras model and potentially make
changes to it, before running the converter.

In case you want to use the converter directly on the Keras .h5 file but your model uses
custom objects, you can write the following:

25 https://github.com/keras-team/keras-applications/blob/master/keras_applications/mobilenet.py
Keras Conversion Tips 45

import coremltools as ct
from keras.utils.generic_utils import CustomObjectScope
with CustomObjectScope({ "relu6": my_relu6 }):
coreml_model = ct.converters.keras.convert(
"YourModel.h5", custom_conversion_functions={})

Note: For some reason you cannot pass the custom_objects dictionary directly into the
convert() function, even though all the plumbing is there to handle it. Maybe they’ll enable
this in a future version.

Conversion options

When you call ct.converters.keras.convert() you only are required to pass in the Keras
model object (or the path to the .h5 file):

import coremltools as ct
coreml_model = ct.converters.keras.convert(keras_model)

However, this treats any inputs and outputs as MultiArrays and gives them default names.
It’s a good idea to provide the following additional arguments:

• input_names=["image"]. These are the names you want to use for the model’s inputs. If
you don’t specify this, the first input is named input1, the second is input2, and so on.
You can always change the input names later in the mlmodel spec, but it’s much easier
to already tell the converter what you want the input names to be.

• image_input_names=["image"]. Which of the inputs listed in input_names should be treated


as images instead of MultiArrays. You only need to use this if your model expects one
or more images as inputs. Now Core ML lets you provide these images as CVPixelBuffer
objects instead of MLMultiArrays.

• output_names=["labelProbs"]. The names of the model’s outputs. Without this option,


the outputs will be given default names such as output1, output2, and so on.

Note: There is no image_output_names option. If your model produces an image as output,


or a tensor that can be interpreted as such, you’ll need to change this yourself in the spec.
Keras Conversion Tips 46

For models that work on images, it’s common to normalize the pixel values. You’ll need to
provide one or more of the following arguments:

• image_scale

• red_bias, blue_bias, green_bias for color images

• gray_bias for grayscale images

• is_bgr — usually False unless your model was trained on images loaded by OpenCV

To read more about choosing the correct values for these options, see the chapter Image
Preprocessing.

Tip: If your model has multiple image inputs and they use different preprocessing val-
ues, you can supply these options as dictionaries instead of single values, such as red_-
bias={"image1": -100, "image2": -120}.

If your model is a classifier, you also need to provide the following:

• class_labels="class_names.txt". This is the name of a text file with the class names, one
per line. You can also provide the class names as a list, which is useful if you only have
a few classes: class_labels=["cat", "dog"]. If you want to use integer labels instead of
strings, you can do class_labels=list(range(num_classes)) .

• predicted_probabilities_output="labelProbs". This is the name of the output that pre-


dicts the class probabilities. It needs to match one of the names from output_names.

• predicted_feature_name="label". For classifier models, Core ML outputs a dictionary


that maps the class names to their predicted probabilities, but also a string with the
name of the winning class. By default this class name output is called "classLabel" but
you can change it to something else with this option.

Other possible conversion options:

• model_precision="float16". If you supply this, the model weights will be saved as 16-bit
floats. You can also change this afterwards so it’s no big deal if you forget.
Keras Conversion Tips 47

• use_float_arraytype=True. With this option, any inputs or outputs that are MultiArrays
will now have 32-bit floating point values instead of 64-bit doubles. Single-precision
floats are often faster than double-precision.

Here’s a full example of how to convert a Keras classifier model to Core ML format:

import coremltools as ct
coreml_model = ct.converters.keras.convert(keras_model,
input_names=["image"],
image_input_names=["image"],
output_names=["labelProbs"],
image_scale=2/255.0,
red_bias=-1.0,
green_bias=-1.0,
blue_bias=-1.0,
class_labels="imagenet_labels.txt",
predicted_probabilities_output="labelProbs",
predicted_feature_name="label",
model_precision="float16")

Undefined input shapes

When creating a Keras model, it’s possible to specify an input shape that is partially None.
Here is a basic example:

from keras.models import *


from keras.layers import *

def create_model():
inp = Input(shape=(None, None, 3)) # look here!
x = Conv2D(16, (3, 3), padding="same")(inp)
x = GlobalAveragePooling2D()(x)
x = Dense(10)(x)
return Model(inp, x)

keras_model = create_model()
coreml_model = ct.converters.keras.convert(keras_model,
input_names=["image"],
image_input_names=["image"])

Converting this model gives an error message such as “ValueError: not enough values to
unpack (expected 3, got 1)” or “ValueError: Invalid input shape for image”.
Keras Conversion Tips 48

The problem is that the input shape is (None, None, 3). The Keras converter now thinks the
model’s input has a single dimension of size 3.

You can see this if you run the conversion without the image_input_names argument, so that
coremltools does not try to turn this input into an image but leaves it as a MultiArray.
Now print(coreml_model) shows that the model has an input with a single shape dimension
instead of three shapes:

input {
name: "image"
type {
multiArrayType {
shape: 3
dataType: DOUBLE
}
}
}

This is wrong, as the original Keras model really expected an input with three dimensions.
(Depending on the version of coremltools you’re using it’s also possible you already get
an error message at this point and that the conversion doesn’t work at all, even without
image_input_names.)

There are two ways to fix this:

1. When creating the Keras model, don’t use None in the input shape but fill in actual
numbers. If you want the model to support flexible input sizes, you can change this in
the mlmodel file afterwards. See the chapter Size Flexibility.

2. Supply the parameter input_name_shape_dict. This tells the Keras converter what the
input shape is supposed to be for each input.

Example:

coreml_model = ct.converters.keras.convert(keras_model,
input_names=["image"],
image_input_names=["image"],
input_name_shape_dict={"image": [None, 224, 224, 3]})

This tells the converter that the input named "image" has the shape [None, 224, 224, 3].

The first value should be None as this is the batch size and you should leave it unspecified.
The second dimension is the height, the third is the width, and the last is the number of
input channels. Note that this is in Keras order, not Core ML order (which puts channels
first, not last)!
Keras Conversion Tips 49

Note: At the time of writing, input_name_shape_dict only works with the version of coreml-
tools from GitHub, not the official package. It also works only with Keras 2.0 or higher.

Custom layers

If during conversion you get an error message such as “ValueError: Keras layer ‘<class
‘xxx’>’ not supported”, then chances are the Keras model uses a custom layer that coreml-
tools does not know about.

Keras allows you to define custom layers using the special Lambda layer type or by subclass-
ing the Layer class. Naturally, the coremltools Keras converter does not understand how to
handle such layers and you have to help it out a little.

If your Keras model has Lambda layers or other custom layers, you have to provide these
additional arguments to the converter:

• add_custom_layers=True. This will add any unknown layer types to the mlmodel file as
CustomLayerParams objects.

• custom_conversion_functions={}. This dictionary lets you provide a function for each


unknown layer that fills in the properties for that layer.

You always have to supply the custom_conversion_functions argument. If you don’t, coreml-
tools will give an error when it finds a custom layer. It’s OK to leave the dictionary empty.
In that case, the converter will simply add empty CustomLayerParams layers to the model for
all the unknown layer types it encounters.

Here is a minimal example:


Keras Conversion Tips 50

import keras
from keras.models import *
from keras.layers import *
from keras import backend as K

def my_custom_layer(x):
return K.pow(x, 2)

def create_model():
inp = Input(shape=(256, 256, 3))
x = Lambda(my_custom_layer, name="custom1")(inp) # look here!
x = GlobalAveragePooling2D()(x)
return Model(inp, x)

keras_model = create_model()

This creates a model with just two layers. The first one is a Lambda layer — this is the custom
layer — and the second one is a GlobalAveragePooling2D layer, just for fun.

The keras_model.summary() is as follows:

Layer (type) Output Shape Param #


=================================================================
input_1 (InputLayer) (None, 256, 256, 3) 0
_________________________________________________________________
custom1 (Lambda) (None, 256, 256, 3) 0
_________________________________________________________________
global_average_pooling2d_1 ( (None, 3) 0
=================================================================

To convert this model to Core ML, you should write:

import coremltools as ct
coreml_model = ct.converters.keras.convert(keras_model,
input_names="image",
image_input_names="image",
output_names="output",
add_custom_layers=True,
custom_conversion_functions={})

After the conversion completes, print(coreml_model._spec.neuralNetwork.layers[0]) will


show the following:
Keras Conversion Tips 51

name: "custom1"
input: "image"
output: "custom1_output"
custom {
}

Notice how the this layer is of type custom. Before you can actually use the mlmodel you
still need to fill in the properties for this layer. If you don’t, then Xcode won’t be able to
load the model and it will give an error message such as: “validator error: Custom layer
custom1 has an empty ‘className’ field. This field is required in order for Core ML to link
to the implementation for this custom class.”

To fill in the properties of the custom layer, you need to write the following:

custom_layer = coreml_model._spec.neuralNetwork.layers[0]
custom_layer.custom.className = "MyCustomLayer"
custom_layer.custom.description = "Takes the element-wise square"

At the very least you need to fill in the className so that Xcode knows which MLCustomLayer
class it needs to load in order to execute this layer when you run the model.

The CustomLayerParams object has several other properties, for the weights and any other
parameters your custom layer may have.

Rather than filling in the properties of your custom layers afterwards, you can also do it
while the model is being converted by providing a conversion function. Like so:

def convert_lambda(layer):
print("converting lambda layer '%s'" % layer.name)
params = ct.proto.NeuralNetwork_pb2.CustomLayerParams()
params.className = "MyCustomLayer"
params.description = "Takes the element-wise square"
return params

coreml_model = ct.converters.keras.convert(keras_model,
input_names="image",
image_input_names="image",
output_names="output",
add_custom_layers=True,
custom_conversion_functions={ "Lambda": convert_lambda })

Now the convert_lambda() function is called for every Lambda layer in the model. It fills in
the className and description like before.
Keras Conversion Tips 52

Note: If your model has more than one Lambda layer, this same conversion function is used
for all of them. You can add a check such as if layer.function == my_custom_layer to tell the
different layers apart.

This works the same way for custom layers that subclass the Keras Layer class. You would
make such a subclass for custom layers that have trainable weights or other parameters.
In the dictionary you now use the name of that subclass instead of "Lambda".

If you used Activation(my_custom_layer) instead of Lambda(my_custom_layer) then you need


to use "my_custom_layer" as the dictionary key.

For more information on custom layers in Core ML, see the chapter Creating Your Own
Custom Layers.

Note: Custom layers are only supported for Keras 2.0 and up.

Making the model updatable

As of Core ML 3, models can be trained on-device. This is really easy to enable for Keras
models, simply pass in the argument respect_trainable=True when you call convert().

coreml_model = ct.converters.keras.convert(keras_model, ...,


respect_trainable=True)

However, if the Keras model was not compiled before it was saved to a .h5 file, you may get
messages such as the following:

WARNING:root:Core ML conversion was asked to respect trainable parameters


from the Keras model, but the input model does not include a loss layer.
WARNING:root:Core ML conversion was asked to respect trainable parameters
from the Keras model, but the input model does not include an optimizer.

In that case, first compile the model before converting it:


Keras Conversion Tips 53

keras_model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.SGD(lr=0.01),
metrics=["accuracy"])

Currently, the supported losses are:

• keras.losses.categorical_crossentropy

• keras.losses.mean_squared_error

And the supported optimizers are:

• keras.optimizers.SGD

• keras.optimizers.Adam

It’s probably not a good idea to set all layers to be trainable. Usually you’d only train the
last couple of layers. To only select certain layers to be updated, do:

# First disable training on all layers:


for layer in keras_model.layers:
layer.trainable = False

# Then enable training on the last few layers again:


keras_model.layers[-3].trainable = True
keras_model.layers[-1].trainable = True

You’ll need to compile() the Keras model again if you change the trainable property of
any layers. Tip: use keras_model.summary() to verify the model has the correct number of
trainable params.

Now, convert(keras_model, respect_trainable=True) will convert the model and make any
layer that is trainable in the Keras model also trainable in the Core ML model.

Note: Core ML 3 only supports training of convolutional and fully-connected layers. You
may see messages such as, “layer ‘name’ is marked updatable, but Core ML does not yet
support updating layers of this type. The layer will be frozen in Core ML.” This means that
training has no effect on that particular layer.

After converting the model, the following should print True:


Keras Conversion Tips 54

print(coreml_model._spec.isUpdatable)

The new Core ML model now includes the loss function and the optimizer configuration. In
addition to the regular inputs, which are used only for inference, the model also has new
training inputs. View these with:

print(coreml_model._spec.description.trainingInput)

For an image classifier model, there will be two training inputs:

1. The first is for the training image. This has the same name and data type as the normal
image input. When training, this image input is used instead of the regular one.

2. The second is for the target. In the case of a classifier, the target would be the true
class label. This is connected to the loss function, so that Core ML can compare the
predicted label with the correct one and compute the loss.

At this point it’s a good idea to verify that everything was set up correctly. You can do this
using NeuralNetworkBuilder.

builder = ct.models.neural_network.NeuralNetworkBuilder(spec=coreml_model._spec)

To view the layers that can be trained:

builder.inspect_updatable_layers()

And to view the loss function and the optimizer settings:

builder.inspect_loss_layers()
builder.inspect_optimizer()

By default, the number of training epochs is set to 1, but you can change this as follows:

builder.set_epochs(10)

And that’s it, you now have a model that can be trained on-device.

It’s still possible to get an error message when opening the model in Xcode. For example,
“validator error: There is a layer (xxx), which does not support backpropagation, between
an updatable marked layer and the loss function.” That means your model can only be
trained up until that particular layer, but any layers before it cannot be trained.
Keras Conversion Tips 55

Note: If converting the model with respect_trainable=True fails for some reason, you can
always make the model updatable afterwards.

Changing the properties of the converted model

The Keras converter returns an MLModel object. You can use this to change some basic
properties:

coreml_model.short_description = "Your awesome model"


coreml_model.author = "Your name"
coreml_model.license = "Copyright by you"

coreml_model.input_description["image"] = "The input image"

coreml_model.output_description["labelProbs"] = "The predicted probabilities"


coreml_model.output_description["label"] = "The class with the highest score"

It’s wise to fill in the input and output descriptions, so it is clear what they are for. Note
that you should use the names from the input_names and output_names arguments you used
earlier.

To see a summary of your model, do print(coreml_model).

For more invasive changes, you can get the spec object and tweak the model to suit your
needs. For example, to change the output type to image or to quantize the weights.

It’s also a good idea at this point to verify the conversion is successful by making a few
predictions on test data.

Tip: You can also use separate Python scripts to modify and test the model. That way you
have one script for doing the conversion, one script for doing model surgery (if any), and
a third script for verifying that the model works correctly.

Saving the model

This is the easy part:


Keras Conversion Tips 56

coreml_model.save("YourModel.mlmodel")

If you made changes to the model’s spec, you should save as follows:

new_model = ct.models.MLModel(spec)
new_model.save("YourModel.mlmodel")

By creating a new MLModel instance, Core ML will try to compile the spec. If there’s anything
wrong with the spec, you’ll know right away.
Converting tf.keras With the Old
Converter
There are two ways to convert Keras models to Core ML:

• For models made with standalone Keras (up to version 2.2.4), use the old Keras
converter.

• For models made with tf.keras, use the new unified conversion API.

However, in some cases it is possible to convert a tf.keras model using the old Keras
converter.

Why would you do this? A significant downside of using the official tf.keras converter is
that it always produces a model that only works on iOS 13 or above (Core ML 3).

To run your tf.keras model on iOS 12 or earlier, you can try the following:

1. Export the model’s weights in h5 format using model.save_weights().

2. Switch to a TensorFlow 1.x environment and install Keras with pip install keras==2.2.4.

3. Create a Python script that defines your model but instead of using tf.keras.layers,
use keras.layers — in other words, remove the tf. prefix everywhere. The easiest way
to do this is to import tf.keras as keras in your Python code.

4. Now you can load the weights from the .h5 file into this standalone Keras model, using
model.load_weights().

5. Convert with coremltools.converters.keras.convert(). See the Keras chapter for details


on how to use this converter.

Tip: I’ve successfully used this method to convert tf.keras models where the new conver-
sion API failed. It is not always possible to use this method, but sometimes it’s a useful
workaround.

57
Caffe Conversion Tips
The manual for the official Caffe-to-Core ML conversion tool is at this link26 . If you
haven’t read it yet, I suggest doing so first. This chapter points out things that may not
be immediately obvious when converting models from Caffe.

Tip: Usually you’d write a Python script to perform the conversion but you can also do it
from the Python REPL or a Jupyter notebook. My preferred approach is to do everything
inside IPython first, then copy the code to a script so I can run it again at a later date.

Caffe models are binary files with the extension .caffemodel. In addition to the binary file
you often will have one or more .prototxt files.

To convert the model you need the prototxt file with “deploy” in its name. This contains
the definition of the model architecture used for inference. The other prototxt files (if you
have them) are for training. You can open these prototxt files with a text editor and make
changes to them if needed.

Note: Caffe can also save models in HDF5 or .h5 format, but the coremltools converter
cannot read such files. It only handles .caffemodel files, which are in a binary protobuf
format.

Converting the Caffe model to Core ML is as straightforward as:

import coremltools as ct

coreml_model = ct.converters.caffe.convert(
("YourModel.caffemodel", "YourModel_deploy.prototxt")
)

If you don’t have a deploy prototxt but you do have one for training, you can use that instead
with a few modifications. You’ll have to remove the loss and accuracy layers and any other
things that are used for training, such as data augmentation on the input layer.
26 https://apple.github.io/coremltools/generated/coremltools.converters.caffe.convert.html

58
Caffe Conversion Tips 59

If you don’t have any prototxt files at all, you can try passing just the .caffemodel file to
caffe.convert(). It might work if no critical information is missing, but I tried it with a few
models and none of them converted without the deploy prototxt.

Note: When using the version of coremltools from GitHub, I got the error message:
“ImportError: cannot import name ‘libcaffeconverter’”. Using the official package fixes
this. If you really must have the very latest version, you can also clone the coremltools
repo, build it using cmake, and then use that. See the coremltools README for details.

Official Caffe only!

Important: There is not just one version of Caffe. There are many different forks.

Caffe is written in C++ and to add a custom layer type you have to change the source code
of Caffe itself. So every time a researcher wants to try something new, they fork Caffe and
make the changes in their own fork.

The coremltools Caffe converter can only convert models that were trained with the official
version of Caffe27 .

If you used a customized Caffe to train your model, it may have a layer type that coremltools
does not understand and the conversion will give an error message such as “RuntimeError:
Unable to load caffe network Prototxt file: YourModel.prototxt”.

There are a couple of ways to work around this:

1. Hack coremltools to support your version of Caffe. The caffeconverter folder contains
the C++ source code for libcaffeconverter, which is used as a Python plug-in by
coremltools. You can add a converter routine for your custom layer type to the C++
libcaffeconverter library and build your own local version of coremltools.

2. Write your own converter using NeuralNetworkBuilder.

Conversion options

The Caffe converter reads the names of the model’s inputs and outputs from the deploy
prototxt. Often the input is named "data". If you want to use other names, you can change
27 http://caffe.berkeleyvision.org
Caffe Conversion Tips 60

them in the prototxt before you run the converter, or you can change them in the mlmodel
afterwards.

If your model expects one or more inputs to be images, you should also provide the
following argument to the Caffe converter:

• image_input_names=["data"]. Which of the model’s inputs should be treated as images


instead of MultiArrays. Now Core ML lets you provide these images as CVPixelBuffer
objects instead of MLMultiArrays.

Note: There is no image_output_names option. If your model produces an image as output,


or a tensor that can be interpreted as such, you’ll need to change this yourself in the spec.

For models that work on images, it’s common to normalize the pixel values. You’ll need to
provide one or more of the following arguments:

• is_bgr=True because Caffe uses OpenCV to load the images in BGR pixel order

• image_scale

• red_bias, blue_bias, green_bias for color images

• gray_bias for grayscale images

To read more about choosing the correct values for these options, see the chapter Image
Preprocessing.

Tip: If your model has multiple image inputs and they use different preprocessing val-
ues, you can supply these options as dictionaries instead of single values, such as red_-
bias={"image1": -100, "image2": -120}.

Instead of using a constant bias value for the red, green and blue channels, some Caffe
models subtract a different mean value for every pixel. This mean image is literally the
average of all the images in the training set. If your model does this, you can supply the
mean image like so:
Caffe Conversion Tips 61

coreml_model = ct.converters.caffe.convert(
("YourModel.caffemodel", "YourModel_deploy.prototxt", "MeanImage.binaryproto"),
...other options...)

If your model is a classifier, you also need to provide the following:

• class_labels="class_names.txt". This is the name of a text file with the class names,
one per line. Unlike with the Keras converter, you cannot provide the class labels as a
list, only as a text file.

• predicted_feature_name="label". For classifier models, Core ML outputs a dictionary


that maps the class names to their predicted probabilities, but also a string with the
name of the winning class. By default this class name output is called "classLabel" but
you can change it to something else with this option.

Other possible conversion options:

• model_precision="float16". If you supply this, the model weights will be saved as 16-bit
floats. You can also change this afterwards so it’s no big deal if you forget.

Here’s a full example of how to convert a Caffe classifier model to Core ML format:

import coremltools as ct

scale = 0.017

coreml_model = ct.converters.caffe.convert(
("YourModel.caffemodel", "YourModel_deploy.prototxt"),
image_input_names="data",
is_bgr=True,
image_scale=scale,
red_bias=-123.68*scale,
green_bias=-116.78*scale,
blue_bias=-103.94*scale,
class_labels="synset_words.txt",
predicted_feature_name="label",
model_precision="float16")

Note that for this particular model, the image preprocessing options used a standard
deviation of about 58, which is why the image_scale is 0.017 or 1/58 and the biases are also
multiplied by that scale factor.
Caffe Conversion Tips 62

Undefined input dimensions

You may get an error such as: “RuntimeError: Unable to infer input name and dimensions.
Please provide a .prototxt file with ‘Input’ layer and dimensions defined.”

This happens when the prototxt does not define the dimensions for the model’s input. To
fix this, open the deploy prototxt file in a text editor and add the following at the top of the
file:

input: "data"
input_dim: 1
input_dim: 3
input_dim: 224
input_dim: 224

The order of the dimensions is: batch size, channel, height, width. You need to add one of
these sections for every input in the model. The name of the input must be the same as the
bottom field in the first layer.

Changing the properties of the converted model

The Caffe converter returns an MLModel object. You can use this to change some basic
properties:

coreml_model.short_description = "Your awesome model"


coreml_model.author = "Your name"
coreml_model.license = "Copyright by you"

coreml_model.input_description["data"] = "The input image"

coreml_model.output_description["prob"] = "The predicted probabilities"


coreml_model.output_description["label"] = "The class with the highest score"

It’s wise to fill in the input and output descriptions, so it is clear what they are for. Note that
you should use the correct input and output names, here "data" and "prob", respectively.
The "label" output is from the converter option predicted_feature_name.

To see a summary of your model, do print(coreml_model). This includes the names of the
inputs and outputs, in case you’re not sure what they are.
Caffe Conversion Tips 63

For more invasive changes, you can get the spec object and tweak the model to suit your
needs. For example, to change the output type to image or to quantize the weights.

It’s also a good idea at this point to verify the conversion is successful by making a few
predictions on test data.

Tip: You can also use separate Python scripts to modify and test the model. That way you
have one script for doing the conversion, one script for doing model surgery (if any), and
a third script for verifying that the model works correctly.

Saving the model

This is the easy part:

coreml_model.save("YourModel.mlmodel")

If you made changes to the model’s spec, you should save as follows:

new_model = ct.models.MLModel(spec)
new_model.save("YourModel.mlmodel")

By creating a new MLModel instance, Core ML will try to compile the spec. If there’s anything
wrong with the spec, you’ll know right away.

Note: At the moment, the Caffe converter cannot create trainable models. See the chapter
Making a Model Updatable on how to do this by hand.
TensorFlow 1.x Conversion With
tfcoreml
IMPORTANT: As of coremltools 4, there is a new and recommended way to convert
TensorFlow models. The information in this chapter applies to coremltools version 3 only —
it will not work with coremltools 4 or later! Even though this is no longer the recommended
way to convert TensorFlow models, I am keeping this information in the book. It is still a
valid way to convert models.

There are three ways to convert TensorFlow 1.x models:

1. using the separate package tfcoreml

2. using coremltools.converters.tensorflow that is included in coremltools 3

3. using the new unified conversion API from coremltools 4

As mentioned, the new coremltools 4 approach is the recommended one. You can read
more about it in the chapter Converting TensorFlow / PyTorch With the Unified API.

This chapter describes how to use tfcoreml for converting TF 1.x models. You can also use
the method from coremltools 3, but I don’t recommend it.

One reason why tfcoreml is still a valid choice, is that the other converters produce models
that have layers that only work with Core ML 3 and later — for example, SoftmaxND instead
of a regular Softmax. Such models only run on iOS 13 and up.

A disadvantage of tfcoreml is that it only supports frozen graphs (.pb files), not the more
modern SavedModel files.

Tip: The coremltools repo contains a number of examples28 of how to use tfcoreml to
convert TensorFlow 1.x models to Core ML. Recommended reading!
28 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/neural_

network_inference/tensorflow_converter/Tensorflow_1

64
TensorFlow 1.x Conversion With tfcoreml 65

Using tfcoreml

Installing tfcoreml is as simple as:

pip install -U coremltools==3.4


pip install -U tfcoreml

To use the very latest version straight from GitHub, you can do:

pip install -U git+https://github.com/tf-coreml/tf-coreml.git

Naturally, you’ll also need to have TensorFlow installed.

Tip: Usually you’d write a Python script to perform the conversion but you can also do it
from the Python REPL or a Jupyter notebook. My preferred approach is to do everything
inside IPython first, then copy the code to a script so I can run it again at a later date.

Limitations

TensorFlow isn’t just for training neural networks, it’s a tool for building computational
graphs. These graphs are composed of low-level operations such as addition, exponentia-
tion, matrix multiplications, and so on.

The building blocks used by Core ML, neural network layers, are at a much higher level
of abstraction. What this means is that you cannot just convert any arbitrary TensorFlow
graph to Core ML — only those graphs that use operations supported by Core ML.

tfcoreml is pretty clever and will do its best to convert the operations from the TensorFlow
graph into something that Core ML understands. But still, it’s quite likely that you won’t be
able to convert your TensorFlow model to Core ML, or at least not the entire graph.

The tfcoreml source code29 has a list of all the supported operations. There is also an older
version of the README30 with a list of supported ops.
29 https://github.com/tf-coreml/tf-coreml/blob/master/tfcoreml/_ops_to_layers.py
30 https://github.com/tf-coreml/tf-coreml/blob/6cbddf78de0c69c988229d46b80e487ec97e93e0/README.md
TensorFlow 1.x Conversion With tfcoreml 66

Different model types

TensorFlow lets you save your models in a few different formats:

• a full graph definition, including training operations

• a training checkpoint

• a frozen graph that is optimized for inference

• a SavedModel object

• a tflite file for use with TensorFlow Lite

The tfcoreml converter can only convert frozen graphs to Core ML. If your TensorFlow
model is saved in a different format, you’ll have to convert it first.

To convert a checkpoint to a frozen graph, you can use the freeze_graph.py31 script. This
removes all the training operations and optimizes the model for inference.

You can also do this straight from your own Python code. For example, if your model is a
SavedModel you would first read it into a new Graph object:

import tensorflow as tf

def load_saved_model(path):
the_graph = tf.Graph()
with tf.Session(graph=the_graph) as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], path)
return the_graph

the_graph = load_saved_model("YourModelFolder/")

Next, call strip_unused() on the graph to remove all unnecessary operations. You’ll need to
pass in the names of the tensors that you want to use for the model’s inputs and outputs.
(This is useful for when you only want to convert a portion of the complete graph.)

After optimizing the graph, it is saved as a new frozen graph file:

31 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py
TensorFlow 1.x Conversion With tfcoreml 67

from tensorflow.python.tools import strip_unused_lib


from tensorflow.python.framework import dtypes
from tensorflow.python.platform import gfile

def optimize_graph(graph, output_path, input_nodes, output_nodes):


gdef = strip_unused_lib.strip_unused(
input_graph_def=graph.as_graph_def(),
input_node_names=input_nodes,
output_node_names=output_nodes,
placeholder_type_enum=dtypes.float32.as_datatype_enum)

with gfile.GFile(frozen_model_file, "wb") as f:


f.write(gdef.SerializeToString())

optimize_graph(the_graph, "frozen_graph.pb", ["input"], ["output_tensor"])

Now you can pass this "frozen_graph.pb" path to the tfcoreml converter.

Note: The names of the input and output tensors here do not have :0 at the end!

What you need to know about tensor names

To be able to convert your TensorFlow graph, tfcoreml needs to know the names of any
output tensors. (It can figure out the input tensors by itself.)

Tensor names are a little peculiar in that they have a :0 at the end, e.g. "output_tensor:0".
Every operation in the graph has such a tensor associated with it. The :0 refers to the first
output of the operation. If an operation has multiple outputs, they’re named :1, :2, and so
on.

In order to figure out which tensors are the inputs and outputs for your model, you
can inspect the TensorFlow graph with Netron. You can also load the graph using the
TensorFlow Python API and print out a list of all the operations. The chapter Verifying
the Conversion is Successful shows how to do this. The tfcoreml repo32 also includes some
useful helper scripts in the utils folder, such as inspect_pb.py.

Often the first tensor in the model is the input tensor and the last tensor in the model is the
output tensor, but there is no guarantee this is the case for all models.
32 https://github.com/tf-coreml/tf-coreml
TensorFlow 1.x Conversion With tfcoreml 68

The tensor names can be quite the mouthful. For example, the output tensor in MobileNet
is named "MobilenetV1/Predictions/Reshape_1:0". Note that whoever created this model
spelled it Mobilenet not MobileNet. Case sensitivity matters in tensor names!

tfcoreml will convert any : and / characters inside the input and output tensor names to two
underscores __ so that these can be used as Swift or Objective-C identifiers. That will result
in really ugly names in your mlmodel. The MobileNet output is now named "MobilenetV1_-
_Predictions__Reshape_1__0". Yuck.

To do this name conversion yourself, write:

tensor_swiftified = tensor_name.replace(":", "__").replace("/", "__")

Sometimes tfcoreml uses the original name with the :0 at the end, sometimes it uses the
Swiftified name. Pay attention to which spelling is used when!

Note: tfcoreml only renames the tensors for the model’s inputs and outputs. The layers
inside the model keep the original TensorFlow name.

Doing the conversion

The TensorFlow converter is very similar to the ones from coremltools, which is not so
strange as tfcoreml is built on top of coremltools. To convert the model, write:

import tfcoreml

coreml_model = tfcoreml.convert(
tf_model_path="your_frozen.pb",
mlmodel_path="YourModel.mlmodel",
output_feature_names=["output_tensor:0"])

Note: When you do import tfcoreml, it also loads coremltools. You may get warnings that the
installed version of TensorFlow is not officially supported by coremltools, but that warning
is only relevant to the Keras converter. tfcoreml seems to work fine with any recent version
of TensorFlow.
TensorFlow 1.x Conversion With tfcoreml 69

At the very least you need to fill in the following when you call tfcoreml.convert():

• tf_model_path: the path to the frozen TensorFlow graph file. If you don’t have a frozen
graph yet, you need to make one first.

• mlmodel_path: the path where the converted Core ML model file will be written.

• output_feature_names: the names of the outputs in the frozen TensorFlow graph. Note
that these tensor names must have a :0 at the end.

The tfcoreml.convert() function returns an MLModel object that you can use to change the
metadata such as the input and output descriptions, to make test predictions, or to tweak
the model’s spec. You don’t need to save this MLModel object to a file because tfcoreml already
did so as part of the conversion process — unless you make changes to it, of course.

Conversion options

tfcoreml will automatically read the names of the input tensors from the graph. If your
model expects one or more inputs to be images, you should also provide the following
argument to tfcoreml.convert():

• image_input_names=["input:0"]. Which of the model’s inputs should be treated as im-


ages instead of MultiArrays. Now Core ML lets you provide these images as CVPixelBuffer
objects instead of MLMultiArrays. Don’t forget the :0 at the end of the input name!

Note: There is no image_output_names option. If your model produces an image as output,


or a tensor that can be interpreted as such, you’ll need to change this yourself in the spec.

For models that work on images, it’s common to normalize the pixel values. You’ll need to
provide one or more of the following arguments:

• image_scale

• red_bias, blue_bias, green_bias for color images

• gray_bias for grayscale images

• is_bgr — usually False unless your model was trained on images loaded by OpenCV
TensorFlow 1.x Conversion With tfcoreml 70

To read more about choosing the correct values for these options, see the chapter Image
Preprocessing. Sometimes TensorFlow graphs already include preprocessing operations of
their own, in which case you should use the default values for these conversion options.

Tip: If your model has multiple image inputs and they use different preprocessing val-
ues, you can supply these options as dictionaries instead of single values, such as red_-
bias={"image1": -100, "image2": -120}.

If your model is a classifier, you also need to provide the following:

• class_labels="class_names.txt". This is the name of a text file with the class names, one
per line. You can also provide the class names as a list, which is useful if you only have
a few classes: class_labels=["cat", "dog"]. If you want to use integer labels instead of
strings, you can do class_labels=list(range(num_classes)) .

• predicted_probabilities_output="output_tensor". This is the name of the output that


predicts the class probabilities. This should be one of the output_feature_names but with
the : and / replaced by __. You also need to provide the predicted_feature_name argu-
ment or else tfcoreml will ignore your predicted_probabilities_output and assumes
the last output layer is the one that predicts the probabilities.

• predicted_feature_name="label". For classifier models, Core ML outputs a dictionary


that maps the class names to their predicted probabilities, but also a string with the
name of the winning class. By default this class name output is called "classLabel" but
you can change it to something else with this option.

Other conversion options:

• minimum_ios_deployment_target="13". The default deployment target is iOS 12, which


means tfcoreml can only use layers that are compatible with Core ML 2. If you set this
to "13" (including quotes), tfcoreml calls coremltools.converters.tensorflow to convert
your model. This produces an mlmodel file that has Core ML 3-style layers in it and
can only run on iOS 13 and later.

Unlike the Keras and Caffe converters, there is no model_precision option to save the model
weights as 16-bit floats, but you can change this afterwards.
TensorFlow 1.x Conversion With tfcoreml 71

Note: If the original TensorFlow model was quantized, the Core ML model won’t be. You
will need to quantize it again yourself.

Here’s a full example of how to convert a TensorFlow classifier model to Core ML format:

import tfcoreml

input_tensor = "input:0"
output_tensor = "output_tensor:0"
output_swiftified = output_tensor.replace(":", "__").replace("/", "__")

coreml_model = tfcoreml.convert(
tf_model_path="your_frozen.pb",
mlmodel_path="YourModel.mlmodel",
output_feature_names=[output_tensor],
image_input_names=input_tensor,
is_bgr=False,
red_bias=-1.0,
green_bias=-1.0,
blue_bias=-1.0,
image_scale=2./255,
class_labels="imagenet_labels.txt",
predicted_feature_name="label",
predicted_probabilities_output=output_swiftified)

Undefined input shapes

When creating a TensorFlow graph, it’s possible to specify an input shape that is partially
None. Converting a graph with an input shape that is not completely defined gives an error
message such as: “ValueError: input:0 is a placeholder with incomplete shape (1, ?, ?, 3).”

If this happens, you need to supply an input_name_shape_dict, like so:


TensorFlow 1.x Conversion With tfcoreml 72

input_width = 224
input_height = 224

coreml_model = tfcoreml.convert(
...other options...,
input_name_shape_dict={ "input:0" : [1, input_height, input_width, 3] })

This tells the converter that the tensor named "input:0" has the shape [1, 224, 224, 3].

The first dimension should be 1 as this is the batch size. The second dimension is the height,
the third is the width, and the last is the number of input channels. Note that this is in
TensorFlow order, not Core ML order (which puts channels first, not last)!

Note: Usually you’d use None for the batch size dimension to indicate the batch size is
flexible. Here, however, it should be 1 and not None. The batch size is ignored in the final
Core ML model, but has to be a valid number during conversion as tfcoreml will run the
TensorFlow graph with a batch of fake data in order to determine the output shapes of all
the layers.

Unsupported operations

As I mentioned earlier, tfcoreml can only convert graphs whose operations are compatible
with Core ML. If you get an error such as, “NotImplementedError: Unsupported Ops of type:
xxx” then you have two options:

1. Pass the argument minimum_ios_deployment_target="13" to tfcoreml.convert. This makes


all the new Core ML 3 layer types available. Of course, now your model requires iOS
13 or later. It does not always solve all problems, because even Core ML 3 does not
support all the operations from TensorFlow.

2. If most of your graph is OK except for some operation near the end of the graph, then
at least you can convert the graph up to that point. Find a tensor as close to the end as
possible that still works and put that into output_feature_names, so that tfcoreml only
runs the graph up to the point where all operations are still supported.
You can implement the rest of the model by manually inserting custom layers into the
mlmodel, or postprocess the MultiArray result on the CPU using Swift or Objective-C
code.
TensorFlow 1.x Conversion With tfcoreml 73

If you’re really dedicated, it’s also possible to split the TensorFlow model into several
smaller graphs, convert each of them separately, and then glue the resulting mlmodel files
back together again.

You can use the optimize_graph() function shown earlier to split up the TensorFlow graph.
Part 4 of this book gives some hints on how to perform the kind of model surgery needed
to combine the layers from several different mlmodel files into a single Core ML model.

Note: If you get an error such as “ValueError: Graph has cycles.” then there’s not much you
can do. Such graphs cannot be automatically converted to Core ML. You’ll have to write
your own converter.

Custom layers

If your TensorFlow graph has unsupported operations, you can also let tfcoreml insert
custom layers into the Core ML model automatically. You have to provide the following
additional arguments to the converter:

• add_custom_layers=True. This will add any unknown operations to the mlmodel file as
CustomLayerParams objects.

• custom_conversion_functions={}. This dictionary lets you provide a function for each


unknown layer that fills in the properties for that layer.

It’s OK to leave the custom_conversion_functions dictionary empty. In that case, the converter
will simply add empty CustomLayerParams layers to the model for all the unknown operation
types it encounters.

After the conversion completes, tfcoreml gives a summary of what custom layers were
added. To look at the actual custom layers in the mlmodel, you can write:

for i, layer in enumerate(coreml_model._spec.neuralNetwork.layers):


if layer.WhichOneof("layer") == "custom":
print("layer index:", i)
print(layer)

For my model this printed:


TensorFlow 1.x Conversion With tfcoreml 74

layer index: 452


name: "Postprocessor/Decode/transpose/Rank"
input: "concat:0"
output: "Postprocessor__Decode__transpose__Rank__0"
custom {
className: "Rank"
description: "Custom layer that corresponds to the TensorFlow op Rank"
}

As you can see, tfcoreml has already filled in the CustomLayerParams object for this custom
layer with the name of the TensorFlow operator as the suggested class name. You can of
course overwrite these fields with your own class name and description.

For more control, you can provide a conversion function for each unknown operation in
the custom_conversion_functions dictionary. For a great example of how to do this, as well
as more advanced uses of custom layers with tfcoreml, check out this example notebook33 .

Changing the properties of the converted model

tfcoreml returns an MLModel object. You can use this to change some basic properties:

coreml_model.short_description = "Your awesome model"


coreml_model.author = "Your name"
coreml_model.license = "Copyright by you"

coreml_model.input_description["input__0"] = "The input image"

coreml_model.output_description["output_tensor__0"] = "The predicted probabilities"


coreml_model.output_description["label"] = "The class with the highest score"

It’s wise to fill in the input and output descriptions, so it is clear what they are for. Note that
you should use the correct input and output names, in their Swiftified form, where any :
and / characters are replaced by a double underscore.

To see a summary of your model, do print(coreml_model). This includes the names of the
inputs and outputs, in case you’re not sure what they are.

For more invasive changes, you can get the spec object and tweak the model to suit your
needs. For example, to change the output type to image or to quantize the weights.

I’m not really a big fan of using the TensorFlow input and output names. In this example
model, I would rename "input__0" to "image" and "output_tensor__0" to something like
33 https://github.com/tf-coreml/tf-coreml/blob/master/examples/custom_layer_examples.ipynb
TensorFlow 1.x Conversion With tfcoreml 75

"probabilities". See the chapter Cleaning Up a Converted Model for an explanation of how
to do this.

It’s also a good idea at this point to verify the conversion is successful by making a few
predictions on test data.

Tip: You can also use separate Python scripts to modify and test the model. That way you
have one script for doing the conversion, one script for doing model surgery (if any), and
a third script for verifying that the model works correctly.

Saving the model

If you didn’t make any changes to the MLModel object or its spec after the conversion
completed, you don’t need to save it to a file as tfcoreml has already done this for you.
However, if you did make changes, you’ll need to save the model again:

coreml_model.save("YourModel.mlmodel")

If you made changes to the model’s spec, you should save as follows:

new_model = coremltools.models.MLModel(spec)
new_model.save("YourModel.mlmodel")

By creating a new MLModel instance, Core ML will try to compile the spec. If there’s anything
wrong with the spec, you’ll know right away.
TensorFlow 2.x Conversion With
coremltools 3
IMPORTANT: As of coremltools 4, there is a new and recommended way to convert
TensorFlow models. The information in this chapter applies to coremltools version 3 only!
I strongly suggest using the new converter from coremltools 4. Only if that doesn’t work
for you for some reason, I’d suggest reading this chapter.

The original TensorFlow-to-Core ML converter was tfcoreml, but that does not support
TensorFlow 2.x or the new layer types that were introduced with Core ML 3.

To support TF 2.x and the new layer types, coremltools 3 shipped with an improved con-
verter, coremltools.converters.tensorflow. This has now been superseded by coremltools
4.

Note: It’s also possible to use tfcoreml to convert a TF 2.x model. To do so, pass the argument
minimum_ios_deployment_target="13" to the tfcoreml.convert() function. Under the hood this
simply uses coremltools.converters.tensorflow to perform the actual conversion.

Using the coremltools 3 converter

I did not include a detailed description of the TensorFlow converter in this book, because
it works in the same way as the other converters.

Important: You must use coremltools version 3 in order to use this converter! It is not
available in coremltools 4.

pip install -U coremltools==3.4

To learn more, check out the following resources:

• The manual34 for the TensorFlow converter. This explains all the options.
34 https://apple.github.io/coremltools/generated/coremltools.converters.tensorflow.convert.html

76
TensorFlow 2.x Conversion With coremltools 3 77

• The coremltools repo contains a number of examples35 of how to use coremltools to


convert TensorFlow 2.x models to Core ML.

• The TF converter has similar options to tfcoreml. If you’re unclear about any of the
options, I suggest checking out that chapter in this book.

By the way, when you import the coremltools package with TensorFlow 2 installed in your
Python environment, it may print the following message:

WARNING:root:TensorFlow version 2.1.0 detected.


Last version known to be fully compatible is 1.14.0 .

You can safely ignore this message, as obviously the TF 2 converter will work fine with
TensorFlow 2.

Not for iOS 12 and below…

One potential downside of this TensorFlow converter is that it produces models that require
at least Core ML 3. You cannot use these models on iOS 12 and earlier, because they will be
made of layer types that are not supported by older versions of Core ML.

Note: coremltools.converters.tensorflow isn’t just for TF 2.x; you can also use it to convert
TF 1.x models. However, I still prefer to use tfcoreml for that, because tfcoreml creates
models that are compatible with iOS 12.

It’s also for tf.keras

It’s likely that your TF 2 model is built with the tf.keras API. Despite the name, you cannot
use the regular Keras converter with such models — you’ll need to use the TensorFlow
converter.

As mentioned, this produces a model that requires iOS 13+ (Core ML 3). See this chapter for
a workaround that lets you run the model on iOS 12 and earlier as well.
35 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/neural_

network_inference/tensorflow_converter/Tensorflow_2
PyTorch Conversion Using ONNX
IMPORTANT: As of coremltools 4, there is a new and recommended way to convert PyTorch
models. This chapter describes the old way of first exporting to ONNX and then using the
ONNX converter to make a Core ML model. Even though this is no longer the recommended
way to convert PyTorch models, I am keeping this information in the book. It is still a useful
alternative for when the new conversion API doesn’t work (it happens).

There are two ways to convert PyTorch models to Core ML:

1. Using the unified conversion API from coremltools 4. This is the recommended
approach.

2. First exporting the PyTorch model as an ONNX file and then using the ONNX converter.

I suggest you try method 1 first. If that fails, try method 2.

This chapter shows how to convert PyTorch models using ONNX as an intermediate step,
as well as the kinds of problems you can expect.

Note: PyTorch is not the same thing as Torch. To convert a Torch model, see this chapter.

Exporting the model from PyTorch

First, instantiate the PyTorch model and load its weights from a checkpoint. How to do this
depends on the model in question.

Unlike models from other frameworks, PyTorch models usually do not include the defini-
tion of the model’s architecture, only the learned weights. This means you need to have
access to the original Python source code that defines the model.

If the model contains CUDA tensors — which can only be loaded if your computer has an
NVIDIA GPU, and most Macs don’t — then load the checkpoint as follows. This converts the
tensors to the CPU:

78
PyTorch Conversion Using ONNX 79

checkpoint = torch.load("pytorch_model.pt", map_location="cpu")


model.load_state_dict(checkpoint)

Now that the model is loaded, you can convert it to ONNX format:

dummy_input = torch.randn(1, 3, 224, 224)


torch.onnx.export(model, dummy_input, "YourModel.onnx", verbose=True)

To do the conversion, PyTorch will perform a forward pass through the model. That’s why
you need to provide the dummy_input tensor. If your model’s forward() takes multiple inputs,
dummy_input should be a tuple of tensors.

In this example, we’re giving it a 3×224×224 image. It doesn’t matter what is in this tensor,
only that it is the correct size. Note that in PyTorch the shape is (batch size, channels, height,
width).

By default, onnx.export() assigns numeric identifiers to the model’s inputs and outputs (and
layer names). That’s not very nice, which is why you can also provide human-readable
names:

torch.onnx.export(model, dummy_input, "YourModel.onnx", verbose=True,


input_names=["image"], output_names=["prediction"])

Use verbose=True to print out the layers from the converted model. This is useful for
confirming that the model conversion was successful. Plus it shows the sizes of the
intermediate tensors, which is always good to check. It’s also a good idea to view the ONNX
file with Netron, to verify everything is where it should be.

What if ONNX export fails?

If the call to torch.onnx.export gives a nasty looking error message, it usually means that
the PyTorch model performs an operation that is not supported by the ONNX converter.
Usually the error message is “RuntimeError: ONNX export failed” but I’ve also seen PyTorch
crash.

ONNX does not support all operations from PyTorch, and conversely, torch.onnx.export may
not support all operations that are possible in ONNX. The PyTorch documentation36 lists the
operators that torch.onnx.export can handle.
36 https://pytorch.org/docs/stable/onnx.html#supported-operators
PyTorch Conversion Using ONNX 80

There are actually different versions of ONNX. PyTorch’s ONNX export uses opset_version=9
by default. If you get an error about unsupported operations, try passing the opset_-
version=11 argument to torch.onnx.export. Even if this works, there’s no guarantee you can
actually convert the ONNX model to Core ML. But it’s always worth a try.

If your model has layers that ONNX does not support, and they’re at the very end of the
model, you can convert up to the last supported layer and implement the rest of the model
yourself as custom layers in Core ML.

If the unsupported operations are somewhere in the middle, you may need to split up the
model into two or more smaller models and convert those separately, then glue them back
together after converting to Core ML. Not fun. (Unfortunately, there is currently no way to
add placeholders for the unsupported operations in the ONNX file.)

Exporting a partial PyTorch model is pretty easy to do. Since you have the model source
code anyway, just comment out any of the unsupported layers in the forward() method, like
so:

class MyModel(torch.nn.Module):
...

def forward(self, x):


x = self.layer1(x)
x = self.layer2(x)

# Comment out the unsupported layers


#x = self.unsupportedLayer(x)
return x

If all went well, you now have an ONNX model! Next, learn how to convert this model to
Core ML.

Important: PyTorch models usually don’t include operations for preprocessing the input,
and so these are not added to the ONNX file. If your model does image preprocessing, don’t
forget to manually set these options when you convert the ONNX model to Core ML.

It’s tricky!

Personally, I’ve had mixed success with exporting PyTorch models through ONNX. If it
works, great. But if it doesn’t, remember that just because the ONNX export fails, this
doesn’t mean Core ML is unable to run the model. It might still be possible to create your
own converter.
PyTorch Conversion Using ONNX 81

I find that for certain PyTorch models you often have to modify the PyTorch code signifi-
cantly before the model can be successfully converted. PyTorch lets you write very dynamic
code, but Core ML — or the converter — can’t always handle that.

For example, if the PyTorch script does h, w = someTensor.shape and then uses h and w for
other computations, you may need to hardcode h and w when exporting to ONNX, instead
of dynamically deriving them from the tensor’s dimensions.

# Workaround to make the model less dynamic


if torch.onnx.is_in_onnx_export():
h, w = 480, 640
else:
_, _, h, w = someTensor.shape

Tip: If you’re using PyTorch 1.x and the exported model does not work, try PyTorch 0.4.1.

Links:

• Instructions from the coremltools Neural Network Guide37

• PyTorch / ONNX conversion examples38

37 https://github.com/apple/coremltools/blob/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

NeuralNetworkGuide.md
38 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/neural_

network_inference/onnx_converter
ONNX Conversion Tips
ONNX39 stands for Open Neural Network eXchange format. Just like Core ML’s mlmodel
files, it is a way to export trained deep learning models into an open format that can be
written and read by different tools. And just like mlmodel, ONNX files are based on the
protobuf standard.

Pretty much all the big players in deep learning support ONNX, although Google doesn’t
appear to be in any rush to add it to TensorFlow.

You’re probably reading this chapter because you have a model in ONNX format that
you wish to convert to Core ML. The other common scenario is that there is no Core ML
converter for a specific deep learning framework. In that case, you may be able to first
export the model to ONNX and then convert that to Core ML. (This used to be the approach
for PyTorch models, but now PyTorch support is built into coremltools.)

You can view ONNX files with Netron. This is useful because you’ll need to know what the
names are of the model’s inputs and possibly also the outputs.

You can find the official documentation for the ONNX converter in the coremltools User
Guide40 .

Tip: Usually you’d write a Python script to perform the conversion but you can also do it
from the Python REPL or a Jupyter notebook. My preferred approach is to do everything
inside IPython first, then copy the code to a script so I can run it again at a later date.

onnx-coreml is now deprecated

Until recently, ONNX models were converted using the onnx-coreml package. However,
as of coremltools 4, the ONNX converter is built into coremltools. This is exactly the
same as the old onnx-coreml package, which is now no longer supported. Therefore, it is
recommended to use coremltools 4 from now on.
39 https://onnx.ai
40 https://coremltools.readme.io/docs/onnx-conversion

82
ONNX Conversion Tips 83

You can still install and use onnx-coreml, but I would only recommend this if you’re stuck
on coremltools 3:

pip install -U coremltools==3.4


pip install -U onnx-coreml

In the text below, we’ll be using the ONNX converter from coremltools. The API for using
onnx-coreml is exactly the same but you use it as follows:

# Using coremltools
import coremltools as ct
coreml_model = ct.converters.onnx.convert(...)

# Using onnx-coreml
from onnx_coreml import convert
coreml_model = convert(...)

Doing the conversion

Using the ONNX converter is similar to the model converters built into coremltools,
although some of the arguments have slightly different names.

This is the minimum amount of code necessary to load an ONNX model, convert it to Core
ML, and save the resulting mlmodel file:

import coremltools as ct
coreml_model = ct.converters.onnx.convert("YourModel.onnx")
coreml_model.save("YourModel.mlmodel")

If all goes well, the call to convert() prints out a list of all the operations in the model and
you get a valid MLModel object.

It’s also possible to get an error at this point. When that happens, it usually is because the
ONNX model has an operation that is not compatible with Core ML. You might be able to
work around this by using custom layers (see below).

Tip for advanced users: Rather than passing in the filename of the ONNX model, you can
also load the ONNX model into memory first using protobuf.
ONNX Conversion Tips 84

from onnx import onnx_pb


onnx_model_file = open("YourModel.onnx", "rb")
onnx_model_proto = onnx_pb.ModelProto()
onnx_model_proto.ParseFromString(onnx_model_file.read())

Now you’d write convert(onnx_model_proto) to start the conversion. This can be useful when
you want to modify the ONNX file by hand before starting the conversion, for example
to fix an issue with an unsupported operation. There is also a convenient Python API
for modifying ONNX models, but for full control you can find ONNX’s protobuf definition
here41 .

Conversion options

Calling convert() without any other options will treat any inputs and outputs as Multi-
Arrays. That may not be what you want if you’re using images. In that case, provide the
following additional arguments:

• image_input_names=["image"]. Which of the inputs should be treated as images instead


of MultiArrays. You only need to use this if your model expects one or more images as
inputs. Now Core ML lets you provide these images as CVPixelBuffer objects instead of
MLMultiArrays.

• image_output_names=["predicted_image"]. This is useful if the model outputs an image.


Without this option, outputs are always MultiArrays.

You won’t get an error message if you get the image_input_names or image_output_names
wrong. That input or output will simply be ignored by the converter and remains a
MultiArray. So double-check your input and output names with Netron before you do the
conversion!

Note: There is no option that lets you rename the inputs and outputs. If you don’t like the
input and output names from the original ONNX model, you can rename them yourself
afterwards using coremltools.utils.rename_feature().

For models that work on images, it’s common to normalize the pixel values. You’ll need to
provide the preprocessing_args argument, which is a dictionary containing one or more of
the following values:
41 https://github.com/onnx/onnx/blob/master/onnx/onnx.in.proto
ONNX Conversion Tips 85

• "image_scale"

• "red_bias", "blue_bias", "green_bias" for color images

• "gray_bias" for grayscale images

• "is_bgr" — usually False unless your model was trained on images loaded by OpenCV

These preprocessing options are applied to all the inputs listed in image_input_names. To
read more about choosing the correct values for these options, see the chapter Image
Preprocessing.

For models that have image outputs, you can also provide a similar dictionary named
deprocessing_args. These “de-processing” options are applied to the outputs listed in image_-
output_names. This is useful for converting the pixels in the predicted images to the range 0
- 255.

Note: If the ONNX model already contains operations for doing image preprocessing or
deprocessing, you should not provide the preprocessing_args or deprocessing_args options
to convert() or they will be applied twice.

If your model is a classifier, you also need to provide the following:

• mode="classifier"

• class_labels="class_names.txt". This is the name of a text file with the class names, one
per line. You can also provide the class names as a list, which is useful if you only have
a few classes: class_labels=["cat", "dog"]. If you want to use integer labels instead of
strings, you can do class_labels=list(range(num_classes)) .

• predicted_feature_name="label". For classifier models, Core ML outputs a dictionary


that maps the class names to their predicted probabilities, but also a string with the
name of the winning class. By default this class name output is called "classLabel" but
you can change it to something else with this option.

If your model is a regressor, you’ll need to provide the following option:

• mode="regressor"
ONNX Conversion Tips 86

Note: If you don’t specify a mode, or mode=None, the converter will create a regular neural
network that is neither a classifier or regressor. More about this distinction in the chapter
The mlmodel File Format.

Other possible conversion options:

• minimum_ios_deployment_target. By default, the ONNX converter will try to output a


model that is compatible with Core ML 2. If that fails you’ll get an error message
about unsupported operations. In that case, add the option minimum_ios_deployment_-
target="13". Now the converter can also use Core ML 3 layer types.

• onnx_coreml_input_shape_map. This is an advanced option that lets you change the


tensor shapes of the model inputs. Core ML tensors are always 5-dimensional and
have the shape (sequence, batch size, channel, height, width). If the ONNX model has
inputs that expect a different shape, you can map them to the Core ML shape using
this option.

Here’s a full example of how to convert an ONNX classifier model to Core ML format:

import coremltools as ct

coreml_model = ct.converters.onnx.convert("YourModel.onnx",
mode="classifier",
image_input_names=["image"],
preprocessing_args={
"image_scale": 2/255.0,
"red_bias": -1.0,
"green_bias": -1.0,
"blue_bias": -1.0 },
class_labels="imagenet_labels.txt",
predicted_feature_name="label")

Unsupported operations

The ONNX file format defines a lot of different operators42 but not all of these are supported
by Core ML.
42 https://github.com/onnx/onnx/blob/master/docs/Operators.md
ONNX Conversion Tips 87

All the ONNX operators that can be converted to Core ML are listed here43 .

There are two things to try when you get an error message about unsupported operations:

1. There are different versions of ONNX. Try exporting the ONNX model again using
opset_version=9. This will use a more limited set of layers.

2. Pass the argument minimum_ios_deployment_target="13" to convert(). This makes all the


new Core ML 3 layer types available. Of course, now your model requires iOS 13 or
later. It does not always solve all problems, because even Core ML 3 does not support
all the operations from ONNX.

If your ONNX model has an operator or layer type that the converter cannot handle, you
can implement this as a custom layer in Core ML.

Custom layers

If your ONNX model has operations that are not supported by Core ML, you can let the
converter insert custom layers into the Core ML model automatically. To do this, provide
the following additional arguments to the converter:

• add_custom_layers=True. This will add any unknown operations to the mlmodel file as
CustomLayerParams objects.

• custom_conversion_functions={}. This dictionary lets you provide a function for each


unknown layer that fills in the properties for that layer.

Please refer to the chapter Creating Your Own Custom Layers for more information about
how to add custom layers. Also see the Keras and TensorFlow conversion chapters — the
process is very similar for ONNX models.

Changing the properties of the converted model

The ONNX converter returns an MLModel object. You can use this to change some basic
properties:

43 https://github.com/onnx/onnx-coreml/blob/3af826dfb0f17de4310d989acc7d6c5aea42e407/onnx_coreml/

_operators_nd.py#L2233
ONNX Conversion Tips 88

coreml_model.short_description = "Your awesome model"


coreml_model.author = "Your name"
coreml_model.license = "Copyright by you"

coreml_model.input_description["image"] = "The input image"

coreml_model.output_description["labelProbs"] = "The predicted probabilities"


coreml_model.output_description["label"] = "The class with the highest score"

It’s wise to fill in the input and output descriptions, so it is clear what they are for.

Note that the Core ML model will use the input and output names from the original ONNX
model, which are not always nice to read. For example, the output in one of my models
was named "output_flatten0_reshape0". Fortunately, you can easily rename the inputs and
outputs at this point.

To see a summary of your model, do print(coreml_model).

For more invasive changes, you can get the spec object and tweak the model to suit your
needs. For example, to quantize the weights or to change them to 16-bit floats.

It’s also a good idea at this point to verify the conversion is successful by making a few
predictions on test data.

Saving the model

This is the easy part:

coreml_model.save("YourModel.mlmodel")

If you made changes to the model’s spec, you should save as follows:

new_model = ct.models.MLModel(spec)
new_model.save("YourModel.mlmodel")

By creating a new MLModel instance, Core ML will try to compile the spec. If there’s anything
wrong with the spec, you’ll know right away.
Torch7 Conversion Tips
Torch44 is a scientific computing framework that can also be used for building neural
networks. Unlike most other deep learning tools, it uses the Lua language instead of Python.
Torch files have the .t7 extension.

Note: Torch is the predecessor of PyTorch, but they’re different tools. To convert a PyTorch
model, see this chapter.

There is no official Torch converter in coremltools, but you can use torch2coreml. Fortu-
nately, this package is written in Python, so you won’t have to deal with Lua. :-)

Like all Python packages, torch2coreml can be installed with pip:

pip install -U torch2coreml

Currently, there only is a version of torch2coreml for Python 2. If you use pip3 or pip inside a
Python 3 environment, you may get an error message that the package could not be found.

You will also need to have PyTorch45 installed, but pip should take care of this when you
install torch2coreml. Even though PyTorch is not the same thing as Torch, it can read .t7
files (with some limitations), which is why torch2coreml uses it.

Tip: Usually you’d write a Python script to perform the conversion but you can also do it
from the Python REPL or a Jupyter notebook. My preferred approach is to do everything
inside IPython first, then copy the code to a script so I can run it again at a later date.

You need CUDA for certain models

CUDA is the programming language for NVIDIA GPUs. If the Torch model was serialized as
a GPU model, it will contain “CUDA tensor” objects. Unfortunately, these cannot be loaded
44 http://torch.ch
45 https://pytorch.org

89
Torch7 Conversion Tips 90

by torch2coreml.

One solution is to load the model in Torch and convert it to a CPU model. Install Torch46 if
you haven’t already, then type th in a Terminal to launch it. Run the following:

require 'torch'
require 'nn'
require 'cunn'

model = torch.load("torch_model.t7").model
model_cpu = model:float()
torch.save("torch_model_cpu.t7", model_cpu)

Note: This only works if you have CUDA installed on your computer (very unlikely if you
have a Mac). Without CUDA, the cunn package does not work and torch.load() will give an
error message “unknown Torch class <torch.CudaTensor>”. The only way to load this model
into Torch is to use a computer that has CUDA installed.

If your Torch model has CUDA tensors in it, and you are unable to convert these to CPU
tensors, you won’t be able to easily convert the model to Core ML. In that case, the only
option is to create your own converter from scratch and read the weights directly from the
t7 model file.

Converting the model

First import the torch2coreml package:

from torch2coreml import convert

If this gives an error message, your installation of PyTorch may be too new. Run the
following from Terminal to downgrade PyTorch to a known good version:

pip install -U torch==0.4.1

Converting the Torch model to Core ML is straightforward:

46 http://torch.ch/docs/getting-started.html
Torch7 Conversion Tips 91

input_shape = (3, 224, 224)


coreml_model = convert("torch_model.t7", input_shapes=[input_shape])

The input_shapes argument tells the converter how large the input tensors or images are.
Note that in Torch, the number of channels comes first, so the shape is (channels, height,
width).

If this gives an error message such as “T7ReaderException: don't know how to deserialize
Lua class”, you will need to load the model in a different way:

from torch.utils.serialization import load_lua


torch_model = load_lua("torch_model.t7", unknown_classes=True)

input_shape = (3, 224, 224)


coreml_model = convert(torch_model, input_shapes=[input_shape])

If you get the error message “KeyError: 'torch.CudaTensor'”, it means the Torch model was
serialized as a GPU model but this is not supported by the load_lua() function. See the
previous section for tips.

After all this effort, it is still possible that you get errors when trying to load the Torch model.
This can happen when the model contains operations that are not supported by PyTorch.
This blog post47 by the creators of torch2coreml gives an excellent explanation of how to
hack the model in order to work around these issues.

The torch2coreml GitHub repo48 lists the Torch 7 layers that are supported.

The GitHub page also lists additional arguments that you can pass to convert(), such as
image preprocessing options and how to handle custom layers.

Note: An alternative way to convert Torch models to Core ML is to load the model with
PyTorch’s load_lua() function, then export to ONNX using torch.onnx.export(), and finally
use the ONNX converter to convert to Core ML.

47 https://blog.prismalabs.ai/diy-prisma-app-with-coreml-6b4994cc99e1
48 https://github.com/prisma-ai/torch2coreml
MXNet Conversion Tips
Apache MXNet49 is Amazon’s deep learning framework of choice. It’s not as popular as
TensorFlow or PyTorch, but it has its fair share of followers. Even Turi Create uses MXNet
under the covers.

MXNet includes a tool for converting models to Core ML. You can find it on GitHub50 . The
documentation is excellent and I encourage you to check it out.

Note: I don’t have any experience yet with converting MXNet models to Core ML. I’ll add
more to this chapter when I have something useful to say about this topic. Get in touch
if you need assistance with converting MXNet models and I’ll bump this chapter on my
priority list.

49 https://mxnet.apache.org
50 https://github.com/apache/incubator-mxnet/tree/master/tools/coreml

92
Troubleshooting the Conversion Process
Even though the model converters from coremltools (and others) are pretty good, you
can still run into problems during the conversion process — or after. In this chapter I’ve
collected common issues that occur when converting models to Core ML.

Image preprocessing / input normalization

The most common reason for getting wrong predictions out of Core ML is incorrect
image preprocessing options. That’s why I dedicated a whole chapter to it. If the model’s
predictions are not what you expected, double-check to make sure you’re using the correct
settings for image preprocessing.

Some models already include their own image preprocessing operations. This may show up
as Add, Multiply, and LoadConstant layers at the very top of your model. Sometimes there
is also a Gather layer to convert between RGB and BGR. You should delete such layers —
either from the original model before conversion or from the mlmodel afterwards — and
use Core ML’s built-in image preprocessing layer instead.

If you’re not using images but MultiArrays as input, you may need to do some kind of
normalization as well. Core ML does not have a special preprocessing layer for non-image
data. You can do this in your Swift or Objective-C code, or perform the normalization by
adding layers to the model. Or build a pipeline.

Remember, if you’re not giving Core ML the correct input values, it will not produce the
correct outputs either.

The input (or output) should be an image but is a


MultiArray

If your model is supposed to take an image as input but after conversion it expects a
MultiArray, you forgot to specify the option image_input_names to the converter. Read more
about this.

For models that generate images as output, you may need to do more work. Some convert-
ers may support an image_output_names option but not all do. In that case, you can change
the output to an image after conversion.

93
Troubleshooting the Conversion Process 94

The converter gives an error message

Different ML frameworks have different features, and conversion from one set of features
to another is not always possible.

For example, PyTorch has bilinear resizing with the option align_corners=True. When
exporting to ONNX using the default “opset” version 9, you may get an error message that
the align_corners=True option is not supported unless you use a newer opset. But if you
export to ONNX with opset_version=11, the ONNX graph will now contain operations that
cannot be converted to Core ML…

The solution to such woes is to edit the original PyTorch model and remove the offending
operation so that the conversion succeeds, then patch up the mlmodel file afterwards with
an equivalent operation. This is only possible if you can figure out a clever way to make
the mlmodel do the same thing as the original PyTorch model.

For the bilinear resizing example, this isn’t possible — Core ML simply does not support
this operation and there is no way to add it (except for using a custom layer). You can read
more about this particular issue in this blog post51 .

If no such workaround is possible, it means you need to re-train your model, this time
without the operations that Core ML does not support!

Tip: It’s smart to figure out in advance if your model can run on Core ML, before you spend
a lot of time and money training it. You can do this by trying to convert the untrained model.
This also lets you try out how fast your model is going to be, even if it doesn’t output usable
predictions yet.

Even if the conversion from PyTorch to ONNX completes without errors, the ONNX to
Core ML conversion may give errors on unsupported operations, such as “Gather” or
“ReshapeStatic”. When that happens, you need to pass the option minimum_ios_deployment_-
target="13" to convert(). This tells the converter that it’s OK to use Core ML 3 layers, which
supports many more operations. The downside is that now your model won’t work on iOS
12 or earlier.
51 https://machinethink.net/blog/coreml-upsampling/
Troubleshooting the Conversion Process 95

The converter is buggy

Sometimes the converter may give an error message that is actually a bug in the converter
itself or some unsupported or untested situation. You can tell because it looks like a Python
crash rather than a properly handled error message.

When this happens, check out the issues page on the converter’s GitHub repo. This may
already have a solution or workaround. And if not, file a bug.

Everything appears to be correct but I still get the wrong


output?!

First, run the model on the CPU. You can’t always trust the output from the GPU or ANE (see
the next item).

Open the model in Netron and put it side-by-side with the original. Compare the layers
one-by-one to make sure everything was correctly converted.

Even so, there may be differences between the training framework and Core ML. For exam-
ple, do you know what padding rules Core ML uses when a convolution layer has stride 2?
As it turns out, these padding rules are different between TensorFlow and PyTorch… Core
ML typically uses TensorFlow conventions; a model converted from PyTorch may need a
special Padding layer to deal with this.

What I do in such situations: Create a Jupyter notebook that loads the mlmodel and
removes all the layers beyond a certain point. Run the model from the notebook (with
useCPUOnly=True) and save the resulting output tensor in a variable, out_coreml. Also run the
original TensorFlow/Keras/PyTorch/etc model up to that same point and save the results in
another variable, out_original.

Then compare the contents of these two tensors. In theory, they should be identical. A
quick way to check is to print(np.max(np.abs(out_coreml - out_original))), which prints
the largest difference between the two tensors. I will allow for small differences to due
numerical precision issues — something on the order of 1e-7 — but if there are large
differences, something is not right and you should remove more layers to find the one
where it first goes wrong.

Tip: It also helps to plot the feature maps as images so you can inspect them visually.
Troubleshooting the Conversion Process 96

The model gives different results on CPU, GPU or ANE

Suppose your model does not give the correct output when running on the GPU or ANE but
it is correct when you try it on the CPU. There are two possibilities:

1. There is a bug in Core ML. I’ve been this happen with a model where Clip layers were
used as the activation function after convolution. The model had 20 or so of these Clip
layers. When running on the GPU, two Clip layers malfunctioned and always produced
zeros, while the rest were fine. Replacing those problematic layers with a ReLU fixed
the issue. A bug is the only explanation for this, as the model worked fine on the CPU
and ANE.

2. There is a floating-point precision issue. Remember that the CPU does its computations
with 32-bit floats but the GPU / ANE use 16-bit floats. The range of float16 is a lot smaller
than float32. If your layers have very small weights, the activations may become so
small that they get rounded down to zero by the GPU. The solution is to re-train your
model with some kind of constraint on the weights that keeps them from being too
small or too large.

My model is too slow

Usually this means the model is too big to be practical on mobile. The solution is to re-
architect the model to use fewer layers or smaller layers. For tips on how to estimate how
fast your model will be, refer to this blog post52 .

Sometimes the model is slow because the converter picked a suboptimal way to convert
a certain operation. For example, experiments have shown that in Core ML, pow(x, 2) is
much slower than doing x * x.

It’s also possible that you can replace certain layer types with simpler versions. For
example, I’ve found that Add layers are faster than AddBroadcastable (likewise for other
types of broadcastable layers). If an Add layer is good enough for your model, it can be
worth writing a small script that replaces those layer types.

52 https://machinethink.net/blog/how-fast-is-my-model/
Troubleshooting the Conversion Process 97

Help, I’m getting desparate!!!

Still no luck? Here are a few things you can try:

1) Install coremltools from master to get the very latest version. The version from pip may
be out-of-date already. Here’s how you do that:

pip install -U git+https://github.com/apple/coremltools.git

Likewise for other packages such as tfcoreml, onnx-coreml, and so on. The official releases
are often behind on the latest fixes.

2) If all else fails, use NeuralNetworkBuilder to make a custom conversion script. It’s more
work but you know exactly what you’re getting.

Tip: If I don’t know how to do something, I usually look at the official source code53 of
NeuralNetworkBuilder. For example, if you want to know how to add a LoadConstant layer
by hand to your mlmodel, you’d look at the code for add_load_constant().

3) Read the rest of this book. ;-) There’s a lot of other info that can help you work around
Core ML issues.

4) Hire an expert. I often help clients with these kinds of issues. Because I’ve got a lot of
experience working around Core ML limitations, I can often diagnose these problems quite
quickly and recommend possible fixes.
53 https://github.com/apple/coremltools/blob/master/coremltools/models/neural_network/builder.py
Writing Your Own Converter
When the official converter for your training tool doesn’t like your model and gives
conversion errors, or if you’ve trained your model with a tool that doesn’t have a Core
ML converter, you might need to roll your own.

Don’t worry: If you understand the mlmodel format, writing your own converter isn’t really
that bad.

To create your own Core ML converter, you have the following options:

• Hack an existing converter. Let’s say the Keras converter almost works but gives
problems with one particular layer in your model. My suggestion in that case would
be to clone the source code of coremltools54 and change it. That’s why it’s open source!
All the Python code is inside the coremltools subfolder. If you put your conversion
script inside the top-level folder of the coremltools repo, then import coremltools will
load your local version instead of the pip package. (There is also C++ code inside the
repo. If you change that, you’ll have to rebuild with cmake, see the README for details.)

• Use NeuralNetworkBuilder from coremltools. This lets you construct a neural


network layer-by-layer. It’s what the other converters use internally too. More about
this in the next chapter.

• Create the mlmodel file directly from a spec object. The chapter Making Sure the
Input is Correct has an example of how to do this. It’s a little more convoluted than
using NeuralNetworkBuilder but you get ultimate control.

• Directly write protobuf files. See the chapter Using Protobuf Without coremltools.
This is useful for when you want to convert your model from a language other than
Python, such as R or Java.

When you write your own converter, pay special attention to the following:

• Order of the weights. Core ML expects the weights for convolution layers to be stored
in the shape (outputChannels, inputChannels, kernelHeight, kernelWidth). For fully-
connected layers it is (fanOut, fanIn). It’s quite likely your training tool stores the
weights in a different order. TensorFlow and Keras, for example, use (kernelHeight,
54 https://github.com/apple/coremltools

98
Writing Your Own Converter 99

kernelWidth, inputChannels, outputChannels). For the exact shapes expected by Core


ML, look in NeuralNetwork.proto55 .

• Padding. Different training tools apply zero-padding in different ways. Core ML con-
volution and pooling layers support a fixed amount of border padding (or cropping)
with ValidPadding, or asymmetric padding like what TensorFlow uses with SamePadding.
If neither of these is suitable, you can use a separate padding layer, PaddingLayerParams,
for example for reflection padding.

• Output shapes. Always verify that the outputs of the converted layers have the same
shapes as those in the original model. See also Checking the Layer Output Shapes.

• Tensor dimension order. The Core ML data blobs that go from one layer to the
next have shape (sequence length, batch size, channels, height, width). This may be a
different order than what your training tool uses, and so the axes for slicing operations
etc may be different than in your original model. (Core ML 3 is more flexible than its
predecessors and supports arbitrary tensor shapes.)

• Unsupported operations. Core ML only understands a limited number of neural


network layer types. If your model performs an operation that is not supported by
Core ML, you’ll need to write a custom layer or combine several other layer types in
a clever way. For example, the ReLU6 activation function is not supported directly by
Core ML but can be implemented by a sequence consisting of a ReLU layer, a threshold
layer, and two linear activations. (Or, as of Core ML 3, using a single clip layer.)

Tip: Looking at the source code of existing converters for inspiration is always a good idea.
You can learn a lot from them. For example, the convert_activation() function from the
coremltools Keras converter shows how to implement the ReLU6 activation function. Also
be prepared to dig through NeuralNetwork.proto to see what’s possible.

55 https://github.com/apple/coremltools/blob/master/mlmodel/format/NeuralNetwork.proto
NeuralNetworkBuilder
Writing your own converter may sound like a daunting prospect but NeuralNetworkBuilder
from coremltools makes it pretty easy.

If you look through the source code of the existing coremltools converters, you may feel
your heart sink. It looks like they do a lot of work… Trust me, it’s not that bad. Where
the Keras, Caffe and TensorFlow converters need to be generic and handle a multitude of
different models and graph types, your converter only has to work for your own model.

That’s a lot simpler, because you only have to write conversion code for a specific model
architecture, not for all possible architectures!

Note: The built-in converters also use NeuralNetworkBuilder to do their work. If you’re ever
wondering about how to use NeuralNetworkBuilder and want to see some sample code,
look at the source code of the Keras converter. The Caffe converter is harder to read
because it’s written in C++, not Python. The TensorFlow and PyTorch converters don’t use
NeuralNetworkBuilder but the MIL builder, which is even more powerful.

Prerequisites

In order to write your own converter, you need to know:

1. the architecture of the model,

2. the learned parameters or weights of the model.

If the weights are stored in a special file format, you must have some way to read and parse
those kinds of files.

This chapter demonstrates how to use NeuralNetworkBuilder with a Caffe model. But the
approach taken in this chapter is the same regardless of what format your model is in: you
need to load the weights and create the layers one-by-one to duplicate the original model’s
architecture.

100
NeuralNetworkBuilder 101

Example: Caffe CIFAR-10 model

In this chapter I’ll be using the Caffe CIFAR-10 example56 . This is a simple model with just
a handful of layers. Because the Caffe repo does not include a trained model that you can
download, I have included the trained model files in the repo for this book57 .

The result of training the model from the example is a .caffemodel.h5 file. Unlike regular
Caffe models, this is stored in the HDF5 file format. The Caffe converter from coremltools
cannot handle this file format, so this will make for a nice demonstration of how to use
NeuralNetworkBuilder.

Note: Keras models are also stored in HDF5 format. HDF5 is a general purpose container
for any kind of structured data. Caffe stores its data in a different way than Keras does, even
though both use HDF5 files. Just because a model may have the .h5 file extension, it doesn’t
automatically mean it’s a Keras file. The same thing is true for the regular .caffemodel files,
by the way. Just like TensorFlow graphs and Core ML’s own mlmodel files, .caffemodel is
based on protobuf. Even though these tools all use the same serialization mechanism —
protobuf — the actual structure of these files is very different!

The model architecture

The architecture of a Caffe model is described in a .prototxt file. This is a text file that uses
protobuf syntax to describe the layers and their properties. For this chapter we’re using
cifar10_quick.prototxt58 :

name: "CIFAR10_quick_test"
layer {
name: "data"
type: "Input"
top: "data"
input_param { shape: { dim: 1 dim: 3 dim: 32 dim: 32 } }
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
56 https://github.com/BVLC/caffe/tree/master/examples/cifar10
57 https://github.com/hollance/coreml-survival-guide
58 https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_quick.prototxt
NeuralNetworkBuilder 102

top: "conv1"
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
}
}
...

Just like in Core ML, the layers are connected through named inputs and outputs. The input
for a layer is named bottom, the output is named top. That makes sense if you think of this as
a graph that starts at the bottom and grows upwards (even though pretty much everyone
draws their models starting at the top and growing downwards).

The "data" layer is the model input and has shape (1, 3, 32, 32). The first dimension is the
batch size, and then it is channels, height, width — just like in Core ML.

Everything you need to understand the Caffe model is in this prototxt file, but it’s not always
easy to read. Personally, I prefer to look at pictures.

You can view prototxt files in my tool of choice, Netron.

You can also use Netscope59 , a web-based tool for visualizing Caffe models. Paste the
contents of the prototxt file in the text box on the left and press Shift+Enter:

59 http://ethereon.github.io/netscope/#/editor
NeuralNetworkBuilder 103

Hover over a layer to see its properties, such as the kernel size, stride, and padding.

The blob shapes field tells you what the output shape of this layer is supposed to be. That’s
important to know when you do the conversion later.

Note: There are many different forks of Caffe. If your model was trained with one of those
forks, chances are that Netscope will not be able to display your model, especially if it uses
custom layers that are not part of the official Caffe distribution.

If you have Caffe installed, you can use the following command to convert the prototxt into
a picture:

python3 /path/to/caffe/python/draw_net.py cifar10_quick.prototxt net.png

The layers in cifar10_quick

Netscope shows that this model has 13 layers:

• data is the input layer

• there are three convolution layers, conv1 to conv3

• there are also three pooling layers, pool1 to pool3

• the convolution or pooling is followed by a ReLU

• there are two fully-connected layers, ip1 and ip2

• the output is a softmax layer called prob

Notice how Netscope groups the relu layers together with the preceding conv or pool layer.
That’s because in Caffe the activation function is performed in-place, so it’s not really a
separate operation.

To convert this network to Core ML, we have to add these same layers. Fortunately, Core
ML supports all of these operations — convolution, pooling, ReLU, fully-connected layer,
softmax — so we won’t have to write any custom layers.
NeuralNetworkBuilder 104

Note: If you do require custom layers, that’s no problem for NeuralNetworkBuilder. You just
have to fill in the CustomLayerParams object. See the chapter Creating Your Own Custom
Layers for more info on how to add custom layers to your mlmodel.

The learned weights

There are several ways you can load a Caffe model to examine its architecture and/or its
learned parameters:

1. Use pycaffe to manipulate the model from Python. Using pycaffe is a good idea if you
already have a working installation of Caffe. However, installing Caffe can be a little
tricky as it requires compiling everything from source.

2. For .caffemodel files that are in protobuf format: directly load the file through the
protobuf API. You’ll see a small example later on in this chapter. Also check out the
chapter Using Protobuf Without coremltools.

3. For .caffemodel.h5 files that are in HDF5 format, use the h5py library to load the file
and examine it. That’s what you’ll be doing here.

Tip: When writing my own converters, I like to work in an IPython shell or Jupyter
notebook so that I can add layers one at a time and check that everything went OK. Once
I’ve got something that works correctly, I copy the code to a Python script so I can easily
run it again at a later date.

The code to load an HDF5 file looks like this:

import numpy as np
import h5py

caffemodel_path = "cifar10_quick_iter_5000.caffemodel.h5"
f = h5py.File(caffemodel_path, "r")

Now you can use the variable f to access the data from the HDF5 file. For example, to
examine what is in the file you can write:
NeuralNetworkBuilder 105

f.visititems(lambda name, obj: print(name, obj))

For cifar10_quick_iter_500.caffemodel.h5, this prints:

data <HDF5 group "/data" (13 members)>


data/cifar <HDF5 group "/data/cifar" (0 members)>
data/conv1 <HDF5 group "/data/conv1" (2 members)>
data/conv1/0 <HDF5 dataset "0": shape (32, 3, 5, 5), type "<f4">
data/conv1/1 <HDF5 dataset "1": shape (32,), type "<f4">
data/conv2 <HDF5 group "/data/conv2" (2 members)>
data/conv2/0 <HDF5 dataset "0": shape (32, 32, 5, 5), type "<f4">
data/conv2/1 <HDF5 dataset "1": shape (32,), type "<f4">
...

Some of these items are “groups”, others are “dataset” objects. For example, data is a top-
level group containing data/conv1, which is another group. This in turn has two members,
data/conv1/0 which has the weights for the conv1 layer, and data/conv1/1 which has the
values for that layer’s biases.

There is a group for each layer with learned parameters. There are also groups for the pool
and ReLU layers but these are empty and we can safely ignore them.

To get the weights for a layer as a NumPy array, you can write:

weights = f["data/conv1/0"][...]

I’ve made a little helper function that grabs both the weights and the biases for a layer at
the same time. This works for convolutional as well as fully-connected layers:

def get_weights(layer_name):
weights = f[layer_name + "/0"][...]

# Transpose the weights for a convolutional layer.


if weights.ndim == 4:
weights = weights.transpose(2, 3, 1, 0)

biases = f[layer_name + "/1"][...]


return weights, biases

To get the weights and biases for a layer, you’d write the following, where "data/conv1" is
the name of the group in the HDF5 file:
NeuralNetworkBuilder 106

W, b = get_weights("data/conv1")

Important when converting to Core ML is that you put the weights in the correct order!

Caffe stores its weights as (output channels, input channels, kernel height, kernel width).
For example, the weights for conv1 are (32, 3, 5, 5).

NeuralNetworkBuilder, however, wants the weights as (height, width, input channels, output
channels). That’s why we call transpose(). If you don’t put the weights in the right order,
Core ML will not give the correct predictions!

Note: Interestingly enough, the mlmodel itself uses the same weights ordering as Caffe.
I’m not sure why NeuralNetworkBuilder wants a different order. It doesn’t really matter, just
remember to always pay attention to the expected order of the dimensions!

When the get_weights() function is used on a fully-connected layer, for example "data/ip1",
the weights array will only have two dimensions: (output channels, input channels).
Transposing it not necessary now, as that’s also the order expected by NeuralNetworkBuilder.

Creating the builder

Now that you know the model architecture, i.e. what the layers for this model are, and you
have a way to read the weights for the layers, you can translate this into a Core ML model
using NeuralNetworkBuilder.

It’s very straightforward: create a NeuralNetworkBuilder instance and then add the layers in
the correct order. That is all!

Here is how you create the builder object:


NeuralNetworkBuilder 107

import coremltools as ct
from coremltools.models import datatypes
from coremltools.models import neural_network

input_features = [ ("image", datatypes.Array(3, 32, 32)) ]


output_features = [ ("labelProbs", None) ]

builder = neural_network.NeuralNetworkBuilder(
input_features, output_features, mode="classifier")

Every Core ML model has one or more inputs and one or more outputs. You define those
here, including their shape. The input_features list has the names and shapes of the model’s
inputs; the output_features list has the names and shapes for the model’s outputs.

The model we’re converting here is a classifier that takes an image as input and produces
a probability distribution over the ten CIFAR-10 classes as output.

The shape for the input is given by a datatypes.Array() object. Since CIFAR-10 images are
32×32-pixel RGB images, the shape for the "image" input is (3, 32, 32).

Note: You don’t yet tell the model that the input named "image" is really supposed to be an
image instead of an array. You do that in a separate step later on.

The shape for the output "labelProbs" is None. That means we’ll let NeuralNetworkBuilder
figure out for itself how big the output shape should be. If you already know the shape of
the output, you can also supply it here, again using a datatypes.Array object:

output_features = [ ("labelProbs", datatypes.Array(10)) ]

In our case, because of the mode="classifier" argument, NeuralNetworkBuilder will create a


model of type neuralNetworkClassifier. In that case the output will be a dictionary of class
names and their probabilities, and so the output shape doesn’t really matter.

If you use mode=None, you get a regular neuralNetwork model. In that case I recommend filling
in the shape of the output if you know it (you can always fix this later in the spec).

Note: The input in the prototxt was named "data" and the output was named "prob" but
here we’re using different names: "image" and "labelProbs". That’s fine — we can name the
model inputs and outputs whatever we want.
NeuralNetworkBuilder 108

If you now do print(builder.spec.description), it should print the following:

input {
name: "image"
type {
multiArrayType {
shape: 3
shape: 32
shape: 32
dataType: DOUBLE
}
}
}
output {
name: "labelProbs"
type {
multiArrayType {
dataType: DOUBLE
}
}
}

Both input and output are of type MultiArray. The shape for the output is not filled in
because we left this as None. Next, you’ll tell NeuralNetworkBuilder that the input should really
be of type image, and the output will be the class name / probabilities dictionary.

Image preprocessing

To tell Core ML that the input is an image, you call set_pre_processing_parameters().


This takes the familiar image preprocessing arguments that are also used by the built-in
converters:

builder.set_pre_processing_parameters(image_input_names=["image"],
is_bgr=False,
red_bias=-125.3,
green_bias=-122.95,
blue_bias=-113.87)

The mean RGB values for CIFAR-10 are used here as the biases. As part of the training
process, Caffe also provides a “mean image”, which is the average of all the training images.
I calculated the red/green/blue_bias values from this mean image. Later in this chapter
you’ll see how to actually use the full mean image in the Core ML model.
NeuralNetworkBuilder 109

Interestingly, is_bgr needs to be False for this model. Caffe models usually expect images
in BGR pixel order but the CIFAR10 example doesn’t use OpenCV to load the images, and so
the pixel order is RGB instead of the usual BGR that we see with Caffe.

Now if you do print(builder.spec.description.input), it says:

[name: "image"
type {
imageType {
width: 32
height: 32
colorSpace: RGB
}
}
]

The "image" input is no longer a MultiArray but a proper 32×32-pixel image.

A classifier needs classes

Because we used mode="classifier" when creating the NeuralNetworkBuilder, this is not a


plain neural network but a special classifier model. This kind of model needs to include
the labels for the classes. To provide those, call set_class_labels():

cifar10_labels = ["airplane", "automobile", "bird", "cat", "deer",


"dog", "frog", "horse", "ship", "truck"]

builder.set_class_labels(class_labels=cifar10_labels,
predicted_feature_name="label",
prediction_blob="labelProbs")

The name from prediction_blob must correspond to the output of the model that predicts
the probabilities. This is the output we named "labelProbs" earlier. The predicted_feature_-
name is the name of the output that has the class with the highest probability; you can name
this anything you want.

Now if you do print(builder.spec.description.output), it says:


NeuralNetworkBuilder 110

[name: "labelProbs"
type {
dictionaryType {
stringKeyType {
}
}
}
, name: "label"
type {
stringType {
}
}
]

Previously, the output "labelProbs" was a MultiArray but now it’s a dictionary that will have
the class names and their predicted probabilities. You can also see that a second output,
"label", has been added to the model.

Adding the layers

Now you’ll add the layers one-by-one to the model. The order that you add them is
important: you should only add a layer when its inputs — the layers that it is connected
to — have been added already. You’d think that NeuralNetworkBuilder could sort the layers,
but it’s not that smart. ;-)

Let’s add the first layer:

W, b = get_weights("data/conv1")

builder.add_convolution(name="conv1",
kernel_channels=3,
output_channels=32,
height=5,
width=5,
stride_height=1,
stride_width=1,
border_mode="valid",
groups=1,
W=W,
b=b,
has_bias=True,
input_name="image",
output_name="conv1_output",
padding_top=2,
padding_bottom=2,
padding_left=2,
padding_right=2)
NeuralNetworkBuilder 111

I kept the layer name the same as in the original Caffe model, "conv1".

Every layer also has an input_name and an output_name. This is how it is connected to the
other layers in the model. The input is "image", as this first layer is connected to the model
input. For the output of the layer we could have used the layer name "conv1", but I like to
append _output to make it clear this name refers to an output and not the whole layer.

The rest of the arguments are taken straight from the Caffe model’s prototxt file:

• The number of input channels, named kernel_channels here, is 3 because the input is
an RGB image.

• The number of convolution filters, or output_channels, is 32.

• The height and width are the convolution kernel size, which is 5×5 for this model.

• The border_mode is used for padding and is "valid". You use "valid" mode if you want to
specify a fixed amount of zero-padding, which we’ve done here using the padding_top=2
etc arguments.

• You also provide the weights W and biases b.

And that’s all you need to do to add a convolution layer. For a complete explanation of the
add_convolution() arguments, see the docs for NeuralNetworkBuilder60 .

The next layer is a pooling layer. The code for adding this layer looks very similar to what
you just did:

builder.add_pooling(name="pool1",
height=3,
width=3,
stride_height=2,
stride_width=2,
layer_type="MAX",
padding_type="INCLUDE_LAST_PIXEL",
input_name="conv1_output",
output_name="pool1_output")

This time the input_name is the same as the output_name from the previous layer, "conv1_-
output". That’s how Core ML knows that "pool1" follows "conv1".

Again, the settings for the layer are taken from the prototxt file. This is a max-pooling layer
with kernel size 3×3 and a stride of 2.
60 https://coremltools.readme.io/reference/modelsneural_network
NeuralNetworkBuilder 112

Worth noting is the padding_type. Possible choices are "VALID", "SAME" and "INCLUDE_LAST_-
PIXEL". For pooling layers in Caffe models, you want the latter. Different training tools have
different ways to implement padding, and you have to be careful to choose the proper Core
ML option.

Initially I had set this to "VALID" but it made the output of the pooling layer too small.
The input is a 32×32 feature map and the output should be 16×16 (you can verify this in
Netscope by hovering over the pool1 layer). But with valid padding the output was only
15×15. Using "INCLUDE_LAST_PIXEL" fixed this. I verified that the official Caffe converter also
uses this option for pooling layers.

It’s only a small detail, but this is where mistakes happen. This is why I like to add layers
one-by-one and then verify their output shapes, to make sure I’m still on the right track.

Sanity checking the model

At this point you have a model with two layers. It’s a good idea to check that the output
shapes of these layers correspond to the blob shapes from the original Caffe model. Here
is one way you can do that.

Note: The technique described in this section does not work anymore with Xcode 11 or
later. Unfortunately, coremlc no longer outputs the inferred layer shapes. See the chapter
Checking the Layer Output Shapes for alternative ways to verify that the data has the
correct shape.

Normally the last layer’s output_name must be equal to the name of a model output. But
currently your last layer is "pool1" and its output is called "pool1_output", while the model’s
output is "labelProbs".

You could temporarily change the layer’s output name, but another solution is to add a fake
layer that connects "pool1_output" to "labelProbs":

builder.add_activation(name="fake",
non_linearity="LINEAR",
input_name="pool1_output",
output_name="labelProbs")

A linear activation has no effect on the computations performed by the model, and so this
is a useful kind of layer to add anywhere into the model when you want to try some kind
of experiment.
NeuralNetworkBuilder 113

Now create an MLModel object from the builder and save it to a temporary file:

mlmodel = ct.models.MLModel(builder.spec)
mlmodel.save("temp.mlmodel")

Note: It’s possible you get an error message at this point. Creating the MLModel will invoke the
Core ML compiler. If something is not right in your model definition, the compiler returns
an error. In that case, you need to fix the error before you can continue.

The output from the Core ML compiler is actually what we’re after but it does not show up
inside the Python REPL. Fortunately, you can also run the Core ML compiler yourself. Open
a Terminal and write the following:

xcrun coremlc compile temp.mlmodel output

This should print something like the following:

Input name(s) and shape(s):


image : (C,H,W) = (3, 32, 32)
Neural Network compiler 0: 100 , name = conv1, output shape : (C,H,W) = (32, 32, 32)
Neural Network compiler 1: 120 , name = pool1, output shape : (C,H,W) = (32, 16, 16)
Neural Network compiler 2: 130 , name = fake, output shape : (C,H,W) = (32, 16, 16)

I wanted to see if the output from pool1 was indeed 16×16 pixels with 32 channels and it is!
So this layer appears to be configured correctly.

Tip: If you’re not sure what the output shapes of the layers are supposed to be, and you
can’t use a tool such as Netscope to look at the blob shapes, then perform a forward pass
of the original model with some random data and look at the sizes of the output blobs. See
also the chapter Looking at Intermediate Layer Outputs.

When you’re done, write the following to delete the fake layer again:

del builder.spec.neuralNetworkClassifier.layers[-1]

Tip: If your model is not a classifier, you’d use builder.spec.neuralNetwork.layers instead.


NeuralNetworkBuilder 114

Adding the rest of the layers

Two layers done, eleven to go. These are pretty much all done in the same way. For example,
after pool1 comes a ReLU activation:

builder.add_activation(name="relu1",
non_linearity="RELU",
input_name="pool1_output",
output_name="relu1_output")

As before, it’s just a matter of choosing the correct settings — here just the type of the
activation used — and filling in the input_name and output_name.

Next up are two more convolution layers, two more pooling layers, and two more ReLUs.
I’m skipping these in order to save space. You can find the full source code in the book’s
GitHub repo.

After the pool3 layer comes a fully-connected or dense layer, also called an inner product
layer by Core ML. With most training tools, you first need to “flatten” the convolution output
into a vector before you can use a fully-connected layer. The output from pool3 is a (64, 4,
4) tensor. Flattened this becomes a vector of 1024 elements.

In Core ML you don’t really need a flatten layer. It is smart enough to automatically flatten
the tensor into a vector. However, it doesn’t hurt to add a flatten layer anyway:

builder.add_flatten(name="flatten1",
mode=0,
input_name="pool3_output",
output_name="flatten1_output")

And now you can add the first fully-connected layer:

W, b = get_weights("data/ip1")

builder.add_inner_product(name="ip1",
W=W,
b=b,
input_channels=1024,
output_channels=64,
has_bias=True,
input_name="flatten1_output",
output_name="ip1_output")
NeuralNetworkBuilder 115

It’s important to set the proper value for input_channels. As I mentioned, the output from
pool3 is 64×4×4 = 1024 elements, and so input_channels should be 1024.

As in the Caffe model, this fully-connected layer is immediately followed by another:

W, b = get_weights("data/ip2")

builder.add_inner_product(name="ip2",
W=W,
b=b,
input_channels=64,
output_channels=10,
has_bias=True,
input_name="ip1_output",
output_name="ip2_output")

And the final layer is a softmax for the probabilities:

builder.add_softmax(name="softmax",
input_name="ip2_output",
output_name="labelProbs")

Since this is the last layer, its output_name should be the same as the model output name,
"labelProbs". That’s how Core ML knows which layer the model output comes from.

And that concludes the model! You’ve defined the inputs and outputs, set image prepro-
cessing options, defined the class names, and added all the layers.

Pretty boring, actually. :-) It’s just a bunch of builder.add_xxx() calls in a row, one for each
layer. I hope you’ll agree with me it wasn’t very difficult after all.

Tip: It’s possible you get an error when adding a new layer. In that case, it may still have
been added but with an incomplete configuration. You should check if this is the case
with print(builder.spec.neuralNetworkClassifier.layers[-1].name) and if so, remove the
incomplete layer with del builder.spec.neuralNetworkClassifier.layers[-1].

Saving the model

At this point, it’s smart to run the Core ML compiler to see if the model you’ve just defined
is actually valid. NeuralNetworkBuilder will already try to catch errors as you add the layers,
NeuralNetworkBuilder 116

but it’s still possible to make a mistake or forget a step. An easy mistake is to spell a layer’s
input or output name wrong.

mlmodel = ct.models.MLModel(builder.spec)

This invokes the Core ML compiler and creates a new MLModel object. If you get an error
message here, something is wrong in the definition of your model.

For example, I got the error: “validator error: Convolution layer ‘conv3’ has weight matrix
of size 25600 to encode a 64 x 32 x 5 x 5 convolution.” What happened was that I had made
a copy-paste error and used the weights for conv2 with layer conv3.

No error messages? Great! Now you can fill in your model’s metadata and save it:

mlmodel.short_description = "cifar10_quick"
mlmodel.author = "https://github.com/BVLC/caffe/tree/master/examples/cifar10"
mlmodel.license = "https://github.com/BVLC/caffe/blob/master/LICENSE"

mlmodel.input_description["image"] = "The input image"


mlmodel.output_description["labelProbs"] = "The predicted probabilities"
mlmodel.output_description["label"] = "The class with the highest score"

mlmodel.save("CIFAR10.mlmodel")

This is also the ideal place to make additional changes, to convert the weights to 16-bit
floats, to make the input sizes flexible, or to perform any other type of model surgery.

Tip: Just to make sure the model really is correct, now is a good time to check the layer
output shapes.

Making a test prediction

It’s also a good idea to test the model on some images. Just because Core ML compiled the
model without problems, that doesn’t mean the model will work correctly. It’s especially
easy to make a mistake with the weights — and if the weights are wrong, the model will
predict nonsense.
NeuralNetworkBuilder 117

import PIL
img = PIL.Image.open("boat.jpg")
img = img.resize((32, 32), PIL.Image.BILINEAR)
prediction = mlmodel.predict({"image": img}, useCPUOnly=True)
print(prediction)

Compare this to what the original Caffe model predicts for the same image to make sure
the results are the same. They will never be 100% identical but still should be pretty close.

For my test image, the original model predicted “ship” 0.86478 as the winning class. Core
ML also predicted “ship” but with probability 0.88286. Close enough. If several other test
images also give very similar results, it’s safe to say the model works OK.

Bonus topic: using the mean image

When Caffe trains the model from the CIFAR-10 example, it also saves a mean.binaryproto.
This file includes the average values for the RGB color channels for every pixel in the
training images. Because CIFAR-10 images are 32×32 pixels, mean.binaryproto contains
32×32×3 = 3072 color values.

Core ML allows you to use such a mean image instead of the red/green/blue_bias prepro-
cessing parameters. Just for the fun of it, let’s add this mean image into the model.

First, we need a way to load this file. It’s not saved as HDF5 but in a protobuf format. To
load a protobuf file, you need to convert the original .proto definitions using protoc, the
protobuf compiler. You can read more about this in the chapter Using Protobuf Without
coremltools.

In the repo for this book, I’ve included a caffe_pb2.py that was created using protoc, as well
as the mean.binaryproto file. To load the mean image into Python, you write:

import caffe_pb2
mean_image = caffe_pb2.BlobProto()
mean_image.ParseFromString(open("mean.binaryproto", "rb").read())
mean_image = np.array(mean_image.data)

Now mean_image is a NumPy array with 3072 elements. To find the average colors for the
RGB color channels, we can reshape this array to (3, 32, 32) and take the mean:

mean_image.reshape(3, -1).mean(axis=-1)
NeuralNetworkBuilder 118

This prints array([125.3069178 , 122.95039426, 113.86538316]), which corresponds to the


values of red/green/blue_bias we used earlier.

To tell Core ML to use this mean image instead of the other preprocessing options, do:

builder.spec.neuralNetworkClassifier.preprocessing[0].meanImage\
.meanImage.extend(mean_image)

This copies the values from the NumPy array into the model definition. (Note: it is correct
that this says .meanImage twice. The first one refers to the NeuralNetworkMeanImage structure,
the second one is the array of data values inside that structure.)

And now you can save the model again. If you make a new test prediction it should still
give the correct answer, although the exact probabilities may be slightly different since the
preprocessing is slightly different too.

Changing an existing model

As of coremltools 3.0, NeuralNetworkBuilder can also be used to make changes to an existing


mlmodel file. You do this by passing in the spec of that model when you construct the
builder:

existing_model = ct.utils.load_spec("YourModel.mlmodel")
builder = ct.models.neural_network.NeuralNetworkBuilder(spec=existing_model)

You can now use the various add_xxx(...) functions to add new layers, use the set_xxx()
functions to change the model’s inputs and outputs, or make the model updatable so that
it can be trained on device.

Note: The NeuralNetworkBuilder constructor also takes an nn_spec argument. This is differ-
ent from using spec. You would use nn_spec for building nested neural nets that go inside
loops or branches.

NeuralNetworkBuilder has handy inspection methods that can print out information about
the model:
NeuralNetworkBuilder 119

builder.inspect_input_features()
builder.inspect_output_features()
builder.inspect_layers(last=3)
builder.inspect_conv_channels("layer_name")
builder.inspect_innerproduct_channels("layer_name")

Especially inspect_layers() is useful for finding out what the last few layers in the model
are, since those are usually the ones you’d want to make updatable.

To make a model updatable, you can use the following functions:

builder.make_updatable(["layer1", ...])

training_features = [ ("input", datatypes.Array(3)),


("target", Double) ]
builder.set_training_input(training_features)

builder.set_categorical_cross_entropy_loss(...)
builder.set_mean_squared_error_loss(...)

builder.set_sgd_optimizer(...)
builder.set_adam_optimizer(...)
builder.set_epochs(...)

Read more about how to use these in the chapter Making a Model Updatable.

Note: If you’re building the model from scratch, you can also pass the training_features
argument to the NeuralNetworkBuilder constructor in order to describe the training inputs.
But you’ll still need to specify the updatable layers, the loss function, and the optimizer
afterwards using these functions.

You can use the following inspection methods to look at a model’s training settings:

builder.inspect_updatable_layers()
builder.inspect_loss_layers()
builder.inspect_optimizer()

When you’re done making changes, save the model like so:

ct.utils.save_spec(builder.spec, "YourNewModel.mlmodel")
Model Training Tips
It’s a good idea to keep the restrictions and limitations of mobile devices in mind when
designing and training your model.

Just because the model gives good results when you evaluate it on a desktop or server-class
machine, doesn’t mean it’s suitable for use on mobile.

You’ll need to balance the following performance characteristics:

• How good are the model’s predictions?

• How much space does the model file take up in your app bundle?

• How fast does it run?

• How much free RAM does an inference pass require?

• How long or how often can you run it, without making the device too hot or using too
much battery power?

Tip: To test the speed of a model you don’t need to train it first. Just define the model’s
architecture and initialize the layers with random numbers, then convert the model to
Core ML. The predictions from the model won’t make any sense at this point, since it is
not trained, but you do get accurate speed measurements. Plus you’ll immediately know if
the Core ML conversion of your architecture is going to give any problems.

Here are some things to consider when you design and train your models.

The number of parameters

This affects the size of the model in your app bundle but also how fast the model loads
and runs. Fewer is better but too few parameters and your model won’t give good
results anymore. The trick is to find a good balance between number of parameters and
performance. The rule-of-thumb I personally use is 10 million parameters or less.

120
Model Training Tips 121

Model compression

Core ML can save your model weights as 16-bit floats or quantized between 1 and 8 bits per
weight. However, this only changes the size in the app bundle. Core ML always dequantizes
the weights before use, and during runtime it does the computations with 16-bit or 32-bit
floats anyway.

There are other model compression techniques that can make the model smaller, such as
decomposing the convolution kernels or using fixed-point numbers for the weights. But
Core ML won’t be able to benefit from these techniques, as it doesn’t have runtime support
for such kernels.

A technique that does help is pruning, where you literally remove portions of the model so
that it has fewer layers or convolution filters. If your model is too large and/or too slow, try
pruning61 it.

FLOPS and MACCs

One way to get an idea of the speed of your model is to count how many floating point
operations62 or multiply-accumulate operations it performs per second. But this only gives
you a very rough idea. Just because one model has more FLOPS than another model, doesn’t
automatically mean it is slower.

GPUs and NPUs such as the Neural Engine are optimized for doing a lot of computations in
parallel. The more work these chips get to do, the more efficiently they run. So sometimes
more FLOPS is better.

The point I’m trying to make is: don’t worry too much about the FLOPS. This number only
tells you how many operations the model performs, but not what kind of operations. A
simple operation that involves reading and writing RAM is much more expensive than a
complicated calculation that doesn’t access memory.

Input and feature map sizes

On current computer architectures, a single memory access from main memory is much
slower than a single floating point operation — by a factor of about 100 or more! This
makes memory bandwidth more important for the speed of your model than the number
of FLOPS.
61 https://machinethink.net/blog/compressing-deep-neural-nets/
62 https://machinethink.net/blog/how-fast-is-my-model/
Model Training Tips 122

In a neural network, most of the memory accesses are from the convolution kernels as they
slide the convolution window over the input feature map to read its pixels. The bigger your
input image, the more FLOPS are needed to compute the convolutions, but also the more
memory accesses are required to read the pixels and write the results.

If you can get away with using smaller input images, that’s the easiest way to make your
models run faster. A 128×128 image only requires one-fourth the number of memory
accesses as a 256×256 image. It should therefore run about 4 times faster.

The same thing goes for the intermediate feature maps: use the largest output stride you
can get away with (this determines how often the input is downsampled) and the smallest
number of convolutional filters.

Tip: To run models on the GPU, Core ML uses the Metal Performance Shaders framework
(MPS). Because Metal is primarily a graphics programming API, it stores the feature
map data in textures that have 4 color channel values per pixel. MPS achieves the best
performance when the number of filters or channels in a convolutional layer is also a
multiple of 4. This is especially important when concatenating multiple feature maps. If
they all have a number of channels that is a multiple of 4, concatenation can be done for
free. Otherwise it involves making a copy.

Number of layers

The downside of having more layers is that it also involves reading and writing more
feature maps, which is done by relatively slow RAM accesses. But more layers isn’t
necessarily a bad thing: it depends on what the layers do.

MobileNet, for example, splits up a regular 3×3 convolution into two simpler convolution
layers that each get their own batch normalization and ReLU. There are now more layers,
so more intermediate feature maps. But you still come out ahead because the simpler
convolution layers perform much fewer RAM accesses than the full convolution would
have.

However, what works well for one type of hardware may not work well everywhere. The
MobileNet approach is really fast on the GPU, but the Neural Engine seems to prefer fewer
but heavier convolution operations.
Model Training Tips 123

The amount of RAM the model takes up at runtime

If your input images are large, your app will crash if an intermediate feature map becomes
too big to fit in RAM. How big is too big? Well, a 1024×1024 feature map with 512 channels
in 16-bit floating point format uses up 1 GB of RAM already.

Core ML will immediately reuse memory from an old layer when it’s no longer needed, so
it doesn’t need to fit the entire computational graph in memory. Still, most iPhones only
have 1 or 2 GB of RAM and you have to share that with other apps too.

If your model has to work on very high-resolution images, you may need to split up
that image into smaller tiles, run the model once for each tile, and combine the results
afterwards.

Use a smaller feature extractor

Many modern neural networks consist of a feature extractor followed by a number of


additional layers that perform the model’s actual task. Often the feature extractor is an
off-the-shelf convolutional net such as VGG or ResNet.

You can replace the feature extractor with a smaller network such as SqueezeNet or
MobileNet63 and get similar — but faster — results. This requires re-training the model,
which is why it’s smart to already use a mobile-friendly feature extractor right from the
start.

Make it work with 16-bit floats

Apple GPUs, and most likely the Neural Engine as well, prefer to work with 16-bit floating
point numbers, also known as half precision floats.

Regardless of the datatype in which your model’s weights are stored inside the mlmodel file
— 32-bit floats, 16-bit floats, or quantized — Core ML will internally convert the weights to
half precision floats before running the model on the GPU / Neural Engine.

Operations such as convolution always happen with 16-bit floats under the hood. The
intermediate results of such computations are kept as 32-bit, but the final result is written
to memory as 16-bit again.

MLModelConfiguration has an option allowLowPrecisionAccumulationOnGPU. If set to true, Core


ML will also use 16-bit floats for the intermediate results. This gives a nice little speed boost,
but it only works if your model produces activations that aren’t too large or too small.
63 https://machinethink.net/blog/mobilenet-v2
Model Training Tips 124

16-bit floats have a very limited range: the maximum value is 65504, the minimum is 6e-
5. As is true for all floating point numbers, the larger or smaller they get, the worse the
precision becomes.

What this means is: if the activations produced by your model’s layers are too big, a 16-bit
accumulator will overflow and your model won’t work correctly.

Some tips to prevent this:

• Normalize your input data. It’s better for pixels to be in the range [-1, 1] than in the
range [0, 255].

• Use something like a ReLU6 to limit how large the activations can become. ReLU6 is
like the regular ReLU activation function but it prevents activations from becoming
greater than 6.

• Use a regularizer or a constraint on the weights to prevent them from becoming very
large during training.

Supported operations

Core ML may not have some of the layer types you need. For example, before mlmodel
version 3, Core ML did not have a bilinear resizing layer that supported arbitrary output
sizes, only an upsampling layer with an integer upsampling factor. The DeepLab v3+ model
works best with 513×513 images but this is impossible to achieve with Core ML on iOS 11.

To make it work on iOS 11, you can hack the trained DeepLab model to use an upsampling
layer with a ratio of 8 instead of the bilinear resizing layer, but that makes the output
512×512 instead of 513×513. A one pixel difference may not seem like a big deal, but it
causes the segmentation masks to be shifted and no longer fit the actual objects.

Knowing about this kind of limitation ahead of time allows you to train the model
accordingly!

You can always implement a custom layer, of course, but currently that means your model
won’t be able to run on the Neural Engine. If at all possible, design your model to work
within the capabilities of Core ML.

Design for the Neural Engine

Your model will work best if it can run on the Neural Engine. Unfortunately, Apple has not
published any guidelines on what sort of model architectures are suitable for their NPU.
Model Training Tips 125

The Neural Engine seems to work best with regular convolution layers and fully-connected
layers, but not so well with the following:

• RNN layers such as LSTM and GRU

• custom layers

• dilated convolutions

• depthwise convolutions

Avoid such layers if you want to take advantage of the Neural Engine.

I’ve been collecting tips and tricks about the Neural Engine in this GitHub repo64 . Check it
out for the latest insights!
64 https://github.com/hollance/neural-engine
Part 3: Examining Models

The mlmodel file format is the cornerstone of Core ML. This part of the book dives deep
into how this file format works. Knowing this is important to understand how Core ML
does things, but also how to fix issues with your mlmodel files.
Viewing Models With Netron
An indispensable tool for examining machine learning models is Netron by Lutz Roeder.
It supports many different model formats, including Core ML mlmodels. Netron is open
source, completely free, and runs on macOS, Linux, Windows and even in your web
browser. Download and installation instructions are on the Netron GitHub page65 .

Netron shows the model on the left and details about the selected layer on the right. You
can also examine the weights and biases for your layers, and even save them to disk as
NumPy arrays. Click on the model’s input or output to see the properties of the mlmodel
file itself, including the Core ML specification version it uses (under “format”).

Using this tool to inspect your models is highly recommended!

Tip: After converting a model to Core ML, open both the original model and the converted
model in Netron. Put them side-by-side to compare how the different layers were translated
to Core ML.

65 https://github.com/lutzroeder/netron

127
Viewing Models With visualize_spec
Note: This no longer works with coremltools 4 and up.

Even though it’s not nearly as pretty as Netron, coremltools comes with its own visualizer.

import coremltools as ct
model = ct.models.MLModel("SqueezeNet.mlmodel")
model.visualize_spec()

This starts a local web server and opens an interactive web page in your browser that shows
the layers in the model. Hover over a layer to see its properties.

128
Viewing Models With visualize_spec 129

Especially handy is the Shapes On/Off button that shows the output shapes of the layers.
As is common in Core ML, these shapes are five dimensional: (sequence length, batch size,
channels, height, width).

Currently, visualize_spec() doesn’t show any of the training stuff for updatable models,
such as the loss function and the optimizer. Perhaps that will be added at some point.

Tip: If a browser window opens but says “can’t connect to server”, or you get the error
message “Error: No module named SimpleHTTPServer”, you can fix this as follows:

First, find where coremltools is installed. For example, if you use Anaconda it’s installed in:

/Users/USERNAME/anaconda3/envs/ENVNAME/lib/python3.7/site-packages/coremltools

Open the file coremltools/models/_graph_visualization.py and replace the following line:

subprocess.Popen(['python', '-m', 'SimpleHTTPServer', str(port)])

with this one:

subprocess.Popen(['python', '-m', 'http.server', str(port)])

Start a new Python session and try model.visualize_spec() again.


The mlmodel File Format
To use a machine learning model with Core ML it needs to be in the mlmodel format.

The mlmodel file describes the layers in your model, the input and outputs, the class labels,
and any preprocessing that needs to happen on the data. It also contains all the learned
parameters: the weights and biases.

For neural network models that are updatable — models that can be trained on device —
the mlmodel file also contains the loss function, the optimizer, and the training hyperpa-
rameters.

Everything you need to use the model with Core ML is inside this one file.

You can use coremltools to convert your trained models into an mlmodel file. Hopefully
all goes well and you’re good to go! But if your converted model doesn’t work as expected
with Core ML, you’ll need to examine the mlmodel file and possibly do some surgery on it.
Therefore, it makes sense to learn how mlmodel files work.

Knowing how to read the mlmodel specification also gives you a better idea of what Core
ML can do and what it cannot.

Protobuf files

The mlmodel file format is an open standard that is based on Protocol Buffers, or protobuf,
a method of serializing structured data that was invented by Google. You can compare pro-
tobuf to XML or JSON, but it produces smaller files and is faster. Models from TensorFlow
and Caffe are saved using protobuf formats too.

The mlmodel specification is defined in a number of .proto files. These are text files
containing the schema that is used by protobuf to read and write mlmodel files. (Core ML
uses the proto3 version of the protocol buffers language.)

You can find these proto files inside the coremltools GitHub repo, in the folder mlmodel/-
format66 . There is also a copy in the Turi Create repo67 .
66 https://github.com/apple/coremltools/tree/master/mlmodel/format
67 https://github.com/apple/turicreate/tree/master/src/external/coremltools_wrap/coremltools/mlmodel/format

130
The mlmodel File Format 131

Everything that Core ML can do is described by these proto files. If it’s not in here, Core ML
does not support it.

Note: Whenever a new capability is added to Core ML, such as a new neural network layer
type, it is added to these proto files. One way to find out what functionality is supported by
the different versions of Core ML is to look at the commit history for this folder.

Model.proto

The main file in the mlmodel specification is Model.proto. The most important definition
in this file is at the bottom and is also named Model. Here is an excerpt:

message Model {
int32 specificationVersion = 1;
ModelDescription description = 2;
bool isUpdatable = 10;

oneof Type {
PipelineClassifier pipelineClassifier = 200;
PipelineRegressor pipelineRegressor = 201;
Pipeline pipeline = 202;

GLMRegressor glmRegressor = 300;


SupportVectorRegressor supportVectorRegressor = 301;
TreeEnsembleRegressor treeEnsembleRegressor = 302;
...
}
}
The mlmodel File Format 132

The protobuf syntax may take a little getting used to, but it’s not really that difficult once
you understand the terminology.

As with all data serialization formats, the different data structures that can occur in the
serialized files need to be defined somewhere. In protobuf parlance, a data structure is
called a “message”.

The code snippet above defines the Model data structure, which is the top-level message in
the mlmodel format. Model consists of four fields:

• specificationVersion, an integer

• description, a so-called ModelDescription object

• isUpdatable, a boolean

• and Type, a special oneof structure that defines the actual type of the model, such as a
SupportVectorRegressor or a TreeEnsembleRegressor.

Notice how each field in the message definition consists of a data type, a name, and a
number. This number is called the “tag”. When the mlmodel is written to disk as a binary
file, the message objects are stored as dictionaries. Each field in the serialized message is
a key-value pair where the key is the numeric tag. This is more efficient than storing the
actual field name.

Note: Occasionally when I read through these proto files, I get confused about the tag
values. When it says int32 specificationVersion = 1; that makes it seem you’re assigning
the value 1 to the specification version of the model. But that’s not the case: it just says that
the dictionary key for this field is 1, just like how a dictionary in Swift would use a hash
value for the key. It’s best to simply ignore these tags as you read through the proto files
— they’re just an implementation detail of how protobuf files work and not important for
understanding Core ML.

Digging through the proto files

Most of the fields in the Model message refer to other messages, such as ModelDescription.
You can find this message definition in Model.proto too:
The mlmodel File Format 133

/**
* A description of a model,
* consisting of descriptions of its input and output features.
* Both regressor and classifier models require the name of the
* primary predicted output feature (``predictedFeatureName``).
* Classifier models can specify the output feature containing
* probabilities for the predicted classes
* (``predictedProbabilitiesName``).
*/
message ModelDescription {
repeated FeatureDescription input = 1;
repeated FeatureDescription output = 10;

string predictedFeatureName = 11;


string predictedProbabilitiesName = 12;

repeated FeatureDescription trainingInput = 50;

Metadata metadata = 100;


}

This message has six fields. There is also a comment at the top that describes what these
fields mean. Not everything in the proto files is documented but there are plenty of
comments that help you figure out what’s going on.

The ModelDescription structure contains the data that you see when you open the mlmodel
in Xcode:

• One or more FeatureDescriptions for the model’s inputs. This is a repeated field, which
is an array or list of such objects.

• Likewise, a list of FeatureDescription objects for the model’s outputs.

• There is also a list of FeatureDescription objects for training inputs. These are used
instead of the regular inputs when you’re training the model.

• A predictedFeatureName and a predictedProbabilitiesName field, both strings.

• A Metadata object with additional information about the model.

If you want to know more about what a FeatureDescription or a Metadata is, look for its
message definition elsewhere in the proto file or in one of the other proto files.
The mlmodel File Format 134

Classifiers and regressors

As the comment at the top of message ModelDescription says, predictedFeatureName must be


the name of the model’s main output for regressor and classifier models. And if the model
is a classifier, predictedProbabilitiesName should be the name of the output that predicts
the probabilities.

Core ML makes a distinction between regressors, classifiers, and “other” models.

For example, there is a NeuralNetworkRegressor model type, a NeuralNetworkClassifier model


type, and a plain NeuralNetwork model type. For classical machine learning models such
as Support Vector Machines or Tree Ensembles, there are also separate regressor and
classifier variants.

This allows Core ML to perform some extra magic that makes it easier to use these models.
If a model is a classifier, and thus fills in the predictedProbabilitiesName field, Core ML
can automatically turn the numbers from that output into a dictionary of class names and
probabilities.

Of course, for this to work the mlmodel must also contain the class names. You can see this
in the message describing the classifier model, for example GLMClassifier looks like this:

message GLMClassifier {
...
oneof ClassLabels {
StringVector stringClassLabels = 100;
Int64Vector int64ClassLabels = 101;
}
}

Note: The fields inside the protobuf messages are always optional. If a model is not a classi-
fier or regressor, simply don’t fill in the predictedFeatureName and predictedProbabilitiesName
fields and they will not appear in the serialized mlmodel. If a field is required by Core ML,
the comments say so.

Supported model types

Core ML can handle several different types of models, such as:


The mlmodel File Format 135

• linear regression and logistic regression

• support vector machines (SVM)

• tree ensembles such as random forests and boosted trees

• neural networks: feed-forward, convolutional, recurrent

• k-Nearest Neighbors

Most of these can be used for regression as well as classification. In addition your model can
contain typical ML preprocessing steps like one-hot encoding, feature scaling, imputation
of missing values, and so on.

This kind of high-level overview is all well and good, but to know exactly what Core ML can
do, you have to look inside the proto files. The list of supported model types is in message
Model from Model.proto, which you saw earlier.

Model has a oneof field that lists all the available model types. Here’s the entire list as of Core
ML 3 (it is possible the most recent proto file contains additional model types):

oneof Type {
PipelineClassifier pipelineClassifier = 200;
PipelineRegressor pipelineRegressor = 201;
Pipeline pipeline = 202;

GLMRegressor glmRegressor = 300;


SupportVectorRegressor supportVectorRegressor = 301;
TreeEnsembleRegressor treeEnsembleRegressor = 302;
NeuralNetworkRegressor neuralNetworkRegressor = 303;
BayesianProbitRegressor bayesianProbitRegressor = 304;

GLMClassifier glmClassifier = 400;


SupportVectorClassifier supportVectorClassifier = 401;
TreeEnsembleClassifier treeEnsembleClassifier = 402;
NeuralNetworkClassifier neuralNetworkClassifier = 403;
KNearestNeighborsClassifier kNearestNeighborsClassifier = 404;

NeuralNetwork neuralNetwork = 500;


ItemSimilarityRecommender itemSimilarityRecommender = 501;

CustomModel customModel = 555;


LinkedModel linkedModel = 556;

OneHotEncoder oneHotEncoder = 600;


Imputer imputer = 601;
FeatureVectorizer featureVectorizer = 602;
DictVectorizer dictVectorizer = 603;
The mlmodel File Format 136

Scaler scaler = 604;


CategoricalMapping categoricalMapping = 606;
Normalizer normalizer = 607;
ArrayFeatureExtractor arrayFeatureExtractor = 609;
NonMaximumSuppression nonMaximumSuppression = 610;

Identity identity = 900;

CoreMLModels.TextClassifier textClassifier = 2000;


CoreMLModels.WordTagger wordTagger = 2001;
CoreMLModels.VisionFeaturePrint visionFeaturePrint = 2002;
CoreMLModels.SoundAnalysisPreprocessing soundAnalysisPreprocessing = 2003;
CoreMLModels.Gazetteer gazetteer = 2004;
CoreMLModels.WordEmbedding wordEmbedding = 2005;
}

A oneof declaration means that the message can only include one of these fields. This is like
a switch statement in Swift, where each entry is one of the cases.

I want to highlight some of the model types:

• Identity is just for testing. It simply returns the inputs as the outputs.

• GLMRegressor and GLMClassifier are for linear regression and logistic regression, re-
spectively. These models are created by the scikit-learn converter.

• SupportVectorRegressor and SupportVectorClassifier are — what else? — for support


vector machines. These models are from LIBSVM or scikit-learn.

• TreeEnsembleRegressor and TreeEnsembleClassifier are for decision trees, random forests,


and boosted trees. These models are from XGBoost or scikit-learn.

• There are three types of neural nets: NeuralNetworkRegressor, NeuralNetworkClassifier,


and a generic NeuralNetwork.

• KNearestNeighborsClassifier is especially useful for building models that can be trained


on-device on data provided by the user.

• CustomModel is for wrapping your own algorithms into an mlmodel. When using a cus-
tom model, the mlmodel file is only a container for your model’s learned parameters
and any metadata. It doesn’t know anything about the actual model algorithm. See
also the chapter on custom models.

• The CoreMLModels are the model types that you can train with Create ML or Turi Create.

There are also a number of model types for feature engineering tasks (most of these are
used with scikit-learn):
The mlmodel File Format 137

• OneHotEncoder and CategoricalMapping: convert categorical features into numbers

• FeatureVectorizer and DictVectorizer: for converting features to arrays

• Imputer: fills in missing values in the data

• Scaler and Normalizer: for normalizing the input data

Also worth mentioning is a model type that is specifically meant for post-processing the
output data:

• NonMaximumSuppression: filters bounding box predictions from object detection models

Usually you’d do something like one-hot encoding or normalizing before passing the data to
a classifier model, and non-maximum suppression is done after running an object detection
model. So it may seem a little weird that these things are considered to be separate models
by Core ML — especially because there is only a single Model object in the mlmodel file.

Fear not! This is what the Pipeline model type is for. A pipeline lets you combine multiple
models as if they are one model. The Pipeline message is defined as follows:

message Pipeline {
repeated Model models = 1;
repeated string names = 2;
}

A pipeline is nothing more than a list of Model objects, letting you nest multiple Model objects
inside a single mlmodel file. The output of one model is the input to the next.

For example, when you train an object detection model with Turi Create, it produces
an mlmodel file whose model is of type Pipeline. The first model in this pipeline is a
NeuralNetwork that implements YOLO and predicts a fixed number of bounding boxes. The
second model in this pipeline is NonMaximumSuppression, which filters the predicted bounding
boxes and only keeps the best ones.

Core ML 3 also introduced the concept of a LinkedModel, a placeholder inside a pipeline


that refers to another mlmodel file. This allows you to re-use the same model in multiple
pipelines, which is more efficient than giving each pipeline an identical copy.

Read more about pipelines in the chapter Building Pipeline Models.


The mlmodel File Format 138

Note: For a full description of what each of these model types does, look at their corre-
sponding proto files to see what kind of functionality they support. The documentation
isn’t always perfect, but you should be able to get a pretty good idea from the proto file.
Bonus tip: Also look at the source code for the different converters from the coremltools
repo to see how they use these model types.

Neural networks

If you’re interested in using Core ML for deep learning, check out NeuralNetwork.proto
to see all the neural network layers you can use. This is by far the largest proto file from
the mlmodel specification. Fortunately, it has extensive documentation.

The main data structure for a neural network model is:

message NeuralNetwork {
repeated NeuralNetworkLayer layers = 1;
repeated NeuralNetworkPreprocessing preprocessing = 2;

NeuralNetworkMultiArrayShapeMapping arrayInputShapeMapping = 5;
NeuralNetworkImageShapeMapping imageInputShapeMapping = 6;
NetworkUpdateParameters updateParams = 10;
}

This contains a list (repeated) of the layers, as well as a list of NeuralNetworkPreprocessing


definitions, one for each input in the model. Specification version 4 added the other three
fields, which allow for dynamic tensor shapes and on-device training, respectively.

Want to know what sort of preprocessing is available in Core ML? Look at the message
definition:

message NeuralNetworkPreprocessing {
string featureName = 1;
oneof preprocessor {
NeuralNetworkImageScaler scaler = 10;
NeuralNetworkMeanImage meanImage = 11;
}
}

The featureName is the name of the model’s input to which this preprocessing gets applied.
You have the choice between a NeuralNetworkImageScaler or a NeuralNetworkMeanImage. Let’s
look at the scaler to see what that does:
The mlmodel File Format 139

message NeuralNetworkImageScaler {
float channelScale = 10;
float blueBias = 20;
float greenBias = 21;
float redBias = 22;
float grayBias = 30;
}

These are the same preprocessing options that you supply to coremltools when you convert
the model. Now you know that these are stored inside a NeuralNetworkImageScaler object,
inside a NeuralNetworkPreprocessing object, inside a NeuralNework object, inside a Model
object, inside the mlmodel file. Phew!

The definition for NeuralNetworkLayer is pretty big, because it encompasses all the possible
layer types:

message NeuralNetworkLayer {
string name = 1;
repeated string input = 2;
repeated string output = 3;

repeated Tensor inputTensor = 4;


repeated Tensor outputTensor = 5;
bool isUpdatable = 10;

oneof layer {
ConvolutionLayerParams convolution = 100;
PoolingLayerParams pooling = 120;
ActivationParams activation = 130;
InnerProductLayerParams innerProduct = 140;
EmbeddingLayerParams embedding = 150;
/* and many more... */
}
}

As you can see, in Core ML a neural network layer has a name, one or more inputs, one or
more outputs, and a layer type (from the oneof). As of Core ML 3, layers can also be marked
as being updatable for on-device training.

The inputs and outputs are strings with the output and input names of the layers that are
connected to this one. Most layers will have just one input and one output. A “concat” layer
can have multiple inputs; a “split” layer will produce multiple outputs.

Using these three fields — name, input and output — you can describe the complete graph of
the layers in the neural network.

To know more about a particular layer type, check out its message definition.
The mlmodel File Format 140

Note that in the mlmodel specification, an activation function is considered to be a separate


layer. To see the supported activation functions, look at ActivationParams:

message ActivationParams {
oneof NonlinearityType {
ActivationLinear linear = 5;
ActivationReLU ReLU = 10;
ActivationLeakyReLU leakyReLU = 15;
ActivationThresholdedReLU thresholdedReLU = 20;
...
}
}

And so on… everything you want to know about Core ML’s neural network layers is inside
this proto file. I suggest reading through it, there are a lot of comments explaining how
everything works.

Tip: I wrote a blog post68 about everything that’s new in Core ML 3. This also describes the
available model types and neural network layer types in more detail, so be sure to check
that out. See also the 2020 update69 .
68 https://machinethink.net/blog/new-in-coreml3/
69 https://machinethink.net/blog/new-in-apple-machine-learning-2020/
Dynamic Tensor Shapes
One of the claims of Core ML 3 is that it is much more “dynamic” than the previous versions.
What exactly does this mean?

For Core ML 2 and earlier, the data that flowed through the neural network always was
a tensor of rank 5. That means each data blob is a tensor made up of the following five
dimensions, in this order:

(sequence length, batch size, channels, height, width)

This choice makes a lot of sense when the input to your neural network is mostly images,
but it’s not very accommodating to other types of data.

For example, in a neural network that processes one-dimensional vectors — or rank-1


tensors — you were supposed to use the channels dimension to describe the size of the
vector and set the other dimensions to size 1. In that case, the shape of the input tensor
would be (1, 1, number of elements, 1, 1). That’s just awkward.

As of model spec v4, Core ML models support tensors of arbitrary rank and shape, making
Core ML much more suitable to data other than images. Many new layer types were added
just for this purpose, and most of the existing layers were improved so they can now handle
tensors of any rank and shape.

With these changes in Core ML 3, it’s no longer needed to stuff your data into rank-5 tensors.
If you want your model to work on vector data, you just give it a rank-1 tensor. Want to use
a matrix, use a rank-2 tensor. Want to use lots of dimensions for some reason, use a rank-6
tensor. And so on…

Core ML is now just as flexible as tools like TensorFlow, PyTorch, and NumPy — and that
makes it possible to convert many more types of neural network architectures to Core ML.

Note: Layers may still have certain limitations. For example, the input to a convolution
layer must be at least a rank-4 tensor of shape (batch, channels, height, width). The tensor
may have more than four dimensions, but not fewer. So if you try to feed a rank-3 tensor
into a convolution layer, Core ML will give an error.

141
Dynamic Tensor Shapes 142

NeuralNetwork.proto70 explains how each layer type handles tensors of different ranks. If
you’re ever wondering what the right tensor shape is for a layer, that’s the place to look.

How is this different from size flexibility?

Since version 2, Core ML has supported flexible input shapes, which tell the model how big
it can expect the input data to be. This may seem similar to the topic of this chapter, but is
not quite the same thing.

In the first version of Core ML, models always had a hardcoded input size. For example, if
you made an image classifier that took a 224×224 RGB image as input, it would only work
on images of 224×224 pixels. Since that’s a little limiting, Core ML 2 allowed you to specify
a range of possible sizes instead of just a single fixed size. If the model outputs an image or
a multi-array, you can also give the outputs flexible sizes.

Where size flexibility only describes the possible sizes of the model’s inputs and outputs,
the new dynamic tensor stuff in Core ML 3 describes what happens to the tensors inside
the neural network.

In certain model architectures, the shapes of the intermediate tensors may be different
from one run to the next, even if the input tensor is always a 224×224 RGB image.

For example, if the first part of your model predicts a bounding box, crops the correspond-
ing region from an image, and feeds the cropped region into the second part of your model,
it’s likely that the cropped image tensor will have a different size every time. Therefore, the
layers in the second part of the model cannot make any assumptions about the shape of that
cropped tensor.

That’s why Core ML 3 is much more dynamic than before: the operations performed by the
neural network are no longer completely determined beforehand when the model is built,
but at runtime, where what happens depends on the data that is being processed.

Input shape mapping

In previous versions of Core ML, all inputs to the neural network were mapped to that
ubiquitous rank-5 tensor (sequence, batch, channels, height, width) because that’s the
format that all layers expected. This is no longer the case with Core ML 3, as the layers
have become much more flexible.
70 https://github.com/apple/coremltools/blob/master/mlmodel/format/NeuralNetwork.proto
Dynamic Tensor Shapes 143

When creating your neural network, you can now decide how your input data gets mapped
into tensors. In NeuralNetwork.proto there are two new definitions for this:

• NeuralNetworkImageShapeMapping for image inputs. Instead of the old rank-5 tensor,


RANK5_IMAGE_MAPPING, you can choose to use a rank-4 tensor, RANK4_IMAGE_MAPPING, which
is just (batch size, channels, height, width). This drops the “sequence length”
dimension, which you usually don’t need for images (unless, of course, you have a
sequence of them).

• NeuralNetworkMultiArrayShapeMapping for multi-array inputs. You can choose between


the old way, RANK5_ARRAY_MAPPING, which creates that rank-5 tensor shown previously,
or the new way, EXACT_ARRAY_MAPPING, which simply passes the input tensor through
unchanged. For most types of data that is not images, it makes sense to use this new
method. You will also need to use the new method if your model uses any of the new
layer types that were added with Core ML 3.

The coremltools converters will automatically choose the appropriate option when con-
verting your model. If you’re using NeuralNetworkBuilder, you can pass in the disable_-
rank5_shape_mapping=True argument to use the new format. Enabling this means the model
needs iOS 13 or up.

ND-layers

Many of the layer types from model spec v3 and earlier have been extended to work with
arbitrary size tensors, but usually they still have some restrictions.

For example, the old Softmax layer now accepts any tensor of rank 3 or higher but always
applies the softmax on the axis N-3 where N is the rank of the input. For the old-style rank-5
tensors that means the channels axis.

That’s not very flexible, of course, so there is now a new version named SoftmaxND. The “ND”
stands for N-dimensional. It does the same thing as the old Softmax layer but can apply the
softmax operation to any of the tensor’s axes.

Several other layer types now have “ND” variants too, such as Concat and Split. If you’re
using the new dynamic tensors in your model, you’ll want to use the “ND” versions of these
layers.
Dynamic Tensor Shapes 144

Broadcasting

Broadcasting is a technique that lets you perform an operation on tensors of different


ranks.

For example, let’s say you want to subtract the RGB values [0.485, 0.456, 0.406] from every
pixel in a 224×224 color image. The problem is that you can’t directly subtract a rank-1
tensor from a rank-3 tensor…

Without broadcasting you’d first have to copy the RGB values into a (3, 224, 224) tensor,
and then subtract that from the image tensor. That’s not very efficient. With broadcasting,
this copying is done implicitly and you don’t have to worry about it.

The existing element-wise operations between two or more tensors already supported
broadcasting to some extent, but Core ML 3 adds new layer types, that fully support NumPy-
style broadcasting. In this example, you’d use SubtractBroadcastableLayer.

There are also BroadcastToStaticLayer, BroadcastToLikeLayer, and BroadcastToDynamicLayer


operations. These change the shape of the tensor according to the common NumPy
broadcasting rules.

Many other layer types that don’t have the word “broadcastable” in their name, such as
EqualLayer, also support broadcasting. This makes Core ML much more capable for general-
purpose tensor processing.

Note: As an illustration of why broadcasting is great, the object detection model from Turi
Create needs to multiply a tensor of shape (1, boxes, classes) by another of shape (1,
boxes). These tensors have a different rank. We can turn the rank-2 tensor into a rank-3
one by reshaping it as (1, boxes, 1), but that doesn’t fix the problem as both tensors still
have a different number of elements.

The only way to do this in Core ML 2 and before is to use a concat layer that repeats the (1,
boxes, 1) tensor classes times. The output of that layer is then also a (1, boxes, classes)
tensor. Now both tensors have the same rank and the same number of elements, and they
can be multiplied. With Core ML 3 and the new MultiplyBroadcastableLayer, the concat
operation becomes unnecessary.

Static and dynamic layer types

A number of layer types have static and dynamic variants.


Dynamic Tensor Shapes 145

Static means, “everything is known about this operation beforehand” while dynamic
means “the arguments of this operation can change at runtime”. For example, the static
version of a layer may have a hardcoded outputShape property while the dynamic version
can use a different output shape every time.

One of the layer types that has been made more dynamic is the reshape layer, which you’d
use to change the shape of a tensor. In addition to the old ReshapeLayer, which only works
on rank-5 tensors, Core ML 3 adds three new variants: ReshapeStaticLayer, ReshapeLikeLayer,
and ReshapeDynamicLayer.

Likewise, there are now FillStaticLayer, FillLikeLayer, and FillDynamicLayer, for filling a
tensor with a constant scalar value.

Notice how these layer types come in three different variants: Static, Like, and Dynamic.
What do these mean?

• Static is the simplest one: all the properties for this layer are hardcoded in the
mlmodel file. If you know that, regardless of what happens, you’re always going to
need a tensor with shape (32, 10, 7), you would use FillStaticLayer or ReshapeStaticLayer.
(Note that FillStaticLayer does not take an input tensor but ReshapeStaticLayer does.)

• Like takes an additional input tensor and outputs new a tensor that has the same shape
as that input. The layer ignores the actual values from that extra input tensor – it only
looks at its shape. FillLikeLayer takes only one input and use this to determine the
shape of the output tensor, while ReshapeLikeLayer takes two input tensors: the one to
reshape, and the second one whose shape it will reshape to.

• Dynamic is similar to Like: it also takes an additional input tensor, but this time it’s not
the shape of that tensor that’s important but its contents. For example, to fill a tensor
of shape (32, 10, 7) you would pass in a rank-1 tensor of shape (3) that has three
values: 32, 10, and 7.

Note: Static / dynamic isn’t always about the output shape, it depends on the layer.
For example, Core ML 3 also has random distribution layers, such as RandomNormal and
RandomUniform, that fill a tensor with random values. Here, you can set the random seed
dynamically too. Some of the dynamic layers have several different inputs that let you
override their default properties. Interestingly, FillDynamicLayer doesn’t let you pass in the
scalar value dynamically, only the tensor shape.

Also worth mentioning is GetShapeLayer. This returns a vector containing the shape of the
input tensor, which lets you inspect at runtime how big a given tensor is. You could use this
with one of these Dynamic layer types.
Dynamic Tensor Shapes 146

Thanks to the support for arbitrary shape tensors, layer types that don’t need to have all
their settings hardcoded, and new operations for creating, shaping, and manipulating ten-
sors, Core ML models can now express much more complex and interesting computational
graphs that go far beyond basic convolutional neural networks!
Using the Spec to Edit Models
After you convert a model to Core ML format using coremltools, you get an MLModel object
from the coremltools.models module.

You can also create an MLModel object by loading an mlmodel file:

import coremltools as ct
model = ct.models.MLModel("YourModel.mlmodel")

The coremltools.models.MLModel class provides a very basic interface to using Core ML from
Python. You can use it to change the model’s metadata and make predictions. (Note that this
is not the same as the MLModel class from CoreML.framework. We’re in Python land here.)

That’s nice, but to make more extensive changes to the model definition you need the spec,
which contains the protobuf objects that describe the model’s properties in detail. With the
spec you can add new layers, remove layers, rename inputs and outputs, and change the
model in pretty much any way you please.

The Python MLModel object is really just a very simple wrapper around such a spec object.
For most model surgery tasks, you’ll need to use the spec directly.

Tip: I prefer to work interactively from the Python REPL when dealing with spec objects. I
recommend using IPython71 or even a Jupyter notebook72 .

Getting the spec

There are a few ways to get the spec object. You can get it from the MLModel:

spec = model.get_spec()

This creates a deep copy of the spec, so changing anything in this copy does not affect the
existing MLModel object.

If you don’t want to use a copy but directly make changes to the MLModel object, you can do:
71 https://ipython.readthedocs.io/en/stable/
72 https://jupyter.org

147
Using the Spec to Edit Models 148

spec = model._spec

In Python, symbols starting with an underscore are private. You’re not really supposed to
use _spec directly, but there’s also no one stopping you. Any changes you make to spec are
now immediately reflected in the MLModel. (Although if you want to make predictions with
the changed spec, you’ll need to create a new MLModel instance, which runs the Core ML
compiler again.)

The third way to get a spec is to skip MLModel altogether and load it directly:

spec = ct.utils.load_spec("YourModel.mlmodel")

A quick way to make sure the spec was loaded correctly is to print the model description:

print(spec.description)

This prints out the model’s metadata as well as the descriptions of its inputs and outputs.
(This is the contents of the ModelDesciption message from Model.proto.)

Tip: don’t do print(spec). That will print out the entire model, including all of its learned
parameters. You’ll regret it!

Making changes using the protobuf API

The spec consists of protobuf objects. To see this, run the following command from the
Python REPL:

type(spec)

This prints Model_pb2.Model. The _pb2 suffix means this is an object that is described by the
Model.proto file.

All the protobuf “message” objects in the spec have similar types. For example, writing
type(spec.description) prints out Model_pb2.ModelDescription, and so on. These types liter-
ally correspond to the definitions in the proto files. Therefore, understanding the mlmodel
specification is important in order to work with the spec (see the previous chapter).
Using the Spec to Edit Models 149

Because these are all protobuf objects, you will need to use the Python protobuf API to
manipulate them. This API is actually quite convenient to use, but not always obvious if
you’re new to it.

To change a field that is an integer or string, you can simply assign it a new value. For
example, the human-readable description of the model that is shown in Xcode:

spec.description.metadata.shortDescription = "This is my awesome model!"

If you don’t want to include such a description, simply set it to the empty string "". You can
also reset the field to its default value:

spec.description.metadata.ClearField("shortDescription")

All the fields in the spec objects are optional. Numeric fields and string fields will always
have a default value (0 or the empty string) but fields that are messages may not be present
at all. To inspect if the model has a certain message, you can write:

spec.description.HasField("metadata")

If a field is a oneof, you can use WhichOneof() to figure out which one it is (notice the spelling
of Oneof!). Recall that spec is an instance of the Model message that has a field oneof Type
that determines what sort of machine learning model is in the mlmodel file.

print(spec.WhichOneof("Type"))

This may print something like neuralNetworkClassifier. You can look at the properties of
this object by writing:

spec.neuralNetworkClassifier.someProperty

Where someProperty is a field from the message type NeuralNetworkClassifier. (Don’t try to
print out all of spec.neuralNetworkClassifier as it dumps the contents of all the layers and
their learned parameters.)

If you write spec.neuralNetwork or spec.treeEnsembleClassifier or any of the other model


type names, you’ll get None as the answer (or just nothing) because those are not the correct
type for this particular mlmodel. The name you use, in this case spec.neuralNetworkClassifier,
has to match the result from WhichOneof("Type").
Using the Spec to Edit Models 150

Note: type(spec.neuralNetworkClassifier) is NeuralNetwork_pb2.NeuralNetworkClassifier be-


cause this message definition is located in the file NeuralNetwork.proto.

The NeuralNetworkClassifier has a repeated field called layers that consists of zero or more
NeuralNetworkLayer objects. This works like a list and so you can iterate through it:

for layer in spec.neuralNetworkClassifier.layers:


print(layer.name)

Or if you just want to look at the convolution layers:

for layer in spec.neuralNetworkClassifier.layers:


if layer.HasField("convolution"):
print(layer.name)

Or equivalently:

for layer in spec.neuralNetworkClassifier.layers:


if layer.WhichOneof("layer") == "convolution":
print(layer.name)

I hope you’re following along with NeuralNetwork.proto to check that this makes sense!

To look at a specific layer you can write:

layer = spec.neuralNetworkClassifier.layers[0]
print("name", layer.name)
print("inputs", layer.input)
print("outputs", layer.output)

Because input and output are repeated fields, you can treat them as Python lists. For
example, to add a second input to the layer:

layer.input.append("another_input")
print(len(layer.input))

Of course, if you now save the spec to an mlmodel file, Xcode won’t be able to load it because
this new input isn’t defined anywhere. So let’s delete it again:
Using the Spec to Edit Models 151

del layer.input[-1]

You can also add new layers. But because layers are message objects, you cannot write
append() like you just did. That only works for simple types like strings and numbers. You
first have to add a new layer object:

new_layer = spec.neuralNetworkClassifier.layers.add()

This new_layer object is of type NeuralNetwork_pb2.NeuralNetworkLayer but it doesn’t have any


contents yet. To give it something to do, simply assign values to its fields:

new_layer.name = "My new layer"


new_layer.input.append("old_output")
new_layer.output.append("new_output")

This inserts the new layer after the layer named "old_output", which now becomes the input
to this new layer.

Note: You can’t write new_layer.input = ["old_output"]. That is an error with the Python
protobuf API. Since these are not regular Python lists but protobuf objects, the API is
sometimes a bit different than what you may be used to.

The new layer doesn’t have a type yet. This is where the protobuf API is a little weird. Let’s
say you want to make this a linear activation layer. According to the proto file, activation
layers are described by the ActivationParams message.

Unlike what you might perhaps expect, you don’t create a new ActivationParams object and
then assign it to the layer. This doesn’t work:

new_layer.activation = ct.proto.NeuralNetwork_pb2.ActivationParams()

Instead, you simply write something like:

new_layer.activation.linear.alpha = 1

By accessing the .activation field you’re telling protobuf you want to this layer to have an
ActivationParams object. By writing .linear you’re saying it should be an ActivationLinear.
And by accessing the .alpha field you actually make all of this happen.

If you now print out the properties of new_layer, you should see the following:
Using the Spec to Edit Models 152

name: "My new layer"


input: "old_output"
output: "new_output"
activation {
linear {
alpha: 1
}
}

Sometimes messages don’t have any fields, such as ActivationReLU. To change the activation
function of this layer to ReLU, you have to write the following instead:

new_layer.activation.ReLU.SetInParent()

Or use the following trick:

layer.activation.ReLU.MergeFromString(b"")

If you wanted to change this layer to a completely different type, write something like this:

new_layer.innerProduct.inputChannels = 100

Now this layer magically becomes an inner product layer — what Core ML calls a fully-
connected layer — of type NeuralNetwork_pb2.InnerProductLayerParams.

Note: In the proto file for NeuralNetworkLayer the layer types are defined inside the oneof
layer declaration. But note that you cannot use the new_layer.layer field to get its layer
type. You always access the thing from the oneof directly by its name, such as activation or
innerProduct. The name of the oneof itself is only used with WhichOneof().

It should be obvious by now that you really need to have the proto files handy to know what
fields and messages you can use here! By the way, to explore what these protobuf objects
from the proto files can do, you can use the dir() command in the Python REPL:

dir(ct.proto.Model_pb2.Model)

This is not as easy to understand as the actual proto files, but it can be useful to quickly
look up what a certain field is called.

To remove a layer, simply delete it. For example, to remove the layer you just added:
Using the Spec to Edit Models 153

del spec.neuralNetworkClassifier.layers[-1]

This was only a short introduction to the most common things you can do with the Python
protobuf API. For the full story, check out the documentation73 . In the next chapters you’ll
be using these techniques for all kinds of advanced model surgery.

Saving the new model

Once you’re done making your changes to the spec, you can write everything to an mlmodel
file again:

ct.utils.save_spec(spec, "YourNewModel.mlmodel")

Tip: Use a different filename. It’s really easy to mess something up when you edit the spec
by hand and you don’t want to overwrite your original mlmodel file!

If you used model._spec to get the spec, you can simply save the MLModel object:

model.save("YourNewModel.mlmodel")

Note: this doesn’t work when you used model.get_spec() since that created a copy.

You can also make a new MLModel object from the spec and then save the model:

new_model = ct.models.MLModel(spec)
new_model.save("YourNewModel.mlmodel")

Personally, I prefer to load and save specs directly without using MLModel unless I also want
to make predictions with the model from Python. Everything else is just as easy with just
the spec.

Note: Because the mlmodel format is based on protobuf, you can load, save, and change
mlmodel files with any language that supports protobuf. That means you don’t necessarily
need to use Python to create new mlmodel files — you can even use Swift! See the chapter
Using Protobuf Without coremltools for more details.

73 https://developers.google.com/protocol-buffers/docs/reference/python-generated
Looking Inside an mlmodel
In the previous chapters you’ve seen the mlmodel specification and how to obtain a spec
object. Now let’s put this knowledge to good use. Examining the insides of an mlmodel is
useful to make sure the conversion was successful, and for troubleshooting problems with
the model in case it wasn’t.

First, get a spec object either by loading the MLModel or using coremltools.utils.load_spec().
See the previous chapter for details.

The model description

The model description, from the ModelDescription message in Model.proto, describes the
model’s inputs, outputs, and metadata.

print(spec.description)

This prints something like the following:

input {
name: "image"
shortDescription: "Input image to be classified"
type {
imageType {
width: 224
height: 224
colorSpace: BGR
}
}
}
output {
name: "classLabelProbs"
shortDescription: "Probability of each category"
type {
dictionaryType {
stringKeyType {
}
}
}

154
Looking Inside an mlmodel 155

}
output {
name: "classLabel"
shortDescription: "Most likely image category"
type {
stringType {
}
}
}
predictedFeatureName: "classLabel"
predictedProbabilitiesName: "classLabelProbs"
metadata {
shortDescription: "..."
author: "..."
}

You can also see this information in Xcode when you select the mlmodel in the Project
navigator, but that doesn’t necessarily show all of the details. At a glance you can see that:

• This model has one input named "image". True to its name, this really is an image input
(its type is imageType). It expects a 224×224 image with BGR pixel order.

• This model has two outputs:


– classLabelProbs, which is also the predictedProbabilitiesName. This is a dictionary
that maps each class label to its predicted property.
– classLabel, which is also the predictedFeatureName. This is a string output that
contains the label of the most likely class (the one with the highest probability).

• The metadata includes human-readable descriptions of the model.

For some models, the metadata also includes a dictionary of userDefined data. This user-
defined data is not visible in Xcode. For example, the YOLO object detector model trained
by Turi Create contains the following:

metadata {
userDefined {
key: "training_iterations"
value: "13000"
}
userDefined {
key: "turicreate_version"
value: "5.1"
}
...
Looking Inside an mlmodel 156

You can add as much user-defined data to your model as you want, and access this data
from your iOS app.

Models of specification version 4 and later can also have training inputs, for example:

trainingInput {
name: "image"
shortDescription: "Example image"
type {
imageType {
width: 224
height: 224
colorSpace: RGB
}
}
}
trainingInput {
name: "labelTrue"
shortDescription: "True label"
type {
stringType {
}
}
}

While the regular inputs are used for performing inference, i.e. making predictions, the
training inputs are used during training. Often these will be the same as the regular inputs,
but with an additional training input for the ground-truth labels.

The input and output types

As you’ve seen, the ModelDescription contains the model’s inputs and outputs. To learn more
about the datatype of a specific input or output you can look at the FeatureType from its
FeatureDescription object.

print(spec.description.input[0].type.WhichOneof("Type"))

The different types are defined in FeatureTypes.proto. They are:

• Int64FeatureType for integer numbers

• DoubleFeatureType for double-precision floating point numbers


Looking Inside an mlmodel 157

• StringFeatureType for text strings

• ImageFeatureType for images

• ArrayFeatureType for multi-dimensional arrays of numbers (tensors)

• DictionaryFeatureType for integer:double or string:double dictionaries

• SequenceFeatureType for a sequence of integers or strings

The FeatureType message also has a boolean isOptional field. By default this is False,
meaning that the user of this model is always required to supply data for this input.

If the input or output is an image, its type is an ImageFeatureType object:

print(spec.description.input[0].type.imageType)

This describes the image’s expected width and height, and its pixel order:

width: 224
height: 224
colorSpace: RGB

In a model with Size Flexibility (mlmodel version 3 and up), you might see something like
this instead:

colorSpace: RGB
imageSizeRange {
widthRange {
lowerBound: 256
upperBound: 1024
}
heightRange {
lowerBound: 256
upperBound: 1024
}
}

For more about this, see the chapter Size Flexibility.

Many interesting models will produce outputs of type ArrayDataType. A model for semantic
segmentation, for example, may output a 21×512×512 tensor. The image size here is
512×512 pixels and for every pixel it predicts 21 class probabilities (20 regular classes and
one “background” class for pixels that don’t belong to any object in the scene).

The output description of such a model will look like this:


Looking Inside an mlmodel 158

output {
name: "scores"
shortDescription: "Predicted probabilities per pixel"
type {
multiArrayType {
shape: 21
shape: 512
shape: 512
dataType: DOUBLE
}
}
}

Such multi-array objects can have the data type DOUBLE, FLOAT32, or INT32. Core ML will use
doubles by default but you can change this to floats in the spec object.

The multi-array has a list of shapes. These describe the dimensions of the array. Here,
the shape is (21, 512, 512). Notice that in Core ML, the order of the dimensions is always
(channels, height, width). To look at the shape of this output, you can write:

print(spec.description.output[0].type.multiArrayType.shape)

This prints [21, 512, 512]. For mlmodel version 3 and later, the multi-array shapes can also
be flexible.

The model type

Because the mlmodel format can describe different types of machine learning models, you
need to know what the actual type is before you can examine the rest of the mlmodel file:

print(spec.WhichOneof("Type"))

Let’s suppose this prints out neuralNetworkClassifier. To examine the rest of the model you
can now write,

print(spec.neuralNetworkClassifier.someProperty)

where someProperty is the name of the thing you want to look at.

Tip: Keep the mlmodel proto files74 handy while you’re examining the model. Without
this specification it’s almost impossible to understand how to dig through the different
messages and properties.
74 https://github.com/apple/coremltools/tree/master/mlmodel/format
Looking Inside an mlmodel 159

There’s more than one kind of neural net

To look at the layers in the neural network, you first need to find out what sort of neural net-
work it is. Even though NeuralNetwork, NeuralNetworkClassifier and NeuralNetworkRegressor
all have very similar message definitions, these are treated as three separate types due to
the way the protobuf API works.

The following makes dealing with the different types of neural networks a little easier:

def get_nn(spec):
if spec.WhichOneof("Type") == "neuralNetwork":
return spec.neuralNetwork
elif spec.WhichOneof("Type") == "neuralNetworkClassifier":
return spec.neuralNetworkClassifier
elif spec.WhichOneof("Type") == "neuralNetworkRegressor":
return spec.neuralNetworkRegressor
else:
raise ValueError("MLModel does not have a neural network")

Use this helper function as follows:

nn = get_nn(spec)

If you write type(nn) it will print something like NeuralNetwork_pb2.NeuralNetworkClassifier,


depending on the actual type of the model.

From now on, you use the nn object to examine the rest of the neural network’s spec.

Tip: Put the get_nn() function in a new Python file named helpers.py. Then the next time
you need it, you just have to write from helpers import get_nn.

As of coremltools 3.0, you can also do:

from coremltools.models.neural_network.builder import _get_nn_spec as get_nn


Looking Inside an mlmodel 160

Neural network layers

A neural network definition consists of two things: a list of layers and the preprocessing
options for any image inputs. See the chapter Changing the Image Preprocessing Options
to learn more about this.

Here, let’s focus on nn.layers.

Note: Don’t write print(nn.layers) or even print(nn.layers[i]) because this also prints out
all of the layer’s weights and biases, which can be thousands of numbers! It’s better to look
at specific layer properties.

For example, to count how many layers there are:

print(len(nn.layers))

By looking at the layer’s input and output fields you can see what other layers it is connected
to. The input of the first layer is usually connected to the model’s input:

print(nn.layers[0].input)

This should print the value from spec.description.input[0].name, for example ['image'].
Python puts this in a list because a layer can have more than one input.

For a NeuralNetworkClassifier, the last layer’s output should be connected to the output
given by spec.description.predictedProbabilitiesName:

print(nn.layers[-1].output)

For the example model from the beginning of this chapter, this prints ['classLabelProbs'].

Note: Classifier models are not required to use the name "classLabelProbs" as the model’s
output. They can use any name they like, as long as you put that name in the model
description’s predictedProbabilitiesName and in the output from the last layer (the softmax
layer). In the chapter Cleaning up a Converted Model you’ll see how to rename your model’s
inputs and outputs.
Looking Inside an mlmodel 161

You can only index nn.layers by number, so one way to find the layer you’re looking for is
to loop through all the layers and print out their list index and name:

for i, layer in enumerate(nn.layers):


print(i, layer.name)

Personally, I find it easier to create a dictionary that maps the layer names to their indices,

layer_dict = {layer.name:i for i, layer in enumerate(nn.layers)}

and then grab a layer by its name:

layer = nn.layers[layer_dict["conv1"]]

To learn more about a specific neural network layer, you first need to know what type of
layer this is:

print(layer.WhichOneof("layer"))

If this prints convolution, then you can write layer.convolution.someProperty to examine


the layer in more detail. Again, you need to look at NeuralNetwork.proto for the possible
property names.

print(layer.convolution.outputChannels)
print(layer.convolution.kernelChannels)
print(layer.convolution.kernelSize)
print(layer.convolution.stride)

The weights (and optional biases) for a layer are given by a WeightParams object. This usually
stores the weights data as a list of single-precision floating points:

print(len(layer.convolution.weights.floatValue))

But it can also use weights.float16Value when it uses half-precision floats or weights.rawValue
for quantized weights (or custom layers). Looking at the weights can be useful to see if they
were converted properly from the original model:
Looking Inside an mlmodel 162

import numpy as np

cout = layer.convolution.outputChannels
cin = layer.convolution.kernelChannels // layer.convolution.nGroups
kh = layer.convolution.kernelSize[0]
kw = layer.convolution.kernelSize[1]

W = np.array(layer.convolution.weights.floatValue)
W = W.reshape(cout, cin, kh, kw)

Now W is a NumPy array of shape (output channels, input channels, kernel height, kernel
width). By the way, // means integer division in Python.

To see if a layer has bias you can use one of the following:

layer.convolution.HasField("bias")
layer.convolution.hasBias

If either of these is True, the biases are also stored as WeightParams object.

Each layer type has its own fields, and using the above techniques you can peek into every
part of your mlmodel file.

Tip: You can also use NeuralNetworkBuilder to inspect the layers. See here for details.

Model summary

coremltools has a handy utility for summarizing the model.

ct.models.neural_network.printer.print_network_spec(spec)

This prints the model’s layers and their attributes:


Looking Inside an mlmodel 163

Inputs:
image (imageType { width: 224 height: 224 colorSpace: BGR})
Outputs:
classLabelProbs (dictionaryType { stringKeyType { }})
classLabel (stringType {})

Layers:
[0] (convolution) conv1
Input blobs: ['image']
Output blobs: ['conv1_1conv1/bn']
Parameters:
outputChannels = 32
kernelChannels = 3
nGroups = 1
kernelSize = [3, 3]
stride = [2, 2]
...

If you add a second argument, print_network_spec(spec, True), it only shows the inputs and
outputs but not the layers.
Verifying the Conversion is Successful
Just because the conversion completed without error messages doesn’t mean the Core
ML model is guaranteed to be correct. The preprocessing options may be erroneous, the
converter may have made a mistake, you did some model surgery after the conversion but
messed up… there are lots of things that can go wrong.

Even when the model is not correct, it will still create predictions — sometimes even
predictions that appear reasonable — but you don’t want to put such a model into your app
and ship it. It’s smart to verify that the Core ML model gives the same prediction results as
the original model!

If you’re doing the model conversion on macOS 10.13 or later, you can make predictions
directly from coremltools. The idea is to make a few of these predictions with the Core ML
model, also make predictions on the same input data with the original model, and compare
the results to see if they are the same — or at least close enough.

Tip: Use the Python REPL, or even better IPython or a Jupyter notebook, to do this.

Load the mlmodel into an MLModel object:

import coremltools as ct
model = ct.models.MLModel("YourModel.model")

You can now use model.predict() to make a prediction, but first you need some test data.

Models with image inputs

If your model expects an image as input, you will need to supply this as a PIL Image object.
It’s useful to have a collection of test JPEG or PNG files handy.

164
Verifying the Conversion is Successful 165

import numpy as np
import PIL
img = PIL.Image.open("whosagoodboy.jpg")

You need to resize the image to the width and height as expected by the model. Rather than
hardcoding this, you can get the expected dimensions from the model’s spec:

spec = model._spec
img_width = spec.description.input[0].type.imageType.width
img_height = spec.description.input[0].type.imageType.height
img = img.resize((img_width, img_height), PIL.Image.BILINEAR)

Note: You don’t need to worry about RGB or BGR pixel order. Core ML will automatically
load the data from the PIL Image into a CVPixelBuffer and do any necessary color space
conversions.

Now you can make a prediction using the model:

y = model.predict({"image": img}, useCPUOnly=True)

The dictionary {"image": img} is what maps the test data to the model’s inputs. Here you’re
passing the img with the PIL Image object to the input named "image". It’s quite likely your
model’s input has a different name. You can get this from the spec again:

input_name = spec.description.input[0].name
y = model.predict({input_name: img}, useCPUOnly=True)

Of course, if the model has multiple inputs, you need to supply all of them in the data
dictionary.

There are a few things that can go wrong at this point:

• If you’re not supplying the correct input name, model.predict() gives an error message:
“Required input feature not passed to neural network.”

• If the size of the image is wrong, you’ll get an error message such as: “Input image
feature image does not match model description.”
Verifying the Conversion is Successful 166

• If you’re not passing in a PIL Image object, you’ll get an error such as: “The model
expects input feature image to be an image, but the input is of type 5.” Using a NumPy
array won’t work for image inputs!

• You get the error message “Exception: Unable to load CoreML.framework.” This
happened to me when I was using the bleeding edge version of coremltools directly
from GitHub. Downgrading coremltools to the latest stable version fixed it.

• The model has a custom layer. Making predictions on such models is not supported,
as they do not contain any code for executing the custom layer.

CPU versus GPU


The useCPUOnly argument tells model.predict() whether to run the model on the CPU or on
the GPU (if available). It’s a good idea to test with useCPUOnly=True first because this will use
32-bit floating point numbers for all the computations, which is also what training tools
such as TensorFlow, Keras, and Caffe use.

With useCPUOnly=False you may see larger errors because the GPU does its arithmetic with
16-bit floating point numbers, which have lower precision. So the outcomes between CPU
and GPU may be different, but should still be quite similar — if there is a massive difference,
then something is really wrong somewhere!

Looking at the predictions

When all went well, y now is a dictionary with the predictions. If you’re doing this in the
Python REPL, you can write y.keys() to see which objects the dictionary contains. Usually
there will be one entry for each model output.

For a classifier model, y will look something like this:

{'classLabel': 'Labrador retriever',


'classLabelProbs': {'pole': 2.5477111581435274e-17,
'black and gold garden spider, Argiope aurantia': 1.6184569215134048e-21,
'assault rifle, assault gun': 1.489192315020529e-16,
'solar dish, solar collector, solar furnace': 3.0364130415960055e-19,
...

The two outputs are "classLabelProbs", a dictionary with the classes and their predicted
probabilities, and "classLabel", the name of the highest scoring class. The actual names
Verifying the Conversion is Successful 167

of these outputs depend on your model. You can examine what your model’s outputs are
called by writing print(spec.description.output). See also the chapter Looking Inside an
mlmodel.

To view the top-5 classification results, define the following function:

def printTop5(resultsDict):
# Put probabilities and labels into their own lists.
probs = np.array(list(resultsDict.values()))
labels = list(resultsDict.keys())

# Find the indices of the 5 classes with the highest probabilities.


top5Probs = probs.argsort()[-5:][::-1]

# Find the corresponding labels and probabilities.


top5Results = map(lambda x: (labels[x], probs[x]), top5Probs)

# Print them from high to low.


for label, prob in top5Results:
print("%.5f %s" % (prob, label))

And then call it as follows:

printTop5(y["classLabelProbs"])

For my model this printed out:

0.37743 Labrador retriever


0.24369 golden retriever
0.18978 clumber, clumber spaniel
0.08162 kuvasz
0.05437 English setter

That result makes sense because I used a picture of a golden retriever. It’s not a very
confident prediction and it actually finds the wrong class, but at this point you don’t really
care how good or how useful the prediction is — just that it gives the same results as the
original model.

Now try your original model in TensorFlow, Keras, Caffe, etc. using the same picture. You
should get pretty much the same result. If the result from the original model is totally
different, then something is not right. The typical culprit is the preprocessing options, so
that would be the first place to look.
Verifying the Conversion is Successful 168

Comparing with Keras

To view the output of a Keras model, start a new IPython session or a Jupyter notebook.
First, load your trained model. The exact syntax for this may be different, depending on
whether your model uses custom objects, etc.

import numpy as np
import PIL
import keras
from keras.models import load_model
from keras.preprocessing.image import img_to_array

model = load_model("YourModel.h5")

Next, load the test image:

img = PIL.Image.open("filename.jpg")
img = img.resize((224, 224), PIL.Image.BILINEAR)
x = np.expand_dims(img_to_array(img), axis=0)

Before you can use the PIL Image object with Keras, it first needs to be converted into a
NumPy array with img_to_array() and be given an extra dimension for the batch size with
np.expand_dims().

Note: Make sure to load the image in exactly the same way as before. Use PIL and not some
other image loading library. Resize the image with exactly the same resampling option
(BILINEAR). If you’re not careful with this, the image may end up being loaded slightly
differently and you will also get different predictions. Naturally, you can only properly
compare the outputs of the two models if they are given exactly the same inputs.

To make the actual prediction:

x = preprocess(x)
y = model.predict(x)

Important: Usually the Keras model doesn’t do any image preprocessing of its own, so
you need to do this yourself before passing the array into the model. Exactly how this
preprocess() function works depends on how the original model was trained.
Verifying the Conversion is Successful 169

The output y is a list containing a NumPy array for every output of the model. If this is a
classifier model, then y[0] is a NumPy array with the probabilities for the classes. Because
Core ML puts its predictions into a dictionary, you can’t directly compare this with the Keras
output. You first need to put the Keras output into a dictionary too.

Putting the classification results into a dictionary

You need to supply the names of the classes. If there aren’t too many labels, then you can
just list them:

labels = [ "class1", "class2", ... ]

Or you can load them from a text file:

labels = np.loadtxt("mylabels.txt", str, delimiter="\n")

Define the following helper method to convert the predictions into a dictionary:

def makeResultsDict(probs):
probs = probs.squeeze()
return {labels[i]:probs[i] for i in range(len(labels))}

Then call this function on the NumPy array that has the predicted probabilities:

labelProbs = makeResultsDict(y[0])

Now labelProbs is a dictionary that associates each class name with its predicted probabil-
ity. You can use the printTop5() function again to look at the five best predictions:

printTop5(labelProbs)

Compare these to the predictions from the Core ML model. Hopefully, there’re very similar!

It is possible that these results are slightly different from the Core ML predictions. If you
see differences two or three places behind the decimal point, that’s no cause for concern.
For example, if the Core ML prediction is 0.71439 and the Keras prediction is 0.70775, then
it may seem like these are very different but the actual difference is only 0.00664. That
counts as correct in my book.
Verifying the Conversion is Successful 170

Why are these not 100% identical? There will always be small differences in how each
machine learning library implements its basic operations such as convolutions, batch
normalization, and so on. Also, if the Core ML model stores its weights as 16-bit floats or
quantized, then that will cause small errors to creep into the precision too.

Comparing with TensorFlow

In your Python REPL or Jupyter notebook, first import the required packages:

import numpy as np
import PIL
import tensorflow as tf

Then load your model. Exactly how you need to do this depends on the format the model
was saved in. For example, if your model is a (frozen) .pb graph then use the following
code:

def load_graph(filename):
the_graph = tf.Graph()
with the_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(filename, "rb") as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name="")
return the_graph

graph = load_graph("YourModel.pb")

If it’s a saved_model.pb you can use the following instead:

def load_saved_model(path):
the_graph = tf.Graph()
with tf.Session(graph=the_graph) as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], path)
return the_graph

graph = load_saved_model("YourModelFolder/")

Next, load the image using PIL and put it into a NumPy array. Make sure to resize the image
to the width and height expected by the model.
Verifying the Conversion is Successful 171

img = PIL.Image.open("filename.jpg")
img = img.resize((224, 224), PIL.Image.BILINEAR)
x = np.array(img.getdata())
x = x.reshape((1, img.size[1], img.size[0], 3)).astype(np.uint8)

Tip: Do print(x) to compare the actual pixel values of this image to the ones that were used
with Core ML, to make 100% sure they are identical. (Core ML expects pixel values between
0 and 255.)

Before you make a prediction, first pre-process the data in the array (unless the TensorFlow
graph already does its own preprocessing). Exactly how this preprocess() function works
depends on how the original model was trained.

x = preprocess(x)

And now you can run the graph to make the prediction:

with graph.as_default():
with tf.Session(graph=graph) as sess:
input_tensor = graph.get_tensor_by_name("input:0")
output_tensor = graph.get_tensor_by_name("output:0")
y = sess.run([output_tensor], feed_dict={input_tensor: x})

You need to use the correct names for the input and output tensors. Instead of "input:0"
your input may be named "ImageTensor:0". And your output tensor may be something like
"Predictions/Reshape_1:0".

Note: Don’t forget to add the :0 at the end of the tensor name.

One way to find the tensor names used by your graph is to print them out:

for op in graph.get_operations(): print(op.name)

Or just look at the first and last ones, although these aren’t always guaranteed to be the
actual inputs and outputs:
Verifying the Conversion is Successful 172

print(graph.get_operations()[0].name)
print(graph.get_operations()[-1].name)

You can also use Netron to visually inspect the graph but some TensorFlow graphs are really
big and complicated and Netron may have trouble rendering them.

After sess.run() completes, y is a list containing a NumPy array for every output of the
model. If this is a classifier model, then y[0] is a NumPy array with the probabilities for the
classes.

Because Core ML puts its classification results into a dictionary, you can’t directly compare
this with the TensorFlow output. See the previous section on how to put the TensorFlow
predictions into a dictionary too.

Comparing with Caffe

To run Caffe models from Python, you must have pycaffe installed. As usual, first import
the needed packages:

import numpy as np
import PIL
import sys

sys.path.insert(0, "/path/to/caffe/python")
import caffe

Next, load the model. This requires a .caffemodel file and a .prototxt file. If you have
multiple prototxt files for the model, use the one with “deploy” in the name.

caffe.set_mode_cpu()
net = caffe.Net("YourModel_deploy.prototxt", "YourModel.caffemodel", caffe.TEST)

Load the image. Caffe has its own function for this, caffe.io.load_image(), but I suggest
using PIL so that the image is exactly the same as with the Core ML model. Don’t forget to
resize the image to the expected width and height, again using the same options as before.

img = PIL.Image.open("filename.jpg")
img = img.resize((224, 224), PIL.Image.BILINEAR)
x = np.asarray(img)

Usually Caffe image preprocessing is done using a caffe.io.Transformer object. Here is a


typical transformer configuration for the input named "data":
Verifying the Conversion is Successful 173

transformer = caffe.io.Transformer({"data": net.blobs["data"].data.shape})


transformer.set_transpose("data", (2,0,1))
transformer.set_channel_swap("data", (2,1,0))
transformer.set_mean("data", np.array([103.94, 116.78, 123.68]))
transformer.set_input_scale("data", 1.0)

This swaps the color channels from RGB to BGR, subtracts the specified means (listed in
BGR order), and multiplies by the input scale.

Note: PIL images have pixels in the range 0 - 255, but Caffe’s io.load_image() returns an
array with pixels in the range 0 - 1. The transformer may be used to put these back in the
range 0 - 255 with set_raw_scale("data", 255). You obviously don’t want to do this on PIL
pixels because they’re already in that range (or first write x /= 255.0).

To preprocess the image and make the prediction:

net.blobs["data"].reshape(1, 3, 224, 224)


net.blobs["data"].data[...] = transformer.preprocess("data", x)
y = net.forward()

As always, make sure this uses the correct image height and width — note that height comes
before width in the call to reshape().

The output y is a dictionary containing a NumPy array for every output of the model. If this
is a classifier model, then y["prob"] has the NumPy array with the predicted probabilities.

It’s possible that in your model the input blob isn’t called "data" and the output isn’t called
"prob". You can see what the actual names are in the YourModel.prototxt file, using Netron,
or by printing the names of the blobs in the model:

for layer_name, blob in net.blobs.items():


print(layer_name.ljust(20) + " " + str(blob.data.shape))

Because Core ML puts its classification results into a dictionary, you can’t directly compare
this with the Caffe output. See the previous section on how to put the Caffe predictions into
a dictionary and compare them to the Core ML results.
Verifying the Conversion is Successful 174

Models that are not classifiers

As you’ve seen, Core ML treats classifier models in a special way. For convenience, it returns
a dictionary that maps the predicted probabilities to the class names. If your model is not
a classifier, then the outputs are usually just NumPy arrays with numbers.

For example, let’s say the output description from the spec looks like this:

output {
name: "scores"
type {
multiArrayType {
shape: 91
shape: 1917
dataType: DOUBLE
}
}
}

That means you can expect this model to produce an output named scores that is a
“MultiArray” object — which in Python is a NumPy array — of shape (91, 1917).

Note: If the description doesn’t include shape fields then the size of the output array is not
known until runtime.

So if you run y = model.predict({...}) then the result y["scores"] is a NumPy array. You
can directly compare this to the NumPy array that is returned by Keras, TensorFlow, Caffe,
etc. One way to do this, is to save the NumPy array to a file:

np.save("output_coreml.npy", y["scores"])

And then later on when you also have obtained the output from the original model, you
load the results back in:

y_coreml = np.load("output_coreml.npy").squeeze()
y_keras = np.load("output_keras.npy").squeeze()

At this point you should make sure both these arrays have the same shape. Often there may
be additional “empty” dimensions of size 1. For example, the Keras output may be (1, 91,
1917) instead of just (19, 1917). You can strip off these unused dimensions with squeeze().
Verifying the Conversion is Successful 175

It’s also possible that the NumPy arrays have their dimensions in a different order. For
example, y_coreml.shape may be (19, 1917) but y_keras.shape could be the other way around,
(1917, 19). In that case you need to transpose one of them:

y_keras = y_keras.transpose(1, 0)

Now that you have both arrays and they are in the same format, you can compare them.
One way is to take the absolute error between every pair of predictions:

abs_err = np.abs(y_coreml - y_keras)


print(abs_err)

If this is a large array, then it’s still hard to see how good or bad the predictions are.
Something I like to do is count how many items have an error that is larger than a given
threshold, relative to the total number of items:

mask = abs_err > 0.01


print(mask.sum() / mask.size)

For my model this printed 0.0012, meaning that only 1.2% of the predictions had an error
that was greater than 0.01. To me, that’s a good result.

To get an idea of how “wrong” the Core ML predictions are on average, write print(abs_-
err.mean()). This should be a small number that is close to 0.

It’s still worth taking a look at the items that have large-ish errors. To see the error values
greater than 0.01, write print(abs_err[mask]). The largest error is print(abs_err.max()).

If the mean is quite small but are a few huge errors, then it’s a good idea to investigate
further. Where do these huge errors come from? Is it always the same predictions that
have large errors, even when you use different inputs?

To see which entries in the array have these large errors, write print(np.where(mask)). For
example, if your model predicts a segmentation mask and it’s always the pixels around the
edges that have large errors while the rest of the pixels seem OK, then something may be
wrong with the padding used by the model.

Models with non-image inputs

Everything I’ve explained thus far also applies to models with inputs that are not images.
The only difference is that now you need to supply the input data as a NumPy array, a
dictionary, or whatever else the model expects. As always, make sure to use exactly the
same inputs for the Core ML model as for the original model.
Looking at Intermediate Layer Outputs
Let’s say the output of your Core ML model is wrong and you’re sure the preprocessing
options are correct. Obviously there is a mistake in the model somewhere, but where? One
way to find out is to look at the output of intermediate layers to locate where the output
starts to diverge from the original model.

You need two things for this:

• the ability to run a Core ML model up to a certain layer and grab the output

• likewise for the original model

TensorFlow and Keras let you run the model up to a certain operation or layer. Caffe always
runs the entire model but afterwards it lets you look at the data blob from any one of the
layers. Unfortunately, Core ML models are as black as black boxes get, and you can’t peek
inside them.

But don’t let that stop you — if you can’t look inside the box, then break it! :-)

The trick is to take a converted mlmodel and remove all the layers after the one you’re
interested in. Then make a prediction with this amputated model and look at the results.

First, grab the spec for the mlmodel. For example:

import coremltools as ct
model = ct.models.MLModel("YourModel.mlmodel")
spec = model._spec

If the model is a classifier…

If the model is a NeuralNetworkClassifier, you first need to turn it into a regular NeuralNetwork.
Core ML always expects classifiers to predict probabilities and when you change the output
type to something else it gives a model compiler error.

Try the following first:

176
Looking at Intermediate Layer Outputs 177

print(spec.WhichOneof("Type"))

If that prints neuralNetworkClassifier, perform the steps from this section. However, if it
prints neuralNetwork, you can skip ahead to the next section.

Begin by copying the relevant parts from spec.neuralNetworkClassifier:

import copy
layers = copy.deepcopy(spec.neuralNetworkClassifier.layers)
preprocessing = copy.deepcopy(spec.neuralNetworkClassifier.preprocessing)

And then assign them to spec.neuralNetwork:

spec.neuralNetwork.layers.extend(layers)
spec.neuralNetwork.preprocessing.extend(preprocessing)

You also need to remove the old outputs:

spec.description.ClearField("predictedFeatureName")
spec.description.ClearField("predictedProbabilitiesName")
del spec.description.output[:]

Now add an output for the final layer so that you can compile the model again.

spec.description.output.add()
spec.description.output[-1].name = spec.neuralNetwork.layers[-1].output[0]

import coremltools.proto.FeatureTypes_pb2 as ft
spec.description.output[-1].type.multiArrayType.dataType = ft.ArrayFeatureType.DOUBLE

Finally, create a new MLModel object from the spec. If this doesn’t give any compilation
errors, you’re good to go:

new_model = ct.models.MLModel(spec)

Find the layer, break the graph

To make things a little easier to write, use the get_nn() function:


Looking at Intermediate Layer Outputs 178

from helpers import get_nn


nn = get_nn(spec)

Now find the layer that you want to look at. One way to do this is to make a dictionary
mapping the layer names to their indices:

layer_dict = {layer.name:i for i, layer in enumerate(nn.layers)}

Let’s say you want to look at the output of the layer conv10. Find the index of this layer:

layer_idx = layer_dict["conv10"]

You will now remove everything from this layer onward. First make a backup copy of all
the layers:

import copy
old_layers = copy.deepcopy(nn.layers)

And now remove all the layers after conv10:

del nn.layers[layer_idx + 1:]

To double-check that conv10 really is the last layer, you can do:

print(nn.layers[-1].name)

Next up, you have to replace the current model outputs with the output from this layer:

del spec.description.output[:]
spec.description.output.add()
spec.description.output[0].name = nn.layers[-1].output[0]

import coremltools.proto.FeatureTypes_pb2 as ft
spec.description.output[0].type.multiArrayType.dataType = ft.ArrayFeatureType.DOUBLE

If you now do print(spec.description), it should show something like the following:


Looking at Intermediate Layer Outputs 179

output {
name: "conv10_62relu_conv10"
type {
multiArrayType {
dataType: DOUBLE
}
}
}

Create a new MLModel object from this spec. Even if you directly used the model._spec to
modify the layers, you still need to make a new MLModel because Core ML needs to compile
the model again.

new_model = ct.models.MLModel(spec)

It’s possible you get an error at this point. That means you deleted too many layers or
messed up somewhere, and now the graph is invalid. This could happen if you split up
the model inside a residual block, for example. Check with Netron what would be a better
place to break up the model and try again.

Making a prediction

With this new model you can now make a prediction. See the chapter Verifying the
Conversion is Successful for tips on how to do that.

As an example, let’s say the model takes an image as input:

import PIL
img = PIL.Image.open("picture.jpg")
spec = new_model._spec
img_width = spec.description.input[0].type.imageType.width
img_height = spec.description.input[0].type.imageType.height
input_name = spec.description.input[0].name
img = img.resize((img_width, img_height), PIL.Image.BILINEAR)
y = new_model.predict({input_name: img}, useCPUOnly=True)

If everything goes well, y is a dictionary with a single entry named after your model’s
output:

y_coreml = y[spec.description.output[0].name]
Looking at Intermediate Layer Outputs 180

And now y_coreml is a NumPy array with the output from layer "conv10" or whichever layer
you’re looking at.

Since you did not specify a shape when you created the output for this intermediate layer,
it is five dimensional. For my model, y_coreml.shape was (1, 1, 1000, 14, 14). This is in
the standard order that Core ML uses for all its data blobs: (sequence length, batch size,
channels, height, width).

Now that you have the output from Core ML, you can compare it to the output from the
original model for that same intermediate layer.

Getting intermediate output from Keras

Load your model into Keras and prepare an input image, just like in the previous chapter.
However, don’t use model.predict() but the following function:

from keras import backend as K

def get_intermediate_output(model, layer_idx, img):


f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[layer_idx].output])
output = f([img, 0])
return output

It can also be useful to make a dictionary to lookup the layers:

layer_dict = {layer.name:i for i, layer in enumerate(model.layers)}

Now call the function as follows:

y = get_intermediate_output(model, layer_dict["conv10"], img)

where img is the preprocessed test image. The result y is a list that contains a NumPy array
with the output of the requested layer.

y_keras = y[0]

To see how to compare y_keras with y_coreml, skip ahead to “Comparing the results”.
Looking at Intermediate Layer Outputs 181

Getting intermediate output from TensorFlow

Load your model into TensorFlow and prepare an input image, just like in the previous
chapter.

You will use exactly the same code to run the TensorFlow graph — all you need to change
is the name of the output tensor. For example, if the "conv10" layer is really named
"yourmodel/Conv2d_10/convolution" in the TensorFlow graph, then the code to run the graph
looks like this:

input_tensor_name = "input:0"
output_tensor_name = "yourmodel/Conv2d_10/convolution:0"

with graph.as_default():
with tf.Session(graph=graph) as sess:
input_tensor = graph.get_tensor_by_name(input_tensor_name)
output_tensor = graph.get_tensor_by_name(output_tensor_name)
y = sess.run([output_tensor], feed_dict={input_tensor: x})

Note: Don’t forget the :0 behind the tensor name!

As before, the variable y is a list with one NumPy array for every output in the model.

y_tf = y[0]

To see how to compare y_tf with y_coreml, skip ahead to “Comparing the results”.

Getting intermediate output from Caffe

Load your model into Caffe and prepare an input image, just like in the previous chapter.
Also run the model as usual. Afterwards, you can look at the output of any intermediate
layer using:

y_caffe = net.blobs["conv10"].data.transpose(0, 2, 3, 1)

Now y_caffe is a NumPy array with the output from layer "conv10". So that is actually very
simple! The next section describes how to compare y_caffe with y_coreml.
Looking at Intermediate Layer Outputs 182

Note: Caffe models usually apply the ReLU activation in-place and so you can’t look at the
output of the layer from before the activation. If yours is a Caffe model, always pick a layer
from the mlmodel with relu in the name.

Comparing the results

The shape of the NumPy array from Core ML is (sequence length, batch size, channels,
height, width). Most training packages will put their data into a different shape:

• Keras and TensorFlow usually have (batch size, height, width, channels). For the
"conv10" layer in my model that is (1, 14, 14, 1000).

• Caffe returns the blob as (batch size, channels, height, width) or (1, 1000, 14, 14).

So if you want to compare these outputs to y_coreml you’ll have to transpose and reshape
them. For Keras and TensorFlow, for example:

y_coreml = y_coreml.reshape(1000, 14, 14).transpose(1, 2, 0)


y_keras = y_keras.squeeze()
# or for TensorFlow: y_tf = y_tf.squeeze()

Now both arrays have the shape (14, 14, 1000) and you can compare them.

For Caffe it’s a little simpler because the order of the dimensions is the same already. You
only have to strip off the extra dimensions for the batch size:

y_coreml = y_coreml.squeeze()
y_caffe = y_caffe.squeeze()

Now both arrays have the shape (1000, 14, 14) and you can compare them.

To compare the outputs, I usually look at all the channels for a handful of pixels. If the
arrays are (height, width, channels), you can write:

abs_err = np.abs(y_coreml[r, c, :] - y_keras[r, c, :])

where r is the row for the pixel (the y-coordinate) and c is the column for the pixel (the
x-coordinate). Just by looking at a few different pixels in the image, especially around
the edges, you can already get a pretty good idea of whether this layer has produced the
expected output or not.
Looking at Intermediate Layer Outputs 183

Trying another layer

When troubleshooting the model, you want to find the first layer where the predictions
start to diverge.

I would suggest a divide-and-conquer approach: Start in the middle. If this layer gives the
correct predictions, then continue looking at the second half of the model. If the layer was
wrong, however, then dig further into the first half of the model. And so on until you find
the layer that is incorrect.

Since you’ve still got a copy of all the layers in the old_layers variable, it’s pretty easy to
look at another intermediate layer. First restore the original layers:

del nn.layers[:]
nn.layers.extend(old_layers)

And now repeat the above steps.


Checking the Layer Output Shapes
Every layer in the neural network produces an output tensor of a certain size. The shape
describes how many dimensions the tensor has and the number of elements in each
dimension.

Particularly important for the output shape of a layer is the amount of zero-padding that is
applied by the layer. Making mistakes with the padding is a common cause of errors when
converting between model formats, especially if you’re writing your own converter or if
you’re performing surgery on the model afterwards.

If the model was correctly converted to Core ML, the output shapes from the mlmodel’s
layers should be identical to the original model’s output shapes. It’s a good idea to verify
that these shapes are correct.

There are several ways to look at the layer output shapes of a Core ML model.

Note: Many of the techniques shown here no longer work with the latest versions of Xcode
and coremltools. This is unfortunate, as having the ability to check the output shapes for
each layer is very useful.

Using model surgery

One technique that does work but is rather cumbersome, is to take your mlmodel file and
remove its layers one-by-one. After removing a layer, create a new MLModel instance, use
the MLModel to make a prediction, and record the shape of the output. There is no easy API
for this, you’ll have to hack this together yourself.

Looking at the mlmodelc

The Core ML compiler creates an .mlmodelc. This is a folder with a number of files that
describe the model in a proprietary format.

184
Checking the Layer Output Shapes 185

You can also find the output shapes for the layers here, in the file model.espresso.shape.
This is a JSON file that you can view with any text editor. It looks something like this:

"layer_shapes" : {
"conv3" : {
"k" : 64,
"w" : 14,
"n" : 1,
"h" : 14
},
"pool1" : { ...

The "n" property is the batch size, usually 1; "k" is the number of channels; "h" is the height
of the tensor; and "w" is the width.

Note: Since this is a dictionary, the layers are not necessarily in order.

Using Xcode 10

Note: The method described in this section no longer works with Xcode 11 or later.

You can view the layer output shapes for the mlmodel in Xcode’s Report navigator. Select
the latest Build on the left and find the entry for Compile CoreML model. Expand this to
reveal the output from the Core ML compiler, including a list of all the layers in the model
and their output shapes:
Checking the Layer Output Shapes 186

The tensor shapes are given as (channels, height, width).

Note: For a Core ML model with Size Flexibility, the output shapes reported are for the
smallest configured size.

If the mlmodel has a custom layer, the output shape is reported as (-1, -1, -1) because the
Core ML compiler does not have any information about how the custom layer works.

By the way, you can also run the Core ML compiler directly from the command line.
This produces the same output as shown in Xcode. Open a Terminal window and run the
following:

xcrun coremlc compile YourModel.mlmodel output

Unfortunately, this method also no longer works as of Xcode 11.

Using coremltools

coremltools also has a built-in Core ML compiler. Unlike the one from Xcode 11, this
compiler still outputs the layer shapes (at least on macOS Mojave). Put the following into a
Python script and run it:
Checking the Layer Output Shapes 187

import coremltools as ct
mlmodel = ct.models.MLModel("YourModel.mlmodel")
_ = ct.models.MLModel(mlmodel._spec)

You will now see the names of all the layers in the neural network and their inferred shapes,
as (channels, height, width).

Note: You can also do the above from the Python REPL or in an IPython session, but you
won’t see the output until you exit out of the REPL.

Using visualize_spec

coremltools has a built-in visualization tool, visualize_spec. It’s not very pretty but it does
the job. Click the Shapes On/Off button to view the output shapes of the layers.

See the chapter Viewing Models With visualize_spec for more info on this tool. Note: This
technique no longer works with coremltools 4.

Using NeuralNetworkShaper

Note: This method no longer works as of coremltools 4.

coremltools includes a NeuralNetworkShaper class, whose job it is to compute the intermedi-


ate tensor shapes for a neural network model. Exactly what we’re after!

import coremltools as ct
mlmodel = ct.models.MLModel("YourModel.mlmodel")
shaper = ct.models.NeuralNetworkShaper(mlmodel._spec)

To get shape information about a given layer, do:

shaper.shape("layer_name")

Or to get the shapes of all layers:


Checking the Layer Output Shapes 188

for x in mlmodel._spec.neuralNetwork.layers:
try:
print(x.name, shaper.shape(x.name))
except:
print("No shape found for:", x.name)

The exception handler is necessary because not all layers may have shape information.

Note: Use mlmodel._spec.neuralNetworkClassifier if your model is a classifier.

The output of NeuralNetworkShaper looks like this:

{'S': (0, -1), 'B': (0, -1), 'C': (256, 256), 'H': (14, 14), 'W': (14, 14)}

The key 'S' stands for the sequence dimension; 'B' is the batch size; 'C' is the number of
channels; 'H' is the height of the tensor; and 'W' is the width.

Each dimension has two numbers: the minimum possible value and the maximum possible
value. If the maximum for a dimension is -1, it means there is no maximum.

How to compare the shapes

Once you know the output shapes of the layers from the Core ML model, what do you
do with this information? You compare it against the output shapes for the layers in the
original model, especially the last few layers and any layer you’re suspicious about.

Training tools such as TensorFlow and Keras will let you run the original model up to a
given layer. You can compare the shape of the data this produces, to the output shape
reported by the Core ML compiler for that same layer. With Caffe you always have to run
the entire model, but afterwards it lets you look at the output blob of any layer.

See the previous chapter, Looking at Intermediate Layer Outputs, for details. Instead of
looking at the actual output from the intermediate layer, this time you’re just interested in
its shape. In Python, print(layer_output.shape).
The mlmodel as a Big Text File
The mlmodel file as used by Core ML is a binary protobuf file, but there is also a text-based
version of protobuf that is useful for debugging.

First, install the protobuf Python package:

pip install -U protobuf

Then in Python, get the spec for the mlmodel, and do the following to write your entire
mlmodel to a text file:

from google.protobuf import text_format

with open("YourModel.prototxt", "w") as f:


f.write(text_format.MessageToString(spec))

Warning! This creates a huge file. As you can imagine, the text version of protobuf is much
less efficient than the binary version.

If you open the text file — preferably in an editor that can handle very large files, such as
Visual Studio Code — it looks something like this:

specificationVersion: 1
description {
input {
name: "image"
shortDescription: "Input image to be classified"
type {
imageType {
width: 224
height: 224
colorSpace: BGR
...

189
The mlmodel as a Big Text File 190

and so on… This should look very similar to what you see when you print out the spec (or
parts of it) from the Python REPL. That’s not so strange, as it’s exactly the same thing.

You can now make changes to this text file. For example, change the colorSpace to RGB — or
anything else you need to do such as removing layers, renaming things, etc. Just make sure
to use valid protobuf syntax.

Now you can read the file back into a new spec object in Python:

import coremltools as ct

new_spec = ct.proto.Model_pb2.Model()

with open("YourModel.prototxt", "r") as f:


text_format.Parse(f.read(), new_spec)

And save it again as a binary mlmodel file:

ct.utils.save_spec(new_spec, "YourModelNew.mlmodel")

You can’t use the text-based version of the mlmodel file with Core ML on iOS or macOS,
only the binary one, but it’s certainly handy for examining and editing the model by hand.
Part 4: Model Surgery

Model surgery is the act of changing an mlmodel after it has been converted. This is useful
for cleaning up a model after conversion or to add missing functionality.
Filling in the Metadata
The metadata for a model includes the following:

• a short description

• the name of the author

• the license for how the model may be used

• a version number

• human-readable descriptions of the inputs and outputs

• any custom data that you may want to add

The metadata is what you see in Xcode when you view the mlmodel file.

In Python, if you have an MLModel object you can do print(mlmodel) to view the metadata, or
print(mlmodel._spec.description) to get it from the spec.

Inside your app, you can get the metadata information from the MLModelDescription object.
See also Inspecting the Model at Runtime.

After converting a model to Core ML, you can fill in the metadata as follows:

mlmodel = convert(original_model, ...)

mlmodel.short_description = "Your awesome model"


mlmodel.author = "Your name"
mlmodel.license = "Copyright by you"

mlmodel.input_description["image"] = "Image to be classified"


mlmodel.output_description["labelProbs"] = "Probability of each category"
mlmodel.output_description["label"] = "Category with the highest score"

mlmodel.save("YourModel.mlmodel")

You can also use the spec object:

192
Filling in the Metadata 193

import coremltools as ct
spec = ct.utils.load_spec("YourModel.mlmodel")

spec.description.metadata.shortDescription = "Your awesome model"


spec.description.metadata.author = "Your name"
spec.description.metadata.license = "Copyright by you"
spec.description.metadata.versionString = "v1.0"

spec.description.input[0].shortDescription = "Image to be classified"


spec.description.output[0].shortDescription = "Probability of each category"
spec.description.output[1].shortDescription = "Category with the highest score"

ct.models.utils.save_spec(spec, "YourModel.mlmodel")

Note that the versionString property is only available on the spec object, not on MLModel. This
version number is also not displayed in Xcode, but you can query it with MLModelDescription.

The spec object is also the only way you can change the descriptions of the training inputs
for models that are updatable:

spec.description.trainingInput[1].shortDescription = "True class label"

Besides the standard properties you can also add your own user-defined properties into
the model’s metadata. This consists of key/value pairs — in other words, a dictionary —
but both keys and values must be strings.

To add your own custom properties, simply set a value on the user_defined_metadata
dictionary:

mlmodel.user_defined_metadata["confidence_threshold"] = "0.35"
mlmodel.user_defined_metadata["iou_threshold"] = "0.5"

Or using the spec object:

user_defined_metadata = {
"confidence_threshold": "0.35",
"iou_threshold": "0.5"
}
spec.description.metadata.userDefined.update(user_defined_metadata)

User-defined data is not displayed by Xcode, but you can read it from the MLModelDescription
object at runtime.

If you print the metadata in Python, it now looks like this:


Filling in the Metadata 194

metadata {
...
userDefined {
key: "confidence_threshold"
value: "0.5"
}
userDefined {
key: "iou_threshold"
value: "0.5"
}
}

You can use this for any additional data that is useful to provide along with the model. For
example, in the chapter MobileNetV2 + SSDLite Example, we use the user-defined metadata
to add the names of the classes to the mlmodel file. (While Core ML can store the class names
for classifier-type models, SSDLite is not a classifier and so we have to use the metadata for
this.)
Changing the Image Preprocessing
Options
In the chapter on Image Preprocessing, I explained that the Core ML model must have the
same options for image normalization as the original model. Otherwise the model does not
understand the data from the CVPixelBuffer.

This chapter shows how to verify that the image preprocessing options from your mlmodel
are correct, and what to do if they aren’t.

Note: This only applies to neural networks. Other types of models will need to do their own
preprocessing, but usually you wouldn’t use images as the inputs for such models anyway.

The NeuralNetworkPreprocessing object

The NeuralNetwork.proto file from the mlmodel specification defines that a NeuralNetwork
model has a list of NeuralNetworkPreprocessing objects, one for every image input.

NeuralNetworkPreprocessing can use one of two methods for normalizing the image pixels:

1. NeuralNetworkImageScaler: multiplies all the pixels by a constant channelScale and then


adds the redBias, greenBias, and blueBias to the individual color channels (or a grayBias
for images with just one channel). These correspond to the preprocessing options that
you pass into the model converter.

2. NeuralNetworkMeanImage: subtracts an “average image” from the input image. This is


sometimes used with Caffe models (only the Caffe converter has an option for it). The
average image is stored as a list of floats inside the mlmodel.

To make sure your mlmodel performs the same kind of preprocessing as the original model,
it’s useful to examine what is inside the mlmodel file’s NeuralNetworkPreprocessing.

195
Changing the Image Preprocessing Options 196

Examining the preprocessing options

First, get the neural network object from the spec using the get_nn() function:

nn = get_nn(spec)

Viewing the preprocessing options from the mlmodel is now as simple as:

print(nn.preprocessing)

For Inceptionv3.mlmodel75 , this prints:

[scaler {
channelScale: 0.007843137718737125
blueBias: -1.0
greenBias: -1.0
redBias: -1.0
}
]

Recall that the neural network will have one NeuralNetworkPreprocessing object for every
input, so what you see here is an array, hence the [ ] brackets. Since this model only has
one input, there is just one entry in the array.

Note: The NeuralNetworkPreprocessing definition also includes a featureName field, which is


the name of the model input to which the preprocessing is applied. For a model with a
single input the name is usually not filled in.

The preprocessor for this first input is a scaler object. In other words, the oneof for this
input’s NeuralNetworkPreprocessing object is of type NeuralNetworkImageScaler.

The pixels in the CVPixelBuffer are 8-bit, which means they have values between 0 and 255.
The preprocessing step turns these numbers into floating point values.

For Inceptionv3, Core ML first multiplies every pixel with 0.007843 — which happens to be
2/255 or 1/127.5 — and then subtracts 1.0 from every color channel. Effectively this scales
pixels from the range [0, 255] to the range [-1, 1].

For ResNet50.mlmodel76 , the preprocessing is:


75 https://docs-assets.developer.apple.com/coreml/models/Inceptionv3.mlmodel
76 https://docs-assets.developer.apple.com/coreml/models/Resnet50.mlmodel
Changing the Image Preprocessing Options 197

scaler {
channelScale: 1.0
blueBias: -103.93900299072266
greenBias: -116.77899932861328
redBias: -123.68000030517578
}

This subtracts the mean RGB values but it doesn’t scale the pixels.

Tip: You can also view the preprocessing options using Netron. This shows the preprocess-
ing step as a scaler node at the beginning of the model:

As you can see, different models have different ways of normalizing the pixels from the
input image. I don’t think it really matters much which type of preprocessing is used during
training — some models don’t do any pixel normalization at all — as long as you use the
exact same options inside the mlmodel file.

But is it correct?

For MobileNet.mlmodel77 the preprocessing is:

scaler {
channelScale: 0.017000000923871994
blueBias: -1.7669800519943237
greenBias: -1.985260009765625
redBias: -2.102560043334961
}

These numbers are a little unusual. Let’s see if they make sense.

First, Core ML multiplies the pixels by the channelScale of 0.017, so now the pixels are in
the range 0 to 4.335. And then it adds the (negative) bias values, so the final pixel range is
somewhere between -2.1 and +2.2.
77 https://docs-assets.developer.apple.com/coreml/models/MobileNet.mlmodel
Changing the Image Preprocessing Options 198

That is an odd range indeed, so it’s worth double checking where these numbers come from.

If you look at the original training script, you’ll probably find totally different bias and
scaling values. That’s because Core ML uses a slightly different formula for doing the
normalization. When the model was trained, the images were probably normalized using
a mean and a standard deviation — but that’s not what Core ML uses.

To compare the preprocessing options between the mlmodel and the original model, you
should convert the numbers from nn.preprocessing back to the original mean RGB values:

redMean = -redBias / channelScale = 2.10256 / 0.017 = 123.68


greenMean = -greenBias / channelScale = 1.98526 / 0.017 = 116.78
blueMean = -blueBias / channelScale = 1.76698 / 0.017 = 103.94

Note that the mean is always subtracted while the bias is added, so you need to flip the sign.
The standard deviation used to normalize the pixels was:

std = 1 / channelScale = 1 / 0.017 = 58.82

If these are indeed the values used in the original training script, then the Core ML model
does the correct preprocessing!

Changing the preprocessing

Of course, if something is wrong with the preprocessing options, you can change it directly
in the spec. For example, the SqueezeNet.mlmodel78 that you can download from Apple’s
Core ML page has the following preprocessing options:

[scaler {
channelScale: 1.0
blueBias: 104.0
greenBias: 117.0
redBias: 123.0
}
]

That seems wrong to me… instead of subtracting the RGB means, it adds them! An easy
mistake to make, as most training tools use mean values but Core ML uses bias values,
which have the opposite sign.

Let’s fix it. Notice that you need to use [0] to index the array, and then it’s just a matter of
assigning the correct values to the fields in the NeuralNetworkImageScaler object:
78 https://docs-assets.developer.apple.com/coreml/models/SqueezeNet.mlmodel
Changing the Image Preprocessing Options 199

nn.preprocessing[0].scaler.blueBias = -104.0
nn.preprocessing[0].scaler.greenBias = -117.0
nn.preprocessing[0].scaler.redBias = -123.0

When you’re done, save the spec to a new mlmodel file:

ct.utils.save_spec(spec, "SqueezeNetFixed.mlmodel")

Note that you save the entire spec here. You made the changes to nn, which is a reference
to a part of the spec, so any modifications to nn are also done to spec itself.

Tip: As of coremltools 3.0, you can also use NeuralNetworkBuilder to change these options:

existing_model = ct.utils.load_spec("YourModel.mlmodel")
builder = ct.models.neural_network.NeuralNetworkBuilder(spec=existing_model)
builder.set_pre_processing_parameters(image_input_names=["image"], is_bgr=False,
red_bias=-123.0, green_bias=-117.0,
blue_bias=-104.0, image_scale=1.0)
ct.utils.save_spec(builder.spec, "YourNewModel.mlmodel")

What about BGR images?

The Keras and Caffe converters have an is_bgr option. For models trained with Caffe this
option should be True since it uses OpenCV to load images in the BGR pixel order. If this
option is incorrect, the red and blue color channels will be flipped and the model won’t be
able to make optimal predictions.

is_bgr is not actually part of the image preprocessing options but of the model’s input
descriptions. You can examine this with:

print(spec.description.input)

It may print something like the following:


Changing the Image Preprocessing Options 200

[name: "image"
shortDescription: "Input image to be classified"
type {
imageType {
width: 227
height: 227
colorSpace: BGR
}
}
]

As you can see, the colorSpace for the input with name "image" is BGR.

To change it, you can write:

import coremltools.proto.FeatureTypes_pb2 as ft
new_colorspace = ft.ImageFeatureType.ColorSpace.Value("RGB")
spec.description.input[0].type.imageType.colorSpace = new_colorspace

To view the allowed values for colorSpace, write ft.ImageFeatureType.ColorSpace.keys() in


the Python REPL or look up the definition for ImageFeatureType in FeatureTypes.proto.
Using a Different Scale for Each Color
Channel
The preprocessing options for a neural network only let you specify a single image scaling
factor that is applied to the red, green, and blue color channels equally. But some models
use a different scale for each color channel.

For example, the RGB channels in the ImageNet training set have the following standard
deviations:

red_std = 58.395
green_std = 57.12
blue_std = 57.375

If your training script loads images with pixels in the range 0 - 1, or divides them by 255
before normalizing, then these standard deviations are usually given as:

red_std = 58.395 / 255 = 0.229


green_std = 57.12 / 255 = 0.224
blue_std = 57.375 / 255 = 0.225

You can approximate this by using a single standard deviation of roughly 58 for all
channels. The coremltools converters always assume the input pixels are in the range 0
- 255, so the image_scale option for the converter would be 1/58 = 0.017.

However, with a bit of model surgery it’s also possible to give each color channel its own
scale parameter — making the mlmodel exactly the same as the original.

First, convert the model using the proper image_scale and red/green/blue_bias options. Let’s
assume the RGB means are 123.68, 116.779, and 103.939, respectively. For the image scale
you no longer use the standard deviation, but simply 1/255. To get the biases you divide the
means by 255 as well:

201
Using a Different Scale for Each Color Channel 202

image_scale = 1 / 255.0
red_bias = -123.68 / 255.0 = -0.485
green_bias = -116.779 / 255.0 = -0.456
blue_bias = -103.939 / 255.0 = -0.406

Tip: If you have already converted your model and don’t want to go through that again,
then you can also change these options directly in the spec. Make sure you have the spec
and the get_nn() function, and then write:

nn = get_nn(spec)

nn.preprocessing[0].scaler.channelScale = 1 / 255.0
nn.preprocessing[0].scaler.redBias = -0.485
nn.preprocessing[0].scaler.greenBias = -0.456
nn.preprocessing[0].scaler.blueBias = -0.406

Now that you’ve set these scale and bias options on the model, the preprocessing part of
the neural network does the following to each pixel:

normalized_r = r * image_scale + red_bias = r / 255.0 - 0.485


normalized_g = g * image_scale + green_bias = g / 255.0 - 0.456
normalized_b = b * image_scale + blue_bias = b / 255.0 - 0.406

To complete the normalization arithmetic so that it does exactly the same thing as the
training script, what still needs to happen is dividing by the standard deviation:

really_normalized_r = normalized_r / red_std = normalized_r / 0.229


really_normalized_g = normalized_g / green_std = normalized_g / 0.224
really_normalized_b = normalized_b / blue_std = normalized_b / 0.225

The Core ML preprocessing options cannot do this last step, because it involves scaling each
color channel independently. However, you can achieve the same effect by inserting a scale
layer as the first layer of the model.

Adding a new layer at the beginning of the model is a little tricky. There isn’t a convenient
method in the protobuf API for inserting a new object at the front or in the middle of a
repeating list. But you also can’t just add this new layer to the end of the spec because Core
ML wants the layer definitions to be in the right order in the mlmodel file.

With the spec and nn objects, make a copy of all the existing layers first:
Using a Different Scale for Each Color Channel 203

import copy
old_layers = copy.deepcopy(nn.layers)

Then delete all the layers from the model. Because you made a copy, you can put these
layers back later.

del nn.layers[:]

In Core ML, layers are connected through the names of their inputs and outputs. Because
the new scaling layer will go in between the model’s input and the old first layer, you need
to find the name of the input (this can be different for every model):

input_name = old_layers[0].input[0]
new_layer_output = input_name + "_scaled"

The new scale layer’s output will also need to have a name, which you create by adding
"_scaled" to the original input name.

Now you can make the new scale layer object:

new_layer = nn.layers.add()

The line nn.layers.add() creates a new NeuralNetworkLayer object and adds it to the spec.
Since you just removed all existing layers, the neural network now has only this single
layer. The new layer is completely empty at this point and you still need to fill in all its
properties.

new_layer.name = "scale_layer"
new_layer.input.append(input_name)
new_layer.output.append(new_layer_output)

By setting the input you connect this layer to the model’s input, and by setting its output you
can connect it to the other layers later.

Next, turn this new layer into a scale layer and give it the correct scaling factors. This
is done by accessing new_layer.scale, which makes this a ScaleLayerParams layer (see
NeuralNetwork.proto), and then assigning a value to at least one of its properties.
Using a Different Scale for Each Color Channel 204

new_layer.scale.scale.floatValue.extend([1/0.229, 1/0.224, 1/0.225])

The ScaleLayerParams layer has a field, also called scale, that holds the actual values as
floating point numbers (it looks weird but .scale.scale is not a typo here).

Note: You’re using the reciprocal of the standard deviations, because a ScaleLayerParams
layer multiplies the pixel values by the scale values, while we’re supposed to divide by the
standard deviation.

You also need to set the shapeScale field, which takes the form (channels, height, width).
This says that the three values you’ve supplied are across the channel dimension. Just like
NumPy and most other libraries, Core ML will automatically broadcast the scale values to
match the tensor shape:

new_layer.scale.shapeScale.extend([3, 1, 1])

If you’re in the Python REPL and do print(new_layer), you should see something like this:

name: "scale_layer"
input: "image"
output: "image_scaled"
scale {
shapeScale: 3
shapeScale: 1
shapeScale: 1
scale {
floatValue: 4.366812227074235
floatValue: 4.464285714285714
floatValue: 4.444444444444445
}
}

Note: Your names for input and output may be different. These names totally depend on
your own model.

Now that you’re done constructing the scale layer, add back the old layers:
Using a Different Scale for Each Color Channel 205

nn.layers.extend(old_layers)

The final step is to connect the new layer to the old layers. You do this by telling the old first
layer to no longer read from the model input but from the scale layer:

nn.layers[1].input[0] = new_layer_output

And now the graph is complete again! Save the spec and you’re all done.

ct.utils.save_spec(spec, "NewModel.mlmodel")

I usually like to check with Netron that the graph is indeed correct. There is now a new scale
layer between the “scaler” from the preprocessing step and the first convolution layer:

Another way to check for problems is to turn the spec back into an MLModel. If this doesn’t
give any error messages, you’re good to go.

new_mlmodel = ct.models.MLModel(spec)

If you get an error such as: “Error reading protobuf spec. validator error: Layer ‘conv1’
consumes a layer named ‘image_scaled’ which is not present in this network.” then
something went wrong connecting the old layers to the new one.
Saving the Weights as 16-bit Floats
Neural networks often have millions of learned parameters. When you convert a model to
Core ML, these parameters are saved as 32-bit floats or 4 bytes per parameter.

That means a model with 10 million parameters will take up 40 MB in your app bundle.
That’s 40 additional megabytes that users of your app will have to download. And 10 million
parameters isn’t really that much yet — there are models with 50 or even 100 million
parameters.

With a bit of model surgery it’s easy to change the weights and biases to be 16-bit floats
instead. This cuts the storage size of your model in half.

Note: This is a feature of mlmodel version 2, supported by iOS 11.2 and macOS 10.13.2.

To tell whether a model is using 16-bit weights or the full 32-bit weights, you can look at
the weights field from any of your layers. The weight values are stored either in floatValue,
float16Value, or rawValue. For example, for a convolution layer:

def examine_weights(weights):
if len(weights.floatValue) > 0:
print("Weights are 32-bit")
elif len(weights.float16Value) > 0:
print("Weights are 16-bit")
elif len(weights.rawValue) > 0:
print("Weights are quantized or custom layer")
else:
print("This layer has no weights")

nn = get_nn(spec)
examine_weights(nn.layers[i].convolution.weights)

To convert the weights for the entire model to 16-bit floats, you can use the function
quantize_weights() from the quantization_utils module. This works on all neural network-
type models and also on pipeline models.

206
Saving the Weights as 16-bit Floats 207

Note: This function used to be named convert_neural_network_spec_weights_to_fp16() but


this is deprecated as of coremltools 4.

Here is a handy utility script that you can use to convert an existing model to 16-bit:

import sys
import coremltools as ct
from coremltools.models.neural_network import quantization_utils

if len(sys.argv) != 3:
print("USAGE: %s <input_mlmodel> <output_mlmodel>" % sys.argv[0])
sys.exit(1)

input_model_path = sys.argv[1]
output_model_path = sys.argv[2]

model = ct.models.MLModel(input_model_path)
model_fp16 = quantization_utils.quantize_weights(model, nbits=16)
model_fp16.save(output_model_path)

Save this script as convert_to_float16.py and then invoke it as follows:

$ python3 convert_to_float16.py input.mlmodel output.mlmodel

I tried this script on a few models from Apple’s developer site79 . Here are the file sizes of
the original models and saved with 16-bit floats:

Model name 32-bit weights 16-bit weights


SqueezeNet.mlmodel 4971516 2500504
MobileNet.mlmodel 17136858 8585292
Inceptionv3.mlmodel 94704130 47382082
ResNet50.mlmodel 102586628 51313154

As you can tell from these results, the 16-bit version is indeed 50% of the original size.

Keep in mind that 32-bit or 16-bit is only for the way your weights are stored inside the
mlmodel file. What happens during runtime is independent of the storage method! The
runtime data type for the weights depends on the hardware that is running the model:

• CPU: uses 32-bit weights, even if the model is saved as 16-bit.


79 https://developer.apple.com/machine-learning/models/
Saving the Weights as 16-bit Floats 208

• GPU: uses 16-bit weights, even if the model is saved as 32-bit.

• Neural Engine: who knows?

I think it’s a good idea to always save your models using 16-bit floats. You do lose a bit of
precision when compared to the full 32-bit numbers the model was originally trained with,
but on average these errors cancel out and it shouldn’t make the model’s predictions any
worse.
Quantizing the Weights
The last chapter showed how to cut your mlmodel file size in half by replacing the full 32-
bit weights with 16-bit floats. If your model is still too big, you can compress the neural
network weights even further using quantization.

With 8-bit quantization your mlmodel file becomes 4× smaller — and coremltools allows
even smaller quantization sizes too!

Note: Model quantization is available as of mlmodel version 3, which means you need at
least iOS 12 or macOS 10.14 to run the model. Of course, it is still possible to quantize the
model even if you’re not on macOS 10.14 and try them out on the iOS Simulator.

What is quantization?

Quantization means that the weights from a layer are divided up into a fixed number of
“bins” or “buckets”. Each bin gets its own weight value, so all the weights that are put into
a certain bin will have the same value after quantization.

Core ML lets you choose quantization levels from 1 bit to 8 bits. The fewer bits you use, the
fewer bins you get. With 8-bit quantization the weights are placed into 256 bins. With 4-bit
quantization, there are only 16 bins. And with 1-bit quantization, there are only two bins!

Because the weights are rounded off to the value of their nearest bin, you lose some
information. The fewer bins there are, the smaller the model file will become but the more
information you lose. The trade-off with quantization is therefore between model size and
prediction accuracy.

For example, let’s say a layer has 100,000 weights. Stored as 32-bit floats, most of these
weights will be unique although many of them will be similar (especially a few places after
the decimal point). After quantizing the model with 4 bits, that layer will still have 100,000
weights but each will now be one of only 16 possible values.

Naturally, this will also affect the predictions made by the model. Usually the accuracy of
the model becomes worse the more quantization you use.

209
Quantizing the Weights 210

Bins do not all need to have the same width. The wider a bin is, the more weights it contains
and the more these weights are rounded off — and therefore the less precise they are.
The trick is to find a good arrangement of bins that matches how the weight values are
distributed along the number line, reducing the amount by which they get rounded off.

Note: Quantization only affects the size of the mlmodel in your app bundle. At runtime,
Core ML will de-quantize the weight values when it loads the model, and the computations
still use 32-bit and 16-bit floating points. There is certain hardware that can work directly
with quantized numbers, but the iPhone CPU and GPU don’t — although perhaps the Neural
Engine can, but not much is known about the inner workings of this chip yet.

Quantizing with coremltools

Inside coremltools is a module models.neural_network.quantization_utils with functions to


quantize your models and also to compare the performance of the full-precision model
with the quantized model.

There are two functions for performing the quantization:

• quantize_weights(): Takes an MLModel object with 32-bit weights and returns a new
MLModel object that has quantized weights. If desired, it also performs a comparison
between the original and quantized models.

• quantize_spec_weights(): Does the same thing but works directly on the spec object.
Use this if you prefer working with the spec instead of the MLModel, but it is somewhat
less convenient to use. (No longer available in coremltools 4.)

Both functions take the following arguments:

• The original MLModel or spec.

• The number of bits to use. This must be between 1 and 8 for integer quantization. If
you use 16 here, you get 16-bit floats.

• The quantization mode.

• Optionally, a custom function for computing the quantization look-up table (LUT).
Quantizing the Weights 211

Core ML supports a few different quantization modes that determine how the original
weight values are mapped into the bins:

• Linear mode. This is the simplest mode. The original weights are rounded off to fall on
a straight line that is given by the equation scale*quantized + bias. You get a separate
scale and bias for each output channel in the layer. This method works very well if
the weights are uniformly distributed, as it assumes that the bins all have the same
width.

• Symmetric linear mode. Same as linear but the range of the weights is assumed
to be symmetrical around 0. With linear mode, the range of the weights goes from
[min(weights), max(weights)]. In symmetric mode, the range is [-X, X] where X =
max(abs(weights)).

• Linear with lookup table. This is exactly like linear mode but instead of a single scale
and bias value, it stores the quantized weight values in a look-up table. (I’m not quite
sure what the advantage of this mode is.)

• Lookup table computed by k-means clustering. If you’re using 256 bins, for exam-
ple, the k-means algorithm finds the 256 most common weight values. This is a good
choice if the weights aren’t evenly distributed along the number line. The bins will
now all have different widths.

• Custom lookup table. You can also provide your own algorithm for computing the
bins. For example, you could use a clustering algorithm that pays more importance to
the larger weights, because they will have a greater effect on the result than smaller
weights.

There is also a dequantization mode that takes a quantized model and restores the weights
back to 32-bit floats. That can be useful for examining the difference between the two
versions.

A few more notes about quantization in Core ML:

• The linear mode will quantize the weights for each convolutional filter separately, but
the look-up table modes only use a single look-up table that is shared by all the filters
in the layer. Which method is better depends on the actual weight values used by the
layer, and you may want to experiment a little.

• Each layer uses its own set of bins, and in theory you could quantize some layers
more than others. Or not quantize certain layers at all. The quantize_weights() function
currently does not support this but you could do it by hand by directly editing the spec
and calling some of the private functions from coremltool’s quantization_utils.py.
Quantizing the Weights 212

• If a neural network layer has biases, these are quantized separately from the main
weights. Interesting piece of trivia: when the number of channels in a layer is less
than the number of bins, there’s enough room in the look-up table so that each bias
gets its own bin and effectively they don’t get quantized at all.

• Custom layers are not quantized. If your custom layer has weights that should be
quantized too, you need to do this yourself.

How to do it

Here is a simple script for quantizing a model:

import sys
import coremltools as ct
from coremltools.models.neural_network import quantization_utils

if len(sys.argv) < 4:
print("USAGE: %s <input_mlmodel> <output_mlmodel> <mode> <bits>" % sys.argv[0])
sys.exit(1)

input_model_path = sys.argv[1]
output_model_path = sys.argv[2]
mode = sys.argv[3]
nbits = int(sys.argv[4]) if len(sys.argv) > 4 else 8

model = ct.models.MLModel(input_model_path)
quant_model = quantization_utils.quantize_weights(model, nbits, mode)
quant_model.save(output_model_path)

Save this as quantize.py and use it as follows from Terminal:

python3 quantize.py YourModel.mlmodel YourModelQuantized.mlmodel linear 8

The mode argument can be: linear or kmeans. After running this script, the specification
version of the mlmodel is now bumped up to 3.

You can also use this script to de-quantize a previously quantized model:

python3 quantize.py YourModelQuantized.mlmodel YourModel.mlmodel dequantization

Of course, the weights in this de-quantized model won’t be exactly the same as in the
original model. But they will be the actual weights that Core ML uses at runtime, since
Core ML de-quantizes the model too before making predictions with it.
Quantizing the Weights 213

A closer look at the weights

I thought it would be fun to look at the before-and-after quantization weights. First, I


used kmeans mode on SqueezeNet.mlmodel to quantize it to 8 bits, and then used the
dequantization mode to turn the weights back to floating point numbers.

Here you’ll be looking at the layer fire7/expand1x1, which has 9216 weights (48 input
channels, 192 output channels, 1×1 convolution kernel).

I put the original floating point weights into a NumPy array and plotted a histogram to see
how the weights from this layer were distributed. Notice how this creates a nice bell-shaped
curve with most of the weights centered around 0.

I did the same for the quantized weights and it created a similar curve, which is a good
thing, but it is not as clean. This shows that quantization still keeps the weights mostly the
same — but not exactly.

By the way, to put the weights of a Core ML layer into a NumPy array, you can write:

weights = np.array(nn.layers[i].convolution.weights.floatValue)

To see how many weights there are per bin, and what these weight values are, you can use
np.unique(). It indeed returns 256 unique values.
Quantizing the Weights 214

Here is a plot of how many weights are inside each bin:

And here are the actual weight values represented by each bin:

If you use linear mode, this would be a straight line.

Finally, I also calculated the average error for each bin. I did this by first computing the
absolute error for every single weight, then adding the errors to their respective bins, and
finally dividing by the number of weights inside the bin. The plot for average error per bin:

As you can see, the errors are quite low near the center (weight values close to 0) but they
become larger as you go outward.

For linear quantization mode the story is a little different. Because each output channel
is quantized using a different line equation, i.e. with a different scale and bias value,
np.unique() now finds 8173 unique weights in the quantized layer. That’s almost as many
Quantizing the Weights 215

as in the original layer, which had 9216 weights.

So you’d think that linear mode would give better results than kmeans. However, it still only
has 256 bins per output channel and if you look at the average error per bin for any given
output channel, the errors are more noisy. That is not so strange, as linear mode assumes
the weights are distributed uniformly, while you saw in the first histogram that they were
shaped more like a bell curve.

How good / bad is the quantized model?

Quantization is a trial-and-error process because it’s hard to say beforehand what the best
strategy is for a given model. The only way to find out is to try it. Here is the approach I
would suggest:

• First convert the model normally, using 32-bit weights, and verify that it is correct.

• Create a small test set of representative images (or other data) for comparing the
predictions from the quantized model with the full 32-bit version.

• Quantize the model with 8 bits. Run it on the test set and compare the predictions
with the full-precision model to see how the quantization has affected the results. If
the results are much worse, then give up on quantization and convert the model to
16-bit floats instead — that’s apparently the best you can do for this model.

• If the quantized model still performs well, reduce the number of quantization bits
from 8 to 7 and try again. Keep going until the results are no longer acceptable.

• Note that you should try all the quantization modes. Sometimes linear quantization
works better, sometimes the look-up table does.

To help with this process, coremltools has a compare_models() function that can compare
the performance of a full-precision model versus the quantized model. This only works on
macOS 10.14 (Mojave) and better because it requires the latest CoreML.framework to make
predictions on quantized models.

For a model that has a single image input, you can do:
Quantizing the Weights 216

import coremltools as ct
from coremltools.models.neural_network import quantization_utils

full_model = ct.models.MLModel("YourModel.mlmodel")
quant_model = ct.models.MLModel("YourModelQuantized.mlmodel")

quantization_utils.compare_models(full_model, quant_model, sample_data="ImagesFolder")

Here, "ImagesFolder" is the name of a directory with test images. compare_models() will load
the PNG and JPEG images from that folder and make predictions using both models.

To evaluate a classifier model, compare_models() counts how many of the quantized model’s
top-5 predictions are also in the top-5 of the original model. For image-to-image models, it
calculates the signal-to-noise ratio between the two generated images.

For SqueezeNet.mlmodel, quantizing in linear mode with 8-bits, the comparison on my


small test had the following results:

Output classLabelProbs:
-------------------------
Top 5 Agreement: 77.42%

Output classLabel:
--------------------
Top 1 Agreement: 45.16%

The same model quantized using kmeans mode, also with 8-bits, scored 100% top-5 agree-
ment and 93.55% on the top 1. So for this particular model, using kmeans is a lot more
accurate than linear mode.

Note: If the model has more than one input or the input isn’t an image, then sample_data
should be a list of dictionaries containing the data for each input.

If you pass the sample_data argument to quantize_weights(), it will automatically run


compare_models() after the quantization completes. (But only on macOS 10.14 and up.)

Inside the spec

The weights for each layer are stored in a WeightParams object. Any biases get their own
WeightParams. Normal 32-bit weights are in the floatValue field, half precision 16-bit floats
are in float16Value, while quantized weights can be found in the rawValue field.
Quantizing the Weights 217

• For linear quantization, each element from rawValue is the x-coordinate that goes into
the equation for the line scale*x + bias.

• For the look-up table modes, each element from rawValue represents an index in the
look-up table.

Before they can be used, the contents of rawValue must be decoded. The datatype of rawValue
is an array of bytes, but only in the case of 8-bit quantization can you directly read each
element. For 7 bits or fewer, the elements are all packed tightly together. (In the extreme
case of 1-bit quantization, a single byte contains the look-up table indices for eight weight
values.)

The quantization method used is described by a QuantizationParams object:

message QuantizationParams {
uint64 numberOfBits = 1;
oneof QuantizationType {
LinearQuantizationParams linearQuantization = 101;
LookUpTableQuantizationParams lookupTableQuantization = 102;
}
}

The LinearQuantizationParams are for “linear” mode, while LookUpTableQuantizationParams is


used for the other modes that all use look-up tables. The look-up table is simply an array
of 32-bit floating point values, one for each bin. For example, if numberOfBits is 8, then this
array has 256 elements.

You probably shouldn’t mess around too much with the contents of rawValue yourself, but
if you must, check out the code in quantization_utils.py from coremltools for handy helper
functions.
Changing the Input Type to Image
A common issue is that after converting a model to Core ML, the data type of the input is
“MultiArray” and not “image”. That is inconvenient — if a model works on images, it should
accept images as input.

This is easy to fix if you have access to the original model. In your conversion script, provide
a ct.ImageType object for each of the image inputs.

If you’re using one of the old converters, pass the argument image_input_names="your_input"
to the convert function. It tells coremltools that the input named "your_input" should be
treated as an image. For a model with multiple inputs, you can supply a list of names:
image_input_names=["first_input", "second_input"].

If you only have the .mlmodel file and not the original model or the conversion script, you
can still fix this using a bit of Python:

import coremltools as ct
import coremltools.proto.FeatureTypes_pb2 as ft

spec = ct.utils.load_spec("YourModel.mlmodel")

input = spec.description.input[0]
input.type.imageType.colorSpace = ft.ImageFeatureType.RGB
input.type.imageType.height = 224
input.type.imageType.width = 224

ct.utils.save_spec(spec, "YourNewModel.mlmodel")

This script changes the data type of the first input, input[0], to expect an image.

This assumes that the image is RGB. If necessary, change colorSpace to GRAYSCALE or BGR.
You should use BGR if the model was trained with Caffe or if OpenCV was used to load the
training images. Core ML will automatically convert the pixel order of your input images to
whatever format the model requires, but it can only do this if you set colorSpace correctly.

Also make sure that the width and height are correct. To support flexible image sizes, see
the chapter Size Flexibility.

218
Outputting an Image Instead of a
MultiArray
The converters provided with coremltools have an option image_input_names that tells the
converter which of the inputs should be treated as images, so that Core ML lets you pass in
a CVPixelBuffer object for these inputs.

However, there is no image_output_names. So if you have an image-to-image model, any


image outputs will become MultiArray outputs in the mlmodel.

There are two things you can do to get a CVPixelBuffer as output from Core ML:

1. convert the MultiArray to an image in the app yourself, or

2. change the mlmodel so that it knows the output should be an image.

Option number two is what this chapter is all about. For example, if your spec.description
prints the following:

output {
name: "generatedImage"
type {
multiArrayType {
shape: 3
shape: 300
shape: 150
dataType: DOUBLE
}
}
}

Then you can turn this output description into an image by writing:

219
Outputting an Image Instead of a MultiArray 220

output = spec.description.output[0]

import coremltools.proto.FeatureTypes_pb2 as ft
output.type.imageType.colorSpace = ft.ImageFeatureType.RGB
output.type.imageType.height = 300
output.type.imageType.width = 150

And now the output is an RGB image:

output {
name: "generatedImage"
type {
imageType {
width: 150
height: 300
colorSpace: RGB
}
}
}

If you want the image to have BGR pixel order, write colorSpace = ft.ImageFeatureType.BGR
instead of RGB.

Important: You can’t just turn any MultiArray into an image. Pay attention to the shapes:

• the first shape is the number of channels

• the second is the image’s height

• the third is the image’s width

You can only turn a MultiArray into an RGB or BGR image if the number of channels is 3. If
the number of channels is 1, or if there are only two shape values (height and width), then
you must use colorSpace = ft.ImageFeatureType.GRAYSCALE.

If there are more than three shape values listed, and the first or last of these are 1, then you
can delete these unused dimensions. For example, if you have:
Outputting an Image Instead of a MultiArray 221

multiArrayType {
shape: 1
shape: 3
shape: 300
shape: 150
dataType: DOUBLE
}

Then do the following to get rid of that first dimension with size 1:

del output.type.multiArrayType.shape[0]

Removing a dimension of size 1 doesn’t change the actual data, and now it lets Core ML
interpret the data as being an image.

Note: Naturally, the output of your model really must be an image. The model must output
a tensor with values in the range 0 - 255. Larger values or negative values will be clipped to
this range. If your model outputs a different range, you may need to insert a multiply layer
at the end to scale the values.

To make it a little more convenient to convert MultiArray outputs to images, I’ve added a
helper function to helpers.py from the book’s GitHub repo80 . You can use this function as
follows:

from helpers import convert_multiarray_to_image

convert_multiarray_to_image(spec.description.output[0], is_bgr=False)

This function does exactly the same thing as described in this chapter, but it also checks if
the MultiArray can converted to an image and throws an error if not.

Tip: You can also use this function on a model input by passing in spec.description.input[i].
That’s handy if you forgot to specify image_input_names when converting the model and you
want to fix it afterwards.

80 https://github.com/hollance/coreml-survival-guide
Outputting Floats Instead of Doubles
When a Core ML model outputs its predictions as a MultiArray, the data type of the numbers
inside that MultiArray are always double precision, or 64-bit floating points. This is true
even if 32-bit floats (CPU) or 16-bit floats (GPU) are used to do the computations.

Having doubles is usually no problem, but if you prefer 32-bit single precision floats you
can change this in the mlmodel.

As of coremltools 3.2, the easiest way to do this is to pass the argument use_float_-
arraytype=True to the converter. This automatically makes the datatype of any input or
output MultiArrays into 32-bit floats.

If you’re using a converter that doesn’t have this option or if you already have an existing
mlmodel file, you can make the change as follows. First, get a spec object for the mlmodel.
To view what data type is used by the model’s outputs, write print(spec.description). This
should show something like this:

output {
name: "outputName"
type {
multiArrayType {
dataType: DOUBLE
}
}
}

Define the following helper function to change that DOUBLE into FLOAT32:

import coremltools.proto.FeatureTypes_pb2 as ft

def update_multiarray_to_float32(feature):
if feature.type.HasField("multiArrayType"):
feature.type.multiArrayType.dataType = ft.ArrayFeatureType.FLOAT32

And use this function as follows:

222
Outputting Floats Instead of Doubles 223

for feature in spec.description.output:


update_multiarray_to_float32(feature)

Note: It’s also possible to use ft.ArrayFeatureType.INT32 if you want the output to be integers
instead of floating point numbers.

If the model uses a MultiArray as input and you also want to change the input data type to
floats, then do the same thing on spec.description.input:

for feature in spec.description.input:


update_multiarray_to_float32(feature)

Afterwards, do a print(spec.description) again to verify that the output type really has
changed. Save the changed spec to a new mlmodel file.

If you open the mlmodel in Xcode, it now says the MultiArray has the data type Float32:

Here is a handy script that you can use to convert the inputs and outputs of an existing
model from doubles to floats:

import sys
import coremltools as ct
import coremltools.proto.FeatureTypes_pb2 as ft

def update_multiarray_to_float32(feature):
if feature.type.HasField("multiArrayType"):
feature.type.multiArrayType.dataType = ft.ArrayFeatureType.FLOAT32

if len(sys.argv) != 3:
print("USAGE: %s <input_mlmodel> <output_mlmodel>" % sys.argv[0])
sys.exit(1)

input_model_path = sys.argv[1]
output_model_path = sys.argv[2]

spec = ct.utils.load_spec(input_model_path)

for feature in spec.description.input:


update_multiarray_to_float32(feature)
Outputting Floats Instead of Doubles 224

for feature in spec.description.output:


update_multiarray_to_float32(feature)

ct.utils.save_spec(spec, output_model_path)

Save this script as convert_multiarrays_to_floats.py and then invoke it as follows:

$ python3 convert_multiarrays_to_floats.py input.mlmodel output.mlmodel

Note that this script only affects models that predict one or more MultiArray objects. It has
no effect on the output of classifiers.

Tip: coremltools now also has a built-in function to do this. Call ct.utils.convert_double_-
to_float_multiarray_type(spec) on your spec object to convert all input and output Multi-
Arrays from doubles to 32-bit float.
Tidying up MultiArray Shapes
Sometimes when you convert a model that predicts one or more MultiArrays, the shapes for
these outputs are not completely filled in. For example, this is what I got after converting
the MobileNet + SSD model from TensorFlow to Core ML:

output {
name: "scores"
type {
multiArrayType {
dataType: DOUBLE
}
}
}
output {
name: "boxes"
type {
multiArrayType {
shape: 4
shape: 1917
shape: 1
dataType: DOUBLE
}
}
}

I know for a fact that the first output, "scores", will always be a (91, 1917) array because
there are 91 classes in this model and 1917 predicted bounding boxes. It would be nice if
that output shape was already described by the model. That way, the user of this model
knows what size MultiArray to expect.

The second output, "boxes", has a strange shape of (4, 1917, 1). The 4 and 1917 are correct,
but what’s up with that empty dimension at the back?

These things happen because the converter may have trouble inferring the shape of the
data as it flows through the model. This especially seems to happen with TensorFlow
models, because these often include operations that the converter cannot bring over to
Core ML.

But no worries! If you know what the output shapes ought to be, you can fix this in the spec
with some model surgery.

To take care of the "scores" output, fill in the shape property on the MultiArray:

225
Tidying up MultiArray Shapes 226

spec.description.output[0].type.multiArrayType.shape.append(91)
spec.description.output[0].type.multiArrayType.shape.append(1917)

And for the "boxes" output, delete the last entry from the shape list:

del spec.description.output[1].type.multiArrayType.shape[-1]

Now print(spec.description) shows the proper shapes that you’d expect for this model:

output {
name: "scores"
type {
multiArrayType {
shape: 91
shape: 1917
dataType: DOUBLE
}
}
}
output {
name: "boxes"
type {
multiArrayType {
shape: 4
shape: 1917
dataType: DOUBLE
}
}
}

Much neater!

Note: Not all models need to have their output shape completely defined in the mlmodel
file. Some model types, such as NonMaximumSuppression, can output a varying number
of results, in which case you can leave the shape undefined or use Size Flexibility.
Renaming Inputs and Outputs
Most model converters let you specify the desired input and output names, using the
input_names and output_names arguments.

mlmodel = convert(original_model, input_names="image", output_names="prediction", ...)

If you don’t provide these, the converter will use the names from the original model or
placeholder names. In that case, you may end up with an input or output named something
like "ImageTensor__0", or worse, "173".

The names of the layers inside the model are not very important, but it’s a good idea to
give readable, clean names to the inputs and outputs. These names will also be used as
properties inside the automatically generated code, so it’s important they are valid Swift /
Objective-C identifiers.

The easiest way to rename inputs or outputs is with rename_feature() from coremltools.utils.

from coremltools import utils


spec = utils.load_spec("YourModel.mlmodel")

utils.rename_feature(spec, "UglyInputName", "image", rename_outputs=False)


utils.rename_feature(spec, "UglyOutputName", "prediction", rename_inputs=False)

utils.save_spec(spec, "YourNewModel.mlmodel")

In this example, you rename the input named "UglyInputName" to "image", and the output
"UglyOutputName" to "prediction". When you set rename_outputs to false, rename_feature()
only looks at inputs with the given name, and vice versa.

Remember that these names must be valid Swift or Objective-C identifiers, so names may
not start with a number or contain spaces, dashes, slashes, and so on. Just stick to letters
and underscores.

Important: rename_feature() does not update the image preprocessing options. If the input
is an image and the model has preprocessing options, you still have to update these by
hand. The model won’t work unless you do this!

To also update the preprocessing options, do:

227
Renaming Inputs and Outputs 228

spec.neuralNetwork.preprocessing[0].featureName = "image"

The featureName should match the name of the model’s input.

You can also rename the inputs and outputs by hand by directly manipulating the spec. See
the chapter Cleaning Up a Converted Model for an example of how to do this.
Inserting a New Layer
Sometimes you cannot convert the entire model at once because the converter gets stuck
on an unsupported layer type. You may be able to work around this manually by inserting
the rest of the layers by hand. Or perhaps you want to do something clever after conversion
that requires adding a layer.

Using the spec and the get_nn() function, it’s relatively easy to insert new layers anywhere
into the model. The trick is to insert the layer at the right position in the layer list, and to
fix up the connections between the layers afterwards.

Inserting at the end

To append a new layer to the end of the model, write this:

import coremltools as ct
spec = ct.utils.load_spec("YourModel.mlmodel")
nn = get_nn(spec)

new_layer = nn.layers.add()

And then fill in the properties of the new_layer object. You do have to re-route the connec-
tions so that the output of the previous last layer becomes the input of this one, and the
output of the new layer is equal to the output of the model.

Note that in a Core ML model, the inputs and outputs are always identified by name. The
inputs for a layer are given by a list of strings, and so are the outputs. To connect two layers,
add an output name from one layer to the list of input names of the other layer.

As an example, let’s add a linear activation function after the last layer. A linear activation
has no effect and so this shouldn’t change the working of the model at all, but it makes for
a good demonstration:

new_layer.name = "test_layer"
new_layer.activation.linear.alpha = 1.0

To re-route the connections, first connect the new layer to the same output as the previous
last layer, which is now found in nn.layers[-2]:

229
Inserting a New Layer 230

new_layer.output.append(nn.layers[-2].output[0])

The old last layer’s output must now go into the new layer’s input. You do that by giving
them the same name, such as "test_layer_input":

nn.layers[-2].output[0] = "test_layer_input"
new_layer.input.append(nn.layers[-2].output[0])

Note: Exactly what this name is doesn’t matter, as long as it’s unique and not the same as
one of the model’s outputs.

Now the new layer looks like this:

name: "test_layer"
input: "test_layer_input"
output: "classLabelProbs"
activation {
linear {
alpha: 1.0
}
}

Its output is the output of the model, "classLabelProbs". Its input is the output from the
previous last layer, which you renamed from "classLabelProbs" to "test_layer_input".

See also the chapter Cleaning Up a Converted Model for a practical example of this
technique where you insert an ArgMax layer at the end of a semantic segmentation model.

Tip: As of coremltools 3.0, another way to append new layers is to load the spec into
NeuralNetworkBuilder and call the builder.add_xxx() methods, like so:

existing_model = ct.utils.load_spec("YourModel.mlmodel")
builder = ct.models.neural_network.NeuralNetworkBuilder(spec=existing_model)
builder.add_activation(...)

Even so, you’ll still need to use the spec to change the connections between the lay-
ers and the outputs. See the chapter NeuralNetworkBuilder for more details on using
NeuralNetworkBuilder.
Inserting a New Layer 231

Inserting at the beginning

This is more tricky because Core ML wants the layers to be in the correct order in the
mlmodel. If you add the new layer to the end of the list but connect it to the input of the
model — rather than the output like you did in the previous section — then Xcode won’t be
able to open the mlmodel file.

So this won’t work:

new_layer = nn.layers.add()
new_layer.name = "test_layer"
new_layer.activation.linear.alpha = 1.0

new_layer.input.append(nn.layers[0].input[0])
new_layer.output.append("test_layer_output")
nn.layers[0].input[0] = "test_layer_output"

Instead, you’ll have to remove all the existing layers from the model first, add the new layer
so that it really is the first layer in the model, and then put the old layers back. Finally,
connect the old first layer to the new one.

import copy
old_layers = copy.deepcopy(nn.layers)

del nn.layers[:]

new_layer = nn.layers.add()
new_layer.name = "test_layer"
new_layer.activation.linear.alpha = 1.0

new_layer.input.append(old_layers[0].input[0])
new_layer.output.append("test_layer_output")

nn.layers.extend(old_layers)
nn.layers[1].input[0] = "test_layer_output"

See also the chapter Using a Different Scale for Each Color Channel for a practical applica-
tion of this technique.

Inserting in the middle

This has the same problem as inserting at the front: you have to put the layer into the correct
position in the list, otherwise Xcode won’t be able to open the mlmodel file. Unfortunately,
Inserting a New Layer 232

the Python protobuf API does not have a convenient way to do this. Again, you’ll have to
remove the layers and put them back manually.

It’s useful to create a dictionary that lets you look up the index of a layer by its name:

layer_dict = {layer.name:i for i,layer in enumerate(nn.layers)}

Now find the index of the layer before which you want to insert the new layer:

layer_idx = layer_dict["conv6"]

You should make a copy of this layer and all the layers that follow it, and then delete them
from the model:

import copy
old_layers = copy.deepcopy(nn.layers[layer_idx:])

del nn.layers[layer_idx:]

Create the new layer and connect it to the last layer in the model:

new_layer = nn.layers.add()
new_layer.name = "test_layer"
new_layer.activation.linear.alpha = 1.0

new_layer.output.append("test_layer_output")
new_layer.input.append(nn.layers[-2].output[0])

Finally, put the old layers back and connect the first of them to the new layer.

nn.layers.extend(old_layers)
nn.layers[layer_idx + 1].input[0] = "test_layer_output"

Note that you have to use layer_idx + 1 here because the new layer you just inserted is at
layer_idx now.

Note: If the model is complex and has many branching structures, you need to pay close
attention to the exact order of the layers in the model and where you can insert the new
layer. The layers should be ordered so that the inputs to each layer have been previously
defined.
Inserting a New Layer 233
Changing an Existing Layer
Everyone makes mistakes, even model converters. Especially on complicated models such
as TensorFlow graphs, it’s possible that a layer in your mlmodel isn’t 100% the same as in
the original model. In that case you may want to dig into the spec and change the properties
of that layer.

For example, a convolution layer may have the wrong kind of padding.

layer = spec.neuralNetwork.layers[7]
print(layer.convolution.valid)

This may print something like:

paddingAmounts {
borderAmounts {
}
borderAmounts {
}
}

In this case, no zero-padding is added to the edges of the image before it goes into this
convolution layer. If that is incorrect for your model, you can overwrite these settings with
the correct padding amounts.

For example:

del layer.convolution.valid.paddingAmounts.borderAmounts[:]

y_pad = layer.convolution.valid.paddingAmounts.borderAmounts.add()
y_pad.startEdgeSize = 1
y_pad.endEdgeSize = 1

x_pad = layer.convolution.valid.paddingAmounts.borderAmounts.add()
x_pad.startEdgeSize = 1
x_pad.endEdgeSize = 1

And now this convolution layer applies one pixel of padding all around:

234
Changing an Existing Layer 235

paddingAmounts {
borderAmounts {
startEdgeSize: 1
endEdgeSize: 1
}
borderAmounts {
startEdgeSize: 1
endEdgeSize: 1
}
}

Obviously, changing the padding of a layer will also change the shape of its output, which
may affect the output shapes of subsequent layers as well. You should check the layer
output shapes to make sure everything is (still) in order.

You can use this same technique to change any layer property, including its weights. Just
make sure your changes don’t break anything downstream in the model.
Deleting a Layer
Having to delete a layer from a Core ML model is not a common occurrence but it can be
useful to know how.

For example, sometimes model converters leave a bunch of unused layers in the model.
This isn’t a problem most of the time but you still may want to clean this up.

In the chapter Cleaning Up a Converted Model, the converted Core ML model had two
resizeBilinear layers in a row. Removing one of these layers actually makes the model a
bit faster!

Don’t include the layer in the first place

Some converters, such as tfcoreml, let you specify what the first and last layers are, and
they will only convert the original model in between these layers. If there are layers at the
beginning or end of the model that you do not wish to include, simply don’t convert those.

Likewise, if you can already delete the layer from the original model before conversion,
that is perhaps easier than deleting it afterwards.

Replace the layer with an identity layer

Instead of physically removing the layer from the model, you can convert it to a so-called
“identity” layer. These perform no function but simply pass any data straight through.

Given a spec, you can write the following to turn the layer named "conv_30" into an identity
layer:

236
Deleting a Layer 237

# Find the layer


nn = get_nn(spec)
layer_dict = {layer.name:i for i, layer in enumerate(nn.layers)}
layer_to_delete = nn.layers[layer_dict["conv_30"]]

# Replace it with an identity operation


layer_to_delete.activation.linear.alpha = 1
layer_to_delete.activation.linear.beta = 0

Core ML doesn’t have an actual “identity” layer type, but you can achieve the same thing
using a linear activation layer. By filling in the activation.linear property, you overwrite
whatever type this layer had before.

Removing a layer in the mlmodel

In theory, Core ML could remove any unused layers, such as identity layers, when it loads
the model onto the device but it’s unknown if it actually makes that optimization. To be
100% sure that Core ML doesn’t perform any computations that it doesn’t need to, you can
completely remove the layer from the mlmodel file.

To delete the layer named "conv30", do the following:

# Find the layer


nn = get_nn(spec)
layer_dict = {layer.name:i for i, layer in enumerate(nn.layers)}
layer_index = layer_dict["conv_30"]

# Connect the next layer's input to the previous layer's output


nn.layers[layer_index + 1].input[0] = nn.layers[layer_index].input[0]

# Remove the layer from the model


del nn.layers[layer_index]

Some things to pay attention to:

• If you’re deleting a layer that is connected to a model output, you now have to connect
the previous layer to this output. You do this by putting the name of the model output
into the previous layer’s output[0] property.

• If you’re deleting a layer that is connected to a model input, you’ll have to connect the
next layer to this input.

• If the layer that you’re deleting is connected to multiple other layers, you’ll need to fix
up all of these connections!
Example: Cleaning Up a Converted
Model (DeepLab v3+)
After converting your trained model to Core ML format, it may be a little messy. Some of
the things you may want to clean up are:

• Filling in descriptions for the inputs and outputs.

• Filling in metadata such as author information and license terms.

• Changing the names of the inputs and outputs.

• Tidying up the shape of any MultiArray outputs.

• Inserting new layers.

• Changing the properties or behavior of existing layers.

• Removing layers that don’t really serve a purpose.

• Converting the weights to 16-bit floats or quantizing them.

• Converting the outputs from doubles to floats.

• Converting the output from a MultiArray to an image.

• For classifier models, replacing the class names.

In this chapter you’ll see how to do several of these things to clean up the DeepLab v3+
model for semantic segmentation.

Tip: I usually go through these steps in the IPython REPL or a Jupyter notebook to make
sure they work, and then copy them into a script that I can run again later to convert the
model again.

You can find the full script for this chapter as deeplab.py in the book’s GitHub repo81 .
81 https://github.com/hollance/coreml-survival-guide

238
Example: Cleaning Up a Converted Model (DeepLab v3+) 239

Apple also makes a version of DeepLab available for download82 , but that just appears to
be a rough conversion from TensorFlow to Core ML. They didn’t do any of the clean-up
work from this chapter. See if you can spot the differences between their model and ours!

Note: This chapter assumes you’re using tfcoreml 0.3.0, TensorFlow 1.14.0, and coremltools
3.4. It has not been tested with coremltools 4! If you use a later version of tfcoreml, the
converted model may not match what is described in this chapter.

DeepLab v3+

As the web page says83 , DeepLab is a state-of-art model for semantic image segmentation,
where the goal is to assign semantic labels (e.g. person, dog, cat and so on) to every pixel
in the input image.

Download the mobilenetv2_coco_voc_trainval model from this link84 . This is a Tensor-


Flow model so first you’ll need to convert it to an mlmodel using tfcoreml. Run the following
code:

import tfcoreml

input_path = "deeplabv3_mnv2_pascal_trainval/frozen_inference_graph.pb"
output_path = "DeepLab.mlmodel"

input_tensor = "ImageTensor:0"
input_name = "ImageTensor__0"
output_tensor = "ResizeBilinear_3:0"

model = tfcoreml.convert(tf_model_path=input_path,
mlmodel_path=output_path,
output_feature_names=[output_tensor],
input_name_shape_dict={input_tensor: [1, 513, 513, 3]},
image_input_names=input_name)

Errors?
You may get an error during the conversion process, “Error compiling model: Error reading
protobuf spec. validator error: The .mlmodel supplied is of version 3, intended for a newer

82 https://developer.apple.com/machine-learning/models/
83 https://github.com/tensorflow/models/tree/master/research/deeplab
84 http://download.tensorflow.org/models/deeplabv3_mnv2_pascal_trainval_2018_01_29.tar.gz
Example: Cleaning Up a Converted Model (DeepLab v3+) 240

version of Xcode. This version of Xcode supports model version 2 or earlier.”

This is not a major problem as the mlmodel file will be saved just fine. It just means that
your version of coremltools is unable to compile the spec into an MLModel object because it
cannot find Xcode 10. One way to fix this is to install the very latest version of coremltools
directly from GitHub: pip install -U git+https://github.com/apple/coremltools.git

After the tfcoreml conversion completes, you have a new DeepLab.mlmodel file. Open
this in Xcode 10 or better. One of the things you may notice is that the inputs and outputs
have really ugly names — this is typical for models converted from TensorFlow. It’s one of
the things this chapter will show how to fix.

By the way, you didn’t specify any preprocessing options for tfcoreml.convert(). As it
happens, this TensorFlow graph already does its own preprocessing.

If you look into the converted mlmodel using Netron, then near the top is a multiply layer
with a scaling factor of 0.007843 followed by an add layer with a value of -1. In other words,
this divides the pixels by 127.5 and subtracts 1.

There are also a few other add, resizeBilinear, padding, and loadConstant layers here. In the
original TensorFlow graph, these are used to add mean RGB padding around the image if
it’s not exactly 513×513 pixels, but in the converted mlmodel they don’t really serve much
of a purpose. Later in this chapter you’ll get rid of all these extraneous layers.

The mlmodel specification version is 3, because this graph contains several ResizeBilinear
layers and these are only supported by version 3 and better.

Filling in the descriptions and metadata

The call to tfcoreml.convert() returns an MLModel object. You can use this to directly fill in
the model’s metadata and input / output descriptions:
Example: Cleaning Up a Converted Model (DeepLab v3+) 241

model.short_description = "DeepLab v3+ on MobileNet v2"


model.author = "https://github.com/tensorflow/models/tree/master/research/deeplab"
model.license = "Apache License"
model.input_description["ImageTensor__0"] = "Input image"
model.output_description["ResizeBilinear_3__0"] = "Segmentation map"
model.save(output_path)

However, you can also do this in the spec object for the mlmodel. Using the spec is nicer
because it also lets you rename those ugly inputs and outputs, as well as perform any other
required model surgery.

import coremltools as ct
spec = ct.utils.load_spec("DeepLab.mlmodel")

spec.description.metadata.versionString = "v1.0"
spec.description.metadata.shortDescription = "DeepLab v3+ on MobileNet v2"
spec.description.metadata.author = \
"https://github.com/tensorflow/models/tree/master/research/deeplab"
spec.description.metadata.license = "Apache License"

Note that this also lets you specify a versionString, which MLModel doesn’t.

Renaming the inputs and outputs

The original TensorFlow model had an input named "ImageTensor:0", which tfcoreml con-
verted to "ImageTensor__0" because the : cannot be used inside a Swift or Objective-C identi-
fier. The original output was named "ResizeBilinear_3:0" which becomes "ResizeBilinear_-
3__0" in Swift. Yuck!

Let’s rename these to something nicer. You also fill in their descriptions here:

old_input_name = "ImageTensor__0"
new_input_name = "image"
spec.description.input[0].name = new_input_name
spec.description.input[0].shortDescription = "Input image"

old_output_name = "ResizeBilinear_3__0"
new_output_name = "scores"
spec.description.output[0].name = new_output_name
spec.description.output[0].shortDescription = "Segmentation map"

Just renaming the model’s inputs and outputs is not enough. If you were to save the spec
now and open it in Xcode, it will give an error message such as, “validator error: Layer ‘xxx’
Example: Cleaning Up a Converted Model (DeepLab v3+) 242

consumes an input named ‘ImageTensor__0’ which is not present in this network.” That’s
because the first and last layers are connected to the model inputs and outputs by name,
and some of these connections still use the old names.

Use the get_nn() function to grab the proper neural network object from the spec:

from helpers import get_nn


nn = get_nn(spec)

Then loop through the layers of the neural network and replace the old connection names
with the new. Usually the first layer takes its input from the model’s input and that still says
"ImageTensor__0". Likewise, the last layer’s output is still connected to "ResizeBilinear_3__-
0". Because you renamed these inputs and outputs in the model, you also need to rename
them inside the layers.

for i in range(len(nn.layers)):
if len(nn.layers[i].input) > 0:
if nn.layers[i].input[0] == old_input_name:
nn.layers[i].input[0] = new_input_name
if len(nn.layers[i].output) > 0:
if nn.layers[i].output[0] == old_output_name:
nn.layers[i].output[0] = new_output_name

Note: The reason you’re using a loop is that there’s no guarantee it’s actually the very first
layer that is connected to the input — and it turns out in DeepLab it isn’t, the first layer in
the mlmodel file is a loadConstant layer. Also, more than one layer could be connected to
the same input.

You’re still not done! The neural network’s preprocessing may also be attached to the input,
and so you’ll also have to rename that connection:

spec.neuralNetwork.preprocessing[0].featureName = new_input_name

If you save the spec to a new mlmodel file now and open it in Xcode, it looks a lot cleaner
already:
Example: Cleaning Up a Converted Model (DeepLab v3+) 243

Removing unused layers

If you look at the last layers of the original DeepLab.mlmodel that was produced by
tfcoreml, you’ll notice that it has two resizeBilinear layers in a row:

The first one scales up the features to 65×65 pixels and the second one scales them up once
again, this time to 513×513 pixels. Why not directly scale them up to 513×513 and save a
step? This is another artefact of how the original TensorFlow graph was constructed. Let’s
get rid of that second-to-last resizeBilinear layer.

nn.layers[-1].input[0] = nn.layers[-2].input[0]
del nn.layers[-2]

The layer to remove is given by nn.layers[-2]. Just removing this layer is not enough, you
also have to connect the preceding batchnorm layer to the remaining resizeBilinear layer,
which is in nn.layers[-1]. That’s done by setting the input to the correct name.

At the beginning of the model there are also a number of layers that you don’t appear to
need:
Example: Cleaning Up a Converted Model (DeepLab v3+) 244

There is a resizeBilinear layer that resizes the input image to 513×513 but Core ML already
requires that the input image has that size. Then there is an add layer that subtracts 127.5
from every channel in the input image, followed by a padding layer that appears to add no
pixels, followed by another add layer that adds 127.5 back again.

This seems pretty pointless. In the original TensorFlow graph, these operations are used to
pad the input image by the mean RGB color. If you first subtract the mean, then pad with
zeros and add the mean back again, the zero padding you just added now has the mean
RGB color.

But in the converted Core ML model, no padding happens and so these layers are just doing
work that we can avoid. Let’s get rid of them.

To figure out exactly which layers to delete, create a helper function first:

def find_layer_index(name):
for i, layer in enumerate(nn.layers):
if layer.name == name:
return i
return None

The first layer to keep is the multiply layer that does the pixel normalization. You need
to connect this layer to the model’s input. To get the name of that input, look at the
resizeBilinear layer:
Example: Cleaning Up a Converted Model (DeepLab v3+) 245

resize_layer = nn.layers[find_layer_index("ResizeBilinear:0")]
multiply_layer = nn.layers[find_layer_index("mul_1:0")]
multiply_layer.input[0] = resize_layer.input[0]

Note: I know these layer names because I’m looking at the mlmodel with Netron. You
shouldn’t assume that the resize layer is actually the first layer in the model — in my model
the loadConstant layer comes first.

At this point, the multiply layer is now connected directly to the model input. The resize-
Bilinear layer and everything in between is still part of the model but are in an unused
branch.

You can actually save the model now and use it, but it’s best to remove these unused layers.

del nn.layers[find_layer_index("ResizeBilinear:0")]
del nn.layers[find_layer_index("negated_Reshape:0_sub_2:0")]
del nn.layers[find_layer_index("sub_2:0")]
del nn.layers[find_layer_index("Pad:0")]
del nn.layers[find_layer_index("Reshape/tensor:0")]
del nn.layers[find_layer_index("add_2:0")]

Note that the remaining multiply and add layer perform image preprocessing. You might
as well put that into nn.preprocessing:
Example: Cleaning Up a Converted Model (DeepLab v3+) 246

nn.preprocessing[0].scaler.channelScale = 1/127.5
nn.preprocessing[0].scaler.redBias = -1.0
nn.preprocessing[0].scaler.greenBias = -1.0
nn.preprocessing[0].scaler.blueBias = -1.0

Now the channelScale is 0.007843, just like in the multiply layer. Finally, remove these two
layers just like you did before:

conv_layer = nn.layers[find_layer_index("MobilenetV2/Conv/Conv2D:0")]
conv_layer.input[0] = multiply_layer.input[0]

del nn.layers[find_layer_index("mul_1:0")]
del nn.layers[find_layer_index("sub_7:0")]

Nice! You managed to remove nine layers from the model and it still works the same.

Note: You don’t always have to remove all the unused layers from the model. Sometimes the
conversion process will leave “Identity” layers in the model that don’t really do anything.
The Core ML compiler is already smart enough to strip those out when it builds the
runtime inference graph. Likewise, it will combine activation and batchnorm layers with
the convolution layer into a single operation.

Inserting an argmax layer

The output of DeepLabv3+ is a tensor of size 21×513×513. This is an “image” of 513×513


pixels and for each pixel there is a 21-element vector with probabilities. There are 20
normal classes for different types of objects, plus one special “background” class that is
used if the pixel does not belong to any recognized object.

Exactly what you want to do with the results from this model depends on your app. Perhaps
you want to use it to replace the background with another image. In that case, you only
need to know what the winning class is for each pixel, but you don’t care so much about
the actual probabilities — only which probability is the highest.

You can write post-processing code that loops through the MultiArray and takes the argmax
over the probabilities for every pixel.

It turns out, Core ML has a built-in layer type for this. The original TensorFlow graph
already has an ArgMax operation too, but the tfcoreml converter that I used (version 0.3.0)
was not able to convert this to Core ML. Let’s add it ourselves then.
Example: Cleaning Up a Converted Model (DeepLab v3+) 247

Add a new layer to the end of the model. This is a ReduceLayerParams layer that should
perform the ARGMAX operation:

new_layer = nn.layers.add()
new_layer.name = "argmax"

params = ct.proto.NeuralNetwork_pb2.ReduceLayerParams
new_layer.reduce.mode = params.ARGMAX
new_layer.reduce.axis = params.C

The axis along which the reduction operation is performed is C, the channel axis, which
means it reduces the (21, 513, 513) array to (1, 513, 513).

Now connect this new layer into the rest of the graph:

new_layer.output.append(nn.layers[-2].output[0])
nn.layers[-2].output[0] = nn.layers[-2].name + "_output"
new_layer.input.append(nn.layers[-2].output[0])

And just like that, you have an ArgMax layer between the bilinear resizing and the model
output. However, you’re not quite done yet…

Tidying up the outputs

The output description of the model still says:

output {
name: "scores"
shortDescription: "Segmentation map"
type {
multiArrayType {
shape: 21
shape: 513
shape: 513
dataType: DOUBLE
}
}
}

But now that you’ve applied an ARGMAX operation the data no longer has 21 channels but
just one. So the output description is wrong and that may cause problems. Let’s fix it by
removing that first dimension:
Example: Cleaning Up a Converted Model (DeepLab v3+) 248

del spec.description.output[0].type.multiArrayType.shape[0]

You could also have written spec.description.output[0].type.multiArrayType.shape[0] = 1


but I don’t like to have dimensions of size 1 in my output shapes unless they serve a real
purpose.

Now the output of the model is a MultiArray of size 513×513. Each number in this array is
a class index. A value of 0 means this pixel is background, a value between 1 and 21 means
it’s some kind of object. (DeepLab is trained on the 20 classes from the Pascal VOC dataset.)

However, class indices are integers, not doubles. Yet the type of the output is still DOUBLE.
MultiArrays can hold integers, so change the type of the MultiArray to INT32:

spec.description.output[0].type.multiArrayType.dataType = \
ct.proto.FeatureTypes_pb2.ArrayFeatureType.INT32

And now you have a model that outputs an integer segmentation mask!

Odds and ends

By the way, instead of outputting a MultiArray you can also make the model produce a
grayscale image. That’s as simple as changing the data type of the output:

spec.description.output[0].type.imageType.colorSpace = \
ct.proto.FeatureTypes_pb2.ImageFeatureType.GRAYSCALE
spec.description.output[0].type.imageType.width = 513
spec.description.output[0].type.imageType.height = 513

However, since DeepLabv3+ is only trained on 21 classes, the output values in the grayscale
image will be very dark (between 0 and 20). You may want to add a multiply layer at the
end that multiplies the ArgMax results by a factor of about 12 so that it’s more in the 0 - 255
range. You should know how to do this by now. :–)

The DeepLabv3+ model has about 4 million learned parameters. To save some space,
convert them to 16-bit floating point:

spec = ct.utils.convert_neural_network_spec_weights_to_fp16(spec)

Finally, save the cleaned up model to a new mlmodel file:


Example: Cleaning Up a Converted Model (DeepLab v3+) 249

ct.models.utils.save_spec(spec, "DeepLabClean.mlmodel")

That’s it for this chapter. I hope it was useful to see how you can still make plenty of
improvements to a model after converting it.
Replacing the Class Names of a Classifier
Core ML classifier models store the names of the classes inside the mlmodel file. When you
convert the model, you can pass in a class_labels option with a list of class names or a path
to a text file that has the class labels.

To change these class names afterwards, you don’t have to convert the model again — you
can simply edit the mlmodel file.

For a neural network classifier, the class names are stored in the field stringClassLabels or
int64ClassLabels from spec.neuralNetworkClassifier. The latter is used when you pass in a
list of integers as the class_labels during conversion.

To view the class names, grab the spec for the mlmodel and write:

print(spec.neuralNetworkClassifier.stringClassLabels)

This will print something like:

vector: "apple"
vector: "banana"
vector: "candy"
...

The data type of stringClassLabels is a StringVector, which is just a repeating field of string
values. In other words, it’s a list of strings.

To change these labels, for example to translate them to another language, you can write:

labels = spec.neuralNetworkClassifier.stringClassLabels
labels.vector[0] = "Appel"
labels.vector[1] = "Banaan"
labels.vector[2] = "Snoep"

Or if you already have the new labels in a list, do the following to replace the old list with
the new one:

250
Replacing the Class Names of a Classifier 251

labels = spec.neuralNetworkClassifier.stringClassLabels
del labels.vector[:]

new_labels = [ ... ]
labels.vector.extend(new_labels)

And save the spec again. Make sure that you have at least as many labels as there are classes,
otherwise bad things happen!

This same method is used for other model types too. For example, if your model is a logistic
regression, then look at spec.glmClassifier.stringClassLabels.

If the model is an object detector pipeline with non-maximum suppression, then the class
names are in spec.pipeline.models[-1].nonMaximumSuppression.stringClassLabels.

Tip: Instead of modifying the spec directly, you can also use NeuralNetworkBuilder to load
the spec and then call set_class_labels().
Part 5: Inside the App

Learn how to make the most out of the APIs from the Core ML and Vision frameworks.
Understanding the Xcode-generated File
When you add an mlmodel to your app project, Xcode automatically generates a source file
with a few classes that make it very easy to use the model from your code.

This chapter looks at what is inside this automatically generated source file. You don’t really
need to know everything that goes on here in order to use Core ML, but it’s useful if you
want to understand the API better.

Build settings

There are two relevant options in the Build Settings for your target:

• CoreML Model Class Generation Language. This lets you choose the programming
language of the class: Swift or Objective-C.

• CoreML Generated Model Inherits NSObject. Yes or no. Useful only if you use Swift, as
in Objective-C the model will always inherit from NSObject.

If you set the language to None, no class is generated for the mlmodel and you’ll have to do
everything through the MLModel API by hand.

Tip: When including a Core ML model inside a CocoaPod or other framework, the automatic
code generation may not work. Fortunately, you can still use the generated class: Build your
project manually (or add the .mlmodel file to a new, empty, project), then copy-paste the
generated code into a new Swift or Objective-C source file and include that file in your
CocoaPod. Finally, disable the automatic code generation by setting the language to None.

The generated source file

The automatically generated Swift or Objective-C source file for your mlmodel is not visible
in your project or Git repository. It’s a temporary file that ends up somewhere in the
Library/Developer/Xcode/DerivedData folder.

253
Understanding the Xcode-generated File 254

If your model file is named YourModel.mlmodel, the generated source file will have the
following three classes, in this order:

• YourModelInput

• YourModelOutput

• YourModel

The YourModel class represents the model as a whole. The other two classes represent the
model’s inputs and outputs. In your app you’ll be working mostly with YourModel and also
with YourModelOutput, but usually not with YourModelInput.

For a model that is updatable, i.e. if it can be trained on the device, there is also a
YourModelTrainingInput class that represents the inputs that are used during training.
(YourModelInput is only used for inference.)

Exactly what you’ll see in the generated source file totally depends on your model’s
inputs and outputs. Two different models will also get classes with different methods
and properties. So don’t panic if what you see for your own models differs from what is
described in this chapter.

Note: If while building your app you get the compiler error “unresolved identifier
YourModel”, it means the automatic code generation has failed. Check the Xcode build log
for more information.

The model class

YourModel is a wrapper around the MLModel API from CoreML.framework. It does two things:

1. Load the model into the MLModel object. This is the object that does all the actual work.

2. Provide convenient methods for making predictions.

In your app, using the generated YourModel class is as simple as:


Understanding the Xcode-generated File 255

// 1: load the model


let config = MLModelConfiguration()
let yourModel = try YourModel(configuration: config)

// 2: make predictions
if let result = try? yourModel.prediction(input: ...) {
/* do something with the result */
}

All of the code in the generated source file exists to make using Core ML as easy as those
couple of lines.

The yourModel.model property lets you access the underlying MLModel object, which is needed
for using Core ML with the Vision framework, or if you want to inspect the attributes of the
model programmatically.

Loading the model

The wrapper class actually has a couple of different ways to load the model:

1. The YourModel(configuration:) initializer loads the compiled model from the folder
YourModel.mlmodelc in the app bundle. This is the most common way to load the
model.

2. With YourModel(contentsOf:configuration:) you can load the mlmodelc from a differ-


ent location, which is mostly useful if the compiled model is part of another bundle
such as an embedded framework, or when you’ve downloaded the model at runtime
and compiled it on the device.

3. New in iOS 14 and macOS 11.0 is YourModel.load(...), which loads the model asyn-
chronously and returns the result in a completion handler. This is mostly useful for
loading encrypted models, but you can also use it with your regular models.

The initializers will throw an error if loading the MLModel fails; the load method calls the
completion handler with a Result object that either contains the model or an Error.

On iOS 12 and up, you can provide an MLModelConfiguration when instantiating the model.
This lets you configure whether the model should only run on the CPU, or whether it can
also use the GPU and/or the Neural Engine.
Understanding the Xcode-generated File 256

Note: The YourModel() initializer, the one that doesn’t take any parameters, is deprecated
as of Xcode 12 because it does not let you handle loading errors gracefully.

Making predictions

All the other methods in YourModel are for making predictions and handling the results.
Exactly which methods you get and what they look like depends on the types and names of
the inputs and outputs from your mlmodel.

At the very least there is a method like the following, that takes a YourModelInput object and
returns a YourModelOutput object:

func prediction(input: YourModelInput, options: MLPredictionOptions)


throws -> YourModelOutput {
let outFeatures = try model.prediction(from: input, options: options)
return YourModelOutput(features: outFeatures)
}

Both YourModelInput and YourModelOutput are classes that implement the MLFeatureProvider
protocol.

Core ML uses protocols to describe the model’s inputs and outputs so that MLModel can
handle a wide variety of model types. But that also makes it a little more cumbersome
to use MLModel directly, which is why Xcode has generated these wrappers for you.

Because working with YourModelInput still is a bit awkward, Xcode has also added a
prediction() method that uses more convenient datatypes. For example, if the model has
an input that expects an image, YourModel will have a method like this:

func prediction(image: CVPixelBuffer) throws -> YourModelOutput {


let input_ = YourModelInput(image: image)
return try self.prediction(input: input_)
}

Important: The name of the argument is taken directly from the name of the input in the
mlmodel. Here, the mlmodel has an input that is literally named "image", hence the method
is prediction(image:).

If your mlmodel’s input has a different name, the argument name will be different too. For
a model converted from TensorFlow, the method signature could look something like this:
Understanding the Xcode-generated File 257

func prediction(ImageTensor__0: CVPixelBuffer) throws -> YourModelOutput

Here, the input was named "ImageTensor:0" in the original TensorFlow graph, which is a
typical TensorFlow name. It got converted to ImageTensor__0 by tfcoreml so that it could be
used as an identifier in Swift.

So remember, just because your model expects an image as the input, that doesn’t necessary
mean you get a prediction(image:) method. The name of the method depends on the name
of the input in the mlmodel.

Tip: If you don’t want your model to use an ugly name such as ImageTensor__0, you can fix
this in the mlmodel file after converting. See the chapter Cleaning up a Converted Model.

If the model has more than one input, the prediction method also takes more than one
argument:

func prediction(image: CVPixelBuffer, threshold: Double)


throws -> YourModelOutput

The output of these convenience methods is still always a YourModelOutput object.

In theory, the prediction(image:) method for a classifier could, instead of a YourModelOutput,


return a dictionary of [String: Double] values with the predicted probabilities for each
class. But the generated class isn’t that clever (yet?).

As of iOS 12 and macOS 10.14, there is also a method for making batch predictions:

func predictions(inputs: [YourModelInput], options: MLPredictionOptions)


throws -> [YourModelOutput]

Instead of one input and output object, this uses arrays of such objects. If you look inside
this method, you’ll see that it makes only a single prediction using an MLArrayBatchProvider
for the input data, and then copies the results into an array of YourModelOutput objects.

Note: Several of the prediction methods also take an options argument. MLPredictionOptions
has one property, usesCPUOnly. If true, then Core ML will perform the prediction using
the CPU, not on the GPU or the Neural Engine. But there already is MLModelConfiguration,
so why two different ways to do the same thing? I’m not sure, but my guess is that
Understanding the Xcode-generated File 258

MLPredictionOptions is an older API that may get deprecated in favor of MLModelConfiguration.

The input class

Every time you want to make a prediction, YourModel creates a new instance of YourModelInput.
This object describes which input will receive which piece of data.

Other tools such as TensorFlow simply use a dictionary to map the data to model inputs,
but Core ML requires an object that implements the MLFeatureProvider protocol. In the
automatically generated file, this object is called YourModelInput.

Note: You don’t need to worry about YourModelInput if you use the convenience prediction()
method from YourModel. But it’s still interesting to see how this class is implemented.

Here is a typical YourModelInput for an mlmodel that has a single image input:

class YourModelInput : MLFeatureProvider {


var image: CVPixelBuffer

var featureNames: Set<String> {


get {
return ["image"]
}
}

func featureValue(for featureName: String) -> MLFeatureValue? {


if (featureName == "image") {
return MLFeatureValue(pixelBuffer: image)
}
return nil
}

init(image: CVPixelBuffer) {
self.image = image
}
}
Understanding the Xcode-generated File 259

Note: With Xcode 12 and up, if there are image inputs this class will also have initializers
that accept a CGImage object or an image file URL. These initializers are available on iOS 13,
macOS 10.15, or better. Unfortunately, there is no option that tells YourModelInput how to
crop or scale the image. For that, you need to make your own MLFeatureValue object.

The MLFeatureProvider protocol requires that the class provides implementations for:

• the featureNames property. This returns the names of the model’s inputs.

• the featureValue(for:) method. This wraps the data for the input with the given name
in an MLFeatureValue object, or returns nil if no such input exists.

If the model has more than one input, YourModelInput will look something like this:

class YourModelInput : MLFeatureProvider {


var ImageTensor__0: CVPixelBuffer
var threshold: Double

var featureNames: Set<String> {


get {
return ["ImageTensor__0", "threshold"]
}
}

func featureValue(for featureName: String) -> MLFeatureValue? {


if (featureName == "ImageTensor__0") {
return MLFeatureValue(pixelBuffer: ImageTensor__0)
}
if (featureName == "threshold") {
return MLFeatureValue(double: threshold)
}
return nil
}

init(ImageTensor__0: CVPixelBuffer, threshold: Double) {


self.ImageTensor__0 = ImageTensor__0
self.threshold = threshold
}
}

Note that the name of the image input for this model is "ImageTensor__0", and that the model
now takes a second input named "threshold" that is a Double value.
Understanding the Xcode-generated File 260

The only job of the MLFeatureProvider is to return the MLFeatureValue object with the data
that should go into a certain model input. So if Core ML asks, “Give me the data for
input ImageTensor__0”, then the feature provider should return an MLFeatureValue object
containing the CVPixelBuffer with the image.

And if Core ML says, “Give me the data for input threshold”, then the feature provider
should return an MLFeatureValue object containing the Double value you want to use for that
input.

MLFeatureValue is a wrapper object around all the types of data that Core ML can pass into
the model’s inputs and receive from the model’s outputs. If MLFeatureValue doesn’t support
a particular data type (see the enumeration MLFeatureType), then Core ML cannot use it.

Currently the supported data types are:

• integer and double for scalar numerical values

• MLMultiArray for multi-dimensional numerical values (tensors)

• CVPixelBuffer objects for images

• string

• dictionary of NSNumbers

• MLSequence for a sequence of numbers or strings

As you can see, the YourModelInput class is just a lot of boilerplate, which is another reason
it’s nice that Xcode automatically writes this code for you.

Note: If you want to use the MLModel API directly, you don’t always have to make a subclass
of MLFeatureProvider. Core ML includes a MLDictionaryFeatureProvider object that lets you
pass the input objects as a dictionary of input name: value pairs, just like in TensorFlow.

Training inputs

Models that can be trained on the device also have training inputs in addition to the regular
inputs. The regular inputs are used during inference, the training inputs are used during
training. An image classifier, for instance, would have two training inputs: 1) the training
image, and 2) the true label for that image.
Understanding the Xcode-generated File 261

The YourModelTrainingInput class works much in the same way as YourModelInput. It imple-
ments the MLFeatureProvider protocol and it has properties for the training examples and
their labels. There is nothing particularly special about this class — it’s just provided for
convenience.

Note: The YourModelTrainingInput class is only generated for models that are marked
as being updatable. Unlike YourModelInput, you can’t directly use it with the generated
YourModel class — there is no model.train(input: YourModelTrainingInput) method. Training
is done using a different API.

The output class

Unlike YourModelInput, which you almost never have to use directly, the prediction(...)
methods will always return the inference result as a YourModelOutput object.

The method that performs the actual prediction, MLModel.prediction(from:options:), always


returns the results inside an MLFeatureProvider object, which allows Core ML to support
models that have more than one output, as well as predictions of different data types.

YourModelOutput is then an MLFeatureProvider that wraps around that other MLFeatureProvider


that holds the actual results:

class YourModelOutput : MLFeatureProvider {


private let provider: MLFeatureProvider

var featureNames: Set<String> {


return self.provider.featureNames
}

func featureValue(for featureName: String) -> MLFeatureValue? {


return self.provider.featureValue(for: featureName)
}
...

For example, if YourModel is an image classifier, you could obtain the results this way in
your application code:
Understanding the Xcode-generated File 262

if let output = try? yourModel.prediction(image: ...) {


if let outputValue = output.featureValue(for: "classLabelProbs"),
let probabilities = outputValue.dictionaryValue as? [String: Double] {
/* do something with the predicted probabilities */
}
}

Here, output is the YourModelOutput object. You ask it for the MLFeatureValue object for the
output named "classLabelProbs", which in turn contains a dictionary of class names and
their predicted probabilities.

The above code certainly works but is cumbersome to write. That’s why Xcode already adds
convenient properties to YourModelOutput. For classifier models these are:

class YourModelOutput : MLFeatureProvider {


/// Probability of each category as dictionary of strings to doubles
lazy var classLabelProbs: [String : Double]

/// Most likely image category as string value


lazy var classLabel: String

...

The names of these properties are literally the names of the mlmodel’s outputs. The
comments are taken from the output descriptions in the mlmodel file.

Using these convenience properties is a lot easier than digging through the feature values
yourself with output.featureValue(for:). Compare this code snippet to the one from earlier:

if let output = try? yourModel.prediction(image: ...) {


let probabilities = output.classLabelProbs
/* do something with the predicted probabilities */
}

Much shorter code and a lot easier to read!

It should be clear by now why Xcode generates this source file with these wrapper classes.
The MLModel API is very generic and needs to support a lot of different model configurations
and data types. It’s a flexible API but using it is a bit of a chore. These wrapper classes create
a more streamlined interface that is tailored to your particular mlmodel.

Tip: Earlier I mentioned that for classifier models it would be useful if prediction(image:)
could already return a dictionary of class probabilities instead of a YourModelOutput. Well,
there is no reason you couldn’t write that code yourself in an extension:
Understanding the Xcode-generated File 263

extension YourModel {
func prediction(image: CVPixelBuffer) throws -> [String: Double] {
let input = YourModelInput(image: image)
let output = try self.prediction(input: input)
return output.classLabelProbs
}
}

// To use it:
if let prediction: [String: Double] = try? model.prediction(image: ...) {
/* do something with the predicted probabilities */
}

Note: Make sure to put this extension in a different source file, not the automatically
generated one! The contents of that file get overwritten the next time you build your app
and you’ll lose any changes.

This is probably more than you ever wanted to know about the contents of the automati-
cally generated file. But be thankful for its existence — it saves you from writing a lot of
code yourself!
mlmodelc
It might surprise you to learn that, after all the effort you’ve gone through to create your
mlmodel file, that this file is not actually included in your app bundle when you ship it to
the App Store.

Instead, Xcode runs a compiler on the mlmodel file to create an mlmodelc. This is not just
one file but a folder containing several binary and JSON files — and that is what goes into
your app bundle.

If you open your .app file in Finder and choose Show Package Contents, it will show what
is inside your app bundle. This is what it looks like for the Inceptionv3.mlmodel85 :

The Inceptionv3.mlmodelc folder contains the following files:

• coremldata.bin: a binary file that appears to have the model’s metadata (name of
author etc) and the classification labels

• model.espresso.net: a JSON file that describes the structure of the model, which
layers it uses and how they are connected to each other

• model.espresso.shape: a JSON file with the output sizes of the layers in the neural
network

• model.espresso.weights: a binary file with the learned parameters of the model

• model/coremldata.bin: ?

• neural_network_optionals/coremldata.bin: ?

The weights file is usually the largest. For Inception v3 it is 96 MB.

85 https://docs-assets.developer.apple.com/coreml/models/Inceptionv3.mlmodel

264
mlmodelc 265

Note: “Espresso” is the internal name Apple uses for the Core ML engine that runs neural
networks.

Models that can be trained on-device will also have an updateParameters folder with bi-
nary files describing the layers that can be trained, the loss function, the hyperparameters,
and so on.

Why two different file formats for the same thing?

mlmodel is supposed to be an open, standard file format, so that other tools can load and
save mlmodel files. This allows people to use their mlmodel files on Android or Linux too,
for example.

On the other hand, the CoreML.framework is proprietary and only available for iOS, tvOS,
watchOS and macOS. Therefore, mlmodelc is a proprietary file format that optimizes the
mlmodel for use with CoreML.framework.

As an example of such optimizations, mlmodelc can “fuse” different layers together to


reduce inference time.

If the mlmodel specifies that a convolution layer is followed by a scale layer, then the Core
ML compiler can remove that scale layer and instead make the weights of the convolution
layer larger or smaller already. The result is the same but it saves a computation step. This
is also how batch normalization layers can be “folded” into the preceding layer.

Another example is activation functions. In the mlmodel file, these are specified as separate
layers of their own, but at runtime the activation function is usually combined into the
kernel that computes the convolution.

Why should you care about mlmodelc?

First off, you should be aware that it exists and that your mlmodel file is not part of the app
bundle.

Second, you can use it to check the model’s layers and their output shapes if you don’t mind
doing a bit of spelunking inside these JSON files. If your model doesn’t work as expected,
these files could give some clues as to why.
mlmodelc 266

Third, you can use it to spy on other people’s models. ;-) And of course they can spy on
yours. It’s not exactly easy to reverse engineer the mlmodelc file format and turn it back
into a regular mlmodel, but not impossible either86 . Something to be aware of!
86 https://heartbeat.fritz.ai/reverse-engineering-core-ml-6d6f1c2bdab0
Running the Core ML Compiler Manually
To run the Core ML compiler outside of Xcode, do the following inside Terminal:

xcrun coremlc compile YourModel.mlmodel output

This creates the “output” folder and places YourModel.mlmodelc inside it.

You can also use coremlc to generate the source file. This is useful for when you want
to download the mlmodel file and compile it on the device, in which case Xcode doesn’t
generate the source file for you. The workaround is to generate it yourself and copy-paste
the files into your project.

xcrun coremlc generate Model.mlmodel output --language Swift --swift-version 5.1

Leave off the --language argument to generate Objective-C code, or use --language Objective-C.

Use the optional argument --no-documentation if you don’t want any comments in this
source file.

Note: If your app doesn’t come bundled with an mlmodel but downloads it from the
internet at runtime, you will have to tell Core ML to compile the model on the device. After
all, Core ML uses only the mlmodelc, not the mlmodel. See also the chapter Downloading
and Compiling Models on the Device.

If Xcode has a problem loading your mlmodel file, or if you get runtime errors when trying
to make predictions, try the following:

xcrun coremlc repair YourModel.mlmodel

This doesn’t appear to actually repair the model but at least it will give warnings about
the things that are confusing the Core ML compiler. You can do this on an mlmodel or an
mlmodelc.

267
Downloading and Compiling Models on
the Device
Normally when you add a trained mlmodel file to your Xcode project, this gets shipped
inside your app bundle as an mlmodelc, the compiled version of the model. Xcode also
automatically generates a class that makes it easy to use the model.

Xcode makes things really convenient — but on the flipside, whenever you improve the
model you also need to make a new build and publish a new version of your app.

There are reasons why you wouldn’t want to bundle the mlmodel with the app, but
download it separately at runtime. For example:

• If you retrain the model often to incorporate the latest data, let’s say a few times per
week, you need the ability to update your Core ML model without publishing a new
version of the app every time. The app can now immediately download a new version
of the model when it becomes available.

• Perhaps you have different versions of your model for different locales and regions,
or different use cases. Or even for different device types: a smaller model for older
devices, a larger but better model for newer devices.

• Or maybe your model is fairly large and you want to keep the App Store file size down.
Users now only download the model if they actually use that particular feature from
your app, perhaps after paying for it using an In-App Purchase.

• If you want to encrypt your mlmodel file, you can’t add the mlmodelc to the app bundle
because it will be unprotected. See also the chapter Encrypting Models.

In all these scenarios, your app will download an mlmodel file from somewhere. I will not
explain how to do the downloading, as it’s just a networking request to grab a file from a
server.

What’s important is that you’re downloading the actual mlmodel file, not the compiled
mlmodelc. Before you can use this mlmodel with Core ML it needs to be compiled into
an mlmodelc first. This now needs to happen from within the app, on the user’s device.

268
Downloading and Compiling Models on the Device 269

Note: I guess you could also use Xcode or the command-line coremlc compiler to generate
the mlmodelc folder, zip it up, and download that zip file into the app. Now you don’t have
to compile the mlmodel at runtime, but you can directly instantiate the MLModel with the
path to where you downloaded the mlmodelc folder. This approach should work too but
personally I’d use the official method.

Compiling the model at runtime

The key method here is compileModel(at:) from MLModel. You use this to turn the mlmodel
into an mlmodelc.

compileModel(at:) takes a URL to where the downloaded mlmodel file is saved. This needs to
be a local file URL, you can’t use a web address like http://yourserver.com/YourModel.mlmodel
here!

The code to compile a model looks something like this:

DispatchQueue.global().async {
let downloadedUrl = ...
do {
let compiledUrl = try MLModel.compileModel(at: downloadedUrl)
let model = try MLModel(contentsOf: compiledUrl)
/* use the model */
} catch {
print("Error:", error)
}
}

The compileModel(at:) method returns a temporary URL of where the mlmodelc folder is
stored. You pass this URL to the MLModel class to load the model. And then you’re good to go!

Because compiling can be time consuming, you should not do this on the main thread.
And you should only do it once, if possible. Every time you call compileModel(at:) it will
compile the model again, even if you already compiled it before. The compiled version
isn’t automatically cached.

That means it’s a good idea to copy the compiled model to a more permanent location.
The next time the user starts your app, you’d first attempt to load the mlmodelc from that
location. If the mlmodelc is not available (yet), you download and compile the mlmodel.

Here’s how you could implement that:


Downloading and Compiling Models on the Device 270

func downloadAndCompileModelIfNecessary(
completionHandler: @escaping (MLModel?) -> Void) {
let fileManager = FileManager.default
let appSupportDirectory = try! fileManager.url(for: .applicationSupportDirectory,
in: .userDomainMask,
appropriateFor: nil,
create: true)

// Do we already have a local copy of the compiled model?


let modelFileName = "YourModel.mlmodelc"
let permanentUrl = appSupportDirectory.appendingPathComponent(modelFileName)
if fileManager.fileExists(atPath: permanentUrl.path),
let model = try? MLModel(contentsOf: permanentUrl) {
completionHandler(model)
return
}

// Not found, so download the model...

This looks for YourModel.mlmodelc inside the Library/Application Support/ folder for your
app. If found, you load it into a new MLModel object and return it to the caller. This happens
inside a completion handler, because downloading and compiling the model — if necessary
— may take a while, and so this should be done asynchronously.

If YourModel.mlmodelc isn’t found, you should now download the model. I did not include
this code, but it typically happens in a background thread. When downloading is done, you
continue:

downloadModel() { downloadedUrl in
do {
let compiledUrl = try MLModel.compileModel(at: downloadedUrl)

// Copy the mlmodelc to a permanent location.


if fileManager.fileExists(atPath: permanentUrl.path) {
try fileManager.removeItem(at: permanentUrl)
}
try fileManager.copyItem(at: compiledUrl, to: permanentUrl)

let model = try MLModel(contentsOf: permanentUrl)


completionHandler(model)

// Optional: delete the downloaded mlmodel file.


//try fileManager.removeItem(at: downloadedUrl)
} catch {
print("Error:", error)
completionHandler(nil)
}
}
}
Downloading and Compiling Models on the Device 271

This is similar to the code from before, except now you copy the compiled mlmodelc from
its temporary URL to the permanent URL inside Library/Application Support/, and call the
completion handler with the initialized MLModel object.

Tip: If you want to test this code without writing the downloading part yet, or for debug
builds, you can add the mlmodel file to the Xcode project but rename the file extension
from .mlmodel to something else such as .bin, so that Xcode does not make the mlmodelc
already. Then add the following method:

func downloadModel(completionHandler: @escaping (URL) -> Void) {


let url = Bundle.main.url(forResource: "YourModel", withExtension: "bin")!
DispatchQueue.global().async {
completionHandler(url)
}
}

This fakes the downloading process and simply loads YourModel.bin from the app bundle.
Naturally, you’d replace this with the real downloading code for the final version of the
app.

Note: Alternatively, if you don’t want to rename from .mlmodel to .bin, you can remove the
file from the target’s Compile Sources build phase and add it to the Copy Files build phase
instead.

Using the model

The above code returns an MLModel object, which requires you to use the MLModel API to
interact with your model. That’s certainly possible but not necessarily the most convenient
way. Xcode did not generate that handy helper class because your project does not include
an mlmodel file anymore.

There are two ways you can still use that generated class:

1. Add the mlmodel file to the Xcode project, go to the generated class, and copy-paste it
into an actual Swift file that you add to the project. Remove the mlmodel file from the
project again.

2. Generate the source file using the command-line Core ML compiler:


Downloading and Compiling Models on the Device 272

xcrun coremlc generate YourModel.mlmodel output --language Swift --swift-version 4.2

This creates a Swift source code file for your mlmodel and puts it into the output folder (you
have to make this folder first). Now you can add this Swift source file to the Xcode project.

Previously you may have instantiated the generated class like so:

let yourModel = try YourModel()

That will no longer work because it tries to load the mlmodelc from the app bundle. Instead,
use the other init method:

let yourModel = try YourModel(contentsOf: permanentUrl)

Tip: Once you’ve loaded the model, use MLModelDescription to verify the model’s inputs and
outputs are indeed what your code expects. You don’t want your app to crash because it
downloaded an mlmodel that is incorrect. See also the chapter Inspecting the Model at
Runtime.
Running the Model on the CPU
Core ML will automatically choose whether to run your model on the CPU, on the GPU, or
on the Neural Engine. It’s not really clear how it decides what hardware to run the model
on, but the CPU is usually the slowest choice.

Note: On the iPhone 6 and earlier models, Core ML appears to always use the CPU only. In
order to run Core ML models on the GPU, the device must have a GPU Family of 3 or higher,
available on devices with the A9 processor or better. The Neural Engine is only available
on devices with an A12 or better.

Even though the CPU is slower than the GPU or the Neural Engine, there are several good
reasons why you’d want to run your Core ML model on the CPU:

• To make sure the model was converted correctly. Core ML on the GPU uses 16-bit
floats for the computations, which have much less precision than the 32-bit floats used
by TensorFlow, Keras, Caffe, etc. These differences in precision can cause the model to
output slightly different predictions.
So if your converted model gives different results from the original, is that because of
the 16-bit floats — or is there something else wrong with the model? One way to check
is to run the model on the CPU, in which case Core ML will also use 32-bit floats. The
Core ML model on the CPU should give the same answers as the original model.

• To run the model while the app is in the background. The GPU and Neural Engine
cannot be used while the app is in the background.

• When the GPU is required for other heavy tasks, such as rendering complicated
scenes. In that case you want the model to run on the Neural Engine if available, but
on the CPU if not.

• To measure the speed difference with respect to the GPU and/or the Neural Engine.
Your model may already be fast enough on the CPU.

When the model has custom layers and you did not provide a Metal implementation for
these layers, Core ML may run parts of the model on the CPU anyway.

There are two ways to force your model to run on the CPU:

273
Running the Model on the CPU 274

• For iOS 12 and up: When you create the model, provide an MLModelConfiguration. Set
the computeUnits property to .cpuOnly.

• For iOS 11 and up: When you call model.prediction(), pass in a MLPredictionOptions
object with usesCPUOnly set to true.

Note that not all versions of the prediction() method in the automatically generated source
file allow you to pass in an MLPredictionOptions object. Tip: create your own version of this
method in an extension.

To verify that the model really is running on the CPU, check out the chapter CPU, GPU, or
Neural Engine?

Note: When you run Core ML apps on the iOS Simulator, it also uses CPU execution. That’s
because the Metal Performance Shaders framework cannot run on the Simulator.
The Neural Engine
The first Neural Engine was introduced with the A11 Bionic processor in the iPhone X, but
Core ML does not make use of it on those devices (apparently it’s just for Face ID).

The A12 and A12X processors have a new, more powerful Neural Engine that can do 5
trillion operations per second. Apple claims that the A12’s Neural Engine makes Core ML
up to 9 times faster at 1/10th the power. I have indeed seen massive speed boosts on the
iPhone XS but only with certain types of models.

The A13 chips in the iPhone 11 and later models appears to have much the same Neural
Engine but it’s a bit faster and more energy efficient.

Not a lot is known about the Neural Engine, or ANE as it’s also called, but here is what we
do know:

• Core ML automatically runs your models on the Neural Engine when it can. This frees
up the GPU to do other work.

• There is no public API for using the Neural Engine. You can only use it through Core
ML. (There are private frameworks but these are obviously not documented and you
can’t use them in App Store apps.)

• If you use custom layers anywhere in your model, even if only at the beginning or at
the very end, the model may not fully run on the Neural Engine. As of yet, there is
no method in MLCustomLayer that you can implement to make your custom layer work
with the Neural Engine.

• The Neural Engine seems to be optimized for large matrix multiplications. A model
such as TinyYOLO, which consists of a small number of heavy convolution layers, is
quite slow on the GPU but extremely fast on the Neural Engine. On the other hand, a
model such as MobileNetV2 + SSDLite, which has more layers but that do less work, is
very fast on the GPU but slower on the Neural Engine.

• Not all layer types are supported. For example, if your model has RNN layers such
as LSTM or GRU, it won’t run on the Neural Engine. The same is true for dilated
convolutions, and possibly other layer types as well.

It’s hard to exactly understand and explain some of these results since there are no profiling
tools for the Neural Engine, nor any documentation about how this chip works. (Also, Core
ML can be a bit temperamental, making it hard to get consistent measurements.)

275
The Neural Engine 276

In any case, it appears that the design tradeoffs for efficient models are different for the
Neural Engine than for the GPU, but figuring them out so far has been mostly trial-and-
error.

Tip: To prevent a model from running on the Neural Engine, you can set the computeUnits
field from MLModelConfiguration to .cpu or .cpuAndGPU. To include the Neural Engine, use
computeUnits = .all.

I’ve started a community project on GitHub to document the Neural Engine87 , so check that
out for the latest insights!
87 https://github.com/hollance/neural-engine
CPU, GPU, or Neural Engine?
Because Core ML automatically chooses on what hardware to run your model, you may
want to check whether it’s actually using the CPU, the GPU, or the Neural Engine.

You can use MLModelConfiguration to tell Core ML what your preferences are, but there is no
API to ask it at runtime on which hardware it is currently running the model.

For debugging purposes, there are a few things you can do:

• Set a breakpoint on the line that calls model.prediction(...) or handler.perform([...])


if you’re using Vision. Turn off optimizations and run the app. Press the Step Into
button from the debugger toolbar a number of times and look at the stacktrace for
“Espresso”. This is the C++ framework behind Core ML. When Core ML is running on
the CPU, it is using Espresso::BNNSEngine. The Espresso::MPSEngine is used for running
models on the GPU. For the Neural Engine, Core ML uses Espresso::ANERuntimeEngine.

• If you have a model that runs on live video or in a loop, just press the Pause button in
the debugger at an opportune time. Find the thread that’s being used by Core ML and
see which Espresso:???Engine shows up in the stacktrace.

• Press the Pause button in the debugger. If there’s a thread H11ANEServicesThread


then Core ML is using the Neural Engine.

• Set a symbolic breakpoint on one of the symbols from BNNS (Basic Neural Network
Subroutines), the library used for neural networks on the CPU. One useful symbol is
apply_convolution_layer_filter. If this breakpoint gets hit, Core ML is using the CPU.

• Set a symbolic breakpoint on Espresso::MPSEngine::blob_container::get_mps_image. If


this breakpoint gets hit, Core ML is using the GPU.

• Set a symbolic breakpoint on -[_ANEModel program]. ANE stands for Apple Neural
Engine and _ANEModel is a symbol from the private AppleNeuralEngine.framework.

• While the app is running, go to the Debug navigator in Xcode and open the Energy
Impact report. This shows whether the GPU is being used or not. (If your app also
does a lot of drawing, you should disable the drawing code or you won’t be able to tell
if it’s really Core ML who is using the GPU.)

277
CPU, GPU, or Neural Engine? 278

• Use GPU Frame Capture from the debugger toolbar. This will capture everything that
Metal is doing, including running Core ML models on the GPU (if any). Inside the
list of captured GPU commands there should be several “ComputeCommandEncoder”
entries with names such as “cnnConv”. Of course, you actually need to be doing a Core
ML prediction while the frame capture takes place.
Inspecting the Model at Runtime
Let’s say you have a model that expects a 416×416 image as input and produces a MultiArray
of shape 125×13×13 as output. You could hard-code those dimensions in your app, but if you
or your colleagues later train a new version of the model that changes the input or output
sizes, you’ll also have to remember to make those same changes in the app.

We all know we shouldn’t really put such magic numbers in our code, so a better solution is
to ask the model what sizes it expects. As a form of defensive programming, it’s also smart
to check that the model will actually give the kinds of outputs that you expect it to.

Every MLModel instance has a modelDescription property. This contains the same information
you see in Xcode’s mlmodel viewer. The MLModelDescription class lets you access the infor-
mation stored in the spec.description from the mlmodel. That includes the descriptions
of the model’s inputs and outputs, plus any metadata such as author info and custom
metadata.

Inspecting an image input

Here is how you can get the expected size of the input image:

let inputs = yourModel.model.modelDescription.inputDescriptionsByName


guard let input = inputs["image"],
let constraint = input.imageConstraint else {
fatalError("Expected 'image' input")
}
let inputHeight = constraint.pixelsHigh
let inputWidth = constraint.pixelsWide
let pixelFormat = constraint.pixelFormatType

This assumes yourModel is an instance of the automatically generated class and therefore
yourModel.model is the MLModel object.

The inputDescriptionsByName property returns a dictionary that maps the model’s input
names to MLFeatureDescription objects. These are the Swift equivalent of the FeatureDescription
message from Model.proto.

The MLFeatureDescription describes a model input or output, and contains:

279
Inspecting the Model at Runtime 280

• the input or output’s name

• a datatype given by MLFeatureType

• a “constraint” that describes the expected size or shape of the input or output

• an isOptional boolean that says whether this input may be nil or not

Note: Interestingly, you cannot access the description field from the mlmodel file, even
though Xcode displays this in the mlmodel summary.

In the above code, you first check that the dictionary returned by inputDescriptionsByName
contains an entry for the input named "image". Because the names of the inputs can vary
per model, you should use your own model’s input name here.

The variable input is now an instance of MLFeatureDescription. Because this will be an image
input — you can check this with input.type == .image if you want to — it should have an
imageConstraint object. That constraint describes the size of the image with the pixelsHigh
and pixelsWide properties.

There is also a pixelFormatType property that tells you the expected pixel format for the
CVPixelBuffer that should hold the image for this input. The data type for this field is OSType,
which is a four-character-code. When you do print(constraint.pixelFormatType), it may
output something like 1111970369. Translated to ASCII this reads 'BGRA'.

A pixelFormatType of 'BGRA' means you will need to create a CVPixelBuffer object with the
pixel format kCVPixelFormatType_32BGRA and copy the image into it. You can find these four-
character-codes and their corresponding constants in the file CVPixelBuffer.h from the
CoreVideo framework.

Of course, if you use Vision instead of pure Core ML, you don’t have to worry about any of
this. Vision will look at the MLModelDescription object for you and figure out how to make
the input image compatible with the model’s expectations.

Note: You get the 'BGRA' code even for a RGB input, not just for BGR inputs. If your model
expects an image with BGR pixel order, Core ML can read data from the CVPixelBuffer right
away. But if your model expects an RGB image instead, Core ML will automatically reorder
the pixels in the CVPixelBuffer before it sends the image through the model.
Inspecting the Model at Runtime 281

For models on iOS 12 or higher that use size flexibility, the MLImageConstraint for the input
contains a sizeConstraint property with an MLImageSizeConstraint object that describes the
image sizes that are allowed in addition to the default pixelsWide and pixelsHigh.

Inspecting a MultiArray output

This is how you can get the expected size of a MultiArray output named "grid":

let outputs = yourModel.model.modelDescription.outputDescriptionsByName


guard let output = outputs["grid"],
let constraint = output.multiArrayConstraint else {
fatalError("Expected 'grid' output")
}
assert(constraint.shape.count == 3)
let arrayChannels = constraint.shape[0].intValue
let arrayHeight = constraint.shape[1].intValue
let arrayWidth = constraint.shape[2].intValue
let dataType = constraint.dataType

It’s the same approach as before, except now you look at the outputDescriptionsByName and
grab the multiArrayConstraint. What constraint object you get depends on the type of the
input or output. These are the following possibilities:

• imageConstraint

• multiArrayConstraint

• dictionaryConstraint: for tabular data

• sequenceConstraint: for sequential data

Since this output is a MultiArray — if you want, you can also check that output.type ==
.multiArray — the constraint is an MLMultiArrayConstraint instance. The most important
property in this object is shape, which describes the dimensions of the array.

In the code snippet you assert that the shape has three dimensions, and then interpret the
first dimension as the number of channels, the second as the height, and the third as the
width.

The dataType describes whether the multi-array contains integers, single precision floats,
or double precision floats.

If your model uses size flexibility, then there is also MLMultiArrayShapeConstraint object that
describes the possible size ranges for the different dimensions.
Inspecting the Model at Runtime 282

Note: This technique only works when the mlmodel already knows the shape of the output.
It’s possible that after converting your model to Core ML, the mlmodel file has an output
with a (partially) unknown shape. In that case, you can do some model surgery on the
mlmodel file to fill in the correct shapes. See the chapter Cleaning Up a Converted Model
for details.

When you make a prediction with the model, you can also query the MLMultiArray object
that holds the prediction results about its shape, which may be easier than looking at these
constraint objects.

Getting the output shape from the MLModelDescription is mostly useful if you need to allocate
or prepare objects ahead of time, before making the first prediction.

Unknown output shapes

Some models have an unknown output shape on purpose. For example, a model that uses
NonMaximumSuppression has an output with a shape of (0, 4) where the 4 stands for the
four coordinates of a bounding box. The 0 means that the number of bounding boxes isn’t
known until you actually make the prediction, as it can vary from one image to the next.

Note: In the mlmodel file, this first dimension is actually marked as having size -1, not 0.

In this case, it’s not very useful to look at the MLModelDescription to find out what the output
shape is. The output.multiArrayConstraint.shape will return [0, 4]. You simply won’t know
ahead of time how much data you’ll get from the model.

The only way to find out is to look at the MLMultiArray object that holds the prediction results.

Tip: As a defensive programming technique, when you load this kind of model you could
still assert that the second dimension of this output is indeed 4. The first dimension may
be flexible but the second dimension isn’t! It’s really easy for someone to (inadvertently)
change the model and break something, and you want to detect this before you start making
predictions. This is particularly important when you download models and compile them
on the device.
Inspecting the Model at Runtime 283

Metadata

Besides info about the inputs and outputs, MLModelDescription also has a metadata dic-
tionary. To read from this dictionary you need to use one of the special keys from
MLModelMetadataKey, for example:

let metadata = yourModel.model.modelDescription.metadata


let author = metadata[MLModelMetadataKey.author]

The possible keys are:

• description for the human-readable description of the model’s purpose

• author for the name of the model’s author

• license for licensing information

• versionString for version information. Note that this is not the same thing as the
mlmodel specification version, but a version number you can choose yourself.

• creatorDefinedKey for custom data you want to add to the model

The custom metadata is another dictionary of type [String: String]. Turi Create often uses
this to store additional information, such as the version of Turi Create that was used to train
the model, how long it was trained for, etc. You can use this dictionary to store whatever
you want, except for data that is used by the model itself (that goes into the model’s layers).

MLModelDescription also has the following fields that should look familiar from the mlmodel
specification:

• predictedFeatureName: for regressor and classification models. This is the name of the
output with the regression results (a number) or classification results (the winning
class label). For other types of models this is usually nil.

• predictedProbabilitiesName: for classification models. This is the name of the output


that has the probabilities for all the classes. Core ML automatically turns this into a
dictionary that maps these probabilities to the respective class names.

If your model is a classifier, it can be useful to know what the class labels are without
actually having to make a prediction first. As of iOS 14 and macOS 11, you can get the list
of class labels using modelDescription.classLabels. If you also need to support earlier OS
versions, one workaround is to store the label names in the model’s metadata as well.
Inspecting the Model at Runtime 284

Training parameters

To check whether a model can be trained on the device, or to inspect the training inputs or
hyperparameters, you can also ask MLModelDescription:

model.modelDescription.isUpdatable
model.modelDescription.trainingInputDescriptionsByName
model.modelDescription.parameterDescriptionsByKey

The parameterDescriptionsByKey is a dictionary of MLParameterKey and MLParameterDescription


objects. The keys correspond to the names of the hyperparameters defined in the mlmodel
file, such as .learningRate or .numberOfNeighbors. The description object has the default
value for that hyperparameter, but also the minimum and maximum values it can be set
to.

And that’s it for MLModelDescription. You are not required to use it, but I suggest using it as a
defensive measure to verify that the model’s input and output definitions are indeed what
your code expects them to be. Use it to find bugs!
Making Sure the Input is Correct
Even if the trained model was converted 100% correctly to Core ML format, it is still possible
to get the wrong predictions. If you’re not certain about the results from Core ML, it’s smart
to double check that your inputs are indeed what the model expects.

For models that take images as inputs, some of the things that can go wrong are:

• The input image is incorrect, for example it is all black. This can happen when you’re
using Core ML without Vision, in which case you are responsible for resizing and
cropping the input image yourself…

• The image orientation is incorrect. You need to tell Vision about the EXIF orientation
from the image file, or about the current orientation of the device if you’re working
on live video.

• The imageCropAndScaleOption is not the same as what your model was trained with.

• The image preprocessing options from the mlmodel may be wrong. It’s the number
one mistake I see people make when they first try out Core ML.

Checking what the input image looks like is easy: just display the image in a UIImageView or
view it using the Xcode debugger.

However, it’s harder to verify what the image looks like that is actually fed into the model,
because Vision and the neural network’s preprocessing options will modify the image
before it goes into the model’s first layer.

Fortunately, there is a simple trick that you can use to spy on Vision and Core ML. Create a
new mlmodel that:

1. takes an image as input,

2. has one linear activation layer that effectively does nothing,

3. produces an image as output again.

This way the output image is your original input image plus any modifications that were
done by Vision and the mlmodel’s preprocessing options.

Here is how to build such a model from scratch:

285
Making Sure the Input is Correct 286

import coremltools as ct
import coremltools.proto.FeatureTypes_pb2 as ft

spec = ct.proto.Model_pb2.Model()
spec.specificationVersion = 1

new_input = spec.description.input.add()
new_input.name = "image"
new_input.type.imageType.width = 256
new_input.type.imageType.height = 256
new_input.type.imageType.colorSpace = ft.ImageFeatureType.RGB

new_output = spec.description.output.add()
new_output.name = "generatedImage"
new_output.type.imageType.width = 256
new_output.type.imageType.height = 256
new_output.type.imageType.colorSpace = ft.ImageFeatureType.RGB

new_prepro = spec.neuralNetwork.preprocessing.add()
new_prepro.scaler.channelScale = 1.0
new_prepro.featureName = spec.description.input[0].name

new_layer = spec.neuralNetwork.layers.add()
new_layer.name = "test_layer"
new_layer.input.append(spec.description.input[0].name)
new_layer.output.append(spec.description.output[0].name)
new_layer.activation.linear.alpha = 1.0

ct.utils.save_spec(spec, "Image2Image.mlmodel")

As far as image-to-image models go, they don’t come much simpler than this! To try out this
model in your app, you can write the following:

let visionModel = try VNCoreMLModel(for: Image2Image().model)

let request = VNCoreMLRequest(model: visionModel, completionHandler: {


request, error in
if let results = request.results as? [VNPixelBufferObservation],
let pixelBuffer = results.first?.pixelBuffer {
let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
let resultImage = UIImage(ciImage: ciImage)
/* look at the resultImage */
}
})

request.imageCropAndScaleOption = .scaleFill
// optionally: request.regionOfInterest = CGRect(...)

Vision will return the generated image as a VNPixelBufferObservation object that contains a
CVPixelBuffer. You can convert this to a UIImage and show it in an image view to compare
Making Sure the Input is Correct 287

it with the original input image, for example. It’s a great way to study the impact of the
different imageCropAndScaleOptions and whether the orientation of the image is correct.

To see the effect of different preprocessing options, you can change the spec like so:

spec.neuralNetwork.preprocessing[0].scaler.channelScale = 2/255.0
spec.neuralNetwork.preprocessing[0].scaler.redBias = -1.0
spec.neuralNetwork.preprocessing[0].scaler.greenBias = -1.0
spec.neuralNetwork.preprocessing[0].scaler.blueBias = -1.0

But note that now the output of the model is in the range [-1, 1] while image outputs
should be between 0 and 255. So you may want to change the scaling factor applied by
the activation layer to compensate, otherwise you won’t be able to see anything:

spec.neuralNetwork.layers[0].activation.linear.alpha = 255.0

The GitHub repo for this book88 contains a demo app, CheckInputImage, that runs this
simple image-to-image model and shows the before and after image. It also lets you
experiment with different crop and resizing options. Give it a go!
88 https://github.com/hollance/coreml-survival-guide
Working With CVPixelBuffer
To make an image prediction with Core ML you need to supply the image as a CVPixelBuffer.

Vision also accepts CGImage and CIImage objects. So if you already have a CGImage — or a
UIImage from which you can easily get the CGImage — then it’s easiest to run your model
using Vision. As of iOS 13 you can also use this new API on MLFeatureValue.

But in many situations, such as when using video, you’ll get CVPixelBuffers anyway. For
image-to-image models, the predicted output is also a CVPixelBuffer object.

Since it’s the preferred image storage format for Core ML, it’s good to understand a bit about
how these CVPixelBuffer objects work.

These are the things you may need to do with pixel buffers:

• Put the image data into the CVPixelBuffer, if you currently have it in some other format.

• Resize the pixel buffer so that it has the dimensions expected by the Core ML model.

• Rotate the pixel buffer so that it has the proper orientation.

• Read from the CVPixelBuffer, if your model produces images as output.

Vision will automatically take care of the first three of these, so that’s usually the most
convenient option. If you’re using Core ML directly without Vision, you’ll need to do all of
these things yourself. In this chapter you’ll read how.

Pixel formats

CVPixelBuffer is a flexible container object that can handle different types of image data.
The data type and arrangement of the pixels is described by the pixel format.

A typical example is kCVPixelFormatType_32BGRA. As the name implies, this is a 32-bit pixel


format with the color channels in BGRA order. The pixel format is given as an OSType, a
so-called four-character-code. The code for kCVPixelFormatType_32BGRA is 'BGRA'.

• The BGRA format is often used with pixel buffers that come from the camera, because
BGRA is the native format of the camera’s sensors.

288
Working With CVPixelBuffer 289

• CVPixelBuffers that are the output of a Core ML model are also in BGRA format.

• To draw a UIImage into a CVPixelBuffer, you should use kCVPixelFormatType_32ARGB. If


you’d use 32BGRA here, then the red and blue color channels will be swapped.

• If the image is grayscale, you should use kCVPixelFormatType_OneComponent8 or 'L008'.

• ARKit’s pixel buffers are in planar YCbCr or YUV format, code '420f', or the mouthful
kCVPixelFormatType_420YpCbCr8BiPlanarFullRange.

Note: You can find a list of all the possible pixel formats and their four-character-codes in
the file CVPixelBuffer.h from the Core Video framework.

Where to get CVPixelBuffer objects

To use an image with Core ML, you first have to convert it to a CVPixelBuffer before Core
ML can use it. You also may have to resize the image and possibly rotate it.

If you have a UIImage, CGImage, or CIImage, the easiest option is to use Vision since it requires
no additional effort on your part. In case you don’t want to use Vision:

• For a UIImage or CGImage you’ll have to create a CVPixelBuffer object by hand (see the
next section) and draw the image into it.

• For a CIImage, you can use the CIContext method render(to:) to draw the image into a
pixel buffer.

Note: CIImage has a pixelBuffer property but this is only filled in if you originally created
the CIImage from a CVPixelBuffer — you can’t use it to convert an arbitrary CIImage into a
CVPixelBuffer object.

Some APIs may already provide you with a CVPixelBuffer object. For example, ARKit’s
ARFrame has a capturedImage property that gives you the camera image as a CVPixelBuffer.

Warning: you need to make sure the pixel format of such a CVPixelBuffer is compatible with
Core ML. The only two pixel formats Core ML can currently handle are kCVPixelFormatType_-
32BGRA and _32ARGB.
Working With CVPixelBuffer 290

ARKit’s pixel buffers are in a planar YCbCr format. That means the image actually consists
of two separate memory regions, one for the Y channel and one for the CbCr channel. Vision
can automatically convert such pixel buffers to 32BGRA or 32ARGB, but if you’re not using
Vision you’ll have to do this conversion yourself!

Note: To convert a YCbCr image to a format that is compatible with Core ML, you can use the
vImageConvert_420Yp8_Cb8_Cr8ToARGB8888() function from the vImage framework. But really,
use Vision.

When using AVFoundation to read video frames from the camera, you will get CMSampleBuffer
objects. You can get a CVPixelBuffer from these, using:

let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)

Don’t forget to configure the AVCaptureVideoDataOutput object to produce pixel buffers in


32BGRA format otherwise you may get YCbCr buffers:

let settings: [String : Any] = [


kCVPixelBufferPixelFormatTypeKey as String: NSNumber(value: kCVPixelFormatType_32BGRA)
]

let videoOutput = AVCaptureVideoDataOutput()


videoOutput.videoSettings = settings
captureSession.addOutput(videoOutput)

Likewise, if you’re using an AVAssetReader to read the frames from a video file, tell the
AVAssetReaderTrackOutput object to produce 32BGRA pixel buffers:

let settings: [String : Any] = [


kCVPixelBufferPixelFormatTypeKey as String: NSNumber(value: kCVPixelFormatType_32BGRA)
]

let readerTrackOutput = AVAssetReaderTrackOutput(track: track, outputSettings: settings)


assetReader.add(readerTrackOutput)

A quick tour of the CVPixelBuffer API

The “CV” stands for Core Video. Unlike most other iOS and macOS frameworks, Core Video
is rather low level and consists of a few opaque types such as CVBuffer and many standalone
functions.
Working With CVPixelBuffer 291

Note: The type CVPixelBuffer is actually a typealias for CVImageBuffer, which itself is a
typealias for the CVBuffer class. So any APIs that use CVImageBuffer or CVBuffer are also
available for CVPixelBuffer.

To create a new CVPixelBuffer object, you can write:

var pixelBuffer: CVPixelBuffer?


let status = CVPixelBufferCreate(nil, width, height,
kCVPixelFormatType_32BGRA,
nil, &pixelBuffer)
if status != kCVReturnSuccess {
print("Error: could not create pixel buffer, \(status)")
}

This requires a width, height, and pixel format. Here you’re using the preferred format,
kCVPixelFormatType_32BGRA.

You can also pass in an optional dictionary of attributes, which is especially useful to create
CVPixelBuffer objects that must be compatible with Metal textures, OpenGL, or IOSurfaces.

let attributes = [
kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue,
kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue,
kCVPixelBufferMetalCompatibilityKey: kCFBooleanTrue
]

let status = CVPixelBufferCreate(nil, width, height,


kCVPixelFormatType_32BGRA,
attributes as CFDictionary,
&pixelBuffer)

There is also a CVPixelBufferCreateWithBytes() function that takes a pointer to a memory


location that it will use to store the pixel data. You’d use that function if you already have
a buffer with pixel data and you only want to create a CVPixelBuffer wrapper for it. Just
make sure the format of your data matches the pixel format of the CVPixelBuffer.

Note: There is also a CVPixelBufferPool object that is useful for when you need to create lots
of pixel buffer objects. The pool recycles the CVPixelBuffers when you’re done with them.
Certain APIs, such as AVAssetWriterInputPixelBufferAdaptor, provide such a pool already.
Working With CVPixelBuffer 292

Once you have a CVPixelBuffer, you can get access to its memory as follows:

let flags = CVPixelBufferLockFlags(rawValue: 0)


if kCVReturnSuccess == CVPixelBufferLockBaseAddress(pixelBuffer, flags) {
if let pixelData = CVPixelBufferGetBaseAddress(pixelBuffer) {
/* do stuff with the pixel data here */
}
CVPixelBufferUnlockBaseAddress(pixelBuffer, flags)
}

First you need to lock the pixel buffer. If you plan on modifying the pixel data, flags should
be 0 or []. But if you’re only going to read from the pixel buffer, use flags = .readOnly, so
Core Video knows that it doesn’t need to invalidate any existing caches since you won’t be
changing the data inside the pixel buffer.

pixelData is an UnsafeMutableRawPointer. You can use this pointer:

• to directly read or write the pixel buffer’s memory

• as the data pointer for a CGContext, allowing you to use Core Graphics to draw directly
into the CVPixelBuffer object

• as the data pointer for a vImage_Buffer, so that you can use the vImage functions to
manipulate the pixels.

When you’re done, don’t forget to unlock the pixel buffer again.

There are a number of useful functions for querying the pixel buffer:

let width = CVPixelBufferGetWidth(pixelBuffer)


let height = CVPixelBufferGetHeight(pixelBuffer)
let pixelFormat = CVPixelBufferGetPixelFormatType(pixelBuffer)
let bytesPerRow = CVPixelBufferGetBytesPerRow(pixelBuffer)

Note: When another API asks you to provide the number of bytes per row, make sure to
use CVPixelBufferGetBytesPerRow() instead of calculating this number by hand, as it is not
always width * 4.

For CVPixelBuffers that store multiple planes, such as ARKit’s YCbCr pixels buffers, you can
use the following:
Working With CVPixelBuffer 293

let numberOfPlanes = CVPixelBufferGetPlaneCount(pixelBuffer)


let width = CVPixelBufferGetWidthOfPlane(pixelBuffer, planeIndex)
let height = CVPixelBufferGetHeightOfPlane(pixelBuffer, planeIndex)

If CVPixelBufferGetPlaneCount() is more than zero, you’ll have to use these functions to


query the individual planes. To access the pixel data for a specific plane, you now need
to use CVPixelBufferGetBaseAddressOfPlane().

CVPixelBuffers and orientation

A CVPixelBuffer does not know anything about the image orientation and, depending on
how the camera was held while the image was taken, may therefore not be upright.

The intended display orientation for an image is described by the CGImagePropertyOrientation


enum. If this is .up, it means the pixel data inside the image is already properly oriented.
For any other values, the image needs to be rotated before use.

Note: Portrait images taken on the iPhone usually have the orientation .right because
the camera’s sensor is physically in landscape orientation, and so to display the image in
portrait it first needs to be rotated by 90 degrees clockwise (to the right).

To get the orientation for an image:

• If the input is a UIImage, you can use the image’s imageOrientation property.

• CGImage and CIImage do not keep track of their orientation, so you’ll have to do this
manually.

• If you’re getting a CVPixelBuffer directly from the camera using AVCaptureSession, you
can use the device’s current orientation from UIDevice.current.orientation.

Unfortunately, UIImage and UIDevice have different enums for describing the orientation.
The Vision API uses yet another one, CGImagePropertyOrientation. In the CoreMLHelpers
repo89 you can find an extension that easily lets you convert between these different
enums.
89 https://github.com/hollance/CoreMLHelpers
Working With CVPixelBuffer 294

Resizing and rotating your images

As I’ve mentioned a few times already, when using Vision you don’t need to resize the
images yourself. You also don’t need to rotate the image, as long as you provide the
correct CGImagePropertyOrientation when creating the VNImageRequestHandler object. That’s
why Vision is the recommended way to use Core ML models that have image inputs.

When using Core ML directly, you need to provide a CVPixelBuffer object that has the correct
dimensions and is in the proper orientation:

• If you have a UIImage, you can create a new CVPixelBuffer object with the width and
height that the model expects, use the pixel buffer’s memory as the backing store for
a CGContext, and then do image.draw(in: context). This respects the image orientation
and draws the image properly rotated — and resized — into the pixel buffer.

• If you have a CGImage, convert it to a UIImage and use the above method. You do need to
supply the proper orientation when creating the UIImage as CGImage itself doesn’t keep
track of this.

• A CIImage can be rotated using its oriented(forExifOrientation) method. To resize a


CIImage you can apply a CIFilter of type "CILanczosScaleTransform", or use transformed(by)
with a scaling matrix. And then use a CIContext to render the image into a CVPixelBuffer.

• A CVPixelBuffer can be rotated 90, 180, or 270 degrees using vImageRotate90_ARGB8888()


from the Accelerate framework. You can also use vImageScale_ARGB8888() to resize the
pixel buffer. If the pixel format of the CVPixelBuffer is not 32ARGB or 32BGRA then you
also need to use the vImageConvert functions.

• You can also directly use CGImage — and therefore UIImage — with vImage, using
vImageBuffer_InitWithCGImage() and then vImageScale_ARGB8888() and vImageRotate90_-
ARGB8888().

• If you’re grabbing pixel buffers directly from the camera, you can set the videoOrientation
property of AVCaptureConnection so that it already rotates the pixel buffers if necessary.
You do still need to resize them afterwards.

As you can imagine, the code for doing this kind of image manipulation can get pretty nasty.
I will refer you to the CoreMLHelpers repo again, which has several handy functions for
putting your images into rotated and resized CVPixelBuffers.
Working With CVPixelBuffer 295

Turning a CVPixelBuffer back into an image

If your model outputs a CVPixelBuffer, or if you did a bunch of image processing on your
own pixel buffers and you want to make sure it looks all right, it can be useful to display the
pixel buffer in an UIImageView. To do that, you first need to convert it back into a UIImage.

The easiest way to turn a CVPixelBuffer into a UIImage is to take a small detour through Core
Image:

let image = UIImage(ciImage: CIImage(cvPixelBuffer: pixelBuffer))

CoreMLHelpers has additional methods for converting CVPixelBuffers back into images.

Tip: You can also view the contents of a CVPixelBuffer inside the Xcode debugger. Set a
breakpoint, select the variable with the CVPixelBuffer instance and click the QuickLook
button. Also click the little info button to see even more details about the pixel buffer, such
as its attributes dictionary.
Using CGImage or Image Files Instead of
CVPixelBuffer
If your model takes images as input, the Xcode-generated class for the model has a
method prediction(input: CVPixelBuffer). This means you need to provide the images as
CVPixelBuffer objects. That’s handy if your images are really video frames but not if you
have a UIImage or CGImage.

You can convert the CGImage to a CVPixelBuffer yourself. Just converting is not enough: you
also need to resize and possibly crop the image. For this you can use CoreMLHelpers90 .

A little easier is using the Vision framework, which can work directly on CGImage objects.
And it can resize and crop the images as well.

As of iOS 13 and macOS 10.15, there is a new API that lets you create an MLFeatureValue
object directly from CGImage or an image file such as PNG or JPEG. This new API was added
to support on-device training, but you can use it for inference too.

MLFeatureValue is a wrapper around all the data types that Core ML supports. When you
call prediction(input: CVPixelBuffer) it puts the CVPixelBuffer into an MLFeatureValue object,
which in turn goes into an MLFeatureProvider object. This abstraction is necessary because it
allows Core ML to work with other types of inputs too, such as strings or arrays of numbers.

You can use the new API to create an MLFeatureValue object from the CGImage or image file.
This makes loading the image simpler as you no longer need to deal with CVPixelBuffer and
resizing / cropping.

Note: As of Xcode 12, the auto-generated class also makes it possible to use CGImage objects
and image file URLs. However, that gives you no control over cropping and resizing. For this
reason, I prefer the MLFeatureValue API described in this chapter.

Loading images into MLFeatureValue

Here is the code you’d use to load a CGImage into an MLFeatureValue object:
90 https://github.com/hollance/CoreMLHelpers

296
Using CGImage or Image Files Instead of CVPixelBuffer 297

let image: CGImage = /* your image */

let imageConstraint = model.modelDescription


.inputDescriptionsByName["image"]!
.imageConstraint!

let imageOptions: [MLFeatureValue.ImageOption: Any] = [


.cropAndScale: VNImageCropAndScaleOption.scaleFill.rawValue
]

let featureValue = try? MLFeatureValue(cgImage: image,


orientation: .up,
constraint: imageConstraint,
options: imageOptions)

Note: To load a PNG or JPEG file, you’d use the MLFeatureValue(imageAt: URL, ...) initializer
instead.

This might be more code than you expected. Here’s what’s going on:

1. The image constraint

The image could be any size but MLFeatureValue will resize it to the dimensions expected by
the model. For this, you need to tell MLFeatureValue what the correct size is, and also what
the pixel format should be (RGB or BGR).

You can get that information from the MLModel object through its MLModelDescription. (Note
that the model variable in this code snippet is not the generated class, but the MLModel it
contains.)

Here, we grab the MLImageConstraint for the input named "image". This MLImageConstraint
object describes the size and format of the image that is expected by this particular model
input. (Naturally, if your model’s input has a different name, you need to change that here.)

There is also a version of the MLFeatureValue initializer that lets you pass in the desired
width, height, and pixel format separately. That is useful for when the model uses size
flexibility and you want to decide for yourself exactly how large the input will be.

2. Image resizing options

Next, we need to tell Core ML how to scale and/or crop the input image. This is done through
a dictionary of MLFeatureValue.ImageOption values.
Using CGImage or Image Files Instead of CVPixelBuffer 298

With .cropAndScale you can pass in a VNImageCropAndScaleOption from the Vision framework.
So this new API can behave in exactly the same way as Vision’s VNCoreMLRequest.

Note: Put .rawValue behind the VNImageCropAndScaleOption property or it won’t work.

You can also pass in the .cropRect option with a rectangle for doing a custom crop. This is
a CGRect object using normalized coordinates (i.e. between 0 and 1). That’s useful for doing
random cropping for data augmentation during training, but also when you don’t like any
of the options offered by VNImageCropAndScaleOption.

Some things you might want to know about cropping:

• (0,0) is the bottom-left corner (not top-left)!

• A cropRect of (0,0) - (1,1) is the same as using .scaleFill.

• You can crop outside the image, in which case the rest will be black.

• You can’t make the width or height > 1, which gives a gray image (it doesn’t work).

• Flipping by using negative width or height doesn’t work (gives error message).

You can also combine these two options. It first crops according to the rectangle, and then
scales that according to the VNImageCropAndScaleOption. (So with center cropping, it might
crop your crop again if it’s not square.)

3. Creating the MLFeatureValue

Finally, we create the MLFeatureValue object. Since CGImage objects do not store orientation
info, you can pass in a CGImagePropertyOrientation value here if you know the image was
rotated. For image files, it uses the EXIF data.

After the call to try? MLFeatureValue(...) succeeds, the feature value object now contains
a valid CVPixelBuffer of the size expected by the model and in the correct pixel format.

Making predictions with the MLFeatureValue object

Great, you now have an MLFeatureValue object. What next?

One way to use it, is to obtain the CVPixelBuffer by writing:


Using CGImage or Image Files Instead of CVPixelBuffer 299

if let pixelBuffer = featureValue.imageBufferValue {


...
}

To make the prediction, you can now pass this CVPixelBuffer to model.prediction(...).

Internally, your model’s generated class now creates a new MLFeatureValue object and
puts the CVPixelBuffer inside that, so this is a little wasteful since we already have an
MLFeatureValue.

Slightly more efficient is to do the prediction as follows:

let inputs: [String: Any] = [ "image": featureValue ]

if let provider = try? MLDictionaryFeatureProvider(dictionary: inputs),


let result = try? model.prediction(from: provider,
options: predictionOptions),
...
}

Note that model here is an MLModel object, not your generated class. Read more about this
method in the chapter Using the MLModel API.
Working With MLMultiArray
Machine learning often works with large multi-dimensional arrays, also known as tensors.
In Core ML these are known as multi-arrays and you use them through the MLMultiArray
class.

If your model has a multi-array input or output, then you need to use MLMultiArray objects.
To make effective use of MLMultiArray, it’s helpful to understand how it stores the values in
memory.

This chapter shows everything you need to know to read and write MLMultiArrays like a
pro. Of course, it’s still up to you to actually interpret what the data in this multi-array
represents, as that is different for every model.

Note: Many computer languages support arrays of multiple dimensions. In Objective-C, for
example, you can write float myArray[200][50][75]; to get a 200×50×75 array. Swift only
has one-dimensional arrays. You can make an array that contains other arrays but that’s
not quite the same thing. That’s why Core ML has this MLMultiArray class.

How to get an MLMultiArray

Usually you’ll get one of these if your mlmodel is a NeuralNetwork that is not a classifier.
Core ML then wraps the output of the model into a multi-array.

If you use Vision, you can get the MLMultiArray object like this:

if let results = request.results as? [VNCoreMLFeatureValueObservation],


let featureValue = results.first?.featureValue,
let multiArray = featureValue.multiArrayValue {
/* do something with multiArray */
}

There will be one VNCoreMLFeatureValueObservation object for every multi-array output in


the model.

300
Working With MLMultiArray 301

If you use Core ML without Vision, the result from model.prediction(...) is a YourModelOutput
object. This will have a property named after the output that is an MLMultiArray object. For
example,TinyYOLO.mlmodel91 has a multi-array output named "grid":

let tinyYOLO = TinyYOLO()

if let output = try? tinyYOLO.model.prediction(image: ...) {


let multiArray = output.grid
/* do something with multiArray */
}

If you’re using the MLModel API, then the output of model.prediction(from: ...) is an
MLFeatureProvider object. You can ask it for the MLMultiArray like so:

let model = MLModel(...)

if let output = try? model.prediction(from: ...).


let multiArray = output.featureValue(for: "grid")?.multiArrayValue {
/* do something with multiArray */
}

If you’re supposed to supply a multi-array as input to your model, you’ll need to create the
MLMultiArray object by hand. That goes something like this:

let multiArray = try? MLMultiArray(shape: ..., dataType: .double)

And then you need to put your data into the multi-array, which is explained later on this
chapter.

The shape

A multi-array has a shape, which describes how many dimensions there are and how large
each dimension is.

For example, a shape of (200, 50, 75) means that the multi-array has three dimensions
and is 200-by-50-by-75. However, it doesn’t say anything yet about what these dimensions
represent.

In Core ML, all tensors are represented as 5-dimensional multi-arrays with the shape:
91 https://github.com/hollance/YOLO-CoreML-MPSNNGraph
Working With MLMultiArray 302

(sequence length, batch size, channels, height, width)

If a dimension isn’t used, it is set to 1. So if the (200, 50, 75) array is really (1, 1, 200, 50, 75),
it is interpreted by Core ML as having 200 channels, a height of 50, and a width of 75.

Note: The height comes before the width!

Other machine learning tools, such as TensorFlow, may use a different order. Often they
will use tensors of shape (batch size, height, width, channels), also known as NHWC. The
Core ML order would be NCHW. The letter N stands for the batch size, or number of
examples in the array. (There is no letter for the sequence length dimension.)

Knowing the shape of the MLMultiArray object is essential to using it correctly. You can
examine the shape in Swift as follows:

let multiArray: MLMultiArray = ...


let numDim = multiArray.shape.count
print(multiArray.shape)

shape is an array with one element for each dimension in the data. If the output shape in
the mlmodel is unknown, numDim will always be 5 and multiArray.shape will be something
like this:

[1, 1, 200, 50, 75]

However, if the Core ML converter hasn’t filled in the output shape properly, it’s better to
fix this in the mlmodel. For this mlmodel file you can set the output shape specifically to
(200, 50, 75) in the spec, like so:

output {
name: "theOutput"
type {
multiArrayType {
shape: 200
shape: 50
shape: 75
dataType: DOUBLE
}
}
}

In that case, numDim is 3 and multiArray.shape is:


Working With MLMultiArray 303

[200, 50, 75]

The unused dimensions of size 1 at the front have disappeared. It is important to realize
that different models may produce multi-arrays with different numbers of dimensions. It
all depends on how this is set up in the mlmodel spec.

Tip: When you convert a model to Core ML, pay close attention to the order in which it
puts the multi-array’s dimensions. This may be different than what is used in the original
model. The channels may now be at the front instead of in the back.

Another important detail: because Core ML needs to be compatible with Objective-C, the
shape array is really an array of NSNumber objects. To get the actual size of a given dimension
in Swift, you need to write:

let channels = multiArray.shape[0].intValue


let height = multiArray.shape[1].intValue
let width = multiArray.shape[2].intValue

Note: This particular code snippet assumes that the shape has three dimensions. If it has
more or fewer, you’ll need to adjust the indices so they point to the correct dimension. You
can’t assume this code is correct for all possible models.

By the way, the MLMultiArray also has a count property that tells you how many elements
are in the array. The count is always equal to all the shapes multiplied by one another, i.e.
multiArray.shape.map { $0.intValue }.reduce(1, *)) is the same as multiArray.count.

The data type

All the numbers inside a given Core ML multi-array will always have the same datatype.
You can choose from the following possibilities:

• double

• float32

• int32
Working With MLMultiArray 304

By default, the Core ML converters will set the data type to DOUBLE in the mlmodel spec, but
you can change this afterwards if you’d rather have 32-bit floats or integers. See also the
chapter Outputting Floats Instead of Doubles.

You can ask the MLMultiArray what type of values it holds by querying its dataType property,
which is described by the MLMultiArrayDataType enum.

Tip: I usually define a typealias:

typealias OutputType = Double

This way I can read / write the MLMultiArray using OutputType. If at some point I decide to
change the datatype of the mlmodel’s output to floats, I can just change this typealias to
OutputType = Float.

Subscripting the array

MLMultiArray has two subscript functions that let you read or write values in the array:

• subscript(Int) -> NSNumber

• subscript([NSNumber]) -> NSNumber

The first one is useful for multi-arrays that have only a single dimension, the second for
arrays that have more than one dimension. However, I find using these functions to be a
pain in the butt.

Because of Objective-C compatibility, they return NSNumber instances rather than the Double,
Float, or Int32 value that is actually in the array. Additionally, you need to specify the indices
of the value you want to read as an array of NSNumbers too. Like this:

let channel: Int = ...


let y: Int = ...
let x: Int = ...
let value = multiArray[[channel, y, x] as [NSNumber]].doubleValue

Here, channel is the index in the first dimension, y is the index in the height dimension, and
x is the index in the width dimension. All this casting to and from NSNumber objects makes
the code harder to read and also is rather slow.
Working With MLMultiArray 305

Indexing the fast way

The solution is to work with pointers and directly access the memory inside the MLMultiArray.
This is less scary than it sounds because MLMultiArray has some useful properties that help
with this.

When using pointers, you are responsible for calculating where to read or write in the
computer’s memory. If you make a mistake, you may end up reading from the wrong
memory location or — perhaps worse — writing into memory you don’t own and crashing
your app. So the trick to make this work is to calculate the correct memory address.

Important for this is the stride. This is the distance in memory between the current “thing”
and the next thing in a particular dimension.

For example, the channel stride is the number of elements in memory between channel 0
and channel 1. The row stride is the number of elements in memory between row 0 and
row 1. And the column stride is the number of elements between a value in column 0 and
column 1.

Fortunately, you don’t have to compute these strides manually — MLMultiArray already
provides them for you. For the array with shape (200, 50, 75) the strides are:

[3750, 75, 1]

Recall that the order of the dimensions is (channels, height, width). This means the column
stride, or the stride for the width dimension, is 1 because two values in the same row are
right next to each other in memory.

The row stride, or the stride for the height dimension, is the same as the width, because
between a value in row 0 and the value right below it in row 1 are 75 other values. And
the channel stride is therefore 50 × 75 = 3750 because one channel is made up of height ×
width values.

But you may immediately forget this again. All you need to do is use the MLMultiArray’s
stride property in combination with the pointer to the multi-array’s memory.

let ptr = UnsafeMutablePointer<OutputType>(OpaquePointer(multiArray.dataPointer))

First you create the pointer to the multi-array’s data. Note that I used the OutputType
typealias here, so that this same code works just fine regardless of the kind of data that
is inside the multi-array.

Next, you need to get the strides for this array:


Working With MLMultiArray 306

let channelStride = multiArray.strides[0].intValue


let rowStride = multiArray.strides[1].intValue
let columnStride = multiArray.strides[2].intValue

Again these are NSNumber values, so you write intValue to turn them into integers. Also
note that the indices you’re using here need to match those of the multi-array’s shape! For
example, if the height is dimension 1, then the row stride must also be dimension 1.

To read the value in channel channel at row y in column x, you then write:

let value = ptr[channel*channelStride + y*rowStride + x*columnStride]

That’s all you need to do. You multiply the index for each dimension with the stride for that
dimension, and add them all up to get the memory address.

To write into the MLMultiArray, you do the same thing but now assign to the pointer:

ptr[channel*channelStride + y*rowStride + x*columnStride] = value

In this example you’re using three dimensions and three strides. If your multi-array has
more dimensions, you’d also use the strides for those extra dimensions. And if your multi-
array has fewer dimensions, you’d naturally only use the strides for the dimensions that
you have.

By the way, you can make this even cleaner by defining a helper function. Since Swift allows
nested functions, the best place for this helper is inside the current function. Like this:

func processResults(multiArray: MLMultiArray) {


let channelStride = multiArray.strides[0].intValue
let rowStride = multiArray.strides[1].intValue
let columnStride = multiArray.strides[2].intValue

@inline(__always) func offset(_ channel: Int, _ y: Int, _ x: Int) -> Int {


return channel*channelStride + y*rowStride + x*columnStride
}
...

And now accessing the multi-array is as easy as:

let value = ptr[offset(channel, y, x)]

Pay attention to the order of x and y here! Generally it’s more common to put x before y,
but here it’s a good idea to stick to the same order as the array dimensions.
Working With MLMultiArray 307

Note: You still need to make sure that your indices — channel, y, and x — don’t go outside
their respective dimensions. If the shape for the channel dimension is 200, then valid values
for the channel index are 0 - 199. If you try to read or write at index 200 or larger, then your
app may crash — or it may not, in which case you’re just reading bogus data. Unlike with
regular Swift arrays, you won’t get out-of-bounds assertions when using pointers. So you
still need to be careful!

But what does it all mean?

Just because you know how to read or write a multi-array doesn’t mean you know what to
do with this data. I can’t really help you with that — it completely depends on your model.
But I can give an example that might make all this theory easier to grasp.

The TinyYOLO model I mentioned earlier outputs a multi-array of shape (125, 13, 13). That
is interpreted as a grid of 13×13 cells. The purpose of such a cell is to predict bounding
boxes for any objects that were detected in a particular region in the original 416×416 input
image.

Each bounding box prediction consists of 25 numbers:

• 4 numbers for the coordinates of the bounding box rectangle

• 1 number for the confidence score, which tells you how likely the model thinks this
bounding box really contains an object

• 20 probabilities that predict the class of the object in the box

Each cell always outputs five of these bounding box predictions, which is why the number
of channels is 5 × 25 = 125.

Note: You could argue that the shape of the multi-array should have been something like
(25, 5, 13, 13). For each of the 13×13 cells there are now 5 bounding box predictions that
each are made up of 25 numbers. This is, in fact, how the numbers are organized in memory
but the shape of the multi-array combines the 25 and 5 into a single dimension.

In the case of TinyYOLO, the multi-array doesn’t actually contain real bounding box
coordinates yet. They still need to be “decoded” using so-called anchor boxes. We’ll ignore
Working With MLMultiArray 308

the specifics of that for now. You can read more about how this sort of thing works in this
blog post92 .

The code to read through the MLMultiArray looks like this:

let boxesPerCell = 5
let numClasses = 20
let gridHeight = multiArray.shapes[1].intValue
let gridWidth = multiArray.shapes[2].intValue

for y in 0..<gridHeight {
for x in 0..<gridWidth {
for b in 0..<boxesPerCell {
let channel = b*(numClasses + 5)

let tx = ptr[offset(channel , y, x)]


let ty = ptr[offset(channel + 1, y, x)]
let tw = ptr[offset(channel + 2, y, x)]
let th = ptr[offset(channel + 3, y, x)]
let tc = ptr[offset(channel + 4, y, x)]

/* decode the coordinates here */

var probabilities = [Float](repeating: 0, count: numClasses)


for c in 0..<numClasses {
probabilities[c] = ptr[offset(channel + 5 + c), y, x]
}

/* find the winning class here */


}
}
}

These nested loops look at the bounding box b for the grid cell (y, x).

This performs a bit of extra arithmetic to find out the channel to read from, and then it
uses the same pointer logic as before. You read the four bounding box coordinates plus the
confidence score, and then the 20 class probabilities.

For your own models you’ll probably need to do something completely different, but the
method for reading from the MLMultiArray is exactly the same.

Creating a new MLMultiArray

I already mentioned that you can create a new MLMultiArray object by writing:
92 https://machinethink.net/blog/object-detection/
Working With MLMultiArray 309

let multiArray = try? MLMultiArray(shape: [200, 50, 75], dataType: .double)

The shape argument is again an array of NSNumber objects, but in Swift if you use integer
literals you don’t need to cast them to NSNumber. The compiler is clever enough to figure
that out. (Otherwise, cast it with as [NSNumber].)

Now you have an MLMultiArray object that has room for 200×50×75 double precision floating
point numbers. You can write your own data into this array using the subscript method, or
using pointers, or using something like memcpy to copy many values at once.

When such a MLMultiArray instance is deallocated, it also frees the memory that holds its
data.

If you want to avoid making copies and manage the memory yourself, there is an MLMultiArray
constructor that takes a pointer but does not assume ownership of the data:

let yourPointer: UnsafeMutableRawPointer = ...

let multiArray = try? MLMultiArray(dataPointer: yourPointer,


shape: [...],
dataType: .double,
strides: [...]) {
pointer in
/* free your data here if needed */
}

This simply assigns your pointer to the multi-array’s dataPointer property. You need to make
sure the shape and strides are correct. You also supply a closure that is invoked when the
MLMultiArray is deallocated, so you can free your data there if necessary.

With this constructor, the MLMultiArray becomes a very simple wrapper around your own
data. It’s a bit more work to manage but you avoid making unnecessary copies and
therefore gain some speed.

New with Xcode 11, and Swift only, is an MLMultiArray constructor that takes any Collection
type whose elements are floats, doubles, or any of the integer types:

var myData: [Float] = [ 3.14, -100, 7.5 ]


let multiArray = try? MLMultiArray(myData)

Since collections in Swift are value types, this copies the data. Modifying the collection
afterwards will not modify the contents of the multi-array.
Working With MLMultiArray 310

New in iOS 14 and macOS 11.0 is MLMultiArray(concatenating:axis:dataType:) that makes


a new multi-array by concatenating several existing multi-arrays. These must all have the
same shape except for the specified axis, along which they’ll be concatenated. It looks like
this was added specifically for doing predictions on videos.

Tip: If you get the error “[coreml] Failure verifying inputs” when making a prediction, it
means that the shape of your input MLMultiArray does not match the shape that is expected
by the model.
Reshaping an MLMultiArray
Let’s say you have an MLMultiArray with shape (1, 1, 3, 200, 150) but now you need to pass
it to some function that expects an array with just three dimensions and not five. In that
case you’d have to reshape the multi-array.

For example, you could reshape this array to (3, 200, 150). Those two dimensions of size 1
at the front serve no real purpose and so you can get rid of them without any problem.

Or if you wanted to interpret the data in this multi-array in a totally different way, you
could also reshape it to (300, 2, 150) or (4500, 20) or however you want to look at this data.

Here’s a real-world example: In the previous chapter I mentioned that the TinyYOLO model
outputs a multi-array of shape (125, 13, 13). But to process this data we actually split up
those 125 channels into five chunks of 25 elements each. It’s therefore quite reasonable to
want to reshape that multi-array to (25, 5, 13, 13).

Reshaping doesn’t change the data in memory — only the way that data is interpreted.

Unfortunately, MLMultiArray does not offer a built-in way to do this. But you can quite easily
implement this yourself with an extension.

Tip: Rather than doing this in Swift, you can also add a new layer at the end of the
mlmodel that reshapes the data. That way Core ML will give you an MLMultiArray that
is already in the desired shape. The type of layer to add is ReshapeLayerParams. There
is also a FlattenLayerParams for when you want to make the data one-dimensional. See
NeuralNetwork.proto for more details.

Here’s how you can implement the reshape operation in Swift:

311
Reshaping an MLMultiArray 312

extension MLMultiArray {
@nonobjc public func reshaped(dimensions: [Int]) throws -> MLMultiArray {
let newCount = dimensions.reduce(1, *)
precondition(newCount == count, "Cannot reshape \(shape) to \(dimensions)")

var newStrides = [Int](repeating: 0, count: dimensions.count)


newStrides[dimensions.count - 1] = 1
for i in stride(from: dimensions.count - 1, to: 0, by: -1) {
newStrides[i - 1] = newStrides[i] * dimensions[i]
}

let newShape_ = dimensions.map { NSNumber(value: $0) }


let newStrides_ = newStrides.map { NSNumber(value: $0) }

return try MLMultiArray(dataPointer: self.dataPointer,


shape: newShape_,
dataType: self.dataType,
strides: newStrides_)
}
}

You use it like this:

let multiArray: MLMultiArray = ...

withExtendedLifetime(multiArray) {
if let reshapedArray = try? multiArray.reshaped(to: [3, 200, 150]) {
/* do something with the reshaped array */
}
}

The to: parameter contains the new shape of the array. If you multiply these numbers
together, they must equal the number of elements in the original array. In other words,
you cannot reshape a (10, 200, 150) array to (3, 200, 150) because the number of elements
does not match.

Using these new dimensions, the reshaped() method computes the strides for the new array.
And finally, it creates a new MLMultiArray instance using the same data pointer and data type
but with the new shape and strides.

It’s important to realize that the new, reshaped multi-array still points at the data in the
original array! It’s your responsibility to keep the original MLMultiArray array alive while
you still need the reshaped array, otherwise the pointer goes stale.

The Swift optimizer can be pretty aggressive in deallocating objects that it thinks are no
longer used, and if you still have a pointer to such an object then you’re in trouble. That’s
why the example wraps everything in a withExtendedLifetime block.
Transposing an MLMultiArray
Suppose you have an MLMultiArray with the shape (100, 200, 4) but you need to pass this
into a function that expects the shape (4, 100, 200). In other words, the last dimension must
become the first.

This is a common occurrence during training. NumPy and TensorFlow offer a transpose()
function for this, while PyTorch calls it permute().

Unlike reshaping, which only changes how you view the dimensions but doesn’t change
the data, transposing or permuting will literally move around the bytes in the array.

For example, to convert the (100, 200, 4) array into (4, 100, 200) you would call:

let transposedArray = multiArray.transpose(to: [2, 0, 1])

The notation [2, 0, 1] means that dimension 2 moves to the front and the other two
dimensions will shift one position towards the back.

Unfortunately, MLMultiArray has no API for doing this, so you’ll have to write it yourself.

Tip: Rather than doing this in Swift, you can also add a new layer at the end of the mlmodel
that rearranges the data. That way Core ML will automatically handle the permutation and
you don’t have to worry about any of it. The type of layer to add is PermuteLayerParams. See
NeuralNetwork.proto for more details.

To move the last dimension to the front as in the above example, the transpose routine
would look like this in pseudocode:

313
Transposing an MLMultiArray 314

for h in 0..<height {
for w in 0..<width {
for c in 0..<channels {
let srcIndex = h*srcStrides[0] + w*srcStrides[1] + c*srcStrides[2]
let dstIndex = c*dstStrides[0] + h*dstStrides[1] + w*dstStrides[2]
dstPtr[dstIndex] = srcPtr[srcIndex]
}
}
}

You read the value using the source strides and write it using the destination strides. Pretty
straightforward. However, this code has a couple of limitations:

• It assumes the array will always have three dimensions.

• It always swaps (h, w, c) to (c, h, w), so it can only transpose one way.

The code for a transpose function that can handle any number of dimensions and any
possible permutation order is more complex. Here is one way to write it:

extension MLMultiArray {
@nonobjc public func transposed(to order: [Int]) throws -> MLMultiArray {
let ndim = order.count
precondition(ndim == strides.count)

let newShape = shape.indices.map { shape[order[$0]] }


let newArray = try MLMultiArray(shape: newShape, dataType: self.dataType)

let srcPtr = UnsafeMutablePointer<Double>(OpaquePointer(dataPointer))


let dstPtr = UnsafeMutablePointer<Double>(OpaquePointer(newArray.dataPointer))

This first swaps around the dimensions so they are in the new order and then creates a
MLMultiArray object with that new shape. You also grab the pointers to the memory of both
arrays.

Note: This function only works with multi-arrays that have the datatype .double. To support
.float32 or .int32 you need to change the type of the UnsafeMutablePointer.

Inside the loop that will copy the values from the original to the new array, you need to
compute the source index for reading from and the destination index for writing to. The
problem is that you don’t know beforehand how many dimensions the array will have, and
so you can’t use multiple nested loops.
Transposing an MLMultiArray 315

Instead, you’ll be using a single loop that treats the multi-arrays as being one-dimensional.
You will need to map that scalar index to the different dimensions in both the source and
destination array. For that, you first define some variables:

let srcShape = shape.map { $0.intValue }


let dstStride = newArray.strides.map { $0.intValue }
var idx = [Int](repeating: 0, count: ndim)

The shape and strides are NSNumbers, so to gain a bit of speed you turn these into plain Int
values. The idx array is a list of integer indices, one for every dimension. This is how you’ll
keep track of where you’re reading inside the source array. Initially, these values are all 0.

Next up is the actual loop:

for j in 0..<count {
var dstIndex = 0
for i in 0..<ndim {
dstIndex += idx[order[i]] * dstStride[i]
}

dstPtr[dstIndex] = srcPtr[j]

It’s a straight for loop that goes from 0 to the number of elements in the array. The source
index that you use for reading is simply j.

The destination index must be composed of the destination strides and the permuted
indices from the idx array. This is how the reading position in the source array is mapped
to the writing position in the destination array. And then you simply copy the value from
the source into the destination.

Finally, you need to update the values of idx to move them ahead to the next reading
position. There are two ways to do this: 1) divide j by the source strides using division
and the remainder, 2) simply increment the values in idx. Because dividing is relatively
slow, you’ll choose option number two here:
Transposing an MLMultiArray 316

var i = ndim - 1
idx[i] += 1
while i > 0 && idx[i] >= srcShape[i] {
idx[i] = 0
idx[i - 1] += 1
i -= 1
}
}
return newArray
}
}

Let’s say the source array has three dimensions of sizes (100, 200, 4). The value of j will go
from 0 to 80000 because that’s how many elements there are (i.e. 100 × 200 × 4).

Initially, idx is [0, 0, 0]. Every iteration of the loop you increment the last element of idx,
so after the first iteration it is [0, 0, 1].

After 4 iterations, idx is [0, 0, 4]. But that is one position beyond the length of this last
dimension in the source array. Now you roll over that last index and increment the second-
to-last index instead, so that idx becomes [0, 1, 0].

After another 4 iterations, idx will be [0, 2, 0], and so on. It works like a clock, where the
minute is incremented when seconds reach 60, and the hour is incremented when minutes
reach 60.

And that’s it for a basic transpose routine. This code isn’t particularly fast, especially for
large arrays, but it’s useful for debugging and testing things.

Note: For certain special cases you can use the vImage framework to do very fast permu-
tations of the dimensions. For example, vImageConvert_PlanarFtoARGBFFFF() will convert a
multi-array of shape (4, height, width) to (height, width, 4). There is also vImageConvert_-
PlanarFtoRGBFFF() for when you have 3 channels instead of 4. These functions will only
work for MLMultiArrays of type .float32 but they’re much faster than writing your own
code.
Converting MLMultiArray to an Image
If your model produces a MultiArray output with a width and height and 3 or 4 channels,
you can convert the contents of this array into a color image.

A multi-array with only height and width dimensions, or with just 1 channel, can be
converted into a grayscale image.

There is no built-in way to do this, so you’ll have to write your own code to perform this
conversion.

Note: For image-to-image models, I would actually recommend changing the output type of
the model from MultiArray to Image, so that Core ML will automatically do this conversion
for you. It now gives you a CVPixelBuffer instead of an MLMultiArray object. See also the
chapter Outputting an Image Instead of a MultiArray.

The easiest and most flexible — and arguably also the slowest — way to create the image is
to loop through all the rows and columns from the MLMultiArray and copy the values to an
array of UInt8 values:

func createImage(from features: MLMultiArray) -> UIImage? {


let ptr = UnsafeMutablePointer<Double>(OpaquePointer(features.dataPointer))

let channels = features.shape[0].intValue


let height = features.shape[1].intValue
let width = features.shape[2].intValue

let cStride = features.strides[0].intValue


let yStride = features.strides[1].intValue
let xStride = features.strides[2].intValue

var pixels = [UInt8](repeating: 255, count: width * height * 4)

for x in 0..<width {
for y in 0..<height {
for c in 0..<channels {
let value = ptr[c*cStride + y*yStride + x*xStride]
/* you can change the value here if not between 0 and 255 */
pixels[(y*width + x)*4 + c] = UInt8(value)
}

317
Converting MLMultiArray to an Image 318

}
}
return UIImage.fromByteArrayRGBA(pixels, width: width, height: height)
}

The destination array, pixels, has 4 bytes for every pixel in the image, that is one byte for
each of the RGBA color channels. You initially set all these bytes to 255, so that the alpha
channel is always set to fully opaque. Then it’s just a matter of filling in the bytes with
the actual pixel values. When you’re done, you can convert the pixels array into an image
object.

The chapter Working With MLMultiArray explains how the shape and strides work, and
how to read the values from the multi-array using a pointer. Here, you read a Double value
at every array location. Then you convert this Double to a UInt8 and write it into the pixels
array.

This code makes a number of assumptions:

• The multi-array must have three dimensions. The first is the number of channels, the
second is the image height, and the third is the image width. If your MLMultiArray has
a shape that is not (channels, height, width) then you need to change the indices used
to read from the shape and strides arrays. Or you can reshape or transpose the multi-
array first.

• The values in the MLMultiArray must all between 0 and 255. If not, you can write
some additional code inside the loop that properly converts the Double to a UInt8. If
your MLMultiArray has Float or Int32 values, you’ll also need to change the type of the
UnsafeMutablePointer.

• The first channel is interpreted as red, the second is green, and the third is blue. If
there is a fourth channel, it will be used for the alpha channel.

The destination index in the pixels array is given by (y*width + x)*4 + c. Here’s how this
works: Each color channel, c, is a single byte. Each pixel, x, is made up of 4 bytes (RGBA).
And each row of pixels, y, is made up of width pixels or width*4 bytes.

If you want to write the pixels in a different format, you’ll have to change this index
calculation. Some modifications you can make:

• To flip the red and blue channels to get BGRA instead of RGBA, use (2-c) instead of c.

• To put the alpha channel first to get ARGB or ABGR pixels, you should add + 1 to the
index. (This assumes the input array has only 3 color channels and you always want
alpha to be 255.)
Converting MLMultiArray to an Image 319

• To get a grayscale image, the destination index should be just y*width + x. The channel
c isn’t used and each pixel is now just one byte.

It should be clear now that you can do whatever you want in this loop. This is why I said
this is the most flexible way to convert the multi-array. All you need to do is figure out how
to put the input values into the right position in the pixels array.

Let’s say your MLMultiArray has 20 channels instead of 3. You can still convert that multi-
array to an image, as long as you find some way to map those 20 channels to the RGBA
color channels in the pixels array. For example, in a semantic segmentation model the 20
channels would represent the probabilities of 20 different classes. Inside the loop you’d
then find the highest scoring class for each pixel and assign it a unique color.

Once you have this pixels array with RGBA values for every pixel in the image — or BGRA
or anything else you want — there are many different ways you can convert it to an image:

• Here, you’re using a helper function UIImage.fromByteArrayRGBA() to turn the pixels


array into a UIImage object. You can find this helper function in the CoreMLHelpers
repo on GitHub93 . That repo also has a more advanced version of the conversion
function from this chapter.

• In the same manner it’s also possible to convert the pixels array to a CVPixelBuffer, an
MTLTexture, or any other kind of image format. As long as the method accepts a memory
buffer of bytes in RGBA order, you’re good to go. And if they expect a different order,
such as ABGR or whatever, all you need to do is change the loops.

Note: The above function is for creating color images. To create a grayscale image from
an MLMultiArray with only one channel, you need to modify the code a little. If the multi-
array only has two dimensions you’ll have to change the way you read from the shape and
strides arrays. You should also remove the loop for the channels, and only write a single
byte per pixel instead of four. Finally, use the helper function UIImage.fromByteArrayGray()
to create the UIImage object. If you don’t feel like writing this code yourself, check out the
full version of the conversion method from the CoreMLHelpers repo, which also handles
grayscale images.

93 https://github.com/hollance/CoreMLHelpers
Converting MLMultiArray to an Image 320

Using vImage

The downside of using nested loops is that they’re not particularly fast, especially for large
arrays.

Unfortunately, you can’t just copy the contents of the MLMultiArray straight into the image’s
memory buffer. First of all, the multi-array contains doubles or floats or 32-bit integers,
while images typically use UInt8s. But also because Core ML puts the dimensions in the
wrong order: in an image the channels should be the last dimension, not the first.

When the data is in the order (channels, height, width) it is known as a planar image
because it’s made up of three separate color planes of size height × width. The first plane
has all the pixels for the red color channel, the second plane has all the pixels for the green
color channel, and the last plane has the pixels for the blue color channel.

However, to create a UIImage or CVPixelBuffer you need to put the pixels in interleaved or
chunky order — an array with the shape (height, width, channels) — where the RGBA values
for each pixel are right next to each other in memory.

So you need some way to convert from planar to interleaved format. Transposing the array
is one approach, but it’s still quite slow as it involves the same nested loops.

For fast image processing routines, your first stop should always be the vImage frame-
work, which is part of the Accelerate framework. Fortunately, it can also help you out here:
vImage knows how to convert between planar and interleaved formats — and it’s pretty
fast at it too.

You’ll be using the function vImageConvert_PlanarFToBGRX8888().

The PlanarF in the name means that vImage expects the multi-array to have 32-bit floating
point numbers, so this method will only work for MLMultiArray objects that have dataType
== .float32, or FLOAT32 in the spec.

The BGRX8888 means that the output will be a 32-bit format with 8 bits per color channel.
The color channels are in the order BGRA, so the alpha channel comes last. The X means
that the alpha channel has a constant value for all pixels.

So this method only works for MLMultiArrays that have three color channels, although
vImage also has a function for converting images with four color planes instead of just
three (where one of them is used for the alpha channel).
Converting MLMultiArray to an Image 321

Note: There is also a vImageConvert_PlanarFToXRGB8888() function. The difference is that it


puts the alpha channel first. By the way, whether it says RGB or BGR in the function name
isn’t very important. It’s easy to swap the red and blue channels, so these functions are not
limited to just BGRA and ARGB — they can output RGBA and ABGR images just fine too.

Using vImage tends to be a little verbose, so I’ll be showing the function in steps:

import Accelerate

func createUIImage(fromFloatArray features: MLMultiArray) -> UIImage? {


assert(features.dataType == .float32)
assert(features.shape.count == 3)

let ptr = UnsafeMutablePointer<Float>(OpaquePointer(features.dataPointer))


let height = features.shape[1].intValue
let width = features.shape[2].intValue
let channelStride = features.strides[0].intValue
let rowStride = features.strides[1].intValue

This is all stuff you’ve seen before: you grab the pointer to the MLMultiArray’s data, get the
width and height from the shape, and look up the strides.

vImage requires that you wrap the image data into vImage_Buffer objects, so create those:

let srcRowBytes = rowStride * MemoryLayout<Float>.stride

var blueBuffer = vImage_Buffer(data: ptr,


height: vImagePixelCount(height),
width: vImagePixelCount(width),
rowBytes: srcRowBytes)
var greenBuffer = vImage_Buffer(data: ptr.advanced(by: channelStride),
height: vImagePixelCount(height),
width: vImagePixelCount(width),
rowBytes: srcRowBytes)
var redBuffer = vImage_Buffer(data: ptr.advanced(by: channelStride * 2),
height: vImagePixelCount(height),
width: vImagePixelCount(width),
rowBytes: srcRowBytes)

Don’t worry too much about the names of these buffers: the PlanarFToBGRX8888() function
treats the first plane as the blue color plane instead of the red color plane, so that’s why
I’ve called it blueBuffer. But vImage really doesn’t care what color channel this is, so if the
first channel in your MLMultiArray really has all the red pixels then that’s OK too.
Converting MLMultiArray to an Image 322

Note that greenBuffer adds the channelStride to the pointer, and the redBuffer adds twice
the channelStride. That’s because each plane contains channelStride number of pixels, so
that’s how far the planes are apart in memory.

Note: If you wanted to flip RGB to BGR here, you’d swap the pointers for blueBuffer and
redBuffer.

You also need to create a vImage_Buffer to hold the output:

let destRowBytes = width * 4


var pixels = [UInt8](repeating: 0, count: height * destRowBytes)
var destBuffer = vImage_Buffer(data: &pixels,
height: vImagePixelCount(height),
width: vImagePixelCount(width),
rowBytes: destRowBytes)

And then you can call the conversion function:

let error = vImageConvert_PlanarFToBGRX8888(


&blueBuffer, &greenBuffer, &redBuffer,
Pixel_8(255), &destBuffer,
[255, 255, 255], [0, 0, 0], vImage_Flags(0))

The Pixel_8(255) is the constant to use for alpha.

The arrays [255, 255, 255] and [0, 0, 0] are for the maximum and minimum values that are
in the multi-array, respectively. For example, if your MLMultiArray contains values between
0 and 1 instead of 255, you’d change the maximum to [1, 1, 1]. This is how vImage knows
how to map the floating point values to the range 0 - 255.

Now the pixels array contains RGBA values again — or BGRA, depending on what the first
channel in your multi-array represents. All the alpha values are 255.

The last thing you do is convert the pixels array to a UIImage, just like before:
Converting MLMultiArray to an Image 323

if error == kvImageNoError {
return UIImage.fromByteArrayRGBA(pixels, width: width, height: height)
} else {
return nil
}
}

And that’s it. Using this function is as at least 10 times faster than the one with the nested
loops, but it’s also less flexible: it only works on .float32 multi-arrays that have 3 color
channels.

If you have a different kind of multi-array, you’ll have to write another conversion routine
that uses a different vImage function. But if you like fast code, it’s definitely worth it!
Converting from UIImage to
MLMultiArray
The situation: You have a Core ML model that takes as input a MultiArray of shape (3,
height, width), for example (3, 224, 224). But in your app you have an image, so how do
you convert the UIImage to an MLMultiArray object that you can pass to the model?

The answer is: you probably shouldn’t do this! A better solution is to change the model to
expect an image instead of a MultiArray. Now Core ML will directly accept your UIImage as
the input — no need to convert it to an MLMultiArray first.

You will still need to convert your UIImage to a CVPixelBuffer object but that’s easy with the
helper code from CoreMLHelpers94 . Even simpler is using Core ML through Vision, which
lets you pass in a CGImage object that you can easily obtain from your UIImage. And on iOS
13 and macOS 10.15 you can also use the new MLFeatureValue(cgImage:) API.

Note: In case you’re wondering, it is definitely possible to convert a UIImage to MLMultiArray.


It’s the opposite of Converting MLMultiArray to an Image. You have to loop through
the pixels in the image and copy the color values into the rows and columns of a new
MLMultiArray object. More trouble than it’s worth, really — it’s smarter to let Core ML and
Vision handle this for you.

94 https://github.com/hollance/CoreMLHelpers

324
Computing the Argmax
It’s very common in machine learning to want to know what the maximum is of a list of
numbers. For example, if you have 20 numbers that represent a probability distribution
over 20 classes, the class with the maximum probability value is the winner.

The max function would give you the largest probability value. The argmax function, on
the other hand, gives you the array index of the thing with the largest value.

In the case of a probability distribution, the argmax tells you which class won but not what
its probability was, while the max only tells you the largest probability but not which class
it belongs to. For example:

let prob = [ 0.2, 0.1, 0.13, 0.5, 0.07 ]

max(prob) // is 0.5
argmax(prob) // is 3 because prob[3] == 0.5

Often you want to know both of these things, so you’ll use both the max and argmax.
Sometimes they are combined into a single operation. That said, it’s pretty easy to find
the max if you already have the argmax — just look it up in the array at the given index.

The simplest way to compute the argmax for a Core ML model is to insert a ReduceLayerParams
layer at the end of the mlmodel. A reduce layer can perform different operations and ARGMAX
is one of them (it can also do regular MAX).

The reduce layer also lets you choose along which axis you want the argmax to be
computed. You can see an example of how to use this kind of layer in the chapter Cleaning
Up a Converted Model.

Tip: For Core ML 3 and later, you can also use an ArgMaxLayerParams layer. The difference is
that this new layer type can also handle dynamic tensor shapes.

It’s not always possible or convenient to let Core ML compute the argmax, so sometimes
you’ll need to do this yourself.

325
Computing the Argmax 326

For example, the TinyYOLO model outputs a multi-array of shape (125, 13, 13). Those 125
channels are really five different things of 25 numbers each. And you only want to compute
the argmax over 20 of these 25 numbers… ReduceLayerParams is quite flexible but not that
flexible.

Note: To be fair, you could use a SliceLayerParams layer first to chop those 125 channels into
smaller chunks and then add a separate ReduceLayerParams to each of these chunks. But this
does make the Core ML model more complicated and at some point you might as well do
these operations yourself in Swift as a post-processing step.

The “obvious” way to implement argmax is to loop through the array and keep track of the
largest element you’ve seen so far. That certainly works, but you can also use the Accelerate
framework for this. Using Accelerate is usually 5 to 10 times faster than writing a loop
because it makes use of the CPU’s vector instruction set, which can process multiple array
elements at the same time.

Here’s how to do argmax using Accelerate:

import Accelerate

public func argmax(_ array: [Double]) -> (Int, Double) {


var maxValue: Double = 0
var maxIndex: vDSP_Length = 0
vDSP_maxviD(array, 1, &maxValue, &maxIndex, vDSP_Length(array.count))
return (Int(maxIndex), maxValue)
}

The calls the vDSP_maxviD() function to do the hard work. It returns both the index — the
argmax — as well as the largest value in the array — the max. So you catch two birds with
one stone!

This code snippet works on Double values. If instead you want the argmax of an array of
Float values, you’ll have to call vDSP_maxvi() without the D, and pass in a [Float] array.

Often you’ll want to compute the argmax over only a portion of an MLMultiArray object, as
in the TinyYOLO example. In that case you need to pass in a pointer instead:
Computing the Argmax 327

public func argmax(_ ptr: UnsafePointer<Double>,


count: Int, stride: Int = 1) -> (Int, Double) {
var maxValue: Double = 0
var maxIndex: vDSP_Length = 0
vDSP_maxviD(ptr, vDSP_Stride(stride), &maxValue, &maxIndex, vDSP_Length(count))
return (Int(maxIndex), maxValue)
}

And you’d use it as follows:

let multiArray: MLMultiArray = ...


let basePtr = UnsafeMutablePointer<Double>(OpaquePointer(multiArray.dataPointer))
let ptr = basePtr.advanced(by: ...)
let (index, value) = argmax(ptr, count: 20)

This time you also have to pass in the count so that vDSP_maxviD() knows how many elements
it needs to look at.

It may also happen that the numbers you want to compute the argmax over aren’t right
next to each other in memory.

For example, let’s say your model makes a prediction of 20 class probabilities for each one
of 13×13 pixels.

If the multi-array has shape (13, 13, 20) then every group of 20 probabilities is a contiguous
array of 20 elements in memory and you can use argmax() as shown above. Write a for-loop
that loops 13×13 times, move the pointer ahead by 20 elements in every iteration and call
argmax(). Simple enough!

However, in Core ML the “channel” dimension usually comes first and so the multi-array
will have shape (20, 13, 13). The order in memory is now: first there are 13×13 numbers for
the first probability of every pixel, then there are 13×13 numbers for the second probability
of every pixel, and so on.

The 20 probability values for any given pixel are now no longer contiguous in memory but
each is separated by 13×13 values for the other pixels.

The solution is to supply a stride to the argmax:

let multiArray: MLMultiArray = ...


let channelStride = multiArray.strides[0].intValue
let (index, value) = argmax(ptr, count: 20, stride: channelStride)

This tells vDSP_maxviD() that it should skip channelStride values to go from one element to
the next.
Translating Class Labels
You have trained a classifier and converted it to Core ML, but now you want to use this
model in an app that is localized for different languages. Unfortunately, the class names
are hardcoded in the mlmodel file, so how do you use different class names for different
languages?

If you just want to translate the hardcoded labels for a single language, you can change the
class names directly in the mlmodel file. See the chapter Replacing the Class Names of a
Classifier.

However, to support multiple languages with the same mlmlodel you’ll need to use a
different approach. One way is to simply define a dictionary that maps the class names
from the mlmodel to the language you want to display:

let translatedLabels = [
"originalLabel0": "translatedLabel0",
"originalLabel1": "translatedLabel1",
"originalLabel2": "translatedLabel2",
...
]

And then use it as follows with Vision:

if let observations = request.results as? [VNClassificationObservation] {


for observation in observations {
let originalLabel = observation.identifier
let translatedLabel = translatedLabels[originalLabel] ?? originalLabel
/* use the translated label */
}
}

If you’re using Core ML without Vision, you get the prediction results in a [String: Double]
dictionary. You’d then take the dictionary key, which is the original class label, and look it
up in the translatedLabels dictionary.

If you trained the classifier yourself, then you probably already have the list of original
class names that you can use to make this translation dictionary. If not, you can extract the
original labels from the model like so:

328
Translating Class Labels 329

import coremltools as ct
import numpy as np
spec = ct.utils.load_spec("YourModel.mlmodel")
labels = list(spec.neuralNetworkClassifier.stringClassLabels.vector)
np.savetxt("labels.txt", labels, delimiter="\n", fmt="%s")

This writes a text file labels.txt with one label per line, in the same order they appear in
the mlmodel. You can now translate these labels and load them into a Swift dictionary.

An alternative method is to replace the string class labels in the mlmodel file with integers,
and then do the lookup using an array instead of a dictionary.

With your mlmodel’s spec, do:

num_labels = len(spec.neuralNetworkClassifier.stringClassLabels.vector)
spec.neuralNetworkClassifier.int64ClassLabels.vector.extend(range(num_labels))

This assigns the numbers from 0 to num_labels - 1 to the mlmodel’s class labels.

You also need to tell the spec that the output is now a dictionary of integers instead of
strings. Assuming that output 0 is the one with the probabilities dictionary and output 1 is
the output for the winning label, write:

spec.description.output[0].type.dictionaryType.int64KeyType.SetInParent()
spec.description.output[1].type.int64Type.SetInParent()

And now save the spec to a new mlmodel file. In your app, you’d do the following:

let translatedLabels = [
"translatedLabel0",
"translatedLabel1",
...
]

if let observations = request.results as? [VNClassificationObservation] {


for observation in observations {
if let originalLabel = Int(observation.identifier) {
let translatedLabel = translatedLabels[originalLabel]
/* use the translated label */
}
}
}
Translating Class Labels 330

This is similar to before except now the value of observation.identifier is always a number
(but still inside a string) and you use it as the index in the translatedLabels array.

If you’re using Core ML without Vision, then you now get a dictionary of [Int64: Double]
values. Note that the dictionary key is no longer a String but an Int64. You can use this
numeric key to look up the new label in the translatedLabels array.
Part 6: Advanced Topics

Become a Core ML pro with these advanced chapters.


Making Multiple Predictions at Once
With Batches
When you do model.prediction(...) it only makes a single prediction at a time. That’s useful
for getting immediate results, such as when you’re making predictions on live video.

As of iOS 12 and macOS 10.14, Core ML also has a batch prediction mode that can make
multiple predictions at once.

If you’re familiar with training of machine learning models, you’ll know that the batch size
is one of the hyperparameters you can choose during training. With Core ML you can now
use a batch size greater than 1 for inference too.

This can give quite the performance boost for offline processing tasks, such as applying
the model to the frames of an mp4 video or to a collection of images from the user’s photo
library.

When combining multiple prediction jobs into a batch, Core ML can more efficiently use
the GPU and the Neural Engine. A heavily parallel processor such as the GPU works best
when it is fully occupied. Larger batches means the GPU gets to stay more busy — just what
it likes!

From my testing, it looks like the GPU benefits most from batching, the Neural Engine
slightly less so, and the CPU almost not at all.

Limits of using batches

Using a batch of size 2 does not mean you get two predictions for the price of one! It doesn’t
magically cut the processing time for each prediction in half.

The speed gain is more like being able to do 5 predictions in the time of 4. The exact speed
gain depends on the complexity of your model and what hardware it’s running on.

So it doesn’t make your app twice as fast, but you do gain a significant amount of time
simply because Core ML can use the hardware more efficiently. This is especially useful
with offline tasks: it’s the difference between a job that is done in 8 minutes and one that’s
done in 10 minutes.

332
Making Multiple Predictions at Once With Batches 333

In other words, using a batch gives higher throughput — the app can handle more
predictions in less time.

The flipside is that it also increases latency, which is the time between making a prediction
and getting the result. If a single prediction takes 1 second, the latency is only 1 second. Let’s
say a batch of 5 predictions takes 4 seconds. It’s faster on average — only 0.8 seconds each
— but the latency is now 4 seconds. You have to wait longer until the entire batch is done.

That means using batches isn’t very useful for real-time, live processing of data, where you
want the latency to be as small as possible.

Besides higher latency, another downside of using batches is that they increase the amount
of RAM that is required to make the prediction. A batch of size 5 also needs five times
as much working memory! So you probably don’t want to do this with image-based deep
learning models that eat up hundreds of MBs of RAM for each prediction, or your app will
run out of free memory real quick!

Making a batch prediction

As of Xcode 10, the generated class has a predictions() method — plural! — that is used for
batch predictions. You use it like this:

let batchSize = 10
var inputs: [YourModelInput] = []
for b in 0..<batchSize {
inputs.append(YourModelInput(image: pixelBuffers[b]))
}

if let results = try? model.predictions(inputs: inputs) {


for (i, result) in results.enumerated() {
/* look at the i-th result from the batch */
}
}

Just as with a regular prediction, you provide a YourModelInput object with the data for your
model’s inputs. Except now you have an array of these objects, one for each prediction you
want to make.

You pass this array to model.predictions(inputs:) and when Core ML is done you get an
array with YourModelOutput objects containing the prediction results.

So it still works very much like before except now you have arrays of YourModelInput and
YourModelOutput objects.
Making Multiple Predictions at Once With Batches 334

Note: Vision has no batch API. To take advantage of batching you’ll need to go through the
Core ML API, either using the generated class or MLModel.

Under the hood

Let’s look at what happens inside the predictions() method:

func predictions(inputs: [YourModelInput],


options: MLPredictionOptions = MLPredictionOptions())
throws -> [YourModelOutput] {
// 1
let batchIn = MLArrayBatchProvider(array: inputs)
// 2
let batchOut = try model.predictions(from: batchIn, options: options)
// 3
var results: [YourModelOutput] = []
results.reserveCapacity(inputs.count)
for i in 0..<batchOut.count {
let outProvider = batchOut.features(at: i)
let result = YourModelOutput(features: outProvider)
results.append(result)
}
return results
}

The contents of the batch, here called inputs, is the array of YourModelInput objects. What
matters here is that the YourModelInput class implements the MLFeatureProvider protocol.

1. You put these MLFeatureProvider objects into an MLArrayBatchProvider instance. This is


nothing more than a list of feature providers, one for each prediction.

2. You call MLModel’s predictions(from:options:) method to make the predictions for the
entire batch at once. This method is new since iOS 12 and macOS 10.14.

3. The output from step 2 is another MLArrayBatchProvider but this time with the results
for each of the predictions. If you put five objects into the inputs array, this output
batch provider also contains five MLFeatureProvider objects.

The loop extracts the individual result objects and adds them to an array of YourModelOutputs
as those are more convenient to work with than naked MLFeatureProvider objects.
Making Multiple Predictions at Once With Batches 335

Doing it yourself

You’re not required to use the generated YourModel.predictions() method to make a batch
prediction. You can also roll your own.

For example, if you don’t want to use YourModelInput objects, you could use an array
of MLDictionaryFeatureProvider objects to initialize the MLArrayBatchProvider. See also the
chapter Using the MLModel API.

Or if your model has multiple inputs and their values are stored in separate arrays, you
can also initialize MLArrayBatchProvider as follows:

let batchIn = try MLArrayBatchProvider(dictionary: [


"image": [ pixelBuffer0, pixelBuffer1, pixelBuffer2, pixelBuffer3 ],
"originalWidth": [ 640.0, 320.0, 1280.0, 640.0 ],
"originalHeight": [ 480.0, 240.0, 720.0, 240.0 ],
])

The model in this example takes three inputs — image, originalWidth, and originalHeight —
and there are four predictions to be made for this batch. Note that all these arrays must
have the same length, in this case 4.

If MLArrayBatchProvider is not powerful enough for you, then you can always write your
own batch provider by implementing the methods from the MLBatchProvider protocol. For
example, you could create a class that reads the next batch_size frames from a video file
straight into the batch.

Batch size in the model

Recall that the data blobs that get passed from one layer of your model to the next have the
shape:

(sequence length, batch size, channels, height, width)

For example, if your batch has five things in it, that second dimension is 5. However, it
doesn’t appear that Core ML offers any layers that can access that batch dimension — most
layers only work on the last three dimensions (C, H, W) and sometimes also the sequence
length.
Making Multiple Predictions at Once With Batches 336

As far as the model itself is concerned, the batch dimension might as well not exist. You
can’t do anything special across a batch with the built-in layers. Custom layers also do not
get access to the entire batch, they only get to see one item from the batch at a time.

Update: Core ML 3 has many new layers that are a lot more flexible in what they allow.
With Core ML 3, tensors can have arbitrary shapes and are therefore no longer limited to
the fixed shape described here. Most of these new operations will let you access the batch
dimension. Look at NeuralNetwork.proto to see how the various layer types interpret the
tensor shapes.
Size Flexibility
As of mlmodel specification version 3, introduced with iOS 12 and macOS 10.14, Core ML
models can have flexible input and output sizes. If a model uses flexible sizes, the mlmodel
viewer in Xcode shows a Flexibility column:

Previously, Core ML always required inputs to have a fixed size. If a model expected a
256×256-pixel image as input, then any other image sizes would trigger a runtime error.

As of mlmodel version 3 you have these options:

1. A single input size of fixed dimensions, like before.

2. A range of possible sizes, for example any images between 128×128 and 512×512 pixels
(as in the picture above).

3. A list of allowed sizes, for example only 128×128, 256×256, or 512×512 pixels.

The flexible sizes are available for inputs of type image as well as MultiArray. Usually when
the input is larger, the output will be proportionally larger as well, and so you can specify
a flexible size for the model outputs too.

Note: You cannot use size flexibility with all models. For example, if your neural network
has a fully-connected layer at the end, using a smaller input image may result in fewer
neurons than the fully-connected layer is expecting. Size flexibility generally only works
with fully convolutional neural networks, or with models where the fully-connected layer
is preceded by a global pooling layer instead of a flatten layer.

When converting the model…

The model converters from coremltools do not allow you to specify flexible sizes for the
model inputs. They always expect each input to be of a particular size. To make the inputs
flexible, you’ll have to change this in the model spec afterwards.

337
Size Flexibility 338

Once the model has been converted, the inputs and/or outputs from the model look like
this in the spec:

imageType {
width: 256
height: 256
colorSpace: RGB
}

And for a multi-array:

multiArrayType {
shape: 125
shape: 13
shape: 13
dataType: DOUBLE
}

These are fixed-size inputs and outputs. It’s also possible the multi-array doesn’t have any
shape fields at all. The output will still always have the same shape, but the converter wasn’t
able to figure out what that is. The final shape won’t be known until you actually make a
prediction.

I suggest that when you convert the model to Core ML you specify the smallest possible
input size that the model will allow or that you intend to use in your app. The smaller the
input, the faster the model runs.

In addition, most models already have a natural limit on how small the input can be. For
example, if a convnet has five pooling layers that each halve the feature map size, the
minimum input size is 32×32 pixels. For smaller inputs, the model will eventually end up
with feature maps that are 0×0 pixels or that have negative dimensions… Not a good idea.

Make it flexible

The fixed input size that you specify during conversion is treated as the “default” size for
that input. To add options for flexibility, you can edit the spec yourself but it’s easier to use
the functions from coremltools.models.neural_network.flexible_shape_utils.

The “flexible shape utils” functions work on spec objects, so grab the spec for your
converted model:
Size Flexibility 339

import coremltools as ct
from coremltools.models.neural_network import flexible_shape_utils
spec = ct.utils.load_spec("YourModel.mlmodel")

What you need to do next depends on whether the model inputs are images or multi-arrays.
First, let’s look at images.

Image inputs and outputs

Range of image sizes

Let’s say your model has an input named "image" and you want it to handle images from
anywhere between 128×128 pixels and 512×512 pixels. Given the spec object, write the
following:

img_size_ranges = flexible_shape_utils.NeuralNetworkImageSizeRange()
img_size_ranges.add_height_range((128, 512))
img_size_ranges.add_width_range((128, 512))
flexible_shape_utils.update_image_size_range(spec, feature_name="image",
size_range=img_size_ranges)

First you create a NeuralNetworkImageSizeRange object and then add a range for the height
and a range for the width. The range is a tuple, e.g. (128, 512) where 128 is the lower bound
and 512 is the upper bound.

With that NeuralNetworkImageSizeRange object, you call update_image_size_range() to change


the spec for the specified input, in this case "image".

If you don’t want there to be an upper bound, specify -1. For example, image classifiers
trained with Create ML have a width and height range of (299, -1), meaning that the image
must be at least 299×299 pixels but can also be larger. (And the width doesn’t have to be
the same as the height.)

Note that -1 is only valid for the upper bound, not the lower bound. So there must always
be a minimum image width or height that is greater than 0.

Now if you look at the spec.description, the type for this input still has the fixed default size
of 256×256, but it also includes an imageSizeRange with the size flexibility settings:
Size Flexibility 340

imageType {
width: 256
height: 256
colorSpace: BGR
imageSizeRange {
widthRange {
lowerBound: 128
upperBound: 512
}
heightRange {
lowerBound: 128
upperBound: 512
}
}

The default input size, here 256×256, must be within the size range. The Core ML compiler
gives an error message if it isn’t.

That’s really all you need to change in the spec to make this work, so you could also have
done this by modifying the spec directly by hand. Still, I recommend using the flexible_-
shape_utils functions because they do error checking that can help prevent mistakes. (It
also bumps the mlmodel specification number to 3, as this functionality is only supported
by iOS 12 and macOS 10.14 and later.)

Now you can perform predictions on images that are anywhere between 128 and 512 pixels
wide and tall. If you try to make a prediction with an image that is not inside this range,
Core ML will throw an error like this one:

Input image feature image does not match model description ...
Image width (100) is not in allowed range (128..512)

If your model is an image-to-image model, you can call update_image_size_range() on the


image output too:

flexible_shape_utils.update_image_size_range(spec, feature_name="generatedImage",
size_range=img_size_ranges)

Although I’ve found that Core ML doesn’t particularly care about this setting, so it’s really
more of a hint to the developer who will be using the model.

Enumerated image sizes

When you specify a range between 128×128 and 512×512 pixels, the model needs to be able
to work with any image size inside that range, such as 412×131 or whatever. Not all model
designs are that pliable.
Size Flexibility 341

If you want to offer a variety of input sizes but restrict this to only a few choices, it’s better
to explicitly list the allowed sizes. Like so:

image_sizes = [
flexible_shape_utils.NeuralNetworkImageSize(128, 128),
flexible_shape_utils.NeuralNetworkImageSize(256, 256),
flexible_shape_utils.NeuralNetworkImageSize(512, 512),
]
flexible_shape_utils.add_enumerated_image_sizes(spec, feature_name="image",
sizes=image_sizes)

First you define the list of sizes you want to allow, then call add_enumerated_image_sizes()
for the inputs and/or outputs this applies to.

In the spec.description it now looks like this:

imageType {
width: 256
height: 256
colorSpace: BGR
enumeratedSizes {
sizes {
width: 128
height: 128
}
sizes {
width: 256
height: 256
}
sizes {
width: 512
height: 512
}
}
}

Instead of an imageSizeRange field there is enumeratedSizes. The Core ML model will only
accept images with these dimensions. If you try to make a prediction on an image that is
not 128×128, 256×256 or 512×512, Core ML will give an error:

Input image feature image does not match model description ...
Image size 100 x 100 not in allowed set of image sizes
Size Flexibility 342

Note: If the default image size you specified during the initial conversion is not in the list,
it is added to the enumeratedSizes anyway. Otherwise you wouldn’t be able to use it.

Multi-array inputs and outputs

Flexible sizes are not just for images but also for MultiArray inputs and outputs. Again, you
can either specify a range or a list of allowed shapes.

Tip: If you mess up, you can remove the flexibility options for an input or output with:

spec.description.input[0].type.multiArrayType.ClearField("ShapeFlexibility")

Range of array shapes

Whereas an image size must have a range for the width and height, a multi-array must
describe a range for the channel dimension also:

shape_range = flexible_shape_utils.NeuralNetworkMultiArrayShapeRange()
shape_range.add_channel_range((125, 125))
shape_range.add_height_range((13, 21))
shape_range.add_width_range((13, 21))
flexible_shape_utils.update_multiarray_shape_range(spec, feature_name="grid",
shape_range=shape_range)

If your model uses any of the Core ML 3 dynamic layer types, use the following code instead:

flexible_shape_utils.set_multiarray_ndshape_range(spec, feature_name="grid",
lower_bounds=[125, 13, 13],
upper_bounds=[125, 21, 21])

Note: Just like for images, you can put -1 for the upper bound, which means there is no
limit to how high it can go.

The spec.description for this multi-array is as follows:


Size Flexibility 343

multiArrayType {
shape: 125
shape: 13
shape: 13
dataType: DOUBLE
shapeRange {
sizeRanges {
lowerBound: 125
upperBound: 125
}
sizeRanges {
lowerBound: 13
upperBound: 21
}
sizeRanges {
lowerBound: 13
upperBound: 21
}
}
}

The shapeRange field has three sizeRanges. The first is for the channels, the second for the
array height, and the third for the array width.

Note: If your data is one-dimensional, i.e. a vector, you should only provide a shape for the
channel dimension, not for the height and width.

In a model with NonMaximumSuppression, such as the YOLO object detector models produced
by Turi Create, the output is a multi-array of shape (unknown, 4). In Xcode this is shown as
“Double 0 x 4”. In the spec it looks like this:

multiArrayType {
dataType: DOUBLE
shapeRange {
sizeRanges {
upperBound: -1
}
sizeRanges {
lowerBound: 4
upperBound: 4
}
}
}

There are two sizeRanges, so the multi-array also has two dimensions. The upperBound of the
Size Flexibility 344

first dimension is given as -1. This means the size of this dimension is unknown (the lower
bound is 0 by default). The lowerBound and upperBound of the second dimension are both 4,
so that the size of this dimension is always 4.

What does it mean for the first dimension to be unknown? The NonMaximumSuppression model
can return a varying number of bounding box predictions, so the size of this dimension
is going to be different for every prediction. The second dimension contains the four
coordinates of the bounding box.

Enumerated shapes

Instead of a range you can also limit the multi-array to a few predefined shapes:

array_shapes = [
flexible_shape_utils.NeuralNetworkMultiArrayShape(125, 13, 13),
flexible_shape_utils.NeuralNetworkMultiArrayShape(125, 21, 21)
]
flexible_shape_utils.add_enumerated_multiarray_shapes(spec, feature_name="grid",
shapes=array_shapes)

If your model uses any of the Core ML 3 dynamic layer types, use the following code instead:

shapes = [ (125, 13, 13), (125, 21, 21) ]


flexible_shape_utils.add_multiarray_ndshape_enumeration(spec, feature_name="grid",
enumerated_shapes=shapes)

Now the spec has a enumeratedShapes field:

multiArrayType {
shape: 125
shape: 13
shape: 13
dataType: DOUBLE
enumeratedShapes {
shapes {
shape: 125
shape: 13
shape: 13
}
shapes {
shape: 125
shape: 21
shape: 21
}
}
}
Size Flexibility 345

Saving the model

When you’re done, save the spec to an mlmodel file:

ct.utils.save_spec(spec, "YourNewModel.mlmodel")

However, I suggest turning the spec into an MLModel object first. This runs the Core ML
compiler, which will tell you if something went wrong.

mlmodel = ct.models.MLModel(spec)
mlmodel.save("YourNewModel.mlmodel")

Even though the flexible_shape_utils functions may not have complained when you chose
the new sizes, running the Core ML compiler is the only way to know for sure you didn’t
make any mistakes. For example, I got the error:

Error compiling model: "compiler error: Invalid data blob: 'pool5/7x7_s1'


shape (C,H,W = 1024,-2,-2) for output of layer: 'pool5/7x7_s1'.

The pool5/7x7_s1 layer does pooling with window size 7×7 and then feeds the results into
a flatten layer and an inner product layer. This works fine if the feature map at this point
really is 7×7 pixels but not if it’s smaller. I apparently specified an input size that was too
small, resulting in a nonsensical feature map of -2 × -2 pixels. (Images that are too large
seem to be OK with Core ML, even though I would consider this an error too. The pooling
layer now outputs too much data that simply gets ignored by the inner product layer.)

Even when you get no errors, it’s a good idea to check the output shapes of the model at
this point, to make sure they still correspond to those from the original model for the given
input sizes. Keep in mind that the output shapes reported by the Core ML compiler are for
the smallest allowed size.

Note: Even if you succeed in making the input size for your model flexible, it may still
happen that the output always is of the same size regardless of the input image measure-
ments. There are certain layers in Core ML that have their output shape hardcoded, such
as a deconvolution layer, a bilinear resizing layer, or a crop layer. The targetSize for these
layers is always a constant in the mlmodel spec, and I’m not aware of any techniques that let
you dynamically change it. Unlike in TensorFlow you can’t say, “I want to resize the feature
map to the dimensions of this other layer.” As of Core ML 3, there are certain operations
that can dynamically adapt to the shape of your tensors, but this is not true for all of them.
Size Flexibility 346

What does Vision do?

When you use Core ML directly, you are responsible for resizing the input image before
calling model.prediction(...). With size flexiblity, the size of the image must be inside the
allowed range or from the enumeration of supported sizes.

One of the benefits of using Vision is that it automatically resizes the image for you. That
brings up a question: which input size does Vision use?

I ran some experiments but found it hard to get a consistent answer. Sometimes Vision
would use the default image size, but other times it would use the maximum allowed size,
while still keeping the aspect ratio intact. So I guess the answer is that Vision will use
whatever image size it feels is appropriate.
Using the MLModel API
Most of the time you’ll want to use Core ML through the Vision framework, or through the
automatically generated classes. However, you can also directly use the MLModel API. That’s
useful when you’ve downloaded the mlmodel on-the-fly, for example.

This chapter shows how to directly use the MLModel API to interact with your Core ML
models. It’s not really that hard, but the code can get a bit more cumbersome — which
is why Xcode auto-generates that source file in the first place.

Disable the auto-generated file

First, disable the generation of the source file since you won’t need it. Go to your project’s
Build Settings, and set CoreML Model Class Generation Language to None.

You’ll notice that Xcode’s mlmodel viewer no longer has a button to view the generated
code. It should say “Model class generation disabled”.

Load the model

The primary object you’ll be using to interact with your Core ML model is MLModel. Before
you can do anything, you first need to create an MLModel instance. For example:

import CoreML

func loadModel() -> MLModel? {


if let url = Bundle.main.url(forResource: "YourModel",
withExtension: "mlmodelc") {
do {
let config = MLModelConfiguration()
config.computeUnits = .all
return try MLModel(contentsOf: url, configuration: config)
} catch {
print("Error loading model: \(error)")
}
}
return nil
}

347
Using the MLModel API 348

The loadModel() function tries to locate YourModel.mlmodelc in the app bundle and then
uses this URL to instantiate the MLModel object. Note that you’re using mlmodelc here, the
compiled version of the model.

You can also pass in a MLModelConfiguration object, which determines whether the model
will be allowed run on the CPU, the GPU and/or the Neural Engine.

Is the model OK?

Once you have an MLModel object, it’s a good idea to examine the MLModelDescription to make
sure that the model’s inputs and outputs are indeed what your code expects.

This is a defensive programming measure. Unlike Swift code, which is statically typed, Core
ML models are totally dynamic. There is no guarantee that the MLModel you’re loading is
compatible with the code that tries to use the model — especially if you download the
mlmodel from a server at runtime.

You can read more about this in the chapter Inspecting the Model at Runtime.

Make a prediction

Let’s suppose this mlmodel takes an image as input. You will need to provide the input
image as a CVPixelBuffer object. One way is to use a helper method from CoreMLHelpers95 :

if let image = UIImage(named: "sunflower.jpg"),


let pixelBuffer = image.pixelBuffer(width: 227, height: 227) {
/* TODO: make the prediction */
}

Note: As of iOS 13 and macOS 10.15, you can also use a new API on MLFeatureValue to convert
a CGImage object, or a PNG or JPEG file, into a new CVPixelBuffer object. Just as easy!

You cannot directly give this CVPixelBuffer to MLModel. All inputs and outputs must be
wrapped inside MLFeatureValue objects and these must be placed inside an MLFeatureProvider
object in turn.
95 https://github.com/hollance/CoreMLHelpers
Using the MLModel API 349

You could create your own class that implements the MLFeatureProvider protocol. It’s the job
of this class to provide the inputs to the MLModel. For more details on how this works, see
the chapter Understanding the Xcode-generated File.

Rather than create your own subclass, it’s easier to use MLDictionaryFeatureProvider, a
convenience class that is already provided by Core ML. You use it like this:

let inputs: [String: Any] = [ "image": pixelBuffer ]

let options = MLPredictionOptions()


options.usesCPUOnly = false

if let provider = try? MLDictionaryFeatureProvider(dictionary: inputs),


let result = try? model.prediction(from: provider, options: options) {
/* do something with the results */
}

The inputs variable is a dictionary with one entry for each input that the model expects.
This model has only one input, named "image", and you assign it the CVPixelBuffer object
containing the input image.

From this dictionary you create the MLDictionaryFeatureProvider and pass that to MLModel’s
prediction(from:options:) method. And that’s all you need to do to make a prediction!

Note: If your model has a different input name, or has more than one input, then adjust
your inputs dictionary accordingly. Make sure to supply values for all the inputs and in the
proper datatype. For images this should be CVPixelBuffer, for multi-dimensional tensors
it should be MLMultiArray, for sequences it should be MLSequence, and for single values it
should be Double, Int64, or String. See the MLFeatureType enum for the full list of supported
data types.

Look at the results

After the prediction completes, the results are given to you inside a new MLFeatureProvider
object.

For this particular model, which is a classifier, the results feature provider contains two
values, classLabelProbs with the predicted probabilities for each class and classLabel with
the name of the highest scoring class.
Using the MLModel API 350

Note: The names of these outputs will vary between models. They probably have different
names in your own mlmodel. To find out what the names are, double-click the mlmodel
file to open it in Xcode and look under Model Evaluation Parameters. You can also write
print(result.featureNames).

To read the value of an output, you can do the following:

let bestLabel = result.featureValue(for: "classLabel")!.stringValue

let probs = result.featureValue(for: "classLabelProbs")!.dictionaryValue


as! [String : Double]

When you do result.featureValue(for:) it doesn’t return the actual object, but an in-
stance of MLFeatureValue. You have to ask that for the proper object using .stringValue
and .dictionaryValue like you did here, or using .multiArrayValue when the output is an
MLMultiArray or .imageBufferValue when it’s an image. See the MLFeatureValue documenta-
tion for more details.

The above works but it’s a little ugly. The reason Core ML uses these MLFeatureProvider and
MLFeatureValue objects is so that mlmodels can handle a variety of input and output types.
But it does require a little more effort on your end to read the results.

It turns out that result is really another MLDictionaryFeatureProvider object. So you can also
do it like this:

if let outputs = result as? MLDictionaryFeatureProvider,


let probs = outputs["classLabelProbs"]?.dictionaryValue as? [String: Double],
let bestLabel = outputs["classLabel"]?.stringValue {
/* do something with probs and bestLabel */
}

And that’s it! As you can see, it’s not really that much work to make predictions directly
with MLModel. The only nasty bit is reading the results from the model’s outputs.
Vision FeaturePrint
When you train an image classifier with Create ML, the resulting Core ML model is really
small, only a few KB — yes, that’s correct: kilobytes!

Other models are easily tens or hundreds of megabytes, so how come Create ML can create
such a tiny model?

The mlmodel file that is produced by Create ML only includes the learned weights for the
classifier, which is just a logistic regression. If your classifier knows about 20 classes, then it
has 20 × 2048 = 40.960 weights. These are stored as double precision floating point numbers,
8 bytes per weight, so that comes to about 320 KB total.

The 2048 input features for the logistic regression are provided by Vision FeaturePrint.Scene,
a model that is built into iOS 12 and macOS 10.14. The reason the mlmodel is so small is
that the weights for the neural network that performs feature extraction are not included
in the model file — they’re baked into Core ML instead.

An image classifier trained with Create ML looks like this in Netron:

The mlmodel is a pipeline classifier consisting of two parts:

1. The Vision FeaturePrint.Scene feature extractor.

2. The GLMClassifier that performs the logistic regression on the extracted features.

Let’s take a look at the spec of this Create ML model:

351
Vision FeaturePrint 352

import coremltools as ct
spec = ct.utils.load_spec("Your_CreateML_Model.mlmodel")
print(spec.pipelineClassifier.pipeline.models[0].WhichOneof("Type"))

This prints out visionFeaturePrint. In Model.proto you can see that this corresponds to a
model of the type CoreMLModels.VisionFeaturePrint.

The full definition of this model type is in VisionFeaturePrint.proto96 . There really isn’t
much to see here. The VisionFeaturePrint model has a type, which currently has two op-
tions, Scene and Objects. The comment in the proto file says, “Scene extracts features useful
for identifying contents of natural images in both indoor and outdoor environments.”

It’s likely Apple will add other types of feature extractors to Vision FeaturePrint in the
future, for use with different types of images, such as drawings or perhaps camera depth
maps or who knows what.

The Scene model type also has a version number. If a future OS version includes an
improved Vision FeaturePrint model, this version number gets bumped. To take advantage
of these improvements, you’ll need to retrain your classifier.

So the proto file doesn’t tell us much about the insides of Vision FeaturePrint. That’s too
bad — but it’s also not so strange, as this is a proprietary model that is trained by Apple
and they have no reason to share these details with us. Just think of Vision FeaturePrint as
a version of ResNet, VGG, or Inception that is built into the operating system.

We may not be able to look at the insides, but we can certainly look at the expected inputs
and outputs. print(spec.pipelineClassifier.pipeline.models[0].description) will print:

input {
name: "image"
type {
imageType {
width: 299
height: 299
colorSpace: BGR
imageSizeRange {
widthRange {
lowerBound: 299
upperBound: -1
}
heightRange {
lowerBound: 299
upperBound: -1
}
}
96 https://github.com/apple/coremltools/blob/master/mlmodel/format/VisionFeaturePrint.proto
Vision FeaturePrint 353

}
}
}
output {
name: "sceneprint"
type {
multiArrayType {
shape: 2048
dataType: DOUBLE
}
}
}

As you can see here, the input image is supposed to be 299×299 pixels or larger and the
output is a 2048-element vector. So this model takes in an input image and extracts 2048
features.

The logistic regression classifier is then trained on these 2048 features. You can see this with
print(spec.pipelineClassifier.pipeline.models[1].description). The input to the classifier
is indeed a 2048-element MultiArray.

Create ML is not the only tool that can create image classifiers on top of Vision Fea-
turePrint. You can also use it as the feature extractor for Turi Create. Simply specify
model="VisionFeaturePrint_Scene" when you call tc.image_classifier.create().

By the way, just because it has “Vision” in its name doesn’t mean you’re required to use the
Vision framework to run this model. You can use it with the regular Core ML API too.

Using Vision FeaturePrint without Create ML or Turi

Here’s a nice hack for when you don’t want to use Create ML or Turi Create to train your
model, but you still want to use Vision FeaturePrint as the feature extractor.

First, create a new Core ML model that consists of just VisionFeaturePrint and use this to
extract features for all your training and validation images. The easiest way to do this is to
take a Create ML-trained model and save the first part of the pipeline as a model of its own:
Vision FeaturePrint 354

ct.utils.save_spec(spec.pipelineClassifier.pipeline.models[0],
"FeatureExtractor.mlmodel")

Note: If you don’t have a Create ML-trained image classifier lying around, just create one
with a few bogus images. We don’t care about the classifier part here, so it’s fine if the
trained model doesn’t work very well.

Next, load this FeatureExtractor.mlmodel into a new MLModel object and call model.predict()
on all your training images. For each image this returns a dictionary with a 2048-element
NumPy array. Save the contents of this array to a file or stick them all together into one
large array.

If you don’t want to do the feature extraction in Python, you can also write a Swift app to do
this using the MLModel API. That is in fact what Turi Create does when it extracts features
during its training process.

Now you can train your model on top of these features, using whatever training tool you
prefer. When you’re done, convert your model to Core ML. It should take a MultiArray of
size 2048 as input, with the name "sceneprint".

Finally, create a new pipeline classifier model that combines VisionFeaturePrint with your
own model into a new mlmodel file. See the chapter Building Pipeline Models for more
info.

Even simpler: VNGenerateImageFeaturePrintRequest

On iOS 13 and macOS 10.15, Vision has a new request type, VNGenerateImageFeaturePrintRequest.
This simply runs the built-in Vision FeaturePrint.Scene model on your image and returns
the features in a VNFeaturePrintObservation object.

You can access the extracted features as a Data object, but there is also a computeDistance()
method for comparing how similar two observations are. This sample code97 shows how
to use these new APIs to analyze the similarity of images.
97 https://developer.apple.com/documentation/vision/analyzing_image_similarity_with_feature_print
Vision FeaturePrint 355

Vision FeaturePrint.Objects

As of iOS 14 and macOS 11.0, there is a new Vision FeaturePrint.Objects model that is
optimized for extracting features that are useful for object detection. It’s used by Create
ML when you train an object detector with transfer learning enabled.

This new Objects model works on a 299×299 input image and outputs two multi-arrays of
shape (288, 35, 35) and (768, 17, 17), respectively. These are not bounding box predictions
yet, only “raw” features.

A full object detector, such as the one made by Create ML, still needs to add logic to turn
these features into bounding boxes and class labels, just like a classifier model needs to add
an extra model on top of VisionFeaturePrint.Scene to make its predictions.
Using Sequences
Note: Sequences are something you’d use with recurrent neural networks such as LSTM
and GRU. I haven’t done much work yet with sequences and Core ML, so currently I don’t
have a lot to say about it. But it’s definitely a topic I want to cover in this book eventually. In
the mean time I will refer you to the documentation for MLSequence and the article Making
Predictions with a Sequence of Inputs98 .

The book Machine Learning by Tutorials99 that I co-wrote has a number of chapters devoted
to dealing with sequence models. Worth reading!
98 https://developer.apple.com/documentation/coreml/core_ml_api/making_predictions_with_a_sequence_of_

inputs
99 https://store.raywenderlich.com/products/machine-learning-by-tutorials

356
Creating Your Own Custom Layers
A small disadvantage of Core ML is that it only supports a limited number of neural network
layer types. New layer types are only added with new OS releases, roughly once a year.

If you’re trying to convert a model with an operation that is not in NeuralNetwork.proto100 ,


you’ve got a problem.

Fortunately, as of mlmodel version 2, Core ML allows models to have custom layers.

Think of a custom layer as being a placeholder inside the mlmodel file. If some operation in
your original model isn’t officially supported by Core ML, you can add such a placeholder
instead of the original operation. Your Swift or Objective-C app must then provide the code
that performs this operation.

Note: The mlmodel file will not contain any executable code. When you create a Lambda
layer with Keras, the code from that layer is serialized into the .h5 file. That does not happen
with Core ML: the custom layer only says, “I am a placeholder for such-and-such operation.”
You must always provide the code for this custom layer in the app itself, it is not part of the
model file.

CustomLayerParams

Inside the mlmodel file, every neural network layer is represented by a message type from
NeuralNetwork.proto. For a convolution layer there is a ConvolutionLayerParams object, for
a pooling layer there is a PoolingLayerParams object, and so on.

If your model has a layer type that is unsupported by Core ML, that layer should be
described in the mlmodel by a CustomLayerParams object. That’s the placeholder that makes
a promise to Core ML that you will provide an implementation for this layer.

The CustomLayerParams message definition looks like this:

100 https://github.com/apple/coremltools/blob/master/mlmodel/format/NeuralNetwork.proto

357
Creating Your Own Custom Layers 358

message CustomLayerParams {
string className = 10;
repeated WeightParams weights = 20;
map<string, CustomLayerParamValue> parameters = 30;
string description = 40;
}

The bare minimum you have to do to add a custom layer to your model, is create a
CustomLayerParams object and fill in the className. That is the name of the Swift or Objective-
C class that implements the logic for this layer. This class must conform to the MLCustomLayer
protocol. Later on in this chapter you’ll see an example of how to write such a class.

Let’s say you’re converting a model that uses a hot new technique called “inside-out
convolution”. If you were adding the custom layer by hand to the spec, you’d write this:

new_layer = spec.neuralNetwork.layers.add()
new_layer.name = "custom1"
new_layer.input.append("previous_layer_output")
new_layer.output.append("custom1_output")
new_layer.custom.className = "InsideOutConvolution"

The line that fills in new_layer.custom.className is what turns this into a custom layer. It just
says, “this layer will be implemented by the class named InsideOutConvolution.”

It’s also useful to fill in the description field:

new_layer.custom.description = "Inside-out convolution as described\


in the paper https://arxiv.org/abs/1812.1234"

Now if you save this spec and open the mlmodel in Xcode, it shows a new Dependencies
section:

Note: You must also set the spec.specificationVersion to at least 2 because custom layers
are not supported by mlmodel version 1. Xcode doesn’t appear to care about this version
number, but at least Core ML will now prevent you from running this kind of model on iOS
versions lower than 11.2.
Creating Your Own Custom Layers 359

Most of the time you won’t be adding the custom layers to the spec by hand, but as part
of the conversion process. The Core ML converters can automatically add custom layers
whenever they encounter an unsupported operation, but you still need to help them out a
little…

Using a converter

Both the Keras and tfcoreml converters have an add_custom_layers option. Normally, any
unknown operations or layer types will result in an error but with add_custom_layers=True,
they will be assigned a CustomLayerParams placeholder instead:

coreml_model = ct.converters.keras.convert(keras_model,
...other settings...,
add_custom_layers=True,
custom_conversion_functions={})

You still need to fill in the className field for these custom layers. One way to find the custom
layers in the converted model is by doing the following after the conversion has finished:

for i, layer in enumerate(coreml_model._spec.neuralNetwork.layers):


if layer.WhichOneof("layer") == "custom":
print("layer index:", i)
print(layer)

This may print something like the following:

layer index: 17
name: "custom1"
input: "layer16_output"
output: "custom1_output"
custom {
}

Apparently layer 17 is an unsupported operation. To see which operation this is, you can
look at the original model and compare it to the converted mlmodel, for example with
Netron. Some converters, such as tfcoreml, will conveniently display which custom layers
were added during the conversion process.

Because there is no className inside the custom { } section, you can fill it in by writing:
Creating Your Own Custom Layers 360

layer = coreml_model._spec.neuralNetwork.layers[17]
layer.custom.className = "InsideOutConvolution"
layer.custom.description = "A useful description"

Note: If your model is a classifier, you should use _spec.neuralNetworkClassifier instead of


just neuralNetwork to access the layers.

It can be a little cumbersome to fill in the class names for these custom layers by hand
afterwards, especially if there is more than one. The Keras and TensorFlow converters
also allow you to specify a custom_conversion_functions dictionary. As soon as the converter
encounters an unsupported operation, it will look up that operation in the dictionary and
call the conversion function that you provided for it (if any).

Let’s say you have a Keras model with a custom layer subclass:

class InsideOutConvolution(keras.base_layer.Layer):
def __init__(self): ...
def compute_output_shape(self, input_shape): ...
def call(self, x): ...

When converting this model you can now write:

coreml_model = ct.converters.keras.convert(keras_model,
...other settings...,
add_custom_layers=True,
custom_conversion_functions={ "InsideOutConvolution": convert_insideout })

The converter will call the convert_insideout() function whenever it encounters a layer of
type InsideOutConvolution inside the original model. That function should look something
like this:

def convert_insideout(layer):
print("converting layer '%s'" % layer.name)
custom = ct.proto.NeuralNetwork_pb2.CustomLayerParams()
custom.className = "InsideOutConvolution"
custom.description = "A useful description"
return custom

As before, you fill in the className and description. The difference is that this time you don’t
do this directly in the layer but in a new CustomLayerParams object.

This approach also works for Keras Lambda and Activation layers. For a Lambda layer, you
would write:
Creating Your Own Custom Layers 361

custom_conversion_functions={ "Lambda": convert_lambda }

If you have more than one Lambda layer, this will use the same convert_lambda function
for all of them. Inside the conversion function you can look at layer.function to distinguish
between the different Lambda layers.

For custom Activation layers, such as Activation(my_activation) you would use "my_-
activation" as the dictionary key.

In tfcoreml, the dictionary key is an operation name from the graph (not a tensor name, so
without the :0 at the end). See this example notebook101 for a good example of how to use
custom layers with tfcoreml.

Note: The Caffe converter does not let you convert models with custom layers. This is
because Caffe is written in C++. To add a custom layer to a Caffe model, you need to fork
Caffe and write a bunch of C++ code. The converter from coremltools can only convert the
layer types from the official Caffe version. If you want to convert a Caffe model with custom
layers, you should probably write your own converter or hack the code from coremltools.

If you’re using NeuralNetworkBuilder, you can use add_custom() to add a new custom layer.
As before, it involves filling in a CustomLayerParams object:

custom = ct.proto.NeuralNetwork_pb2.CustomLayerParams()
custom.className = "InsideOutConvolution"
custom.description = "A useful description"

builder.add_custom(name="custom1",
input_names=["previous_layer_output"],
output_names=["custom1_output"],
custom_proto_spec=custom)

Parameters and weights

Besides className and description, the CustomLayerParams type has two other fields:

• weights for any learned parameters

• parameters for settings that are not learned, such as hyperparameters


101 https://github.com/tf-coreml/tf-coreml/blob/master/examples/custom_layer_examples.ipynb
Creating Your Own Custom Layers 362

For our hypothetical inside-out convolution, the parameters could store the size of the
convolution kernel or the amount of padding to add around the input. It’s a dictionary
where the keys are always strings and the values can be double, string, 32-bit integer, 64-
bit integer, or boolean.

For example, if custom is the CustomLayerParams for this layer, you can write:

custom.parameters["kernelHeight"].intValue = 7
custom.parameters["padding"].boolValue = True
custom.parameters["method"].stringValue = "clockwise"

You shouldn’t store loads of data inside the parameters dictionary, it’s really only for
configuring the layer.

If the layer has learned weights, these go into the weights field. You can also use this to store
other large amounts of data. Suppose orig_weights is a NumPy array containing the weights
for this layer in the original model. Then you can write:

weights = custom.weights.add()
weights.floatValue.extend(orig_weights.flatten())

The NumPy array must be made one-dimensional before you can put it in CustomLayerParams.
This is also a good place to transpose the weights if necessary. Note that Core ML doesn’t
care at all what you put inside the weights array. How you organize this data is totally up
to you.

The weights are stored in a WeightParams object. This is the same structure that is used by
Core ML’s own layers. You have a few choices for how to store the data:

• floatValue: uses 32-bit floating points (single precision)

• float16Value: uses 16-bit floating points (half precision)

• rawValue: uses whatever format you want, this is simply an array of bytes

• quantized

If the weights are originally 32-bit floats and you want to store them as 16-bit floats, you
will have to do the conversion yourself. Using quantization_utils.quantize_weights() will
ignore custom layers. It’s pretty easy with NumPy:
Creating Your Own Custom Layers 363

weights.float16Value = orig_weights.flatten().astype(np.float16).tobytes()

Likewise, the quantization utilities will skip custom layers. If you want to quantize these
weights, you’re on your own — although you can use the private functions from quantiza-
tion_utils.py102 , such as _quantize_wp_field(), to do most of the work.

Tip: If your custom layer has more than one set of learned parameters, such as biases in
addition to the regular weights, then simply call custom.weights.add() a second time. You
can add as many weight arrays as you want.

The MLCustomLayer class

Before you can use a Core ML model with a custom layer in your app, you first need to im-
plement a class for the custom layer. After all, the mlmodel file only has CustomLayerParams
placeholders that just contain the data that is needed by the layer, but it doesn’t actually
know what these layers do.

You must provide implementations for all the classes shown in the Dependencies section of
the Xcode mlmodel viewer. Without this, the app will compile fine but trying to instantiate
the Core ML model will give an error message like the following:

A Core ML custom neural network layer requires an implementation named


'InsideOutConvolution' which was not found in the global namespace.
[coreml] Error creating Core ML custom layer implementation from factory
for layer "InsideOutConvolution".
[coreml] Error in loading user-provided custom layer during neural network
evaluation.

Here is a bare bones version of such a class:

102 https://github.com/apple/coremltools/blob/master/coremltools/models/neural_network/quantization_utils.

py
Creating Your Own Custom Layers 364

import CoreML

@objc(InsideOutConvolution) class InsideOutConvolution: NSObject, MLCustomLayer {


required init(parameters: [String : Any]) throws {
print(#function, parameters)
super.init()
}

func setWeightData(_ weights: [Data]) throws {


print(#function, weights)
}

func outputShapes(forInputShapes inputShapes: [[NSNumber]]) throws -> [[NSNumber]] {


print(#function, inputShapes)
return inputShapes
}

func evaluate(inputs: [MLMultiArray], outputs: [MLMultiArray]) throws {


print(#function, inputs.count, outputs.count)
}

/* This method is optional:


func encode(commandBuffer: MTLCommandBuffer,
inputs: [MTLTexture], outputs: [MTLTexture]) throws {
print(#function, inputs.count, outputs.count)
}
*/
}

The class needs to extend NSObject, have an @objc() modifier to make it visible to the
Objective-C runtime, and implement the MLCustomLayer protocol. This protocol consists of
four required methods and one optional:

• init(parameters:). The constructor. This gets the parameters dictionary with the addi-
tional configuration options for the layer.

• setWeightData(). Tells the layer what its learned weights are. Even if the layer has no
weights, this is still called but the array is empty.

• outputShapes(forInputShapes:). This determines how the layer modifies the size of the
input data.

• evaluate(inputs:outputs:). This performs the actual computation — it is where the


magic happens! This method is required. It gets called when the model is run on the
CPU.

• encode(commandBuffer:inputs:outputs:). This method is optional. It also implements the


layer’s logic but this time on the GPU.
Creating Your Own Custom Layers 365

So there are two different functions that provide the implementation of the layer: one for
the CPU and one for the GPU. The CPU method is required — you must always provide at
least a CPU version of your layer. The GPU method is optional, but recommended.

Note: Currently there is no method that lets you provide a version of the layer for the Neural
Engine, as there is no public API for programming the Neural Engine. In fact, it appears that
Core ML will not run the model on the Neural Engine at all if it has a custom layer! If using
the Neural Engine is important to your app, you may want to rearchitect your model so it
only uses Core ML’s officially supported operations and doesn’t require custom layers.

I always find it useful to add some print statements so I can see what’s going on when the
InsideOutConvolution class is used by Core ML. When running the model, they print the
following:

init(parameters:) ["engineName": InsideOutConvolution, "kernelHeight": 7,


"method": clockwise, "padding": 1]

setWeightData [80 bytes]

outputShapes(forInputShapes:) [[1, 1, 3, 256, 256]]


outputShapes(forInputShapes:) [[1, 1, 3, 256, 256]]
outputShapes(forInputShapes:) [[1, 1, 3, 256, 256]]

evaluate(inputs:outputs:) 1 1

Obviously, init(parameters:) is called first. Its parameters dictionary contains the values
from the parameters field in the CustomLayerParams, plus an item "engineName" with the name
of the class. At this point, you’d store these parameters in instance variables.

This is followed by a call to setWeightData(), which gets zero or more Data objects. It’s up
to you to read and interpret the values from these Data objects. In this case, the layer
has 20 weights in 32-bit float format, so to read these floating point numbers you would
need to grab a pointer to the Data’s underlying memory buffer and cast this into an
UnsafePointer<Float>.

If the weights were 16-bit floats, you’d need to use a UnsafePointer<UInt16> instead (Swift
has no 16-bit float type). And if the weights were quantized or in another format, you’d
need to decode them yourself at this point.

Then outputShapes(forInputShapes:) is called several times in a row. I’m not sure why it gets
called so often, but it’s no big deal since you typically don’t do a lot of work in that method
anyway. Note that the shapes are given in five dimensions, using the convention:
Creating Your Own Custom Layers 366

[ sequence length, batch size, channels, height, width ]

According to the debug output, our custom layer is receiving 1 image of 256×256 pixels
that contains 3 channels. If the layer doesn’t change the shape of the input tensor(s) you
can simply return the input shapes. Otherwise, you have to compute the new shapes for
the output, so that Core ML knows how big it should make the MLMultiArray or MTLTexture
objects that will hold the output from this layer.

Note: I find it curious that you don’t have to specify the custom layer’s output shape during
model conversion. The Core ML compiler always prints (-1, -1, -1) for the output shape of a
custom layer because it doesn’t know what the size is. If the converter already knew what
output shape it could expect from the custom layer, it could also verify that this makes sense
for the subsequent layers. But that’s not what happens — Core ML doesn’t know what the
layer’s output size is until it calls outputShapes(forInputShapes:).

Evaluating the layer

The init(parameters:), setWeightData(), and outputShapes(forInputShapes:) methods are


invoked while the Core ML model is being loaded. When you make a prediction, one of
the evaluation methods is called.

You’re only required to provide evaluate(inputs:outputs:). This runs the layer on the CPU.
There is also an evaluation method for the GPU, but that is optional.

Of course, exactly how you implement the evaluate method depends on what your custom
layer does. Here is the simplest possible implementation that simply copies the contents of
the input MLMultiArray to the output MLMultiArray:

func evaluate(inputs: [MLMultiArray], outputs: [MLMultiArray]) throws {


for i in 0..<inputs.count {
let input = inputs[i]
let output = outputs[i]

for j in 0..<input.count {
let x = input[j].floatValue
output[j] = NSNumber(value: x)
}
}
}
Creating Your Own Custom Layers 367

Note that the inputs and outputs are arrays of MLMultiArray objects. That is because some
types of layers may accept multiple inputs and/or produce multiple outputs.

If you’re making a batch prediction, then evaluate() is called once for every image in the
batch. The batch dimension of the MLMultiArray is always 1. So you can’t evaluate the whole
batch at once — the custom layer always sees only the data for a single prediction at a time.

Tip: Usually you wouldn’t implement evaluate() using a for loop. That’s useful for im-
plementing and debugging a first version of the layer’s algorithm, but it’s not exactly a
speed monster. Instead, I suggest using the vectorized CPU functions from the Accelerate
framework. For more information about MLMultiArray, see the chapter Working With
MLMultiArray. For an example of using Accelerate, see Computing the Argmax.

For even more speed, you can also implement encode(commandBuffer:inputs:outputs:) to


run the layer on the GPU. In this function you would encode a Metal compute kernel into
the supplied MTLCommandBuffer object. The inputs and outputs are MTLTexture objects of type
type2DArray with the pixel format .rgba16Float.

Just like with the CPU version, encode(commandBuffer:inputs:outputs:) is invoked for every
image in the batch, so it cannot compute the results for the entire batch at once.

Note: Even if you don’t provide the GPU method for a custom layer, that doesn’t mean
your entire model will now run on the CPU. Core ML can mix-and-match. It will still run as
much of your model on the GPU as possible, switch to the CPU just for your custom layer,
and then switch back to the GPU for the rest of the layers. So even if you don’t have Metal
programming chops, you can still implement your own custom layers. Having only a CPU
implementation doesn’t degrade the performance of the rest of the model.

A detailed example

In this chapter you’ve seen how to fill in the CustomLayerParams placeholder for layer types
that are unsupported by Core ML, and how to write the Swift or Objective-C class that
implements the custom layer’s logic. But it might be useful to see an actual example.

For that I will refer you to my blog post Custom Layers in Core ML103 , which shows in detail
103 https://machinethink.net/blog/coreml-custom-layers/
Creating Your Own Custom Layers 368

how to add a new activation function called “swish” to a Core ML model, complete with a
downloadable demo project.

You don’t always need a custom layer

Even though Core ML supports only a limited amount of layer types, this doesn’t mean you
always need to write a custom layer if your operation isn’t among the happy few. You can
sometimes achieve the same effect by cleverly combining layer types that Core ML does
support.

The ReLU6 activation function, for example, is missing from Core ML. ReLU6 is like a
regular ReLU but clamps the activation values so that they never become larger than 6.
You can implement ReLU6 in the following manner:

1. First, use a regular ReLU layer.

2. Follow this up with a linear activation layer with alpha = -1. This multiplies everything
by -1 to effectively negate the values in the tensor.

3. Next is a UnaryFunctionLayerParams layer of type THRESHOLD that clamps any values less
than -6 to -6.

4. Finally, add another linear activation layer with alpha = -1 to negate the results again.

It’s a bit of a roundabout way to do it, and it requires four layers instead of just one, but it’s
100% Core ML.

It would be pretty straightforward to implement ReLU6 with a custom layer, but then the
model can no longer run on the Neural Engine. So the trade-off is: four layers but Neural
Engine support, or only one layer but no Neural Engine. I’d go for the former, especially
since Core ML might be able to combine these four layers internally into a single operation
anyway.

Note: As of Core ML 3, you can actually implement ReLU6 using a clip layer with minimum
value 0 and maximum value 6.
Creating Your Own Custom Models
A custom model is an mlmodel file that only contains a bunch of (hyper)parameters but no
actual logic. It’s up to your Swift or Objective-C code to implement an MLCustomModel class to
do all the actual work.

There are a couple of benefits to wrapping up your model inside a custom Core ML model:

• It lets the app developer run the model through the usual Core ML API.

• It lets you put custom logic inside a larger pipeline of models. For example, if the model
has three steps A, B, C but Core ML doesn’t support B, you can either split this up into
separate mlmodel files, or easier, use a pipeline with B as a custom model.

My guess is this functionality was added primarily for Turi Create, which uses custom
models for model types that are not directly supported by Core ML, such as their audio
preprocessing and recommender models. (Search the turicreate repo104 for MLCustomModel
to see some examples.)

Custom models require at least iOS 12, watch OS 5, or macOS 10.14.

What can a custom model do?

The definition of custom models is in CustomModel.proto105 . This shows that a custom


model has only three properies:

• className: the name of your MLCustomModel class that implements the logic

• description: an optional description that’s shown in Xcode

• parameters: a dictionary of string keys and CustomModelParamValue objects

Any data that you want to store along with your custom model goes into the parameters
dictionary. These values can have the following data types:
104 https://github.com/apple/turicreate
105 https://github.com/apple/coremltools/blob/master/mlmodel/format/CustomModel.proto

369
Creating Your Own Custom Models 370

• double (doubleValue)

• string (stringValue)

• int32 (intValue)

• int64 (longValue)

• bool (boolValue)

• bytes (bytesValue)

To store learned weights or any other large blobs of data along with the custom model,
you’d use the bytes type, which as its name implies is an array of byte values.

Let’s build a custom model

Unfortunately, coremltools does not have a CustomModelBuilder that makes it easy to create
custom models. But it’s also not that hard to write out the code by hand.

First we’ll define the model in Python:

import coremltools as ct
import coremltools.proto.FeatureTypes_pb2 as ft

spec = ct.proto.Model_pb2.Model()
spec.specificationVersion = 3

The specificationVersion should be at least 3, as that is the first version of Core ML that
supports custom models.

The model must have some inputs and outputs. Since this is only a simple example, we’re
giving it an input that is a vector of 10 numbers and an output that is a single integer.
Creating Your Own Custom Models 371

new_input = spec.description.input.add()
new_input.name = "vector"
new_input.type.multiArrayType.dataType = ft.ArrayFeatureType.FLOAT32
new_input.type.multiArrayType.shape.extend([10])

new_output = spec.description.output.add()
new_output.name = "count"
new_output.type.multiArrayType.dataType = ft.ArrayFeatureType.INT32
new_output.type.multiArrayType.shape.extend([1])

Next, we tell Core ML this is a custom model:

spec.customModel.className = "MyCustomModel"
spec.customModel.description = "Simple example of a custom model"

Our custom model will simply count the number of values in the input that are over a
certain threshold value. We’ll add this threshold as a parameter to the model:

spec.customModel.parameters["threshold"].doubleValue = 0.5

In case you want to add large amounts of data such as learned weights, you’ll have to
convert it to a bytes array first, and then assign that to bytesValue. We’re not doing this
for our simple model, but the syntax would look like this:

weights = np.random.randn(100, 50)


spec.customModel.parameters["weights"].bytesValue = bytes(weights)

You can store any binary data you want here. Naturally, if you do this you’ll have to decode
those bytes again in the MLCustomModel class inside your app.

Finally, save the model:

ct.utils.save_spec(spec, "MyAwesomeModel.mlmodel")

Note: There’s a good reason we’re not calling this file MyCustomModel.mlmodel. Just like
any other Core ML model, when you add the mlmodel file into your project, Xcode will
automatically generate a source file for it. The main class in this source file is named after
the mlmodel file. This must obviously be a different name than the customModel.className
otherwise you end up with two classes with identical names!
Creating Your Own Custom Models 372

Using the custom model in the app

Before you can use the custom model to make predictions, you first have to provide a Swift
or Objective-C implementation of the class MyCustomModel.

@objc(MyCustomModel) class MyCustomModel: NSObject, MLCustomModel {


let threshold: Double

required init(modelDescription: MLModelDescription, parameters: [String: Any]) throws {


threshold = parameters["threshold"] as! Double
}

The class must have the same name that you put into spec.customModel.className and it
must be visible to Objective-C, hence the @objc. It extends from NSObject and implements
the MLCustomModel protocol.

The initializer takes an MLModelDescription object, which describes the inputs and outputs,
as well as any other metadata that you put into the mlmodel file. We don’t use that here,
but it’s useful in case you want to introspect the model.

init also gets a parameters dictionary with the contents of spec.customModel.parameters. In


our case, that dictionary just has a single element, threshold, that we’ll read and place in
the corresponding variable.

The second required method in the MLCustomModel protocol is the prediction method that
gets called when the user runs the model:

func prediction(from input: MLFeatureProvider, options: MLPredictionOptions)


throws -> MLFeatureProvider {
// Get the input MultiArray
let vector = input.featureValue(for: "vector")!.multiArrayValue!

// The model's custom logic


var count = 0
for i in 0..<vector.count {
if vector[i].floatValue >= Float(threshold) {
count += 1
}
}

// Create a MultiArray for the output


let countArray = try MLMultiArray([count])
let outputs = [ "count": countArray ]
return try MLDictionaryFeatureProvider(dictionary: outputs)
}
Creating Your Own Custom Models 373

The input is an MLFeatureProvider object containing values for all the inputs. Here, we just
have one input named "vector" that is a MultiArray.

There is also MLPredictionOptions that tells you whether Core ML expects to run this model
on the CPU or whether you’re allowed to use the GPU as well. (The model from our example
will always run on the CPU.)

After performing the custom model’s logic, you need to put the output values into a new
MLFeatureProvider object. Here, we use the helper class MLDictionaryFeatureProvider to do
that. For more info on how this works, see the chapter Using the MLModel AP.

Note that the variable count is an Int but the model expects the output to be a MultiArray,
so we first put it into a new MLMultiArray object.

To use this model, you could write the following. It creates an array with 10 random
numbers between -1 and +1, and then makes a prediction.

let config = MLModelConfiguration()


let model = try! MyAwesomeModel(configuration: config)

var values = [Float]()


for _ in 0..<10 {
values.append(Float.random(in: -1...1))
}

if let input = try? MLMultiArray(values),


let output = try? model.prediction(vector: input) {
print("The count is:", output.count[0])
}

And that’s all you need to do to implement a custom model!


Building Pipeline Models
A pipeline is nothing more than several models in a row:

input ⟶ Model A ⟶ Model B ⟶ Model C ⟶ output

The pipeline pretends that these different models are just one single model.

The models are connected through their output and input names. For example, if model A
has an output feature_vector then model B needs to have an input named feature_vector.

Why do this?

Pipelines are especially useful for scikit-learn models. A nice example is Sentiment Polarity
CoreML Demo106 by Vadym Markov that performs sentiment analysis on text. The original
scikit-learn model consists of a dictionary vectorizer that turns the words from the text into
numbers, followed by a support vector classifier that makes the actual prediction.

The coremltools scikit-learn converter turns this into the following pipeline model:

text ⟶ [ DictVectorizer ⟶ FeatureVectorizer ] ⟶ GLMClassifier ⟶ predictions

The DictVectorizer and FeatureVectorizer themselves are a pipeline of their own, so this is
a pipeline inside of another pipeline.

But pipelines are not just for scikit-learn… A few other examples:

• When you train an image classifier using Create ML, you get a pipeline model that
consists of the Vision FeaturePrint model that extracts features from the input image,
followed by a GLMClassifier model that makes a prediction using those extracted
features.

• When you create an object detector using Turi Create, you also get a pipeline made up
of two models. The first is TinyYOLO that predicts a large number of bounding boxes
(over 2500). The second model is NonMaximumSuppression, which filters the predictions
and only returns the best bounding boxes.
106 https://github.com/cocoa-ai/SentimentCoreMLDemo

374
Building Pipeline Models 375

• For on-device training, you can build a pipeline consisting of a pre-trained neural
network for feature extraction, followed by a trainable k-Nearest Neighbors classifier.

Without these pipelines, you’d need to have two or more separate Core ML models and pass
around the data from one model to the next by yourself. It’s much more convenient — and
efficient — to let Core ML handle this.

Creating your own pipeline

The easiest way to connect multiple models into a single pipeline is to use the module
coremltools.models.pipeline.

Let’s build a very simple classifier model with Keras and then connect this to a OneHotEncoder
model in the following pipeline:

input ⟶ OneHotEncoder ⟶ Keras model ⟶ classification

The classifier

First, we’ll create the Keras model. This consists of just a fully-connected layer that expects
7 input values and produces 3 output values. We’ll pretend this is a classifier of some kind,
so the layer gets a softmax activation.

import keras
from keras.models import *
from keras.layers import *

keras_model = Sequential()
keras_model.add(Dense(3, input_shape=(7,), activation="softmax"))

Because this is only a simple example we won’t train the model, but immediately convert
it to Core ML:
Building Pipeline Models 376

import coremltools as ct

labels = ["rain", "snow", "dry"]

clf_model = ct.converters.keras.convert(keras_model,
input_names="feature_vector",
output_names="probabilities",
class_labels=labels)

The clf_model is a Core ML neural network classifier model (“clf” is shorthand for classifier).
To make a prediction with this model you’d write:

data = np.array([0, 1, 0, 0, 0, 0, 0], dtype=np.float)


clf_model.predict({"feature_vector": data})

This will output something like the following:

{'probabilities': {'dry': 0.22885005176067352,


'rain': 0.2978954613208771,
'snow': 0.4732545018196106},
'classLabel': 'snow'}

As you can see, the input to the model must be a NumPy array with 7 numbers. Now let’s
suppose the data you want to make predictions on is categorical. The 7-element input vector
then is the one-hot encoded representation of this categorical data.

You could certainly one-hot encode the inputs yourself in Swift, but it’s easier to let Core
ML do this for you.

The one-hot encoder

One of the available model types in Core ML is OneHotEncoder. This can convert string or
integer inputs into one-hot encoded vectors. That’s exactly what we need!

Let’s assume the feature_vector input to the classifier model really is supposed to be a one-
hot encoded version of the day of the week where "Monday" is [1,0,0,0,0,0,0], "Tuesday" is
[0,1,0,0,0,0,0], and so on…

There isn’t a convenient helper function for creating a OneHotEncoder model, so let’s write
some protobuf code by hand:
Building Pipeline Models 377

import coremltools.proto.FeatureTypes_pb2 as ft

ohe_spec = ct.proto.Model_pb2.Model()
ohe_spec.specificationVersion = 1

new_input = ohe_spec.description.input.add()
new_input.name = "day_of_week"
new_input.type.stringType.SetInParent()

new_output = ohe_spec.description.output.add()
new_output.name = "feature_vector"
new_output.type.multiArrayType.shape.extend([7])
new_output.type.multiArrayType.dataType = ft.ArrayFeatureType.DOUBLE

This creates a new protobuf Model object and gives it an input named "day_of_week" that
expects a single string value. It also adds an output named "feature_vector" that is a multi-
array of size 7, which is exactly the same as the input from the classifier model. You can
see this with print(clf_model._spec.description.input), which prints the following:

[name: "feature_vector"
type {
multiArrayType {
shape: 7
dataType: DOUBLE
}
}
]

Now do print(ohe_spec.description.output). It should print exactly the same thing. That’s


how we’re going to connect the output from the one-hot encoder model to the input of the
classifier.

The final step is to turn this spec into a one-hot encoder model:

import coremltools.proto.OneHotEncoder_pb2 as OHE

ohe_spec.oneHotEncoder.outputSparse = False
ohe_spec.oneHotEncoder.handleUnknown = OHE.OneHotEncoder.IgnoreUnknown
ohe_spec.oneHotEncoder.stringCategories.vector.extend(
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

The stringCategories field contains the strings that will be mapped to the different one-hot
encoded vectors. Because handleUnknown is IgnoreUnknown, any string that is not in this list
will be returned as a vector with all zeros.
Building Pipeline Models 378

Check out OneHotEncoder.proto107 to see exactly what these fields do, and what other
functionality is available.

Now you can turn the spec into an MLModel object:

ohe_model = ct.models.MLModel(ohe_spec)

To check that it works, make a prediction:

ohe_model.predict({"day_of_week": "Monday"})

This should print an array that is all zeros except for the first element:

{'feature_vector': array([1., 0., 0., 0., 0., 0., 0.])}

Also check that the other day names work, and what happens when you use a string that
is not in the list from stringCategories.

Putting it together into a pipeline

You now have two Core ML models: ohe_model with the one-hot encoder and clf_model
with the classifier. Let’s combine them into a pipeline so that you can directly perform
classifications on days of the week.

Core ML supports three different types of pipelines:

• PipelineClassifier, for pipelines that end with a classifier

• PipelineRegressor, for pipelines that output a real-valued number

• Pipeline, for any other kind of prediction

Since our example model is a classifier, let’s build a PipelineClassifier object:

107 https://github.com/apple/coremltools/blob/master/mlmodel/format/OneHotEncoder.proto
Building Pipeline Models 379

from coremltools.models.pipeline import *


from coremltools.models import datatypes

input_features = [ ("day_of_week", datatypes.String()) ]


output_features = [ "classLabel", "probabilities" ]

pipeline = PipelineClassifier(input_features, labels, output_features)

You need to specify the inputs to the pipeline in input_features. We have one input, "day_-
of_week", that is a string.

Because the output from the Keras model was named "probabilities" instead of the default
"classProbability", we must specify this in the output_features argument. You also need to
provide the class labels (not sure why because these labels are already inside the classifier
model).

Now you have the PipelineClassifier object, simply add the two models:

pipeline.add_model(ohe_model)
pipeline.add_model(clf_model)

And convert the pipeline into a new MLModel:

pipeline_model = ct.models.MLModel(pipeline.spec)

Now you can make a prediction using this final model:

pipeline_model.predict({"day_of_week": "Friday"})

Previously doing a prediction on a day name just gave the one-hot encoded vector, but now
the pipeline will automatically pass that one-hot encoded vector into the classifier model.
The output is something like:

{'probabilities': {'dry': 0.2183634489774704,


'rain': 0.5823769569396973,
'snow': 0.19925953447818756},
'classLabel': 'rain'}

Of course the predictions here are totally random because the model isn’t trained.

Finally, save the model and open it in Xcode:


Building Pipeline Models 380

pipeline_model.save("MyFirstPipeline.mlmodel")

As you can see, the type is now “Pipeline Classifier”. The input is a String rather than the
usual array of numbers, and the output is a classification dictionary.

For more information on how to create pipelines, see the coremltools documentation108 .

Looking at the spec

In case you’re curious how this works inside the mlmodel file, a pipeline model is literally
just a list of other models (which may include other pipelines). You can see this in the spec:

spec = pipeline_model._spec
print(len(spec.pipelineClassifier.pipeline.models))

This prints 2 because the pipelineClassifier.pipeline.models list contains two models.

Note: If your pipeline is a regular pipeline and not a classifier pipeline, you should write
just spec.pipeline.models and leave out the pipelineClassifier part.

To look at the individual models, do:

print(spec.pipelineClassifier.pipeline.models[0].WhichOneof("Type"))
108 https://apple.github.io/coremltools/generated/coremltools.models.pipeline.html
Building Pipeline Models 381

This prints oneHotEncoder, while the same thing for models[1] prints neuralNetworkClassifier.
You can use spec.pipelineClassifier.pipeline.models[i] to look at — or change! — the
properties of that particular model, for example:

model1 = spec.pipelineClassifier.pipeline.models[1]
nn = model1.neuralNetworkClassifier
print(len(nn.layers))

To extract a model from an existing pipeline, it’s as simple as writing:

extracted_spec = spec.pipelineClassifier.pipeline.models[1]
ct.utils.save_spec(extracted_spec, "Classifier.mlmodel")

This saves the second model from the pipeline, in this example the classifier, into its own
mlmodel file. This is a complete Core ML model and you can use it right away.

You can find the full definition of the Pipeline, PipelineClassifier and PipelineRegressor
types in Model.proto109 .

Referring to the models by name

As of Core ML 3 / mlmodel version 4, the models in a pipeline can have names. If you don’t
set these names, the models will get default names, such as model0, model1, etc. It’s a good
idea to assign them more meaningful names.

For example, for the pipeline from this chapter:

names = ["OneHot", "Classifier"]


pipeline_model._spec.pipelineClassifier.pipeline.names.extend(names)

Now the first model can be referred to by the name “OneHot” and the second by “Classifier”.

This is useful when the models have hyperparameters that you want to access inside your
app, especially if two submodels can have the same hyperparameters. In that case you’d
use MLParameterKey.keyName.scoped(to: "ModelName") to look at the hyperparameters of just
the model with that name.
109 https://github.com/apple/coremltools/blob/master/mlmodel/format/Model.proto
Building Pipeline Models 382

Making a pipeline trainable

Also new as of Core ML 3 is updatable models, which can be trained on-device. You can
make a pipeline model updatable too, with the restriction that only the last model in the
pipeline can be updatable. This must be a k-Nearest Neighbors model or a neural network.

Unfortunately, the PipelineClassifier object we used to build the pipeline is not very useful
for creating updatable pipelines. It’s easier to directly create the protobuf objects.

First, you should make the last model from the pipeline updatable. In this example, that’s
the Keras classifier model. The easiest way to turn this into an updatable model is to run
the Keras converter again, this time with respect_trainable=True:

keras_model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.SGD(lr=0.01),
metrics=["accuracy"])

clf_model = ct.converters.keras.convert(keras_model,
input_names="feature_vector",
output_names="probabilities",
class_labels=labels,
respect_trainable=True)

Note: If you don’t have access to the original (Keras) model, you can also make an mlmodel
updatable by modifying its protobuf objects. See the chapter Making a Model Updatable for
more info.

Make sure the training inputs for the model are properly filled in. The Keras converter
will have done this automatically, and print(clf_model._spec.description.trainingInput)
should print the following:
Building Pipeline Models 383

[name: "feature_vector"
type {
multiArrayType {
shape: 7
dataType: DOUBLE
}
}
, name: "classLabel"
type {
stringType {
}
}
]

The first training input, feature_vector, is identical to the neural network’s regular input.
The classLabel training input is new — this lets you provide the true class label belonging
to the training example.

Now you can make the pipeline:

ohe_spec = ohe_model._spec # first model


clf_spec = clf_model._spec # second model

pipeline_spec = ct.proto.Model_pb2.Model()
pipeline_spec.specificationVersion = ct._MINIMUM_UPDATABLE_SPEC_VERSION
pipeline_spec.isUpdatable = True

pipeline_spec.pipelineClassifier.pipeline.models.add().CopyFrom(ohe_spec)
pipeline_spec.pipelineClassifier.pipeline.models.add().CopyFrom(clf_spec)

pipeline_spec.pipelineClassifier.pipeline.names.extend(["OneHot", "Classifier"])

Note: isUpdatable must be set to True on both the pipeline and the last model in the pipeline.
The specificationVersion for the updatable submodel as well as the pipeline model itself
must be set to at least version 4 (or _MINIMUM_UPDATABLE_SPEC_VERSION).

Next, assign inputs and outputs to the pipeline. It’s easiest to copy these from the submodels.
We take the input(s) from the first model and the output(s) from the last model:
Building Pipeline Models 384

pipeline_spec.description.input.extend(ohe_spec.description.input[:])
pipeline_spec.description.output.extend(clf_spec.description.output[:])

pipeline_spec.description.predictedFeatureName = \
clf_spec.description.predictedFeatureName
pipeline_spec.description.predictedProbabilitiesName = \
clf_spec.description.predictedProbabilitiesName

Because the last model is a classifier, we also take over the predictedFeatureName and
predictedProbabilitiesName properties.

It’s important to set the training inputs on the pipeline too. Currently, only the last model
in the pipeline — the neural network classifier — has training inputs: the feature vector
and its corresponding true class label. We’ll keep the training input for the true label, but
instead of the feature vector we’ll use the input that goes into the one-hot encoder model:

pipeline_spec.description.trainingInput.extend([ohe_spec.description.input[0]])
pipeline_spec.description.trainingInput.extend([clf_spec.description.trainingInput[1]])

Thus, the training inputs for the pipeline model are:

[name: "day_of_week"
type {
stringType {
}
}
, name: "classLabel"
type {
stringType {
}
}
]

You can also fill in the shortDescription property of these training inputs with a human-
readable description, but this is not mandatory.

Finally, save the model:

ct.utils.save_spec(pipeline_spec, "UpdatablePipeline.mlmodel")

Open this model in Xcode and you’ll see that, in addition to the regular “Prediction”
section listing the model inputs and outputs, it now also has an “Update” section with
the training inputs, and a “Parameters” section with the training hyperparameters for the
neural network.
Linked Models
Suppose your app needs two or more classification models, each for a different selection
of objects. You will probably train these classifiers using transfer learning, where you start
from a pre-trained base model — for example, MobileNetV2 — and only train the classifier
layer.

This approach works fine, except that you now have several models that are completely
identical except for their very last layers. This is not very efficient, especially because
neural networks tend to make for large files.

As of Core ML 3, there is a solution for this: linked models. A linked model is simply a
reference from one mlmodel to another mlmodel file in the app bundle or a folder in the
app’s sandbox.

Your classifiers can now share the layers that they have in common. This is done by
extracting the shared parts into a new model, and then linking to it from the classifier
models. This lets you reuse expensive feature extractors without having to duplicate them.

This saves space in your app bundle, but also during runtime. Core ML will only load the
linked model once, even if it’s being used in multiple places.

You need to make a pipeline

To make use of a linked model, you first create a pipeline model, and add a new sub-model
of type LinkedModel to this pipeline.

You can find the definition in LinkedModel.proto110 . It contains just two properties:

• linkedModelFileName, which is the name of the mlmodelc file in your app bundle. Note
that you don’t link to an mlmodel but to the compiled version, mlmodelc. Makes sense,
because that’s what your app bundle contains.

• linkedModelSearchPath, which is where Core ML will look for the linked model file.
110 https://github.com/apple/coremltools/blob/master/mlmodel/format/LinkedModel.proto

385
Linked Models 386

The search path can be relative or absolute. The default value is ".", which is the same
location as the pipeline model. You can also use the special symbol $BUNDLE_MAIN to look in
the main app bundle, or $BUNDLE_IDENTIFIER(identifier) to look inside another bundle. You
can specify multiple search paths by separating them with a : symbol.

Follow these steps

To create a linked model, you need to do the following:

1. Take the shared portion of your model and export it as its own mlmodel file. This
means you’d save the feature extraction portion of the model — everything except the
classification layers — into a separate mlmodel file.
The input to this model is an image. The output is a MultiArray with the extracted
features, usually a vector of 512 or 1024 floating point numbers.

2. Create a pipeline model. You’ll build a separate one for each classifier.

3. The first sub-model in the pipeline is the LinkedModel. Give it the same inputs and
outputs as the extracted model you saved in step 1.

4. The second model in the pipeline is a neural network classifier that takes as input
the MultiArray with the extracted features, and outputs the classification prediction.
Often this sub-model will only have a handful of layers, or even just a single layer.

5. Add both the pipeline mlmodels and the linked mlmodel file into your Xcode project.

Want to see some code? The coremltools repo has a detailed code example111 that shows
how to set up the linked model and the pipeline. (The example uses a nearest neighbor
classifier and makes it updatable, but you can skip those steps.)

Note: This technique is not limited to just classifiers, but that is probably the most common
use case. You can use it for other types of models as well. Whenever there is a significant
portion to reuse between models, it makes sense to extract the shared layers into a linked
model.

111 https://github.com/apple/coremltools/blob/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models/updatable_linked_model.ipynb
Control Flow in Graphs
The traditional neural network structure has no loops or branches. It simply runs from top
to bottom for each prediction. Even layers that do perform loops, such as LSTM, always run
the same loop.

This fixed structure is fine for many tasks, but having the ability for decision-making and
looping allows us to create much more powerful — and more exciting! — models.

Core ML 3 adds a number of layer types that make neural networks a lot more dynamic.
Depending on the situation, it can now repeat certain parts of the neural network graph
(loops) and skip other parts (branches).

Exactly which parts of the neural network get executed by Core ML can vary from one run
of the model to the next, depending on the contents of your input data.

Thanks to these control flow layers, Core ML 3 graphs can go way beyond the traditional
acyclic graphs used in neural networks. This opens up lots of new possibilities.

Nested neural networks

Key to making this work is putting NeuralNetwork objects inside other NeuralNetwork objects.
This is a little weird, so let’s look at it in detail.

When you write logic in Swift or Objective-C, you’re often nesting code inside other code,
like so:

result = doSomething(inputs)

if result > value {


/* nested code */
} else {
/* other nested code */
}

The nested code has its own scope. A variable defined inside the nested code cannot be
used outside it, for example.

387
Control Flow in Graphs 388

Inside the Core ML model something similar happens. When you use these new loops and
branches, the insides of those looping and branching layers are new NeuralNetwork objects.

Yep, you read that correctly: a loop or branch layer consists of a smaller neural network
inside the main neural network. This sub-network, in turn, can contain its own looping and
branching layers with sub-sub-networks, and so on.

Note: This is different from a pipeline, which is a sequential list of models, whereas this is
more like a Russian doll that has models inside models inside models. (I’m not sure how
deeply you can nest them.)

One important difference with regular programming is that these sub-networks do not
have their own input and outputs. They can use tensors from the outer scope as input,
but they do not “return” values. Instead, you have to copy the result into some other tensor
that lives in the outer scope. Let’s see how this works in practice.

A simple example

We’ll create a very simple neural network that takes a single number as input and then
branches based on the value of that number. Here is the equivalent in Swift:

func model(value: Double) -> Double {


if value > 0.5 {
return 2 * value
} else {
return value + 3
}
}

Although what we’re going to build will look more like the following:
Control Flow in Graphs 389

func model(value: Double) -> Double {


var result = 0.0

func ifBranch() {
result = 2 * value
}

func elseBranch() {
result = value + 3
}

let condition = value > 0.5


if condition {
ifBranch()
} else {
elseBranch()
}
return result
}

To implement this with Core ML we’ll create three different NeuralNetwork objects, for:

• the entire model

• the “true” branch

• the “false” branch

We’ll use NeuralNetworkBuilder to set this all up. First, define the top-level model:

import coremltools as ct
import coremltools.models.datatypes as datatypes
from coremltools.models.neural_network import NeuralNetworkBuilder

input_features = [ ("value", datatypes.Array(1)) ]


output_features = [ ("result", datatypes.Array(1)) ]

builder = NeuralNetworkBuilder(input_features, output_features,


disable_rank5_shape_mapping=True)

The model has one input and one output. Both are tensors containing a single floating-
point value. We do disable_rank5_shape_mapping so that we’re not stuck with the old Core
ML 2-style tensors but can use tensors of any shape.

First, we’ll add a new tensor that will be used to store the output of whichever branch we
end up taking:
Control Flow in Graphs 390

builder.add_load_constant_nd(name="result",
output_name="result",
constant_value=np.zeros((1,)),
shape=(1,))

We need this tensor because the branch’s child networks do not have outputs of their own
and cannot return values. Instead, they need to copy their results into a tensor in the parent
network. The LoadConstantND layer outputs a tensor with a constant value, so that’s a good
place to copy those branch results into. Initially, this tensor has the value 0.0 (because of
np.zeros((1,))).

You may be wondering why we need this tensor if the model already has an output that is
named "result" as well? It turns out you cannot copy values directly into a model output.
Because the LoadConstantND layer’s output has the same name as the model output, this layer
produces a tensor that is connected to the model output, and we can copy into that.

Next, we’ll create a layer that does the comparison value > 0.5:

builder.add_greater_than(name="condition",
input_names=["value"],
output_name="value_gt",
alpha=0.5)

This layer compares the value from the tensor named "value", which is taken from the
model’s input, with the constant 0.5. The output of this operation is a new tensor, "value_gt",
that is 1.0 if true or 0.0 if false.

We can now connect this "value_gt" tensor to a branching layer:

branch_layer = builder.add_branch(name="branch_layer", input_name="value_gt")

OK, here comes the fun part. We will now create a new NeuralNetwork object for the true-
branch, which Core ML calls the ifBranch:

builder_ifbranch = NeuralNetworkBuilder(nn_spec=branch_layer.branch.ifBranch)

To create a new NeuralNetwork, we need a new NeuralNetworkBuilder. Normally when you


use NeuralNetworkBuilder it defines not just the layers of the network but also all the other
metadata a model needs, such as inputs and outputs. But these nested NeuralNetwork objects
do not have inputs and outputs — all they are is a list of layers.
Control Flow in Graphs 391

With nn_spec you tell the builder that you just want to fill in those layers. In other words,
all you want is a NeuralNetwork, not a complete Model. Here, we’ll be filling in the layers for
the ifBranch network.

From now on we will use the builder_ifbranch to add layers. As shown in the Swift version,
when the condition is true, we’ll multiply the value by 2, so we’ll need a multiply layer:

builder_ifbranch.add_elementwise(name="mult",
input_names=["value"],
output_name="result_ifbranch",
mode="MULTIPLY",
alpha=2.)

The input to this layer is taken from the "value" tensor, which comes from the top-level
NeuralNetwork object. You can use any tensors from the parent model.

Now that we’ve done the calculation, we will copy the results into the "result" tensor using
a special CopyLayer:

builder_ifbranch.add_copy(name="copy",
input_name="result_ifbranch",
output_name="result")

Recall that "result" is a tensor that lives in the top-level network. The Swift version of
this code did something similar: it assigned the output of the computation to a variable
defined outside the if-statement. You can think of this CopyLayer as the equivalent of a return
statement.

We’re done with the “true” branch. The “false” branch, or elseBranch, is similar: create a
new NeuralNetworkBuilder object and add the layers to it.

builder_elsebranch = NeuralNetworkBuilder(nn_spec=branch_layer.branch.elseBranch)

builder_elsebranch.add_elementwise(name="add",
input_names=["value"],
output_name="result_elsebranch",
mode="ADD",
alpha=3.)

builder_elsebranch.add_copy(name="copy",
input_name="result_elsebranch",
output_name="result")
Control Flow in Graphs 392

Note that we gave the CopyLayer the same name as in the ifBranch, both are named
"copy". That’s fine. These are different NeuralNetwork objects and therefore have their own
namespace.

However, the intermediate output of the multiply and add layers cannot be called "result"
because that would conflict with the "result" tensor in the top-level model. And you can’t
directly write into that top-level tensor from the multiply or add layers — only CopyLayer
is allowed to do that. That’s why we’ve named these temporary tensors "result_ifbranch"
and "result_elsebranch".

That’s it, we’re done defining the model. Let’s save it:

spec = builder.spec
model = ct.models.MLModel(spec)
ct.utils.save_spec(spec, "BranchModel.mlmodel")

If you look at this model in Netron, it shows up as two disconnected parts. This is fine, as the
model output is connected to the LoadConstantND layer and therefore always has a value. The
branches from the BranchLayer simply overwrite that default value with something else.

Note that only the ifBranch is required, the elseBranch is optional. If you don’t supply a
child network for this branch and the condition is false, then simply nothing happens to
the "result" tensor and the model will output its initial value (which we set to 0).

Let’s try out this model:

model.predict({"value": np.array([2.1])}, useCPUOnly=True)

This should give the output 4.2 (or 4.19999 due to floating point precision). Also test that
the else branch works:

model.predict({"value": np.array([-2.1])}, useCPUOnly=True)


Control Flow in Graphs 393

This prints 0.9. Sweet!

I hope this section gave you some idea of how these control flow layers are supposed
to work. It feels a bit like programming in an awkward language that has no return
statements, but it does allow for some powerful new constructs. We’ll now take a look at
the available layers in more detail.

The control flow layers

Core ML currently supports two type of control flow structures:

• BranchLayer: this is like an if-else statement. It contains two NeuralNetwork objects,


one that runs when the input to this layer is true, the other when the input is false
(optional). Because Core ML doesn’t have a boolean tensor type, you’ll actually pass in
1.0 or 0.0 instead of true or false. (Core ML actually considers the condition to be true
if the value is greater than 1e-6.)

• LoopLayer: this is like a while loop. If no input is given, the loop repeats for the
maximum number of iterations specified in the layer. You can override this by passing
in a tensor containing the number of iterations you want to loop for.
The LoopLayer contains a “body” NeuralNetwork that represents the inside of the while
loop. The layers from this neural net are run on every iteration.
It’s also possible to include a NeuralNetwork that acts as the condition of the while loop.
This “condition” neural network is run once before the loop starts and again before
every new iteration. As long as it outputs a value greater than 0.0, the loop keeps
repeating.

You only get branches and loops — there is no such thing as a “goto” that lets you jump to
a different part of the model, or way to run sub-networks as if they were function calls.

The BranchLayer and LoopLayer do not have outputs. They always pass control to one of their
child NeuralNetwork objects. As you’ve seen in the example above, these child NeuralNetworks
cannot have inputs and outputs of their own. The only way to pass results back is to copy
them into a tensor that lives in the parent network.

There are a few special layers that you can use inside of a loop:

• LoopBreakLayer: put this into the loop’s body NeuralNetwork to terminate the loop, just
like a regular break statement.
Control Flow in Graphs 394

• LoopContinueLayer: put this into the body NeuralNetwork if you want to stop the current
loop iteration and skip ahead to the next one, just like a regular continue statement.

There are a number of layer types whose purpose is to compare tensors:

• EqualLayer, NotEqualLayer

• LessThanLayer, LessEqualLayer,

• GreaterThanLayer, GreaterEqualLayer

• LogicalOrLayer, LogicalXorLayer, LogicalAndLayer

• LogicalNotLayer

These output a new tensor that is 1.0 wherever the condition is true, and 0.0 wherever
the condition is false. This is done in an elementwise fashion and therefore the output is a
tensor “mask” made up of ones and zeros. These layer types support broadcasting, so you
can compare tensors of different ranks. You can also compare a tensor with a (hardcoded)
scalar value.

Using these layers allows you to branch based on the outcome of a comparison, or create a
loop that keeps repeating until a certain condition becomes false.

There is one more layer that provides the magic sauce that makes all of this work:

• CopyLayer: overwrites a previous tensor, for example to replace an old result with a
new one.

This is very much like an assignment statement in Swift or Objective-C. Normally, the value
of a tensor can never change once it has been computed. But if you’re looping, you’ll want
to overwrite the contents of an existing tensor with a new value. That’s the purpose of the
CopyLayer. As you’ve seen in the simple example above, you also use the CopyLayer to return
values from branches.

A more detailed example

For an example of the power of these new control flow layers, have a look at this Jupyter
notebook112 from the coremltools repo. It shows how to implement an iterative process
inside the Core ML model and uses a many of the new layer types.
112 https://github.com/apple/coremltools/blob/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/neural_

network_inference/Neural_network_control_flow_power_iteration.ipynb
Control Flow in Graphs 395

The example shows how to use the “power method” to “compute the largest eigenvalue of a
real symmetric matrix”. You can immediately forget all that — it’s just an example of some
kind of iterative process that we’ll run until it gets an answer that is good enough, i.e. until
it converges.

The example works like this:

1. use a LoadConstantND layer to load the value 0 into the tensor named iteration_count

2. add a LoopLayer that will loop for a certain maximum number of iterations

3. inside the loop, add a new neural network that performs some kind of computation —
feel free to ignore this part of the example as it’s just a demonstration of some kind of
iterative process (you could swap this out for another interative process and the rest
of the example would still work the same way)

4. at the end of the computation, use a layer to increment the current value from
iteration_count, and then a CopyLayer to overwrite the value inside the iteration_count
output

5. use another CopyLayer to copy the result of the computation back into the original
tensor, so that the next iteration of the loop can use this new value

6. add a LessThanLayer to compare the output of the computation to some convergence


threshold, and feed this yes/no result into a BranchLayer

7. add a new neural network to the BranchLayer that just has LoopBreakLayer inside it.
In other words, if the branch happens — because the output of the computation was
small enough to go under the convergence threshold — then we’ll break out of the
loop created in step 2.

This way of structuring a loop is a little strange perhaps, but very flexible! Key point is to
remember to use the CopyLayer to overwrite existing tensors with new values.

After you run the model, the iteration_count output will now have counted how many times
the loop was repeated. For this particular example, the iteration count may be different
every time, depending on the values of the inputs to the model, as some input values will
converge quicker than others.

This demonstrates exactly why having control flow layers is so powerful: your neural
network no longer has to do the exact same thing every time it runs. Pretty cool!
Working With Video
When working with AVCaptureSession to capture video from the live camera feed, your app
receives CMSampleBuffer objects in the following delegate method:

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
/* make a prediction using the pixel buffer */
}
}

When working with ARKit to capture live video, your app receives ARFrame objects in the
following callback:

func session(_ session: ARSession, didUpdate frame: ARFrame) {


if case .normal = frame.camera.trackingState {
let pixelBuffer = frame.capturedImage
/* make a prediction using the pixel buffer */
}
}

And when using an AVAssetReader to read video frames from a file, your app receives
CMSampleBuffer objects in a loop:

while assetReader.status == .reading {


autoreleasepool {
if let sampleBuffer = readerTrackOutput.copyNextSampleBuffer(),
let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
/* make a prediction using the pixel buffer */
}
}
}

What all these methods have in common is that they will receive many CVPixelBuffer
objects per second. Your app needs to decide how to handle all those pixel buffers.

• If the Core ML or Vision prediction for the previous frame is still running when a new
frame arrives, you will need to ignore this new frame in a real-time setting. Your app
simply isn’t able to keep up with the FPS of the live camera feed.

396
Working With Video 397

• When processing frames offline — such as when reading from an mp4 file — you will
need to temporarily block reading until the model is ready for the next frame.

• To get higher throughput, you can already schedule the next Core ML or Vision request
while the previous one is still running. That’s because some part of the model will run
on the CPU and some part on the GPU. If the CPU is just sitting there waiting for the
GPU to complete, it’s more efficient to use that idle time to already prepare the next
request. However, you want to limit this to two or three requests at most, or you’ll
eventually end up with a huge backlog of unprocessed pixel buffers.

Exactly how to handle this situation depends on the demands of your app. This chapter
shows a few different approaches.

Note: When using live video, you don’t want to hang on to any given pixel buffer for too
long. The camera only has a small, fixed pool of buffers that it reuses over and over. If
you’re holding on to all these buffers, waiting for them to be processed by Core ML at some
point in the future, the camera’s buffer pool will run out of available buffers and it has
nowhere left to store the next frame. That’s asking for trouble. You want to recycle the
buffers quickly.

Blocking the video thread

The easiest solution is to block the video thread while the model is busy making a
prediction. In the delegate method, do the following:

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
let resizedPixelBuffer = ...
if let output = try? model.prediction(image: resizedPixelBuffer) {
/* do something with the output */
}
}
}

Core ML’s model.prediction(...) method will block the current thread until the result is
available. Likewise for the perform() method from Vision’s VNImageRequestHandler.
Working With Video 398

With this approach, you make the video capture thread wait until Core ML is done. That’s
fine when the model is really fast and finishes before the next frame arrives from the
camera.

But what happens when the model is slower than the camera? In that case, AVCaptureSession
will not be able to call captureOutput(_:didOutput:from:) again on the video capture thread,
because that thread is currently occupied with running the Core ML model.

Fortunately, AVCaptureSession is smart: if the video capture thread is blocked for too long,
AVCaptureSession will automatically drop the next frame. When that happens, instead of
captureOutput(_:didOutput:from:), the delegate method captureOutput(_:didDrop:from:) is
called to inform you that your app was too slow.

Note: The setting for this is alwaysDiscardsLateVideoFrames from AVCaptureVideoDataOutput.

All the logic for dropping the frames happens in AVCaptureSession and so you don’t have to
worry about it. Simply run the model directly on the video capture thread, so that it blocks
this thread while Core ML is working. (This appears to be true for ARKit too, which uses
AVCaptureSession behind the scenes.)

In most apps the camera will run at 30 FPS. If the model takes longer than 1/30th of a second,
or 33 ms, then not all video frames can be delivered. Let’s say the model takes 50 ms instead.
The approach taken here will drop every other frame and model.prediction(...) will only
be called 15 times per second instead of 30.

This also highlights the drawback of this simple method: a model that takes 50 ms can ac-
tually run at 20 FPS instead of 15. This is also true for offline processing with AVAssetReader
— if you can get this model to process 20 frames per second, it’s done in 3/4th the time. To
get the maximum amount of throughput, you’ll need to be a bit more clever.

Running the model on a separate thread

One way to try and be more clever is to not run the prediction on the video capture
thread, but on a separate thread or dispatch queue. You might be tempted to wrap it in
a DispatchQueue.async:
Working With Video 399

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
DispatchQueue.global(qos: .userInitiated).async {
let resizedPixelBuffer = ...
if let output = try? model.prediction(image: resizedPixelBuffer) {
/* do something with the output */
}
}
}
}

This is a bad idea! Now model.prediction(...), or the VNImageRequestHandler.perform()


method if you’re using Vision, will no longer block the video capture thread. And when
this thread is not blocked, AVCaptureSession will no longer automatically drop frames when
your app is too slow.

For every new frame that arrives from the camera — 30 times per second — this code will
schedule another Core ML prediction on the dispatch queue. If the model is slower than 33
ms, you’ll quickly build up a huge backlog of unprocessed requests, as Core ML won’t be
able to keep up.

Eventually, AVCaptureSession will start dropping frames anyway. This happens because the
global DispatchQueue only has so many free threads to run all the Core ML requests. At some
point all your CPU cores will be in use and there is no thread that is free to handle the next
incoming video frame.

Alternatively, because of the backlog of pending requests, AVCaptureSession can eventually


run out of CVPixelBuffer objects and shut down. Not good. :-(

Another problem with having a backlog is that it increases the latency of the requests.
Let’s say the first 10 frames will all get scheduled on the DispatchQueue, then the next 60
frames are dropped because your CPU is completely maxed out and is slowly working its
way through that backlog of the first ten frames. By the time you get the results of frame
number 10, it’s already two seconds later.

If your model isn’t fast enough to keep up with the camera, it’s better to drop more frames
than to get way behind and deliver results for frames that are no longer relevant.

The solution here is to block the video capture thread again, but now manually. The
simplest approach is to ignore any new frames — i.e. to drop them by hand — until Core
ML is free for work again. First, you need to add a new instance variable:
Working With Video 400

var currentBuffer: CVPixelBuffer?

Then in your video capture delegate method do the following:

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if currentBuffer == nil,
let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
currentBuffer = pixelBuffer
DispatchQueue.global(qos: .userInitiated).async {
let resizedPixelBuffer = ...
if let output = try? model.prediction(image: resizedPixelBuffer) {
/* do something with the output */
}
self.currentBuffer = nil
}
}
}

The new frame will only be processed by Core ML when currentBuffer is nil, i.e. when Core
ML is not already doing anything. Otherwise, nothing happens.

With this method you still only do a single prediction at a time, but now you run it on a
thread of your own choosing. Using a different thread in this manner actually improves
the performance by quite a bit!

Using a semaphore

The above method already works quite well, but it’s not 100% thread-safe. The currentBuffer
variable is written to from different threads: the video capture thread but also any threads
that belong to the DispatchQueue. There’s a small but not unlikely chance that these writes
might interfere with each other.

You could solve this by locking the currentBuffer variable before reading or writing it, but
there’s another solution that neatly avoids this issue, and that is to use a semaphore.

First, create the semaphore object:

let semaphore = DispatchSemaphore(value: 1)

Then the video capture delegate becomes:


Working With Video 401

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
semaphore.wait()
DispatchQueue.global(qos: .userInitiated).async {
let resizedPixelBuffer = ...
if let output = try? model.prediction(image: resizedPixelBuffer) {
/* do something with the output */
}
self.semaphore.signal()
}
}
}

The first time semaphore.wait() is called, it doesn’t actually wait. It only decrements the
semaphore’s value from 1 to 0 and the thread continues. You can safely run a Core ML
prediction on that frame.

If the next frame arrives before Core ML is finished, semaphore.wait() really will wait. It
blocks the video thread because now the semaphore’s value is 0. AVCaptureSession will
automatically drop any other frames that arrive in the mean time.

Once Core ML is done, you call semaphore.signal() to increment the value of the semaphore
from 0 to 1. Now semaphore.wait() sees that the semaphore is no longer 0 and wakes up the
video capture thread again.

And so on… Using a semaphore is completely thread-safe and a really neat solution, plus
with a small modification it allows you to perform multiple Core ML requests at once.

Note: Besides this being a 100% thread-safe solution, there is another small difference with
the method from the previous section. There you ignored any new frames while Core ML
was still busy. After Core ML has finished, you may have to wait a short while until the next
frame arrives (anywhere between 0 and 33 ms).

With the method from this section, you’re letting AVCaptureSession take care of dropping
the frames. When Core ML is finished and the video capture thread gets unblocked, it will
immediately start a new prediction — no need to wait until the next frame arrives, because
you already have a frame.

However, this is actually an older pixel buffer from a few frames ago, not the most recent
one. So that’s the trade-off: with the semaphore method you get higher throughput but what
Core ML sees may be a little behind on what’s happening right now. Naturally, the faster
your model, the less of a concern this is.
Working With Video 402

Multiple requests at a time

The methods described so far were all used to make sure you only perform one Core ML or
Vision request at a time, so that you don’t build up a large backlog and the video capture
session does not run out of buffers.

That’s a good idea indeed. However, there is a speed advantage to running multiple
requests at a time, as long as you limit this to 2 or 3 simultaneous requests at most.

When Core ML runs a model on the GPU, it also does a bunch of work on the CPU such as
encoding all the Metal kernels into command buffers. This can take several milliseconds.
But once the job has been handed over to the GPU, the CPU sits idle and waits until the GPU
is done.

The CPU and GPU are two parallel processors that can work independently, so it’s a waste
of time to make the CPU wait on the GPU and vice versa. While the GPU is handling request
n, the CPU can already do the preparation work for request n+1 and possibly even request
n+2. Keep ‘em busy!

To do this in your app, simply change the value of the semaphore when you initialize it:

let semaphore = DispatchSemaphore(value: 2)

Now semaphore.wait() needs to be called two times before it will block. That means your
video capture thread will be able to schedule two separate Core ML requests at the same
time. This is no problem for Core ML because internally it uses a serial queue to manage
multiple requests.

As soon as the first request is done, semaphore.signal() unblocks the video thread again,
making room for the next frame to be processed. This way you’ll always have two Core ML
predictions running concurrently. This scheme is also called double-buffering.

Double-buffering will appear to make your model run a little faster, because it overlaps
some of the CPU work with the GPU work from a previous frame.

The timeline looks like this:


Working With Video 403

start prediction for frame 1


start prediction for frame 2
block thread for frame 3
drop frame 4
finish prediction for frame 1
start prediction for frame 3
block thread for frame 5
drop frame 6
drop frame 7
finish prediction for frame 2
start prediction for frame 5
...and so on...

After the predictions for the first two frames are underway, the video capture thread
blocks on frame 3. Any new frames arriving after this will be automatically dropped by
AVCaptureSession.

As soon as the prediction for frame 1 finishes, the video capture thread resumes. That
thread got blocked on frame 3, so that’s the frame it makes the prediction for. Note that
this is not the most recent frame, as frame 4 got dropped in the mean time.

The next frame that arrives, 5, will block the video capture thread again. The other
prediction that’s still running at this point is the one for frame 2. Once that is done, frame
5 can have a go, and so on.

Note: If your model is fast enough, no frames will actually get dropped. The above timeline
is for a model that is significantly slower than 33 ms per frame.

Why stop at two concurrent predictions, why not three or even more? In practice it turns
out you gain the most from having two predictions in parallel. On faster devices it may be
worth using a semaphore with value 3, known as triple-buffering. (The faster the device,
the quicker the model runs but also the more idle time there is to take advantage of.)

There’s an important trade-off here: the more requests you schedule at the same time, the
longer it will take before you get the results. Look at the above timeline: you don’t get the
results for frame 2 until about five frames later. With triple-buffering, this time delay —
known as the latency — only becomes longer.

You get a higher throughput, i.e. the device is able to squeeze as much performance out of
the CPU and GPU as possible, but you lose responsiveness.

For offline work, such as when you’re using AVAssetReader to read the frames from a video
file, the latency isn’t important. But the law of diminishing returns also applies here: you
Working With Video 404

won’t gain much from doing more than two or three predictions at a time. If your CPU and
GPU are both maxed out already, adding more work won’t make them go any faster…

The in-flight index: whose turn is it?

When scheduling multiple Core ML requests at a time, there is an important detail you need
to take care of. Because every request will run from its own thread, these threads should
not be sharing resources.

Suppose you’re using Core Image to resize the camera’s CVPixelBuffer to the dimensions
that are expected by the mlmodel. Instead of allocating a new destination CVPixelBuffer for
every frame, it’s more efficient to just allocate it once and put it in an instance variable —
let’s call it resizedPixelBuffer — and keep re-using that.

However, it’s possible that two different prediction threads will now try to access the
resizedPixelBuffer variable at the same time. That’s a race condition waiting to happen…

The solution is to allocate one of these resizedPixelBuffer objects for each possible Core ML
request. You also need to write code to cycle through these buffers. First, add some instance
variables:

let maxInflightBuffers = 2 // or 3
let semaphore: DispatchSemaphore
var inflightBuffer = 0
var resizedPixelBuffers: [CVPixelBuffer] = []

Initialize them like so:

semaphore = DispatchSemaphore(value: maxInflightBuffers)

for _ in 0..<maxInflightBuffers {
var resizedPixelBuffer: CVPixelBuffer = ...
resizedPixelBuffers.append(resizedPixelBuffer)
}

Then change the video capture delegate method to:


Working With Video 405

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
semaphore.wait()

let inflightIndex = inflightBuffer


inflightBuffer += 1
if inflightBuffer >= maxInflightBuffers {
inflightBuffer = 0
}

DispatchQueue.global(qos: .userInitiated).async {
let resizedPixelBuffer = self.resizedPixelBuffers[inflightIndex]
/* actually resize the pixel buffer */

if let output = try? model.prediction(image: resizedPixelBuffer) {


/* do something with the output */
}
self.semaphore.signal()
}
}
}

Now inflightIndex will cycle between 0 and maxInflightBuffers. With maxInflightBuffers


= 2, there will be at most two Core ML predictions happening in parallel, each in its
own thread. One of these will always have inflightIndex = 0 and the other will have
inflightIndex = 1. Each thread is therefore always reading from and writing to its own
resizedPixelBuffer object.

Note: The code that increments inflightIndex must always be run from the same thread, in
this case the video capture thread. Don’t put this inside the DispatchQueue or it’s no longer
thread-safe.

Vision requests are not thread-safe!

In the previous examples I’ve only shown the code for using Core ML directly. When using
double or triple-buffering, you’ll need to treat the Vision request object as a resource that
should not be shared between threads too.

It’s OK to call Core ML’s model.prediction(...) method from multiple threads, but a Vision
request object should only be used by one thread at a time.
Working With Video 406

So if you’re planning to use Vision in combination with the semaphore, you need to do
allocate as many Vision request objects as your maxInflightBuffers:

var visionRequests = [VNCoreMLRequest]()

for _ in 0..<maxInflightBuffers {
let request = VNCoreMLRequest(model: ...) { request, error in
/* handle the results */
}
requests.append(request)
}

It’s OK for all these VNCoreMLRequest objects to have the same completion handler. (I usually
put it in a separate function.)

Now the code in the camera capture delegate will be slightly different:

public func captureOutput(_ output: AVCaptureOutput,


didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
if let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) {
semaphore.wait()

let inflightIndex = inflightBuffer


inflightBuffer += 1
if inflightBuffer >= maxInflightBuffers {
inflightBuffer = 0
}

DispatchQueue.global(qos: .userInitiated).async {
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
let request = self.requests[inflightIndex]
try? handler.perform([request])
self.semaphore.signal()
}
}
}

Note that you can also call semaphore.signal() from the VNCoreMLRequest completion handler.

Is this really a good idea?

Having multiple predictions in progress at once really improves the throughput of the
model, because now the CPU and GPU are as busy as they can be. However, on an iPhone
Working With Video 407

that is not always the best idea. You may need to use the GPU for drawing as well, especially
if you’re doing things with AR.

Maxing out the GPU for a long stretch of time will also make it hot, and eventually the GPU
will throttle down to run at a lower clock speed.

These high throughput techniques certainly have a place, but use them judiciously.

Even if your model is fast enough to run at 30 FPS or faster, I would encourage you to
find ways to run the Core ML model as little as possible. Keep the camera at 30 FPS but add
some logic to captureOutput(_:didOutput:from:) to skip every other frame, for example. This
will seriously improve the energy usage of your app and keeps users’ batteries charged for
longer.

For the best user experience, my advice is to run the model at the lowest possible frame
rate you can possibly get away with. For example, if the user is waving the camera around
too much then pause the predictions and wait until they’ve stopped moving before running
the model again.

How big is my preview?

If you’re displaying the live video feed from the camera using AVCaptureVideoPreviewLayer,
the video may be letterboxed on the sides or on the top and bottom. This depends on the
screen size of the device and the preview layer’s videoGravity setting.

Sometimes it’s useful to know what the screen coordinates are of the actual video, for
example to draw something on top, such as a detected bounding box or a segmentation
mask. You can find these screen coordinates with the following code snippet:

let videoPreviewRect = previewLayer.layerRectConverted(fromMetadataOutputRect:


CGRect(x: 0, y: 0, width: 1, height: 1))
MobileNetV2 + SSDLite Example
In this chapter we’ll take a fairly complicated deep learning model and convert it to Core
ML. This will use many of the techniques that were shown throughout the book, such as:

• converting a model

• cleaning up the model after conversion

• using NeuralNetworkBuilder to create a new model

• directly writing and modifying the spec

• building a pipeline to combine the different models into one

• using Vision to run the pipeline model in an iOS app

The model in question is SSD, which stands for Single Shot Multibox Detector113 — the M
appears to have gone missing from the acronym. SSD is an object detection model that is
fast enough it can be used on real-time video.

113 https://arxiv.org/abs/1512.02325

408
MobileNetV2 + SSDLite Example 409

An object detector can find the locations of several different types of objects in the image.
The detections are described by bounding boxes, and for each bounding box the model also
predicts a class.

There are many variations of SSD. The one we’re going to use here employs MobileNet V2
as the backbone and has depthwise separable convolutions for the SSD layers, also known
as SSDLite.

SSD vs. YOLO

SSD isn’t the only way to do real-time object detection. Another common architecture is
YOLO.

I wanted to mention YOLO because when you train an object detector114 with Turi Create,
it produces a model with the TinyYOLO v2 architecture. The “tiny” YOLO model is smaller
and therefore less accurate than the full one, but it’s also faster. Like SSD it was designed
to run in real-time.

(Tiny)YOLO and SSD(Lite) work along the same lines. There are many architectural differ-
ences between them, but in the end both models make predictions on a fixed-size grid. Each
cell in this grid is responsible for detecting objects in a particular location in the original
input image.

It’s not super important that you understand the inner workings of these models — what
matters is that they take an image as input and produce a tensor, or multi-array as Core
ML calls it, of a certain size as output. This tensor contains the bounding box predictions
in one form or another.

For an in-depth explanation of how these kinds of models work and how they are trained,
see my blog post One-shot object detection115 .

The output tensor of TinyYOLO v2 is interpreted as a grid that has 13×13 cells. Turi Create’s
version predicts 15 different bounding boxes per grid cell, or 13×13×15 = 2535 bounding
boxes in total.

For SSDLite there are several different grids ranging in size from 19×19 to 1×1 cells. The
number of bounding boxes per cell is 3 for the largest grid and 6 for the others, giving a
total of 1917 boxes.
114 https://github.com/apple/turicreate/blob/master/userguide/object_detection/README.md
115 https://machinethink.net/blog/object-detection/
MobileNetV2 + SSDLite Example 410

These models always predict the same number of bounding boxes, even if there is no object
at a particular location in the image. To filter out the useless predictions, a post-processing
step called non-maximum suppression (or NMS) is necessary.

In addition, the predictions aren’t of real bounding boxes but are relative to so-called
anchor boxes. In order to turn the predictions into true rectangles, they must be decoded
first. YOLO and SSD both use anchor boxes but have different ways of doing this.

Vision makes it easier

Until recently, the decoding and NMS post-processing steps had to be performed afterwards
in Swift116 . The model would output an MLMultiArray containing the grid of predictions, and
you had to loop through the cells and perform these calculations yourself.

But as of iOS 12 and macOS 10.14, things have become a lot easier. The YOLO object detector
from Turi Create is directly supported by the Vision framework. You simply perform a
Vision request on the image and the result is an array of VNRecognizedObjectObservation
objects that contain the coordinates and class labels for the bounding boxes.

Vision automatically decodes the predictions for you and even performs NMS. How conve-
nient is that!

The goal of this chapter is to create a version of MobileNetV2+SSDLite that works with
Vision, just like Turi Create’s model does. First, we’ll convert the original model from
TensorFlow to Core ML, then we’ll add operations that will decode the predictions using
the anchor boxes, and finally we’ll add NMS to the model as well.

Once all that is done, you can use SSD with Vision too — and get your hands on those
precious VNRecognizedObjectObservation objects.

Converting from TensorFlow

We’ll be using the model ssdlite_mobilenet_v2_coco. You can download it here117 .

This model was trained using the TensorFlow Object Detection API118 and so we’ll need to
use tfcoreml to convert it to Core ML.

If you don’t have tfcoreml installed yet, do so first:


116 https://github.com/hollance/YOLO-CoreML-MPSNNGraph
117 http://download.tensorflow.org/models/object_detection/ssdlite_mobilenet_v2_coco_2018_05_09.tar.gz
118 https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.

md
MobileNetV2 + SSDLite Example 411

pip install -U tfcoreml

The conversion process will give us a version of SSD that will work with Core ML but you
won’t be able to use it with the new Vision API just yet.

Note: The following instructions were tested with coremltools 2.0, tfcoreml 0.3.0, and
TensorFlow 1.7.0. If you’re using different versions of any of these packages, you may also
get different results. Especially tfcoreml is being updated and improved regularly, so some
of the issues we’ll encounter here may have been fixed in later versions.

The ssdlite_mobilenet_v2_coco download contains the trained SSD model in a few different
formats: a frozen graph, a checkpoint, and a SavedModel. tfcoreml needs to use a frozen
graph but the downloaded one gives errors — it contains “cycles” or loops, which are a no-
go for tfcoreml. We’ll use the SavedModel and convert it to a frozen graph without cycles.

First, load the SavedModel into a new TensorFlow graph object:

import tensorflow as tf
from tensorflow.python.tools import strip_unused_lib
from tensorflow.python.framework import dtypes
from tensorflow.python.platform import gfile

def load_saved_model(path):
the_graph = tf.Graph()
with tf.Session(graph=the_graph) as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], path)
return the_graph

saved_model_path = "ssdlite_mobilenet_v2_coco_2018_05_09/saved_model"
the_graph = load_saved_model(saved_model_path)

Next, we’ll use a helper function to strip away unused subgraphs and save the result as
another frozen model:
MobileNetV2 + SSDLite Example 412

frozen_model_file = "frozen_model.pb"
input_node = "Preprocessor/sub"
bbox_output_node = "concat"
class_output_node = "Postprocessor/convert_scores"

def optimize_graph(graph):
gdef = strip_unused_lib.strip_unused(
input_graph_def = graph.as_graph_def(),
input_node_names = [input_node],
output_node_names = [bbox_output_node, class_output_node],
placeholder_type_enum = dtypes.float32.as_datatype_enum)

with gfile.GFile(frozen_model_file, "wb") as f:


f.write(gdef.SerializeToString())

optimize_graph(the_graph)

The strip_unused() function will only keep the portion of the graph in between the specified
input and output nodes, and removes everything else. What’s left is the only piece of the
original graph that Core ML can actually handle – the rest is full of unsupported operations.

The part of the TensorFlow graph that we keep has one input for the image and two
outputs: one for the bounding box coordinate predictions and one for the classes. While
YOLO combines the coordinate and class predictions into a single tensor, SSD makes these
predictions on two separate branches. That’s why we also have to supply the names of two
output nodes.

I looked up these node names in the saved_model.pb file using Netron. TensorFlow models
can be quite complicated, so it usually takes a bit of searching to find the nodes you need.
Another trick is to simply print out a list of all the operations in the graph and look for
ones that seem reasonably named, then run the graph up to that point and see what sort of
results you get.

By the way, I didn’t come up with this code myself — it’s taken from this tfcoreml example
notebook119 . Interestingly, they use a different output node. SSD does multi-label classifi-
cation on the class predictions, which applies a sigmoid activation to the class scores. The
node that they use, "concat_1", happens before the sigmoid, while "Postprocessor/convert_-
scores" is after the sigmoid. By including this node it saves us from having to apply the
sigmoid ourselves.

Now we’ve got something that tfcoreml will be happy with. To convert the frozen Tensor-
Flow graph to Core ML, do the following:

119 https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb
MobileNetV2 + SSDLite Example 413

import tfcoreml

coreml_model_path = "MobileNetV2_SSDLite.mlmodel"

input_width = 300
input_height = 300

input_tensor = input_node + ":0"


bbox_output_tensor = bbox_output_node + ":0"
class_output_tensor = class_output_node + ":0"

ssd_model = tfcoreml.convert(
tf_model_path=frozen_model_file,
mlmodel_path=coreml_model_path,
input_name_shape_dict={ input_tensor: [1, input_height, input_width, 3] },
image_input_names=input_tensor,
output_feature_names=[bbox_output_tensor, class_output_tensor],
is_bgr=False,
red_bias=-1.0,
green_bias=-1.0,
blue_bias=-1.0,
image_scale=2./255)

This is fairly straightforward usage of tfcoreml. We specify the same inputs and output
names again but this time they need to have the :0 appended.

The expected size of the input image is 300×300 pixels. Fun fact: YOLO uses larger images
of 416×416 pixels.

The image preprocessing options are typical for TensorFlow image models: first divide by
127.5 to put the pixels in the range [0, 2], then subtract -1 to put them in the range [-1, 1].

Often TensorFlow models already do their own normalization and this one is no exception.
However, when we stripped away the parts of the model we couldn’t use, such as anything
before the input node "Preprocessor/sub:0", we also got rid of the original model’s built-in
preprocessing operations.

After a few brief moments, if all goes well, tfcoreml completes the conversion and saves
the mlmodel to a file. We now have a Core ML model that takes a 300×300 image as input
and produces two outputs: a multi-array with the coordinates for 1917 bounding boxes and
another multi-array with the class predictions for the same 1917 bounding boxes.

At this point, the predicted coordinates aren’t “real” coordinates yet — they still have to
be decoded using the anchor boxes. Also remember that the model always predicts 1917
bounding boxes for any image. Most of these bounding boxes will not have detected an
actual object. In that case, the class “unknown” has the highest score. We’ll later use non-
maximum suppression to filter out these predictions.
MobileNetV2 + SSDLite Example 414

Cleaning it up

In theory we could start using the converted model already, but I always like to clean up
the converted model first.

Due to how the input and output tensors were named in the original model, the Core ML
model’s input is named "Preprocessor__sub__0" while the two outputs are "Postprocessor_-
_convert_scores__0" and "concat__0". Those are pretty meaningless — and ugly! — names.

Let’s rename the input to "image" and the two outputs to "scores" and "boxes", respectively.
This requires using the model’s spec object:

spec = ssd_model.get_spec()

spec.description.input[0].name = "image"
spec.description.input[0].shortDescription = "Input image"
spec.description.output[0].name = "scores"
spec.description.output[0].shortDescription = "Predicted class scores for each bounding box"
spec.description.output[1].name = "boxes"
spec.description.output[1].shortDescription = "Predicted coordinates for each bounding box"

It’s not enough to change these names in the spec.description. Any layers that are connected
to the old input or output names must now use the new names too. Likewise for the object
that handles the image preprocessing.

input_mlmodel = input_tensor.replace(":", "__").replace("/", "__")


class_output_mlmodel = class_output_tensor.replace(":", "__").replace("/", "__")
bbox_output_mlmodel = bbox_output_tensor.replace(":", "__").replace("/", "__")

for i in range(len(spec.neuralNetwork.layers)):
if spec.neuralNetwork.layers[i].input[0] == input_mlmodel:
spec.neuralNetwork.layers[i].input[0] = "image"
if spec.neuralNetwork.layers[i].output[0] == class_output_mlmodel:
spec.neuralNetwork.layers[i].output[0] = "scores"
if spec.neuralNetwork.layers[i].output[0] == bbox_output_mlmodel:
spec.neuralNetwork.layers[i].output[0] = "boxes"

spec.neuralNetwork.preprocessing[0].featureName = "image"

If we now look at the outputs using print(spec.description), the "scores" output correctly
shows up as a multi-array but its shape is not filled in:
MobileNetV2 + SSDLite Example 415

type {
multiArrayType {
dataType: DOUBLE
}
}

We know for a fact that this always outputs an array of shape (91, 1917) because there
are 91 classes and 1917 bounding boxes. Why 91 classes? This model was trained on the
COCO dataset and so it can detect 90 possible types of objects, plus one class for “no object
detected”.

Let’s remedy that and fill in the output shape:

num_classes = 90
num_anchors = 1917
spec.description.output[0].type.multiArrayType.shape.append(num_classes + 1)
spec.description.output[0].type.multiArrayType.shape.append(num_anchors)

Note that we define num_classes to be 90, not 91, because we want to ignore the bounding
boxes that did not detect anything. In fact, later on we’ll remove any detections for this
“unknown” class from the model completely.

The "boxes" output’s shape has something wrong with it too:

multiArrayType {
shape: 4
shape: 1917
shape: 1
dataType: DOUBLE
}

The first two dimensions are correct, but there is no reason to have that third dimension
of size 1, so we might as well get rid of it:

del spec.description.output[1].type.multiArrayType.shape[-1]

Note: The model will work fine even when the output shapes aren’t filled in completely or
even incorrectly. But the shape also serves as documentation for the user of the model, and
so I like to make sure it is right.

Finally, let’s convert the weights to 16-bit floats:


MobileNetV2 + SSDLite Example 416

import coremltools as ct
spec = ct.utils.convert_neural_network_spec_weights_to_fp16(spec)

And save the model again. We could just save the spec, but it’s easier to create a new MLModel
object — and we’ll need this later to build the pipeline anyway.

ssd_model = ct.models.MLModel(spec)
ssd_model.save(coreml_model_path)

Now if you open MobileNetV2_SSDLite.mlmodel in Xcode, it shows the following:

The input is a 300×300-pixel image and there are two multi-array outputs. The scores output
is pretty straightforward to interpret: for every one of the 1917 bounding boxes there is a
91-element vector containing a multi-label classification.

However, in order to use this model in an app, you still need some way to make sense out
of the predicted bounding box “coordinates” from the boxes output…

Decoding the bounding box predictions

As I mentioned, the values coming out of the boxes output are not real coordinates yet.
Instead, these numbers describe how to modify the anchor boxes.

An anchor box is nothing more than a pre-defined rectangle that is located somewhere in
the original image. It is described by four numbers: x, y, width, and height. There are 1917
unique anchor boxes, one for each prediction from boxes.

The purpose of the anchor boxes is to give the model some idea of what sizes of objects
to look for. The anchor boxes represent the most common (rectangular) shapes among the
objects in the dataset. They act as hints for the model during the training process — without
these anchor boxes, it’s much harder for the model to learn its objective.

The four numbers that SSD predicts for each bounding box describe how the position and
size of the corresponding anchor box should be modified in order to fit the detected object.
MobileNetV2 + SSDLite Example 417

For example, the predicted numbers may say, “move my anchor box 20 pixels to the left,
and make it 5% wider but also 3% less tall.”

The model is trained to make its prediction using the anchor box that best fits the detected
object, and then tweak the box a little so that it fits perfectly.

The anchor boxes are chosen prior to training and are always fixed. In other words, they
are a hyperparameter. The original YOLO ran a clustering algorithm on the training set
to determine the most common object shapes, but SSD and also Turi Create use a simple
mathematical formula for selecting the positions and sizes of the anchor boxes.

To be able to decode the coordinate predictions, we first need to know what the anchor
boxes are. It’s possible to dig up the mathematical formula for computing the anchor box
positions and sizes — but as this formula is part of the original TensorFlow model, we can
also simply ask the graph:

import numpy as np

def get_anchors(graph, tensor_name):


image_tensor = graph.get_tensor_by_name("image_tensor:0")
box_corners_tensor = graph.get_tensor_by_name(tensor_name)
box_corners = sess.run(box_corners_tensor, feed_dict={
image_tensor: np.zeros((1, input_height, input_width, 3))})

ymin, xmin, ymax, xmax = np.transpose(box_corners)


width = xmax - xmin
height = ymax - ymin
ycenter = ymin + height / 2.
xcenter = xmin + width / 2.
return np.stack([ycenter, xcenter, height, width])

anchors_tensor = "Concatenate/concat:0"
with the_graph.as_default():
with tf.Session(graph=the_graph) as sess:
anchors = get_anchors(the_graph, anchors_tensor)

Note: This is one of the reasons why tfcoreml cannot convert the original frozen model:
because of the operations that generate the anchor boxes. We first had to remove those
operations from the graph.

To get the appropriate anchor boxes for our desired input image size, we must run the
graph on such an image. Here we’re simply using a fake image that is all zeros. For the
anchor boxes it doesn’t matter what is actually in the image, only how large it is.
MobileNetV2 + SSDLite Example 418

The TensorFlow graph computes each anchor box as the rectangle (ymin, xmin, ymax,
xmax). We convert these min/max values to a center coordinate plus width and height.
By the way, the anchor box coordinates are normalized, i.e. between 0 and 1, so that they
are independent of the original image size.

After you run this code, anchors is a NumPy array of shape (4, 1917). There is exactly one
anchor box for each bounding box prediction, described by the four numbers (y, x, height,
width). Yes, that’s a little weird — you probably expected the order to be (x, y, width, height)
— but this is a TensorFlow convention.

Before the introduction of the new Vision VNRecognizedObjectObservation API, you would’ve
had to save the anchors array to a file and load it into your app. Your Swift code would then
need to do the following to decode the predictions into real coordinates:

for b in 0..<numAnchors {
// Read the anchor coordinates:
let ay = anchors[b ]
let ax = anchors[b + numAnchors ]
let ah = anchors[b + numAnchors*2]
let aw = anchors[b + numAnchors*3]

// Read the predicted coordinates:


let ty = boxes[b ]
let tx = boxes[b + numAnchors ]
let th = boxes[b + numAnchors*2]
let tw = boxes[b + numAnchors*3]

// The decoding formula:


let x = (tx / 10) * aw + ax
let y = (ty / 10) * ah + ay
let w = exp(tw / 5) * aw
let h = exp(th / 5) * ah

// The final bounding box is given by (x, y, w, h)


}

Here, numAnchors is 1917. This code snippet loops through each predicted bounding box
and applies a simple formula involving exp() and some arithmetic to get the decoded
coordinates.

You could optimize this logic using the vectorized functions from the Accelerate framework,
or even implement it as a custom layer in the Core ML model. This is a massively parallel
computation — the exact same formula is applied to each of the 1917 predictions — and so
it’s well suited to running on the GPU.

But I want to use the shiny new Vision API! And so we’re going to take another approach
and add this logic to the mlmodel itself using built-in Core ML operations.
MobileNetV2 + SSDLite Example 419

Note: The above decoding formula is unique to SSD. For YOLO you also need to decode the
predictions but the calculations are slightly different.

Decoding inside the Core ML model

Adding the decoding logic to the mlmodel involves taking the above formula and imple-
menting it using various Core ML layer types. We could directly add these layers to the SSD
model that we just converted, but instead let’s create a completely new model from scratch.
Later, we’ll connect these models together using a pipeline.

The easiest way to build this decoder model is to use NeuralNetworkBuilder:

from coremltools.models import datatypes


from coremltools.models import neural_network

input_features = [ ("scores", datatypes.Array(num_classes + 1, num_anchors, 1)),


("boxes", datatypes.Array(4, num_anchors, 1)) ]

output_features = [ ("raw_confidence", datatypes.Array(num_anchors, num_classes)),


("raw_coordinates", datatypes.Array(num_anchors, 4)) ]

builder = neural_network.NeuralNetworkBuilder(input_features, output_features)

The inputs to the decoder model are exactly the same as the outputs from the SSD model.
Well, almost. The boxes output from SSD has shape (4, num_anchors) but here we say the
shape is (4, num_anchors, 1). Similarly for the scores output.

In Core ML, if the input to a neural network model is a multi-array it must have either one
or three dimensions. Since our arrays only have two dimensions, we need to add an unused
dimension of size 1 at the front or back. C’est la vie.

The outputs of the decoder model are very similar to its inputs, except that the order of the
dimensions has been flipped around. There are only two dimensions now, and the number
of bounding boxes is in the first dimension. The next step in the pipeline, non-maximum
suppression, demands it that way.
MobileNetV2 + SSDLite Example 420

Note: Even though the input multi-arrays are not allowed to have two dimensions, this is
fine for the outputs. The NonMaximumSuppression model is not a neural network, so it doesn’t
have that restriction.

Also notice that we use num_classes for the decoder output, not num_classes + 1. We’re not
interested in bounding boxes that do not belong to any object and so the decoder will ignore
predictions for the “unknown” class.

SSD does multi-label classification, which means that the same bounding box can have
more than one class. Any class whose predicted probability is over a certain “confidence
threshold” counts as a valid prediction. For the vast majority of the 1917 predicted
bounding boxes, the “unknown” class will have a high score and all the other classes will
have scores below the threshold.

Note: With YOLO / Turi Create this is a bit different. YOLO’s classification is multi-class
(softmax), not multi-label (sigmoid). It also predicts a separate confidence score for the
bounding box itself, which indicates whether YOLO thinks this bounding box contains
an object or not. In SSD this is represented by the score for the “unknown” class. The
confidence score for the bounding box is multiplied with the probability for the highest
scoring class. If that combined confidence score is over a certain threshold, the bounding
box is kept.

All right, let’s build this decoder model. First let’s look at the scores input. The decoder
needs to do two things with the scores: 1) swap around the dimensions, and 2) strip off the
predictions for the “unknown” class. To swap the dimensions we use a permute operation:

builder.add_permute(name="permute_scores",
dim=(0, 3, 2, 1),
input_name="scores",
output_name="permute_scores_output")

Even though our input is a tensor with three dimensions, (91, 1917, 1), the permute layer
treats it as having four dimensions. The first dimension is used for sequences and we’ll
leave it alone.

After permuting, the shape of the data is now (1, 1, 1917, 91). Each bounding box prediction
gets a 91-element vector with the class scores, the first of which is the prediction for class
“unknown”. To strip this off we use a slice operation that works on the “width” axis (the
MobileNetV2 + SSDLite Example 421

last one). We only want to keep the elements 1 through 90, so we set start_index=1 and
end_index=91 (the end index is exclusive):

builder.add_slice(name="slice_scores",
input_name="permute_scores_output",
output_name="raw_confidence",
axis="width",
start_index=1,
end_index=num_classes + 1)

Now the data has shape (1, 1, 1917, 90). That tensor can go straight into the decoder model’s
first output, "raw_confidence". Note that we declared this output to have shape (1917, 90).
The first two dimensions are automatically dropped by Core ML because they are of size 1.

Next up is the second input, boxes, with the bounding box “coordinates”. As I mentioned in
the previous section, the formula for decoding a single bounding box is as follows,

x = (tx / 10) * aw + ax
y = (ty / 10) * ah + ay
w = exp(tw / 5) * aw
h = exp(th / 5) * ah

where tx, ty, tw, th are the predictions from SSD and ax, ay, aw, ah are the anchor box
coordinates. This happened in a for loop. Seems simple enough, except that in Core ML
we can’t have loops, so we’ll have to work on the entire tensor at once.

The shape of the boxes tensor is (4, 1917, 1). Recall that we had to add that extra dimension
in order to make this input three dimensional. Since adding a dimension of size 1 doesn’t
change the data, we could have added it to the front, but we put it in the back for a good
reason: we want the coordinates to be in what Core ML calls the “channel” dimension. This
is important for later on when we need to use a concatenation layer.

It doesn’t really matter much where the 1917 goes, it can go either in the “height” dimension
or the “width” dimension. I put these terms inside quotes because it is only by convention
that they’re named that way and we’re not using them for that purpose right now.

Let’s start with the x and y coordinates. According to the decoding formula, we must divide
these by 10. But we can’t simply apply the division operation on the entire boxes array
because that would also change the width and height (these need to be divided by 5 instead).
So first, we grab only the x and y coordinates by slicing up the array:
MobileNetV2 + SSDLite Example 422

builder.add_slice(name="slice_yx",
input_name="boxes",
output_name="slice_yx_output",
axis="channel",
start_index=0,
end_index=2)

Since the coordinates are in the “channels” dimension, that’s what we put for axis. Here
we’re slicing off the first two channels.

Notice that I called this layer "slice_yx" with y coming before x. That’s because SSDLite’s
prediction of the 4 coordinates is in the order (y, x, height, width). The anchor box
coordinates are also stored in this order.

It doesn’t really matter one way or the other what the chosen order is, as long as we’re
careful careful to use the correct one. The non-maximum suppression requires that the
decoder model outputs bounding boxes for which the coordinates are (x, y, width, height),
so at some point we’ll have to flip y and x, and height and width.

Now that we’ve isolated the y/x coordinates, we can divide them by the scalar constant 10.
In Core ML you can do this by using a multiply layer. Normally this performs element-wise
multiplication between the tensors from two inputs, but if you only supply one input it
multiplies every element in that tensor with a constant value.

builder.add_elementwise(name="scale_yx",
input_names="slice_yx_output",
output_name="scale_yx_output",
mode="MULTIPLY",
alpha=0.1)

Next up comes an interesting challenge: We must now multiply the “x” result with aw, the
width of the anchor box, and the “y” result with ah, the height of the anchor box. That
means we need to put these anchor box coordinates into the Core ML model somehow. The
operator for this is load constant.

Earlier we retrieved the anchor boxes from the TensorFlow graph. The anchors variable
contains a NumPy array of shape (4, 1917). We need to split this up into two arrays of shape
(2, 1917, 1). One array contains the y, x coordinates of the anchor boxes, the other their
heights and widths. We can do this with NumPy slicing:

anchors_yx = np.expand_dims(anchors[:2, :], axis=-1)


anchors_hw = np.expand_dims(anchors[2:, :], axis=-1)
MobileNetV2 + SSDLite Example 423

Notice that these arrays have an empty dimension of 1 at the end. That’s what expand_dims()
is for. We’re doing this because the tensors also have the shape (2, 1917, 1).

Let’s add both of these anchors arrays into the Core ML model:

builder.add_load_constant(name="anchors_yx",
output_name="anchors_yx",
constant_value=anchors_yx,
shape=[2, num_anchors, 1])

builder.add_load_constant(name="anchors_hw",
output_name="anchors_hw",
constant_value=anchors_hw,
shape=[2, num_anchors, 1])

Now we’ll perform an element-wise multiplication between "anchors_hw" and the output
from "scale_yx":

builder.add_elementwise(name="yw_times_hw",
input_names=["scale_yx_output", "anchors_hw"],
output_name="yw_times_hw_output",
mode="MULTIPLY")

And then we’ll do an element-wise addition with the values from "anchors_yx":

builder.add_elementwise(name="decoded_yx",
input_names=["yw_times_hw_output", "anchors_yx"],
output_name="decoded_yx_output",
mode="ADD")

That completes the formula for the y and x coordinates. We’ve divided by 10 — or actually
multiplied by 1/10 — also have multiplied by the anchor height/width, and added the
anchor center position.

We still have to calculate the true height and width of the predicted bounding box. This
happens in a very similar manner. First, we slice off the height and width from the original
boxes tensor (the last two channels), then divide these predictions by 5 using a multiply
layer with alpha=0.2:
MobileNetV2 + SSDLite Example 424

builder.add_slice(name="slice_hw",
input_name="boxes",
output_name="slice_hw_output",
axis="channel",
start_index=2,
end_index=4)

builder.add_elementwise(name="scale_hw",
input_names="slice_hw_output",
output_name="scale_hw_output",
mode="MULTIPLY",
alpha=0.2)

Next, we use a unary function layer to exponentiate these values:

builder.add_unary(name="exp_hw",
input_name="scale_hw_output",
output_name="exp_hw_output",
mode="exp")

And finally, we multiply by the anchor height and width:

builder.add_elementwise(name="decoded_hw",
input_names=["exp_hw_output", "anchors_hw"],
output_name="decoded_hw_output",
mode="MULTIPLY")

Great, that was all the math we needed for decoding the predictions using the anchor boxes.
But we’re not done yet. We have two tensors now, both of size (2, 1917, 1). These need to
be combined into one big tensor of size (1917, 4). So that will involve concatenation and a
permutation of some kind.

But there is a small wrinkle: if we were to simply use a concat layer to put the two tensors
together, then the order of the coordinates is (y, x, height, width) — but we need them as (x,
y, width, height). So instead, we’ll slice them up into four separate tensors of size (1, 1917,
1) and then concatenate these in the right order.
MobileNetV2 + SSDLite Example 425

builder.add_slice(name="slice_y",
input_name="decoded_yx_output",
output_name="slice_y_output",
axis="channel",
start_index=0,
end_index=1)

builder.add_slice(name="slice_x",
input_name="decoded_yx_output",
output_name="slice_x_output",
axis="channel",
start_index=1,
end_index=2)

builder.add_slice(name="slice_h",
input_name="decoded_hw_output",
output_name="slice_h_output",
axis="channel",
start_index=0,
end_index=1)

builder.add_slice(name="slice_w",
input_name="decoded_hw_output",
output_name="slice_w_output",
axis="channel",
start_index=1,
end_index=2)

builder.add_elementwise(name="concat",
input_names=["slice_x_output", "slice_y_output",
"slice_w_output", "slice_h_output"],
output_name="concat_output",
mode="CONCAT")

Now we can permute this from (4, 1917, 1) to (1, 1917, 4) and write the results to the second
output of the decoder model, "raw_coordinates":

builder.add_permute(name="permute_output",
dim=(0, 3, 2, 1),
input_name="concat_output",
output_name="raw_coordinates")

The final predicted bounding box coordinates are already normalized, meaning that they
are between 0 and 1 (although they can be slightly smaller or larger too). That’s good,
because the next stage, non-maximum suppression, expects them that way. For other
types of models that don’t use normalized coordinates, you might need to do some extra
arithmetic to scale the bounding box coordinates down. Turi’s YOLO model, for example,
has a scale layer at the end that divides the coordinates by 13 to normalize them.
MobileNetV2 + SSDLite Example 426

Now that we’re done building the decoder, let’s turn it into an actual MLModel and save it:

decoder_model = ct.models.MLModel(builder.spec)
decoder_model.save("Decoder.mlmodel")

We don’t really need to save this model, as we’ll put it into a pipeline shortly, but sometimes
coremltools will crash the Python interpreter and it’s nice to not have to repeat all the work
we just did.

Just for fun, this is what the model looks like in Netron:

Tip: It’s useful to occasionally save the model and inspect it with Netron while you’re
developing it, to make sure all the layers are connected properly etc. In that case it’s best
to save the model using coremltools.utils.save_spec(builder.spec, "Decoder.mlmodel"), as
creating the MLModel object is one of the things that tends to crash Python if there is a
problem with your model definition.

Non-maximum suppression

The third and final model we’ll build is the one for non-maximum suppression. In Core ML
this is not a neural network layer but a separate model type.
MobileNetV2 + SSDLite Example 427

Note: As of Core ML 3, there is now also a NonMaximumSuppressionLayer that lets you use NMS
directly inside the neural network model. In that case it’s not necessary to build a pipeline.
However, in this chapter we will use the separate NonMaximumSuppression model instead of
the neural network layer.

There is no convenient API for building this kind of model, so we’ll have to write some
protobuf code by hand:

nms_spec = ct.proto.Model_pb2.Model()
nms_spec.specificationVersion = 3

The specification version needs to be 3 because that’s the earliest version of the mlmodel
format that supports non-maximum suppression.

First, we need to define the inputs and outputs for this model. Because the inputs must be
the same as the decoder model’s outputs, we can simply copy their definitions:

for i in range(2):
decoder_output = decoder_model._spec.description.output[i].SerializeToString()

nms_spec.description.input.add()
nms_spec.description.input[i].ParseFromString(decoder_output)

nms_spec.description.output.add()
nms_spec.description.output[i].ParseFromString(decoder_output)

nms_spec.description.output[0].name = "confidence"
nms_spec.description.output[1].name = "coordinates"

Because non-maximum suppression may return a different number of bounding boxes


for every input image, we should make the output shapes flexible. The confidence and
coordinates outputs may have anywhere from 0 to 1917 bounding boxes, but right now
their sizes are fixed to always return the maximum of 1917 predictions.

Normally you’d use the helper functions from coremltools.models.neural_network.flexible_-


shape_utils but these are really meant for changing the inputs and outputs of neural
network models. But this isn’t going to be a neural network, so we’ll have to modify the
spec ourselves:
MobileNetV2 + SSDLite Example 428

output_sizes = [num_classes, 4]
for i in range(2):
ma_type = nms_spec.description.output[i].type.multiArrayType
ma_type.shapeRange.sizeRanges.add()
ma_type.shapeRange.sizeRanges[0].lowerBound = 0
ma_type.shapeRange.sizeRanges[0].upperBound = -1
ma_type.shapeRange.sizeRanges.add()
ma_type.shapeRange.sizeRanges[1].lowerBound = output_sizes[i]
ma_type.shapeRange.sizeRanges[1].upperBound = output_sizes[i]
del ma_type.shape[:]

The first dimension from each output represents the number of bounding boxes that will
be returned. We set the lower bound for this dimension to 0, so the minimum predicted
amount is no bounding boxes. This happens when none of the boxes has a predicted class
score that is over the confidence threshold.

The upper bound is -1, meaning there is no limit to the maximum number of predictions
— although you’ll never get more than 1917 results, and in practice it’s usually less than 10
or so.

The second dimension of the output tensor is not flexible at all because the lower bound
and upper bound are equal here. For the confidence output it is always num_classes (90), for
the coordinates output it’s always 4.

We also remove the old fixed-size shape from the outputs using del ma_type.shape[:]
because that conflicts with the flexible size ranges.

Tip: In order to figure out how to do this sort of thing, I spent some time reading through
the Turi Create source code120 , notably the method export_coreml(). Recommended practice
for anyone who wants to learn how to build advanced Core ML models!

Now let’s turn this into a model of type NonMaximumSuppression. You can find the full
definition of this model type in NonMaximumSuppression.proto121 . We begin by filling out
the “feature name” fields; these tell the NMS model which input / output is used for what.

120 https://github.com/apple/turicreate/blob/master/src/python/turicreate/toolkits/object_detector/object_

detector.py
121 https://github.com/apple/coremltools/blob/master/mlmodel/format/NonMaximumSuppression.proto
MobileNetV2 + SSDLite Example 429

nms = nms_spec.nonMaximumSuppression
nms.confidenceInputFeatureName = "raw_confidence"
nms.coordinatesInputFeatureName = "raw_coordinates"
nms.confidenceOutputFeatureName = "confidence"
nms.coordinatesOutputFeatureName = "coordinates"
nms.iouThresholdInputFeatureName = "iouThreshold"
nms.confidenceThresholdInputFeatureName = "confidenceThreshold"

We’ll also choose some good thresholds for the IOU (Intersection-over-Union) computation
and for the class confidences. These settings determine how strict the NMS is in filtering out
(partially) overlapping bounding boxes. The values provided here are defaults. The user of
the model will be able to override them because we’ll also make inputs for these settings.

default_iou_threshold = 0.6
default_confidence_threshold = 0.4
nms.iouThreshold = default_iou_threshold
nms.confidenceThreshold = default_confidence_threshold

Because SSD uses multi-label classification, the same bounding box can have more than
one class. We should therefore tell the NMS to only filter out overlapping predictions if
they have the same class. That’s what the following option is for:

nms.pickTop.perClass = True

We also add the class labels. These live in a text file with one line for each class, 90 lines in
total.

labels = np.loadtxt("coco_labels.txt", dtype=str, delimiter="\n")


nms.stringClassLabels.vector.extend(labels)

Finally, also save this model:

nms_model = ct.models.MLModel(nms_spec)
nms_model.save("NMS.mlmodel")

Note: In case you’re curious how the non-maximum suppression algorithm works, the
CoreMLHelpers122 repo has an implementation in Swift. But we won’t need to use that here,
as our model now includes its own NonMaximumSuppression stage.
122 https://github.com/hollance/CoreMLHelpers
MobileNetV2 + SSDLite Example 430

Putting it together into a pipeline

We now have three separate models:

1. The MobileNetV2+SSDLite model that produces class scores and coordinate predic-
tions that still need to be decoded.

2. The decoder model that uses the anchor boxes to turn the predictions from SSD into
real bounding box coordinates.

3. The non-maximum suppression model that only keeps the best predictions.

We can combine these three into a single pipeline model, so that the image goes into one
end, and zero or more bounding box predictions come out the other.

To build the pipeline, we’ll use the coremltools.models.pipeline module.

from coremltools.models.pipeline import *

input_features = [ ("image", datatypes.Array(3, 300, 300)),


("iouThreshold", datatypes.Double()),
("confidenceThreshold", datatypes.Double()) ]

output_features = [ "confidence", "coordinates" ]

pipeline = Pipeline(input_features, output_features)

First we define which inputs and outputs are exposed by the pipeline model. The "image"
input goes into the SSD model, while the two threshold inputs go into the NMS model. The
two outputs also come from the NMS model.

We can now add the different models to the Pipeline object. However, at this point there
is something we need to fix. Recall that the SSDLite model outputs two multi-arrays with
shapes (91, 1917) and (4, 1917), respectively. But because Core ML demands that multi-array
inputs to a neural network are one or three dimensional, we added a dimension of size 1
to the input shapes of the decoder model: (91, 1917, 1) and (4, 1917, 1).

In order for Pipeline to connect these two models, their outputs and inputs must be
identical. Therefore, we also need to add that extra dimension to the output of the SSDLite
model:
MobileNetV2 + SSDLite Example 431

ssd_output = ssd_model._spec.description.output
ssd_output[0].type.multiArrayType.shape[:] = [num_classes + 1, num_anchors, 1]
ssd_output[1].type.multiArrayType.shape[:] = [4, num_anchors, 1]

And now we can add the three models, in order:

pipeline.add_model(ssd_model)
pipeline.add_model(decoder_model)
pipeline.add_model(nms_model)

Ideally, this should be all we’d need to do, but unfortunately the definition of the pipeline’s
inputs and outputs is not exactly correct. The "image" input is defined to be a multi-array
while we want it to be an image. Similarly, the types of the two outputs are wrong. This is
a limitation in the Pipeline API. The easiest solution is to copy the proper definitions from
the first (SSD) and last (NMS) models.

pipeline.spec.description.input[0].ParseFromString(
ssd_model._spec.description.input[0].SerializeToString())
pipeline.spec.description.output[0].ParseFromString(
nms_model._spec.description.output[0].SerializeToString())
pipeline.spec.description.output[1].ParseFromString(
nms_model._spec.description.output[1].SerializeToString())

While we’re at it, let’s add human-readable descriptions to these inputs and outputs. I
borrowed the ones from Turi Create’s YOLO model:

pipeline.spec.description.input[1].shortDescription = "(optional) IOU Threshold override"


pipeline.spec.description.input[2].shortDescription =
"(optional) Confidence Threshold override"
pipeline.spec.description.output[0].shortDescription = u"Boxes \xd7 Class confidence"
pipeline.spec.description.output[1].shortDescription =
u"Boxes \xd7 [x, y, width, height] (relative to image size)"

Let’s also add some metadata to the model:

pipeline.spec.description.metadata.versionString = "ssdlite_mobilenet_v2_coco_2018_05_09"
pipeline.spec.description.metadata.shortDescription = "MobileNetV2 + SSDLite, trained on COCO"
pipeline.spec.description.metadata.author = "Converted to Core ML by Matthijs Hollemans"
pipeline.spec.description.metadata.license =
"https://github.com/tensorflow/models/blob/master/research/object_detection"

Turi Create’s YOLO model also adds some extra information to the user-defined metadata,
which sounds like a good idea. For example, it might be useful to include the class labels:
MobileNetV2 + SSDLite Example 432

user_defined_metadata = {
"classes": ",".join(labels),
"iou_threshold": str(default_iou_threshold),
"confidence_threshold": str(default_confidence_threshold)
}
pipeline.spec.description.metadata.userDefined.update(user_defined_metadata)

Before I forget, we should set the specification version to 3 because our pipeline uses the
NonMaximumSuppression model type that is only available on iOS 12 or macOS 10.14 and
better. Xcode will still load the model just fine even if it has the wrong specification version,
but Core ML may crash if you try to use it on a device with an OS version that is too old.

pipeline.spec.specificationVersion = 3

And finally, we can save the model:

final_model = ct.models.MLModel(pipeline.spec)
final_model.save(coreml_model_path)

And there you have it: MobileNetV2_SSDLite.mlmodel is now a complete object detector
model, including the logic that decodes the bounding box coordinates and non-maximum
suppression. When you open the mlmodel file in Xcode, it now looks like this:

You can find the full conversion script, ssdlite.py, in the repo for the book123 .
123 https://github.com/hollance/coreml-survival-guide
MobileNetV2 + SSDLite Example 433

Using the model in an app

This is as easy as it gets. First create the VNCoreMLModel instance:

let coreMLModel = MobileNetV2_SSDLite()

lazy var visionModel: VNCoreMLModel = {


do {
return try VNCoreMLModel(for: coreMLModel.model)
} catch {
fatalError("Failed to create VNCoreMLModel: \(error)")
}
}()

Then create the VNCoreMLRequest object:

lazy var visionRequest: VNCoreMLRequest = {


let request = VNCoreMLRequest(model: visionModel, completionHandler: {
[weak self] request, error in
if let results = request.results as? [VNRecognizedObjectObservation] {
/* do something with the results here */
}
})
request.imageCropAndScaleOption = .scaleFill
return request
}()

And when you’re ready to run the model on an image, create the VNImageRequestHandler:

let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)


do {
try handler.perform([self.visionRequest])
} catch {
print("Failed to perform Vision request: \(error)")
}

Inside the completion handler for the VNCoreMLRequest you get the results as an array of
VNRecognizedObjectObservation objects. There is one such object for every detected object in
the image. It contains a property labels with the classification scores for the class labels,
and a property boundingBox with the coordinates of the bounding box rectangle.

And that’s really all there is to it!

All the math and post-processing is done inside the Core ML model already. All you need to
do is make the request and look at the VNRecognizedObjectObservation result objects.
MobileNetV2 + SSDLite Example 434

Note: The IOU and confidence thresholds are optional but it doesn’t look like you can set
these with Vision. There is no API in Vision that lets you pass in any input values except for
a single image. Of course, if you use Core ML directly you can set these thresholds, but you’ll
also need to read the bounding box coordinates and class scores from the MLMultiArray by
yourself, so you lose the convenience of having VNRecognizedObjectObservation objects.

If you want to draw the bounding box rectangles, there are a few things to keep in mind:

• The imageCropAndScaleOption is important because it determines how the original input


image is transformed before it goes into the Core ML model. You will need to apply the
inverse transformation to the bounding box coordinates.

• The predicted bounding box is in normalized image coordinates, which go from 0 to


1. You will need to scale them up to the size at which you’re displaying the image.

• The origin of the bounding box coordinates is in the lower-left corner. That’s just how
Vision likes to do things.

The repo for the book124 contains a demo app that runs the MobileNetV2+SSDLite model
on the live camera feed. Here it is in action in my kitchen:

By the way, recall that we stored the names of the class labels inside the model’s metadata.
You can read those label names using the following code. The demo app uses this to assign
a unique color to each class label when the model is first loaded.
124 https://github.com/hollance/coreml-survival-guide
MobileNetV2 + SSDLite Example 435

guard let userDefined = coreMLModel.model.modelDescription


.metadata[MLModelMetadataKey.creatorDefinedKey]
as? [String: String],
let allLabels = userDefined["classes"] else {
fatalError("Missing metadata")
}

let labels = allLabels.components(separatedBy: ",")

Was it all worth it?

After having seen the effort it took to add the coordinate decoding and NMS into the Core
ML model, you may be wondering if it’s really worth it — maybe doing this in a custom
layer or with a bit of Swift code wasn’t such a bad idea after all?

The answer is: it depends.

If your goal is to get a model that runs as fast as possible, you’d have to measure the speed
of the different implementations on different devices. That’s the only way to find out which
approach is fastest.

The advantage of putting everything inside the same model is that it is much more
convenient to use. It also lets Core ML choose the best hardware to run the model on. On
the iPhone XS, this model runs on the Neural Engine. Currently that’s not possible if you
use a custom layer.

It’s certainly possible that doing the decoding by yourself — using the Accelerate frame-
work, for instance — ends up being a little faster than letting Core ML do it. With Core ML
you have to use generic operations such as slice, permute and concat, while in your own
code it’s just a matter of changing a pointer address.

But does it matter? Most of the time taken by this model is spent in the SSD neural network.
Optimizing the coordinate decoding may not make any difference to the overall speed.
Again, the only way to find out is to measure it!

Note: It turns out that MobileNetV2+SSDLite actually runs quite slowly on the Neural
Engine. Turi Create’s TinyYOLO is much faster — which is interesting because on iPhones
that do not have a Neural Engine it’s the other way around… SSDLite on the GPU is way
faster than TinyYOLO. I think this is because of the depthwise separable convolutions —
the Neural Engine doesn’t really seem to get along very well with those.
Using Protobuf Without coremltools
Because an mlmodel is nothing more than a binary file in the protobuf format, you don’t
actually need coremltools at all. All you need is the protobuf library. You can directly read
and write mlmodel files using any programming language that supports protobuf — even
Swift.

Here’s how this works: first you use protoc, the protobuf compiler, to generate source files
in a particular programming language that correspond to the Core ML proto files. Then
you use the objects from these generated source files to manipulate the mlmodel in that
programming language.

First let’s do it in Python and then in Swift.

Install the protobuf compiler

To compile the proto files to your programming language of choice, you need protoc. The
easiest way to install this on macOS is through the Homebrew125 package manager:

$ brew update
$ brew install protobuf

You may also need to install the Xcode Command Line Tools:

$ xcode-select --install

Also install the protobuf library for Python if you don’t have it already:

$ pip install -U protobuf

Download the proto files

The mlmodel specification consists of a number of .proto files. The easiest way to get the
most up-to-date version of these is from the coremltools GitHub repo.
125 https://brew.sh/

436
Using Protobuf Without coremltools 437

$ git clone https://github.com/apple/coremltools.git

You only need the contents of the folder coremltools/mlmodel/format, so copy those to a
new folder:

$ cp -R coremltools/mlmodel/format ./proto

Create a new folder named proto_python. This will receive the output from the protobuf
compiler:

$ mkdir proto_python

Now go into the proto folder (the one containing the proto files) and run the protobuf
compiler:

$ cd proto
$ protoc --python_out=../proto_python *.proto

This will create a _pb2.py file for every .proto file. The protobuf compiler has turned every
definition from the proto files into an object that you can use from Python.

Note: Feel free to take a look inside these pb2.py files, but they’re pretty unreadable.

Python example

Now go into the proto_python folder and start a Python REPL or write a new script. You can
do the following from Python to load an mlmodel file:
Using Protobuf Without coremltools 438

import Model_pb2

spec = Model_pb2.Model()

with open("YourModel.mlmodel", "rb") as f:


spec.ParseFromString(f.read())

The line import Model_pb2 loads the Model_pb2.py file you just generated from Model.proto,
which in turn imports all the other generated files. The Model_pb2 module contains a Model
class that can be used to load an mlmodel file.

And now you have a valid spec object again. You can do anything you want with this spec
object, in exactly the same manner as before. In fact, the above thing is what coremltools
does too — the coremltools repo even contains the pb2.py files already, in the folder
coremltools/proto.

To save this spec to a new mlmodel file, you can write:

with open("YourModelNew.mlmodel", "wb") as f:


f.write(spec.SerializeToString())

This shows that you don’t actually need coremltools at all — you only need the proto files
with the mlmodel specification!

Swift example

Lots of programming languages support protobuf, including Swift!

This opens up all kinds of interesting possibilities… It is currently impossible to train


models using Core ML — especially not on the device — but your app could implement
its own training logic, then use protobuf to write the trained weights to an mlmodel file,
and finally use the on-device Core ML compiler to turn this into an mlmodelc. It’s quite the
undertaking, but not impossible. ;-)

First install the protoc-gen-swift extension that allows protoc to generate Swift code. The
easiest way to do this is through Homebrew again:

$ brew install swift-protobuf

Now make a new folder for the generated source files:


Using Protobuf Without coremltools 439

$ mkdir proto_swift

And run the protobuf compiler again, this time telling it to generate Swift code:

$ cd proto
$ protoc --swift_out=../proto_swift *.proto

There is now a .pb.swift file for every .proto file. If you look inside these Swift files, you’ll
notice that the messages from the proto files now all have corresponding Swift structs. (The
message names are prefixed by CoreML_Specification_.)

To read and write protobuf files in Swift, you need to add these pb.swift files to your
Xcode project. You also need the SwiftProtobuf library. There are different ways to add
this library to your project, but to keep things simple we’ll use the copy-paste method of
package management.

First clone the swift-protobuf GitHub repo126 :

$ git clone https://github.com/apple/swift-protobuf.git

Open the SwiftProtobuf.xcodeproj and build the SwiftProtobuf_iOS target.

To test this all works, create a new iOS app project and copy the proto_swift folder with
the pb.swift files into the project. Also make it link with the SwiftProtobuf.framework you
built in the previous step — just drag that from the SwiftProtobuf Xcode project into the
new project’s “Embedded Binaries” section (make sure “Copy items if needed” is checked).

Note: In the Build Settings for your new project, make sure Framework Search Paths has
the folder in it where SwiftProtobuf.framework is located. Otherwise you’ll get the error
“No such module SwiftProtobuf” when you build the app.

Also add an mlmodel file to the app, for example MobileNet.mlmodel127 . Because Xcode
will try to compile this to an mlmodelc, give it a different file extension such as .bin.

In ViewController.swift, add an import:

126 https://github.com/apple/swift-protobuf
127 https://docs-assets.developer.apple.com/coreml/models/MobileNet.mlmodel
Using Protobuf Without coremltools 440

import SwiftProtobuf

In viewDidLoad(), add the following code to load the mlmodel file:

if let url = Bundle.main.url(forResource: "MobileNet", withExtension: "bin"),


let data = try? Data(contentsOf: url),
let model = try? CoreML_Specification_Model(serializedData: data) {
print(model.description_p)
/* do other stuff with model */
}

This first loads the MobileNet.bin file into a Data object and unserializes it into a CoreML_-
Specification_Model object. This is the exact same as the Model object you saw earlier in the
Python code, but now in Swift.

To use this model object, access its properties like you would in Python:

print(model.description_p)
print(model.neuralNetworkClassifier.classLabels!)
print(model.neuralNetworkClassifier.layers.last!)
/* and so on... */

Note: The description field is called description_p because otherwise it clashes with the
Swift symbol.

To save the CoreML_Specification_Model object to a file again, first turn it into Data object
and then save that somewhere:

if let outData: Data = try? model.serializedData() {


outData.write(to: outURL)
}

You can now use Swift code to inspect and modify the mlmodel!
Encrypting Models
Note: As of iOS 14 and macOS 11.0, Core ML models can be encrypted. Xcode will generate
an encryption key for the model and securely store it on Apple’s servers. To decrypt the
model when the app instantiates it, Core ML will fetch the encryption key over the network
from Apple’s servers (the first time only) and automatically decrypts the model in memory.
It’s a great solution if you’re worried that people will steal your proprietary tech.

Because of this new feature, the contents of this chapter are now a little outdated. To learn
how to use the new model encryption stuff, see the Apple developer page for Core ML.

Core ML model files are stored in an open format. Anyone can read mlmodel files, for
example using tools such as Netron or simply by loading them into a spec object using
coremltools and a bit of Python.

If someone gets their hands on your mlmodel file, or even the compiled mlmodelc, they
can easily use it in their own app. That’s something you will want to prevent if your model
is proprietary and gives you a competitive advantage.

Or perhaps you’re required by law to protect your model because its learned parameters
can reveal sensitive information about the original training set.

I don’t want to be a scaremonger here, but not everyone realizes that mlmodel files are
so revealing. I’m a big fan of open formats and I’m really happy that Core ML models are
accessible, but you should be aware of the downsides too. (The same is true for TensorFlow
graphs, by the way.)

A few things to keep in mind:

• Your app bundle does not include the mlmodel file but the compiled version, mlmod-
elc. This is not an open format but it’s also not that hard to reverse engineer. In fact,
it’s super simple to copy the mlmodelc folder from an existing app into another app
and use it through the MLModel API.

• When you train a model and convert it to Core ML, it is your intellectual property.
Someone else using that model without your permission is a form of copyright
infringement, and you can take legal action against them. That might be enough of

441
Encrypting Models 442

a deterrent against people stealing your work. However, you may still want to protect
your model so that your competitors can’t learn your trade secrets.

Note: I am not a legal expert nor a cryptography expert, so don’t just take my word for it.
This chapter is intended to highlight a potential drawback of using Core ML, but as a non-
expert I can only provide handwavy solutions. If this issue matters to you or your company,
you should seek additional professional advice.

Anyone who downloads your app onto their device can also get it on their computer. The
IPA file is just a ZIP file and the app bundle is just a folder, so it’s really easy to look inside
the contents of an app. As of iTunes 12.7 this has become a little harder because it no longer
shows the IPAs for apps you downloaded from the App Store. But you can still copy the IPAs
using Apple Configurator 2, available from the Mac App Store.

In other words: whatever is in your IPA file is accessible to the world. If you don’t protect
your Core ML model, then the mlmodelc folder is there for everyone to see. People can copy
that folder into their own app and start using your model. They don’t even need to have
the original mlmodel file or the Xcode-generated class.

If you don’t want anyone to see what is in your Core ML model, then you should not include
it unprotected in the app bundle. One way to do this is to encrypt the mlmodelc folder.
Anyone peeking into the IPA will only see an encrypted binary file.

Note: As a low budget solution you can zip the mlmodelc and call it DetailViewController.nib
or something else innocuous. At runtime, unzip this file and restore the mlmodelc folder.
This is only a very basic obfuscation method but it might be enough to fool some people.

You can also not include the mlmodelc in the app bundle at all but download the model
and compile it when the app is in use. Of course, you should do the download over HTTPS
so that this is encrypted as well.

However, be aware of the following:

• While your app is using Core ML, it needs to have access to the unencrypted mlmodelc
folder. So your app needs to decrypt the model before you can use it.

• This also means that while your app is running, the unprotected mlmodelc folder is
stored on the user’s device. You should make sure that this folder is not backed up
Encrypting Models 443

or synced with iCloud. And you should probably remove it again when the user stops
using the app.

• It is not possible for anyone to read the contents of your app’s sandbox, so in theory
your decrypted mlmodelc is safe. However, if an attacker is dedicated enough they can
install your app on a jailbroken device and then they have full access to the sandbox,
including the mlmodelc.

• If you download the model and compile it on the device, you will want to delete the
original mlmodel file as soon as the compilation finishes. However, if you care about
protecting your trade secrets, I would recommend against downloading the mlmodel
at all — even if it’s encrypted. You never want the original mlmodel to end up on
the attacker’s device, or allow for the possibility of it being intercepted during the
download. If you need to download anything, make it an encrypted mlmodelc instead.

• The decryption key must also necessarily be part of your app, otherwise the app
wouldn’t be able to decrypt the files. If they know where to look, an attacker will have
access to the decryption key too. This isn’t particularly easy, as it involves jailbreaking,
using disassemblers and other special skills — but all your secrets are in the attacker’s
possession and they can be found and decyphered. Encrypting your models really is
just a fancy method of obfuscation.

Doing this properly is a lot of work and you’ll have to decide whether it’s worth the effort.

While writing this chapter I decided to look inside the IPAs for a number of Core ML-
enabled apps on the App Store. Quite a few of them include an unprotected mlmodelc.
Interestingly, Adobe’s Photoshop Mix includes a readable prototxt file for one of their Caffe
models and with a little digging I could probably find the weights file in the IPA too. So
apparently even a company like Adobe doesn’t put too much effort into protecting their
models…

Perhaps Core ML will include a built-in encryption feature in the future. After all, iOS takes
great pains to encrypt the binary code in your executable, so why not the “code” from your
machine learning models?

Tip: One way to prevent people from stealing your mlmodelc is to have a custom layer. The
model is useless without the Swift or Objective-C implementation for that layer, which is
not part of the mlmodelc but is compiled into the app binary and is therefore much harder
to steal. Of course, it must not be a trivial custom layer because those are easy to replicate.
Performance Tips
Speed is the big enemy of machine learning on mobile. Running a deep neural network
requires the device to perform millions or even billions of computations and memory
accesses.

iPhones and iPads are pretty speedy but they have their limitations, especially compared
to the computers that machine learning models are usually trained on. A model that takes
a fraction of a second on a big fat desktop computer may take tens of seconds on an iPhone.

What matters for the inference speed is:

• The design of your model. A smaller model will run faster than a larger model. This
is related to the number of learned parameters in the model, but also to the number
and types of layers being used. For tips about model design for mobile, see the chapter
Model Training Tips.

• The size of the input data. A fully-convolutional model can work on input images of
any size, but it will be much faster on smaller images. Try to use the smallest possible
input sizes you can get away with.

• The processor that is in the user’s device. If the user has an iPhone 6 or earlier, then
Core ML only uses the CPU. Deep learning models are not a great fit for these older
devices. If your model can take advantage of the Neural Engine, it will usually run
faster than on the GPU.

• How often you run the model. A heavy model on live video or in an offline processing
loop quickly makes the device hot and drains battery power. On a laptop or desktop,
you’ll hear the fans spinning at top speed soon enough. That is not necessarily the best
kind of user experience.

There is often a trade-off between prediction quality and prediction speed. A smaller
model with fewer parameters will give faster predictions, but they won’t be as good as
the predictions from a larger but slower model. The trick is to find the right balance that
works for your app.

You could even have different versions of the model for different device types: a small but
less accurate model for iPhone 7, a slightly larger and better model for iPhone 8 and X, and
a state-of-the-art model for the iPhone XS that makes full use of the Neural Engine. To save

444
Performance Tips 445

space in the app bundle, you could download the appropriate model for the user’s device
at runtime.

Don’t use the model if you don’t have to!

Core ML is a black box and so you don’t have a lot of control over what Core ML does
internally. You just have to hope it will choose the most optimal way to run your model.

However, you do have control over how often the model runs.

My advice is to try and use the model as little as possible. Do you really need to run the
model 30 times per second, or is 15 FPS still real-time enough for your use case? If 30 FPS is
a struggle for the device but 15 FPS or less still gives good enough results, then running at
a lower frame rate probably makes your users happier. See also the chapter Working With
Video.

Even better, only run the model when the camera isn’t moving. See the WWDC 2018 session
Vision with Core ML for an example of how to use VNTranslationalImageRegistrationRequest
to see if the user is holding the camera still. Run the model once to get a prediction, then
pause the model until the user has moved again.

That same WWDC session also explains how you can make object detection more efficient
by only running the object detection model every so often. In between, you can use Vision’s
object tracking methods to follow the detected object. Vision’s object tracking is much faster
than running the object detection model every frame.

If you’re clever about how often the user really needs to get predictions from the model,
you can probably find ways to avoid running the model when it’s not necessary. Machine
learning is great, but on mobile you need to be judicious about when and how to use it!

MLModelConfiguration

When instantiating your MLModel object, you can pass in an MLModelConfiguration object. This
lets you set the following options:

• computeUnits. To take advantage of the Neural Engine, set this to .all. However, some
models may run slower on the Neural Engine than on the GPU — in that case, set
computeUnits to .cpuAndGPU.
Performance Tips 446

• preferredMetalDevice (iOS 13 / macOS 10.15 or later). Lets you choose which Metal
device to use for GPU execution. This is mostly useful for Macs, which can have more
than one GPU. A discrete GPU or an eGPU will likely be faster than an integrated one.

• allowLowPrecisionAccumulationOnGPU (iOS 13 / macOS 10.15 or later). The GPU uses 16-


bit floats in its calculations but it keeps track of temporary results in 32-bit floats for
extra precision. The GPUs in Apple devices generally work much faster with 16-bit
floats, so if you enable low-precision accumulation, the temporary results get stored
in 16-bit floats too. This gives a nice speed boost to things like convolution layers, but
be sure to verify that your model still gives good results.

Tip: It makes sense to keep 16-bit floats in mind when training your model!

Quantizing won’t help!

Core ML lets you quantize the weights in the mlmodel file. Contrary to popular belief,
quantizing doesn’t make the runtime speed any faster. It only makes the mlmodel file
smaller inside your app bundle. By all means quantize your models but don’t expect them to
run any faster, because Core ML currently does not have quantized versions of convolution
and other operations.

(In the future Apple might build a version of the Neural Engine that can directly execute
quantized models — in which case you can expect a speed boost from quantizing — but the
current version does not appear to do this.)

Note: As of iOS 14 and macOS 11.0, layers that do matrix multiplication (InnerProductLayer
and BatchedMatMulLayer) can perform their computations with 8-bit quantized values. It’s
likely that in the future more layers may be able to do this.
Part 7: On-device Personalization

Core ML 3 introduced training models on the device. Learn how to use this exciting new
feature!
What is on-device personalization?
Simply put, on-device personalization lets you train Core ML models on the user’s device,
but in a limited way.

The idea here is that apps can fine-tune an existing model based on the user’s own data, so
that it changes from a one-size-fits-all model into something that is more tailored to each
individual user.

Because training happens on the device, all data stays on the device too, so updating he
model in this manner does not affect the user’s privacy.

What this is not: a way to train big neural networks from scratch on iPhones. Apple is
careful not to call it “training” but “updating” of models.

A short summary

The Model object in an mlmodel file now has an isUpdatable property. If this is true, the
model can be trained on-device with new data. In other words, you can’t use the training
APIs with an mlmodel file unless it specifically allows it.

On-device training currently only works for neural networks and k-Nearest Neighbors.

These can be standalone models or the last model in a pipeline. If the updatable model is
inside a pipeline, you also set the pipeline to be isUpdatable.

In a neural network, any layers that you want to be trainable must also have their
isUpdatable property set to true.

In addition to the regular inputs, which are used only during inference, the model must
define training inputs. For example, for an image classifier one of the training inputs is the
image (same as for inference) but there is also a second training input that contains the
ground-truth label for the image.

Training k-Nearest Neighbors (k-NN for short) is very straightforward and fast: you give it
new data and it simply memorizes that data. Training a neural network is more involved
and uses a training loop with a loss function and backpropagation, just like you’d expect.

448
What is on-device personalization? 449

Neural network models have a NetworkUpdateParameters field containing the hyperparame-


ters, such as the loss function to use, the optimizer to use, optimizer momentum, how many
epochs to train, the mini-batch size, and so on.

Some of these hyperparameters, such as the learning rate, can be overridden inside the
app.

Note: What Core ML calls “parameters” are really hyperparameters, not the learned
parameters of the model. Those are called weights and biases by Core ML.

Core ML 3 offers the following loss functions:

• categorical cross entropy (for classification)

• MSE, mean squared error (for regression)

A model with multiple outputs can also have multiple loss functions. For example, in a
model that predicts both a class label and a bounding box you would use cross-entropy for
the class label output and MSE for the bounding box coordinates.

You can choose from the following optimizers:

• SGD

• Adam

The following neural network layers can be trained:

• convolution layers

• fully-connected layers

That’s quite a limited selection, but I’m sure more layers will be added in the future. Many
of the other layer types can be backpropagated through, but not all.

Usually you’ll just be retraining the last couple of layers in your model anyway, as updating
the entire model is much more computationally expensive and typically not needed for
fine-tuning.
What is on-device personalization? 450

Inside the app

You use the MLUpdateTask API for training the models. This is a fire-and-forget API: you tell
it to start training and then get a notification when it is done. For k-NN, training is pretty
much instantaneous.

The data is provided by an MLBatchProvider object. You need to implement this yourself to
load the training example and their ground-truth labels.

For a neural network, MLUpdateTask runs the training loop on a background thread for
the specified number of epochs. You can get progress updates after every mini-batch and
epoch, which give you a limited amount of influence over the training process — for
example, you can use this to dynamically change the learning rate.

When it’s done, MLUpdateTask will notify you via a completion handler. At this point you’d
save the updated model to a new mlmodel file. If you were also using the model to make
predictions, you now reload the MLModel object from that new file to get the changes.

Some of the more interesting training additions to the Core ML API are:

• MLModelDescription:
– has an isUpdatable property that is true if the model supports training
– trainingInputDescriptionsByName tells you what the training inputs are
– parameterDescriptionsByKey: during training this contains the current values of the
hyperparameters

• MLModelConfiguration has a parameters dictionary that lets you override the hyperpa-
rameters that will be used in this training session.

• MLUpdateProgressHandlers is a collection of callbacks that provide you with progress


updates during the training loop.

• MLUpdateContext: this object is passed to the progress callbacks. It has a dictionary


containing the values of the hyperparameters, but also a metrics dictionary with the
current loss and epoch.

• MLFeatureValue has a new convenience initializer that accepts CGImage objects and can
load images from local files. It can also apply different cropping options to resize the
image. You’d use this in your MLBatchProvider to easily load images — no need to mess
around with CVPixelBuffers anymore.
What is on-device personalization? 451

Using BGTaskScheduler and BGProcessingTaskRequest from the BackgroundTasks framework,


you can schedule the Core ML update task to run in the background, for example to train
the model at night. That is useful for when you have lots of training data or very complex
models.

Limitations

There’s quite a bit of training functionality included in Core ML 3, but if you’ve trained mod-
els before there are also some obvious omissions. These are not necessarily unreasonable
limitations, but you should be aware of them.

• Almost everything about training is hardcoded in the mlmodel file. There are only
a few things you can change in the app before or during training: the number of
epochs to train, the mini-batch size, the learning rate, optimizer momentum, whether
to shuffle the data, and the random seed. Any other kinds of changes must be made to
the mlmodel file, and the app can’t do that.

• Limited choice of loss functions and optimizers.

• You cannot define your own loss functions or optimizers.

• You can’t train RNN layers.

• If you have a layer that does not support backprop that sits between two trainable
layers, you cannot train the one that comes before the unsupported layer.

• You can’t selectively fine-tune layers by turning the isUpdatable flag for a layer on/off
in your app. Which layers are trainable is fixed in the mlmodel file. You cannot
“unfreeze” additional layers from within the app.

• There is only one learning rate for all layers (no discriminative learning rates).

• Even though you can have more than one loss function, there is no way to weight these
losses so that one counts more than the other. (It’s impossible to modify the output of
the loss layer before it goes into the optimizer. The optimizer isn’t “connected” to the
loss function through a named output or anything.)

• While the mlmodel can contain multiple loss functions, Core ML 3 cannot currently
load such models. For the time being, we’ll have to make do with just one loss function
per model.

• The cross-entropy loss does not support weighting for imbalanced classes.
What is on-device personalization? 452

• Only the last model in a pipeline can be updatable.

• There is no way to add weight decay or other regularization (and no dropout layer).

• There are no built-in data augmentations, only basic cropping. If you want random
crops you have to choose random CGRects yourself. This is not a big deal, as it is possible
to do your own data augmentation, but it would have been nice to have.

I’m sure that in future versions of Core ML, a lot of these limitations will be resolved.

Read my blog post series

I have written an in-depth series of blog posts about on-device training with Core ML. If
you’re new to on-device training, I suggest you read through these posts first:

• Part 1: Introduction to on-device training128

• Part 2: Rock, Paper, Scissors (Lizard? Spock?)129

• Part 3: k-Nearest Neighbors130

• Part 4: Training a Neural Network131

Don’t forget Apple’s documentation

Core ML is generally not very well documented (hence this book) but there is some useful
documentation from Apple on how to do on-device training:

• Jupyter notebooks132 that explain how to create trainable models.

• On-Device Model Personalization API Usage133

• Personalizing a Model with On-Device Updates134 — includes sample code

You should definitely have a look at these resources.


128 https://machinethink.net/blog/coreml-training-part1/
129 https://machinethink.net/blog/coreml-training-part2/
130 https://machinethink.net/blog/coreml-training-part3/
131 https://machinethink.net/blog/coreml-training-part4/
132 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models
133 https://github.com/apple/coremltools/blob/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models/OnDeviceTraining_API_Usage.md
134 https://developer.apple.com/documentation/coreml/core_ml_api/personalizing_a_model_with_on-

device_updates
What is on-device personalization? 453

What if this isn’t good enough for my model?

Sometimes you can use a clever workaround. For example, Core ML supports linear
regression models but these cannot be on-device trained yet. However, you could build a
very simple neural network to perform the regression (just a single fully-connected layer)
and train that.

Of course, you can always write your own training code. There is no rule that says you have
to use Core ML.

Currently, Core ML does not allow you to train an LSTM. It’s not easy but definitely possible
to implement the forward and backward passes for LSTM yourself in Swift. Just because
Core ML can’t do it, doesn’t mean it’s impossible. If you’re truly adventurous, you could
even write the trained weights back into the compiled mlmodelc files.

Note: An alternative technique to training the last few layers of the model is weight
imprinting. Core ML does not currently support this, but you could write the code for this
yourself. Weight imprinting basically averages the weights from the final layer with the
activations from the training images. This lets you train with just a few examples.
Making a Model Updatable
Core ML won’t let you use on-device training unless the mlmodel is configured to be
updatable. So how do you make a model updatable?

Note: This chapter only applies to neural networks. Other types of models cannot be trained
with Core ML at the moment (with the exception of k-Nearest Neighbors, the topic of the
next chapter). For making a pipeline updatable, refer to the chapter Building Pipeline
Models.

Note: An mlmodel that is updatable can only be opened by Xcode 11 or later, and will only
work on iOS 13 and up — even for making predictions.

Using the converter

The easiest way to make a model trainable is during conversion. To do this, pass the option
respect_trainable=True to the converter. Any layers that were trainable in the original
model now also become trainable in the Core ML model.

Currently, only the Keras converter supports this. See the chapter Keras Conversion Tips
for a detailed description.

Using NeuralNetworkBuilder

NeuralNetworkBuilder isn’t just for creating new models, you can also use it to make changes
to existing models, in particular to add training stuff.

454
Making a Model Updatable 455

Note: For a full description of how to use NeuralNetworkBuilder, see the chapter NeuralNet-
workBuilder.

First, load the model and create a NeuralNetworkBuilder to edit it:

import coremltools as ct
model = ct.models.MLModel("YourModel.mlmodel")
builder = ct.models.neural_network.NeuralNetworkBuilder(spec=model._spec)

To make the model updatable, you must at least do the following:

• choose which layers will be trained

• add training inputs

• add the loss layer(s)

• add the optimizer

• set a default number of epochs

• specificationVersion must at least be 4

Fortunately, NeuralNetworkBuilder already does some of this for you, such as setting the
specificationVersion to the correct version number.

Example: SqueezeNet

Suppose that we’re making an updatable version of SqueezeNet135 . Usually you’d only want
to make the last layer trainable, or at most the last couple of layers.

In the case of SqueezeNet, the final layers are (ordered from last to first):

• softmax

• global pooling layer

• ReLU
135 https://docs-assets.developer.apple.com/coreml/models/Image/ImageClassification/SqueezeNet/SqueezeNet.

mlmodel
Making a Model Updatable 456

• convolution layer conv10

• concat layer

• …

Of these only the convolution layer has trainable weights, so let’s choose layer conv10 to
be the one that we’re training. (Some versions of SqueezeNet may actually have a fully-
connected layer between the pooling layer and the softmax, but the one downloaded from
Apple doesn’t.)

Tip: You can use Netron to find out which layers are in your model, or execute builder.inspect_-
layers() to get a list of all the layers.

Mark layers as updatable

Call make_updatable with the names of the layers that you want to train:

builder.make_updatable(["conv10"])

This automatically sets the isUpdatable flags on the layer, on the layer’s weights, and on the
model object itself. If you do print(builder.spec.isUpdatable) now, it prints True.

To see which layers are currently marked as trainable, do builder.inspect_updatable_-


layers(). This should output:

Name: conv10 (Type: convolution)


Input blobs: ['fire9/concat']
Output blobs: ['conv10_62relu_conv10']

Note: In order to train the conv10 layer, Core ML must backpropagate through the softmax,
pooling, and ReLU layers. It will automatically take care of that — you don’t need to set
these layers to be updatable. Only layers with weights need to have this flag.
Making a Model Updatable 457

Choose the loss function

Since SqueezeNet is a classifier, the cross-entropy loss is appropriate. (If your model is not
a classifier, you’d use the MSE loss instead, see also below.)

builder.set_categorical_cross_entropy_loss(name="lossLayer", input="classLabelProbs")

A loss layer has two inputs:

• the predicted value

• the ground-truth label, also known as the target

When you add the loss function to the model, you only specify the input for the predicted
value. NeuralNetworkBuilder will automatically add the target input.

Here, we specify "classLabelProbs" to be the input. This is the name of the dictionary that
has the categories and their predicted probabilities. All Core ML classifiers output such a
dictionary but in your own model this may have a different name. It should be the same as
the model’s predictedProbabilitiesName property.

Note: You can see the model’s outputs with builder.inspect_output_features() or by doing
print(builder.spec.description).

Write builder.inspect_loss_layers() to look at the new loss layer:

[Id: 0], Name: lossLayer (Type: categoricalCrossEntropyLossLayer)


Loss Input: classLabelProbs
Loss Target: classLabelProbs_true

This tells us that classLabelProbs_true is the tensor that should hold the ground-truth for
the current training example. But where does this tensor get its value from? That’s what
the training inputs are for.

Note: In theory, the neural network can have more than one loss layer. This allows you
combine classification and regression in a single model, or do multi-task training. However,
having more than one loss layer in the model currently gives the error message, “This model
Making a Model Updatable 458

has more than one loss layers specified, which is not supported at the moment.” For now
we’ll have to make do with a single loss layer.

Training inputs

An updatable Core ML model has two sets of inputs:

• the regular inputs for inference

• “training inputs” for training

During prediction, it uses the regular inputs. During training, it uses the training inputs.
Seems obvious enough, but when I first learned about this it confused me — I assumed
that the training inputs were used in addition to the regular inputs.

In other ML frameworks, the “training input” is just the ground-truth label, but with Core
ML the regular inputs are not used at all during training. Therefore, the training inputs also
include copies of the regular inputs.

For an image classifier model such as SqueezeNet, there will be two training inputs:

1. The first is for the training image. This has the same name and data type as the image
input used for inference. (Because it has the same name, the same image preprocessing
will be applied to this input at training time too.)

2. The second training input is for the target. In the case of a classifier, the target would
be the true class label. This training input is connected to the loss function, so that Core
ML can compare the predicted label with the correct one and compute the loss.

Tip: To view your model’s regular inputs, do builder.inspect_input_features(). However,


this does not print the training inputs!

NeuralNetworkBuilder has a function set_training_input() that lets you add the training
inputs by hand, but you won’t need to use that function as adding the loss layer also adds
the required training inputs already.

To view the model’s training inputs, do print(builder.spec.description.trainingInput). For


our SqueezeNet model this outputs the following:
Making a Model Updatable 459

[name: "image"
shortDescription: "Input image to be classified"
type {
imageType {
width: 227
height: 227
colorSpace: BGR
}
}
, name: "classLabel"
type {
stringType {
}
}
]

The "image" input is the same as for inference, and was simply copied over from the model’s
regular inputs.

The "classLabel" input is new and is for the ground-truth label. The name for this input is
taken from the model’s predictedFeatureName property.

It is not immediately obvious, but the "classLabel" training input is connected to the
"classLabelProbs_true" input of the cross-entropy loss layer. You’d think these two things
would need to have the same name, as that is how Core ML usually connects everything
together, but classifiers are treated special by Core ML.

By making the name of the label input the same as the model’s predictedFeatureName, Core
ML lets you pass in the training labels as strings. That’s actually quite convenient, as the
cross-entropy loss normally doesn’t accept string values! Core ML will automatically map
the string to the correct categorical index and then pass it into the target of the cross-
entropy loss layer.

So, with this approach you can use the actual label values during training. In other words,
if the training example is an image of a cat, the label is simply the string "cat". You don’t
need to convert it to an index or one-hot encode it.

Note: This also works if your classifier has numeric labels. In that case, the data type of
the training input will be int64Type instead of stringType, and you can simply pass in the
numeric label during training.

Even though the training inputs were added automatically, the one for the label doesn’t
have a human-readable description yet. You can fill it in as follows:
Making a Model Updatable 460

builder.spec.description.trainingInput[1].shortDescription = "True label"

Once you’ve saved the model, the inputs will look like this in the Xcode model viewer:

The inference inputs (and outputs) are under Prediction, the training inputs are under
Update. The Update section does not list any outputs, as training does not produce any
outputs (only the loss metric).

If the model isn’t a classifier

Directly using the actual labels, such as "cat", is the most convenient. As you saw, this is
done by connecting the loss layer to the model’s predictedProbabilitiesName, and giving the
training input the same name as the model’s predictedFeatureName.

However, if your model is not a NeuralNetworkClassifier but a plain NeuralNetwork, it will not
output that special dictionary with the predicted probabilities. You still can use the cross-
entropy loss, but now you cannot pass in the actual label values — you’ll have to use the
categorical index of the label.

Adding the loss is the same as before:

builder.set_categorical_cross_entropy_loss(name="lossLayer", input="softmax_output")

The input to the loss layer must be the output of a softmax layer, here named "softmax_-
output".

Adding the loss layer creates a new training input with the name "softmax_output_true".
This is also the name of the loss layer’s target (you can confirm this with builder.inspect_-
loss_layers()).
Making a Model Updatable 461

The training input is now a MultiArray of type INT32 that contains just a single element.
This element is the categorical index of the label, i.e. the index of the softmax output
corresponding to the true label. There’s no need to one-hot encode the labels. The data
type of the MLMultiArray is also allowed to be FLOAT32 or DOUBLE.

Note: If you don’t like the name of the training input, here "softmax_output_true", you’ll
have to rename the softmax layer’s output in the model before adding the loss layer.

Using MSE loss

To train the output of a layer that is not softmax, you’d use the mean squared error loss
instead.

feature = ("pool10", datatypes.Array(1000))


builder.set_mean_squared_error_loss(name="MSE_loss", input_feature=feature)

Here, we’ve attached the MSE loss to the "pool10" layer of SqueezeNet, which outputs a
tensor with 1000 elements. You need to tell Core ML what the size is of this tensor, so it
knows how large to make the training input.

Note: MSE works on tensors of any shape. For example, to put MSE loss on the layer "relu_-
conv10" that precedes the pooling layer, you’d specify datatypes.Array(1000, 7, 7) as the
tensor shape.

As before, NeuralNetworkBuilder automatically fills in the training inputs.

print(builder.spec.description.trainingInput[1]) shows the training input for the ground-


truth:
Making a Model Updatable 462

name: "pool10_true"
type {
multiArrayType {
shape: 1000
dataType: DOUBLE
}
}

In the app, during the training loop, you’ll have to provide an MLMultiArray of this shape,
containing the ground-truth target for the current training example.

Note: For fun, I tried adding both cross-entropy and MSE to SqueezeNet (connected to
different layers), but even though the mlmodel can have more than one loss layer, Core
ML 3 cannot currently deal with more than one loss function. So no multi-task learning
with Core ML 3 yet.

Choose the optimizer

There are currently two available optimizers, SGD and Adam. To add an SDG optimizer to
the model, you write:

from coremltools.models.neural_network import SgdParams


sgd_params = SgdParams(lr=0.001, batch=8, momentum=0.9)
builder.set_sgd_optimizer(sgd_params)

To use Adam, you write:

from coremltools.models.neural_network import AdamParams


adam_params = AdamParams(lr=0.001, batch=8, beta1=0.9, beta2=0.999, eps=1e-8)
builder.set_adam_optimizer(adam_params)

The SgdParams and AdamParams objects describe the default values for the hyperparameters.
For some parameters you can choose a range or set of allowed values. For example,
Making a Model Updatable 463

sgd_params = SgdParams(lr=0.001, batch=8, momentum=0.9)


sgd_params.set_batch(8, [1, 8, 16])
sgd_params.set_lr(0.01, 0.001, 0.1)
sgd_params.set_momentum(0.9, 0, 0.999)

For Adam you can also set ranges for beta1, beta2, and epsilon.

Inside the app you can override the default values of the hyperparameters, for example to
do learning rate annealing, but any new value you choose must fall within the constraints
defined in the mlmodel file.

With this model, trying to use a learning rate that is greater than 0.1, or a batch size that is
not 1, 8, or 16, will give a runtime error.

To view the optimizer settings, do builder.inspect_optimizer(). It prints:

Optimizer Type: sgdOptimizer


lr: 0.01, min: 0.001, max: 0.1
batch: 8, allowed_set: [1, 8, 16]
momentum: 0.9, min: 0.0, max: 0.999

Other hyperparameters

When you tell the app to train the neural network, it always runs the training loop for the
exact number of epochs that is defined in the mlmodel file. (You must set the number of
epochs, or Core ML refuses to load the model.)

builder.set_epochs(5, allowed_set=[1, 5, 10])

For this model, the training loop does 5 epochs, unless the app changes this to 1 or 10. Any
other number of epochs will give a runtime error.

Another hyperparameter you can set is shuffle. By default this is true, which means Core
ML will randomly reorder the training examples in each epoch. You can also set the random
seed used for shuffling:

builder.set_shuffle(seed=1234)

There is no way to turn shuffling off with this function, but in case you wanted to do that,
write:
Making a Model Updatable 464

builder.nn_spec.updateParams.shuffle.defaultValue = False

Save the new model

To verify that everything is set up correctly, do print(builder.nn_spec.updateParams). This


prints the contents of the NetworkUpdateParameters object.

When you’re satisfied, save the model:

new_model = ct.models.MLModel(builder.spec)
new_model.save("YourUpdatableModel.mlmodel")

And that’s it, you now have a model that can be trained on-device.

If you open the mlmodel file in Xcode, it now shows the training inputs in addition to the
regular inputs (see image above). There is also a section for the hyperparameters:

Note: Even if the model saves without errors, it’s still possible to get an error message
when opening the model in Xcode. For example, “validator error: There is a layer (xxx),
which does not support backpropagation, between an updatable marked layer and the loss
function.” That means your model can only be trained up until that particular layer, but
any layers before it cannot be trained.

With SqueezeNet, if you also try to make the convolution layer fire9/expand3x3 trainable,
Core ML gives this error because it cannot backpropagate through the concat layer that sits
between conv10 and fire9/expand3x3.

Don’t forget: The coremltools repo has a number of Jupyter notebooks136 with examples of
how to make models updatable. Recommended reading!
136 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models
Training a Neural Network
The chapter Making a Model Updatable shows how to adapt a Core ML model so that it can
be trained on the device. This chapter explains how to do the actual training inside the iOS
or macOS app.

An updatable model looks like this in the Xcode model viewer:

• The Prediction section contains the inputs and outputs used for inference.

• The Update section lists the training inputs. Note that inference and training both
have an input named image but training also has a string input for the ground-truth
label. Training never has outputs.

• The Parameters section shows the default hyperparameter settings. This particular
model will always be trained for 5 epochs at a time, with a mini-batch size of 8
examples, a learning rate of 0.01, and so on. The app can override these values, but
within limits.

465
Training a Neural Network 466

For tutorials of how to use the Core ML training APIs in an app, check out the following
resources:

• On-device training with Core ML – part 4137 (my blog post)

• Personalizing a Model with On-Device Updates138 (Apple documentation)

• On-Device Model Personalization API Usage139

Tip: On-device training of neural networks is supported in the iOS simulator.

Loading the training data

As is common with neural networks, the Core ML training loop works on mini-batches.
Every iteration of the loop, Core ML asks the app to provide a mini-batch of training
examples and their targets. The mini-batch size is hardcoded in the mlmodel file, although
the app can override it.

To load the training data, you have to make a class that implements the MLBatchProvider
protocol. This is a very simple protocol with just two things:

• a count property

• the method features(at:)

Even though this is called a batch provider, it only returns a single training example at a
time. Core ML will repeatedly call features(at:) with different indices to fill up the mini-
batch. If shuffling is enabled, those indices will be picked randomly.

The MLBatchProvider actually doesn’t know anything about batches (!) or about how the
examples get stored inside a batch. So the name is a little misleading: it doesn’t provide
batches, only things that go inside batches.

Here is what an MLBatchProvider for training an image classifier could look like:
137 https://machinethink.net/blog/coreml-training-part4/
138 https://developer.apple.com/documentation/coreml/core_ml_api/personalizing_a_model_with_on-

device_updates
139 https://github.com/apple/coremltools/blob/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models/OnDeviceTraining_API_Usage.md
Training a Neural Network 467

import CoreML
import Vision

class TrainingBatchProvider: MLBatchProvider {


let folderURL: URL
let imageConstraint: MLImageConstraint

let filenames: [String]


let labels: [String]

init(folderURL: URL, imageConstraint: MLImageConstraint) {


self.folderURL = folderURL
self.imageConstraint = imageConstraint

/* scan the folder and store the filenames and labels in the arrays */
}

var count: Int {


filenames.count
}

func features(at index: Int) -> MLFeatureProvider {


let url = folderURL.appendingPathComponent(filenames[index])

let imageOptions: [MLFeatureValue.ImageOption: Any] = [


.cropAndScale: VNImageCropAndScaleOption.centerCrop.rawValue
]

guard let featureValue = try? MLFeatureValue(imageAt: url,


constraint: imageConstraint,
options: imageOptions)
else {
print("Could not load image at index \(index)")
return try! MLDictionaryFeatureProvider(dictionary: [:])
}

let trueLabel = labels[index]

let inputs: [String: Any] = [ "image": featureValue,


"label": trueLabel ]
return try! MLDictionaryFeatureProvider(dictionary: inputs)
}
}

The method features(at:) should return both the image and its true label. That’s why
it returns an MLFeatureProvider object. (Read more about feature providers in Using the
MLModel API.)

The above code uses MLFeatureValue(imageAt:...) to load the images. This API was in-
troduced specifically for training. It makes it easy to load images into MLFeatureValue
Training a Neural Network 468

objects, without having to mess with CVPixelBuffers yourself. The MLImageConstraint tells
MLFeatureValue how large the input image is supposed to be.

As mentioned, Core ML expects you to return both the image and the true label for this
example in a single MLFeatureProvider object. Here, we put them into a dictionary and then
into MLDictionaryFeatureProvider. Your Xcode project will have a generated class named
YourModelTrainingInput, which you can also use for this, as that is an MLFeatureProvider too.

In case of an error, you can return an empty feature provider, and training will immedi-
ately end with a “failed” state.

Note: There is a built-in MLArrayBatchProvider class that you can use if your training data
is small enough to fit completely into memory. Simply load your data into MLFeatureValue
objects, put them into an array, and construct a new MLArrayBatchProvider instance. This
probably isn’t feasible for when you’re working with lots of image data, but for smaller
inputs it’s a nice solution.

Data augmentation

Core ML has no built-in support for on-the-fly data augmentations, such as random flips,
rotations, color shifts, and so on. However, it’s possible to write this code yourself and add
it to your MLBatchProvider class.

This data augmentation code should modify the contents of the CVPixelBuffer object that
holds the training image. To apply these kinds of effects to CVPixelBuffer, I suggest using
the vImage framework or Core Image.

If you’re using MLFeatureValue(imageAt:) or MLFeatureValue(cgImage:) to load the images, it’s


easy to do random cropping. Create a CGRect with a random origin and size, and set up the
imageOptions dictionary as follows:
Training a Neural Network 469

var randomCropRect = CGRect()


randomCropRect.origin.x = CGFloat.random(in: 0..<1)
randomCropRect.origin.y = CGFloat.random(in: 0..<1)
randomCropRect.size.width = CGFloat.random(in: 0..<1)
randomCropRect.size.height = CGFloat.random(in: 0..<1)

let imageOptions: [MLFeatureValue.ImageOption: Any] = [


.cropAndScale: VNImageCropAndScaleOption.scaleFill.rawValue,
.cropRect: randomCropRect
]

This will first crop a random rectangle from the image, and then applies the .scaleFill
mode to resize the image to the correct dimensions.

MLUpdateTask

The training process is managed by MLUpdateTask. Training is asynchronous: you tell the
update task to start and then wait for a completion event. When training a neural network,
Core ML also gives you periodic progress updates, which you can use to update the UI,
compute a validation score, and so on.

You set up the MLUpdateTask as follows:

let trainingData = /* your MLBatchProvider object */

let config = MLModelConfiguration()


config.computeUnits = .all

let handlers = MLUpdateProgressHandlers(


forEvents: [.trainingBegin, .miniBatchEnd, .epochEnd],
progressHandler: progressHandler,
completionHandler: completionHandler)

updateTask = try? MLUpdateTask(forModelAt: modelURL,


trainingData: trainingData,
configuration: config,
progressHandlers: handlers)
updateTask?.resume()

Once the MLUpdateTask is constructed, call resume() to start training.

Note that MLUpdateTask does not perform the training on an existing MLModel instance.
Instead, you need to give it the URL to the mlmodelc directory. It will load the model from
this location and then train that.
Training a Neural Network 470

Which URL should you use here? Your app bundle contains an mlmodelc directory, but the
app cannot write into its own bundle. Therefore, it makes sense to first copy the mlmodelc
folder from the app bundle into the app’s Documents or Library folder.

Note: Any MLModel instances that you created in your app before training started do not
automatically get updated with the changes from training. After training finishes, you’ll
need to re-instantiate the MLModel again from the updated mlmodelc folder!

Progress updates

Because training a neural network might take a while, Core ML gives the app progress
updates every so often. Some of the things you can do in this progress handler:

• update the UI

• change hyperparameters such as the learning rate

• save a model checkpoint when an epoch completes

• compute validation metrics

• inspect the weights and biases for the layers being trained

You can register the progress handler for the following events:

• trainingBegin: happens once when training starts

• miniBatchEnd: happens after every mini-batch

• epochEnd: happens after every epoch

There are no “epochStart”, “miniBatchStart”, or “trainingEnd” callbacks.

The progress handler itself looks like this:


Training a Neural Network 471

let progressHandler = { (context: MLUpdateContext) in


switch context.event {
case .trainingBegin:
print("Training begin")

case .miniBatchEnd:
let batchIndex = context.metrics[.miniBatchIndex] as! Int
let batchLoss = context.metrics[.lossValue] as! Double
print("Mini batch \(batchIndex), loss: \(batchLoss)")

case .epochEnd:
let epochIndex = context.metrics[.epochIndex] as! Int
let trainLoss = context.metrics[.lossValue] as! Double
print("Epoch \(epochIndex), loss: \(trainLoss)")

/* use context.model to compute validation loss and accuracy */


/* save a checkpoint every few epochs */
/* update the UI */

default:
print("Unknown event")
}
}

The MLUpdateContext object describes the current state of the training process:

• Its event property tells you why the progress callback was invoked. The most interest-
ing events are .miniBatchEnd and .epochEnd.

• The metrics dictionary has the current loss value and mini-batch/epoch index.

• The parameters dictionary has the current hyperparameter values.

• It has a reference to the MLUpdateTask, which you’ll need in order to change hyperpa-
rameters.

• context.model is a reference to the MLModel that’s being trained. You’ll need this to
compute validation metrics or save a checkpoint.

The MLUpdateTask has a state property that tells you what the task is currently doing. Inside
the progress handler, context.task.state is always .suspended (paused).

There are also some limitations to what you can do inside the progress handler:

• The reported loss appears to be the training loss over the most recent mini-batch. You
may want to keep track of a moving average and update this after each batch.
Training a Neural Network 472

• You don’t get access to the mini-batch’s training data, so you can’t compute your own
metrics, such as the training accuracy, over the mini-batch.

• There is no built-in support for validation. If you want to compute the validation loss,
you have to create your own validation set, run the prediction loop, and compute any
metrics yourself. Core ML only computes the loss over the training set — for anything
else, you have to do your own math.

Completion handler

The MLUpdateTask completion handler is called just once, after the last epoch completes or
if training is aborted due to an error.

let completionHandler = { (context: MLUpdateContext) in


if context.task.state == .failed {
/* handle the error */
return
}

let trainLoss = context.metrics[.lossValue] as! Double


print("Final loss: \(trainLoss)")

/* save the model */


/* update the UI */
}

If something went wrong, the completion handler is still called but the state of the
MLUpdateTask is set to .failed. You may end up in this .failed state if the batch provider
returns an invalid MLFeatureProvider object, for example.

If all went well, context.task.state is .completed. (The completion handler does not appear
to be called when the task is cancelled by the app and the state is .cancelling.)

Core ML does not automatically save the updated model, so that’s something you’d typically
do in the completion handler (see next section).

At this point, if the app is using the model to make predictions, that copy of the model is now
outdated. To use the newly trained version, you can use the MLModel from context.model. Or
load the model again by making a new instance using the init(contentsOf:) initializer with
the URL to where you saved the updated mlmodelc.
Training a Neural Network 473

Note: Once the task has completed, you can’t start it again. You’ll need to create a new
MLUpdateTask to train for another set of epochs.

Saving the model

Assuming all went well, the completion handler is a good place to save the updated model
to a file. You can also save the model from within the progress handler, although it only
makes sense to save such checkpoints if you’re training for many epochs or if each epoch
takes a long time.

Here is a helper function for saving the model:

private func saveUpdatedModel(_ model: MLModel & MLWritable, to url: URL) {


do {
let tempURL = /* create a temporary URL */
try model.write(to: tempURL)
_ = try FileManager.default.replaceItemAt(url, withItemAt: tempURL)
} catch {
print("Error saving neural network model to \(url):", error)
}
}

Note that the object from context.model also conforms to the MLWritable protocol, which
means you can call model.write(to:) on it. A regular MLModel is not MLWritable, so this only
works from within an update context.

The filename in the URL must have the .mlmodelc extension.

Important: You cannot directly overwrite the old mlmodelc folder, this fails for some
reason. That’s why you save to a temporary URL first and then call replaceItemAt to swap
out the old mlmodelc.

The URL where you’re saving the model cannot be in your app bundle — i.e. don’t use
Bundle.main.url(for...) to obtain this URL — because apps cannot write into their own
bundle. Instead, save the model to somewhere your app’s Documents or Library folder.

Training flow chart

This is the flow of what happens after you call updateTask.resume():


Training a Neural Network 474

MLUpdateTask.resume() immediately returns and the training loop happens in a background


thread. You don’t need to put this on a background queue yourself. The progress and
completion handlers are also called on a background thread.

It’s important to realize that calling MLUpdateTask.resume() will always train for the number
of epochs that is defined in the mlmodel file. The mini-batch size is also defined in the
mlmodel file.

With shuffling enabled, the training examples are placed into the batches in random order.
But because the random seed is also hardcoded in the mlmodel file, the order of the shuffled
images is the same every time you run the update task. (Unless you manually override the
seed, see below.)
Training a Neural Network 475

If the number of training examples doesn’t evenly divide by the batch size, Core ML fills up
the remainder of the last mini-batch by starting from the beginning again. This means the
model might see some images more than once in the epoch.

Note: In the first mini-batch of the first epoch, the first image is requested twice for some
reason. I’m assuming Core ML does this on purpose to inspect what the data will be like.
It’s not a big deal, but the result from the very first call to your MLBatchProvider might not
actually be used. Also, the batch provider’s .count is called a whole bunch of times, so make
sure this method isn’t doing heavy work.

Overriding the hyperparameters

The values for the training hyperparameters are set in the mlmodel file. This is useful
because Core ML can simply use these settings when you tell it to train. The app developer
won’t need to worry about choosing good values for these parameters.

To give the developer more control over the training process, Core ML allows you to over-
ride these hyperparameters. You do this using MLModelConfiguration’s parameters dictionary.
Initially this dictionary is nil, but you can fill it in as follows:

let parameters: [MLParameterKey: Any] = [


.miniBatchSize: 2,
.epochs: 3,
.shuffle: false,
.seed: 4321,
.learningRate: 0.05,
]

let config = MLModelConfiguration()


config.computeUnits = .all
config.parameters = parameters

And then you create the MLUpdateTask with this config object. In other words, you can only
change these hyperparameters before training starts. (Some can also be changed during
training, see the next section.)

When you set the hyperparameters, keep in mind that you can’t just choose any values here.
For each parameter, the mlmodel file specifies a range of allowed values. If you choose a
number that is not in the allowed range, Core ML will throw an exception when you start
training.
Training a Neural Network 476

The Xcode mlmodel viewer only shows the default values of the hyperparameters, but not
their allowed values. Fortunately, there is an API that lets you inspect this at runtime. The
MLModelDescription object has a dictionary parameterDescriptionsByKey, but this only works
from within the MLUpdateTask’s progress handler (it’s empty elsewhere).

Add the line print(model.modelDescription.parameterDescriptionsByKey) to the progress han-


dler and start training. This will print out something like the following for all the hyperpa-
rameters:

miniBatchSize:
parameterKey: miniBatchSize
defaultValue: 8
numericConstraint:
minValue: 1
maxValue: 16
enumeratedNumbers: {(
16,
8,
1
)}, ...

This only tells you what the default and allowed values are for these hyperparameters, not
their current values. (For that, use context.parameters.)

The seed parameter is used for the random number generator when shuffle is enabled. If
you don’t set a default seed, Core ML chooses some kind of fixed seed anyway (I checked
and it’s not 0).

With a fixed random seed, repeating the training several times will give the exact same
results each time. You can override the seed using the model configuration parameters,
just like any of the other hyperparameters. For true randomness, you can set the seed to
the current time, for example.

Changing hyperparameters during training

It’s possible to change the values of some of the hyperparameters while training is happen-
ing. In particular, you can change the optimizer settings: the learning rate, momentum, etc.
The other hyperparameters — shuffle, mini-batch size, and number of epochs — are fixed
once training starts.

A typical hyperparameter you’d want to change during training is the learning rate. It’s
common to start with a high learning rate and to make it gradually smaller over time.
Training a Neural Network 477

In order to change a hyperparameter during training, you’d do the following in your


progress handler:

let progressHandler = { (context: MLUpdateContext) in


switch context.event {
/* previous code omitted */
}

// Simple learning rate annealing:


var newParameters = context.parameters
let annealed = (newParameters[.learningRate] as! Double) * 0.99
newParameters[.learningRate] = annealed
context.task.resume(withParameters: newParameters)
}

You need to create a new dictionary containing the parameters you changed, and then
resume the MLUpdateTask with these new parameters.

By the way, to examine the current values of the hyperparameters, look at context.parameters,
not at context.model.configuration.parameters (these are are what training started out with
and will never change).

Inspecting the weights and biases

Another fun thing you can do inside the MLUpdateTask’s progress or completion handler is
grab the weights and biases of a layer in the model, for example by writing:

do {
let multiArray = try context.model.parameterValue(for:
MLParameterKey.weights.scoped(to: "conv10")) as! MLMultiArray
print(multiArray.shape)
} catch {
print(error)
}

You now get an MLMultiArray object containing the weights of the layer named conv10. You
can also do this for .biases.

This only works for the layers that are trainable. If you use the name of another layer that
is not trainable, this crashes somewhere deep inside Core ML.

These weights and biases appear to be read-only. You can change the contents of the
returned MLMultiArray but this has no effect on the model or the training procedure (it’s
most likely a copy of the actual weights).
Training a Neural Network 478

You can use this for debugging or visualizations.

Cancelling training

Given the MLUpdateTask instance, you can call cancel() on it to interrupt the training process.
This immediately stops training.

However, be aware that the completion handler does not get called now! This also means
that any changes to the model will not be saved and you lose all your work. One way to
prevent this is to save checkpoints in the progress handler.

If you don’t stop the training task when going into the background, Core ML will automati-
cally pause the task and resume when the app becomes active again. Training while in the
background is possible, but you’ll need to use the BackgroundTasks framework for that.
k-Nearest Neighbors
k-Nearest Neighbors, k-NN for short, can be used as a model on its own, or inside a pipeline
where the first model is usually a neural network that acts as a feature extractor and k-NN
is used to classify those feature vectors.

For many on-device training situations, k-NN is a better choice than a neural network. Some
benefits of using a k-NN classifier:

• k-NN models are very fast to train because they don’t actually “learn” anything. They
just memorize all the training examples.

• k-NN only needs a handful of training examples for each class, not the hundreds of
examples required to train a neural network.

• You can train on each new example as soon as you receive it — no need for a training
loop that cycles through all the training data.

• With a neural network classifier, the number of classes is always fixed (determined by
the size of the output layer). With k-NN, however, you can add as many classes as you
want. These classes do not have to be known beforehand. This makes k-NN especially
useful for when the user is the one who determines what the classes should be.

• A k-NN model is much smaller than a neural network, although it grows in size as it
learns more data.

Example use case: The Natural Language framework uses k-NN to search through large sets
of word embeddings really quickly.

Note: k-NN only works on iOS 13 or macOS 10.15 and above. If you’re on iOS 12, you’ll have
to roll your own k-NN implementation.

k-NN in Core ML

The k-NN model in Core ML is a classifier, you cannot use it for regression. It takes a feature
vector as input and outputs the probabilities for the class labels.

479
k-Nearest Neighbors 480

The KNearestNeighborsClassifier object has the following properties:

• numberOfNeighbors. The number of neighbors to use for making the predictions. This is
the k in k-NN. Instead of always using a fixed value for k, Core ML lets you set a range
of allowed values, so that the app can change the number of neighbors at runtime.

• numberOfDimensions. The size of the feature vectors. Recall that k-NN works on one-
dimensional vectors of numbers. If your input data is not one-dimensional, it needs to
be flattened to numberOfDimensions first.

• A default class label. This is the class that gets predicted when the k-NN model is
untrained. This can be a string or an integer.

• A weighting scheme. Core ML 3 supports two weighting schemes: uniform weighting


and inverse distance weighting. Once k-NN has found the k closest feature vectors, it
takes a vote to determine the majority class label among these k vectors. With uniform
weighting, each vote counts equally. With inverse distance weighting, the votes of the
closer neighbors count more than the votes of far-away neighbors.

• A distance function that determines how far away the feature vectors are. In Core ML
3, this is always the squared Euclidean distance.

• The “index”, which is where the memorized examples are stored. Core ML 3 lets you
choose between a linear index or a K-d tree index. For the K-d tree you also have to
set leafSize, which is the number of data points contained within a leaf node of the
tree. The K-d tree is best when you have many training examples, as it allows for more
efficient lookups than a linear index.

• The labels for the memorized examples. You can use string labels or numeric labels.

Just like any model, k-NN has an input and an output. The input is a MultiArray containing
the one-dimensional feature vector as FLOAT32 (not DOUBLE). The output is the label of the
predicted class. Just like a neural network classifier, k-NN can also output a dictionary with
the predicted probabilities for each of the classes.

The k-NN model that ships with your app can be empty, i.e. it has not learned anything yet.
Of course, you can also already add data to it.

The full definition of the k-NN model is in NearestNeighbors.proto140 .


140 https://github.com/apple/coremltools/blob/master/mlmodel/format/NearestNeighbors.proto
k-Nearest Neighbors 481

Converting an existing k-NN model

If you have an existing k-NN model that was trained with scikit-learn, you can convert it
with the standard scikit-learn converter141 .

For example, let’s say we have the following model that is trained on the iris dataset:

from sklearn.datasets import load_iris


iris_samples = load_iris()

from sklearn.neighbors import KNeighborsClassifier


scikit_model = KNeighborsClassifier(algorithm="auto", n_neighbors=5)
scikit_model.fit(iris_samples.data, iris_samples.target)

Note: I used scikit-learn version 0.21.3 for this example, as 0.22 (the latest at the time of
writing) broke the version of coremltools I was using (v3.3).

Converting is as simple as:

import coremltools as ct
coreml_model = ct.converters.sklearn.convert(scikit_model)

Note: It’s possible you get an error message here. Not all types of k-NN are supported, it
depends on the algorithm used. Core ML 3 only supports linear and K-d tree indexes.

The Core ML model now has the following input:

141 https://apple.github.io/coremltools/generated/coremltools.converters.sklearn.convert.html
k-Nearest Neighbors 482

input {
name: "input"
type {
multiArrayType {
shape: 4
dataType: FLOAT32
}
}
}

Because the iris dataset uses feature vectors of length 4, the input has a MultiArray of shape
4. The data type is FLOAT32 or single-precision floating-point values.

The Core ML model has the following output:

output {
name: "classLabel"
type {
int64Type {
}
}
}
predictedFeatureName: "classLabel"

This is the label of the predicted class. The scikit-learn converter did not add a dictionary
output for the predicted probabilities, but you can add this yourself afterwards:

spec = coreml_model._spec
output_label_probs = spec.description.output.add()
output_label_probs.name = spec.description.output[0].name + "Probs"
output_label_probs.type.dictionaryType.int64KeyType.MergeFromString(b"")
spec.description.predictedProbabilitiesName = output_label_probs.name

Notice how the data type of the "classLabel" output is int64. That’s because the labels from
the iris dataset are integers.

Write print(coreml_model._spec.kNearestNeighborsClassifier.int64ClassLabels) to view the


labels of the memorized training examples. There are 150 training examples in the iris
dataset, so this will print 150 rows like the following:
k-Nearest Neighbors 483

vector: 0
vector: 0
vector: 0
...

Tip: With _spec.kNearestNeighborsClassifier.nearestNeighborsIndex.floatSamples you can


look at the memorized training data.

The sklearn converter always uses int64 labels for k-NN, even if the original labels are
strings. If your scikit-learn model has string labels and you want the Core ML model to also
use strings, do the following:

spec = coreml_model._spec
classes_by_name = [scikit_model.classes_[i] for i in \
spec.kNearestNeighborsClassifier.int64ClassLabels.vector]
spec.kNearestNeighborsClassifier.stringClassLabels.vector.extend(classes_by_name)
spec.description.output[0].type.stringType.SetInParent()

If you’ve also added the output for the probabilities dictionary, then make its labels strings
as well:

output_label_probs.type.dictionaryType.stringKeyType.MergeFromString(b"")

When you’re done, save the model:

coreml_model.save("kNN.mlmodel")

Making k-NN updatable

You can use a pretrained k-NN to make predictions in your app, but most likely the reason
you’re interested in k-NN is to continue training the model once it’s on the device. For that,
you need to make the Core ML model updatable.
k-Nearest Neighbors 484

spec = coreml_model._spec
spec.isUpdatable = True
spec.specificationVersion = 4

The model must be marked as being updatable and the specificationVersion must be at
least 4 or Xcode won’t be able to load it.

Next, you need to add training inputs. First we’ll copy the regular input to the training
inputs:

input_feature = spec.description.input[0]
input_name = input_feature.name
input_type = input_feature.type.multiArrayType.dataType
input_shape = input_feature.type.multiArrayType.shape

training_features = spec.description.trainingInput.add()
training_features.name = input_name
training_features.type.multiArrayType.dataType = input_type
training_features.type.multiArrayType.shape.extend(input_shape)
training_features.shortDescription = "Example input vector"

You also need to create a training input for the ground-truth label. This gets the same name
and data type as the predicted class label, which we take from the k-NN’s output:

output_feature = spec.description.output[0]
output_name = output_feature.name
output_type = output_feature.type

training_features = spec.description.trainingInput.add()
training_features.name = output_name
training_features.type.CopyFrom(output_type)
training_features.shortDescription = "True label"

Note: The actual name of the label’s training input doesn’t seem to matter — Core ML
probably just assumes that the second input is for the label — but it’s a good idea to
use the name of the predicted output anyway. You can also get that from the model’s
predictedFeatureName property.

For the iris classifier, if you do print(spec.description.trainingInput) it will now show the
following:
k-Nearest Neighbors 485

[name: "input"
shortDescription: "Example input vector"
type {
multiArrayType {
shape: 4
dataType: FLOAT32
}
}
, name: "classLabel"
shortDescription: "True label"
type {
int64Type {
}
}
]

The first training input receives the feature vector, a MultiArray of size 4. The second
training input is for the true label, which is an int64 value. If your k-NN predicts string
labels, this would have stringType as the data type.

Finally, you can set a default label. This is what gets predicted when the k-NN does not have
any data in it:

spec.kNearestNeighborsClassifier.defaultInt64Label = -1

Or if the model works with string labels:

spec.kNearestNeighborsClassifier.defaultStringLabel = "unknown"

Even though the model has a value for numberOfNeighbors, the app can override this. By
default the allowed range is between 1 and the number of training examples used (150 for
the iris classifier). If you want to limit this, you can write:

spec.kNearestNeighborsClassifier.numberOfNeighbors.defaultValue = 5
spec.kNearestNeighborsClassifier.numberOfNeighbors.range.minValue = 2
spec.kNearestNeighborsClassifier.numberOfNeighbors.range.maxValue = 10

Or to restrict numberOfNeighbors to a set of specific values:


k-Nearest Neighbors 486

allowed_values = [ 3, 5, 7 ]
spec.kNearestNeighborsClassifier.numberOfNeighbors.defaultValue = 5
spec.kNearestNeighborsClassifier.numberOfNeighbors.set.values.extend(allowed_values)

When you’re done, save the model:

coreml_model.save("UpdatablekNN.mlmodel")

Creating a new model using


KNearestNeighborsClassifierBuilder

When you don’t have a model yet, or perhaps one that’s not made by scikit-learn, the easiest
way to get a Core ML model is to build it from scratch using KNearestNeighborsClassifierBuilder
(which I’ll shorten to kNNBuilder from now on).

from coremltools.models.nearest_neighbors import KNearestNeighborsClassifierBuilder

builder = KNearestNeighborsClassifierBuilder(input_name="input",
output_name="classLabel",
number_of_dimensions=4,
default_class_label=-1,
number_of_neighbors=5,
weighting_scheme="inverse_distance",
index_type="kd_tree",
leaf_size=30)

For the documentation of the various options, refer to the coremltools docs142 .

Some remarks:

• kNNBuilder will automatically add two outputs. Here, the first one is named "classLabel"
and will have the predicted class label. The second output is named "classLabelProbs"
and has the dictionary of predicted probabilities for each class.
If you don’t like the name of the dictionary output, you can change it by editing the
spec. Also make sure to update the model’s predictedProbabilitiesName if you do so.

• By default, the number of neighbors has an allowed range of (1, 1000) so that inside
the app you can change k to be anywhere between 1 and 1000. The value given to
number_of_neighbors is merely the default.
142 https://apple.github.io/coremltools/generated/coremltools.models.nearest_neighbors.builder.html
k-Nearest Neighbors 487

• If the default_class_label is a string, the class labels will be strings, otherwise they are
numeric (int64).

• kNNBuilder always makes the model updatable and automatically adds the training
inputs.

You may want to add descriptions for the inputs and outputs:

builder.spec.description.input[0].shortDescription = "Input vector to classify"


builder.spec.description.output[0].shortDescription = "Predicted label"
builder.spec.description.output[1].shortDescription = "Probabilities for each possible label"

builder.spec.description.trainingInput[0].shortDescription = "Example input vector"


builder.spec.description.trainingInput[1].shortDescription = "True label"

Of course, you can also fill in other metadata at this point.

If you don’t like the default range of (1, 1000) for number_of_neighbors, you can change it as
follows:

builder.set_number_of_neighbors_with_bounds(5, allowed_range=(1, 10))

You can also use the allowed_set argument to limit it to a set of allowed values, for example:

builder.set_number_of_neighbors_with_bounds(5, allowed_set=set([3, 5, 7]))

To add trained data to this model, do the following:

builder.add_samples(iris_samples.data, iris_samples.target)

The type of the labels must match that of the default_class_label. You can call builder.add_-
samples() more than once.

When you’re done, save the model:

from coremltools.models import MLModel


coreml_model = MLModel(builder.spec)
coreml_model.save("kNN.mlmodel")
k-Nearest Neighbors 488

Note: You cannot use KNearestNeighborsClassifierBuilder to load an existing model and add
more data to it. Currently, it can only create new models.

Making predictions in the app

A k-NN model looks like this in Xcode:

• The Prediction section shows the inputs and outputs used for inference. The input for
this particular model is a MultiArray with 4 single-precision floating-point elements.
The output is an Int64 class label and a dictionary of probabilities.

• The Update section lists the training inputs. It has the same MultiArray as used for
inference, as well as a training input for the class label that is associated with the
training example.

• The Parameters section shows that there is one hyperparameter that can be changed
by the user, the number of neighbors. Here, the default value is 5.

Just like neural network models, a k-NN model has an auto-generated class. Look inside
and you’ll find that it is exactly the same as for any other Core ML model. Core ML doesn’t
k-Nearest Neighbors 489

really care what is inside the model, it treats the inner workings as a black box. As long as
you give it an input in the format that is expected, it’s happy.

Making a prediction is just like any other Core ML model:

let config = MLModelConfiguration()


let model = try! IrisClassifier(configuration: config)

let features = try! MLMultiArray([4.3, 1.4, 2.5, 1.2])

if let output = try? model.prediction(input: features) {


print(output.classLabel)
print(output.classLabelProbs)
}

The input is an MLMultiArray object. Note that even though the k-NN model expects the input
feature vector to be FLOAT32, using an array of double values like we’re doing here works
just fine.

The predicted probabilities for this input example are:

0: 0.1389919124491633
1: 0.8610080875508367
2: 0.0

The predicted class label is 1, which makes sense as that has the highest probability.

To override the value of k, you have to initialize the model as follows:

let config = MLModelConfiguration()


config.parameters = [MLParameterKey.numberOfNeighbors: 3]

let model = try? IrisClassifier(configuration: config)

If you provide a new value that is not within the allowed range or set, Core ML will throw
an error. Note that you cannot change the value of numberOfNeighbors after the model has
been constructed; model.configuration.parameters = [MLParameterKey.numberOfNeighbors: 3]
has no effect.

To query what the allowed range is for numberOfNeighbors, you can look at the model’s
MLModelDescription object:
k-Nearest Neighbors 490

modelDescription.parameterDescriptionsByKey[.numberOfNeighbors]

This returns an MLParameterDescription object that has an MLNumericConstraint with the


range and/or set of allowed values.

Note: There is no API to inspect what the k-NN model has learned. You cannot get a list of
all the classes that it knows about. If knowing which classes the model has been trained on
is important for your app, you’ll have to keep track of this separately.

Training k-NN in the app

The cool thing about k-NN is that you can add new training examples — and even new
class labels — at any given time. You just say, “Hey k-NN, here is some new data,” and it
will simply add that new data to its memory.

Core ML uses the same MLUpdateTask API for training k-NN and neural networks. The main
difference is that with k-NN you can train on a single example at a time, whereas with a
neural network you always have to loop through the entire dataset (often multiple times).
This is why training k-NN is much faster than training a neural network.

To tell Core ML about your training data, you will need to implement the MLBatchProvider
protocol. For k-NN models you can use the built-in class MLArrayBatchProvider:

var batchInputs: [MLFeatureProvider] = []

let example = try! MLMultiArray([4.3, 1.4, 2.5, 1.2])


let trueLabel: Int64 = 1
let featureProvider = YourModelTrainingInput(input: example, classLabel: trueLabel)
batchInputs.append(featureProvider)

let trainingData = MLArrayBatchProvider(array: batchInputs)

There is one MLFeatureProvider for every training example. It contains the feature vector
and its ground-truth label. If you’re training a k-NN that’s part of a pipeline, you’d pass in
an image here instead of the feature vector, and the pipeline’s feature extractor will turn
the image into a suitable feature vector.

The YourModelTrainingInput class is auto-generated by Xcode and already implements the


MLFeatureProvider protocol. It takes an MLMultiArray with the feature vector and an Int64 (or
String) class label.
k-Nearest Neighbors 491

We add the MLFeatureProvider objects for our training examples into an array and then
make an MLArrayBatchProvider. In the above code snippet we’re just training on a single
example.

As I’ve mentioned before, with k-NN you can add new training examples one-by-one as they
become available, so training on a single example at a time is typical. If you have more than
one example, simply add all of them to the array.

Note: If you train on the same example more than once, it gets added to the model multiple
times. Be aware this might affect the predictions because there may be more neighbors
now that all share the same feature vector.

Now that the training example(s) are in a batch provider, you add them to the model by
creating an MLUpdateTask and calling resume(). When it is done training, Core ML will invoke
the completion handler.

do {
let updatableModelURL = /* URL to the mlmodelc folder */

let config = MLModelConfiguration()

let updateTask = try MLUpdateTask(forModelAt: updatableModelURL,


trainingData: trainingData,
configuration: config,
completionHandler: { context in
/* save the updated model */
/* update the UI */
})

updateTask.resume()
} catch {
print("Error training k-NN:", error)
}

Note that MLUpdateTask does not perform the training on an existing MLModel instance.
Instead, you need to give it the URL to the mlmodelc directory. It will load the model from
this location and then train that.

Which URL should you use here? Your app bundle contains an mlmodelc directory, but the
app cannot write into its own bundle. Therefore, it makes sense to first copy the mlmodelc
folder from the app bundle into the app’s Documents or Library folder.
k-Nearest Neighbors 492

Note: Any MLModel instances that you created in your app before training started do not
automatically get updated with the changes from training. After training finishes, you’ll
need to re-instantiate the MLModel again from the updated mlmodelc folder!

Training a k-NN model is really fast and the completion handler will be called pretty much
instantaneously. If the k-NN is part of a pipeline, the only thing that takes any time is the
feature extractor part of the model.

The completion handler is a good place to save the updated model. Here is a helper function
for saving the model:

private func saveUpdatedModel(_ model: MLModel & MLWritable, to url: URL) {


do {
let tempURL = /* create a temporary URL */
try model.write(to: tempURL)
_ = try FileManager.default.replaceItemAt(url, withItemAt: tempURL)
} catch {
print("Error saving neural network model to \(url):", error)
}
}

The model to save can be obtained from the completion handler’s context parameter, which
is an MLUpdateContext. The object from context.model also conforms to the MLWritable proto-
col, which means you can call model.write(to:) on it. A regular MLModel is not MLWritable, so
saving only works from within a completion handler.

The filename in the URL must have the .mlmodelc extension.

Important: You cannot directly overwrite the old mlmodelc folder, this fails for some
reason. That’s why you save to a temporary URL first and then call replaceItemAt to swap
out the old mlmodelc.

The URL where you’re saving the model cannot be in your app bundle — i.e. don’t use
Bundle.main.url(for...) to obtain this URL — because apps cannot write into their own
bundle. Instead, save the model to somewhere your app’s Documents or Library folder.

Note: You cannot remove data from the k-NN model. Once an example has been added, it
cannot be changed. The only way to remove data is to load an empty instance of the model
and re-add all the training data, except for the ones you no longer want.
k-Nearest Neighbors 493

k-NN in a pipeline

If you’re working with complex data such as images, you’ll likely want to use k-NN in a
pipeline so that it works on the output of some kind of feature extractor (typically a neural
network).

You can read more about setting up pipelines in the chapter Building Pipeline Models.

The following links have examples of putting a k-NN model inside a pipeline:

• On-device training with Core ML – part 3143 (my blog post)

• coremltools Jupyter notebooks144

The process of converting the k-NN model is the same as explained above. It just needs to
be placed inside the pipeline and hooked up to the feature extractor.

143 https://machinethink.net/blog/coreml-training-part3/
144 https://github.com/apple/coremltools/tree/0730319f2c2294532f578c84dca88a6a0d65ff92/examples/

updatable_models

You might also like