You are on page 1of 58

Embedded Electronic Systems

for Artificial intelligence and


Machine Learning
From training in the cloud
to inference at the edge
Mario R. Casu, Luciano Lavagno
Politecnico di Torino
Outline

• Traffic signal classification: the GTSRB dataset


• Training and testing a DNN over the GTSRB dataset
• Exporting a tensorflow graph (.pb file)
• The Intel’s OpenVino framework
• The Intel’s VPU edge device: Neural Compute Stick 2
– Testing the DNN on CPU and VPU
• Creating a custom layer in Keras
• Creating a custom layer in OpenVINO
– Testing the DNN on CPU and VPU
• Basics on OpenCL

30/11/21 - 2 EESAM - © 2020 MC- LL


Traffic signal classification

• Advanced Driver Assistance Systems (ADASs) can


help reduce the number of accidents by automating
tasks such as traffic sign recognition
• Traffic sign recognition systems include two main
stages: detection, and classification
• Some ADASs of commercial cars like Honda Accord
2020 have this capability

30/11/21 - 3 EESAM - © 2020 MC- LL


2020 Honda Accord

30/11/21 - 4 EESAM - © 2020 MC- LL


Even my car has it!

Two operations involved: object detection + object classification


30/11/21 - 5 EESAM - © 2020 MC- LL
The GTSRB dataset

• We focus on the classification (easier than detection)


• German Traffic Sign Recognition Dataset (GTSRB) is
an image classification dataset
• The images are photos of traffic signs. The images
are classified into 43 classes
• The training set contains 39209 labeled images, and
the test set contains 12630 images
• Image sizes vary between 15x15 to 250x250 pixels
• See more details about the dataset here

30/11/21 - 6 EESAM - © 2020 MC- LL


TrafficSignNet

• TrafficSignNet is a simple convolutional neural


network for traffic sign classification
– Originally developed by A. Rosebrock of pyimagesearch.com
– Ported to a Colab Jupyter Notebook for this course
• TrafficSignNet has 5 Conv2D and 2 Dense layers
– 2 Conv2D with kernel size 5x5 and 3 with kernel size 3x3
– The Dense layers have both 128 neurons
– Final stage is a softmax logistic classifier
• It uses the Keras sequential method
• We now train it over the GTSRB dataset and export
the network “frozen graph” (Tensorflow jargon)

30/11/21 - 7 EESAM - © 2020 MC- LL


Inference at the edge

• We now want to run inference on an edge device


using the Tensorflow frozen graph of TrafficSignNet
• The Intel Movidius Neural Compute Stick 2

30/11/21 - 8 EESAM - © 2020 MC- LL


Intel’s OpenVINO framework

• Open Visual Inferencing & Neural Network Optimization


• OpenVINO is an open-source development framework
dedicated to deploying inference on Intel’s hardware
– CPU, GPU, VPU, and FPGA

30/11/21 - 9 EESAM - © 2020 MC- LL


Intel’s OpenVINO framework

• Open Visual Inferencing & Neural Network Optimization


• OpenVINO is an open-source development framework
dedicated to deploying inference on Intel’s hardware
– CPU, GPU, VPU, and FPGA

This box is
OpenVINO

30/11/21 - 10 EESAM - © 2020 MC- LL


OpenVINO model zoo

• You can skip the Model Optimizer (MO) and feed


directly the Inference Engine (IE) with a file pair
(.xml,.bin) from a zoo of pretrained models
– The zoo includes various public models (e.g. Inception,
AlexNet, etc.) and Intel models
• The framework features also several examples of
applications (samples and demos) using these
models, including
– Face and human-related detection and recognition
– Vehicle-related detection and recognition
– Text detection
• Public models can be retrained if needed

30/11/21 - 11 EESAM - © 2020 MC- LL


Model Optimizer (MO)

• MO takes in a frozen graph and produces an


Intermediate Representation (IR) of the network,
which can be read, loaded, and inferred with the IE
• The IR is a pair of files describing the model:
– .xml - Describes the network topology
– .bin - Contains the weights and biases binary data
• MO optimizes the graph
– It removes training-only layers (e.g., Dropout)
– It replaces groups of operations that can be represented as a
single mathematical operation
» Example for linear operations: BatchNormalization and
ScaleShfit layers contain MUL-ADD operations that can be
fused together with MUL-ADD operations of Conv2D if they are
executed in sequence

30/11/21 - 12 EESAM - © 2020 MC- LL


Example Linear Operations Fusing

• From Resnet269, before and after optimization

30/11/21 - 13 EESAM - © 2020 MC- LL


Example Intermediate Representation

• A graph like the one here gets translated in XML:


<?xml version="1.0" ?>
<net name="model_file_name" version="10">
<layers>
<layer id="0" name="input" type="Parameter" version="opset1">
<data element_type="f32" shape="1,3,32,100"/> <!-- attributes of operation -->
<output>
<!-- description of output ports with type of element and tensor dimensions -->
<port id="0" precision="FP32">
<dim>1</dim>
<dim>3</dim>
<dim>32</dim>
<dim>100</dim>
</port>
</output>
</layer>
<layer id="1" name="conv1/weights" type="Const" version="opset1">
<data element_type="f32" offset="0" shape="64,3,3,3" size="6912"/>
<output>
<port id="1" precision="FP32">
<dim>64</dim>
<dim>3</dim>
<dim>3</dim>
<dim>3</dim>
</port>
</output>
</layer>
30/11/21 - 14 EESAM - © 2020 MC- LL
Example Intermediate Representation

<layer id="2" name="conv1" type="Convolution" version="opset1">


<data auto_pad="same_upper" dilations="1,1" output_padding="0,0" pads_begin="1,1"
pads_end="1,1" strides="1,1"/>
<input>
<port id="0">
<dim>1</dim>
<dim>3</dim>
<dim>32</dim>
<dim>100</dim>
</port>
<port id="1">
<dim>64</dim>
<dim>3</dim>
<dim>3</dim>
<dim>3</dim>
</port>
</input>
<output>
<port id="2" precision="FP32">
<dim>1</dim>
<dim>64</dim>
<dim>32</dim>
<dim>100</dim>
</port>
</output>
</layer>

30/11/21 - 15 EESAM - © 2020 MC- LL


Example Intermediate Representation

<layer id="3" name="conv1/activation" type="ReLU" version="opset1">


<input>
<port id="0">
<dim>1</dim>
<dim>64</dim>
<dim>32</dim>
<dim>100</dim>
</port>
</input>
<output>
<port id="1" precision="FP32">
<dim>1</dim>
<dim>64</dim>
<dim>32</dim>
<dim>100</dim>
</port>
</output>
</layer>
<layer id="4" name="output" type="Result" version="opset1">
<input>
<port id="0">
<dim>1</dim>
<dim>64</dim>
<dim>32</dim>
<dim>100</dim>
</port>
</input>
</layer>
30/11/21 - 16 EESAM - © 2020 MC- LL
Example Intermediate Representation

The Layers and their Ports are connected to build the graph
</layers>
<edges>
<!-- Connections between layer nodes based on ids for layers and ports -->
<edge from-layer="0" from-port="0" to-layer="2" to-port="0"/> 0 1
<edge from-layer="1" from-port="1" to-layer="2" to-port="1"/>
<edge from-layer="2" from-port="2" to-layer="3" to-port="0"/>
0 1
<edge from-layer="3" from-port="1" to-layer="4" to-port="0"/>
</edges> 0 1
<meta_data> 2
<!-- Aauxiliary information that serves for the debugging purposes. -->
<MO_version value="2019.1"/> 2
<cli_parameters>
<blobs_as_inputs value="True"/> 0
<caffe_parser_path value="DIR"/>
<data_type value="float"/>
3
1
...
0
<!-- Omitted a long list of CLI options for debugging purposes. -->
4
</cli_parameters>
</meta_data>
</net>

30/11/21 - 17 EESAM - © 2020 MC- LL


Operation Set (1/2)

• Abs • CumSum • GroupConvolution


• Acos • DeformableConvolution • GroupConvolutionBackpropData
• Add • DeformablePSROIPooling • GRUCell
• Asin • DepthToSpace • HardSigmoid
• Assign • DetectionOutput • Interpolate
• Atan • Divide • Less
• AvgPool • Elu • LessEqual
• BatchNormInference • EmbeddingBagOffsetsSum • Log
• BatchToSpace • EmbeddingBagPackedSum • LogicalAnd
• BinaryConvolution • EmbeddingSegmentsSum • LogicalNot
• Broadcast • Equal • LogicalOr
• Bucketize • Erf • LogicalXor
• CTCGreedyDecoder • Exp • LRN
• Ceiling • ExtractImagePatches • LSTMCell
• Clamp • FakeQuantize • LSTMSequence
• Concat • Floor • MatMul
• Constant • FloorMod • MaxPool
• Convert • Gather • Maximum
• ConvertLike • GatherTree • Minimum
• Convolution • Gelu • Mod
• ConvolutionBackpropData • Greater • MVN
• Cos • GreaterEqual • Multiply
• Cosh • GRN • Negative
30/11/21 - 18 EESAM - © 2020 MC- LL
Operation Set (2/2)

• NonMaxSuppression • ReduceSum • SpaceToDepth


• NonZero • RegionYolo • Split
• NormalizeL2 • ReorgYolo • Sqrt
• NotEqual • Reshape • SquaredDifference
• OneHot • Result • Squeeze
• Pad • Reverse • StridedSlice
• Parameter • ReverseSequence • Subtract
• Power • RNNCell • Tan
• PReLU • ROIAlign • Tanh
• PriorBoxClustered • ROIPooling • TensorIterator
• PriorBox • ScatterElementsUpdate • Tile
• Proposal • ScatterUpdate • TopK
• PSROIPooling • Select • Transpose
• Range • Selu • Unsqueeze
• ReLU • ShapeOf • VariadicSplit
• ReadValue • ShuffleChannels
• ReduceLogicalAnd • Sigmoid
• ReduceLogicalOr • Sign
• ReduceMax • Sin
• ReduceMean • Sinh
• ReduceMin • SoftMax
• ReduceProd • SpaceToBatch

30/11/21 - 19 EESAM - © 2020 MC- LL


From Frozen Graph to IR

• In general, each node of a frozen graph (for example


in Tensorflow) is converted to
– one node of the IR graph if there is a one-to-one
correspondence with an operation in the Operation Set
– a subgraph combining several operations if there is no such a
correspondence
» Example: SquaredDifference = (a-b)^2

a b

a b neg

add
(a-b)^2

square

30/11/21 - 20 EESAM - © 2020 MC- LL


Custom layers in MO

• Not all TF and PyTorch layers can be converted to IR,


but MO can accept user-defined custom layers
• Moreover, it can be useful to provide a custom layer
that fuses various TF layers (to speed-up inference)
• We’ll see how to do it in practice: it’s a little
complicated, especially for the VPU case
– Referred to as “Extending Model Optimizer with New
Primitives” in the OpenVINO documentation

30/11/21 - 21 EESAM - © 2020 MC- LL


Quantization…

• OpenVINO can convert a network trained using FP32


or FP16 into one using INT8 during inference
• Two possible strategies (Step 1 and Step 2 in figure)
– i.e. post-training quantization vs quantization-aware training

30/11/21 - 22 EESAM - © 2020 MC- LL


…but some Intel devices are picky
about data type

Input Precision Output Precision


Plugin FP32 FP16 U8 U16 I8 I16 Plugin FP32 FP16
CPU Y N Y Y N Y CPU Y N
GPU Y Y Y Y N Y GPU Y Y
VPU Y Y Y N N N VPU Y Y
GNA Y N Y N Y Y GNA Y N

Model Precision
Key Plugin FP32 FP16 I8

Y = Supported CPU Y* Y Y

Y* = Supported and preferred GPU Y Y* Y


N = Not supported VPU N Y N
GNA Y Y N

30/11/21 - 23 EESAM - © 2020 MC- LL


Inference Engine

• IE is a set of C++ libraries providing a common API to


deliver inference solutions on a specific device
– C libraries and Python bindings are also available
• The IE API reads the Intermediate Representation, set
the input and output formats, and execute the model
on the selected device
– A set of classes are typically used in sequence within the user
application

30/11/21 - 24 EESAM - © 2020 MC- LL


Example calls in the user application

1. Create Inference Engine Core(*) to manage available devices


and read network objects:
• C++
– #include <inference_engine.hpp>
– InferenceEngine::Core ie;
• Python
– from openvino.inference_engine import IECore
– ie = IECore()

(*) The IE API is executed on the Leon (host processor) in case of VPU target

30/11/21 - 25 EESAM - © 2020 MC- LL


Example calls in the user application

2. Read a model IR created by the Model Optimizer (.xml file):


• C++
– auto network = ie.ReadNetwork(“model.xml”);
• Python
– network = ie.read_network(model=“model.xml, weights=“model.bin”)

30/11/21 - 26 EESAM - © 2020 MC- LL


Example calls in the user application

3. Configure input and output. Request input and output info


– InferenceEngine::InputsDataMap i_info = network.getInputsInfo();
– InferenceEngine::OutputsDataMap o_info = network.getOutputsInfo();

30/11/21 - 27 EESAM - © 2020 MC- LL


Example calls in the user application

3. Configure input and output. Request input and output info


– InferenceEngine::InputsDataMap i_info = network.getInputsInfo();
– InferenceEngine::OutputsDataMap o_info = network.getOutputsInfo();
• Use the info to configure I/O. Example input…
– for (auto &item : input_info) {
auto input_data = item.second;
input_data->setPrecision(Precision::U8);
input_data->setLayout(Layout::NCHW);
input_data->getPreProcess().setResizeAlgorithm(RESIZE_BILINEAR);
input_data->getPreProcess().setColorFormat(ColorFormat::RGB);
}

30/11/21 - 28 EESAM - © 2020 MC- LL


Example calls in the user application

3. Configure input and output. Request input and output info


– InferenceEngine::InputsDataMap i_info = network.getInputsInfo();
– InferenceEngine::OutputsDataMap o_info = network.getOutputsInfo();
• Use the info to configure I/O. Example input…and output
– for (auto &item : output_info) {
auto output_data = item.second;
output_data->setPrecision(Precision::FP32);
output_data->setLayout(Layout::NC);
}

• In Python there are methods with the same purpose

30/11/21 - 29 EESAM - © 2020 MC- LL


Example calls in the user application

4. Load the model to the device using LoadNetwork():


• Ex: CPU
– auto executable_network = core.LoadNetwork(network, "CPU");
• Ex: VPU
– auto executable_network = core.LoadNetwork(network, ”MYRIAD");

30/11/21 - 30 EESAM - © 2020 MC- LL


Example calls in the user application

4. Load the model to the device using LoadNetwork():


• Ex: CPU
– auto executable_network = core.LoadNetwork(network, "CPU");
• Ex: VPU
– auto executable_network = core.LoadNetwork(network, ”MYRIAD");
5. Create an infer request:
– auto infer_request = executable_network.CreateInferRequest();

30/11/21 - 31 EESAM - © 2020 MC- LL


Example calls in the user application

4. Load the model to the device using LoadNetwork():


• Ex: CPU
– auto executable_network = core.LoadNetwork(network, "CPU");
• Ex: VPU
– auto executable_network = core.LoadNetwork(network, ”MYRIAD");
5. Create an infer request:
– auto infer_request = executable_network.CreateInferRequest();
6. Prepare input (various methods, using typically GetBlob() )
– blob is the binary input or output (i.e. a Numpy tensor)

30/11/21 - 32 EESAM - © 2020 MC- LL


Example calls in the user application

7. Do inference by calling either StartAsync method for asynchronous


request (non-blocking, call Wait for waiting result to be available):
– infer_request->StartAsync();
– infer_request.Wait(IInferRequest::WaitMode::RESULT_READY);
or by calling the Infer method for synchronous request (blocking):
– sync_infer_request->Infer();

30/11/21 - 33 EESAM - © 2020 MC- LL


Example calls in the user application

7. Do inference by calling either StartAsync method for asynchronous


request (non-blocking, call Wait for waiting result to be available):
– infer_request->StartAsync();
– infer_request.Wait(IInferRequest::WaitMode::RESULT_READY);
or by calling the Infer method for synchronous request (blocking):
– sync_infer_request->Infer();
8. Go over the output blobs and process the results

30/11/21 - 34 EESAM - © 2020 MC- LL


Example calls in the user application

7. Do inference by calling either StartAsync method for asynchronous


request (non-blocking, call Wait for waiting result to be available):
– infer_request->StartAsync();
– infer_request.Wait(IInferRequest::WaitMode::RESULT_READY);
or by calling the Infer method for synchronous request (blocking):
– sync_infer_request->Infer();
8. Go over the output blobs and process the results
• Let’s see now how we can go through the entire flow in practice with
our TrafficSignNet model imported in OpenVINO

30/11/21 - 35 EESAM - © 2020 MC- LL


Step 1: Model Optimizer

• Download the frozen graph (tf_model.pb) and save it in a


directory
• Initialize the OpenVINO environment variables
– source /opt/intel/openvino/bin/setupvars.sh
• In the directory where you saved the graph, run:
– mo_tf.py --input_model tf_model.pb --output_dir . --input_shape
[1,32,32,3] --reverse_input_channels --scale 255.0 --data_type FP16
• This calls MO with TF specific settings (mo_tf.py); notice 1)
the input blob shape, 2) the reversing of the input channels
(from RGB to BGR), 3) the scaling of the data – normalized
to [0,1] – and the data type FP16
• Check the created IR files: XML file tf_model.xml
(readable) and binary file tf_model.bin (unreadable)
30/11/21 - 36 EESAM - © 2020 MC- LL
Step2: Create and run the inference code

• We adapt an existing OpenVINO Python sample called


“classification_sample_async.py” and rename it as
“trafficsignnet_async.py”
– It calls the device in the asynchronous mode seen before
• Code available here
• Now we run it on both CPU and VPU
1. Connect via ssh: ssh <user>@wormtongue.polito.it
2. From wormtongue, connect to tfgpu: ssh <user>@tfgpu
3. Run with option “–d MYRIAD” for VPU and “–d CPU” for CPU:
python3
/home/casu/EESAM/TrafficSignNet/openvino/async/trafficsignnet_async.py
-m /home/casu/EESAM/TrafficSignNet/tf_models/tf_model.xml
-i /home/casu/EESAM/datasets/GTSRB/test/0000*.png
--labels /home/casu/EESAM/datasets/GTSRB/labels.txt
--gndt /home/casu/EESAM/datasets/GTSRB/Test_labels.csv -nt 5 -d MYRIAD

30/11/21 - 37 EESAM - © 2020 MC- LL


TrafficSignNet with custom layer

• In Keras it is possible to define new layers


– To make training possible, the gradient of the implemented
function must be defined, unless it is a combination of known
and differentiable functions
– As an example, we use a function (mysq) that computes the
square of all activations (no need to specify the gradient)
» We use it in TrafficSignNet to replace one ReLU activation
• In OpenVINO it is also possible to customize layers
• We now go over the entire flow
1. Define the new layer and train a modified TrafficSignNet
2. Export the graph and instruct OpenVINO’s MO to use mysq
3. Use OpenCL to describe the function and compile for VPU
4. Run Inference on VPU (MYRIAD)

30/11/21 - 38 EESAM - © 2020 MC- LL


Custom layer in OpenVINO’s MO (1/4)

• Edit the text version of the new frozen graph and


replace “Square” op with “MySquare”
– This is needed to avoid that MO uses the existing Square
operation and to force it to use the new “Op extension”
… …
node { node {
name: "mysq/Square" name: "mysq/Square"
op: "Square" op: "MySquare"
input: "conv2d_1/BiasAdd" input: "conv2d_1/BiasAdd"
attr { attr {
key: "T" key: "T"
value { value {
type: DT_FLOAT type: DT_FLOAT
} }
} }
} }
… …
30/11/21 - 39 EESAM - © 2020 MC- LL
Custom layer in OpenVINO’s MO (2/4)

• Create directories user_mo_extensions/front/tf and


Python file named “mysquare_ext.py” in tf
from mo.front.extractor import FrontExtractorOp
from mo.ops.op import Op

class mysquareFrontExtractor(FrontExtractorOp):
op = 'MySquare'
enabled = True

@staticmethod
def extract(node):
proto_layer = node.pb
param = proto_layer.attr
# extracting parameters from TensorFlow layer and prepare them for IR
attrs = {
'op': __class__.op
}
# update the attributes of the node
Op.get_op_class_by_name(__class__.op).update_node_stat(node, attrs)

return __class__.enabled
30/11/21 - 40 EESAM - © 2020 MC- LL
Custom layer in OpenVINO’s MO (3/4)

• Create directories user_mo_extensions/ops and


Python file named “mysquare.py” in ops
import numpy as np
from mo.graph.graph import Node
from mo.ops.op import Op
from mo.front.extractor import FrontExtractorOp
from mo.front.common.partial_infer.elemental import copy_shape_infer

class mysquareOp(Op):
op = 'MySquare'
def __init__(self, graph, attrs):
mandatory_props = dict(
type=__class__.op,
op=__class__.op,
infer=mysquareOp.infer
)
super().__init__(graph, mandatory_props, attrs)
@staticmethod
def infer(node: Node):
# Add your shape calculation implementation here if input shape differs
# from output shape. Otherwise, use copy_shape_infer(node).
30/11/21 - 41
return copy_shape_infer(node)
EESAM - © 2020 MC- LL
Custom layer in OpenVINO’s MO (4/4)

• Run the model optimizer to generate .xml & .bin files


– mo_tf.py --input_model tf_model_mysq.pbtxt --extensions
mysquare/user_mo_extensions --output_dir . --input_shape
[1,32,32,3] --reverse_input_channels --scale 255.0 --data_type FP16
--input_model_is_text
• Check tf_model_mysq.xml
<layer id="15" name="mysq/Square" type="MySquare" version="extension">
<input>
<port id="0">
<dim>1</dim>
<dim>16</dim>
<dim>16</dim>
<dim>16</dim>
</port>
</input>
<output>
<port id="1" precision="FP16">
<dim>1</dim>
<dim>16</dim>
<dim>16</dim>
<dim>16</dim>
</port>
</output>
30/11/21 - 42 </layer> EESAM - © 2020 MC- LL
Extending the Operation Set for VPU

• Create OpenCL file “mysquare_kernel.cl” describing


the new operation and the directives for parallelization
– Exploit work-groups and work-items
__kernel void mysquare_kernel(
// Insert pointers to inputs, outputs as arguments here
// If your layer has one input and one output, arguments will be:
// use half* as pointers to FP16 tensors (use float for FP32)
const __global half* input0, __global half* output
)
{
// Add the kernel implementation here:
// sets index for iterations over the input tensor “input0”
int ii = get_global_id(0) + get_global_id(1) * get_global_size(0) +
get_global_id(2) * get_global_size(0) * get_global_size(1);
// actual “square” computation
output[ii] = input0[ii]*input0[ii];
}

30/11/21 - 43 EESAM - © 2020 MC- LL


Compiling the OpenCL “kernel”

• We need to initialize some environment variables


before compiling for SHAVE processor targets (VPU)
• Create and execute bash script “shaveproc.sh”
#!/bin/bash

export SHAVE_MA2X8XLIBS_DIR=/opt/intel/openvino/deployment_tools/tools/cl_compiler/lib/
export SHAVE_LDSCRIPT_DIR=/opt/intel/openvino/deployment_tools/tools/cl_compiler/ldscripts/
export SHAVE_MYRIAD_LD_DIR=/opt/intel/openvino/deployment_tools/tools/cl_compiler/bin/
export SHAVE_MOVIASM_DIR=/opt/intel/openvino/deployment_tools/tools/cl_compiler/bin/

• Run the OpenCL compiler to create a binary runnable


with the Inference Engine
– /opt/intel/openvino/deployment_tools/tools/cl_compiler/bin/clc
--strip-binary-header mysquare_kernel.cl -o mysquare_kernel.bin

30/11/21 - 44 EESAM - © 2020 MC- LL


OpenCL kernel configuration file

• To bind the custom layer to the topology IR, we need a


configuration file, e.g. “mysquare_kernel.xml,” so that
the Inference Engine can find the kernel parameters
and the execution work grid
– BFYX (batch, channel, height, width) is the tensor structure
– WorkSizes specifies how tensors are stored in global and
local memory
• Details can be found here
<CustomLayer name="MySquare" type="MVCL" version="1">
<Kernel entry="mysquare_kernel">
<Source filename="mysquare_kernel.bin"/>
<Parameters>
<Tensor arg-name="input0" type="input" port-index="0" format="BFYX"/>
<Tensor arg-name="output" type="output" port-index="0" format="BFYX"/>
</Parameters>
<WorkSizes dim="input,0" global="X,Y,F" local="X,Y,1"/>
</Kernel>
30/11/21 - 45 </CustomLayer> EESAM - © 2020 MC- LL
Running IE with extension

• We can run again the Inference Engine (IE) with the


extension (custom layer) through the –v option
python3 /home/casu/EESAM/TrafficSignNet/openvino/async/trafficsignnet_async.py
-m /home/casu/EESAM/TrafficSignNet/tf_models/tf_model.xml
-i /home/casu/EESAM/datasets/GTSRB/test/0000*.png
--labels /home/casu/EESAM/datasets/GTSRB/labels.txt
--gndt /home/casu/EESAM/datasets/GTSRB/Test_labels.csv -nt 5 -d MYRIAD
-v /home/casu/EESAM/TrafficSignNet/tf_models/mysquare/opencl/mysquare_kernel.xml
• In the python code, the -v option is used to tell how to
configure the IE so that it can use the extensions:
def build_argparser():

args.add_argument("-v", "--vpu_extension",
help="Optional. Required for VPU custom layers. Absolute path to a shared
library with the kernels implementations.", type=str, default=None)

if args.vpu_extension and 'MYRIAD' in args.device:
ie.set_config({'VPU_CUSTOM_LAYERS': args.vpu_extension},
01/12/21 - 46 device_name="MYRIAD") EESAM - © 2020 MC- LL
An N-dimensional domain of work-items

• Global Dimensions:
– 1024x1024 (whole problem space)
• Local Dimensions:
– 128x128 (work-group, executes together)

Synchronization between work-items


possible only within work-groups:
barriers and memory fences

Cannot synchronize
between work-groups
within a kernel

• Choose the best dimensions (1, 2, 3) for your algorithm


– For tensors in NNs, typically 3 (C,H,W)

01/12/21 - 47 Slide credit: T. Mattson EESAM - © 2020 MC- LL


OpenCL model for the Myriad VPU

OpenCL Model VPU Mapping


Device code Executed on SHAVE cores
Mapped to CMX internal memory, limited to 100kB per work group, valid only while
Private memory
the work group is executed
Mapped to CMX internal memory, limited to 100kB per work group, valid only while
Local memory
the work group is executed
Global memory Mapped to DDR, used to pass execution parameters for inputs, outputs, and blobs

Work group Executed on a single SHAVE core iterating over multiple work items

01/12/21 - 48 LPDDR EESAM - © 2020 MC- LL


Execution model (kernels)

• OpenCL execution model


– define a problem domain and execute an instance of a kernel
for each point in the domain
__kernel void times_two(
__global float* input,
__global float* output)
{
int i = get_global_id(0);
output[i] = 2.0f * input[i];
}

01/12/21 - 49 Slide credit: T. Mattson EESAM - © 2020 MC- LL


One-Dimensional NDRange

01/12/21 - 50 Slide credit: Xilinx EESAM - © 2020 MC- LL


Two-Dimensional NDRange

01/12/21 - 51 Slide credit: Xilinx EESAM - © 2020 MC- LL


Three-Dimensional NDRange

01/12/21 - 52 Slide credit: Xilinx EESAM - © 2020 MC- LL


Our “mysquare” kernel

• Input is a 3D-tensor with size (C,H,W)=(16,16,16)


– global_id(0) is the current W row index
– global_id(1) is the current H column index
– global_id(2) is the current C channel index
• The compiler automatically vectorizes and executes on 16
SHAVE cores, leveraging their vector datapath
__kernel void mysquare_kernel(
// Insert pointers to inputs, outputs as arguments here
// If your layer has one input and one output, arguments will be:
// use half* as pointers to FP16 tensors (use float for FP32)
const __global half* input0, __global half* output
)
{
// Add the kernel implementation here:
// sets index for iterations over the input tensor “input0”
int ii = get_global_id(0) + get_global_id(1) * get_global_size(0) +
get_global_id(2) * get_global_size(0) * get_global_size(1);
// actual “square” computation
output[ii] = input0[ii]*input0[ii];
}
01/12/21 - 53 EESAM - © 2020 MC- LL
Our “mysquare” kernel

• Input is a 3D-tensor with size (C,H,W)=(16,16,16)


– global_id(0) is the current W row index
– global_id(1) is the current H column index
– global_id(2) is the current C channel index
• The compiler automatically vectorizes and executes on 16
SHAVE cores, leveraging their vector datapath
• Each work-group has “local” dimensions
– WorkSize local parameter: in this case “X,Y,1” ⇒ 16x16x1
<CustomLayer name="MySquare" type="MVCL" version="1">
<Kernel entry="mysquare_kernel">
<Source filename="mysquare_kernel.bin"/>
<Parameters>
<Tensor arg-name="input0" type="input" port-index="0" format="BFYX"/>
<Tensor arg-name="output" type="output" port-index="0" format="BFYX"/>
</Parameters>
<WorkSizes dim="input,0" global="X,Y,F" local="X,Y,1"/>
</Kernel>
01/12/21 - 54
</CustomLayer>
EESAM - © 2020 MC- LL
Vector types

• The OpenCL C kernel programming language


provides a set of vector instructions:
– These are portable between different vector instruction sets
• These instructions support vector lengths of 2, 4, 8,
and 16 …for example:
– char2, ushort4, int8, float16, …
• Vector literal
int4 vi0 = (int4) (2, 3, -7, -7);
int4 vi1 = (int4) (0, 1, 2, 3);
• Vector ops
vi0 += vi1;
vi0 = abs(vi0);

01/12/21 - 55 Slide credit: T. Mattson EESAM - © 2020 MC- LL


Unroll pragmas

• A user may specify that a loop be unrolled via a pragma


and an optional unroll factor
– Syntax: #pragma unroll [unroll-factor]
• The pragma must be placed immediately before the
loop and only applies to that loop.
Ex. 1: UNROLL 32 times on i Ex. 2: UNROLL 16 times on i and 2 on j
int i, j, k; int i, j, k;
#pragma unroll 32 #pragma unroll 16
for (i = 0; i < N; i++) { for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) { #pragma unroll 2
C[i*N+j] = 0.0f; for (j = 0; j < N; j++) {
… C[i*N+j] = 0.0f;
} …
} }
} }
} }
}
01/12/21 - 56 EESAM - © 2020 MC- LL
Homework

• Rewrite mysquare as a fully sequential OpenCL kernel


– You should have three loops (3D tensor)
– See the instructions on the course website about how to
connect to the server
– Compile the kernel and run the inference on the GTSRB
dataset; verify that you obtain the same accuracy as before
» Although the kernel mysquare is slower, you won’t see the
difference in performance because it’s only one, relatively small
layer out of many other layers executed in sequence
» In the lab, we will experiment with another example in which a
sequential code kills completely the performance
• Use pragmas to unroll one or more of the three loops
– Compile and run the inference
– Check that the results in terms of accuracy are the same

01/12/21 - 57 EESAM - © 2020 MC- LL


Conclusions

• We have seen a complete flow from training in the


cloud to inference at the edge
• OpenVINO is a powerful development tool for Intel
devices
– It can target CPU, GPU, FPGA, VPU, also in combination
(Heterogeneous mode, we haven’t seen this possibility)
• OpenCL is the language used to define new kernels
for VPU and to compile them and use them with
OpenVINO’s Inference Engine
– OpenCL is used in GPU programming and now also for FPGA
programming using a high-level approach (no RTL)

01/12/21 - 58 EESAM - © 2020 MC- LL

You might also like