Cored Architecture To Acclerate Ai Algorithms: LSI Logic Design (CO309D) Extra Assignment Report

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING FACULTY
LSI Logic Design (CO309D)
Extra Assignment Report
CORED ARCHITECTURE TO
ACCLERATE AI ALGORITHMS
GVHD: Nguyễn Xuân Quang

Trần Ngọc Thịnh
SV: Nguyễn Trọng Nhân - 1914446
Lê Hoàng Minh Tú - 1915812
Ho Chi Minh City, 5/2022

Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Mục lục
1 Google TPU (A Tensor Processing Unit) 2
1.1 What is Google Cloud TPU? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 An in-depth look at Google’s first Tensor Processing Unit (TPU v1) . . . . . . . 2
1.2.1 Prediction with neural networks . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Quantization in neural networks . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 RISC, CISC and the TPU instruction set . . . . . . . . . . . . . . . . . . 5
1.2.4 Parallel Processing on the Matrix Multiplier Unit . . . . . . . . . . . . . . 6
1.2.5 The heart of the TPU: A systolic array . . . . . . . . . . . . . . . . . . . 9
1.2.6 Minimal and deterministic design . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.7 Neural-specific architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 The improvements of Google’s second Tensor Processing Unit (TPU v2) . . . . . 13
1.3.1 Bfloat16 semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Choosing bfloat16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Mixed-precision training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.4 Performance wins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.5 More improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 The improvements of Google’s second Tensor Processing Unit (TPU v3) . . . . . 16
1.4.1 Performance benefits of TPU v3 over v2 . . . . . . . . . . . . . . . . . . . 17
1.4.2 Production Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 When to use TPUs ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 NVDLA 20
2.1 What is NVDLA: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Functional Description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Convolution Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Single Data Point Operation: . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Planar Data Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Multi-Plane Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5 Data Memory and Reshape Operations: . . . . . . . . . . . . . . . . . . . 29
2.3 External Interfaces: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Configuration space bus (CSB): . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 Host Interrupt: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 System interconnect: DBBIF . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 On-Chip SRAM Interfaces - SRAMIF . . . . . . . . . . . . . . . . . . . . 33
2.3.5 Example of External Interface: . . . . . . . . . . . . . . . . . . . . . . . . 33
3 References 36
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 1/36
1 Google TPU (A Tensor Processing Unit)

1.1 What is Google Cloud TPU?
A Tensor Processing Unit (TPU) is an application specific integrated circuit (ASIC) developed
by Google to accelerate machine learning. Google offers TPUs on demand, as a cloud deep learn-
ing service called Cloud TPU.
TPUs train your models more efficiently using hardware designed for performing large ma-
trix operations often found in machine learning algorithms. TPUs have on-chip high-bandwidth
memory (HBM) letting you use larger models and batch sizes. TPUs can be connected in groups
called Pods that scale up your workloads with little to no code changes.
Cloud TPU is tightly integrated with TensorFlow, Google’s open source machine learning
(ML) framework. You can use dedicated TensorFlow APIs to run workloads on TPU hardware.
Cloud TPU lets you create clusters of TensorFlow computing units, which can also include CPUs
and regular graphical processing units (GPUs).
1.2 An in-depth look at Google’s first Tensor Processing Unit (TPU

v1)
TPU v1 Overview:
• ASIC (28nm Process)

• Clock: 700MHz
• Power Consumption: 40W
• Applications: Alphago (South Korean)
1.2.1 Prediction with neural networks

To understand why Google designed TPUs, let’s look at calculations involved in running a simple
neural network.
This example on the TensorFlow Playground trains a neural network to classify a data point
as blue or orange based on a training dataset. (See this post to learn more about this example.)
The process of running a trained neural network to classify data with labels or estimate some
missing or future values is called inference. For inference, each neuron in a neural network does
the following calculations:
• Multiply the input data (x) with weights (w) to represent the signal strength
• Add the results to aggregate the neuron’s state into a single value
• Apply an activation function (f) (such as ReLU, Sigmoid, tanh or others) to modulate the
artificial neuron’s activity.
For example, if you have three inputs and two neurons with a fully connected single-layer
neural network, you have to execute six multiplications between the weights and inputs and add
up the multiplications in two groups of three. This sequence of multiplications and additions
can be written as a matrix multiplication. The outputs of this matrix multiplication are then
processed further by an activation function. Even when working with much more complex neural
network model architectures, multiplying matrices is often the most computationally intensive
part of running a trained model.
How many multiplication operations would you need at production scale? In July 2016, based
on a survey of six representative neural network applications across Google’s production services
and summed up the total number of weights in each neural network architecture. You can see
the results in the table below.
As you can see in the table, the number of weights in each neural network varies from 5 mil-
lion to 100 million. Every single prediction requires many steps of multiplying processed input
data by a weight matrix and applying an activation function.
In total, this is a massive amount of computation. As a first optimization, rather than execut-
ing all of these mathematical operations with ordinary 32-bit or 16-bit floating point operations
on CPUs or GPUs, Google apply a technique called quantization that allows us to work with
integer operations instead. This enables us to reduce the total amount of memory and computing
resources required to make useful predictions with Google’sneural network models.
1.2.2 Quantization in neural networks

If it’s raining outside, you probably don’t need to know exactly how many droplets of water are
falling per second — you just wonder whether it’s raining lightly or heavily. Similarly, neural
network predictions often don’t require the precision of floating point calculations with 32-bit
or even 16-bit numbers. With some effort, you may be able to use 8-bit integers to calculate a
neural network prediction and still maintain the appropriate level of accuracy.
Quantization is a powerful tool for reducing the cost of neural network predictions, and the
corresponding reductions in memory usage are important as well, especially for mobile and em-
bedded deployments. For example, when you apply quantization to Inception, the popular image
recognition model, it gets compressed from 91MB to 23MB, about one-fourth the original size.
Being able to use integer rather than floating point operations greatly reduces the hardware
footprint and energy consumption of Google’sTPU. A TPU contains 65,536 8-bit integer mul-
tipliers. The popular GPUs used widely on the cloud environment contains a few thousands of
32-bit floating-point multipliers. As long as you can meet the accuracy requirements of your
application with 8-bits, that can be up to 25X or more multipliers.
Quantization is an optimization technique that uses an 8-bit integer to approximate an arbi-

trary value between a preset minimum and a maximum value.
1.2.3 RISC, CISC and the TPU instruction set

Programmability was another important design goal for the TPU. The TPU is not designed to
run just one type of neural network model. Instead, it’s designed to be flexible enough to accel-
erate the computations needed to run many different kinds of neural network models.
Most modern CPUs are heavily influenced by the Reduced Instruction Set Computer (RISC)
design style. With RISC, the focus is to define simple instructions (e.g., load, store, add and
multiply) that are commonly used by the majority of applications and then to execute those
instructions as fast as possible. Google chose the Complex Instruction Set Computer (CISC)
style as the basis of the TPU instruction set instead. A CISC design focuses on implementing
high-level instructions that run more complex tasks (such as calculating multiply-and-add many
times) with each instruction. Let’s take a look at the block diagram of the TPU.
The TPU includes the following computational resources:
• Matrix Multiplier Unit (MXU): 65,536 8-bit multiply-and-add units for matrix operations
Unified Buffer (UB): 24MB of SRAM that work as registers
• Activation Unit (AU): Hardwired activation functions
• To control how the MXU, UB and AU proceed with operations, Google defined a dozen
high-level instructions specifically designed for neural network inference. Five of these op-
erations are highlighted below.
– Read_Host_Memory: Read data from memory.
– Read_Weights: Read weights from memory.
– MatrixMultiply/Convolve: Multiply or convolve with the data and weights,accumulate
the results.
– Activate: Apply activation functions.
– Write_Host_Memory: Write result to memory.
This instruction set focuses on the major mathematical operations required for neural net-
work inference that Google mentioned earlier: execute a matrix multiply between input data and
weights and apply an activation function.
Norm says: “Neural network models consist of matrix multiplies of various sizes — that’s
what forms a fully connected layer, or in a CNN, it tends to be smaller matrix multiplies. This
architecture is about doing those things — when you’ve accumulated all the partial sums and are
outputting from the accumulators, everything goes through this activation pipeline. The nonlin-
earity is what makes it a neural network even if it’s mostly linear algebra.”(from First in-depth
look at Google’s TPU architecture, the Next Platform)”.
In short, the TPU design encapsulates the essence of neural network calculation, and can be
programmed for a wide variety of neural network models. To program it, Google created a com-
piler and software stack that translates API calls from TensorFlow graphs into TPU instructions.
1.2.4 Parallel Processing on the Matrix Multiplier Unit

Even though CPUs run at clock speeds in the gigahertz range, it can still take a long time to
execute large matrix operations via a sequence of scalar operations. One effective and well-known
way to improve the performance of such large matrix operations is through vector processing,
where the same operation is performed concurrently across a large number of data elements at
the same time. CPUs incorporate instruction set extensions such as SSE and AVX that express
such vector operations. The streaming multiprocessors (SMs) of GPUs are effectively vector pro-
cessors, with many such SMs on a single GPU die. Machines with vector processing support can
process hundreds to thousands of operations in a single clock cycle.
In the case of the TPU, Google designed its MXU as a matrix processor that processes
hundreds of thousands of operations (= matrix operation) in a single clock cycle. Think of it like
printing documents one character at a time, one line at a time and a page at a time.
• How a CPU works
A CPU is a general-purpose processor based on the von Neumann architecture. That means
a CPU works with software and memory like this:
The greatest benefit of CPUs is their flexibility. You can load any kind of software on a
CPU for many different types of applications. For example, you can use a CPU for word
processing on a PC, controlling rocket engines, executing bank transactions, or classifying
images with a neural network.
A CPU loads values from memory, performs a calculation on the values and stores the
result back in memory for every calculation. Memory access is slow when compared to the
calculation speed and can limit the total throughput of CPUs. This is often referred to as
the von Neumann bottleneck.
• How a GPU works
To gain higher throughput, GPUs contain thousands of Arithmetic Logic Units (ALUs) in
a single processor. A modern GPU usually contains between 2,500–5,000 ALUs. The large
number of processors means you can execute thousands of multiplications and additions
simultaneously.
This GPU architecture works well on applications with massive parallelism, such as matrix
operations in a neural network. In fact, on a typical training workload for deep learning, a
GPU can provide an order of magnitude higher throughput than a CPU.
But, the GPU is still a general-purpose processor that has to support many different ap-
plications and software. Therefore, GPUs have the same problem as CPUs. For every cal-
culation in the thousands of ALUs, a GPU must access registers or shared memory to read
operands and store the intermediate calculation results.
• How a TPU works
Google designed Cloud TPUs as a matrix processor specialized for neural network work-
loads. TPUs can’t run word processors, control rocket engines, or execute bank transactions,
but they can handle massive matrix operations used in neural networks at fast speeds.
The primary task for TPUs is matrix processing, which is a combination of multiply and
accumulate operations. TPUs contain thousands of multiply-accumulators that are directly
connected to each other to form a large physical matrix. This is called a systolic array ar-
chitecture. Cloud TPU v3, contains two systolic arrays of 128 x 128 ALUs, on a single
processor.
The TPU host streams data into an infeed queue. The TPU loads data from the infeed
queue and stores them in HBM memory. When the computation is completed, the TPU
loads the results into the outfeed queue. The TPU host then reads the results from the
outfeed queue and stores them in the host’s memory.
To perform the matrix operations, the TPU loads the parameters from HBM memory into
the MXU.
Then, the TPU loads data from HBM memory. As each multiplication is executed, the
result is be passed to the next multiply-accumulator. The output is the summation of all
multiplication results between the data and parameters. No memory access is required dur-
ing the matrix multiplication process.
As a result, TPUs can achieve a high-computational throughput on neural network calcu-

lations.
1.2.5 The heart of the TPU: A systolic array

To implement such a large-scale matrix processor, the MXU features a drastically different archi-
tecture than typical CPUs and GPUs, called a systolic array. CPUs are designed to run almost
any calculation; they’re general-purpose computers. To implement this generality, CPUs store
values in registers, and a program tells the Arithmetic Logic Units (ALUs) which registers to
read, the operation to perform (such as an addition, multiplication or logical AND) and the reg-
ister into which to put the result. A program consists of a sequence of these read/operate/write
operations. All of these features that support generality (registers, ALUs and programmed con-
trol) have costs in terms of power and chip area.
CPUs and GPUs often spend energy to access multiple registers per operation. A systolic
array chains multiple ALUs together, reusing the result of reading a single register.
For an MXU, however, matrix multiplication reuses both inputs many times as part of pro-
ducing the output. Google can read each input value once, but use it for many different operations
without storing it back to a register. Wires only connect spatially adjacent ALUs, which makes
them short and energy-efficient. The ALUs perform only multiplications and additions in fixed
patterns, which simplifies their design.
The design is called systolic because the data flows through the chip in waves, reminiscent
of the way that the heart pumps blood. The particular kind of systolic array in the MXU is
optimized for power and area efficiency in performing matrix multiplications, and is not well
suited for general-purpose computation. It makes an engineering tradeoff: limiting registers,

control and operational flexibility in exchange for efficiency and much higher operation density.
The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 ×
256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit
integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000
= 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix
unit.
Let’s compare the number of operations per cycle between CPU, GPU and TPU.
In comparison, a typical RISC CPU without vector extensions can only execute just one or
two arithmetic operations per instruction, and GPUs can execute thousands of operations per
instruction. With the TPU, a single cycle of a MatrixMultiply instruction can invoke hundreds
of thousands of operations.
During the execution of this massive matrix multiply, all intermediate results are passed
directly between 64K ALUs without any memory access, significantly reducing power consump-
tion and increasing throughput. As a result, the CISC-based matrix processor design delivers
an outstanding performance-per-watt ratio: TPU provides a 83X better ratio compared with
contemporary CPUs and a 29X better ratio than contemporary GPUs.
1.2.6 Minimal and deterministic design

Another significant benefit of designing a new processor optimized for neural network inference
is that you can be the ultimate minimalist in your design. As stated in Google’sTPU paper: “As
compared to CPUs and GPUs, the single-threaded TPU has none of the sophisticated microar-
chitectural features that consume transistors and energy to improve the average case but not
the 99th-percentile case: no caches, branch prediction, out-of-order execution, multiprocessing,
speculative prefetching, address coalescing, multithreading, context switching and so forth. Min-
imalism is a virtue of domain-specific processors."
Because general-purpose processors such as CPUs and GPUs must provide good performance
across a wide range of applications, they have evolved myriad sophisticated, performance-oriented
mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which
makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU
design is strictly minimal and deterministic as it has to run only one task at a time: neural
network prediction. You can see its simplicity in the floor plan of the TPU die.
If you compare this with floor plans of CPUs and GPUs, you’ll notice the red parts (control
logic) are much larger (and thus more difficult to design) for CPUs and GPUs since they need to
realize the complex constructs and mechanisms mentioned above. In the TPU, the control logic
is minimal and takes under 2% of the die.
More importantly, despite having many more arithmetic units and large on-chip memory, the
TPU chip is half the size of the other chips. Since the cost of a chip is a function of the area3
— more smaller chips per silicon wafer and higher yield for small chips since they’re less likely
to have manufacturing defects* — halving chip size reduces chip cost by roughly a factor of 8 (23 ).
With the TPU, Google can easily estimate exactly how much time is required to run a neural
network and make a prediction. This allows us to operate at near-peak chip throughput while
maintaining a strict latency limit on almost all predictions. For example, despite a strict 7ms
limit in the above-mentioned MLP0 application, the TPU delivers 15–30X more throughput than
contemporary CPUs and GPUs.
Google uses neural network predictions to support end-user-facing products and services, and
everyone knows that users become impatient if a service takes too long to respond. Thus, for
the MLP0 application, they limit the 99th-percentile prediction latency to around 7 ms, for a
consistently fast user experience from TPU-based Google services. The following is an overall
performance (predictions per second) comparison between the TPU and a contemporary CPU
and GPU across six neural network applications under a latency limit. In the most spectacular
case, the TPU provides 71X performance compared with the CPU for the CNN1 application.
1.2.7 Neural-specific architecture

In this article, we’ve seen that the secret of TPU’s outstanding performance is its dedication to
neural network inference. The quantization choices, CISC instruction set, matrix processor and
minimal design all became possible when Google decided to focus on neural network inference.
Google felt confident investing in the TPU because Google see that neural networks are driving
a paradigm shift in computing and Google expect TPUs to be an essential part of fast, smart
and affordable services for years to come
1.3 The improvements of Google’s second Tensor Processing Unit (TPU

v2)
A TPU v2 board contains four TPU chips and 16 GiB of HBM. Each TPU chip contains two
cores. Each core has a MXU, a vector unit, and a scalar unit.
After the TPU v1 was created, the team took the lessons they learned and applied it to
designing the TPU v2. As you can see, it’s considerably larger than the first TPU and features
four chips instead of just one. It has 180 Teraflops of compute. Meaning that it can do 180 trillion
floating-point operations per second. And it does both training and prediction now.
The layout of the TPU v2 is quite interesting. Each boad has those four chips, but each
chip has two cores. Each core then contains a matrix unit, a vector unit, and a scalar unit. All
connected to 8 gigabytes of high-bandwidth memory. That means in total, each board has 8 cores
and 64 gigs of memory. And the matrix unit is a 128 x 128 systolic array.
What’s special about the TPU v2, though, is its use of a new data type called the bfloat16 -
The secret to high performance on Cloud TPUs. This custom floating point format is called “Brain
Floating Point Format,” or “bfloat16” for short. The name flows from “Google Brain”, which is
an artificial intelligence research group at Google where the idea for this format was conceived.
Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations
on Cloud TPUs. More precisely, each multiply-accumulate operation in a matrix multiplication
uses bfloat16 for the multiplication and 32-bit IEEE floating point for accumulation.
1.3.1 Bfloat16 semantics
Bfloat16 is a custom 16-bit floating point format for machine learning that’s comprised of
one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-
standard IEEE 16-bit floating point, which was not designed with deep learning applications
in mind. Figure 1 diagrams out the internals of three floating point formats: (a) FP32: IEEE
single-precision, (b) FP16: IEEE half-precision, and (c) bfloat16.
As the above image shows, bfloat16 has a greater dynamic range—i.e., number of exponent
bits—than FP16. In fact, the dynamic range of bfloat16 is identical to that of FP32. We’ve
trained a wide range of deep learning models, and in our experience, the bfloat16 format works
as well as the FP32 format while delivering increased performance and reducing memory usage.
1.3.2 Choosing bfloat16

Google hardware teams chose bfloat16 for Cloud TPUs to improve hardware efficiency while
maintaining the ability to train accurate deep learning models, all with minimal switching costs
from FP32. The physical size of a hardware multiplier scales with the square of the mantissa
width. With fewer mantissa bits than FP16, the bfloat16 multipliers are about half the size in
silicon of a typical FP16 multiplier, and they are eight times smaller than an FP32 multiplier!
Based on their years of experience training and deploying a wide variety of neural networks
across Google’s products and services, Google knew when they designed Cloud TPUs that neural
networks are far more sensitive to the size of the exponent than that of the mantissa. To ensure
identical behavior for underflows, overflows, and NaNs, bfloat16 has the same exponent size
as FP32. However, bfloat16 handles denormals differently from FP32: it flushes them to zero.
Unlike FP16, which typically requires special handling via techniques such as loss scaling [Mic
17], BF16 comes close to being a drop-in replacement for FP32 when training and running deep
neural networks.
1.3.3 Mixed-precision training

Deep learning models are known to tolerate lower numerical precision [Suyog Gupta et al., 2015,
Courbariaux et al., 2014]. For the overwhelming majority of computations within a deep neural
network, it isn’t essential to compute, say, the 18th digit of each number; the network can accom-
plish a task with the same accuracy using a lower-precision approximation. Surprisingly, some
models can even reach a higher accuracy with lower precision, which research usually attributes
to regularization effects from the lower precision [Choi et al., 2018].
When programming Cloud TPUs, the TPU software stack provides automatic format conver-
sion: values are seamlessly converted between FP32 and bfloat16 by the XLA compiler, which is
capable of optimizing model performance by automatically expanding the use of bfloat16 as far
as possible without materially changing the math in the model. This allows ML practitioners to
write models using the FP32 format by default and achieve some performance benefits without
having to worry about any manual format conversions—no loss scaling or code changes required.
While it is possible to observe the effects of bfloat16, this typically requires careful numerical
analysis of the computation’s outputs.
1.3.4 Performance wins

Google have optimized the majority of Cloud TPU reference models to use mixed precision
training with bfloat16. Figure 3 shows the performance gains from this optimization, which range
from 4% to 47%, with a geometric mean of 13.9%. The picture below is normalized speed-up (%)
1.3.5 More improvements.

Combined with the designing an integrated chip and placing it on the board with three compa-
triots. So Google have four chips per board . The TPU v2 achieves much higher performance
than the v1. Now for the really exciting part, the TPU v2 is arranged into pods, one TPU v2
pod is 64 TPUs, they are all connected together and Google can use an entire pod as if it was
one machine. And since there are two cores per chip with those four chips per board and 64
TPUs per pod, that multiplies out to 512 cores (256 chips) in a TPU pod, totaling up to over
11 petaflops of processing power. Google can use smaller subdivisions of the pods as well, such
as quarter pod or half pod. What’s really extraordinary is when Google start using TPU pods
to train up state-of-the-art models on benchmark data sets (Ex ResNet-50) in 30 minutes, and
that only half-pods with just 32 TPUs. And when you look at the training process used, they
are using batch size of 8000 images per patch.
1.4 The improvements of Google’s second Tensor Processing Unit (TPU

v3)
A TPU v3 board contains four TPU chips and 32 GiB of HBM. Each TPU chip contains two
cores. Each core has a MXU, a vector unit, and a scalar unit.
As you can see, they took the v2 and they made it blue. But additionally, of course these
chips use water cooling. This allows it to take up much less vertical space. So the TPU v3 pods
can then support many more TPUs in it. And a full TPU v3 pod comes in at eight times faster
than a v2 pod and weighs in at over 100 PFlops (Petaflops) of compute power. TPU boards have
4 chips per board. So Google have 1024 chips in the TPUv3 Supercomputer (pods).
Cloud TPU provides the following TPU configurations:

• A single TPU device
• A TPU Pod - a group of TPU devices connected by high-speed interconnects
• A TPU slice - a subdivision of a TPU Pod
1.4.1 Performance benefits of TPU v3 over v2
The increased FLOPS per core and memory capacity in TPU v3 configurations can improve
the performance of your models in the following ways:
• TPU v3 configurations provide significant performance benefits per core for compute-bound
models. Memory-bound models on TPU v2 configurations might not achieve this same
performance improvement if they are also memory-bound on TPU v3 configurations.
• In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can
provide improved performance and reduced recomputation of intermediate values (re-
materialization).
• TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2
configurations. For example, TPU v3 might allow deeper ResNets and larger images with
RetinaNet.
Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting
for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can
help you resolve infeed issues.
1.4.2 Production Applications

• MultiLayer Perceptrons (MLP)
– MLP0 is unpublished
– MLP1 is RankBrain [Cla15]
• Convolutional Neural Networks (CNN)
– CNN0 is AlphaZero, which mastered the games chess, Go, and shogi [Sil18]
– CNN1 is an Google-internal model for image recognition
• Recurrent Neural Networks (RNN)
– RNN0 is a Translation model [Che18]

– RNN1 is a Speech model [Chi18]
• MLP0 & MLP1

– 40% & 14% of perfect linear scale at 1024 chip-scale
– Limited by embeddings
• CNN0
– 96% of perfect linear scaling!
• CNN1, RNN0, RNN1
– 3 production apps run at 99% of perfect linear scaling at 1024 chips!
1.5 When to use TPUs ?

Cloud TPUs are optimized for specific workloads. In some situations, you might want to use
GPUs or CPUs on Compute Engine instances to run your machine learning workloads. In gen-
eral, you can decide what hardware is best for your workload based on the following guidelines:
CPUs:
• Quick prototyping that requires maximum flexibility
• Simple models that do not take long to train
• Small models with small, effective batch sizes
• Models that contain many custom TensorFlow/PyTorch/JAX operations written in C++
• Models that are limited by available I/O or the networking bandwidth of the host system
GPUs:
• Models with a significant number of custom TensorFlow/PyTorch/JAX operations that
must run at least partially on CPUs
• Models with TensorFlow/PyTorch ops that are not available on Cloud TPU
• Medium-to-large models with larger effective batch sizes
TPUs:
• Models dominated by matrix computations Models with no custom TensorFlow/PyTorch/-
JAX operations inside the main training loop
• Models that train for weeks or months

• Large models with large effective batch sizes
Cloud TPUs are not suited to the following workloads:

• Linear algebra programs that require frequent branching or contain many element-wise
algebra operations
• Workloads that access memory in a sparse manner
• Workloads that require high-precision arithmetic

• Neural network workloads that contain custom operations in the main training loop
2 NVDLA
2.1 What is NVDLA:
NVDLA (stands for NVIDIA® Deep Learning Accelerator) is an open-source hard-

ware neural network AI accelerator created by Nvidia.
There are 2 version of NVDLA:
1. This is the non-configurable “full-precision” version of NVDLA.
2. This version is now scalable to meet many system. This architecture of sub-units is the
same as NVDLA v1.
NVDLA provides free Intellectual Property (IP) licensing to anyone wanting to build a chip
that uses CNNs for inference applications The accelerator is written in Verilog and NVDLA
provides complete solution with C-model, compiler, Linux drivers, test benches and test suites,
kernel- and user-mode software, and software development tools. It is also configurable and
scalable to meet many different architecture needs (only available in NVDLA v2). NVDLA is
merely an accelerator and any process must be scheduled and arbitrated by an outside entity
such as a CPU.
Full hardware acceleration is used for a Convolutional Neural Network (CNN) by exposing
individual blocks, which means to accelerate operations related to each CNN layer (such as:
convolution, deconvolution, fully-connected, pooling, activation, local response normalization,
...)
Maintaining separate and independently configurable blocks means that the NVDLA can be
scaled and sized appropriately for many smaller applications. This modular architecture gives a
highly-configurable solution that is suitable to meet specific inferencing needs.
NVDLA is applied in many product:
• Nvidia’s Jetson Xavier NX
• a small circuit board which is as large as a credit card which includes a 6-core ARMv8.2
64-bit CPU.
• an integrated 384-core Volta GPU with 48 Tensor Cores.
2.2 Functional Description:
Each block in the NVDLA architecture is responsible for supporting specific operations in-
tegral to inference on deep neural networks. These Inference operations are divided into five
groups:
• Convolution operations: Convolution core and buffer blocks.
• Single Data Point operations (SDP): Activation engine block.
• Planar Data Point operations (PDP): Pooling engine block
• Multi-Plane operations: Local resp. norm block
• Data Memory and Reshape operations: Reshape and Bridge DMA blocks
In various deep learning applications, inference operations are required. As a result, perfor-
mance, area, and power requirements for any given NVDLA design will vary too. Also, by using
NVDLA architecture’s series of hardware parameters that are defined detailedly for each feature
selection and design sizing.
2.2.1 Convolution Operations:

Convolution operation works on 2 datasets:
• Offline-trained “weights”: remain constant between each run of inference
• Input “feature” data: varies with the network’s input
In this operation, NVDLA supports to optimize and improve performance over a naive con-
volution implementation through 4 mode:
• Direct Convolution Mode

• Image-input Convolution Mode
• Winograd Convolution Mode

• Batching Convolution Mode
2.2.1.1 Direct Convolution Mode:

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-
add (MAD) operation is a common step that computes the product of two numbers and adds
that product to an accumulator.
Hình 1: Multiply-accumulate (MAC)
There are two key factors that impact convolution function performance:
1. Memory bandwidth
2. MAC efficiency.
First, Memory bandwidth:

NVDLA supports two memory bandwidth optimization features that can significantly
help to reduce memory bandwidth requirements for CNN layer(s) requiring huge data exchange.
• Sparse compression: Turn the input matrix to sparser one, in which most of the elements
are zero. By turning normal matrix to sparse matrix, it reduces memory traffic on the
memory bus.
⇒ Example: A 60% sparse network (60% of the data are zero) can almost cut the memory
bandwidth requirement to half.
Hình 2: Sparse Matrix
• Second memory interface: Provides efficient on-chip buffering, which can increase mem-
ory traffic bandwidth and also reduce the memory traffic latency.
⇒ Example: Usually an on-chip SRAM can provide 2x∼4x of DRAM bandwidth with
1/10 ∼ 1/4 latency.
Hình 3: Raspberry-pi Second memory interface
Second, MAC efficiency:

To increase MAC efficiency by reducing MAC utilization. The number of MAC in-
stances is determined by (Atomic-C * Atomic-K). However, if a layer’s input feature data channel
number is not aligned with the Atomic-C setting or the output feature data kernel number is
not aligned with the Atomic-K setting, there will result in a drop in MAC utilization.
⇒ Example: If the NVDLA design specification has Atomic-C = 16 and Atomic-K = 64 (which
would result in 1024 MAC instances), and one layer of the network has the input feature
data channel number = 8 and output feature data kernel number = 16, then the MAC
utilization will be only 1/8th (only 128 MACs will be utilized with the others being idle at
all times).
Hardware Parameters:
• Atomic – C sizing
• Atomic – K sizing
• Data type supporting
• Feature supporting – Compression
• Feature supporting – Second Memory Bus
2.2.1.2 Image-Input Convolution Mode:

Image-input mode is a special type of direct convolution mode for the first layer containing
the image surface input feature data.
Hình 4: Image surface input
Because the image surface format is quite different from the normal feature data format, fea-
ture data fetching operations follow a different path from direct convolution operations. There-
fore, a special mode for image-input is needed for optimizations.
• All from Direct Convolution mode +
• Image input support
2.2.1.3 Winograd Convolution Mode:

Winograd convolution refers to an optional algorithm used to optimize the performance
of direct convolution. The Winograd convolution reduces the number of multiplications, while
increasing the adders to deal with the additional transformation. Which is suitable for many
small, low-performed hardware.
The equation of Winograd convolution used in convolution core is:
Hình 5: The equation of Winograd convolution
• Feature supporting – Winograd
2.2.1.4 Batching Convolution:

The NVDLA batching feature supports processing of multiple sets of input activations (from
multiple images) at a time.
In detail, The size of weight data in fully-connected layers is significant and is only used a
single time in MAC functions (leading to causes of bottlenecks in memory bandwidth).
⇒ Allowing multiple sets of activations to share the same weight data means they can run at
the same time (reducing overall run-time).
Note: Maximum batching size is limited by the convolution buffer size, so the maximum
batching number is a hardware limitation in the design specification.
• Feature batch support
• Max batch number
2.2.2 Single Data Point Operation:

The Single Data Point Processor (SDP) allows for the application of both linear and non-
linear functions onto individual data points. This is commonly used immediately after convolution
in CNN systems.
• Linear function: provides native support for linear functions by working with simple bias
and scaling
• Non-linear function: uses lookup tables (LUTs) to implement non-linear functions.
This combination supports most common activation functions as well as other element-wise
operations including: ReLU, PReLU, precision scaling, batch normalization, bias addition, or
other complex non-linear functions, such as a sigmoid or a hyperbolic tangent.
• SDP function support
• SDP throughput
2.2.2.1 Linear Function:

NVDLA supports multiple instances of linear functions (which are mostly scaling functions).
There are several methods that can be used for of setting the scaling factor and bias:
• Precision Scaling: Control memory bandwidth throughout the full inference process.
Feature data can be scaled to its full range before chunking into lower precision and being
written to memory.
• Batch Normalization: In an inference function batch normalization requires a linear
function with a trained scaling factor. SDP can support a per-layer parameter or a per-
channel parameter to do the batch normalization operation. By re-centering and re-scaling,
Batch Normalization makes artificial neural networks faster and more stable.
• Bias Addition: Some layers require the bias function at the output side, which means
that they need to provide an offset (either from a per-layer setting or per-channel memory
surface or per-feature memory surface) to the final result.
• Element-Wise Operation: NVDLA supports common operations such as add, sub, mul-
tiply, max/min comparasion, ... for two feature data cubes which have the same W, H and
C size
2.2.2.2 Non-Linear Function:

There are several non-linear functions that are required to support Deep Learning algorithms.
Some of these are supported using dedicated hardware logic while more complex functions in-
corporate the use of a dedicated Look-Up-Table.
Hình 6: Some popular non-function
• SDP function support
• SDP throughput
2.2.3 Planar Data Operations:

The Planar Data Processor (PDP) supports specific spatial operations that are common in
CNN applications. The PDP unit has a dedicated memory interface to fetch input data from
memory and outputs directly to memory. NVDLA supports different pool group sizes with the
ability to be configurable at runtime. There’re three popular pooling functions:
• maximum-pooling: Obtain maximum value from pooling window.
• minimum-pooling: Obtain minimum value from pooling window.
• average-pooling: Obtain average value from pooling window.
Hình 7: Example of maximum pooling
• PDP throughput
2.2.4 Multi-Plane Operations:

The Cross-channel Data Processor (CDP) is a specialized unit which is built to apply the
local response normalization (LRN) function (a special normalization function that operates
on channel dimensions, as opposed to the spatial dimensions).
• CDP throughput
Hình 8: Local Response Normalization (LRN)
2.2.5 Data Memory and Reshape Operations:

2.2.5.1 Bridge DMA:
The bridge DMA (BDMA) module is a data copy engine which is responsible for moving
data between the system DRAM and a dedicated high-performance, low-latency memory inter-
face, where present (such as SRAM). Provides an accelerated path to transfer data between these
two non-connected memory systems without passing through CPU.
Hình 9: Bridge DMA (BDMA)
• BDMA function support
2.2.5.2 Data Reshape Engine:

The data reshape engine performs data format transformations such as splitting or slic-
ing, merging, contraction, reshape-transpose, ... Data in memory often needs to be reconfigured
or reshaped in the process of performing inferencing on a convolutional neutral network.
For examples:
• “slice” operations may be used to separate out different features or spatial regions of an
image.
• “reshape-transpose” operations ,which is commonly used in deconvolutional neutral
networks, create output data with larger dimensions than the input dataset.
⇒ The combination of Convolutional neutral networks and deconvolutional neutral net-
works is applied for Image2Image problem.
Hình 10: Combination of Convolutional Neutral Network and Deconvolutional Neutral Network
Hình 11: Image2Image Problem
To increase performance, NVDLA provides Rubik function which is used to transform data
mapping format without any data calculations. NVDLA supports three working modes:
• Contract Mode: Contract mode in Rubik transforms mapping format are used to de-
extend the cube. It’s a second hardware layer to support deconvolution. Normally, a soft-
ware deconvolution layer has deconvolution x stride and y stride that are greater than 1;
with these strides the output of phase I hardware-layer is a channel-extended data cube.
• Split Mode and Merge Mode: Split and merge are two opposite operation modes in
Rubik.
– Split transforms a data cube into M-planar formats (NCHW). The number of planes
is equal to channel size.
– Merge transforms a serial of planes to a feature data cube.
• Rubik function support
2.3 External Interfaces:

The NVDLA has four interfaces to the system as a whole. These are:
• Configuration space bus (“CSB”)
• External interrupt (“IRQ”): Certain states in the NVDLA demand asynchronous re-
porting to the processor that is commanding the NVDLA, these states include operation
completion and error conditions. The external interrupt interface provides a single output
pin that complements the CSB interface.
• Data backbone (“DBBIF”): The NVDLA contains its own DMA engine to load and
store values (including parameters and datasets) from the rest of the system. The data
backbone is an AMBA AXI4-compliant interface that is intended to access large quantities
of relatively high-latency memory (such as system DRAM).
• SRAM connection (“SRAMIF”): Some systems may have the need for more throughput
and lower latency than the system DRAM can provide, and may wish to use a small SRAM
as a cache to improve the NVDLA’s performance. A secondary AXI4-compliant interface
is provided for an optional SRAM to be attached to the NVDLA.
2.3.1 Configuration space bus (CSB):

The CPU uses the CSB (Configuration Space Bus) interface to access and configure NVDLA
registers set. The CSB interface is intentionally extremely simple, and low-performance; as such,
it should be simple to build an adapter between the CSB and any other system bus which may
be supported on every platform. Some small systems may directly connect the CSB interface to
the host CPU, with a suitable bus bridge. Other, potentially larger systems will instead connect
the CSB interface to a small microcontroller unit (MCU) for offloading some of the work of
managing the NVDLA to the external core.
The CSB bus consists of three channels:
• the request channel

• the read data channel
• the write response channel
2.3.1.1 Clock and reset:

The CSB interface uses a single clock domain, shared between NVDLA and the requester.
2.3.1.2 Request channel:

The request channel not only follows a valid protocol, but also follows ready protocol.
A data transaction only occurs on the request channel as long as the valid signal (from the
host) and the ready signal (from NVDLA) are both asserted in the same clock cycle.
Each request to CSB has a fixed request size of 32 bits of data, and a fixed 16 bit address.
CSB does not support any form of burst requests. Every sent down the request channel packet
is independent from any other packet.
2.3.1.3 Read data channel:

The read data channel only follows a valid protocol. Therefore, the host cannot apply
back-pressure to the NVDLA on this interface.
NVDLA returns read-response data to the host in strict request order. Each request packet
for which “write” is set to 0 will have exactly one response, and that response cannot jump
forward or backwards relative to other reads.
Because NVDLA does not support error reporting from the CSB, illegal reads will return 0.
2.3.1.4 Write response:

The write response channel alse only follows a valid protocol. Therefore, the host cannot
apply back-pressure to the NVDLA on this interface.
NVDLA will return write completion signal (used to indicate that a CSB write has completed)
to the host in request order for every non-posted write.
2.3.2 Host Interrupt:

Some states (such as operation completion or error conditions) demand an asynchronous report-
ing to the NVDLA. Interupt is necessary to notify the CPU these events. The interrupt signal is
a level-driven interrupt that is asserted high as long as the NVDLA core has interrupts pending.
The NVDLA interrupt signal is on the same clock domain as the CSB interface.
2.3.3 System interconnect: DBBIF

NVDLA has two major interfaces to interact with the memory system, these are called:
• DBBIF: The DBBIF interface is intended to be connected to an on-chip network which

connects to the system memory.
• SRAMIF: The SRAMIF is intended to be connected with an optional on-chip SRAM
with lower memory latency and potentially higher throughput.
Hình 12: DRAM vs SRAM
DBBIF stands for Data backbone Interfaces is an AMBA AXI4-compliant interface

intended to access large quantities of relatively high-latency memory (such as system DRAM).
The NVDLA data backbone interface supports a configurable data bus width of 32, 64,
128, 256 or 512-bits. Despite memory latency, internal buffers can be configured to support a
configurable number of outstanding requests up to 256.
The NVDLA DBBIF assumes synchronized data backbone interface with single clock domain
and reset. Therefore, all NVDLA DBBIF ports are part of the main NVDLA core clock domain.
Synchronization to the SOC data backbone will need to be done outside the NVDLA core.
The data backbone interface is based on an AXI-like protocol, but it has simplified interface
protocol:
• Issuing incremental burst request
• Burst size needs to align with data width
• Request address needs to be aligned to data width
• Write acknowledge must be returned in write request order
• Read data must be returned in read request order
• Writes must always be acknowledged, reads must always get return data.
• Writes must be committed to memory when NVDLA gets a write acknowledge
• Reads need to get the actual value from memory
2.3.4 On-Chip SRAM Interfaces - SRAMIF

Some systems may have the need for more throughput and lower latency than the system DRAM
can provide. Here, SRAM can be used as a cache to improve the NVDLA’s performance.
SRAM is also an AXI4-compliant interface (just like DRAM). It can be used optionally
in potentially larger system.
2.3.5 Example of External Interface:

2.3.5.1 Small NVDLA System:
Hình 13: Small NVDLA System
In this small NVDLA System:
• CSB is connected directly to CPU

• There is no NVDLA-dedicated SRAM connected to SRAMIF, and all accesses hit the main
system memory DRAM.
2.3.5.2 Large NVDLA System:
Hình 14: Large NVDLA System
In this large NVDLA System:

• NVDLA connects to a microcontroller, which is responsible for managing the small details
of programming the NVDLA.
⇒ Example: freeing the main CPU from servicing low-level NVDLA interrupts
• A SRAM is attached to NVDLA. (Other units on the system may also have connections
to this SRAM, and share it for their own needs; this is not shown in the diagram). SRAM
will work as a cache to boost the NVDLA’s performance.
3 References
Tài liệu
[1] cloud.google.com “https://cloud.google.com/tpu/docs/system-architecture-tpu-vm”, Sys-
tem Architecture
[2] www.nextplatform.com “https://www.nextplatform.com/2017/04/05/first-depth-look-
googles-tpu-architecture/”, First in-depth look at Google’s TPU architecture
[3] www.run.ai “https://www.run.ai/guides/cloud-deep-learning/google-tpuCloud-TPU-

Architecture”, Google TPU Architecture and Performance Best Practices
[4] analyticsindiamag.com “https://analyticsindiamag.com/what-are-google-cloud-tpu-vms/”,
What Are Google Cloud TPU VMs?
[5] meseec.ce.rit.edu “http://meseec.ce.rit.edu/551-projects/fall2017/3-4.pdf”, Tensor Process-

ing Unit
[6] old.hotchips.org “https://old.hotchips.org/hc31/HC31_T3_Cloud_TPU_Codesign.pdf”,
HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture and Infrastructure
[7] abelay.github.io “https://abelay.github.io/6828seminar/papers/jouppi:tpu.pdf”, In-
Datacenter Performance Analysis of a Tensor Processing Unit
[8] www.nextplatform.com “https://www.nextplatform.com/2018/05/10/tearing-apart-
googles-tpu-3-0-ai-coprocessor/”, AI COPROCESSOR Tearing apart Google’s TPU 3.0 AI
COPROCESSOR
[9] cloud.google.com “https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-

the-secret-to-high-performance-on-cloud-tpus”, BFloat16: The secret to high performance on
Cloud TPUs
[10] nvdla.org “http://nvdla.org/hw/v1/hwarch.html”, Hardware Architectural Specification
[11] developer.nvidia.com “https://developer.nvidia.com/blog/nvdla/”, NVDLA Deep Learning
Inference Compiler is Now Open Source
[12] old.hotchips.org “https://old.hotchips.org/hc30/2conf/2.08N V idiaD LAN vidiaD LAH otChips1 0Aug18.pdf ”,
THE NVIDIA DEEP LEARNING ACCELERATOR
[13] forbes.com “https://www.forbes.com/sites/moorinsights/2018/03/29/arm-chooses-nvidia-
open-source-ai-chip-technology/?sh=700addd81e50”, Arm Chooses NVIDIA Open-Source
CNN AI Chip Technology
[14] en.wikipedia.org “https://en.wikipedia.org/wiki/NVDLA”, NVDLA
[15] youtube.com “https://www.youtube.com/watch?v=XvWWECCwCtA”, Nvdla Overview
[16] youtube.com “https://www.youtube.com/watch?v=Ao7GSJ41YTM”, X-NVDLA: Runtime

Accuracy Configurable NVDLA based on Employing Voltage Overscaling Approach

Cored Architecture To Acclerate Ai Algorithms: LSI Logic Design (CO309D) Extra Assignment Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cored Architecture To Acclerate Ai Algorithms: LSI Logic Design (CO309D) Extra Assignment Report

Uploaded by

Copyright:

Available Formats

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

LSI Logic Design (CO309D)

Extra Assignment Report

GVHD: Nguyễn Xuân Quang

Ho Chi Minh City, 5/2022

1 Google TPU (A Tensor Processing Unit)

1.2 An in-depth look at Google’s first Tensor Processing Unit (TPU

• ASIC (28nm Process)

• Applications: Alphago (South Korean)

1.2.1 Prediction with neural networks

1.2.2 Quantization in neural networks

Quantization is an optimization technique that uses an 8-bit integer to approximate an arbi-

1.2.3 RISC, CISC and the TPU instruction set

The TPU includes the following computational resources:

1.2.4 Parallel Processing on the Matrix Multiplier Unit

• How a TPU works

As a result, TPUs can achieve a high-computational throughput on neural network calcu-

1.2.5 The heart of the TPU: A systolic array

suited for general-purpose computation. It makes an engineering tradeoff: limiting registers,

1.2.6 Minimal and deterministic design

1.2.7 Neural-specific architecture

1.3 The improvements of Google’s second Tensor Processing Unit (TPU

1.3.1 Bfloat16 semantics

1.3.2 Choosing bfloat16

1.3.3 Mixed-precision training

1.3.4 Performance wins

1.3.5 More improvements.

1.4 The improvements of Google’s second Tensor Processing Unit (TPU

Cloud TPU provides the following TPU configurations:

• A TPU slice - a subdivision of a TPU Pod

1.4.1 Performance benefits of TPU v3 over v2

1.4.2 Production Applications

– RNN0 is a Translation model [Che18]

• MLP0 & MLP1

1.5 When to use TPUs ?

• Models that train for weeks or months

Cloud TPUs are not suited to the following workloads:

• Workloads that require high-precision arithmetic

2.1 What is NVDLA:

NVDLA (stands for NVIDIA® Deep Learning Accelerator) is an open-source hard-

2.2 Functional Description:

2.2.1 Convolution Operations:

• Direct Convolution Mode

• Winograd Convolution Mode

2.2.1.1 Direct Convolution Mode:

Hình 1: Multiply-accumulate (MAC)

First, Memory bandwidth:

Hình 2: Sparse Matrix

Hình 3: Raspberry-pi Second memory interface

Second, MAC efficiency:

• Data type supporting

• Feature supporting – Compression

• Feature supporting – Second Memory Bus

2.2.1.2 Image-Input Convolution Mode:

Hình 4: Image surface input

• All from Direct Convolution mode +

• Image input support

2.2.1.3 Winograd Convolution Mode:

Hình 5: The equation of Winograd convolution

• Feature supporting – Winograd

2.2.1.4 Batching Convolution:

• Feature batch support