Professional Documents
Culture Documents
CORED ARCHITECTURE TO
ACCLERATE AI ALGORITHMS
Mục lục
1 Google TPU (A Tensor Processing Unit) 2
1.1 What is Google Cloud TPU? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 An in-depth look at Google’s first Tensor Processing Unit (TPU v1) . . . . . . . 2
1.2.1 Prediction with neural networks . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Quantization in neural networks . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 RISC, CISC and the TPU instruction set . . . . . . . . . . . . . . . . . . 5
1.2.4 Parallel Processing on the Matrix Multiplier Unit . . . . . . . . . . . . . . 6
1.2.5 The heart of the TPU: A systolic array . . . . . . . . . . . . . . . . . . . 9
1.2.6 Minimal and deterministic design . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.7 Neural-specific architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 The improvements of Google’s second Tensor Processing Unit (TPU v2) . . . . . 13
1.3.1 Bfloat16 semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Choosing bfloat16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Mixed-precision training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.4 Performance wins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.5 More improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 The improvements of Google’s second Tensor Processing Unit (TPU v3) . . . . . 16
1.4.1 Performance benefits of TPU v3 over v2 . . . . . . . . . . . . . . . . . . . 17
1.4.2 Production Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 When to use TPUs ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 NVDLA 20
2.1 What is NVDLA: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Functional Description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Convolution Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Single Data Point Operation: . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Planar Data Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Multi-Plane Operations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5 Data Memory and Reshape Operations: . . . . . . . . . . . . . . . . . . . 29
2.3 External Interfaces: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Configuration space bus (CSB): . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 Host Interrupt: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 System interconnect: DBBIF . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 On-Chip SRAM Interfaces - SRAMIF . . . . . . . . . . . . . . . . . . . . 33
2.3.5 Example of External Interface: . . . . . . . . . . . . . . . . . . . . . . . . 33
3 References 36
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 1/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
TPUs train your models more efficiently using hardware designed for performing large ma-
trix operations often found in machine learning algorithms. TPUs have on-chip high-bandwidth
memory (HBM) letting you use larger models and batch sizes. TPUs can be connected in groups
called Pods that scale up your workloads with little to no code changes.
Cloud TPU is tightly integrated with TensorFlow, Google’s open source machine learning
(ML) framework. You can use dedicated TensorFlow APIs to run workloads on TPU hardware.
Cloud TPU lets you create clusters of TensorFlow computing units, which can also include CPUs
and regular graphical processing units (GPUs).
TPU v1 Overview:
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 2/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
This example on the TensorFlow Playground trains a neural network to classify a data point
as blue or orange based on a training dataset. (See this post to learn more about this example.)
The process of running a trained neural network to classify data with labels or estimate some
missing or future values is called inference. For inference, each neuron in a neural network does
the following calculations:
• Multiply the input data (x) with weights (w) to represent the signal strength
• Add the results to aggregate the neuron’s state into a single value
• Apply an activation function (f) (such as ReLU, Sigmoid, tanh or others) to modulate the
artificial neuron’s activity.
For example, if you have three inputs and two neurons with a fully connected single-layer
neural network, you have to execute six multiplications between the weights and inputs and add
up the multiplications in two groups of three. This sequence of multiplications and additions
can be written as a matrix multiplication. The outputs of this matrix multiplication are then
processed further by an activation function. Even when working with much more complex neural
network model architectures, multiplying matrices is often the most computationally intensive
part of running a trained model.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 3/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
How many multiplication operations would you need at production scale? In July 2016, based
on a survey of six representative neural network applications across Google’s production services
and summed up the total number of weights in each neural network architecture. You can see
the results in the table below.
As you can see in the table, the number of weights in each neural network varies from 5 mil-
lion to 100 million. Every single prediction requires many steps of multiplying processed input
data by a weight matrix and applying an activation function.
In total, this is a massive amount of computation. As a first optimization, rather than execut-
ing all of these mathematical operations with ordinary 32-bit or 16-bit floating point operations
on CPUs or GPUs, Google apply a technique called quantization that allows us to work with
integer operations instead. This enables us to reduce the total amount of memory and computing
resources required to make useful predictions with Google’sneural network models.
Quantization is a powerful tool for reducing the cost of neural network predictions, and the
corresponding reductions in memory usage are important as well, especially for mobile and em-
bedded deployments. For example, when you apply quantization to Inception, the popular image
recognition model, it gets compressed from 91MB to 23MB, about one-fourth the original size.
Being able to use integer rather than floating point operations greatly reduces the hardware
footprint and energy consumption of Google’sTPU. A TPU contains 65,536 8-bit integer mul-
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 4/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
tipliers. The popular GPUs used widely on the cloud environment contains a few thousands of
32-bit floating-point multipliers. As long as you can meet the accuracy requirements of your
application with 8-bits, that can be up to 25X or more multipliers.
Most modern CPUs are heavily influenced by the Reduced Instruction Set Computer (RISC)
design style. With RISC, the focus is to define simple instructions (e.g., load, store, add and
multiply) that are commonly used by the majority of applications and then to execute those
instructions as fast as possible. Google chose the Complex Instruction Set Computer (CISC)
style as the basis of the TPU instruction set instead. A CISC design focuses on implementing
high-level instructions that run more complex tasks (such as calculating multiply-and-add many
times) with each instruction. Let’s take a look at the block diagram of the TPU.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 5/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
• Matrix Multiplier Unit (MXU): 65,536 8-bit multiply-and-add units for matrix operations
Unified Buffer (UB): 24MB of SRAM that work as registers
• Activation Unit (AU): Hardwired activation functions
• To control how the MXU, UB and AU proceed with operations, Google defined a dozen
high-level instructions specifically designed for neural network inference. Five of these op-
erations are highlighted below.
– Read_Host_Memory: Read data from memory.
– Read_Weights: Read weights from memory.
– MatrixMultiply/Convolve: Multiply or convolve with the data and weights,accumulate
the results.
– Activate: Apply activation functions.
– Write_Host_Memory: Write result to memory.
This instruction set focuses on the major mathematical operations required for neural net-
work inference that Google mentioned earlier: execute a matrix multiply between input data and
weights and apply an activation function.
Norm says: “Neural network models consist of matrix multiplies of various sizes — that’s
what forms a fully connected layer, or in a CNN, it tends to be smaller matrix multiplies. This
architecture is about doing those things — when you’ve accumulated all the partial sums and are
outputting from the accumulators, everything goes through this activation pipeline. The nonlin-
earity is what makes it a neural network even if it’s mostly linear algebra.”(from First in-depth
look at Google’s TPU architecture, the Next Platform)”.
In short, the TPU design encapsulates the essence of neural network calculation, and can be
programmed for a wide variety of neural network models. To program it, Google created a com-
piler and software stack that translates API calls from TensorFlow graphs into TPU instructions.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 6/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
where the same operation is performed concurrently across a large number of data elements at
the same time. CPUs incorporate instruction set extensions such as SSE and AVX that express
such vector operations. The streaming multiprocessors (SMs) of GPUs are effectively vector pro-
cessors, with many such SMs on a single GPU die. Machines with vector processing support can
process hundreds to thousands of operations in a single clock cycle.
In the case of the TPU, Google designed its MXU as a matrix processor that processes
hundreds of thousands of operations (= matrix operation) in a single clock cycle. Think of it like
printing documents one character at a time, one line at a time and a page at a time.
• How a CPU works
A CPU is a general-purpose processor based on the von Neumann architecture. That means
a CPU works with software and memory like this:
The greatest benefit of CPUs is their flexibility. You can load any kind of software on a
CPU for many different types of applications. For example, you can use a CPU for word
processing on a PC, controlling rocket engines, executing bank transactions, or classifying
images with a neural network.
A CPU loads values from memory, performs a calculation on the values and stores the
result back in memory for every calculation. Memory access is slow when compared to the
calculation speed and can limit the total throughput of CPUs. This is often referred to as
the von Neumann bottleneck.
• How a GPU works
To gain higher throughput, GPUs contain thousands of Arithmetic Logic Units (ALUs) in
a single processor. A modern GPU usually contains between 2,500–5,000 ALUs. The large
number of processors means you can execute thousands of multiplications and additions
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 7/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
simultaneously.
This GPU architecture works well on applications with massive parallelism, such as matrix
operations in a neural network. In fact, on a typical training workload for deep learning, a
GPU can provide an order of magnitude higher throughput than a CPU.
But, the GPU is still a general-purpose processor that has to support many different ap-
plications and software. Therefore, GPUs have the same problem as CPUs. For every cal-
culation in the thousands of ALUs, a GPU must access registers or shared memory to read
operands and store the intermediate calculation results.
Google designed Cloud TPUs as a matrix processor specialized for neural network work-
loads. TPUs can’t run word processors, control rocket engines, or execute bank transactions,
but they can handle massive matrix operations used in neural networks at fast speeds.
The primary task for TPUs is matrix processing, which is a combination of multiply and
accumulate operations. TPUs contain thousands of multiply-accumulators that are directly
connected to each other to form a large physical matrix. This is called a systolic array ar-
chitecture. Cloud TPU v3, contains two systolic arrays of 128 x 128 ALUs, on a single
processor.
The TPU host streams data into an infeed queue. The TPU loads data from the infeed
queue and stores them in HBM memory. When the computation is completed, the TPU
loads the results into the outfeed queue. The TPU host then reads the results from the
outfeed queue and stores them in the host’s memory.
To perform the matrix operations, the TPU loads the parameters from HBM memory into
the MXU.
Then, the TPU loads data from HBM memory. As each multiplication is executed, the
result is be passed to the next multiply-accumulator. The output is the summation of all
multiplication results between the data and parameters. No memory access is required dur-
ing the matrix multiplication process.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 8/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
CPUs and GPUs often spend energy to access multiple registers per operation. A systolic
array chains multiple ALUs together, reusing the result of reading a single register.
For an MXU, however, matrix multiplication reuses both inputs many times as part of pro-
ducing the output. Google can read each input value once, but use it for many different operations
without storing it back to a register. Wires only connect spatially adjacent ALUs, which makes
them short and energy-efficient. The ALUs perform only multiplications and additions in fixed
patterns, which simplifies their design.
The design is called systolic because the data flows through the chip in waves, reminiscent
of the way that the heart pumps blood. The particular kind of systolic array in the MXU is
optimized for power and area efficiency in performing matrix multiplications, and is not well
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 9/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 ×
256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit
integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000
= 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix
unit.
Let’s compare the number of operations per cycle between CPU, GPU and TPU.
In comparison, a typical RISC CPU without vector extensions can only execute just one or
two arithmetic operations per instruction, and GPUs can execute thousands of operations per
instruction. With the TPU, a single cycle of a MatrixMultiply instruction can invoke hundreds
of thousands of operations.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 10/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
During the execution of this massive matrix multiply, all intermediate results are passed
directly between 64K ALUs without any memory access, significantly reducing power consump-
tion and increasing throughput. As a result, the CISC-based matrix processor design delivers
an outstanding performance-per-watt ratio: TPU provides a 83X better ratio compared with
contemporary CPUs and a 29X better ratio than contemporary GPUs.
Because general-purpose processors such as CPUs and GPUs must provide good performance
across a wide range of applications, they have evolved myriad sophisticated, performance-oriented
mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which
makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU
design is strictly minimal and deterministic as it has to run only one task at a time: neural
network prediction. You can see its simplicity in the floor plan of the TPU die.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 11/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
If you compare this with floor plans of CPUs and GPUs, you’ll notice the red parts (control
logic) are much larger (and thus more difficult to design) for CPUs and GPUs since they need to
realize the complex constructs and mechanisms mentioned above. In the TPU, the control logic
is minimal and takes under 2% of the die.
More importantly, despite having many more arithmetic units and large on-chip memory, the
TPU chip is half the size of the other chips. Since the cost of a chip is a function of the area3
— more smaller chips per silicon wafer and higher yield for small chips since they’re less likely
to have manufacturing defects* — halving chip size reduces chip cost by roughly a factor of 8 (23 ).
With the TPU, Google can easily estimate exactly how much time is required to run a neural
network and make a prediction. This allows us to operate at near-peak chip throughput while
maintaining a strict latency limit on almost all predictions. For example, despite a strict 7ms
limit in the above-mentioned MLP0 application, the TPU delivers 15–30X more throughput than
contemporary CPUs and GPUs.
Google uses neural network predictions to support end-user-facing products and services, and
everyone knows that users become impatient if a service takes too long to respond. Thus, for
the MLP0 application, they limit the 99th-percentile prediction latency to around 7 ms, for a
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 12/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
consistently fast user experience from TPU-based Google services. The following is an overall
performance (predictions per second) comparison between the TPU and a contemporary CPU
and GPU across six neural network applications under a latency limit. In the most spectacular
case, the TPU provides 71X performance compared with the CPU for the CNN1 application.
After the TPU v1 was created, the team took the lessons they learned and applied it to
designing the TPU v2. As you can see, it’s considerably larger than the first TPU and features
four chips instead of just one. It has 180 Teraflops of compute. Meaning that it can do 180 trillion
floating-point operations per second. And it does both training and prediction now.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 13/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
The layout of the TPU v2 is quite interesting. Each boad has those four chips, but each
chip has two cores. Each core then contains a matrix unit, a vector unit, and a scalar unit. All
connected to 8 gigabytes of high-bandwidth memory. That means in total, each board has 8 cores
and 64 gigs of memory. And the matrix unit is a 128 x 128 systolic array.
What’s special about the TPU v2, though, is its use of a new data type called the bfloat16 -
The secret to high performance on Cloud TPUs. This custom floating point format is called “Brain
Floating Point Format,” or “bfloat16” for short. The name flows from “Google Brain”, which is
an artificial intelligence research group at Google where the idea for this format was conceived.
Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations
on Cloud TPUs. More precisely, each multiply-accumulate operation in a matrix multiplication
uses bfloat16 for the multiplication and 32-bit IEEE floating point for accumulation.
Bfloat16 is a custom 16-bit floating point format for machine learning that’s comprised of
one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-
standard IEEE 16-bit floating point, which was not designed with deep learning applications
in mind. Figure 1 diagrams out the internals of three floating point formats: (a) FP32: IEEE
single-precision, (b) FP16: IEEE half-precision, and (c) bfloat16.
As the above image shows, bfloat16 has a greater dynamic range—i.e., number of exponent
bits—than FP16. In fact, the dynamic range of bfloat16 is identical to that of FP32. We’ve
trained a wide range of deep learning models, and in our experience, the bfloat16 format works
as well as the FP32 format while delivering increased performance and reducing memory usage.
Based on their years of experience training and deploying a wide variety of neural networks
across Google’s products and services, Google knew when they designed Cloud TPUs that neural
networks are far more sensitive to the size of the exponent than that of the mantissa. To ensure
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 14/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
identical behavior for underflows, overflows, and NaNs, bfloat16 has the same exponent size
as FP32. However, bfloat16 handles denormals differently from FP32: it flushes them to zero.
Unlike FP16, which typically requires special handling via techniques such as loss scaling [Mic
17], BF16 comes close to being a drop-in replacement for FP32 when training and running deep
neural networks.
When programming Cloud TPUs, the TPU software stack provides automatic format conver-
sion: values are seamlessly converted between FP32 and bfloat16 by the XLA compiler, which is
capable of optimizing model performance by automatically expanding the use of bfloat16 as far
as possible without materially changing the math in the model. This allows ML practitioners to
write models using the FP32 format by default and achieve some performance benefits without
having to worry about any manual format conversions—no loss scaling or code changes required.
While it is possible to observe the effects of bfloat16, this typically requires careful numerical
analysis of the computation’s outputs.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 15/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
than the v1. Now for the really exciting part, the TPU v2 is arranged into pods, one TPU v2
pod is 64 TPUs, they are all connected together and Google can use an entire pod as if it was
one machine. And since there are two cores per chip with those four chips per board and 64
TPUs per pod, that multiplies out to 512 cores (256 chips) in a TPU pod, totaling up to over
11 petaflops of processing power. Google can use smaller subdivisions of the pods as well, such
as quarter pod or half pod. What’s really extraordinary is when Google start using TPU pods
to train up state-of-the-art models on benchmark data sets (Ex ResNet-50) in 30 minutes, and
that only half-pods with just 32 TPUs. And when you look at the training process used, they
are using batch size of 8000 images per patch.
As you can see, they took the v2 and they made it blue. But additionally, of course these
chips use water cooling. This allows it to take up much less vertical space. So the TPU v3 pods
can then support many more TPUs in it. And a full TPU v3 pod comes in at eight times faster
than a v2 pod and weighs in at over 100 PFlops (Petaflops) of compute power. TPU boards have
4 chips per board. So Google have 1024 chips in the TPUv3 Supercomputer (pods).
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 16/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
The increased FLOPS per core and memory capacity in TPU v3 configurations can improve
the performance of your models in the following ways:
• TPU v3 configurations provide significant performance benefits per core for compute-bound
models. Memory-bound models on TPU v2 configurations might not achieve this same
performance improvement if they are also memory-bound on TPU v3 configurations.
• In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can
provide improved performance and reduced recomputation of intermediate values (re-
materialization).
• TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2
configurations. For example, TPU v3 might allow deeper ResNets and larger images with
RetinaNet.
Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting
for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can
help you resolve infeed issues.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 17/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
– MLP0 is unpublished
– MLP1 is RankBrain [Cla15]
• Convolutional Neural Networks (CNN)
– CNN0 is AlphaZero, which mastered the games chess, Go, and shogi [Sil18]
– CNN1 is an Google-internal model for image recognition
• Recurrent Neural Networks (RNN)
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 18/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
CPUs:
• Quick prototyping that requires maximum flexibility
• Simple models that do not take long to train
• Small models with small, effective batch sizes
• Models that contain many custom TensorFlow/PyTorch/JAX operations written in C++
• Models that are limited by available I/O or the networking bandwidth of the host system
GPUs:
• Models with a significant number of custom TensorFlow/PyTorch/JAX operations that
must run at least partially on CPUs
• Models with TensorFlow/PyTorch ops that are not available on Cloud TPU
• Medium-to-large models with larger effective batch sizes
TPUs:
• Models dominated by matrix computations Models with no custom TensorFlow/PyTorch/-
JAX operations inside the main training loop
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 19/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
2 NVDLA
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 20/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
NVDLA provides free Intellectual Property (IP) licensing to anyone wanting to build a chip
that uses CNNs for inference applications The accelerator is written in Verilog and NVDLA
provides complete solution with C-model, compiler, Linux drivers, test benches and test suites,
kernel- and user-mode software, and software development tools. It is also configurable and
scalable to meet many different architecture needs (only available in NVDLA v2). NVDLA is
merely an accelerator and any process must be scheduled and arbitrated by an outside entity
such as a CPU.
Full hardware acceleration is used for a Convolutional Neural Network (CNN) by exposing
individual blocks, which means to accelerate operations related to each CNN layer (such as:
convolution, deconvolution, fully-connected, pooling, activation, local response normalization,
...)
Maintaining separate and independently configurable blocks means that the NVDLA can be
scaled and sized appropriately for many smaller applications. This modular architecture gives a
highly-configurable solution that is suitable to meet specific inferencing needs.
NVDLA is applied in many product:
• Nvidia’s Jetson Xavier NX
• a small circuit board which is as large as a credit card which includes a 6-core ARMv8.2
64-bit CPU.
• an integrated 384-core Volta GPU with 48 Tensor Cores.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 21/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Each block in the NVDLA architecture is responsible for supporting specific operations in-
tegral to inference on deep neural networks. These Inference operations are divided into five
groups:
• Convolution operations: Convolution core and buffer blocks.
• Single Data Point operations (SDP): Activation engine block.
• Planar Data Point operations (PDP): Pooling engine block
• Multi-Plane operations: Local resp. norm block
• Data Memory and Reshape operations: Reshape and Bridge DMA blocks
In various deep learning applications, inference operations are required. As a result, perfor-
mance, area, and power requirements for any given NVDLA design will vary too. Also, by using
NVDLA architecture’s series of hardware parameters that are defined detailedly for each feature
selection and design sizing.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 22/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
There are two key factors that impact convolution function performance:
1. Memory bandwidth
2. MAC efficiency.
⇒ Example: A 60% sparse network (60% of the data are zero) can almost cut the memory
bandwidth requirement to half.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 23/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
• Second memory interface: Provides efficient on-chip buffering, which can increase mem-
ory traffic bandwidth and also reduce the memory traffic latency.
⇒ Example: Usually an on-chip SRAM can provide 2x∼4x of DRAM bandwidth with
1/10 ∼ 1/4 latency.
⇒ Example: If the NVDLA design specification has Atomic-C = 16 and Atomic-K = 64 (which
would result in 1024 MAC instances), and one layer of the network has the input feature
data channel number = 8 and output feature data kernel number = 16, then the MAC
utilization will be only 1/8th (only 128 MACs will be utilized with the others being idle at
all times).
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 24/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Hardware Parameters:
• Atomic – C sizing
• Atomic – K sizing
Because the image surface format is quite different from the normal feature data format, fea-
ture data fetching operations follow a different path from direct convolution operations. There-
fore, a special mode for image-input is needed for optimizations.
Hardware Parameters:
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 25/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Hardware Parameters:
⇒ Allowing multiple sets of activations to share the same weight data means they can run at
the same time (reducing overall run-time).
Note: Maximum batching size is limited by the convolution buffer size, so the maximum
batching number is a hardware limitation in the design specification.
Hardware Parameters:
• Linear function: provides native support for linear functions by working with simple bias
and scaling
• Non-linear function: uses lookup tables (LUTs) to implement non-linear functions.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 26/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
This combination supports most common activation functions as well as other element-wise
operations including: ReLU, PReLU, precision scaling, batch normalization, bias addition, or
other complex non-linear functions, such as a sigmoid or a hyperbolic tangent.
Hardware Parameters:
• SDP throughput
• Precision Scaling: Control memory bandwidth throughout the full inference process.
Feature data can be scaled to its full range before chunking into lower precision and being
written to memory.
• Batch Normalization: In an inference function batch normalization requires a linear
function with a trained scaling factor. SDP can support a per-layer parameter or a per-
channel parameter to do the batch normalization operation. By re-centering and re-scaling,
Batch Normalization makes artificial neural networks faster and more stable.
• Bias Addition: Some layers require the bias function at the output side, which means
that they need to provide an offset (either from a per-layer setting or per-channel memory
surface or per-feature memory surface) to the final result.
• Element-Wise Operation: NVDLA supports common operations such as add, sub, mul-
tiply, max/min comparasion, ... for two feature data cubes which have the same W, H and
C size
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 27/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Hardware Parameters:
• SDP throughput
Hardware Parameters:
• PDP throughput
Hardware Parameters:
• CDP throughput
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 28/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 29/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Hardware Parameters:
• “slice” operations may be used to separate out different features or spatial regions of an
image.
• “reshape-transpose” operations ,which is commonly used in deconvolutional neutral
networks, create output data with larger dimensions than the input dataset.
⇒ The combination of Convolutional neutral networks and deconvolutional neutral net-
works is applied for Image2Image problem.
Hình 10: Combination of Convolutional Neutral Network and Deconvolutional Neutral Network
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 30/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
To increase performance, NVDLA provides Rubik function which is used to transform data
mapping format without any data calculations. NVDLA supports three working modes:
• Contract Mode: Contract mode in Rubik transforms mapping format are used to de-
extend the cube. It’s a second hardware layer to support deconvolution. Normally, a soft-
ware deconvolution layer has deconvolution x stride and y stride that are greater than 1;
with these strides the output of phase I hardware-layer is a channel-extended data cube.
• Split Mode and Merge Mode: Split and merge are two opposite operation modes in
Rubik.
– Split transforms a data cube into M-planar formats (NCHW). The number of planes
is equal to channel size.
– Merge transforms a serial of planes to a feature data cube.
Hardware Parameters:
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 31/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 32/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 33/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 34/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
• A SRAM is attached to NVDLA. (Other units on the system may also have connections
to this SRAM, and share it for their own needs; this is not shown in the diagram). SRAM
will work as a cache to boost the NVDLA’s performance.
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 35/36
Ho Chi Minh City University of Technology
Computer Science and Engineering faculty
3 References
Tài liệu
[1] cloud.google.com “https://cloud.google.com/tpu/docs/system-architecture-tpu-vm”, Sys-
tem Architecture
[2] www.nextplatform.com “https://www.nextplatform.com/2017/04/05/first-depth-look-
googles-tpu-architecture/”, First in-depth look at Google’s TPU architecture
Extra Assignment of LSI Logic Design (CO309D) - Academic Year 2022-2023 Page 36/36