HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF

DaVinci: A Scalable Architecture
for Neural Network Computing
Heng Liao, Jiajin Tu,

Jing Xia, Xiping Zhou
2019-07
HUAWEI.COM Security Level:

Key element to enable intelligence in physical devices
Control, Process,
Transfer, Store ……
Ubiquitous AI computation
Memory
Footprint (GB)
Scalability General Artificial
Auto ML Intelligence
Meta-Learning
Algorithm
Discovery
Model
AutonomousSmart City Training
Driving Intelligent
Drones Surveillance
Robotics
IP Industrial Cloud
Smart Smart AI Inference
Camer Embedded AI Computation
Toy phone
a
IoT (TOPS)
Applications across ~106 performance range

Ubiquitous AI computation
Memory
Footprint
(GB)
Scalability DaVinci
Max
DaVinci Cloud-
training
D-Mini Edge-
training
Surveillance
D-Lite Wireless
D-Tiny
Phone
Wearable
Computation (TOPS)
Rich Variety of Computing architectures in Huawei Portfolio
• Wide range of performance & efficiency
• CPU: General purpose
• GPU: Graphics
• NPU: DNN
• ISP: Camera sensor pipeline
• DSP: Camera post processing, AR
• VPU: Vision Processing Unit
• NP: Network Processor
• Each category represents a different PPA curve
Target: Search for Optimal PPA in Design Space
Performance
Power
Architecture Overview of DaVinci
Building Blocks and their Computation Intensity
1D Scalar Unit + 2D Vector Unit + 3D Matrix Unit
Full flexibility Rich & efficient operations High intensity
N N2 N3
1 1 1 GPU + AI core +
2 4 8 Tensor core SRAM
4 16 128 Area
8 64 512 (normalized 5.2mm^2 13.2mm^2
16 256 4096 to 12 nm)
32 1024 32768 Compute 1.7Tops 8Tops
64 4096 262144 power fp16 fp16
DaVinci Core
DaVinci Core (16^3 Cube)
Buffer A L0
accumulator
(64KB)
Accum DFF
decomp
img2col
16^2
A/B
trans
DFF
16^3 Cube Buffer C L0 (256KB)
L0 load
4096-bit Buffer B L0
L1 Buffer (1MB)
(64KB)
MTE
FP16->FP32 FP32->FP16
ReLU
8*16 Vector Unit
L1 DMAC
L1 load L/S
2048-bit 2048-bit
Unified Buffer (256KB)
GPR
Scalar Unit/ AGU/
SPR
Mask Gen
Cube Queue
Scalar
I Cache Instr.
PSQ
Config System Vector Queue Event Sync
Port BIU (32KB) Dispatch
Control
MTE Queue
L2 access
1024-bit *2
• Cube：4096(16^3) FP16 MACs + 8192 INT8 MACs

• Vector：2048bit INT8/FP16/FP32 vector with special functions
（activation functions, NMS- Non Minimum Suppression, ROI, SORT）
• Explicit memory hierarchy design, managed by MTE
Micro Architecture Configurations
Core Cube Vector L0 L1 L2

Version Ops/cycle Ops/Cycle Bus width Bus Width Bandwidth
910: 3TB/s ÷32
Davinci A:8192
8192 256 610: 2TB/s ÷8
Max B:2048
310: 192GB/s÷2
Davinci Match A:8192

4096 128 Execution 38.4GB/s
Lite B:2048
Units
Davinci A:2048
512 32 None
Tiny Not B:512
bottleneck
Set the
Minimize vector Ensure this is Scarce, limited by NoC,
performance
bound not a bound avoid bound where possible
baseline
Resource Matching ---- Vector
Bert cube cycle/vector cycle ratio
30
25
20
15 • Balance Computation Power between
10
5
0
CUBE vs Vector by overlapping its
computation time with Vector
task100
task110
task120
task130
task140
task150
task160
task170
task180
task190
task200
task210
task220
task230
task240
task0
task10
task20
task30
task40
task50
task60
task70
task80
task90
• Carefully allocated the number of MACs

Mobilenet cube in CUBE and Vectors
ResNet50 cube
cycle/vector cycle cycle/vector cycle • Support multiple matrices multiply vector
ratio
12
60
50 operations in CUBE.
10 40
8 30
6 20 • Expand the width of data bus between L1
4 10
2
0
0 feature map buffer and CUBE
conv1
res2c_br2c
res3c_br2c
res5c_br2a
res2b_br2a
res3b_br2a
res4a_br2a
res4b_br2c
res4d_br2b
res4f_br2a
res5a_br2b
Resource Matching ---- Memory Hierarchy
Davinci carefully balance the memory hierarchy

design to avoid bandwidth become bottleneck at key
locations.
Examples:
• Reduce the DDR bandwidth requirement by
reusing data within L1, L0A, L0B.
• Asymmetric bandwidth provided according to the
nature of computation
• L1 -> L0A bandwidth >> L1->L0B bandwidth,
because W*H could be much bigger than
output channel number
More Challenges of DaVinci
Overview of the DSA Developer Stack
Level 3 Library (written by novice

programmer) TBE LIB
Level 3 Compiler
(mathematical programming model) TVM/XLA TBE
Level 2 Library (written by skilled CudaNN/

TIK LIB
programmer) CuBLAS
Level 2 Compiler
Cuda/OpenCL TIK
(parallel/kernel programming model)
Level 1 Library (written by expert) CCE Lib
Low Level 1 Compiler (Intrinsic C)

CCE C
(Architecture defined programming)
Instruction Set Architecture GPU NPU

Challenge 1: How to Enable Parallelism with Single Thread
Cube CUBE
Queue Instr 0
Data
Dependency
Vector Vector Vector
Queue Instr 0 Instr 1 Resource
Limit
MTE MTE MTE

Instr 0 Instr 1
Sync
Queue
Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar

Pipe instr 0 instr 1 instr 2 instr 3 instr 4 instr 5 instr 6 instr 7
Dispatch Barrier
Unit global
0 1 2 3 4 5 6 7 8 9 Time
• Programmer is comfortable with the sequential code

• Davinc’s C like programming interface (CCE) let
programmer to control the parallelism explicitly .
Solution with Multi-thread?
How about support hardware multi-thread feature?
• The code in each thread is sequential
• CUBE is a share resource between threads
• It has hardware cost
How does it work - TIK
• Typical sequential
Davinci code is a
combination of nested
FOR loops
• Software multi-thread threads
can be added to any
FOR loop body (iterator
kernel).
Programmer view of multi-thread
Advanced Compiler Techniques
• Architecture independent DSL→ C → Binary
lowering process
• Traversal order determines data reuse factor
• Millions of legitimate mappings
• Find optimal mapping to
• bridge the 2,000x memory bandwidth
gap
Kernel Work Space
Tensor Data Set

Putting All This Together
Tensorflow Pytorch Mindspore User Program: AI model description
Graph Engine (L2: Architecture independent)
CANN (L1: Architecture dependent) Operator Library, User defined operators
Task scheduler
Davinci Core Davinci Core Davinci Core
• User program AI model using familiar frameworks

• Extends operator library when necessary
• The tasks are executed in a single node, or over a network cluster
Davinci AI Core in SoCs
Mobile AP SoC
Automotive SoC
AI Inference SoC X86/ARM
X86/ARM Host
Network or
Host
PCIE EP Device
Ascend310 A55 Cluster

1~4 LANE
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
x8
USB PCIe 3.0
Da Vinci Da Vinci Gigabit
Low 3.0 Ctrl
AI Core AI Core A55 Ethernet
DSU Power Device RC/EP
TS
M3
CHIE Interconnect 512bits
On-chip Buffer Last Level DMA DMA SPI UART / I2C /

FHD Video
8MB Cache Engine Engine Flash SPI / GPIO /etc.
JPEG/PNG
3MB Codec
LPDDR4x LPDDR4x
64bits 64bits
LPDDR4 LPDDR4
Chip Chip
AI Training SoC
Ascend 910 On-chip HBM DIE On-chip HBM DIE
Virtuvian Nimbus V3
DaVinci
Da Vinci DaVinci
Da Vinci
Da
Da Vinci
Vinci
AI Core Da
Da Vinci
Vinci
AI Core
DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core IO
Taishan Subsys
Taishan
Taishan DVPP
DVPP
HBM 2.0 AIAIAI Core
Core
Core AIAIAI Core
Core
Core HBM 2.0
Cluster
Cluster
Taishan
Cluster 0~7 8~15
MP4
MP4 0~1 DMA DMA
MP4
MP40~3
0~3 Engine Engine HAC Network
0~15
0~3 Network
Subsys Subsys
CHIE NOC (Mesh)
L3 Cache TS On-chip Buffer PCIE X86/ARM

Subsys Subsys
Subsys 32MB Host
DDR4 DIMM DDR4 IMU CCIX FPGA

DDR4 DIMM DMA DMA
DDR4 DIMM DVPP Engine Da Vinci Da Vinci Engine
DDR4 DIMM 2~3
DDR4 DVPP HBM 2.0 Da
AIDa Vinci
CoreVinci Da
AI Vinci
Core
Da Vinci HBM 2.0
Da
Da
AIAI
CoreVinci
Vinci Da
Da
AIAI
CoreVinci
Vinci Hydra Extended
16~23Da
Core Vinci 24~31Da
Core Vinci Subsys Da Vinci Board
Da
AIAICoreVinci
Core Da
AIAICoreVinci
Core
AIAICore
Core AIAICore
Core
On-chip HBM DIE On-chip HBM DIE

Wireless SoC
CPU DSP AVP

…
Bus Bus Bus Davinci

L3 Dache Shared Memory Shared Memory
…
Ring Bus Mesh Bus
…
DDR IO/Serdes/E Queue CGRA LDPC/Polar FFT
TH Manager
Alleviate Memory Wall and Slow down
of Moore’s Law
Memory Wall & I/O Wall
Bandwidth
Bandwidth @ Bandwidth Challenges
Compiler Innovation
Ratio
Execution Engine 2048T Byte/Sec 1 Can build faster EU, but no way to feed data
(512TOPS)
L0 Memory 2048T Byte/Sec 1/1 Very wide datapath, hard to do scatter-gather
Inner-loop data reuse
L1 Memory 200T Byte/Sec 1/10 Intra-kernel data reuse
L2 Memory 20T Byte/sec 1/100 Inter-kernel data reuse
Cluster Innovation
HBM Memory 1T Byte/sec 1/2000 HBM size limits memory footprint

Intra Node 50G Byte/sec 1/40000 Scale-up node increase memory footprint, but
bandwidth severely bandwidth constrained
Inter Node 10G Byte/sec 1/200000 Model parallelism across
bandwidth nodes,severely bandwidth
constrained
Technology challenges — Why do we need 3DIC
Bounded by memory
Memory bandwidth & capacity
Wall
Process scaling can not satisfy
Interface IO speed Logic the need for more die area (many
Interface chips have reached full reticle
does not scale with Wall
process node Wall size in 7+ process)
3DIC
Thermal Power
Denard slow down
Ultimate system performance is Wall wall
thermal limited causing power limits
• 3DIC can help alleviating memory wall, IO wall and logic wall.
• The search for new transistor to help power and thermal all.
• Architecture innovation to reach new grounds despite of all
challenges
AI Training SoC: Logic + 3DSRAM + 12 HBM
3DSRAM dies
stacked below
SOC die
IO Die HBM2E HBM2E

HBM2E
HBM2E
HBM2E
HBM2E
AI SOC Die
3D-SRAM 3D-SRAM
3D-SRAM 3D-SRAM
Interposer
Substrate
HBM2E HBM2E
HBM2E 3D-SRAM 3D-SRAM
HBM2E
• Customized HBM2E with two Stacks to

HBM2E
HBM2E
3D-SRAM 3D-SRAM
HBM2E
HBM2E
increase HBM bandwidth
AI SOC Die • Large 3D-SRAM as AI core cache
Mobile AP: LoL + MoL
HBM
HBM
3DM HBM
HBM
Logic Logi
c
Package board Package board
Step 1 Step 3:
• One logic die + 3D DRAM • Multi-layer 3D DRAM
• 3DM+POP LPDDR (remove POP LPDDR)
• Multi-layer Logic die
Step 2
• Two logic die + 3D DRAM
• POP LPDDR
Physical Design of Davinci AI Chips
Ascend Architecture
Ascend 310 Ascend 910

High Power Efficiency High Computing Density
Ascend-Mini Ascend-Max
Architecture: DaVinci Architecture: DaVinci
FP16：8 TeraFLOPS FP16: 256 TeraFLOPS

INT8 ：16 TeraOPS INT8: 512 TeraOPS
16 Channel Video Decode – H.264/265 128 Channel Video Decode –
1 Channel Video Encode – H.264/265 H.264/265
Power：8W Power: 350W
Process: 12nm Process: 7+ nm EUV
Comparison of Computing Density
Tesla V100 (12 nm) Google TPUv3 (?nm) Asend910 (7+ nm)
• 416mm² • (undisclosed) • 182.4 mm²
• 120TFlops fp16 • 105Tflops bfloat16 • 256TFlops fp16
Ascend 910 AI Server
8* Ascend 910 Features AI Server SPEC.

Davinci Node 8 * Davinci
Specification
2 * Xeon CPU + 24 DIMM
Performance 2PFops/Chassis ,256T/AI Module
Memory 24DIMM，Up to 1.5TB
6 * 2.5inch，NVME；24TB
Storage
2 * 2.5inch，SAS/SATA，Raid1
8*100G Fiber
Interface
4 * PCIe IO
Power 6000W
Ambient
5~35℃
Temperature
X86 Node
Ascend 910 Cluster
• 2048 Node x 256TFlops = 512 Peta Flops
• 1024~2048 Node Cluster
Cluster Network
Ascend
Cluster
AI AI
Server Board level interconnection Server
D D D D D D D D
……
D D D D D D D D
Ascend910 Board
Ascend910 Die Shot
 Total 8 Dies integrated
 Two dummy dies are added to ensure
mechanical uniformity
 Total size：
456+168+96x4+110x2=1228mm2
HBM HBM Dummy Die
96mm2 96mm2 110mm2
Vitruvian Nimbus
456mm2 168mm2
0.72mm
HBM HBM Dummy Die

96mm2 96mm2 110mm2
2mm
Ascend910 Floorplan
 Mesh NoC connects 32 Davinci Cores in
the Ascend 910 Compute Die
 NoC provides Read Bandwidth of
128GBps + Write Bandwidth of 128GBps
Ascend910 7nm per core
8TFlops Tensor 1.6mm * 1.2mm
128GFlops Vector 0.9mm * 1mm
 Inter-chip connections
31.25mm
 3x 240Gbps HCCS ports – for NUMA connections

Tensor
 2x100Gbps RoCE interfaces for networking
3mm
Vector
14.6mm 1.9mm
Ascend910 NoC
1024bits 2GHz NoC Mesh

 Topology : 6 Rows x 4 Columns
 Access Bandwidth to on-chip L2
Cache: 4 TByte/s
 Access Bandwidth to offchip
HBM: 1.2 TByte/s
 NoC bandwidth fairly shared
among the Davinci Cores
Ascend310 Die Shot
Ascend310 12nm
Tensor 3.08mm * 1.73mm
Vector 1.6mm * 1.35mm
Tensor
4.55mm
Vector
10.65mm
3.68mm
9.8mm
Kunpeng Vs Ascend
Ascend310
Kunpeng920
Ascend910
Future: Davinci 3DSRAM Floorplan
3D SRAM Architecture Diagram
Logic
BPV
TDV BPV Delectric
~ 20 um
TSV Memory ubump
Interposer/Substrate
More Challenges…
• Generalized Auto ML
• Efficiency for Re-enforcement Learning, GNN?
• Generalized method for Data/Model/Pipeline parallelism
• How to unify data precision?
• Finding the sweet spot architecture
 Big chip vs Small chip
 Dense vs Sparse
 Out of memory, near memory, in memory
Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Copyright©2018 Huawei Technologies Co., Ltd.

All Rights Reserved.
The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Huawei Confidential

HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF

Uploaded by

Copyright:

Available Formats

DaVinci: A Scalable Architecture

for Neural Network Computing

Heng Liao, Jiajin Tu,

HUAWEI.COM Security Level:

Applications across ~106 performance range

• Cube：4096(16^3) FP16 MACs + 8192 INT8 MACs

Core Cube Vector L0 L1 L2

Davinci Match A:8192

• Carefully allocated the number of MACs

Davinci carefully balance the memory hierarchy

Level 3 Library (written by novice

Level 2 Library (written by skilled CudaNN/

Level 1 Library (written by expert) CCE Lib

Low Level 1 Compiler (Intrinsic C)

Instruction Set Architecture GPU NPU

MTE MTE MTE

Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar

• Programmer is comfortable with the sequential code

Kernel Work Space

Tensor Data Set

Graph Engine (L2: Architecture independent)

CANN (L1: Architecture dependent) Operator Library, User defined operators

Davinci Core Davinci Core Davinci Core

• User program AI model using familiar frameworks

Ascend310 A55 Cluster

CHIE Interconnect 512bits

On-chip Buffer Last Level DMA DMA SPI UART / I2C /

L3 Cache TS On-chip Buffer PCIE X86/ARM

DDR4 DIMM DDR4 IMU CCIX FPGA

On-chip HBM DIE On-chip HBM DIE

CPU DSP AVP

Bus Bus Bus Davinci

HBM Memory 1T Byte/sec 1/2000 HBM size limits memory footprint

IO Die HBM2E HBM2E

• Customized HBM2E with two Stacks to

Ascend 310 Ascend 910

FP16：8 TeraFLOPS FP16: 256 TeraFLOPS

8* Ascend 910 Features AI Server SPEC.

HBM HBM Dummy Die

 3x 240Gbps HCCS ports – for NUMA connections

1024bits 2GHz NoC Mesh

Copyright©2018 Huawei Technologies Co., Ltd.

The information in this document may contain predictive

You might also like