You are on page 1of 44

DaVinci: A Scalable Architecture

for Neural Network Computing

Heng Liao, Jiajin Tu,


Jing Xia, Xiping Zhou

2019-07

HUAWEI.COM Security Level:


Key element to enable intelligence in physical devices

Control, Process,
Transfer, Store ……
Ubiquitous AI computation
Memory
Footprint (GB)
Scalability General Artificial
Auto ML Intelligence
Meta-Learning
Algorithm
Discovery

Model
AutonomousSmart City Training
Driving Intelligent
Drones Surveillance

Robotics
IP Industrial Cloud
Smart Smart AI Inference
Camer Embedded AI Computation
Toy phone
a
IoT (TOPS)

Applications across ~106 performance range


Ubiquitous AI computation
Memory
Footprint
(GB)
Scalability DaVinci
Max

DaVinci Cloud-
training
D-Mini Edge-
training
Surveillance
D-Lite Wireless

D-Tiny
Phone

Wearable
Computation (TOPS)
Rich Variety of Computing architectures in Huawei Portfolio
• Wide range of performance & efficiency
• CPU: General purpose
• GPU: Graphics
• NPU: DNN
• ISP: Camera sensor pipeline
• DSP: Camera post processing, AR
• VPU: Vision Processing Unit
• NP: Network Processor
• Each category represents a different PPA curve
Target: Search for Optimal PPA in Design Space

Performance

Power
Architecture Overview of DaVinci
Building Blocks and their Computation Intensity
1D Scalar Unit + 2D Vector Unit + 3D Matrix Unit
Full flexibility Rich & efficient operations High intensity

N N2 N3
1 1 1 GPU + AI core +
2 4 8 Tensor core SRAM
4 16 128 Area
8 64 512 (normalized 5.2mm^2 13.2mm^2
16 256 4096 to 12 nm)
32 1024 32768 Compute 1.7Tops 8Tops
64 4096 262144 power fp16 fp16
DaVinci Core
DaVinci Core (16^3 Cube)
Buffer A L0

accumulator
(64KB)

Accum DFF
decomp
img2col

16^2
A/B

trans
DFF
16^3 Cube Buffer C L0 (256KB)
L0 load
4096-bit Buffer B L0

L1 Buffer (1MB)
(64KB)

MTE
FP16->FP32 FP32->FP16
ReLU
8*16 Vector Unit

L1 DMAC
L1 load L/S
2048-bit 2048-bit
Unified Buffer (256KB)

GPR
Scalar Unit/ AGU/

SPR
Mask Gen

Cube Queue

Scalar
I Cache Instr.

PSQ
Config System Vector Queue Event Sync
Port BIU (32KB) Dispatch
Control
MTE Queue

L2 access
1024-bit *2

• Cube:4096(16^3) FP16 MACs + 8192 INT8 MACs


• Vector:2048bit INT8/FP16/FP32 vector with special functions
(activation functions, NMS- Non Minimum Suppression, ROI, SORT)
• Explicit memory hierarchy design, managed by MTE
Micro Architecture Configurations

Core Cube Vector L0 L1 L2


Version Ops/cycle Ops/Cycle Bus width Bus Width Bandwidth
910: 3TB/s ÷32
Davinci A:8192
8192 256 610: 2TB/s ÷8
Max B:2048
310: 192GB/s÷2

Davinci Match A:8192


4096 128 Execution 38.4GB/s
Lite B:2048
Units
Davinci A:2048
512 32 None
Tiny Not B:512
bottleneck
Set the
Minimize vector Ensure this is Scarce, limited by NoC,
performance
bound not a bound avoid bound where possible
baseline
Resource Matching ---- Vector
Bert cube cycle/vector cycle ratio
30
25
20
15 • Balance Computation Power between
10
5
0
CUBE vs Vector by overlapping its
computation time with Vector
task100
task110
task120
task130
task140
task150
task160
task170
task180
task190
task200
task210
task220
task230
task240
task0
task10
task20
task30
task40
task50
task60
task70
task80
task90

• Carefully allocated the number of MACs


Mobilenet cube in CUBE and Vectors
ResNet50 cube
cycle/vector cycle cycle/vector cycle • Support multiple matrices multiply vector
ratio
12
60
50 operations in CUBE.
10 40
8 30
6 20 • Expand the width of data bus between L1
4 10
2
0
0 feature map buffer and CUBE
conv1

res2c_br2c

res3c_br2c

res5c_br2a
res2b_br2a

res3b_br2a

res4a_br2a
res4b_br2c
res4d_br2b
res4f_br2a
res5a_br2b
Resource Matching ---- Memory Hierarchy

Davinci carefully balance the memory hierarchy


design to avoid bandwidth become bottleneck at key
locations.
Examples:
• Reduce the DDR bandwidth requirement by
reusing data within L1, L0A, L0B.
• Asymmetric bandwidth provided according to the
nature of computation
• L1 -> L0A bandwidth >> L1->L0B bandwidth,
because W*H could be much bigger than
output channel number
More Challenges of DaVinci
Overview of the DSA Developer Stack

Level 3 Library (written by novice


programmer) TBE LIB

Level 3 Compiler
(mathematical programming model) TVM/XLA TBE

Level 2 Library (written by skilled CudaNN/


TIK LIB
programmer) CuBLAS

Level 2 Compiler
Cuda/OpenCL TIK
(parallel/kernel programming model)

Level 1 Library (written by expert) CCE Lib

Low Level 1 Compiler (Intrinsic C)


CCE C
(Architecture defined programming)

Instruction Set Architecture GPU NPU


Challenge 1: How to Enable Parallelism with Single Thread
Cube CUBE
Queue Instr 0
Data
Dependency
Vector Vector Vector
Queue Instr 0 Instr 1 Resource
Limit

MTE MTE MTE


Instr 0 Instr 1
Sync
Queue

Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar Scalar


Pipe instr 0 instr 1 instr 2 instr 3 instr 4 instr 5 instr 6 instr 7

Dispatch Barrier
Unit global

0 1 2 3 4 5 6 7 8 9 Time

• Programmer is comfortable with the sequential code


• Davinc’s C like programming interface (CCE) let
programmer to control the parallelism explicitly .
Solution with Multi-thread?
How about support hardware multi-thread feature?
• The code in each thread is sequential
• CUBE is a share resource between threads
• It has hardware cost
How does it work - TIK

• Typical sequential
Davinci code is a
combination of nested
FOR loops
• Software multi-thread threads
can be added to any
FOR loop body (iterator
kernel).
Programmer view of multi-thread
Advanced Compiler Techniques
• Architecture independent DSL→ C → Binary
lowering process
• Traversal order determines data reuse factor
• Millions of legitimate mappings
• Find optimal mapping to
• bridge the 2,000x memory bandwidth
gap

Kernel Work Space

Tensor Data Set


Putting All This Together
Tensorflow Pytorch Mindspore User Program: AI model description

Graph Engine (L2: Architecture independent)

CANN (L1: Architecture dependent) Operator Library, User defined operators

Task scheduler

Davinci Core Davinci Core Davinci Core

• User program AI model using familiar frameworks


• Extends operator library when necessary
• The tasks are executed in a single node, or over a network cluster
Davinci AI Core in SoCs
Mobile AP SoC
Automotive SoC
AI Inference SoC X86/ARM
X86/ARM Host
Network or
Host
PCIE EP Device

Ascend310 A55 Cluster


1~4 LANE
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
x8
USB PCIe 3.0
Da Vinci Da Vinci Gigabit
Low 3.0 Ctrl
AI Core AI Core A55 Ethernet
DSU Power Device RC/EP
TS
M3

CHIE Interconnect 512bits

On-chip Buffer Last Level DMA DMA SPI UART / I2C /


FHD Video
8MB Cache Engine Engine Flash SPI / GPIO /etc.
JPEG/PNG
3MB Codec

LPDDR4x LPDDR4x
64bits 64bits

LPDDR4 LPDDR4
Chip Chip
AI Training SoC
Ascend 910 On-chip HBM DIE On-chip HBM DIE

Virtuvian Nimbus V3
DaVinci
Da Vinci DaVinci
Da Vinci
Da
Da Vinci
Vinci
AI Core Da
Da Vinci
Vinci
AI Core
DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core IO
Taishan Subsys
Taishan
Taishan DVPP
DVPP
HBM 2.0 AIAIAI Core
Core
Core AIAIAI Core
Core
Core HBM 2.0
Cluster
Cluster
Taishan
Cluster 0~7 8~15
MP4
MP4 0~1 DMA DMA
MP4
MP40~3
0~3 Engine Engine HAC Network
0~15
0~3 Network
Subsys Subsys
CHIE NOC (Mesh)

L3 Cache TS On-chip Buffer PCIE X86/ARM


Subsys Subsys
Subsys 32MB Host

DDR4 DIMM DDR4 IMU CCIX FPGA


DDR4 DIMM DMA DMA
DDR4 DIMM DVPP Engine Da Vinci Da Vinci Engine
DDR4 DIMM 2~3
DDR4 DVPP HBM 2.0 Da
AIDa Vinci
CoreVinci Da
AI Vinci
Core
Da Vinci HBM 2.0
Da
Da
AIAI
CoreVinci
Vinci Da
Da
AIAI
CoreVinci
Vinci Hydra Extended
16~23Da
Core Vinci 24~31Da
Core Vinci Subsys Da Vinci Board
Da
AIAICoreVinci
Core Da
AIAICoreVinci
Core
AIAICore
Core AIAICore
Core

On-chip HBM DIE On-chip HBM DIE


Wireless SoC

CPU DSP AVP


Bus Bus Bus Davinci


L3 Dache Shared Memory Shared Memory


Ring Bus Mesh Bus


DDR IO/Serdes/E Queue CGRA LDPC/Polar FFT
TH Manager
Alleviate Memory Wall and Slow down
of Moore’s Law
Memory Wall & I/O Wall
Bandwidth
Bandwidth @ Bandwidth Challenges
Compiler Innovation

Ratio
Execution Engine 2048T Byte/Sec 1 Can build faster EU, but no way to feed data
(512TOPS)
L0 Memory 2048T Byte/Sec 1/1 Very wide datapath, hard to do scatter-gather
Inner-loop data reuse
L1 Memory 200T Byte/Sec 1/10 Intra-kernel data reuse
L2 Memory 20T Byte/sec 1/100 Inter-kernel data reuse
Cluster Innovation

HBM Memory 1T Byte/sec 1/2000 HBM size limits memory footprint


Intra Node 50G Byte/sec 1/40000 Scale-up node increase memory footprint, but
bandwidth severely bandwidth constrained
Inter Node 10G Byte/sec 1/200000 Model parallelism across
bandwidth nodes,severely bandwidth
constrained
Technology challenges — Why do we need 3DIC
Bounded by memory
Memory bandwidth & capacity

Wall
Process scaling can not satisfy
Interface IO speed Logic the need for more die area (many
Interface chips have reached full reticle
does not scale with Wall
process node Wall size in 7+ process)
3DIC

Thermal Power
Denard slow down
Ultimate system performance is Wall wall
thermal limited causing power limits

• 3DIC can help alleviating memory wall, IO wall and logic wall.
• The search for new transistor to help power and thermal all.
• Architecture innovation to reach new grounds despite of all
challenges
AI Training SoC: Logic + 3DSRAM + 12 HBM
3DSRAM dies
stacked below
SOC die

IO Die HBM2E HBM2E


HBM2E
HBM2E
HBM2E
HBM2E
AI SOC Die
3D-SRAM 3D-SRAM

3D-SRAM 3D-SRAM
Interposer
Substrate
HBM2E HBM2E
HBM2E 3D-SRAM 3D-SRAM
HBM2E

• Customized HBM2E with two Stacks to


HBM2E
HBM2E
3D-SRAM 3D-SRAM
HBM2E
HBM2E
increase HBM bandwidth
AI SOC Die • Large 3D-SRAM as AI core cache
Mobile AP: LoL + MoL

HBM
HBM
3DM HBM
HBM
Logic Logi
c
Package board Package board

Step 1 Step 3:
• One logic die + 3D DRAM • Multi-layer 3D DRAM
• 3DM+POP LPDDR (remove POP LPDDR)
• Multi-layer Logic die

Step 2
• Two logic die + 3D DRAM
• POP LPDDR
Physical Design of Davinci AI Chips
Ascend Architecture

Ascend 310 Ascend 910


High Power Efficiency High Computing Density

Ascend-Mini Ascend-Max
Architecture: DaVinci Architecture: DaVinci

FP16:8 TeraFLOPS FP16: 256 TeraFLOPS


INT8 :16 TeraOPS INT8: 512 TeraOPS
16 Channel Video Decode – H.264/265 128 Channel Video Decode –
1 Channel Video Encode – H.264/265 H.264/265
Power:8W Power: 350W
Process: 12nm Process: 7+ nm EUV
Comparison of Computing Density

Tesla V100 (12 nm) Google TPUv3 (?nm) Asend910 (7+ nm)
• 416mm² • (undisclosed) • 182.4 mm²
• 120TFlops fp16 • 105Tflops bfloat16 • 256TFlops fp16
Ascend 910 AI Server

8* Ascend 910 Features AI Server SPEC.


Davinci Node 8 * Davinci
Specification
2 * Xeon CPU + 24 DIMM
Performance 2PFops/Chassis ,256T/AI Module
Memory 24DIMM,Up to 1.5TB
6 * 2.5inch,NVME;24TB
Storage
2 * 2.5inch,SAS/SATA,Raid1
8*100G Fiber
Interface
4 * PCIe IO
Power 6000W
Ambient
5~35℃
Temperature

X86 Node
Ascend 910 Cluster
• 2048 Node x 256TFlops = 512 Peta Flops
• 1024~2048 Node Cluster

Cluster Network
Ascend
Cluster

AI AI
Server Board level interconnection Server

D D D D D D D D
……

D D D D D D D D

Ascend910 Board
Ascend910 Die Shot
 Total 8 Dies integrated
 Two dummy dies are added to ensure
mechanical uniformity

 Total size:
456+168+96x4+110x2=1228mm2
HBM HBM Dummy Die
96mm2 96mm2 110mm2

Vitruvian Nimbus
456mm2 168mm2
0.72mm

HBM HBM Dummy Die


96mm2 96mm2 110mm2

2mm
Ascend910 Floorplan
 Mesh NoC connects 32 Davinci Cores in
the Ascend 910 Compute Die
 NoC provides Read Bandwidth of
128GBps + Write Bandwidth of 128GBps
Ascend910 7nm per core
8TFlops Tensor 1.6mm * 1.2mm
128GFlops Vector 0.9mm * 1mm
 Inter-chip connections
31.25mm

 3x 240Gbps HCCS ports – for NUMA connections


Tensor
 2x100Gbps RoCE interfaces for networking

3mm
Vector

14.6mm 1.9mm
Ascend910 NoC

1024bits 2GHz NoC Mesh


 Topology : 6 Rows x 4 Columns
 Access Bandwidth to on-chip L2
Cache: 4 TByte/s
 Access Bandwidth to offchip
HBM: 1.2 TByte/s
 NoC bandwidth fairly shared
among the Davinci Cores
Ascend310 Die Shot

Ascend310 12nm
Tensor 3.08mm * 1.73mm
Vector 1.6mm * 1.35mm

Tensor

4.55mm
Vector
10.65mm

3.68mm

9.8mm
Kunpeng Vs Ascend
Ascend310

Kunpeng920

Ascend910
Future: Davinci 3DSRAM Floorplan
3D SRAM Architecture Diagram
Logic
BPV
TDV BPV Delectric
~ 20 um
TSV Memory ubump

Interposer/Substrate
More Challenges…
• Generalized Auto ML
• Efficiency for Re-enforcement Learning, GNN?
• Generalized method for Data/Model/Pipeline parallelism
• How to unify data precision?
• Finding the sweet spot architecture
 Big chip vs Small chip
 Dense vs Sparse
 Out of memory, near memory, in memory
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home and
organization for a fully connected,
intelligent world.

Copyright©2018 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

Huawei Confidential

You might also like