Professional Documents
Culture Documents
2019-07
Control, Process,
Transfer, Store ……
Ubiquitous AI computation
Memory
Footprint (GB)
Scalability General Artificial
Auto ML Intelligence
Meta-Learning
Algorithm
Discovery
Model
AutonomousSmart City Training
Driving Intelligent
Drones Surveillance
Robotics
IP Industrial Cloud
Smart Smart AI Inference
Camer Embedded AI Computation
Toy phone
a
IoT (TOPS)
DaVinci Cloud-
training
D-Mini Edge-
training
Surveillance
D-Lite Wireless
D-Tiny
Phone
Wearable
Computation (TOPS)
Rich Variety of Computing architectures in Huawei Portfolio
• Wide range of performance & efficiency
• CPU: General purpose
• GPU: Graphics
• NPU: DNN
• ISP: Camera sensor pipeline
• DSP: Camera post processing, AR
• VPU: Vision Processing Unit
• NP: Network Processor
• Each category represents a different PPA curve
Target: Search for Optimal PPA in Design Space
Performance
Power
Architecture Overview of DaVinci
Building Blocks and their Computation Intensity
1D Scalar Unit + 2D Vector Unit + 3D Matrix Unit
Full flexibility Rich & efficient operations High intensity
N N2 N3
1 1 1 GPU + AI core +
2 4 8 Tensor core SRAM
4 16 128 Area
8 64 512 (normalized 5.2mm^2 13.2mm^2
16 256 4096 to 12 nm)
32 1024 32768 Compute 1.7Tops 8Tops
64 4096 262144 power fp16 fp16
DaVinci Core
DaVinci Core (16^3 Cube)
Buffer A L0
accumulator
(64KB)
Accum DFF
decomp
img2col
16^2
A/B
trans
DFF
16^3 Cube Buffer C L0 (256KB)
L0 load
4096-bit Buffer B L0
L1 Buffer (1MB)
(64KB)
MTE
FP16->FP32 FP32->FP16
ReLU
8*16 Vector Unit
L1 DMAC
L1 load L/S
2048-bit 2048-bit
Unified Buffer (256KB)
GPR
Scalar Unit/ AGU/
SPR
Mask Gen
Cube Queue
Scalar
I Cache Instr.
PSQ
Config System Vector Queue Event Sync
Port BIU (32KB) Dispatch
Control
MTE Queue
L2 access
1024-bit *2
res2c_br2c
res3c_br2c
res5c_br2a
res2b_br2a
res3b_br2a
res4a_br2a
res4b_br2c
res4d_br2b
res4f_br2a
res5a_br2b
Resource Matching ---- Memory Hierarchy
Level 3 Compiler
(mathematical programming model) TVM/XLA TBE
Level 2 Compiler
Cuda/OpenCL TIK
(parallel/kernel programming model)
Dispatch Barrier
Unit global
0 1 2 3 4 5 6 7 8 9 Time
• Typical sequential
Davinci code is a
combination of nested
FOR loops
• Software multi-thread threads
can be added to any
FOR loop body (iterator
kernel).
Programmer view of multi-thread
Advanced Compiler Techniques
• Architecture independent DSL→ C → Binary
lowering process
• Traversal order determines data reuse factor
• Millions of legitimate mappings
• Find optimal mapping to
• bridge the 2,000x memory bandwidth
gap
Task scheduler
LPDDR4x LPDDR4x
64bits 64bits
LPDDR4 LPDDR4
Chip Chip
AI Training SoC
Ascend 910 On-chip HBM DIE On-chip HBM DIE
Virtuvian Nimbus V3
DaVinci
Da Vinci DaVinci
Da Vinci
Da
Da Vinci
Vinci
AI Core Da
Da Vinci
Vinci
AI Core
DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core DaDa
DaAI Vinci
AI
VinciCore
Vinci
AICore
Core IO
Taishan Subsys
Taishan
Taishan DVPP
DVPP
HBM 2.0 AIAIAI Core
Core
Core AIAIAI Core
Core
Core HBM 2.0
Cluster
Cluster
Taishan
Cluster 0~7 8~15
MP4
MP4 0~1 DMA DMA
MP4
MP40~3
0~3 Engine Engine HAC Network
0~15
0~3 Network
Subsys Subsys
CHIE NOC (Mesh)
…
Ring Bus Mesh Bus
…
DDR IO/Serdes/E Queue CGRA LDPC/Polar FFT
TH Manager
Alleviate Memory Wall and Slow down
of Moore’s Law
Memory Wall & I/O Wall
Bandwidth
Bandwidth @ Bandwidth Challenges
Compiler Innovation
Ratio
Execution Engine 2048T Byte/Sec 1 Can build faster EU, but no way to feed data
(512TOPS)
L0 Memory 2048T Byte/Sec 1/1 Very wide datapath, hard to do scatter-gather
Inner-loop data reuse
L1 Memory 200T Byte/Sec 1/10 Intra-kernel data reuse
L2 Memory 20T Byte/sec 1/100 Inter-kernel data reuse
Cluster Innovation
Wall
Process scaling can not satisfy
Interface IO speed Logic the need for more die area (many
Interface chips have reached full reticle
does not scale with Wall
process node Wall size in 7+ process)
3DIC
Thermal Power
Denard slow down
Ultimate system performance is Wall wall
thermal limited causing power limits
• 3DIC can help alleviating memory wall, IO wall and logic wall.
• The search for new transistor to help power and thermal all.
• Architecture innovation to reach new grounds despite of all
challenges
AI Training SoC: Logic + 3DSRAM + 12 HBM
3DSRAM dies
stacked below
SOC die
3D-SRAM 3D-SRAM
Interposer
Substrate
HBM2E HBM2E
HBM2E 3D-SRAM 3D-SRAM
HBM2E
HBM
HBM
3DM HBM
HBM
Logic Logi
c
Package board Package board
Step 1 Step 3:
• One logic die + 3D DRAM • Multi-layer 3D DRAM
• 3DM+POP LPDDR (remove POP LPDDR)
• Multi-layer Logic die
Step 2
• Two logic die + 3D DRAM
• POP LPDDR
Physical Design of Davinci AI Chips
Ascend Architecture
Ascend-Mini Ascend-Max
Architecture: DaVinci Architecture: DaVinci
Tesla V100 (12 nm) Google TPUv3 (?nm) Asend910 (7+ nm)
• 416mm² • (undisclosed) • 182.4 mm²
• 120TFlops fp16 • 105Tflops bfloat16 • 256TFlops fp16
Ascend 910 AI Server
X86 Node
Ascend 910 Cluster
• 2048 Node x 256TFlops = 512 Peta Flops
• 1024~2048 Node Cluster
Cluster Network
Ascend
Cluster
AI AI
Server Board level interconnection Server
D D D D D D D D
……
D D D D D D D D
Ascend910 Board
Ascend910 Die Shot
Total 8 Dies integrated
Two dummy dies are added to ensure
mechanical uniformity
Total size:
456+168+96x4+110x2=1228mm2
HBM HBM Dummy Die
96mm2 96mm2 110mm2
Vitruvian Nimbus
456mm2 168mm2
0.72mm
2mm
Ascend910 Floorplan
Mesh NoC connects 32 Davinci Cores in
the Ascend 910 Compute Die
NoC provides Read Bandwidth of
128GBps + Write Bandwidth of 128GBps
Ascend910 7nm per core
8TFlops Tensor 1.6mm * 1.2mm
128GFlops Vector 0.9mm * 1mm
Inter-chip connections
31.25mm
3mm
Vector
14.6mm 1.9mm
Ascend910 NoC
Ascend310 12nm
Tensor 3.08mm * 1.73mm
Vector 1.6mm * 1.35mm
Tensor
4.55mm
Vector
10.65mm
3.68mm
9.8mm
Kunpeng Vs Ascend
Ascend310
Kunpeng920
Ascend910
Future: Davinci 3DSRAM Floorplan
3D SRAM Architecture Diagram
Logic
BPV
TDV BPV Delectric
~ 20 um
TSV Memory ubump
Interposer/Substrate
More Challenges…
• Generalized Auto ML
• Efficiency for Re-enforcement Learning, GNN?
• Generalized method for Data/Model/Pipeline parallelism
• How to unify data precision?
• Finding the sweet spot architecture
Big chip vs Small chip
Dense vs Sparse
Out of memory, near memory, in memory
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Huawei Confidential