Professional Documents
Culture Documents
Ten Lessons From Three Generations Shaped Google S Tpuv4i
Ten Lessons From Three Generations Shaped Google S Tpuv4i
Published in 2021 at the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
3
What is a TPU?
4
How did the TPU design evolve? 2.3x better
perf/TDP
5
Outline
6
Summary of the 10 Lessons
7
① Logic, wires, SRAM, & DRAM improve unequally
Energy per operation [pJ] 45nm vs 7nm
8
② Leverage prior compiler optimizations
10
Summary of the 10 Lessons
11
④ Support Backwards ML Compatibility
12
⑤ Inference DSAs need air cooling for global scale
TPU version TDP Cooling Peak TFLOPS/Chip
13
⑥ Some inference apps need floating point arithmetic
14
Summary of the 10 Lessons
15
⑦ Production inference normally needs multi-tenancy
16
Lessons Regarding DNN Application evolution
⑧ DNNs grow ~1.5x/year in memory and compute
Google ML Workload
18
Outline
19
Compiler compatibility
High-Level Operations
● Compiler compatibility
○ XLA compiler separates: LLO LLO LLO
■ High-Level Operations (HLO)
● Hardware agnostic TPUv2 TPUv3 TPUv4i
■ Low-Level Operations (LLO)
● Hardware dependent
○ Maintains compiler compatibility ②
20
On-chip storage & DMA
21
Custom on-chip interconnect (OCI)
22
Arithmetic unit & TDP
23
ICI scaling & Workload analysis
● 2 ICI links
○ 4 chips/board access nearby memory
quickly for future DNN growth ⑧
● Extensive tracing a HW
performance counters included
○ Analyze system-level bottlenecks
○ Increased Design Time but worth it
since target is perf/TCO, not
perf/CapEx ③
○ System-level performance
improvements (compiler) ②
○ Boost developer productivity ⑦ ⑧ ⑨
24
Outline
25
Performance/Watt
Breakdown:
● CMEM ~1.5x
● 7nm ~1.3x
● Other contributions ~1.2x
26
Strengths
architecture decisions
27
Weaknesses
28
Key Takeaways and insights
29
Discussion and Questions
30
Discussion and Questions
31
Discussion and Questions
32
As Moore's law reaches plateau and Dennard scaling finishes, what
are the next steps to keep pace with growing DNN models?
33
Is compiler, ML and hardware backwards compatibility restricting
innovation in production hardware?
34
Is sacrificing flexibility, that Google seeks in its TPU hardware,
a viable approach in extracting more performance from
current hardware?
35
Thank you to my advisors
36
Backup Slides: Roofline model
37
Backup Slides: CMEM size
38
Comparison to NVIDIA T4
39
Backup Slides: 10 Lessons summary
40