Professional Documents
Culture Documents
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
https://efficientml.ai
@SongHan_MIT
Beijing Stanford
Boston
MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion) 175B Model compression
100 bridges the gap.
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (ICML 2023)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 3
Model Compression and Efficient AI
Bridges the Gap between the Supply and Demand of AI Computing
“Deep Compression” and EIE opened a new opportunity to build hardware accelerator
for sparse and compressed neural networks
2400
# Publications
1600
EIE
Optimal
800
Brain Damage
Deep
Compression
0
1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022
Source: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 5
Deep Learning is Everywhere
But they are computationally costly
How to make them light-weighted and fast?
Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 6
Deep Learning is Everywhere
But they are computationally costly
How to make them light-weighted and fast?
Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 7
Deep Learning for Image Classification
DNNs achieve super-human classification accuracy on ImageNet
16.4
11.7
6.7
5.1
3.6 3.0 2.3
Image Classification 2010 2011 2012 2013 2014 2015 2016 2017 Human
(AlexNet) (GoogLeNet) (ResNet) (SENet)
InceptionV3 Xception
79 ResNetXt-50
DPN-92
ImageNet Top-1 accuracy (%)
DenseNet-169 ResNetXt-101
77
DenseNet-264
DenseNet-121
75 ResNet-101
ResNet-50
InceptionV2
71
69
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 9
Efficient Image Classification
Neural architecture search reduces the computational cost
14x less computation
81
595M MACs Once-for-All (ours) InceptionV3 Xception
80.0% Top-1
79 ResNetXt-50
EfficientNet NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)
75 MBNetV2 ResNet-101
PNASNet ResNet-50
InceptionV2
ShuffleNet
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better
→
71
69
0
MobileNetV1 (MBNetV1)
1 2 3 4
Handcrafted
5
AutoML
6
The lower the better
7
→ 8 9
MACs (Billion)
Once-for-All: Train One Network and Specialize it for Efficient Deployment [Cai et al., ICLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 10
Efficient Image Classification
Efficient deep learning enables daily life application on mobile phones
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation [Wang et al., CVPR 2022]
Recognizing People in Photos Through Private On-Device Machine Learning [Apple, 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 12
Efficient Image Recognition on MCUs
MCUNet enables TinyML on IoT devices
(ARM Cortex-M7 + OpenMV Cam)
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 13
Efficient On-Device Training on Edge
AI systems need to continually adapt to new data collected locally
Cloud-based Learning
On-device Learning
SAM runs at 12 images per second due to the large vision transformer model (ViT-Huge).
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 17
Efficient Prompt Segmentation
EfficientViT accelerates Segment Anything Model by 70 times
SAM ViT-Huge
842 image/s 12 image/s
EfficientViT
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 18
From Discrimitive Model to Generative Model
Diffusion models create realistic images from a natural language description
Teddy bears, mixing sparkling A bowl of soup, as a planet in the A photo of an astronaut riding a
chemicals as mad scientists universe, as a 1960s poster horse on mars
281
100 9×
57 57
31.7
12×
21×
10
4.8
2.7
GAN Compression; MACs: 3.50G (16.2×); FPS: 40.0 (3.3×); FID: 53.6
1
Horse→zebra Edge→shoes Cityscapes Measured on NVIDIA Jetson Xavier GPU
CycleGAN, Zhu et al. Pix2pix, Isola et al. GauGAN, Park et al. Lower FID indicates better Performance.
GAN Compression: Efficient Architectures for Interactive Conditional GANs [Li et al., CVPR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 20
Efficient Image Generation
• Generative Adversarial Network (GAN)
is computationally heavy and slow
Train once
Anycost GANs for Interactive Image Synthesis and Editing [Lin et al., CVPR 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 22
Efficient Image Generation
SIGE accelerates Stable Diffusion by >4X with spatial sparsity
A photograph of a horse on a grassland. A fantasy beach landscape, trending on artstation.
FastComposer: https://fastcomposer.hanlab.ai/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 25
Efficient Image Generation
FastComposer achieves tuning-free multi-subject image generation
DreamFusion: https://dreamfusion3d.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 27
Video Generation
Diffusion models create realistic videos from a natural language description
“Video modeling is a harder task for which performance is not yet saturated at 5.6B model size”
Imagen Video: https://imagen.research.google/video/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 28
From 2D Vision to 3D Vision
Deep learning helps machine perceive the surrounding environment
Efficient and Robust LiDAR-Based End-to-End Navigation [Liu et al., ICRA 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 30
Efficient 3D Perception
BEVFusion supports efficient multi-task multi-sensor fusion
: Car : Truck : Pedestrian : Barrier : Drivable Area : Lane Divider : Walkway : Crosswalk
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [Liu et al., Arxiv 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 31
Deep Learning is Everywhere
Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 32
ChatGPT and Large Language Model
Large language models produce human-like text based on past conversations
ChatGPT: https://chat.openai.com/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 33
Code Generation
GitHub CoPilot can make meaningful coding suggestions based on context
Conversation Translation
(on iPhone)
Google Translate: https://translate.google.com/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 35
Efficient Neural Machine Translation
Lite Transformer reduces the model size with pruning and quantization
200 45
176
133.3 41.7
2.5
x
18.2×
39.9
39.6 39.6 39.5
69.2
66.7 4. 38.3
0x
17.3 1.8x
9.7
0 35
Transformer Lite Transformer +Quantization +Quantization & Pruning
Lite Transformer with Long-Short Range Attention [Wu et al., ICLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 36
Large Language Models
Large language models show emergent behaviors: zero/few learning
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 39
Large Language Models
Large language models show emergent behaviors: chain-of-thought
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 40
Large Language Models
But it comes at the cost of large model size
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 41
Large Language Models
Model size of language models is growing exponentially
MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion)
175B
100
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022
Accumulate vertically
Accumulate vertically
0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4 0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4
- Deploying LLM on the edge is useful: running copilot services (code completion, office, game chat)
locally on laptops, cars, robots, and more. These devices are resource-constrained, low-power
and sometimes do not have access to the Internet.
- Data privacy is important. Users do not want to share personal data with large companies.
Images are generated by Midjourney
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 44
Efficient Large Language Models
- LLMs are too large to fit into the memory of edge devices. We enable edge deployment of LLMs through
quantization: SmoothQuant and AWQ
- TinyChatEngine implements the compressed inference, built from C/C++ from scratch, easy to install and migrate
to edge platforms
TinyChatEngine: https://github.com/mit-han-lab/TinyChatEngine
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 45
TinyChat on Orin
Running LLMs on resource-constrained edge GPUs
Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 51
Vision-Language Models
LLaVA achieves general-purpose visual and language understanding
LLaVA: https://llava-vl.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 52
Efficient Vision-Language Models
AWQ quantizes vision-language models to 4 bits with high quality
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Lin et al., 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 53
Vision-Language-Action Models
Robotics transformers control robots based on language instructions
Run at only 3Hz due to the high computational cost and networking latency.
RT-1: https://robotics-transformer1.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 54
Deep Learning for Games
AlphaGo masters the game of Go with DNNs & tree search
Compute: 1920 CPUs and 280 GPUs ($3000 electric bill per game)
AlphaGo: https://www.deepmind.com/research/highlighted-research/alphago
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 55
Deep Learning for Scientific Discovery
AlphaFold reveals the structure of the protein universe
Experimental result
Computational prediction
AlphaFold (Nature 2021)
3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 57
Architectural support for quantization/pruning
brings tremendous improvement
FP32 => FP16 => Int8; dense => sparse
integer matrix
multiplication
and
accumulation
half matrix
multiplication
and
8-bit integer accumulation
4-element
vector dot
product
The software cost dominates the cost breakdown of advanced technology nodes [source].
We focus on designing new algorithms and software for efficient computing.
Dense FP16 Perf (TOPs) Memory Bendwidth (GB/s) Power (W) Memory (GB)
990
1000 4000 700 100
3,430 700
96
560
750 3000 75 80
2,039 420
500 2000 400 50
312 280
900 300
250 1000732 250 25 32
125 140
18.7 16
0 0 0 0
P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100
SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM
https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 61
Edge AI Hardware
Apple Neural Engine
• The Apple Neural Engine (ANE) is an energy-efficient and high-throughput engine
for ML inference on Apple silicon.
NanoReview. https://nanoreview.net/en
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 62
Edge AI Hardware
Nvidia Jetson
• NVIDIA Jetson is a complete System on Module (SOM) that includes a GPU, CPU,
memory, power management, high-speed interfaces, and more.
225 200.0 45 40 48
30 32
150 30 32
20
15 16
75 32.0 15 10 16 8
21.0 4 4
0.5 1.3
0 0 0
Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Ori
32GB 64GB 32GB 64GB 32GB 64GB
NanoReview. https://connecttech.com/jetson/jetson-module-comparison/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 63
Edge AI Hardware
Tensor Processing Unit
• Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated
circuit (ASIC) developed by Google for neural network machine learning, using Google's
own TensorFlow software.
1 15 4
4
12
0 0 0
ZU9EG KCU1500 ZU9EG KCU1500 ZU9EG KCU1500
* Dhrystone Million Instructions Per Second (DMIPs) is an index for integer computation.
Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 68
Current Landscape of Deep Learning
Big computation, engineer and data
A lot of data
less data
Efficient Training
gradient compression, on-device
training, federated learning
Application-Specific
Optimizations
LLM, AIGC, video, point-cloud
73
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Efficient Inference
pruning, quantization, neural
architecture search, distillation
Efficient Training
System gradient compression, on-device Algorithm
training, federated learning
Application-Specific
Optimizations
LLM, AIGC, video, point-cloud
74
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Efficient Inference
pruning, quantization, neural
EE architecture search, distillation
CS
Efficient Training
System gradient compression, on-device Algorithm
training, federated learning
Application-Specific
Optimizations
LLM, AIGC, video, point-cloud
AI+D
75
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Computation Structures Computer System Architecture
(6.1910[6.004]), pre-req Efficient Inference (6.5900 [6.823])
Hardware Architecture for pruning, quantization, neural Software Performance Engineering
Deep Learning EE architecture search, distillation
CS (6.1060[6.172])
(6.5930[6.825]) Mobile and Sensor Computing
Microcomputer Project Lab (6.1820[6.808])
(6.2060[6.115])
Efficient Training
System gradient compression, on-device Algorithm
training, federated learning
Application-Specific
Optimizations
LLM, AIGC, video, point-cloud
• 23 Lectures (90min)
• 1 Guest Lecture
• 5 Lab Assignments
• 1 Final Project:
- proposal
- presentation + demo
- written report
Pruning
Quantization
Knowledge Distillation
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 78
Part II: Application-Specific Optimizations
Distributed Training
On-Device Learning
TinyEngine on MCU
Lab 0 — Getting
Lab 1 — Lab 2 —
Started with
Pruning Quantization
PyTorch
Large Driver
Generative
Language Assistance TinyML
AI
Model System