Lec 01

TinyML and Efficient
Deep Learning Computing
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
https://efficientml.ai
@SongHan_MIT
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai

Prof. Song Han
Beijing Stanford
Boston
• B.S. from Tsinghua University • MIT Technology Review, 35 Innovators under 35

• Ph.D. from Stanford University, advised by Prof. Bill Dally • NSF Career Award
• Deep Compression (best paper award of ICLR) • IEEE “AIs 10 to Watch: The Future of AI” Award
• EIE (top5 cited paper in 50 years of ISCA) • Sloan Research Fellowship
• Cofounder of DeePhi (now part of AMD)

• Cofounder of OmniML (now part of NVIDIA)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 2
Model Compression and Efficient AI
Bridges the Gap between the Supply and Demand of AI Computing
MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion) 175B Model compression
100 bridges the gap.
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (ICML 2023)
Model Compression and Efficient AI
Bridges the Gap between the Supply and Demand of AI Computing
“Deep Compression” and EIE opened a new opportunity to build hardware accelerator
for sparse and compressed neural networks

Pruning and Sparsity: Impact
3200
#publications on pruning and sparse neural networks
2400
# Publications
1600
EIE
Optimal
800
Brain Damage
Deep
Compression
0
1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022
Quickly increased since 2015, including both algorithms and systems.

Influenced the design of NVDIA Ampere Sparse Tensor Core
Source: https://github.com/mit-han-lab/pruning-sparsity-publications
Deep Learning is Everywhere
But they are computationally costly
How to make them light-weighted and fast?
Vision Language Multimodal
Image source: 1, 2, 3
But they are computationally costly
How to make them light-weighted and fast?
Deep Learning for Image Classification
DNNs achieve super-human classification accuracy on ImageNet
ImageNet Contest Winning Entry: Top 5 Error Rate (%)

Deep Learning
28.2
25.8
16.4
11.7
6.7
5.1
3.6 3.0 2.3
Image Classification 2010 2011 2012 2013 2014 2015 2016 2017 Human
(AlexNet) (GoogLeNet) (ResNet) (SENet)

Efficient Image Classification
High accuracy comes at the cost of high computation
81
InceptionV3 Xception
79 ResNetXt-50
DPN-92
ImageNet Top-1 accuracy (%)
DenseNet-169 ResNetXt-101
77
DenseNet-264
DenseNet-121
75 ResNet-101
ResNet-50
InceptionV2
73 2M 4M 8M 16M 32M 64M

IGCV3-D
Model Size
71
MobileNetV1 (MBNetV1) Handcrafted AutoML
69
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
Neural architecture search reduces the computational cost
14x less computation
81
595M MACs Once-for-All (ours) InceptionV3 Xception
80.0% Top-1
79 ResNetXt-50
EfficientNet NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)
MBNetV3 DenseNet-169 ResNetXt-101

77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121
75 MBNetV2 ResNet-101
PNASNet ResNet-50
InceptionV2
ShuffleNet
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better
→
71
69
0
MobileNetV1 (MBNetV1)
1 2 3 4
Handcrafted
5
AutoML
6
The lower the better
7
→ 8 9
MACs (Billion)
Once-for-All: Train One Network and Specialize it for Efficient Deployment [Cai et al., ICLR 2020]
Efficient deep learning enables daily life application on mobile phones
Image Classification Photo Tags (on iPhone)

Efficient Image Recognition on Phones
Efficient deep learning enables daily life application on mobile phones
People Recognition (on iPhone) On-Device Pose Estimation
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation [Wang et al., CVPR 2022]
Recognizing People in Photos Through Private On-Device Machine Learning [Apple, 2021]
Efficient Image Recognition on MCUs
MCUNet enables TinyML on IoT devices
(ARM Cortex-M7 + OpenMV Cam)
Facial Mask Detection Person Detection
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Efficient On-Device Training on Edge
AI systems need to continually adapt to new data collected locally
Cloud-based Learning
On-device Learning
New and Sensitive

data cannot be sent to the
Data cloud for privacy reason
User Intelligent Edge Devices Cloud Server
● On-device learning: better privacy, lower cost, customization, life-long learning

● Training is more expensive than inference, hard to fit edge hardware (limited memory)

Prompt Segmentation
Segment Anything Model (SAM) realizes promptable zero-shot segmentation
Segment based on interactive point prompts Automatically segment everything in an image
SAM runs at 12 images per second due to the large vision transformer model (ViT-Huge).
Segment Anything: https://segment-anything.com/

Efficient Prompt Segmentation
EfficientViT accelerates Segment Anything Model by 70 times
SAM ViT-Huge
842 image/s 12 image/s
EfficientViT
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
Efficient Prompt Segmentation
EfficientViT accelerates Segment Anything Model by 70 times
SAM ViT-Huge
842 image/s 12 image/s
EfficientViT
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
From Discrimitive Model to Generative Model
Diffusion models create realistic images from a natural language description
Teddy bears, mixing sparkling A bowl of soup, as a planet in the A photo of an astronaut riding a
chemicals as mad scientists universe, as a 1960s poster horse on mars
Training Stable Diffusion costs $600,000 (256 A100s, 150k hours)

Midjourney: https://www.midjourney.com/
Efficient Image Generation
GAN Compression reduces the computation by 9-21X by pruning
1000
Original Model
GAN Compression
281
100 9×
Original CycleGAN; MACs: 56.8G; FPS: 12.1; FID: 61.5

MACs (G)
57 57
31.7
12×
21×
10
4.8
2.7
GAN Compression; MACs: 3.50G (16.2×); FPS: 40.0 (3.3×); FID: 53.6
1
Horse→zebra Edge→shoes Cityscapes Measured on NVIDIA Jetson Xavier GPU
CycleGAN, Zhu et al. Pix2pix, Isola et al. GauGAN, Park et al. Lower FID indicates better Performance.
GAN Compression: Efficient Architectures for Interactive Conditional GANs [Li et al., CVPR 2020]
• Generative Adversarial Network (GAN)
is computationally heavy and slow
• Difficult for interactive photo editing

on mobile device (iPad)
• Anycost GAN with once-for-all network:
Train once
Small sub-net: Large sub-net:

low cost, high-quality
fast prototyping finalization
Anycost GAN, CVPR’21
AnycostGAN achieves interactive image synthesis and editing on a laptop
Compute Budget 1x 0.7x 0.5x 0.4x 0.2x

Tiered Pricing $0.01 $0.007 $0.005 $0.004 $0.002
The quality is still
reasonably good
Anycost GANs for Interactive Image Synthesis and Editing [Lin et al., CVPR 2021]
SIGE accelerates Stable Diffusion by >4X with spatial sparsity
A photograph of a horse on a grassland. A fantasy beach landscape, trending on artstation.
Original 11.6% Masked Original 2.9% Edited
Stable Diffusion: Ours: Stable Diffusion+SDEdit: Ours:

1855GMACs 369ms 514G (3.6×) 95.0ms (3.9×) 1855GMACs 369ms 353G (5.3×) 76.4ms (4.8×)
Image Inpainting Image Editing
Latency Measured on NVIDIA RTX 3090

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models [Li et al., NeurIPS 2022]
FastComposer achieves tuning-free multi-subject image generation
• Create personalized images based on user-specified inputs
• Existing work:
• need extra fine-tuning stage to learn subjects (computationally expensive),
• generate poor multi-subject images (limited composability).
• overfit reference images (limited editability).
“Song Han riding a horse”

(Midjourney) FastComposer: https://fastcomposer.hanlab.ai/
FastComposer: https://fastcomposer.hanlab.ai/

3D Generation
Diffusion models create 3D objects from a natural language description
DreamFusion: https://dreamfusion3d.github.io/
Video Generation
Diffusion models create realistic videos from a natural language description
“Video modeling is a harder task for which performance is not yet saturated at 5.6B model size”
Imagen Video: https://imagen.research.google/video/
From 2D Vision to 3D Vision
Deep learning helps machine perceive the surrounding environment
A whole trunk of workstation

Waymo Driver
Waymo Driver: https://www.youtube.com/watch?v=2CVInKMz9cA

Too slow to drive
Efficient 3D Perception fps

30fps
fps Real time!
Fast-LiDARNet accelerates 3D perception fps
with algorithm/system co-design
Efficient and Robust LiDAR-Based End-to-End Navigation [Liu et al., ICRA 2021]
Efficient 3D Perception
BEVFusion supports efficient multi-task multi-sensor fusion
3D Object Detection BEV Map Segmentation
: Car : Truck : Pedestrian : Barrier : Drivable Area : Lane Divider : Walkway : Crosswalk
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [Liu et al., Arxiv 2022]
ChatGPT and Large Language Model
Large language models produce human-like text based on past conversations
ChatGPT: https://chat.openai.com/
Code Generation
GitHub CoPilot can make meaningful coding suggestions based on context
Image credit: https://techcrunch.com/2021/06/29/github-previews-new-ai-tool-that-makes-coding-suggestions/

Neural Machine Translation
Neural machine translation bridges the language barrier
Conversation Translation
(on iPhone)
Google Translate: https://translate.google.com/
Efficient Neural Machine Translation
Lite Transformer reduces the model size with pruning and quantization
200 45
176
BLEU Score on En-Fr Translation

Model Size (MB)
133.3 41.7
2.5
x
18.2×
39.9
39.6 39.6 39.5
69.2
66.7 4. 38.3
0x
17.3 1.8x
9.7
0 35
Transformer Lite Transformer +Quantization +Quantization & Pruning
Lite Transformer with Long-Short Range Attention [Wu et al., ICLR 2020]
Large Language Models
Large language models show emergent behaviors: zero/few learning
Brown et al., GPT-3, 2020

But it comes at the cost of large model size
Brown et al., GPT-3, 2020

Large language models show emergent behaviors: chain-of-thought
Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
Large language models show emergent behaviors: chain-of-thought
But it comes at the cost of large model size
Model size of language models is growing exponentially
MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion)
175B
100
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022

Efficient Large Language Models
SpAtten accelerates language models by pruning redundant tokens
Attention Prob Attention Prob
Accumulate vertically
Accumulate vertically
0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4 0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4
Tokens with small cumulative importance scores are pruned away
SpAtten: Efficient Natural Language Processing [Wang et al., HPCA 2021]

Running LLMs on the edge is very important
- Deploying LLM on the edge is useful: running copilot services (code completion, office, game chat)
locally on laptops, cars, robots, and more. These devices are resource-constrained, low-power
and sometimes do not have access to the Internet.
- Data privacy is important. Users do not want to share personal data with large companies.
Images are generated by Midjourney
- LLMs are too large to fit into the memory of edge devices. We enable edge deployment of LLMs through
quantization: SmoothQuant and AWQ
- TinyChatEngine implements the compressed inference, built from C/C++ from scratch, easy to install and migrate
to edge platforms
TinyChatEngine: https://github.com/mit-han-lab/TinyChatEngine
TinyChat on Orin
Running LLMs on resource-constrained edge GPUs

TinyChat on Orin
TinyChat delivers 30 tokens / second performance for LLaMA2
LLaMA-2-7B (W4A16, AWQ): 30 tokens / s

TinyChat on Orin
TinyChat enables local inference of 13B LLMs on edge devices
LLaMA-2-13B (W4A16, AWQ): 17 tokens / s

TinyChat
TinyChat brings about 3.3x speedup to LLaMA-2 on 4090
LLaMA-2-7B (FP16): 50 tokens / s LLaMA-2-7B (W4A16, AWQ): 166 tokens / s*

All-Python user interface, diverse and flexible support for different LLMs (e.g. LLaMA, MPT, Falcon)
*: If measured in exLLaMA’s setting, we can achieve 195 tokens / s.
TinyChat: https://tinychat.hanlab.ai
TinyChat
TinyChat seamlessly supports different LLM architectures
MPT-7B (W4A16): Falcon-7B (W4A16): Vicuna-7B (W4A16):

31 tokens / s 22 tokens / s 33 tokens / s
Vision-Language Models
LLaVA achieves general-purpose visual and language understanding
LLaVA uses a 13B LLaMA for language understanding.
LLaVA: https://llava-vl.github.io/
Efficient Vision-Language Models
AWQ quantizes vision-language models to 4 bits with high quality
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Lin et al., 2023]
Vision-Language-Action Models
Robotics transformers control robots based on language instructions
Run at only 3Hz due to the high computational cost and networking latency.
RT-1: https://robotics-transformer1.github.io/
Deep Learning for Games
AlphaGo masters the game of Go with DNNs & tree search
AlphaGo (Nature 2016) AlphaGo versus Lee Sedol (4-1)
Compute: 1920 CPUs and 280 GPUs ($3000 electric bill per game)
AlphaGo: https://www.deepmind.com/research/highlighted-research/alphago
Deep Learning for Scientific Discovery
AlphaFold reveals the structure of the protein universe
Experimental result
Computational prediction
AlphaFold (Nature 2021)
Compute: 16 TPUv3s (128 TPUv3 cores) for a few weeks

AlphaFold 2: https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
Deep Learning: Three Pillars
Algorithm Hardware Data
3
Architectural support for quantization/pruning
brings tremendous improvement
FP32 => FP16 => Int8; dense => sparse
integer matrix
multiplication
and
accumulation
half matrix
multiplication
and
8-bit integer accumulation
4-element
vector dot
product
Data source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, K. Rupp
GPUs, Machine Learning, and EDA — Bill Dally

Software Innovation is important in
advanced technology node
The software cost dominates the cost breakdown of advanced technology nodes [source].
We focus on designing new algorithms and software for efficient computing.

Cloud AI Hardware
NVIDIA
P100 (2016) V100 (2017) A100 (2020) H100 (2022)
Dense FP16 Perf (TOPs) Memory Bendwidth (GB/s) Power (W) Memory (GB)
990
1000 4000 700 100
3,430 700
96
560
750 3000 75 80
2,039 420
500 2000 400 50
312 280
900 300
250 1000732 250 25 32
125 140
18.7 16
0 0 0 0
P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100
SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM

Edge AI Hardware
Qualcomm Hexagon DSP
• Qualcomm Hexagon is a family of digital signal processor (DSP) products by Qualcomm.
It is designed to deliver performance with low power over a variety of applications
Performance (TOPs) Power (W) Memory (GB)

60 52 10 16
10 10 10 10 16 16
45 9.6 12
12
9.2
30 26 8
15
8.8 9 8
15 7 8.4 4
3 4
0 8 0
S845 S855 S865 S888 S8G1 S845 S855 S865 S888 S8G1 S845 S855 S865 S888 S8G1
https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors
Edge AI Hardware
Apple Neural Engine
• The Apple Neural Engine (ANE) is an energy-efficient and high-throughput engine
for ML inference on Apple silicon.

20 17
8 8
15.8 8 8
15 6 6
11 6 6 6 6 6 6 6
10 4 4
6 4 4
5
5 2 2 3
0.6
0 0 0
A11 A12 A13 A14 A15 A16 A11 A12 A13 A14 A15 A16 A11 A12 A13 A14 A15 A16
NanoReview. https://nanoreview.net/en
Edge AI Hardware
Nvidia Jetson
• NVIDIA Jetson is a complete System on Module (SOM) that includes a GPU, CPU,
memory, power management, high-speed interfaces, and more.

60 64
300 275.0 60 64
225 200.0 45 40 48
30 32
150 30 32
20
15 16
75 32.0 15 10 16 8
21.0 4 4
0.5 1.3
0 0 0
Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Ori
32GB 64GB 32GB 64GB 32GB 64GB
NanoReview. https://connecttech.com/jetson/jetson-module-comparison/
Edge AI Hardware
Tensor Processing Unit
• Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated
circuit (ASIC) developed by Google for neural network machine learning, using Google's
own TensorFlow software.

5 2.5 10
4 2 8
4 2 8
3 1.5 6
2 1 4
1 0.5 2
0 0 0
Edge TPU Edge TPU Edge TPU
Tensor Processing Unit. https://en.wikipedia.org/wiki/Tensor_Processing_Unit

Edge AI Hardware
FPGA-based Accelerators
• Field Programmable Gate Arrays (FPGA) delivers higher performance compared to
a fixed-architecture AI accelerator like a GPU due to efficiency of custom hardware
acceleration.

4 60 16
16
3.8
3 45 53 12
2.7
2 30 8
1 15 4
4
12
0 0 0
ZU9EG KCU1500 ZU9EG KCU1500 ZU9EG KCU1500
Neural Network Accelerator Comparison. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator.html

Edge AI Hardware
Microcontrollers (MCU)
• A microcontroller is a compact integrated circuit designed for embedded systems. A typical
microcontroller includes a processor, memory and input/output (I/O) peripherals on a single
chip.
Performance (DMIPS*) Power (mW) Memory (KB)

320
500 462 400 360 320
375 300 240 196
250 210 200 160

158 132 96
125 100 80 32
45 26
6
0 0 0
Arduino Arduino STM32 STM32 Arduino Arduino STM32 STM32 Arduino Arduino STM32 STM32
Zero DUE F407VGT F746NG Zero DUE F407VGT F746NG Zero DUE F407VGT F746NG
* Dhrystone Million Instructions Per Second (DMIPs) is an index for integer computation.

Edge AI Hardware
Edge AI devices still have huge gap to cloud processors
Cloud AI Mobile AI Tiny AI
Memory (Activation) 80GB 4GB 320kB
Storage (Weights) ~TB/PB 256GB 1MB

Deep Learning: Three Pillars
Hardware Data Algorithm
Current Landscape of Deep Learning
Big computation, engineer and data
A lot of computation Many engineers

A lot of carbon
A lot of data

Tiny Deep Learning
Tiny computation, engineer and data
less computation TinyML fewer engineers

less carbon
less data

Course Overview

Course Overview

Course Overview
Efficient Inference
pruning, quantization, neural
architecture search, distillation
Efficient Training
gradient compression, on-device
training, federated learning
Application-Specific
Optimizations
LLM, AIGC, video, point-cloud
73
Course Overview
Efficient Inference
architecture search, distillation
Efficient Training
System gradient compression, on-device Algorithm
Optimizations
74
Course Overview
Efficient Inference
EE architecture search, distillation
CS
Efficient Training
Optimizations
AI+D
75
Course Overview
Computation Structures Computer System Architecture
(6.1910[6.004]), pre-req Efficient Inference (6.5900 [6.823])
Hardware Architecture for pruning, quantization, neural Software Performance Engineering
Deep Learning EE architecture search, distillation
CS (6.1060[6.172])
(6.5930[6.825]) Mobile and Sensor Computing
Microcomputer Project Lab (6.1820[6.808])
(6.2060[6.115])
Efficient Training
Optimizations
AI+D Introduction to Machine Learning (6.3900[6.036]), pre-req

Deep Learning (6.S898)
Advances in Computer Vision (6.8300)
Lecture Structure
• 23 Lectures (90min)
• 1 Guest Lecture
• 5 Lab Assignments
• 1 Final Project:
- proposal
- presentation + demo
- written report

Part I: Efficient Inference
Pruning
Quantization
Neural Architecture Search
Knowledge Distillation
Part II: Application-Specific Optimizations

Part III: Efficient Training
Distributed Training
On-Device Learning
TinyEngine on MCU

Part IV: Quantum ML
Basics Quantum Circuit Quantum ML, Noise Robustness

Labs
Lab 0 — Getting
Lab 1 — Lab 2 —
Started with
Pruning Quantization
PyTorch
Lab 3 — Neural Lab 4 — Lab 5 — LLM

Architecture Search LLM Compression Deployment on Laptop

Final Projects — Last Year Knowledge
Distillation and
Quantization for
Efficient Keyword
Spotting
Arman Dave and Julian Hamelberg
The First Real Time Mobile ML Analytics for

both Para and Non-Para Rowers
Emelie, Sol, Veronica
Optimizing TinyEngine Kernels and Experimental Kernel

MIT 6.S965: TinyML and Efficient Deep Learning Computing
https://eﬃcientml.ai
Generation with Multistaging and Meta-Programming
5th in the 2022 World Rowing Championships
Anne Ouyang, Pranav Krishna

Impacts
Art Generation Video Synthesis Search Engine Revolution Chatbots
Predictive Maintenance Question Answering Augmented Reality
Gesture Recognition Storytelling Autonomous Driving
Video Recognition Music Composition Sentiment Analysis Blind Spot Detection
Health Monitoring Fashion Design Machine Translation Adaptive Cruise Control
Large Driver
Generative
Language Assistance TinyML
AI
Model System
AI Application Model AI Hardware

(demand of computation) Compression (supply of computation)

Lec 01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 01

Uploaded by

Copyright:

Available Formats

TinyML and Efficient

Deep Learning Computing

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai

• B.S. from Tsinghua University • MIT Technology Review, 35 Innovators under 35

• Cofounder of DeePhi (now part of AMD)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 4

Quickly increased since 2015, including both algorithms and systems.

Vision Language Multimodal

Vision Language Multimodal

ImageNet Contest Winning Entry: Top 5 Error Rate (%)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 8

73 2M 4M 8M 16M 32M 64M

MobileNetV1 (MBNetV1) Handcrafted AutoML

MBNetV3 DenseNet-169 ResNetXt-101

Image Classification Photo Tags (on iPhone)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 11

People Recognition (on iPhone) On-Device Pose Estimation

Facial Mask Detection Person Detection

New and Sensitive

User Intelligent Edge Devices Cloud Server

● On-device learning: better privacy, lower cost, customization, life-long learning

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 14

Segment based on interactive point prompts Automatically segment everything in an image

Segment Anything: https://segment-anything.com/

Training Stable Diffusion costs $600,000 (256 A100s, 150k hours)

Original CycleGAN; MACs: 56.8G; FPS: 12.1; FID: 61.5

• Difficult for interactive photo editing

• Anycost GAN with once-for-all network:

Small sub-net: Large sub-net:

Compute Budget 1x 0.7x 0.5x 0.4x 0.2x

Original 11.6% Masked Original 2.9% Edited

Stable Diffusion: Ours: Stable Diffusion+SDEdit: Ours:

Image Inpainting Image Editing

Latency Measured on NVIDIA RTX 3090

“Song Han riding a horse”

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 26

A whole trunk of workstation

Waymo Driver: https://www.youtube.com/watch?v=2CVInKMz9cA

Efficient 3D Perception fps

Fast-LiDARNet accelerates 3D perception fps

with algorithm/system co-design

3D Object Detection BEV Map Segmentation

Vision Language Multimodal

Image credit: https://techcrunch.com/2021/06/29/github-previews-new-ai-tool-that-makes-coding-suggestions/

BLEU Score on En-Fr Translation

Brown et al., GPT-3, 2020

Brown et al., GPT-3, 2020

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 42

Tokens with small cumulative importance scores are pruned away

SpAtten: Efficient Natural Language Processing [Wang et al., HPCA 2021]

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 46

LLaMA-2-7B (W4A16, AWQ): 30 tokens / s

LLaMA-2-13B (W4A16, AWQ): 17 tokens / s

LLaMA-2-7B (FP16): 50 tokens / s LLaMA-2-7B (W4A16, AWQ): 166 tokens / s*

MPT-7B (W4A16): Falcon-7B (W4A16): Vicuna-7B (W4A16):

Vision Language Multimodal

LLaVA uses a 13B LLaMA for language understanding.

AlphaGo (Nature 2016) AlphaGo versus Lee Sedol (4-1)

Compute: 16 TPUv3s (128 TPUv3 cores) for a few weeks

Algorithm Hardware Data

Data source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, K. Rupp

GPUs, Machine Learning, and EDA — Bill Dally

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 59