You are on page 1of 84

TinyML and Efficient

Deep Learning Computing

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
https://efficientml.ai
@SongHan_MIT

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai


Prof. Song Han

Beijing Stanford
Boston

• B.S. from Tsinghua University • MIT Technology Review, 35 Innovators under 35


• Ph.D. from Stanford University, advised by Prof. Bill Dally • NSF Career Award
• Deep Compression (best paper award of ICLR) • IEEE “AIs 10 to Watch: The Future of AI” Award
• EIE (top5 cited paper in 50 years of ISCA) • Sloan Research Fellowship

• Cofounder of DeePhi (now part of AMD)


• Cofounder of OmniML (now part of NVIDIA)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 2
Model Compression and Efficient AI
Bridges the Gap between the Supply and Demand of AI Computing

MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion) 175B Model compression
100 bridges the gap.
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (ICML 2023)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 3
Model Compression and Efficient AI
Bridges the Gap between the Supply and Demand of AI Computing

“Deep Compression” and EIE opened a new opportunity to build hardware accelerator
for sparse and compressed neural networks

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 4


Pruning and Sparsity: Impact
3200
#publications on pruning and sparse neural networks

2400
# Publications

1600
EIE

Optimal
800
Brain Damage
Deep
Compression
0
1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022

Quickly increased since 2015, including both algorithms and systems.


Influenced the design of NVDIA Ampere Sparse Tensor Core

Source: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 5
Deep Learning is Everywhere
But they are computationally costly
How to make them light-weighted and fast?

Vision Language Multimodal

Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 6
Deep Learning is Everywhere
But they are computationally costly
How to make them light-weighted and fast?

Vision Language Multimodal

Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 7
Deep Learning for Image Classification
DNNs achieve super-human classification accuracy on ImageNet

ImageNet Contest Winning Entry: Top 5 Error Rate (%)


Deep Learning
28.2
25.8

16.4

11.7

6.7
5.1
3.6 3.0 2.3
Image Classification 2010 2011 2012 2013 2014 2015 2016 2017 Human
(AlexNet) (GoogLeNet) (ResNet) (SENet)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 8


Efficient Image Classification
High accuracy comes at the cost of high computation
81

InceptionV3 Xception

79 ResNetXt-50

DPN-92
ImageNet Top-1 accuracy (%)

DenseNet-169 ResNetXt-101
77
DenseNet-264
DenseNet-121

75 ResNet-101
ResNet-50
InceptionV2

73 2M 4M 8M 16M 32M 64M


IGCV3-D
Model Size

71

MobileNetV1 (MBNetV1) Handcrafted AutoML

69
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 9
Efficient Image Classification
Neural architecture search reduces the computational cost
14x less computation
81
595M MACs Once-for-All (ours) InceptionV3 Xception
80.0% Top-1
79 ResNetXt-50
EfficientNet NASNet-A
DPN-92
ImageNet Top-1 accuracy (%)

MBNetV3 DenseNet-169 ResNetXt-101


77
ProxylessNAS DenseNet-264
AmoebaNet DenseNet-121

75 MBNetV2 ResNet-101
PNASNet ResNet-50
InceptionV2
ShuffleNet
73 DARTS 2M 4M 8M 16M 32M 64M
IGCV3-D
Model Size The higher the better


71

69
0
MobileNetV1 (MBNetV1)

1 2 3 4
Handcrafted

5
AutoML

6
The lower the better
7
→ 8 9
MACs (Billion)
Once-for-All: Train One Network and Specialize it for Efficient Deployment [Cai et al., ICLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 10
Efficient Image Classification
Efficient deep learning enables daily life application on mobile phones

Image Classification Photo Tags (on iPhone)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 11


Efficient Image Recognition on Phones
Efficient deep learning enables daily life application on mobile phones

People Recognition (on iPhone) On-Device Pose Estimation

Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation [Wang et al., CVPR 2022]
Recognizing People in Photos Through Private On-Device Machine Learning [Apple, 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 12
Efficient Image Recognition on MCUs
MCUNet enables TinyML on IoT devices
(ARM Cortex-M7 + OpenMV Cam)

Facial Mask Detection Person Detection

MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 13
Efficient On-Device Training on Edge
AI systems need to continually adapt to new data collected locally

Cloud-based Learning

On-device Learning

New and Sensitive


data cannot be sent to the
Data cloud for privacy reason

User Intelligent Edge Devices Cloud Server

● On-device learning: better privacy, lower cost, customization, life-long learning


● Training is more expensive than inference, hard to fit edge hardware (limited memory)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 14


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 15
Prompt Segmentation
Segment Anything Model (SAM) realizes promptable zero-shot segmentation

Segment based on interactive point prompts Automatically segment everything in an image

SAM runs at 12 images per second due to the large vision transformer model (ViT-Huge).

Segment Anything: https://segment-anything.com/


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 16
Efficient Prompt Segmentation
EfficientViT accelerates Segment Anything Model by 70 times
SAM ViT-Huge
842 image/s 12 image/s
EfficientViT

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 17
Efficient Prompt Segmentation
EfficientViT accelerates Segment Anything Model by 70 times
SAM ViT-Huge
842 image/s 12 image/s
EfficientViT

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [Cai et al., ICCV 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 18
From Discrimitive Model to Generative Model
Diffusion models create realistic images from a natural language description

Teddy bears, mixing sparkling A bowl of soup, as a planet in the A photo of an astronaut riding a
chemicals as mad scientists universe, as a 1960s poster horse on mars

Training Stable Diffusion costs $600,000 (256 A100s, 150k hours)


Midjourney: https://www.midjourney.com/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 19
Efficient Image Generation
GAN Compression reduces the computation by 9-21X by pruning
1000
Original Model
GAN Compression

281

100 9×

Original CycleGAN; MACs: 56.8G; FPS: 12.1; FID: 61.5


MACs (G)

57 57

31.7
12×
21×
10

4.8

2.7
GAN Compression; MACs: 3.50G (16.2×); FPS: 40.0 (3.3×); FID: 53.6
1
Horse→zebra Edge→shoes Cityscapes Measured on NVIDIA Jetson Xavier GPU
CycleGAN, Zhu et al. Pix2pix, Isola et al. GauGAN, Park et al. Lower FID indicates better Performance.
GAN Compression: Efficient Architectures for Interactive Conditional GANs [Li et al., CVPR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 20
Efficient Image Generation
• Generative Adversarial Network (GAN)
is computationally heavy and slow

• Difficult for interactive photo editing


on mobile device (iPad)

• Anycost GAN with once-for-all network:

Train once

Small sub-net: Large sub-net:


low cost, high-quality
fast prototyping finalization
Anycost GAN, CVPR’21
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 21
Efficient Image Generation
AnycostGAN achieves interactive image synthesis and editing on a laptop

Compute Budget 1x 0.7x 0.5x 0.4x 0.2x


Tiered Pricing $0.01 $0.007 $0.005 $0.004 $0.002
The quality is still
reasonably good

Anycost GANs for Interactive Image Synthesis and Editing [Lin et al., CVPR 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 22
Efficient Image Generation
SIGE accelerates Stable Diffusion by >4X with spatial sparsity
A photograph of a horse on a grassland. A fantasy beach landscape, trending on artstation.

Original 11.6% Masked Original 2.9% Edited

Stable Diffusion: Ours: Stable Diffusion+SDEdit: Ours:


1855GMACs 369ms 514G (3.6×) 95.0ms (3.9×) 1855GMACs 369ms 353G (5.3×) 76.4ms (4.8×)

Image Inpainting Image Editing

Latency Measured on NVIDIA RTX 3090


Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models [Li et al., NeurIPS 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 23
Efficient Image Generation
FastComposer achieves tuning-free multi-subject image generation
• Create personalized images based on user-specified inputs
• Existing work:
• need extra fine-tuning stage to learn subjects (computationally expensive),
• generate poor multi-subject images (limited composability).
• overfit reference images (limited editability).

“Song Han riding a horse”


(Midjourney) FastComposer: https://fastcomposer.hanlab.ai/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 24
Efficient Image Generation
FastComposer achieves tuning-free multi-subject image generation

FastComposer: https://fastcomposer.hanlab.ai/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 25
Efficient Image Generation
FastComposer achieves tuning-free multi-subject image generation

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 26


3D Generation
Diffusion models create 3D objects from a natural language description

DreamFusion: https://dreamfusion3d.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 27
Video Generation
Diffusion models create realistic videos from a natural language description

“Video modeling is a harder task for which performance is not yet saturated at 5.6B model size”
Imagen Video: https://imagen.research.google/video/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 28
From 2D Vision to 3D Vision
Deep learning helps machine perceive the surrounding environment

A whole trunk of workstation


Waymo Driver

Waymo Driver: https://www.youtube.com/watch?v=2CVInKMz9cA


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 29
Too slow to drive

Efficient 3D Perception fps


30fps
fps Real time!

Fast-LiDARNet accelerates 3D perception fps

with algorithm/system co-design

Efficient and Robust LiDAR-Based End-to-End Navigation [Liu et al., ICRA 2021]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 30
Efficient 3D Perception
BEVFusion supports efficient multi-task multi-sensor fusion

3D Object Detection BEV Map Segmentation

: Car : Truck : Pedestrian : Barrier : Drivable Area : Lane Divider : Walkway : Crosswalk

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [Liu et al., Arxiv 2022]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 31
Deep Learning is Everywhere

Vision Language Multimodal

Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 32
ChatGPT and Large Language Model
Large language models produce human-like text based on past conversations

ChatGPT: https://chat.openai.com/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 33
Code Generation
GitHub CoPilot can make meaningful coding suggestions based on context

Image credit: https://techcrunch.com/2021/06/29/github-previews-new-ai-tool-that-makes-coding-suggestions/


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 34
Neural Machine Translation
Neural machine translation bridges the language barrier

Conversation Translation
(on iPhone)
Google Translate: https://translate.google.com/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 35
Efficient Neural Machine Translation
Lite Transformer reduces the model size with pruning and quantization
200 45

176

BLEU Score on En-Fr Translation


Model Size (MB)

133.3 41.7

2.5
x
18.2×
39.9
39.6 39.6 39.5

69.2
66.7 4. 38.3
0x

17.3 1.8x
9.7
0 35
Transformer Lite Transformer +Quantization +Quantization & Pruning

Lite Transformer with Long-Short Range Attention [Wu et al., ICLR 2020]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 36
Large Language Models
Large language models show emergent behaviors: zero/few learning

Brown et al., GPT-3, 2020


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 37
Large Language Models
But it comes at the cost of large model size

Brown et al., GPT-3, 2020


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 38
Large Language Models
Large language models show emergent behaviors: chain-of-thought

Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 39
Large Language Models
Large language models show emergent behaviors: chain-of-thought

Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 40
Large Language Models
But it comes at the cost of large model size

Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 41
Large Language Models
Model size of language models is growing exponentially

MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion)

175B

100
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 42


Efficient Large Language Models
SpAtten accelerates language models by pruning redundant tokens
Attention Prob Attention Prob

Accumulate vertically

Accumulate vertically
0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4 0.4 1.0 0.3 1.2 1.7 1.0 0.4 1.8 0.6 1.9 1.4 0.3 0.9 0.4

Tokens with small cumulative importance scores are pruned away

SpAtten: Efficient Natural Language Processing [Wang et al., HPCA 2021]


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 43
Efficient Large Language Models
Running LLMs on the edge is very important

- Deploying LLM on the edge is useful: running copilot services (code completion, office, game chat)
locally on laptops, cars, robots, and more. These devices are resource-constrained, low-power
and sometimes do not have access to the Internet.

- Data privacy is important. Users do not want to share personal data with large companies.
Images are generated by Midjourney
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 44
Efficient Large Language Models
- LLMs are too large to fit into the memory of edge devices. We enable edge deployment of LLMs through
quantization: SmoothQuant and AWQ
- TinyChatEngine implements the compressed inference, built from C/C++ from scratch, easy to install and migrate
to edge platforms

TinyChatEngine: https://github.com/mit-han-lab/TinyChatEngine
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 45
TinyChat on Orin
Running LLMs on resource-constrained edge GPUs

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 46


TinyChat on Orin
TinyChat delivers 30 tokens / second performance for LLaMA2

LLaMA-2-7B (W4A16, AWQ): 30 tokens / s


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 47
TinyChat on Orin
TinyChat enables local inference of 13B LLMs on edge devices

LLaMA-2-13B (W4A16, AWQ): 17 tokens / s


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 48
TinyChat
TinyChat brings about 3.3x speedup to LLaMA-2 on 4090

LLaMA-2-7B (FP16): 50 tokens / s LLaMA-2-7B (W4A16, AWQ): 166 tokens / s*


All-Python user interface, diverse and flexible support for different LLMs (e.g. LLaMA, MPT, Falcon)
*: If measured in exLLaMA’s setting, we can achieve 195 tokens / s.
TinyChat: https://tinychat.hanlab.ai
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 49
TinyChat
TinyChat seamlessly supports different LLM architectures

MPT-7B (W4A16): Falcon-7B (W4A16): Vicuna-7B (W4A16):


31 tokens / s 22 tokens / s 33 tokens / s
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 50
Deep Learning is Everywhere

Vision Language Multimodal

Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 51
Vision-Language Models
LLaVA achieves general-purpose visual and language understanding

LLaVA uses a 13B LLaMA for language understanding.

LLaVA: https://llava-vl.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 52
Efficient Vision-Language Models
AWQ quantizes vision-language models to 4 bits with high quality

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Lin et al., 2023]
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 53
Vision-Language-Action Models
Robotics transformers control robots based on language instructions

Run at only 3Hz due to the high computational cost and networking latency.
RT-1: https://robotics-transformer1.github.io/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 54
Deep Learning for Games
AlphaGo masters the game of Go with DNNs & tree search

AlphaGo (Nature 2016) AlphaGo versus Lee Sedol (4-1)

Compute: 1920 CPUs and 280 GPUs ($3000 electric bill per game)
AlphaGo: https://www.deepmind.com/research/highlighted-research/alphago
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 55
Deep Learning for Scientific Discovery
AlphaFold reveals the structure of the protein universe

Experimental result
Computational prediction
AlphaFold (Nature 2021)

Compute: 16 TPUv3s (128 TPUv3 cores) for a few weeks


AlphaFold 2: https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 56
Deep Learning: Three Pillars

Algorithm Hardware Data

3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 57
Architectural support for quantization/pruning
brings tremendous improvement
FP32 => FP16 => Int8; dense => sparse

integer matrix
multiplication
and
accumulation
half matrix
multiplication
and
8-bit integer accumulation
4-element
vector dot
product

Data source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, K. Rupp

GPUs, Machine Learning, and EDA — Bill Dally


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 58
Software Innovation is important in
advanced technology node

The software cost dominates the cost breakdown of advanced technology nodes [source].
We focus on designing new algorithms and software for efficient computing.

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 59


Cloud AI Hardware
NVIDIA

P100 (2016) V100 (2017) A100 (2020) H100 (2022)

Dense FP16 Perf (TOPs) Memory Bendwidth (GB/s) Power (W) Memory (GB)
990
1000 4000 700 100
3,430 700
96
560
750 3000 75 80
2,039 420
500 2000 400 50
312 280
900 300
250 1000732 250 25 32
125 140
18.7 16
0 0 0 0
P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100 P100 V100 A100 H100
SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM SXM

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 60


Edge AI Hardware
Qualcomm Hexagon DSP
• Qualcomm Hexagon is a family of digital signal processor (DSP) products by Qualcomm.
It is designed to deliver performance with low power over a variety of applications

Performance (TOPs) Power (W) Memory (GB)


60 52 10 16
10 10 10 10 16 16
45 9.6 12
12
9.2
30 26 8
15
8.8 9 8
15 7 8.4 4
3 4
0 8 0
S845 S855 S865 S888 S8G1 S845 S855 S865 S888 S8G1 S845 S855 S865 S888 S8G1

https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 61
Edge AI Hardware
Apple Neural Engine
• The Apple Neural Engine (ANE) is an energy-efficient and high-throughput engine
for ML inference on Apple silicon.

Performance (TOPs) Power (W) Memory (GB)


20 17
8 8
15.8 8 8
15 6 6
11 6 6 6 6 6 6 6
10 4 4
6 4 4
5
5 2 2 3
0.6
0 0 0
A11 A12 A13 A14 A15 A16 A11 A12 A13 A14 A15 A16 A11 A12 A13 A14 A15 A16

NanoReview. https://nanoreview.net/en
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 62
Edge AI Hardware
Nvidia Jetson
• NVIDIA Jetson is a complete System on Module (SOM) that includes a GPU, CPU,
memory, power management, high-speed interfaces, and more.

Performance (TOPs) Power (W) Memory (GB)


60 64
300 275.0 60 64

225 200.0 45 40 48
30 32
150 30 32
20
15 16
75 32.0 15 10 16 8
21.0 4 4
0.5 1.3
0 0 0
Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Orin Nano TX2 Xavier NX AGX Xavier AGX Orin AGX Ori
32GB 64GB 32GB 64GB 32GB 64GB

NanoReview. https://connecttech.com/jetson/jetson-module-comparison/
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 63
Edge AI Hardware
Tensor Processing Unit
• Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated
circuit (ASIC) developed by Google for neural network machine learning, using Google's
own TensorFlow software.

Performance (TOPs) Power (W) Memory (GB)


5 2.5 10
4 2 8
4 2 8
3 1.5 6
2 1 4
1 0.5 2
0 0 0
Edge TPU Edge TPU Edge TPU

Tensor Processing Unit. https://en.wikipedia.org/wiki/Tensor_Processing_Unit


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 64
Edge AI Hardware
FPGA-based Accelerators
• Field Programmable Gate Arrays (FPGA) delivers higher performance compared to
a fixed-architecture AI accelerator like a GPU due to efficiency of custom hardware
acceleration.

Performance (TOPs) Power (W) Memory (GB)


4 60 16
16
3.8
3 45 53 12
2.7
2 30 8

1 15 4
4
12
0 0 0
ZU9EG KCU1500 ZU9EG KCU1500 ZU9EG KCU1500

Neural Network Accelerator Comparison. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator.html


MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 65
Edge AI Hardware
Microcontrollers (MCU)
• A microcontroller is a compact integrated circuit designed for embedded systems. A typical
microcontroller includes a processor, memory and input/output (I/O) peripherals on a single
chip.

Performance (DMIPS*) Power (mW) Memory (KB)


320
500 462 400 360 320

375 300 240 196

250 210 200 160


158 132 96
125 100 80 32
45 26
6
0 0 0
Arduino Arduino STM32 STM32 Arduino Arduino STM32 STM32 Arduino Arduino STM32 STM32
Zero DUE F407VGT F746NG Zero DUE F407VGT F746NG Zero DUE F407VGT F746NG

* Dhrystone Million Instructions Per Second (DMIPs) is an index for integer computation.

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 66


Edge AI Hardware
Edge AI devices still have huge gap to cloud processors

Cloud AI Mobile AI Tiny AI

Memory (Activation) 80GB 4GB 320kB

Storage (Weights) ~TB/PB 256GB 1MB

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 67


Deep Learning: Three Pillars

Hardware Data Algorithm

Image source: 1, 2, 3
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 68
Current Landscape of Deep Learning
Big computation, engineer and data

A lot of computation Many engineers


A lot of carbon

A lot of data

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 69


Tiny Deep Learning
Tiny computation, engineer and data

less computation TinyML fewer engineers


less carbon

less data

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 70


Course Overview

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 71


Course Overview

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 72


Course Overview
Efficient Inference
pruning, quantization, neural
architecture search, distillation

Efficient Training
gradient compression, on-device
training, federated learning

Application-Specific
Optimizations
LLM, AIGC, video, point-cloud

73
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Efficient Inference
pruning, quantization, neural
architecture search, distillation

Efficient Training
System gradient compression, on-device Algorithm
training, federated learning

Application-Specific
Optimizations
LLM, AIGC, video, point-cloud

74
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Efficient Inference
pruning, quantization, neural
EE architecture search, distillation
CS

Efficient Training
System gradient compression, on-device Algorithm
training, federated learning

Application-Specific
Optimizations
LLM, AIGC, video, point-cloud

AI+D

75
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Course Overview
Computation Structures Computer System Architecture
(6.1910[6.004]), pre-req Efficient Inference (6.5900 [6.823])
Hardware Architecture for pruning, quantization, neural Software Performance Engineering
Deep Learning EE architecture search, distillation
CS (6.1060[6.172])
(6.5930[6.825]) Mobile and Sensor Computing
Microcomputer Project Lab (6.1820[6.808])
(6.2060[6.115])
Efficient Training
System gradient compression, on-device Algorithm
training, federated learning

Application-Specific
Optimizations
LLM, AIGC, video, point-cloud

AI+D Introduction to Machine Learning (6.3900[6.036]), pre-req


Deep Learning (6.S898)
Advances in Computer Vision (6.8300)
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai
Lecture Structure

• 23 Lectures (90min)
• 1 Guest Lecture
• 5 Lab Assignments
• 1 Final Project:
- proposal
- presentation + demo
- written report

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 77


Part I: Efficient Inference

Pruning

Quantization

Neural Architecture Search

Knowledge Distillation
MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 78
Part II: Application-Specific Optimizations

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 79


Part III: Efficient Training

Distributed Training

On-Device Learning

TinyEngine on MCU

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 80


Part IV: Quantum ML

Basics Quantum Circuit Quantum ML, Noise Robustness

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 81


Labs

Lab 0 — Getting
Lab 1 — Lab 2 —
Started with
Pruning Quantization
PyTorch

Lab 3 — Neural Lab 4 — Lab 5 — LLM


Architecture Search LLM Compression Deployment on Laptop

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 82


Final Projects — Last Year Knowledge
Distillation and
Quantization for
Efficient Keyword
Spotting
Arman Dave and Julian Hamelberg

The First Real Time Mobile ML Analytics for


both Para and Non-Para Rowers

Emelie, Sol, Veronica

Optimizing TinyEngine Kernels and Experimental Kernel


MIT 6.S965: TinyML and Efficient Deep Learning Computing
https://efficientml.ai
Generation with Multistaging and Meta-Programming

5th in the 2022 World Rowing Championships

Anne Ouyang, Pranav Krishna

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 83


Impacts
Art Generation Video Synthesis Search Engine Revolution Chatbots
Predictive Maintenance Question Answering Augmented Reality
Gesture Recognition Storytelling Autonomous Driving
Video Recognition Music Composition Sentiment Analysis Blind Spot Detection
Health Monitoring Fashion Design Machine Translation Adaptive Cruise Control

Large Driver
Generative
Language Assistance TinyML
AI
Model System

AI Application Model AI Hardware


(demand of computation) Compression (supply of computation)

MIT 6.5940: TinyML and Efficient Deep Learning Computing https://efficientml.ai 84

You might also like