Professional Documents
Culture Documents
Multi-Wafer AI Cluster
Cerebras Systems
Sean Lie
Co-founder & Chief Hardware Architect
Founded in 2016
Offices
Silicon Valley | San Diego | Toronto | Tokyo
Customers
North America | Asia | Europe
MSFT-1T (1T)
10,000
In just 2 years
GPT-3 (175B)
1,000 T5 (11B)
T-NLG (17B)
100 Megatron-LM (8B)
Today, GPT-3 with 175 billion params
GPT-2 (1.5B)
trained on 1024 GPUs for 4 months.
10
BERT Large (340M) Tomorrow, multi-trillion parameter
BERT Base (110M) models and beyond.
1
1 10 100 1,000 10,000 100,000
Model
Model memory
memory requirement
requirement,, GB
GB
Push-Button Scaling
Model Training
Memory Dataset
Compute
CS-2
CS-2
➔
Weight
Weights Input
Memory Labels
Samples
Dataset
Gradients Server
L4
L3
L2
L1L2
L4Gradient
L3Forward
Delta
Loss
Compute
CS-2Compute
Optimizer
Weight
Compute
Update
Delta Memory
© 2021 Cerebras Systems Inc. All Rights Reserved
Solving the Latency Problem
• Efficient performance scaling by removing latency-sensitive communication
• Coarse-grained pipelining
• Forward/delta/gradient are fully pipelined
• All streams within an iteration have no inter-layer dependencies
• Fine-grained pipelining
• Overlapping of weight update and forward pass covers inter-iteration dependency
50%
40%
30%
1024 2048 4096 8192 16384 32768 65536
Model Layer Size (Square Matrix Dimension)
SwarmX Weights
• Data parallel training across CS-2s
CS-2
Gradients
• Weights are broadcast to all CS-2s
MemoryX
• Gradients are reduced on way back
Weight
Memory Weights
CS-2
• Multi-system scaling with the same
execution model as single system
Gradients
CS-2 • Same system architecture
• Same network execution flow
Optimizer
• Same software user interface
CS-2
Compute
16 Model Size
8 Megatron-LM 8B
4 T5 11B
T-NLG 17B
2
GPT-3 175B
1
1 2 4 8 16 32 64 128 256
MSFT-1T 1T
Number of CS-2s
MSFT-1T (1T)
10,000
• We need it.
GPT-3 (175B) • But we also need more.
Exponential
1,000 T5 (11B)
T-NLG (17B) Algorithmic efficiency lets us
100 Megatron-LM (8B) use FLOPs more effectively.
GPT-2 (1.5B)
10
BERT Large (340M)
BERT Base (110M)
1
1 10 100 1,000 10,000 100,000
Model memory requirement, GB
C += A B y += A x y += α x
• Core array receives sparse weight stream with only Sparse Weights
non-zero weight
6
• Reduced as networks grow larger 5
1
0% 50% 67% 75% 80% 83% 86% 88% 89% 90%
Unstructured Sparsity Factor (% of Zeros)
Near-linear sparsity acceleration