人工智能系统 System for Al 深度神经网络计算框架基础

1
Cat
Dog
dog Raccoon
𝑤1 𝑤2 𝑤3 𝑤4 𝑤5
loss
Error
𝑑error 𝑑error 𝑑error 𝑑error 𝑑error
𝑑𝑤1 𝑑𝑤2 𝑑𝑤3 𝑑𝑤4 𝑑𝑤5
import numpy as np
N, D = 3, 4 𝑥 𝑦 𝑧 𝑐
x = np.random.randn(N, D)
y = np.random.randn(N, D) sum
z = np.random.randn(N, D)
a = x * y
b = a + z
c = np.sum(b)
𝑔𝑟𝑎𝑑_𝑥 𝑔𝑟𝑎𝑑_𝑦 𝑔𝑟𝑎𝑑_𝑧
grad_c = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x
Efficiency Python-like Flexibility

import xxlib
x, y = load_data()
y = xxlib.resnet152(x)
Efficiency library Python-like Flexibility

灵活
高效
Efficiency library Layer-based Python-like Flexibility
Class AttenionLayer<CPU>
{
void forward(inputs..)
{
}
void backward(inputs, grad)
{
}
….
};
Class AttenionLayer<GPU>
{
…
};
REGISTER_LAYER(“Attention”, AttenionLayer);
SGD: 𝑤 ← 𝑤 − 𝜂∇𝑤
SGD with momentum: 𝑤 ← 𝑤 − (𝛾∇𝑡−1 𝑡

𝑤 + 𝜂∇𝑤 )
https://ruder.io/optimizing-gradient-descent/
前端编程语言和接口
Python, Lua, R, C++
自动求导（Auto Differentiation）
统一模型表示：计算流图
x *
+ y
w b
图的优化与调度执行
Batching, Cache, Overlap
内核代码优化与编译
GPU kernel, auto kernel generation
计算硬件
CPU, GPU, RDMA devices
Add Log While
Sub MatMul Merge
Mul Conv BroadCast
Div BatchNorm Reduce
Relu Loss Map
Tanh Transpose Reshape
Exp Concatenate Select
Floor Sigmoid …..
15
Numpy
import numpy as np
np.random.seed(0)
𝑥 𝑦 𝑧 𝑐
N, D = 3, 4 sum
x = np.random.randn(N, D)
y = np.random.randn(N, D)
z = np.random.randn(N, D)
a = x * y 𝑔𝑟𝑎𝑑_𝑥 𝑔𝑟𝑎𝑑_𝑦 𝑔𝑟𝑎𝑑_𝑧

b = a + z
c = np.sum(b)
grad_c = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x
16
Numpy TensorFlow
x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
+ +𝐠
b 𝛻b
Σ Σ𝐠
c
17
𝜕𝐿(𝑤)
L(𝑤) = 𝐿𝑜𝑠𝑠 𝑓(𝑤, 𝑥𝑖 ), 𝑦𝑖 →
𝜕𝑤
L 𝑥 = exp exp 𝑥 + exp 𝑥 2 + sin(exp 𝑥 + exp 𝑥 2 )

𝐿 𝑥 = exp exp 𝑥 + exp 𝑥 2 + sin(exp 𝑥 + exp 𝑥 2 )
TensorFlow
x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
+ +𝐠
b 𝛻b
Σ Σ𝐠
c
Python, Lua, R, C++
x *
+ y
w b
计算硬件
x *
+ y
w b
x *
+ y
w b
29
 Batch same-type operators to leverage GPU massive parallelism
ht
+
× ×
𝝈 𝝈 𝒕𝒂𝒏𝒉
+ + +
M M M M M M
Rf Ro Rz ht-1 xt Wf Wo Wz
Data-flow graph of a GRU cell

30
 Batch same-type operators to leverage GPU massive parallelism
ht ht
+ +
× × × ×
+ + + +
M M M M M M M M
Rf Ro Rz ht-1 xt Wf Wo Wz R ht-1 xt W
Data-flow graph of a GRU cell

31
x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
+ +𝐠
b 𝛻b
Σ Σ𝐠
c
...... 1
Operators
32
x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
+ +𝐠
b 𝛻b
Σ Σ𝐠
c
 显式图划分
𝒀 GPU 0
MatMul
𝑯
Sigmoid 𝑾𝟐
MatMul
GPU 1
𝑾𝟏 𝑿
33
 跨设备的边将被自动替换成一组Send/Recv operators
Transparent tensor
Dispatch transmission mechanism
partitions
𝑯 𝒀
Partition 𝒀 𝝈 Send
graph *
*
𝑯 * Recv 𝑾𝟐
𝝈 𝑾𝟐 𝑾𝟏 𝑿
*
𝑾𝟏 𝑿
Server0 Server1
34
36
x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
CPU code + +𝐠
b 𝛻b
Σ Σ𝐠
GPU code c
Python, Lua, R, C++
x *
+ y
w b
计算硬件
• Caffe • TensorFlow, CNTK, Caffe2 • Python, Numpy, Scipy
• Programing with config • Declarative programming • Cannot leverage GPU
• Large kernel granularity • Graph optimization • No programming restrict
More Efficiency Layer-based Static graph Python-like More Flexibility

PyTorch
import torch
from torch.autograd import Variable
N, D = 3, 4
x = Variable(torch.randn(N, D).cuda())
y = Variable(torch.randn(N, D).cuda())
z = Variable(torch.randn(N, D).cuda())
for i in range(10):
a = x * y
b = a + z
c = c + torch.sum(b)
c.backward()
43
• PyTorch, Chainer, DyNet

• Caffe
• Imperative programming(Define-by-run)
• Programing with config
• No graph optimization
• Large kernel granularity
• TensorFlow, CNTK, Caffe2 • Python, Numpy, Scipy

• Declarative programming • Cannot leverage GPU
• Graph optimization • No programming restrict
More Efficiency Layer-based Static graph Dynamic graph Python-like More Flexibility
Compiler is used to optimize general framework to be more

efficient, while keeping the existing flexibility!
Custom purpose Deep learning Machine Learning Language and Compiler
machine learning frameworks
algorithms provide easier
ways to leverage A Full-Featured Programming Language for
various libraries ML: Expressive and flexible
Control flow, recursion, sparsity
Theano
DisBelief Powerful Compiler Infrastructure:
Caffe Code optimization, sparsity optimization,
hardware targeting
Algebra & SIMD → MIMD

AI framework
linear libs Dense matmul Sparsity Support
• CPU engine Control Flow and Dynamicity
• GPU Associated Memory
45
[Link]
46

人工智能系统 System for Al 深度神经网络计算框架基础

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

人工智能系统 System for Al 深度神经网络计算框架基础

Uploaded by

Copyright:

Available Formats

1

Efficiency Python-like Flexibility

Efficiency library Python-like Flexibility

SGD with momentum: 𝑤 ← 𝑤 − (𝛾∇𝑡−1 𝑡

a = x * y 𝑔𝑟𝑎𝑑_𝑥 𝑔𝑟𝑎𝑑_𝑦 𝑔𝑟𝑎𝑑_𝑧

L 𝑥 = exp exp 𝑥 + exp 𝑥 2 + sin(exp 𝑥 + exp 𝑥 2 )

 Batch same-type operators to leverage GPU massive parallelism

Data-flow graph of a GRU cell

 Batch same-type operators to leverage GPU massive parallelism

Data-flow graph of a GRU cell

More Efficiency Layer-based Static graph Python-like More Flexibility

• PyTorch, Chainer, DyNet

• TensorFlow, CNTK, Caffe2 • Python, Numpy, Scipy

Compiler is used to optimize general framework to be more

Algebra & SIMD → MIMD

You might also like