You are on page 1of 46

1

Cat
Dog
dog Raccoon

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5
loss

Error
𝑑error 𝑑error 𝑑error 𝑑error 𝑑error
𝑑𝑤1 𝑑𝑤2 𝑑𝑤3 𝑑𝑤4 𝑑𝑤5
import numpy as np
N, D = 3, 4 𝑥 𝑦 𝑧 𝑐
x = np.random.randn(N, D)
y = np.random.randn(N, D) sum
z = np.random.randn(N, D)

a = x * y
b = a + z
c = np.sum(b)
𝑔𝑟𝑎𝑑_𝑥 𝑔𝑟𝑎𝑑_𝑦 𝑔𝑟𝑎𝑑_𝑧
grad_c = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x

Efficiency Python-like Flexibility


import xxlib

x, y = load_data()

y = xxlib.resnet152(x)

Efficiency library Python-like Flexibility


灵活

高效
Efficiency library Layer-based Python-like Flexibility
Class AttenionLayer<CPU>
{
void forward(inputs..)
{
}
void backward(inputs, grad)
{
}
….
};
Class AttenionLayer<GPU>
{

};

REGISTER_LAYER(“Attention”, AttenionLayer);
SGD: 𝑤 ← 𝑤 − 𝜂∇𝑤

SGD with momentum: 𝑤 ← 𝑤 − (𝛾∇𝑡−1 𝑡


𝑤 + 𝜂∇𝑤 )

https://ruder.io/optimizing-gradient-descent/
前端编程语言和接口
Python, Lua, R, C++
自动求导 (Auto Differentiation)

统一模型表示:计算流图
x *
+ y
w b

图的优化与调度执行
Batching, Cache, Overlap

内核代码优化与编译
GPU kernel, auto kernel generation

计算硬件
CPU, GPU, RDMA devices
Add Log While
Sub MatMul Merge
Mul Conv BroadCast
Div BatchNorm Reduce
Relu Loss Map
Tanh Transpose Reshape
Exp Concatenate Select
Floor Sigmoid …..
15

Numpy
import numpy as np
np.random.seed(0)
𝑥 𝑦 𝑧 𝑐
N, D = 3, 4 sum
x = np.random.randn(N, D)
y = np.random.randn(N, D)
z = np.random.randn(N, D)

a = x * y 𝑔𝑟𝑎𝑑_𝑥 𝑔𝑟𝑎𝑑_𝑦 𝑔𝑟𝑎𝑑_𝑧


b = a + z
c = np.sum(b)

grad_c = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x
16

Numpy TensorFlow
x y z 𝛻x 𝛻y

* *𝐠 𝛻z
a 𝛻a
+ +𝐠

b 𝛻b

Σ Σ𝐠

c
17
𝜕𝐿(𝑤)
L(𝑤) = 𝐿𝑜𝑠𝑠 𝑓(𝑤, 𝑥𝑖 ), 𝑦𝑖 →
𝜕𝑤

L 𝑥 = exp exp 𝑥 + exp 𝑥 2 + sin(exp 𝑥 + exp 𝑥 2 )


𝐿 𝑥 = exp exp 𝑥 + exp 𝑥 2 + sin(exp 𝑥 + exp 𝑥 2 )
TensorFlow
x y z 𝛻x 𝛻y

* *𝐠 𝛻z
a 𝛻a
+ +𝐠

b 𝛻b

Σ Σ𝐠

c
前端编程语言和接口
Python, Lua, R, C++
自动求导 (Auto Differentiation)

统一模型表示:计算流图
x *
+ y
w b

图的优化与调度执行
Batching, Cache, Overlap

内核代码优化与编译
GPU kernel, auto kernel generation

计算硬件
CPU, GPU, RDMA devices
x *
+ y
w b

x *
+ y
w b
29

 Batch same-type operators to leverage GPU massive parallelism

ht

+
× ×
𝝈 𝝈 𝒕𝒂𝒏𝒉

+ + +

M M M M M M

Rf Ro Rz ht-1 xt Wf Wo Wz

Data-flow graph of a GRU cell


30

 Batch same-type operators to leverage GPU massive parallelism

ht ht

+ +
× × × ×
𝝈 𝝈 𝒕𝒂𝒏𝒉
𝝈 𝝈 𝒕𝒂𝒏𝒉

+ + + +
M M M M M M M M

Rf Ro Rz ht-1 xt Wf Wo Wz R ht-1 xt W

Data-flow graph of a GRU cell


31

x y z 𝛻x 𝛻y

* *𝐠 𝛻z
a 𝛻a
+ +𝐠

b 𝛻b

Σ Σ𝐠

c
...... 1
Operators
32

x y z 𝛻x 𝛻y
* *𝐠 𝛻z
a 𝛻a
+ +𝐠
b 𝛻b
Σ Σ𝐠
c
 显式图划分

𝒀 GPU 0
MatMul
𝑯
Sigmoid 𝑾𝟐

MatMul
GPU 1
𝑾𝟏 𝑿

33
 跨设备的边将被自动替换成一组Send/Recv operators
Transparent tensor
Dispatch transmission mechanism
partitions
𝑯 𝒀
Partition 𝒀 𝝈 Send
graph *
*
𝑯 * Recv 𝑾𝟐
𝝈 𝑾𝟐 𝑾𝟏 𝑿

*
𝑾𝟏 𝑿
Server0 Server1

34
36

x y z 𝛻x 𝛻y

* *𝐠 𝛻z
a 𝛻a
CPU code + +𝐠

b 𝛻b

Σ Σ𝐠

GPU code c
前端编程语言和接口
Python, Lua, R, C++
自动求导 (Auto Differentiation)

统一模型表示:计算流图
x *
+ y
w b

图的优化与调度执行
Batching, Cache, Overlap

内核代码优化与编译
GPU kernel, auto kernel generation

计算硬件
CPU, GPU, RDMA devices
• Caffe • TensorFlow, CNTK, Caffe2 • Python, Numpy, Scipy
• Programing with config • Declarative programming • Cannot leverage GPU
• Large kernel granularity • Graph optimization • No programming restrict

More Efficiency Layer-based Static graph Python-like More Flexibility


PyTorch
import torch
from torch.autograd import Variable

N, D = 3, 4

x = Variable(torch.randn(N, D).cuda())
y = Variable(torch.randn(N, D).cuda())
z = Variable(torch.randn(N, D).cuda())

for i in range(10):
a = x * y
b = a + z
c = c + torch.sum(b)

c.backward()
43

• PyTorch, Chainer, DyNet


• Caffe
• Imperative programming(Define-by-run)
• Programing with config
• No graph optimization
• Large kernel granularity

• TensorFlow, CNTK, Caffe2 • Python, Numpy, Scipy


• Declarative programming • Cannot leverage GPU
• Graph optimization • No programming restrict

More Efficiency Layer-based Static graph Dynamic graph Python-like More Flexibility

Compiler is used to optimize general framework to be more


efficient, while keeping the existing flexibility!
Custom purpose Deep learning Machine Learning Language and Compiler
machine learning frameworks
algorithms provide easier
ways to leverage A Full-Featured Programming Language for
various libraries ML: Expressive and flexible
Control flow, recursion, sparsity
Theano
DisBelief Powerful Compiler Infrastructure:
Caffe Code optimization, sparsity optimization,
hardware targeting

Algebra & SIMD → MIMD


AI framework
linear libs Dense matmul Sparsity Support
• CPU engine Control Flow and Dynamicity
• GPU Associated Memory
45

[Link]
46

You might also like