Professional Documents
Culture Documents
31
MECOMM’18, August 20, 2018, Budapest, Hungary En Li, Zhi Zhou, Xu Chen
bp r
ic
M ve
kb er
kb er
0k ve
s)
(1 Se )
s)
ps
00 rv
00 rv
(1 er
(5 Ser
D
(5 Se
S
32
On-Demand DL Model Co-Inference with Device-Edge Synergy MECOMM’18, August 20, 2018, Budapest, Hungary
edge server respectively, running the classical AlexNet [1] edge server and mobile device, we should note that the opti-
DNN model for image recognition over the Cifar-10 dataset mal DNN partitioning is still constrained by the run time of
[8]. Fig. 2 plots the breakdown of the end-to-end latency of layers running on the device. For further reduction of latency,
different approaches under varying bandwidth between the the approach of DNN Right-Sizing can be combined with
edge and mobile device. It clearly shows that it takes more DNN partitioning. DNN right-sizing promises to accelerate
than 2s to execute the model on the resource-limited Rasp- model inference through an early-exit mechanism. That is, by
berry Pi. Moreover, the performance of edge-based execution training a DNN model with multiple exit points and each has
approach is dominated by the input data transmission time a different size, we can choose a DNN with a small size tai-
(the edge server computation time keeps at ∼10ms) and thus lored to the application demand, meanwhile to alleviate the
highly sensitive to the available bandwidth. Specifically, as computing burden at the model division, and thus to reduce
the available bandwidth jumps from 1Mbps to 50Kbps, the the total latency. Fig. 4 illustrates a branchy AlexNet with
end-to-end latency climbs from 0.123s to 2.317s. Considering five exit points. Currently, the early-exit mechanism has been
the network bandwidth resource scarcity in practice (e.g., supported by the open source framework BranchyNet[15].
due to network resource contention among users and apps) Intuitively, DNN right-sizing further reduces the amount of
and computing resource limitation on mobile devices, both computation required by the DNN inference tasks.
of the device- and edge-based approaches are challenging Problem Description: Obviously, DNN right-sizing in-
to well support many emerging real-time intelligent mobile curs the problem of latency-accuracy tradeoff — while early-
applications with stringent latency requirement. exit reduces the computing time and the device side, it also
deteriorates the accuracy of the DNN inference. Considering
2.3 Enabling Edge Intelligence with DNN the fact that some applications (e.g., VR/AR game) have
Partitioning and Right-Sizing stringent deadline requirement while can tolerate moderate
accuracy loss, we hence strike a nice balance between the
latency and the accuracy in an on-demand manner. Partic-
ularly, given the predefined and stringent latency goal, we
Layer Output Data Size (KB)
33
MECOMM’18, August 20, 2018, Budapest, Hungary En Li, Zhi Zhou, Xu Chen
CONV RELU
POOL FC
CONV CONV
Trained
CONV Exit Point 1 BranchyNet
FC Model a) Latency Requirement
CONV
Mobile Device
1) Train regression models of layer prediction and 2) Analyse the selection of exit point and 3) Model co-inference on
train the BranchyNet model the selection of partition point server and device
Offline Training Stage Online Optimization Stage Co-Inference Stage
Table 1: The independent variables of regression whole DNN. This greatly reduces the profiling overhead since
models there are very limited classes of layers. By experiments, we
observe that the latency of different layers is determined by
Layer Type Independent Variable various independent variables (e.g., input data size, output
amount of input feature maps,
Convolution data size) which are summarized in Table 1. Note that we
(filter size/stride)^2*(num of filters)
Relu input data size also observe that the DNN model loading time also has an
Pooling input data size, output data size obvious impact on the overall runtime. Therefore, we further
Local Response
Normalization
input data size take the DNN model size as a input parameter to predict
Dropout input data size the model loading time. Based on the above inputs of each
Fully-Connected input data size, output data size layer, we establish a regression model to predict the latency
Model Loading model size
of each layer based on its profiles. The final regression models
of some typical layers are shown in Table 2 (size is in bytes
to generate regression-based performance prediction models and latency is in ms).
(Sec. 3.2) for different types DNN layer (e.g., Convolution,
Pooling, etc.). (2) Using Branchynet to train DNN models 3.3 Joint Optimization on DNN Partition
with various exit points, and thus to enable early-exit. Note and DNN Right-Sizing
that the performance profiling is infrastructure-dependent, At online optimization stage, the DNN optimizer receives the
while the DNN training is application-dependent. Thus, given latency requirement from the mobile device, and then searches
the sets of infrastructures (i.e., mobile devices and edge for the optimal exit point and partition point of the trained
servers) and applications, the two initializations only need branchynet model. The whole process is given in Algorithm 1.
to be done once in an offline manner. For a branchy model with 𝑀 exit points, we denote that the
At online optimization stage, the DNN optimizer se- 𝑖-𝑡ℎ exit point has 𝑁𝑖 layers. Here a larger index 𝑖 correspond
lects the best partition point and early-exit point of DNNs to a more accurate inference model of larger size. We use
to maximize the accuracy while providing performance guar- the above-mentioned regression models to predict 𝐸𝐷𝑗 the
antee on the end-to-end latency, based on the input: (1) runtime of the 𝑗-𝑡ℎ layer when it runs on device and 𝐸𝑆𝑗
the profiled layer latency prediction models and Branchynet- the runtime of the 𝑗-𝑡ℎ layer when it runs on server. 𝐷𝑝 is
trained DNN models with various sizes. (2) the observed the output of the 𝑝-𝑡ℎ layer. Under a specific bandwidth 𝐵,
available bandwidth between the mobile device and edge with the input data 𝐼𝑛𝑝𝑢𝑡, then we calculate 𝐴𝑖,𝑝 the whole
server. (3) The pre-defined latency requirement. The opti- runtime ( 𝑝−1
∑︀ ∑︀𝑁𝑖
𝑗=1 𝐸𝑆𝑗 + 𝑘=𝑝 𝐸𝐷𝑗 +𝐼𝑛𝑝𝑢𝑡/𝐵+𝐷𝑝−1 /𝐵) when
mization algorithm is detailed in Sec. 3.3. the 𝑝-𝑡ℎ is the partition point of the selected model of 𝑖-𝑡ℎ exit
At co-inference stage, according to the partition and point.When 𝑝 = 1, the model will only run on the device then
early-exit plan, the edge server will execute the layer before 𝐸𝑆𝑝 = 0, 𝐷𝑝−1 /𝐵 = 0, 𝐼𝑛𝑝𝑢𝑡/𝐵 = 0 and when 𝑝 = 𝑁𝑖 , the
the partition point and the rest will run on the mobile device. model will only run on the server then 𝐸𝐷𝑝 = 0, 𝐷𝑝−1 /𝐵 = 0.
In this way, we can find out the best partition point having
3.2 Layer Latency Prediction the smallest latency for the model of 𝑖-𝑡ℎ exit point. Since
When estimating the runtime of a DNN, Edgent models the the model partition does not affect the inference accuracy,
per-layer latency rather than modeling at the granularity of a we can then sequentially try the DNN inference models with
34
On-Demand DL Model Co-Inference with Device-Edge Synergy MECOMM’18, August 20, 2018, Budapest, Hungary
different exit points(i.e., with different accuracy), and find Pi 3 has a quad-core ARM processor at 1.2 GHz with 1 GB
the one having the largest size and meanwhile satisfying the of RAM. The available bandwidth between the edge server
latency requirement. Note that since regression models for and the mobile device is controlled by the WonderShaper [10]
layer latency prediction are trained beforehand, Algorithm tool. As for the deep learning framework, we choose Chainer
1 mainly involves linear search operations and can be done [11] that can well support branchy DNN structures.
very fast (no more than 1ms in our experiments). For the branchynet model, based on the standard AlexNet
model, we train a branchy AlexNet for image recognition
Algorithm 1 Exit Point and Partition Point Search over the large-scale Cifar-10 dataset [8]. The branchy AlexNet
has five exit points as showed in Fig. 4 (Sec. 2), each exit
Input:
𝑀 : number of exit points in a branchy model point corresponds to a sub-model of the branchy AlexNet.
{𝑁𝑖 |𝑖 = 1, · · · , 𝑀 }: number of layers in each exit point Note that in Fig. 4, we only draw the convolution layers and
{𝐿𝑗 |𝑗 = 1, · · · , 𝑁𝑖 }: layers in the 𝑖-𝑡ℎ exit point the fully-connected layers but ignore other layers for ease of
{𝐷𝑗 |𝑗 = 1, · · · , 𝑁𝑖 }: each layer output data size of 𝑖-𝑡ℎ exit point illustration. For the five sub-models, the number of layers
𝑓 (𝐿𝑗 ): regression models of layer runtime prediction they each have is 12, 16, 19, 20 and 22, respectively.
𝐵: current wireless network uplink bandwidth For the regression-based latency prediction models for each
𝐼𝑛𝑝𝑢𝑡: the input data of the model layer, the independent variables are shown in the Table. 1,
𝑙𝑎𝑡𝑒𝑛𝑐𝑦: the target latency of user requirement and the obtained regression models are shown in Table 2.
Output:
Selection of exit point and its partition point
1: Procedure
4.2 Results
2: for 𝑖 = 𝑀, · · · , 1 do
3: Select the 𝑖-𝑡ℎ exit point We deploy the branchynet model on the edge server and
4: for 𝑗 = 1, · · · , 𝑁𝑖 do the mobile device to evaluate the performance of Edgent.
5: 𝐸𝑆𝑗 ← 𝑓𝑒𝑑𝑔𝑒 (𝐿𝑗 ) Specifically, since both the pre-defined latency requirement
6: 𝐸𝐷𝑗 ← 𝑓𝑑𝑒𝑣𝑖𝑐𝑒 (𝐿𝑗 ) and the available bandwidth play vital roles in Edgent’s
7: end for optimization logic, we evaluate the performance of Edgent
∑︀𝑁𝑖
𝐴𝑖,𝑝 = argmin ( 𝑝−1
∑︀
8: 𝑗=1 𝐸𝑆𝑗 + 𝑘=𝑝 𝐸𝐷𝑗 + 𝐼𝑛𝑝𝑢𝑡/𝐵 + under various latency requirements and available bandwidth.
𝑝=1,··· ,𝑁𝑖
𝐷𝑝−1 /𝐵) We first investigate the effect of the bandwidth by fixing the
9: if 𝐴𝑖,𝑝 ≤ 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 then latency requirement at 1000ms and varying the bandwidth
10: return Selection of Exit Point 𝑖 and its Partition Point 50kbps to 1.5Mbps. Fig. 6(a) shows the best partition point
𝑝 and exit point under different bandwidth. While the best
11: end if partition points may fluctuate, we can see that the best exit
12: end for point gets higher as the bandwidth increases. Meaning that
13: return NULL ◁ can not meet latency requirement the higher bandwidth leads to higher accuracy. Fig. 6(b)
shows that as the bandwidth increases, the model runtime
first drops substantially and then ascends suddenly. However,
4 EVALUATION this is reasonable since the accuracy gets better while the
We now present our preliminary implementation and evalua- latency is still within the latency requirement when increase
tion results. the bandwidth from 1.2Mbps to 2Mbps. It also shows that our
proposed regression-based latency approach can well estimate
4.1 Prototype the actual DNN model runtime latency. We further fix the
bandwidth at 500kbps and vary the latency from 100ms
We have implemented a simple prototype of Edgent to verify
to 1000ms. Fig. 6(c) shows the best partition point and
the feasibility and efficacy of our idea. To this end, we take
exit point under different latency requirements. As expected,
a desktop PC to emulate the edge server, which is equipped
the best exit point gets higher as the latency requirement
with a quad-core Intel processor at 3.4 GHz with 8 GB of
increases, meaning that a larger latency goal gives more room
RAM, and runs the Ubuntu system. We further use Raspberry
for accuracy improvement.
Pi 3 tiny computer to act as a mobile device. The Raspberry
35
MECOMM’18, August 20, 2018, Budapest, Hungary En Li, Zhi Zhou, Xu Chen
25 2500 25
Inference Latency on Device
5 5
20 2000 Target Latency 20
Runtime (ms)
Parition Point
Parition Point
4 Co-Inference Latency (Actual) 4
Exit Point
Exit Point
Exit Point 15 1500 Co-Inference Latency (Predicted)
Exit Point 15
3 3
Partition Point 10 Partition Point 10
2 1000 2
1 5 500 1 5
0 0 250 500 750 1000 1250 1500 1750 2000 0 0 250 500 750 1000 1250 1500 1750 2000 00 200 400 600 800 1000 0
Bandwidth(kbps) Bandwidth (kbps) Latency Requirement (ms)
(a) Selection under different bandwidths (b) Model runtime under different bandwidths (c) Selection under different latency require-
ments
Figure 6: The results under different bandwidths and different latency requirements
0.75
search and Development Program of China under grant
No.2017YFB1001703; the National Science Foundation of
0.70 China (No.U1711265); the Fundamental Research Funds for
the Central Universities (No.17lgjc40), and the Program
Unsatisfied
Deadline
36