Professional Documents
Culture Documents
Quanti Zed CNN SP Des
Quanti Zed CNN SP Des
The research reported in this paper was supported by the Israel Innovation Authority
through Avatar consortium, and by grant no. 2018209 from the United States - Israel Bina-
tional Science Foundation (BSF), Jerusalem, Israel. ME is supported by Kreitman High-Tech
scholarship.
Ido Ben-Yair, Gil Ben Shalom, Moshe Eliasof and Eran Treister
Computer Science Department, Ben-Gurion University of the Negev, Beer Sheva, Israel
E-mail: erant@cs.bgu.ac.il
∗ The authors Ido Ben-Yair, Gil Ben Shalom and Moshe Eliasof have contributed equally.
2 Ido Ben-Yair et al.
1 Introduction
2 Background
f (y, θ) ≈ c (1)
for input-output pairs {(yi , ci )}si=1 from a certain data set Y × C. Typical
tasks include regression or classification, and the function f can be viewed as
an interpolation of the true function that we want the machine to learn.
In learning tasks involving images and videos, the learnt function f com-
monly includes a CNN that filters the input data. Among the many ar-
chitectures we focus on feed-forward networks of a common type called by
ResNets [34]. In the simplest case, the forward propagation of a given example
y through an N -layer ResNet can be written as
x0 = σ(Kopen y) (4)
where [0, T ] is an artificial time interval related to the depth of the network.
Relating spatial convolution filters with differential operators, the authors of
[56] propose a layer function representation F that renders Eq. (5) similar to
a parabolic diffusion-like PDE
400 400
100
350 350
80 300 300
250 250
60
200 200
40 150 150
100 100
20
50 50
0 0 0
−0.5 0.0 0.5 1.0 −6 −4 −2 0 2 4 6 −0.4 −0.2 0.0 0.2 0.4
(a) Original distribution (b) Integer representation (c) Real quantized values
round((2b − 1) · t)
qb (t) = , (8)
2b − 1
Quantized CNNs through the lens of PDEs 7
where t is a real-valued scalar in one of the ranges [-1, 1] or [0, 1] for signed
or unsigned quantization1 , respectively. b is the number of bits that are used
to represent t as integer at inference time (during training, t and qb (t) are
real-valued). During the forward pass, we force each quantized value to one of
the ranges above by applying a clipping function to the quantized weights and
activations before applying Eq. (8):
w
wb = Qb (w) = αw qb−1 (clip( , −1, 1))
αw
x
xb = Qb (x) = αx qb (clip( , 0, 1)). (9)
αx
Here, w, wb are the real-valued and quantized weight tensors, x, xb are the real-
valued and quantized input tensors, and αw , αx are their associated clipping
parameters (also called scales), respectively. An example of Eq. (9) applied
to a weight tensor using 4-bit signed quantization, is given in Fig. 1. During
training, we iterate on the floating-point values of the weights w, while both
the weights and activation maps are quantized in the forward pass (i.e., xb
and wb are passed through the network). The STE [5] is used to compute the
gradient in the backward pass, where the derivative of qb is ignored and we
use the derivatives w.r.t wb in the SGD optimization to update w iteratively.
The quantization scheme in Eq. (9) involves the clipping parameters αw
and αx , also called scales, that are used to translate the true value of each
integer in the network to its floating-point value. This way, at inference time,
two integers from different layers can be multiplied in a meaningful way that
is faithful to the original floating-point values [48, 37]. These scales also control
the quantization error, and need to be chosen according to the values propa-
gating through the network and the bit allocation b, whether the weights or
the activations. As shown in [4], the quantization error can be evaluated by
n
1X
MSE(x, xb ) = (xi − (xb )i )2 , (10)
n i=1
and it is reasonable to reduce the MSE to maximize the accuracy of the quan-
tized network. In particular, the scales αw , αx can be computed by minimizing
the MSE based on sampled batches of the training data [4].
Alternatively, the works [47, 25], which we follow here, introduced an ef-
fective gradient-based optimization to find the clipping values αx , αw for each
layer. Given Eq. (9) the gradients w.r.t αw , and αx can be approximated using
the STE [47, 25]. For the activation maps, for example, this resolves to:
0
if x ≤ 0
∂xb
= 1 if x ≥ αx (11)
∂αx xb
x
αx − αx if 0 < x < α x .
1 We assume that the ReLU activation function is used in between any convolution op-
erator, resulting in non-negative activation maps, and can be quantized using an unsigned
scheme. If a different activation function is used that is not non-negative, like tanh(), signed
quantization should be used instead.
8 Ido Ben-Yair et al.
The quantization scheme in Eq. (9) is applied to both the weights and acti-
vation (feature) maps throughout the network. The weights are known dur-
ing training and are fixed during inference. Thus, their post-quantization be-
haviour is known and we can regularize them during training. On the other
Quantized CNNs through the lens of PDEs 9
hand, the activations are available only at inference time and depend on the
input data. Therefore, they are prone to erratic behaviour given unseen, pos-
sibly noisy data. For this reason, we view the difference ηi = xi − (xb )i at
at the i-th layer as noise generated from the quantization of the activations.
In what follows, we propose a method to design a network that prevents the
amplification of this noise.
One of the most popular and effective approaches for image denoising is the
Total Variation method suggested by Rudin, Osher, and Fatemi [46]. This
approach seeks to regularize the data loss with the TV norm of the image,
Z
||u||T V (Ω) = k∇u(x)kdx (13)
Ω
where as a norm k · k one can use `2 or `1 , which is the applied to the gradient
of the image. In this work we consider the `1 norm, and regularize the feature
maps to have a low TV norm during training.
Since a low TV norm is known to encourage piecewise constant images, we
can incur less errors when quantizing these images. This is because they will
typically contain less gray (mid-range) values. Furthermore, extreme values
influence the MSE in Eq. (10) the most, and one will typically choose a smaller
clipping parameter αx as the number of bits decreases, to better approximate
the distribution of values by the quantized distribution [4]. Since both of these
factors negate each other, we wish to eliminate extreme values (outliers) from
the distribution as much as possible, such that that the optimization of αx can
better fit the distributions. Smoothing the feature maps essentially averages
pixels, which promotes more uniform distributions and reduces outliers, as
demonstrated in Fig. 2.
However, in our experience, regularizing solely towards smooth feature
maps also harms the performance of the network, since it de-emphasizes other
important parts of the loss function during training. We have witnessed exper-
imentally that following an aggressive application of the TV regularization in
addition to the loss function, the 3 × 3 filters in the network gravitate towards
learning low-pass filters rather than serving as spatial feature extractors, like
edge detectors. To remedy this, we add TV smoothing layers to network, acting
as non-pointwise activation layers. This addition results in piece-wise smooth
feature maps with significantly less noise in the feature maps, allowing other
filters in the network to focus on feature extraction and minimization of the
loss function. As we show next, this can be done at a lower cost compared to
the convolution operations.
In practice, in every non-linear activation function, we also apply an edge-
aware denoising step. The denoising step is applied as
50 50
100 100
150 150
200 200
250 250
0 100 200 300 400 500 0 100 200 300 400 500
Original, [min,max]
(a) Original = [-3.78,map
feature 3.11] (b) TV-smoothed
TV smoothed, [min,max]feature map
= [-2.05, 1.25]
12000 6000
10000 5000
8000 4000
6000 3000
4000 2000
2000 1000
0 0
−4 −3 −2 −1 0 1 2 3 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
(c) Original dist. ∈ [−3.78, 3.11] (d) TV-smoothed dist. ∈ [−2.05, 1.25]
Fig. 2: An example of a feature map from the 4-th layer of ResNet50 encoder for
an image from the Cityscapes data set. (a) and (b) show the feature map before
and after 3 iterations of the TV-smoothing operator in Eq. (14), with γ 2 = 0.1.
It is evident that fine details are preserved after the rather light smoothing.
(c) and (d) show the corresponding value distributions. It is evident that the
TV-smoothing eliminates outliers in addition to denoising the image a bit.
After a 4-bit signed quantization with α = 0.5, the MSE between the original
and quantized maps are 0.16 and 0.05 for the original and TV-smoothed maps,
respectively. Hence, we see that the TV-smoothing outputs a feature map that
is better suited for quantization.
using three TV layers, while in our network we apply one TV layer in every
step. Indeed, a single TV step has a lesser effect than three, but throughout
the forward application of our network, the smoothness of the feature maps
accumulates with the layers, leading to a similar effect of the advanced feature
maps.
where Qb denotes the quantization operation in Eq. (9). Let us denote the
feature maps of the non-quantized network by x̂j (i.e., assuming qb (x) = x in
Eq. (9)), and let the error between corresponding activations be denoted by
ηj = xj − x̂j . We consider the same weights K between the two architectures,
whether quantized or not, and also assume that the absolute quantization
error for any scalar is bounded by some δ. We analyze the propagation of the
quantization error ηj+1 as a function of ηj .
We start unwrapping the block in Eq. (15). First, we subtract x̂j+1 from
both sides, and replace the outer quantization with the error term ηj1 .
Next, we remove the other instance of Qb and add another error term ηj2 :
= (I − hK> 0 >
j σ (Kj x̂j )Kj )ηj + ηj1 − hKj ηj2 . (18)
The key ingredient of the analysis above is that ηj1 and ηj2 are fixed and
bounded for every layer. On the other hand, it is the Jacobian matrix of
the block Jj = I − hK> 0
j σ (Kj x̂j )Kj that multiplies ηj at every iteration
and propagates the previous error into the next block. Assuming that σ is
non-decreasing, it means that K> 0
j σ ()Kj is positive semi-definite, and with a
proper choice of Kj and h, we can force ρ(Jj ) < 1, so that the error decays.
To ensure this forward stability we must set h < 2(LkKj k22 )−1 for every layer
12 Ido Ben-Yair et al.
j in the network, where L is the upper bound for σ 0 () [1]. This is generally
possible in ResNets only if we use the symmetric variant in Eq. (6). [56, 62]
also achieved a similar stability result, but not in the context of quantization.
The classical CNN architecture typically applies several layers like Eq. (2).
However, to obtain better representational capabilities for the network, the
number of channels typically increases every few layers. In CNNs, this is typi-
cally accompanied by a down-sampling operation performed by a pooling layer
or a strided convolution. In such cases, the residual equation Eq. (2) cannot
be used, since the update F (xj , θ j ) does not match xj in terms of dimensions.
For this reason, ResNets typically include 3-4 steps like
xj+1 = Kj xj + F (xj , θ j ), or xj+1 = F (xj , θ j ), (19)
throughout the network to match the changing resolution and number of chan-
nels. These layers are harder to control than Eq. (2) and (6) if one wishes to
ensure stability.
In this work we are interested in testing and demonstrating the theoretical
property described in Sec. 4 empirically, using fully stable networks (except
for the very first and last layers). To this end, when increasing the channel
space from ncin to ncout for xj+1 and xj respectively, we simply concatenate
the output of the step with the necessary channels from the current iteration
to maintain the same dimensions. That is, we apply the following:
xj + hF (xj , θ j )
xj+1 = , (20)
(xj )1:ncout −ncin
where (xj )1:ncout −ncin are the first ncout − ncin channels of xj . This assumes
that each of the channel changing steps satisfies ncin ≤ ncout ≤ 2ncin , which is
quite common in CNNs. This way, we only use the symmetric dynamics as in
Eq. (6) throughout the network, which is guaranteed to be stable for a proper
choice of parameters. Finally, we do not apply strides as part of Eq. (20), and
to reduce the channel resolution we simply apply average pooling following
Eq. (20), which is a stable, parameter-less, operation.
5 Numerical experiments
Then, after adding the TV layers, we changed the initial learning rate to 0.001
and the batch size to 4 and trained for 10 more epochs. In both phases, the
cross-entropy loss is used. We measure both the accuracy and mean intersec-
tion over union (mIoU) metrics, which are summarized in Tab. 3. We see that
whether the network is quantized or not, incorporating the TV layers to an
existing architecture leads to favorable performance.
network - one at 4 bits and the other at 32 bits. This divergence is summa-
rized as the per-entry MSE between the activation maps throughout the two
networks. Tables 4 and 5 show that the divergence of the two runs is greater in
the non-symmetric (original) network variants, as predicted by the theoretical
analysis. We show further evidence in Fig. 3, where we plot the same MSE
difference along the layers of each network. We note an intriguing observation
in these plots: the difference decreases where strided operations are applied.
This is not surprising as such down-sampling is equivalent to averaging out
quantization errors for which cancellation occurs. Somewhat surprisingly, the
networks which were trained using 4-bit activation maps maintain their accu-
racy when removing the quantization. This is consistent for both the symmetric
and non-symmetric networks. Still, having the stable behaviour of the network
is a desired property, especially for sensitive applications, as input data in real
life may be unpredictable.
In this section we employ the graph networks PDE-GCND (sym.) and PDE-
GCND (non-sym.) from Eq. (23)-(24) for semi-supervised node-classification on
the Cora, CiteSeer and PubMed data-sets. Namely, we measure the impact of
symmetry on the network’s accuracy, as well as its ability to maintain similar
behaviour with respect to activation similarity under activation quantization
18 Ido Ben-Yair et al.
0.2
0.15
MSE
0.1
0.05
1 7 14 21 26
Layer
ResNet56 Sym. ResNet56
0.6
0.4
MSE
0.2
0
1 7 14 21 26
Layer
ResNet56 Sym. ResNet56
0.3
0.2
MSE
0.1
1 6 12 16
Layer
MobileNetV2 Sym. MobileNetV2
Fig. 3: Per-layer MSE between the activation maps of symmetric and non-
symmetric network pairs - in each pair one network has quantized activation
maps and the other does not. Values are normalized per-layer to account for
the different dimensions of each layer. In all the cases, the symmetric variants
(in red) exhibit a bounded divergence, while the non-symmetric networks di-
verge as the information propagates through the layers (in blue), and hence
they are unstable. Both networks in each pair achieve similar classification
accuracy. Top to bottom: (a) ResNet56 architecture for the CIFAR-10 data-
set (b) ResNet56 architecture for the CIFAR-100 data-set (c) MobileNetV2
architecture for the CIFAR-100 data-set.
Quantized CNNs through the lens of PDEs 19
6 Summary
References
1. Alt, T., Peter, P., Weickert, J., Schrader, K.: Translating numerical concepts for pdes
into neural architectures. arXiv preprint arXiv:2103.15419 (2021)
2. Alt, T., Schrader, K., Augustin, M., Peter, P., Weickert, J.: Connections between numer-
ical algorithms for pdes and neural networks. arXiv preprint arXiv:2107.14742 (2021)
3. Ambrosio, L., Tortorelli, V.M.: Approximation of functional depending on jumps by el-
liptic functional via t-convergence. Communications on Pure and Applied Mathematics
43, 999–1036 (1990)
4. Banner, R., Nahshan, Y., Soudry, D.: Post training 4-bit quantization of convolutional
networks for rapid-deployment. NeurIPS pp. 7948–7956 (2019)
5. Bengio, Y.: Estimating or propagating gradients through stochastic neurons for condi-
tional computation. arXiv (2013)
6. Blalock, D., Ortiz, J., Frankle, J., Guttag, J.: What is the state of neural network
pruning? MLSys (2020)
7. Bodner, B.J., Shalom, G.B., Treister, E.: Gradfreebits: Gradient free bit allocation for
dynamic low precision neural networks. IJCAI Workshop on Data Science Meets Opti-
mization (2021)
8. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine
learning. Siam Review 60(2), 223–311 (2018)
9. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners.
arXiv preprint arXiv:2005.14165 (2020)
10. Cai, W., Li, W.: Weight normalization based quantization for deep neural network
compression. CoRR abs/1907.00593 (2019). URL http://arxiv.org/abs/1907.00593
11. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Pro-
cessing 10(2), 266–277 (2001). DOI 10.1109/83.902291
12. Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Carlier, G.: Deep relaxation: partial
differential equations for optimizing deep neural networks. Research in the Mathematical
Sciences 5(3), 30 (2018)
13. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for
semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
14. Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolutional
networks. In: 37th International Conference on Machine Learning (ICML), vol. 119, pp.
1725–1735 (2020)
15. Chen, T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differen-
tial equations. In: Advances in Neural Information Processing Systems, pp. 6571–6583
(2018)
16. Chen, Y., Xie, Y., Song, L., Chen, F., Tang, T.: A survey of accelerator architectures
for deep neural networks. Engineering (2020)
Quantized CNNs through the lens of PDEs 21
17. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: Model compression and acceleration for
deep neural networks: The principles, progress, and challenges. IEEE Signal Processing
Magazine 35(1), 126–136 (2018)
18. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan,
K.: Parameterized clipping activation quantized neural networks. arXiv preprints DOI
arXiv:1805.06085
19. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke,
U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding.
In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016)
20. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255 (2009). DOI 10.1109/CVPR.2009.5206848
21. E, W.: A proposal on machine learning via dynamical systems. Communications in
Mathematics and Statistics 5(1), 1–11 (2017)
22. Eliasof, M., Haber, E., Treister, E.: PDE-GCN: Novel architectures for graph neural
networks motivated by partial differential equations (2021)
23. Eliasof, M., Treister, E.: DiffGCN: Graph convolutional networks via differential opera-
tors and algebraic multigrid pooling. 34th Conference on Neural Information Processing
Systems (NeurIPS 2020), Vancouver, Canada. (2020)
24. Ephrath, J., Eliasof, M., Ruthotto, L., Haber, E., Treister, E.: LeanConvNets: Low-cost
yet effective convolutional neural networks. IEEE Journal of Selected Topics in Signal
Processing 14(4), 894–904 (2020)
25. Esser, S.K., McKinstry, J.L., Bablani, D., Appuswamy, R., Modha, D.S.: Learned step
size quantization. arXiv preprint arXiv:1902.08153 (2019)
26. Gholami, A., Keutzer, K., Biros, G.: ANODE: Unconditionally accurate memory-
efficient gradients for neural odes. In: IJCAI (2019)
27. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
28. Goodfellow, I., McDaniel, P., Papernot, N.: Making machine learning robust against
adversarial inputs. Communications of the ACM 61(7), 56–66 (2018)
29. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial exam-
ples. In: 3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
30. Gou, J., Yu, B., Maybank, S., Tao, D.: Knowledge distillation: A survey. CoRR
abs/2006.05525 (2020). URL https://arxiv.org/abs/2006.05525
31. Haber, E., Lensink, K., Triester, E., Ruthotto, L.: Imexnet: A forward stable deep neural
network. arXiv preprint arXiv:1903.02639 (2019)
32. Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse Problems
pp. 1–21 (2017)
33. Han, S.: Deep compression: Compressing deep neural networks with pruning, trained
quantization and huffman coding. arXiv (2016)
34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
770–778 (2016)
35. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 17, 185–203 (1981)
36. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y.,
Pang, R., Vasudevan, V., et al.: Searching for MobileNetv3. In: Proceedings of the IEEE
International Conference on Computer Vision, pp. 1314–1324 (2019)
37. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural
networks: Training neural networks with low precision weights and activations. Journal
of Machine Learning Research 18, 187:1–187:30 (2017)
38. Jakubovitz, D., Giryes, R.: Improving dnn robustness to adversarial attacks using jaco-
bian regularization. In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 514–529 (2018)
39. Jin, Q., Yang, L., Liao, Z., Qian, X.: Neural network quantization with scale-adjusted
training. In: British Machine Vision Conference (BMVC) (2020)
40. Jung, S., Son, C., Lee, S., Son, J., Han, J.J., Kwak, Y., Hwang, S.J., Choi, C.: Learning
to quantize deep networks by optimizing quantization intervals with task loss. pp.
4350–4359 (2019)
22 Ido Ben-Yair et al.
41. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
42. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional net-
works. The International Conference on Learning Representations (ICLR) (2017)
43. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
44. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolu-
tional neural networks. Advances in Neural Information Processing Systems (NIPS)
61, 1097–1105 (2012)
45. LeCun, Y., Boser, B.E., Denker, J.S.: Handwritten digit recognition with a back-
propagation network. In: Advances in neural information processing systems, pp. 396–
404 (1990)
46. Leonid I.Rudin Stanley Osher, E.F.: Nonlinear total variation based noise removal al-
gorithms. Physica D: Nonlinear Phenomena (1992)
47. Li, Y., Dong, X., Wang, W.: Additive powers-of-two quantization: An efficient non-
uniform discretization for neural networks. International Conference on Learning Rep-
resentations (ICLR) (2019)
48. LLC, G.: gemmlowp: a small self-contained low-precision gemm library (1999). URL
https://github.com/google/gemmlowp
49. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method
to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2016)
50. Nagel, M., van Baalen, M., Blankevoort, T., Welling, M.: Data-free quantization through
weight equalization and bias correction. pp. 1325–1334 (2019)
51. Olaf Ronneberger, P.F., Brox, T.: U-net: Convolutional networks for biomedical image
segmentation. In: International Conference on Medical image computing and computer-
assisted intervention, pp. 234–241 (2015)
52. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE
Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990)
53. Raina, R., Madhavan, A., Ng, A.Y.: Large-scale deep unsupervised learning using graph-
ics processors. In: 26th ICML, pp. 873–880. ACM, New York, USA (2009). DOI
10.1145/1553374.1553486
54. Ren, P., Xiao, Y., Chang, X., Huang, P., Li, Z., Chen, X., Wang, X.: A comprehensive
survey of neural architecture search: Challenges and solutions. CoRR abs/2006.02903
(2020). URL https://arxiv.org/abs/2006.02903
55. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algo-
rithms. Physica D: Nonlinear Phenomena 60, 259–268 (1992)
56. Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equa-
tions. Journal of Mathematical Imaging and Vision pp. 1–13 (2019)
57. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: Inverted
residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 4510–4520 (2018)
58. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective
classification in network data. AI magazine 29(3), 93–93 (2008)
59. Yang, Z., Cohen, W., Salakhudinov, R.: Revisiting semi-supervised learning with graph
embeddings. In: International conference on machine learning, pp. 40–48. PMLR (2016)
60. Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J.: Blended coarse gradient descent
for full quantization of deep neural networks. Research in the Mathematical Sciences
6(1), 1–23 (2019)
61. Zhang, D.: Lq-nets: Learned quantization for highly accurate and compact deep neural
networks. ECCV (2018)
62. Zhang, L., Schaeffer, H.: Forward stability of resnet and its variants. Journal of Math-
ematical Imaging and Vision 62(3), 328–351 (2020)
63. Zhao, R., Hu, Y., Dotzel, J., De Sa, C., Zhang, Z.: Improving neural network quantiza-
tion without retraining using outlier channel splitting. ICML 97, 7543–7552 (2019)
64. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth
convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160
(2016). URL http://arxiv.org/abs/1606.06160
65. Zhou, Y., Moosavi-Dezfooli, S.M., Cheung, N.M., Frossard, P.: Adaptive quantization
for deep neural network. In: AAAI, pp. 4596–4604 (2018)