You are on page 1of 12

Domain Generalization on Constrained

Platforms: On the Compatibility


with Pruning Techniques

Baptiste Nguyen1,2(B) , Pierre-Alain Moëllic1,2 , and Sylvain Blayac3


1
CEA Tech, Centre CMP, Equipe Commune CEA Tech - Mines Saint-Etienne,
13541 Gardanne, France
{baptiste.nguyen,pierre-alain.moellic}@cea.fr
2
Univ. Grenoble Alpes, CEA, Leti, 38000 Grenoble, France
3
Mines Saint-Etienne, CMP, Department of Flexible Electronics, 13541 Gardanne,
France
blayac@emse.fr

Abstract. The wide deployment of Machine Learning models is an


essential evolution of Artificial Intelligence, predominantly by porting
deep neural networks in constrained hardware platforms such as 32 bits
microcontrollers. For many IoT applications, the deployment of such
complex models is hindered by two major issues that are usually handled
separately. For supervised tasks, training a model requires a large quan-
tity of labelled data which is expensive to collect or even intractable in
many real-world applications. Furthermore, the inference process implies
memory, computing and energy capacities that are not suitable for typ-
ical IoT platforms. We jointly tackle these issues by investigating the
efficiency of model pruning techniques under the scope of the single
domain generalization problem. Our experiments show that a pruned
neural network retains the benefit of the training with single domain
generalization algorithms despite a larger impact of pruning on its per-
formance. We emphasize the importance of the pruning method, more
particularly between structured and unstructured pruning as well as the
benefit of data-agnostic heuristics that preserve their properties in the
single domain generalization setting.

Keywords: Deep learning · Neural network pruning · Single domain


generalization · Embedded systems

1 Introduction
For many IoT domains and applications, edge computing enables to reduce band-
width requirements and unnecessary network communications that may raise
critical security threats. Due to its success across a large variety of application
domains, deploying state-of-the-art deep neural network models on edge devices
is a growing field of research [22]. However, this deployment faces several chal-
lenges of different nature, with critical ones related to the training data and
hardware constraints.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
A. González-Vidal et al. (Eds.): GIoTS 2022, LNCS 13533, pp. 250–261, 2022.
https://doi.org/10.1007/978-3-031-20936-9_20
Compatibility of Domain Generalization and Model Pruning 251

Fig. 1. Illustration of the scope of our study. Pruning and single domain generalization
techniques are jointly used to train a model on a source domain and test on an unseen
target domain. The model must fit in a constrained MCU.

First, collecting and managing large-scale real-world datasets can be chal-


lenging [12], extremely time-consuming and may require large infrastructure and
human expertise. These difficulties prevent the use of neural networks, which
require large amount of data for their training. A common solution is to train
a model on a publicly available dataset similar to the target use case or to cre-
ate simulated data. Since these datasets cannot perfectly substitute real-world
data, techniques such as domain adaptation or generalization are extensively
studied in the AI community. These approaches aim at learning from a source
data distribution a well-performing model on a different (but related) target
data distribution.
Second, the necessary memory and computational requirements for an infer-
ence limit the deployment on typical IoT platforms such as 32 bits MCUs. For
example, the state-of-art InceptionTime model [8] for time series classification
has 400K parameters and requires approximately 100 MFLOPS for an inference
which may be prohibitive for most ARM Cortex-M MCUs for real-time applica-
tions. This incompatibility led to the emergence of more efficient architectures
(e.g. MobileNet [6]) and compression techniques such as quantization or pruning
that aim at removing parameters from an over-parameterized model.
This work focuses on the evaluation of the compatibility of model pruning
under the scope of the single domain generalization problem, as illustrated in
Fig. 1: training a neural network on a unique source dataset and testing it on
multiple unseen but related datasets. Our contributions are as follows:
– We perform several experiments on two typical benchmarks (digit recognition
and human activity recognition) with state-of-the-art pruning and domain
generalization techniques.
– We show that – on a whole – pruning is efficient in the domain generalization
setting even with strong compression rate.
– However, we highlight the importance of the type of the pruning as well as
the pruning heuristics, more particularly between structured/unstructured
pruning and data agnostic/dependent heuristics.
To the best of our knowledge, this work is the first to focus on model com-
pression techniques in a single domain generalization setting, yet two essential
252 B. Nguyen et al.

challenges for modern AI-based IoT systems. For reproducibility purpose and
further experiments, codes and experiments are publicly available1 .

2 Background
2.1 Single Domain Generalization
Single domain generalization (hereafter, SDG) is a challenging setting where a
model is trained on a single source dataset with the objective to generalize to
unseen but related target datasets. Traditionally, the target domain represents
a real-world application with very few available training data (e.g. anomaly
detection from sensors). A source domain is selected according to its closeness
to the target domain and the ability to gather sufficient amount of labelled
data (e.g. simulated data). The most common method to tackle SDG is data
augmentation, for example with a combination of standard input transformations
found with an evolution algorithm, as in [21]. Adversarial data augmentation is
the most popular approach for SDG: it consists of alternating between training
and data augmentation phases where the dataset is augmented with samples
from a fictitious target domain that is “hard” under the current model [14].
As a reference method, we use the work from Xu et al. [23] that recently
reaches state-of-the-art performance with a scalable approach. For image clas-
sification, the authors start from the observation that semantics often relies
more on object shapes than local textures, while local textures are one of the
main sources of difference between domains (as the dogs in Fig. 1). To learn
texture-invariant representations, they augment the training dataset thanks to
random convolutions that “create an infinite number of new domains” [23]. At
each training iteration, images are augmented with a probability p up to three
times. Each augmentation is done by convolving the image with a randomly
(size, value) generated kernels. This augmentation creates copies of the input
image with different textures. Furthermore, they introduce a consistency loss
(based on Kullback-Leibler divergence) to encourage the model to predict the
same output for all augmented images. A parameter λ tunes the contribution of
the consistency loss to the global loss.

2.2 Neural Network Pruning


Nearly all pruning methods derive from [2] that removes parameters according
to a score based on a pruning heuristic. Therefore, pruning approaches can be
distinguished between four features:
– Sparsity structure: unstructured pruning [3] removes individual parameters
producing highly efficient sparse neural networks. Rather, structured prun-
ing [11] removes weights in groups, e.g. by removing entire neurons or filters.
Furthermore, some methods [2,10] prune a fixed fraction of weights across
the whole model (global pruning) while other methods [3,5] prune a fraction
of weights across each layer of the network (local pruning).
1
https://gitlab.emse.fr/b.nguyen/randconvpruning.
Compatibility of Domain Generalization and Model Pruning 253

Table 1. Mapping of the pruning algorithms used in our study.

Pruning techniques Types [11] [5] [3] [2] [15] [10] [20] [17]
Sparsity structure Structured  
Unstructured      
Local   
Global     
Pruning heuristic Magnitude-based    
Gradient-based   
Others 
Iterative scoring  
Data-agnostic      
Pruning schedule One-shot   
Iterative     
Retraining procedure Fine-tuning   
Weight Rewinding 
Learning rate rewinding 

– Pruning heuristic: due to its empirical success, estimating the importance


of an individual parameter by its magnitude [3,11] is the standard heuristic.
Gradient-based heuristics [10,17] are another popular approach. Other heuris-
tics propose to tackle different issues like FPGM [5] which handles redundancy
between filters in structured pruning. An important factor in the choice of
a heuristic is its use of training data for its computation (data-dependent or
data-agnostic).
– Pruning schedule: some methods [10,17] prune the weights in one iteration,
mainly before training. Others [3] follow an iterative procedure which alter-
nates between prune a small fraction of weights and retrain the model.
– Retraining procedure: the most common technique, fine-tuning, refers to keep
training the network using the trained weights and the last learning rate.
Some recent alternatives proposed weight [2] and learning rate [15] rewinding
in which the weights and/or the learning rate are reset at an early state before
the retraining phase.

Table 1 sums up the different approaches and the state-of-the-art references


used in this work.
The challenge of porting neural networks to constrained platforms such as
microcontrollers has led to the creation of embedding tools (e.g. TFLM2 or
STM32CubeMX-AI3 ) with which structured pruning is generally effortless. How-
ever, unstructured pruning (that leads to sparse structures) is more challenging

2
https://www.tensorflow.org/lite/microcontrollers.
3
https://www.st.com/en/embedded-software/x-cube-ai.html.
254 B. Nguyen et al.

and requires the use of a specific sparse computation library (e.g. [19]) to decrease
the model’s consumption and storage.
We focus our experiments on three common pruning settings. The first setting
is the one-shot global unstructured pruning at initialization. Global unstructured
pruning algorithms are known to be the most efficient methods to produce sparse
neural networks and, one-shot techniques do not increase the training budget.
The second is the iterative global unstructured pruning that reduces the loss
of accuracy at the cost of a bigger training budget. The third is the iterative
local structured pruning since structured methods are easily compatible with
standard development platforms.

3 Experiments on Digit Recognition Benchmark


3.1 Datasets and Setup
As in [23], we use digit recognition datasets: MNIST [9], SVHN [13] and USPS [7].
We use two classical CNN architectures: ResNet20 [4] and a variant of Lenet [9]
composed of two convolution layers (32 and 64 filters of 5×5 kernels) followed
by max-pooling layers and three fully connected layers (128, 128 and 10 neu-
rons). Both models have about the same number of parameters (273K and 276K
respectively). The models are trained on MNIST (the source domain). We fol-
low the experimental setting in [23] with random kernels of various sizes within
[1–7]. The original data fraction parameter p and the consistency loss factor λ
are fixed at 0.5 and 5 respectively. Unless specified, the models are trained on
150 epochs with Adam optimizer, a learning rate of 10−4 , a batch size of 32 and
50 epochs of retraining for iterative pruning. Our results are averaged on three
training seeds4 .

3.2 Unstructured Pruning at Initialization


Influence of Iterative Ranking. Before training, pruning a network itera-
tively (i.e. at each iteration, the heuristic is computed and a small fraction of
the network’s parameters is pruned) improves the performance of the pruned net-
work [20]. This procedure also avoids potential layer collapse (i.e. the premature
pruning of a layer that leads to an abrupt drop of accuracy [17]). To check if this
property is valid with SDG, we used two state-of-the-art algorithms, SNIP [10]
and SynFlow [17] that are applied at initialization with two ranking budgets:
– computing the parameters’ score and prune the model in one pass with a
single batch (referred as one batch, one iteration in Fig. 2),
– computing the parameters’ score and pruning the neural network using 100
iterations [17] with a single batch (one batch, 100 iterations).
As shown in Fig. 2, iteration helps SynFlow to avoid layer collapse for all
domains. But for SNIP, iterations do not affect the accuracy on the different
4
Setups are detailed in https://gitlab.emse.fr/b.nguyen/randconvpruning.
Compatibility of Domain Generalization and Model Pruning 255

Fig. 2. Influence of iterative ranking on SNIP and SynFlow heuristics.

domains. The data agnosticism of SynFlow can explain this difference. With
enough iterations, and independently of the dataset, SynFlow is designed to
satisfy the Maximal Critical Compression axiom that implies that Synflow algo-
rithm does not prune a parameter if it leads to layer collapse and there exists
another prunable parameter which can avoid layer collapse (see [17]). Meanwhile,
SNIP heuristic is designed to discover the important connections of the network
for its training on the source task. Xu et al. [23] relax this task thanks to ran-
dom convolutions. So SNIP is less relevant and using iterative ranking does not
improve the network performance.

Influence of Pruning Heuristic. We compare the baseline magnitude pruning


to SNIP and SynFlow. The best ranking budget found in Fig. 3 is used for both
heuristics. As shown in Fig. 3 and consistent with [17], SynFlow outmatches
other heuristics at high sparsity rates on all domains and magnitude heuristic
suffers from layer collapse with Lenet networks (at 80% of sparsity). A heuristic
which outperforms other heuristics in the source domain is likely to outperform
them in other domains. However, the impact of pruning with a given heuristic
on a model performance may not be the same on the source and the target
domains. On the source domain, the accuracy begins to decrease exponentially
at an extreme sparsity rate (around 95%) while the accuracy starts to decrease
almost linearly at a high sparsity rate (around 70%) on the target domains.

3.3 Iterative Unstructured Pruning

Influence of Retraining Procedure. We compare fine-tuning, weight [2] and


learning rate [15] rewinding as retraining techniques. In order to compare these
256 B. Nguyen et al.

Fig. 3. Comparison of pruning heuristics for one-shot unstructured pruning.

three methods, the learning rate is initialized at 10−4 and is reduced by a factor
0.1 at epoch 120. Magnitude heuristic is used for pruning. After each pruning,
the network are retrained on 150 epochs.
For all domains and networks, an Occam’s hill [18] is observed in Fig. 4:
at low sparsity rate, the accuracy increases since pruning acts as a regulariza-
tion process which forces the model to focus on more important and general
aspects of the task [18]. For high sparsity rate, the collapse of the network’s per-
formance classically occurs. This local gain of generalization is confirmed with
weight rewinding where the network’s parameters receive the same number of
gradient updates for each sparsity level. Learning rate rewinding outperforms
other methods in accordance with [15]. However, the large increase of accuracy
is mostly due to the additional training iterations (gradient updates) with high
learning rate. For the following experiments, learning rate rewinding will be used.

Influence of Pruning Heuristic. We compare the baseline magnitude prun-


ing to SNIP and SynFlow. Figure 5 shows that there are few differences between
pruning heuristics at low sparsity rate. For high sparsity rate, magnitude heuris-
tic underperforms on all domains and networks.

3.4 Iterative Structured Pruning

To study the impact of structured pruning, baseline algorithms such as magni-


tude pruning [11] and FPGM [5] are used. Furthermore, we adapt SNIP and Syn-
Flow to structured pruning by averaging the parameters’ score over each filters
as in [11]. Our results presented in Fig. 6 do not enable to confirm the superiority
of a heuristic. For all domains, the accuracy decreases slightly, then this loss is
accelerated at higher sparsity rate. An important observation is that structured
pruning is not perfectly suited to domain generalization since this acceleration
Compatibility of Domain Generalization and Model Pruning 257

Fig. 4. Comparison of retraining procedures for iterative pruning.

appears earlier in target domains especially on SVHN. Another important, but


expected, observation is that structured pruning has a worse accuracy score than
unstructured pruning for any sparsity rate.

4 Experiments on RealWorld HAR Dataset

We scale our experiments on a second benchmark dedicated to Human Activity


Recognition (HAR) since it is a challenging task, representative of many sensor-
based IoT applications that process time series.

4.1 Datasets and Setup

The RealWorld HAR dataset [16] gathers fifteen subjects equipped with smart-
phones and smartwatches on seven different body positions (head, chest, upper
arm, waist, forearm, thigh, and shin) that perform seven activities (climbing
stairs down and up, jumping, lying, standing, sitting, running/jogging, and walk-
ing). From their devices, accelerometer and gyroscope data are sampled 50 Hz.
We follow the reference procedure of Chang et al. [1]. The accelerometer
signals are sampled in fixed width sliding windows of 3 s (no overlap). A trace
is discarded if it includes a transition of activities, timestamp noise, or data
points without labels. The neural network is trained with the data from one
body location (chest) then tested on the other body locations.
For our experiments, we use a variant of the model proposed in [1] in which
instance normalization layers are replaced with standard batch-normalization
258 B. Nguyen et al.

Fig. 5. Comparison of pruning heuristics for iterative unstructured pruning.

layers. We adapt Xu et al. [23] technique with temporal convolutions with ran-
dom kernels of various sizes within [1–7]. The original data fraction parameter
p and the consistency loss factor λ are fixed at 0.5 and 5 respectively. We keep
SynFlow heuristic since it performs well on all settings of the digit benchmarks.
Our results are averaged on three training seeds.

4.2 Impact of the Pruning Settings

For these experiments, the network is trained on 70 epochs with Adam optimizer,
a batch size of 32 and an initial learning rate of 0.001 which is divided by a
factor 2 at epochs 40 and 60. For iterative pruning, the network is retrained on
50 epochs with learning rate rewinding after each pruning. We also follow the
evaluation process of [1] and measure the F1-score with macro-averaging (mean
of all the per-class F1 scores).
A first observation from Fig. 7 is the efficiency of our customized version of
Xu et al. method [23]: for all target domains, random convolutions enable the
model to reach a greater f1-score than classically trained model despite a lower
F1-score on the source domain. Second, we highlight an interesting compatibility
between [23] and pruning techniques, since a compression ratio up to 80% and
50% can be reached without loss of accuracy on the source domains for unstruc-
tured and structured pruning respectively, although models trained with random
convolutions are more impacted by high compression rate, more particularly for
structured pruning (right).
Figure 7 shows that pruning improves the generalization capacity: without
random convolutions (bottom), the F1-score of the network increases for target
domains at high sparsity score on all pruning settings. Furthermore, with random
convolutions, this increase is also observed in the one-shot unstructured pruning
Compatibility of Domain Generalization and Model Pruning 259

Fig. 6. Comparison of pruning heuristics for iterative structured pruning.

Fig. 7. Pruning on RealWorld HAR: trained with (top) and without (bottom) ran-
dom convolutions, one-shot at initialization (left) and iterative (centre) unstructured
pruning and iterative structured pruning (right).

setting (top-left) for the farthest body positions (thigh, shin) from the source
domain (chest). On the contrary, for iterative pruning (top-centre and top-right)
pruning increases F1-score on target domains close to the source domain while
decreases F1-score on target domains far from the source domains. This effect
can be explained by the additional training iterations (gradient updates) caused
by iterative pruning with learning rate rewinding.

5 Conclusion
We experimentally evaluate the impact of pruning techniques in the single
domain generalization setting with state-of-the-art methods and two benchmarks
260 B. Nguyen et al.

on image classification and human activity recognition. Our results show an


interesting compatibility between pruning methods, that enable to significantly
reduce the number of parameters, and single domain generalization approaches.
Pruning improves the ability of a model to generalize, especially on domains far
from the source domain. Moreover, all properties of pruning techniques are valid
in the single domain generalization setting for approaches based on data-agnostic
heuristics. Therefore, the combination of these methods represents a powerful
tool to ease the deployment of neural network models on constrained platforms
like microcontrollers for real-world applications for which the availability of train-
ing data is challenging. However, this combination of methods is not free from
drawbacks since the impact of pruning on the performance is higher in the single
domain generalization setting. More particularly, the additional training steps
due to iterative pruning can cause a drop in performance on domains far from
the source domain. For pruning algorithms with data-dependent heuristic, some
properties like the benefits of using iterative scoring do not apply in the single
domain generalization setting. These results highlight the need of developing
as well as evaluating advanced domain generalization approaches for embedded
applications that use highly compressed models.

Acknowledgments. This work benefited from the French Jean Zay supercomputer
thanks to the AI dynamic access program. This collaborative work is partially sup-
ported by the IPCEI on Microelectronics and Nano2022 actions and by the European
project InSecTT (www.insectt.eu: ECSEL Joint Undertaking (876038). The JU receives
support from the European Union’s H2020 program and Au, Sw, Sp, It, Fr, Po, Ir, Fi,
Sl, Po, Nl, Tu. The document reflects only the author’s view and the Commission is
not responsible for any use that may be made of the information it contains.) and by
the French National Research Agency (ANR) in the framework of the Investissements
d’Avenir program (ANR-10-AIRT-05, irtnanoelec).

References
1. Chang, Y., Mathur, A., Isopoussu, A., Song, J., Kawsar, F.: A systematic study
of unsupervised domain adaptation for robust human-activity recognition. Proc.
ACM Interact. Mobile Wearable Ubiquit. Technol. 4(1), 1–3 (2020)
2. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable
neural networks. In: International Conference on Learning Representations (2019)
3. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. Adv. Neural Inf. Proc. Syst. 1, 1135–1143 (2015)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the Conference on Computer Vision and Pattern Recognition
(2016)
5. He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for
deep convolutional neural networks acceleration. In: Proceedings of the Conference
on Computer Vision and Pattern Recognition (2019)
6. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861 (2017)
7. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pat-
tern Anal. Mach. Intell. 16(5), 550–554 (1994)
Compatibility of Domain Generalization and Model Pruning 261

8. Ismail Fawaz, H., et al.: InceptionTime: finding AlexNet for time series classifica-
tion. Data Min. Knowl. Disc. 34, 1–27 (2020)
9. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition.
Neural Comput. 1(4), 541–551 (1989)
10. Lee, N., Ajanthan, T., Torr, P.: Snip: single-shot network pruning based on connec-
tion sensitivity. In: International Conference on Learning Representations (2018)
11. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient
convnets. In: International Conference on Learning Representations (2017)
12. Munappy, A., Bosch, J., Olsson, H.H., Arpteg, A., Brinne, B.: Data management
challenges for deep learning. In: 2019 45th Euromicro Conference on Software Engi-
neering and Advanced Applications (SEAA). IEEE (2019)
13. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits
in natural images with unsupervised feature learning. NIPS (2011)
14. Qiao, F., Zhao, L., Peng, X.: Learning to learn single domain generalization. In:
Proceedings of the Conference on Computer Vision and Pattern Recognition (2020)
15. Renda, A., Frankle, J., Carbin, M.: Comparing rewinding and fine-tuning in neural
network pruning. In: International Conference on Learning Representations (2020)
16. Sztyler, T., Stuckenschmidt, H.: On-body localization of wearable devices: an inves-
tigation of position-aware activity recognition. In: 2016 IEEE International Con-
ference on Pervasive Computing and Communications (PerCom). IEEE (2016)
17. Tanaka, H., Kunin, D., Yamins, D.L., Ganguli, S.: Pruning neural networks without
any data by iteratively conserving synaptic flow. Adv. Neural Inf. Proc. Syst. 33,
6377–6389 (2020)
18. Thodberg, H.H.: Improving generalization of neural networks through pruning. Int.
J. Neural Syst. 1(4), 317–326 (1991)
19. Trommer, E., Waschneck, B., Kumar, A.: dCSR: a memory-efficient sparse matrix
representation for parallel neural network inference. In: 2021 IEEE/ACM Interna-
tional Conference On Computer Aided Design (ICCAD). IEEE (2021)
20. Verdenius, S., Stol, M., Forré, P.: Pruning via iterative ranking of sensitivity statis-
tics. arXiv preprint arXiv:2006.00896 (2020)
21. Volpi, R., Murino, V.: Addressing model vulnerability to distributional shifts over
image transformation sets. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision (2019)
22. Wang, X., Han, Y., Leung, V.C., Niyato, D., Yan, X., Chen, X.: Convergence of
edge computing and deep learning: a comprehensive survey. IEEE Commun. Surv.
Tutorials 22, 869–904 (2020)
23. Xu, Z., Liu, D., Yang, J., Raffel, C., Niethammer, M.: Robust and generalizable
visual representation learning via random convolutions. In: International Confer-
ence on Learning Representations (2021)

You might also like