You are on page 1of 12

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-020-09301-x

Unsupervised densely attention network for infrared


and visible image fusion

Yang Li1 · Jixiao Wang1 · Zhuang Miao1 · Jiabao Wang1

Received: 8 April 2019 / Revised: 1 May 2020 / Accepted: 6 July 2020 /


© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
Integrating the information of infrared and visible images without human supervision is a
long-standing problem. A key technical challenge in this domain is how to extract features
from heterogeneous data-sources and fuse them appropriately. Prior deep learning works
either extract the middle layers information or use costly training step to improve fusion per-
formance, which limited their performances in cluttered scenes and real-time applications.
In this paper, we introduce a novel and pragmatic unsupervised infrared and visible image
fusion method based on a pre-trained deep network, which employs a densely connection
structure and incorporates the attention mechanism to achieve high fusion performance. Fur-
thermore, we propose to use the cross-dimensional weighting and aggregation to compute
the attention map for infrared and visible image fusion. The attention map enables more
efficient feature extraction and captures more structure information from source images.
We evaluate our method and compare it with ten typical state-of-the-art fusion methods.
Extensive experimental results demonstrate that our method achieves state-of-the-art fusion
performance in both subjective and objective evaluation.

Keywords Image fusion · Deep learning · Densely connection · Attention mechanism

 Yang Li
solarleeon@outlook.com

Jixiao Wang
jixiao wang@126.com

Zhuang Miao
miao zhuang@163.com

Jiabao Wang
jiabao 1108@163.com

1 Command and Control Engineering College, Army Engineering University of PLA, Nanjing,
210007, China
Multimedia Tools and Applications

1 Introduction

Infrared and visible image fusion is a classical yet still very active topic in image processing
field [15]. By integrating the detailed texture information in visible images and thermal radi-
ation information in infrared images, infrared and visible image fusion is directly related to
various practical applications, such as object recognition [25], object detection [12], image
enhancement [4], and video surveillance [28]. Therefore, how to design a robust image
fusion method has drawn a significant amount of interest from both academia and industry.
During the past few years, a large amount of infrared and visible image fusion meth-
ods have been proposed. According to [28], classical image fusion methods can be divided
into several categories according to their adopted theories, i.e., multi-scale transform-
based methods [10, 11], sparse representation-based methods [20, 21, 36], subspace-based
methods [2], saliency-based methods [37], hybrid methods [26], and other methods [25].
However, the reliance on handcrafted features and fixed fusion procedures have limited the
robustness and accuracy of these techniques.
Recently, deep learning based image fusion methods have made notable progress in
achieving state-of-the-art fusion performance than traditional methods. For example, Liu
et al.[22] introduced a convolutional neural network (CNN) method for multi-focus image
fusion. In this method, different blur versions of the input image are used to train the network
and a decision map is obtained for image fusion. Du and Gao proposed an improved work
in [3] by introducing a multi-scale convolutional neural network for generating the decision
map. However, these methods [3, 22] are only suitable for multi-focus image fusion.
In [23], Liu et al. proposed a CNN-based approach for infrared and visible image fusion,
which merging procedure is conducted in a multiscale manner via image pyramids. In [16],
Li et al. proposed a VGG-based [31] fusion framework for infrared and visible image fusion.
In this method [16], the source images were decomposed into base parts and detail content.
Then the base parts were fused by weighted-averaging, and the detail content was fused
by multi-layer deep features. Finally, the fused image was reconstructed by combining the
fused base part and the detailed content. Although the middle layers information was used
by this VGG-based fusion method [16], with the increase of network depth, a degradation
problem [30] has been exposed and much useful information is lost in feature extraction.
To overcome the degradation problem in image fusion, Li et al. introduced supervised
learning to train a CNN with a dense block, which can get more useful features and boost the
performance for infrared and visible image fusion [14]. However, this training step [14] has
the main drawback of having to spend large efforts on collecting, annotating and training a
large dataset, which sometimes is not feasible [24]. They also do not perform well to unseen
scenarios due to large discrepancy of the feature distribution [13].
To solve the aforementioned problems, we propose a novel unsupervised infrared and
visible image fusion framework based on pre-trained convolutional neural networks. We
cast a fused image as the weighted average of the corresponding pixels in the input images.
Rather than spend large efforts on training step [14], we focus on used a pre-trained CNN
features more effective. The key idea of our approach is exploiting the transferability of
the information encoded in a CNN, not only in its multi-scale features, but also in its abil-
ity to focus the attention on the most representative regions of the source images. To this
end, we propose to use densely feature extraction and attention mechanism within a unified
fusion framework. In this manner, our method sufficiently captures more structure informa-
tion from source images, hence significantly boosts the performance of infrared and visible
image fusion.
Multimedia Tools and Applications

In summary, the main contributions of this paper can be summarized as follows:


• Unsupervised Densely Attention Network: We present an unsupervised infrared and
visible image fusion framework, which dose not spend large efforts on collecting, anno-
tating, and training. This method not only can exploit multi-scale features from global
and local perspectives, but also focus the attention on the most representative regions
of the source images.
• Novel Multimodal Fusion Strategy: We propose a cross-dimensional weighting and
aggregation method to compute the attention map for infrared and visible image fusion.
The attention map enables more efficient feature extraction and improves image fusion
performance.
• New State-of-the-art Results: We validate the performance of our method on an image
fusion dataset and set a new state-of-the-art in both subjective and objective evaluation.
The rest of this paper is organized as follows. Section 2 briefly reviews related works.
Section 3 presents the detailed description of the proposed methodology. Section 4 intro-
duces the experimental configuration and analyzes the results. Finally, Section 5 concludes
the paper.

2 Related work

A lot of works have been proposed to improve image fusion performance. Briefly, these
works could be categorized into two main categories: network architecture and attention
mechanism. In this section, we will briefly review different methods for improving image
fusion performance.

2.1 Network architecture for image fusion

Designing an advanced neural network architecture is one of the most effective but challeng-
ing ways for improving image fusion performance. For example, Liu et al. [23] designed an
image fusion network with three convolutional layers and one max-pooling layer. Li et al.
[16] proposed to choose few layers from a pre-trained VGG-19 network for image fusion.
Li et al. [14] also proposed a novel architecture with dense block in which direct connec-
tions from any layer to all the subsequent layers for image fusion. More recently, Li et al.
[18] introduced a fusion algorithm based on ResNet50 [7] and zero-phase component anal-
ysis (ZCA). To sum up, the above-mentioned architecture shows significant improvement
as a generic feature extractor in image fusion task. However, these methods just summed
up the features of adjacent stages without consideration of their diverse representation. This
may be deficient to merge all the useful information from multiple layers.
In this paper, we proposed a Siamese network architecture based on a pre-trained
DenseNet-201 [8] network, which can exploit multi-scale features and enrich the fusion pro-
cess in one layer. To the best of our knowledge, this is the deepest network which contains
more than 200 weight layers for image fusion.

2.2 Attention mechanism for image fusion

Infrared and visible image fusion wants to integrate the detailed texture information in
visible images and thermal radiation information in infrared images [15]. Meanwhile, the
detailed texture area and the thermal radiation area are the saliency parts in visible images
Multimedia Tools and Applications

and infrared images, respectively. According to the attention mechanism of the human visual
system, saliency-based fusion methods can maintain the integrity of the salient object region
and improve the visual quality of the fused image [28].
The attention mechanism [5, 29, 32, 34, 35] has been studied extensively in previous
works. And in recent years, researchers have adopted attention mechanism for infrared
and visible image fusion. On one hand, attention mechanism can be used to reconstruct of
the fused image. For example, Ma et al. [27] used the improved visual saliency map and
weighted least square optimization to fuse the base and detail layers. On the other hand,
attention mechanism can also be used to extract the significant object regions of source
images. For example, Liu et al. [19] integrated saliency detection into the sparse repre-
sentation framework for infrared and visible image fusion, which adopted global and local
saliency maps to obtain the weight for fused image reconstruction.
However, as far as we know, no attention mechanism with CNN-based method has
been applied to infrared and visible image fusion task. Therefore, we propose to integrate
attention mechanism into the deep learning framework for infrared and visible image fusion.
Our method is closely related to the dense block of the DeepFuse [14]. However, the
biggest difference between both approaches is that our method is an unsupervised method.
In addition, our densely connected structure and attention mechanism helps our method
maintain as much of their original content structure as possible through the keeping diversity
of global and local patterns.

3 Image fusion pipeline

In this section, we will elaborate the proposed method in detail. The image fusion pipeline
of the proposed method is depicted in Fig. 1. It is essentially a two-branch Siamese network
that takes a pair of infrared and visible images as input and aims to extract deep features
and construct attention map for image fusion. As a result, our method contains three main
stages (see Fig. 1): a DenseNet shared by the two branches, an attention unit, and an image
fusion module.
As shown in Fig. 1, the two main branches of the network have the same base net-
work architecture and share their parameters, hence the name Siamese. In addition, different


DenseNet 201
Infrared image Attention map
Share
Weights 
Fused image

DenseNet 201
Visible image Attention map
Feature extracon Aenon unit Image fusion

Fig. 1 Flowchart of the proposed method. The proposed approach consists of three components: 1) DenseNet
architecture for deep feature extraction (Section 3.1); 2) Attention unit for attention maps (Section 3.2); 3)
Image fusion method (Section 3.3). Best viewed on color display
Multimedia Tools and Applications

source images have been strictly aligned in advance. After feature vectors are computed for
the input images using the base network, they are fed into an attention unit so that construct
two attention maps for infrared and visible image, respectively. Finally, the original input
images and their attention maps are used to reconstruct the fused image. In the following
subsections, we will elaborate the proposed DenseNet architecture, attention unit and the
fusion method in detail.

3.1 DenseNet architecture

The DenseNet is a CNN that extracts a deep representation from the input images. Based on
the densely connected structure, each layer of DenseNet is directly connected to every other
layer in a feed-forward manner. In this paper, we use DenseNet-201 [8] pre-trained on Ima-
geNet classification benchmark. Although DenseNet-201 was initially created for image
classification challenges, our experiments show that its capability of gathering the fea-
tures extracted from different layers and aggregating the multi-scale information is innately
beneficial to image fusion tasks.
In DenseNet, each layer obtains additional inputs from all preceding layers which can
encourage feature reuse. Please note that there are two CNNs (top CNN and bottom CNN)
in Fig. 1. These two CNNs have the same structure and share the same weights. That is to
say, both infrared and visible images used the same network to extract deep features. More
specifically, this is the deepest network which contains more than 200 weight layers for
image fusion. The detailed configuration of the DenseNet-201 can be found in [8].
The infrared image I1 (i, j ) and visible image I2 (i, j ) are feed-forwarded through
the CNN to compute the convolutional features of a selected layer. Let θ denote all
the parameters of the DenseNet-201, and the output F1 ∈ R(K×W ×H ) and F2 ∈
R(K×W ×H ) are two 3-dimensional feature tensors from the last concatenation layer l (named
“conv5 block32 concat” in [8]), where K is the total number of channels with a resolu-
tion of W × H . Note that the spatial dimensions may vary per image depending on its
original size. Taking advantage of densely connection, the output F1 ∈ R(K×W ×H ) and
F2 ∈ R(K×W ×H ) ensures maximum multi-scales information flow between layers, which
is specifically efficiency for image fusion.
So far, we have obtained dense features from both the infrared and visible images. A
naive approach would be to generate a global feature from the dense infrared and visi-
ble features. However, due to the degradation problem [30], the set of features from the
previous step may contain features of pixels on unimportant objects or parts of the back-
ground. Therefore, blindly fusing infrared and visible features globally would degrade the
performance of the estimation. In the following, we describe a novel pixel-wise attention
mechanism that effectively combines the extracted features.

3.2 Attention unit

The key idea of our attention mechanism [17] is to enhance different representations of
objects at different source images. Inspired by the previous methods [5, 32, 34], we poten-
tially select the important parts and minimize the effects of noise. Concretely, after feature
vectors are computed for the input images using the base network, F1 ∈ R(K×W ×H ) and
F2 ∈ R(K×W ×H ) are fed into an attention unit to construct two attention maps for infrared
and visible image, respectively.
Multimedia Tools and Applications

Let us consider the convolutional feature maps F1 ∈ R(K×W ×H ) and F2 ∈ R(K×W ×H ) ,


we can compute their attention maps by statistics of these values across the channel
dimension:

K
A1 (i, j ) = αk1 ∗ |F1 (k, i, j )| (1)
k=1


K
A2 (i, j ) = αk2 ∗ |F2 (k, i, j )| (2)
k=1

where αk1 and αk2 are the per-channel weights calculated by:

1 
W 
H
αk1 = F1 (k, i, j ) (3)
W ×H
i=1 j =1

1 
W 
H
αk2 = F2 (k, i, j ) (4)
W ×H
i=1 j =1

As showed in (1) and (2), the features in different stages have different degrees of dis-
crimination, which results in different affect of fusion. Therefore, the αk value applies on
the feature maps, which represents the feature selection with channel-wise soft attention.
More specifically, the global contextual information is gathered by global average pooling
(in (3) and (4)) across spatial dimensions [17]. With this design, we can make the network
to obtain discriminative features and produces improved fusion results.

3.3 Image fusion

The bicubic interpolation [9] is used to resize the attention maps into source image size.
Then the final weight maps are obtained by (5) and (6):
A1 (i, j )
w1 (i, j ) = (5)
A1 (i, j ) + A2 (i, j )

A2 (i, j )
w2 (i, j ) = (6)
A1 (i, j ) + A2 (i, j )
Finally, the fused image is reconstructed by using (7), which is a weighted average
scheme to combine the source images by weighting [1].

F used(i, j ) = w1 (i, j )I1 (i, j ) + w2 (i, j )I2 (i, j ) (7)

4 Experiments and analysis

In this section, we first introduce the experimental setting, including datasets and evaluation
metrics. Then we present the implementation details of our proposed approach. Finally, we
perform a series of experiments to thoroughly investigate the performance of our proposed
method.
Multimedia Tools and Applications

4.1 Experimental settings

To verify the effectiveness of the proposed fusion method, 21 pairs of infrared and visible
images are collected from [26] and [33]. Please note that all of these source images are
popular in the field of infrared and visible image fusion. Fig. 2 shows some example images
of the experimental datasets used in our experiments.
For the performance evaluation, we compare the proposed method with ten state-of-the-
art fusion methods. These methods can be divided into two categories: (1) Traditional fusion
methods including: cross bilateral filter fusion method (CBF) [11], discrete cosine harmonic
wavelet transform (DCHWT) [10], joint sparse representation (JSR) [36], saliency detection
in sparse domain (JSRSD) [21], gradient transfer and total variation minimization (GTF)
[25], weighted least square optimization (WLS) [26], and convolutional sparse representa-
tion (ConvSR) [20]. (2) Deep learning fusion methods including: VGG19 and multi-layers
(VGG ML) [16], DeepFuse [14], and ResFuse [18].
All experiments were implemented in Matlab R2018a with Intel (R) Core (TM) i5-4690
CPU (3.50GHz), 16 GB DDR3. The parameters of all the evaluated methods are set to the
default values from their publicly available code.

4.2 Objective evaluation

Objective evaluation plays an important role in image fusion. In this paper, we select three
quality metrics: including F MIpixel [6], Nabf [10] and SSI Ma [16]. F MIpixel [6] repre-
sents the feature mutual information and relates to the fused image quality comprise certain
features, such as edges, details, and contrast. Nabf [10] denotes the artifacts added to the
fused image or the rate of noise by the fusion process. SSI Ma [16] represents the abil-
ity to preserve structural information, which is used to quantify the model image loss and
distortion.
The average results of F MIpixel [6], Nabf [10] and SSI Ma [16] for 21 fused images
are summarized in Table 1. The first and second best results are highlighted by bold and
underline. As shown in Table 1, compared with other existing fusion methods, our algo-
rithm obtains the best values in Nabf [10] and SSI Ma [16], which denotes that our method
contain less noise and preserve more structure information from source images. Although
the F MIpixel [6] for our method is not the best, its values are still very close to the best one.

Fig. 2 Four pairs of infrared and visible images. The top row contains 4 visible images, and the second row
contains 4 infrared images
Multimedia Tools and Applications

Table 1 Comparison with the


state-of-the-art infrared and Methods F MIpixel Nabf SSI Ma
visible image fusion methods in
terms of F MIpixel , Nabf , and CBF [11] 0.87203 0.31727 0.59957
SSI Ma DCHWT [10] 0.91377 0.12295 0.73132
JSR [36] 0.88463 0.34712 0.54073
JSRSD [21] 0.86429 0.34657 0.54127
GTF [25] 0.90207 0.07951 0.70016
WLS [26] 0.89788 0.21257 0.72360
ConvSR [20] 0.91691 0.01958 0.75335
VGG ML [16] 0.91070 0.00120 0.77799
DeepFuse [14] 0.90470 0.09178 0.72882
The first and second best results ResFuse [18] 0.90921 0.00062 0.77825
are highlighted by bold and Our method 0.90881 0.00047 0.77845
underline

4.3 Subjective evaluation

The subjective evaluation methods are the basis of the human visual system which plays an
important role in fusion quality evaluation. It can consistently compare different methods
based on image distortion, image details, and object completeness.
Figure 3 presents some fused results of the 10 existing methods and the proposed method.
As we can see, in various challenging conditions, our method consistently outperforms other
compared methods. For example, the fused images which are obtained by CBF [11], JSR
[36] and JSRSD [21] have more artificial noise and the saliency features are not clear.
On the other hand, the fused images obtained by the proposed method contain less noise
and ringing artifacts. In addition, we observe that our fusion method preserves more detail
information and the thermal radiation distribution in the red box.

(a) Infrared image (b) Visible image (c) CBF (d) DCHWT

(e) JSR (f) JSRSD (g) GTF (h) WLS

(i) ConvSR (j) VggML (k) DeepFuse (l) ResFuse (m) Ours

Fig. 3 Experiment on “street” images. a Infrared image; b Visible image; c CBF [11]; d DCHWT [10]; e JSR
[36]; f JSRSD [21]; g GTF [25]; h WLS [26]; i ConvSR [20]; j VggML [16]; k DeepFuse [14]; l ResFuse
[18]; m Ours
Multimedia Tools and Applications

5 Conclusions

In this work, we propose an unsupervised densely attention network for infrared and visible
image fusion. The core idea of our method is to utilize the densely connected structure
and attention mechanism for efficient feature extraction. Additionally, the proposed method
does not require any manual annotation and re-training procedure but still can outperform
the state-of-the-art methods on the public standard benchmark. Comprehensive experiments
demonstrate that our method can produce pretty accurate image fusion results. Based on the
superior performance and flexibility, we plan to apply the framework to other multi-modal
applications. And we hope that our method will provide a new insight into the image fusion
strategy, and our implementation will benefit works on infrared and visible image fusion.

Acknowledgments This work was supported by the National Natural Science Foundation of China
(No.61806220). The authors would like to thank the anonymous reviewers for their comments and insight,
which helped to shape the final version of this paper.

References

1. Bai X, Chen X, Zhou F, Liu Z, Xue B (2013) Multiscale top-hat selection transform based infrared
and visual image fusion with emphasis on extracting regions of interest. Infrared Physics & Technology
60:81– 93
2. Bavirisetti DP, Xiao G, Liu G (2017) Multi-sensor image fusion based on fourth order partial differential
equations. In: International conference on information fusion, pp 1–9
3. Du C, Gao S (2017) Image segmentation-based multi-focus image fusion through multi-scale convolu-
tional neural network. IEEE Access 5(99):15750–15761
4. Dümbgen F, Helou ME, Gucevska N, Süsstrunk S (2018) Near-infrared fusion for photorealistic image
dehazing. Electronic Imaging (16):321–325
5. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic
consistency. IEEE Trans Multimed 19(9):2045–2055
6. Haghighat M, Razian MA (2014) Fast-fmi: non-reference image fusion metric. In: IEEE international
conference on application of information and communication technologies, pp 1–3
7. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference
on computer vision and pattern recognition, pp 770–778
8. Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference
on computer vision and pattern recognition, pp 2261–2269
9. Hwang JW, Lee HS (2004) Adaptive image interpolation based on local gradient features. IEEE Signal
Processing Letters 11:359–362
10. Kumar BKS (2013) Multifocus and multispectral image fusion based on pixel significance using discrete
cosine harmonic wavelet transform. Signal Image & Video Processing 7(6):1125–1143
11. Kumar BKS (2015) Image fusion based on pixel significance using cross bilateral filter. Signal Image &
Video Processing 9(5):1193–1204
12. Lahoud F, Susstrunk S (2018) Ar in vr: simulating infrared augmented vision. In: IEEE international
conference on image processing, pp 3893–3897
13. Lahoud F, Süsstrunk S (2019) Fast and efficient zero-learning image fusion. arXiv:1905.03590
14. Li H, Wu X-J (2018) Densefuse: a fusion approach to infrared and visible images. IEEE Trans Image
Process 28(5):2614–2623
15. Li S, Kang X, Fang L, Hu J, Yin H (2017) Pixel-level image fusion: a survey of the state of the art.
Information Fusion 33:100–112
16. Li H, Wu XJ, Kittler J (2018) Infrared and visible image fusion using a deep learning framework. In:
International conference on pattern recognition, pp 2705–2710
17. Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks, pp 510–519
18. Li H, Wu X, Durrani TS (2018) Infrared and visible image fusion with resnet and zero-phase component
analysis. arXiv:1806.07119
Multimedia Tools and Applications

19. Liu C, Qi Y, Ding W (2017) Infrared and visible image fusion method based on saliency detection in
sparse domain. Infrared Physics & Technology 83:94–102. https://doi.org/10.1016/j.infrared.2017.04.
018
20. Liu Y, Chen X, Ward RK, Wang ZJ (2016) Image fusion with convolutional sparse representation. IEEE
Signal Process Lett 23(12):1882–1886
21. Liu CH, Qi Y, Ding WR (2017) Infrared and visible image fusion method based on saliency detection in
sparse domain. Infrared Physics & Technology 83:94–102
22. Liu Y, Chen X, Peng H, Wang Z (2017) Multi-focus image fusion with a deep convolutional neural
network. Information Fusion 36:191–207
23. Liu Y, Chen X, Cheng J, Peng H, Wang Z (2017) Infrared and visible image fusion with convolutional
neural networks. International Journal of Wavelets Multiresolution & Information Processing 16(3):20–
52
24. Liu Y, Chen X, Wang Z, Wang ZJ, Ward RK, Wang X (2018) Deep learning for pixel-level image fusion:
recent advances and future prospects. Information Fusion 42:158–173
25. Ma J, Chen C, Li C, Huang J (2016) Infrared and visible image fusion via gradient transfer and total
variation minimization. Information Fusion 31:100–109
26. Ma J, Zhou Z, Wang B, Zong H (2017) Infrared and visible image fusion based on visual saliency map
and weighted least square optimization. Infrared Physics and Technology 82:8–17
27. Ma J, Zhou Z, Wang B, Zong H (2017) Infrared and visible image fusion based on visual
saliency map and weighted least square optimization. Infrared Physics & Technology 82:8–17.
https://doi.org/10.1016/j.infrared.2017.02.005
28. Ma J, Ma Y, Li C (2019) Infrared and visible image fusion methods and applications: a survey.
Information Fusion 45:153–178
29. Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. In:
International conference on neural information processing systems, pp 2204–2212
30. Prabhakar KR, Srikar VS, Babu RV (2017) Deepfuse: a deep unsupervised approach for exposure fusion
with extreme exposure image pairs. In: IEEE international conference on computer vision, pp 4724–4732
31. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition.
In: International conference on learning representations, pp 1–13
32. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal
stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
33. Toet A (2014) TNO Image Fusion Dataset, Figshare. data https://doi.org/10.6084/m9.figshare.1008029.v1
34. Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in
videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644
35. Zhai Y, Shah M (2006) Visual attention detection in video sequences using spatiotemporal cues. In:
International conference on multimedia, pp 815–824
36. Zhang Q, Fu Y, Li H, Zou J (2013) Dictionary learning method for joint sparse representation-based
image fusion. Opt Eng 52(5):7006–7018
37. Zhang X, Ma Y, Fan F, Zhang Y, Huang J (2017) Infrared and visible image fusion via saliency analysis
and local edge-preserving multi-scale decomposition. Journal of the Optical Society of America A Optics
Image Science & Vision 34(8):1400–1410

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Multimedia Tools and Applications

Yang Li is currently an associate professor of Army Engineering University of PLA, Nanjing, China. He
received the B.S. degree from Beihang University, Beijing, China, in 2007 and the M.S. degree from PLA
University of Science and Technology, Nanjing, China, in 2010. He then received the Ph.D. degree from
Army Engineering University of PLA, Nanjing, China, in 2018. His current research interests include
computer vision, deep learning and image processing.

Jixiao Wang received the B.S. degree from PLA University of Science and Technology, Nanjing, China, in
2014. He is currently working towards the M.S. degree in Command and Control Engineering College, Army
Engineering University of PLA. His current research interests include computer vision, deep learning and
image processing.
Multimedia Tools and Applications

Zhuang Miao is currently an associate professor of Army Engineering University of PLA, Nanjing, China.
He received the Ph.D. degree from PLA University of Science and Technology, Nanjing, China, in 2007. His
current research focuses on artificial intelligence, pattern recognition and computer vision.

Jiabao Wang is currently an assistant professor of Army Engineering University of PLA, Nanjing, China. He
received the Ph.D. degree in Computational Intelligence from PLA University of Science and Technology,
Nanjing, China, in 2012. His current research focuses on computer vision and machine learning.

You might also like