1 s2.0 S2214914723002544 Main

Defence Technology xxx (xxxx) xxx
Contents lists available at ScienceDirect
Defence Technology
journal homepage: www.keaipublishing.com/en/journals/defence-technology
MTTSNet: Military time-sensitive targets stealth network via real-time

mask generation
Siyu Wang, Xiaogang Yang*, Ruitao Lu, Zhengjie Zhu, Fangjia Lian, Qing-ge Li, Jiwei Fan
PLA Rocket Force University of Engineering, Xi'an 710025, China
a r t i c l e i n f o a b s t r a c t
Article history: The automatic stealth task of military time-sensitive targets plays a crucial role in maintaining national
Received 14 June 2023 military security and mastering battlefield dynamics in military applications. We propose a novel Mili-
Received in revised form tary Time-sensitive Targets Stealth Network via Real-time Mask Generation (MTTSNet). According to our
10 September 2023
knowledge, this is the first technology to automatically remove military targets in real-time from videos.
Accepted 22 September 2023
The critical steps of MTTSNet are as follows: First, we designed a real-time mask generation network
Available online xxx
based on the encoder-decoder framework, combined with the domain expansion structure, to effectively
extract mask images. Specifically, the ASPP structure in the encoder could achieve advanced semantic
Keywords:
Deep learning
feature fusion. The decoder stacked high-dimensional information with low-dimensional information to
Military application obtain an effective mask layer. Subsequently, the domain expansion module guided the adaptive
Targets stealth network expansion of mask images. Second, a context adversarial generation network based on gated convolution
Mask generation was constructed to achieve background restoration of mask positions in the original image. In addition,
Generative adversarial network our method worked in an end-to-end manner. A particular semantic segmentation dataset for military
time-sensitive targets has been constructed, called the Military Time-sensitive Target Masking Dataset
(MTMD). The MTMD dataset experiment successfully demonstrated that this method could create a mask
that completely occludes the target and that the target could be hidden in real time using this mask. We
demonstrated the concealment performance of our proposed method by comparing it to a number of
well-known and highly optimized baselines.
© 2023 China Ordnance Society. Publishing services by Elsevier B.V. on behalf of KeAi Communications
Co. Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
1. Introduction military time-sensitive targets related to national security need to

be declassified before they are released to the public. However,
The future of the battlefield is constantly changing, and the there is still a lack of a method that can quickly hide targets, and
combat mode is developing in the direction of information and how to automatically hide military time-sensitive targets in real-
intelligence. As an essential intelligence source for information- time is an urgent research topic.
based warfare, the Internet has become a key to improving Traditional target stealth methods rely on manual or semi-
battlefield situational generation, reconnaissance, surveillance, and automatic methods to find specific military time-sensitive targets,
command decision-making in modern warfare. With the increasing and the main techniques are patch-based and diffusion-based.
development of the Internet and artificial intelligence technology, a Patch-based methods [1e5] fill in the missing spaces in the image
large amount of military information has been leaked onto the by looking for better patches in non-sensitive regions. The
Internet, which has caused great harm to national security and diffusion-based algorithms [6e8] carry out a filling procedure by
interests. Military time-sensitive targets include ships, aircraft, etc., smoothly propagating the image material from the boundary to the
whose strike opportunities are limited by time windows and have missing sensitive region. Such algorithms are relatively complex,
high military value. Therefore, all kinds of images or videos of unable to learn the high-dimensional information of the image, and
the repair results have a low resolution.
Deep learning [9e11] has produced amazing advancements in
* Corresponding author. computer vision in recent years. To facilitate the accomplishment of
E-mail address: doctoryxg@163.com (X. Yang). more complex image structures, current image inpainting research
Peer review under responsibility of China Ordnance Society
https://doi.org/10.1016/j.dt.2023.09.010
2214-9147/© 2023 China Ordnance Society. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article under the CC BY-NC-
ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article as: S. Wang, X. Yang, R. Lu et al., MTTSNet: Military time-sensitive targets stealth network via real-time mask generation,
Defence Technology, https://doi.org/10.1016/j.dt.2023.09.010
S. Wang, X. Yang, R. Lu et al. Defence Technology xxx (xxxx) xxx
efforts have shifted to a data-driven scheme that learns deep se- user studies likewise demonstrate the progressive nature of
mantic information in an image and then infers what is missing our method.
based on this information, which can be used to remove specific
targets in a snap and can coherently recover texture and structural The remaining sections of this article are arranged as follows.
features of the removed regions [12e16]. To better repair the Section 2 provides a brief introduction to the related work. The
texture details and overall structural information of the missing main content of Section 3 is a detailed description of the proposed
areas, researchers have proposed various deep learning-based im- method. The results are presented in Section 4. The conclusion of
age inpainting algorithms, including deep convolutional neural the article is given in Section 5.
network-based methods (CNNs) [16e20], and generative adversa-
rial network-based methods [12e15,21e23]. 2. Related work
In recent years, significant research has been performed
regarding generative adversarial network methods due to their Our proposed Military Time-sensitive Targets Stealth Network
powerful image-generation abilities [24]. GANs include a generator via Real-time Mask Generation mainly includes two modules: a
responsible for capturing the distribution of sample data through real-time mask generation module based on semantic segmenta-
iterative data training and a discriminator accountable for deter- tion and a target region repair module. This section reviews exist-
mining whether the samples are generated spurious data or real ing work and the most relevant methods to our approach, including
data. Benefitting from their excellent image generation ability, target semantic segmentation methods and image and video
GANs can directly learn the areas needing completion while uti- inpainting methods.
lizing adversarial loss constraint training to generate more realistic
target stealth results. However, GAN-based target stealth methods 2.1. Target semantic segmentation
can only handle a single image and cannot complete the military
target concealment task for videos. Therefore, some methods based As a fundamental computer vision task, semantic segmentation
on video target stealth have been proposed [25e29]. Among them, is viewed as a dense prediction problem with the goal of labeling
typical flow-based methods consider video inpainting as a pixel every pixel in the input image [30]. DeConvNet [31], an inverse
propagation problem to fully utilize the internal space and tem- convolutional network consisting of an inverse convolutional layer
poral information of videos. Nevertheless, such methods require and an inverse pooling layer, is used to identify pixel-level labels
information between the front and back frame images and cannot and predict segmentation masks. Subsequently, a fully convolu-
realize real-time concealment of military target videos. tional coding and decoding architecture known as SegNet was
To address these flaws, in this paper, we proposed a novel proposed by Badrinarayanan et al. [32]. A pixel-level classification
concept of automatic hiding of military time-sensitive targets, in layer, a decoder, and an encoder make up the majority of SegNet's
which masks can automatically hide military time-sensitive targets essential components. The decoder's method for upsampling the
in real-time video, and the hidden area will be replaced by similar low-resolution input feature maps is what makes it innovative.
backgrounds around. The key idea is to generate a mask that Specifically, nonlinear up-sampling is performed using the pooling
effectively masks the target in real time. This real-time processing index calculated during the encoder's maximum pooling phase. To
mode for video sequences substantially improved the efficiency of obtain more explicit boundary information, algorithms such as
algorithms, and it could provide target-hiding effects no less than PSPNet [33] and DeepLab family [34,35] use spatial pyramid pool-
non-real-time processing. The target hiding area was prone to ing modules [36,37] and parallel void convolution at different rates
distortion and incomplete hiding challenges, such as the generated for feature extraction and fusion. Due to the success of deeply
mask being unable to cover the target to be hidden entirely. To separable convolution [38,39], atrous separable convolution was
solve the above problem, we built an encoder-decoder framework used for ASPP (Atrous Spatial Pyramid Pooling) and decoder net-
based on the mask expansion module to enhance the mask gen- works [34], and the encoder-decoder structure based on the atrous
eration effect. Specifically, the original image was encoded and convolution is shown in Fig. 1. However, the mask images generated
decoded to produce a coarse mask image. We performed external by most segmentation algorithms cannot fully cover the target area,
expansion and internal hole filling on the coarse mask to generate and the segmentation effect of the target edge could be better.
the ideal real-time mask. The background reconstruction network Therefore, to solve the above problems, this paper proposes a mask
received both the created mask and the original picture of the dilated model.
current frame as input. A coarse network based on gated convo-
lution and a refinement network based on contextual attention was 2.2. Image, video inpainting
used to achieve background filling of the target hidden region.
In summary, the main contributions of this study are as follows: Image inpainting was first proposed as an image processing task
[40], which aims to repair damaged areas of an image and remove
(1) We introduce a novel Military Time-sensitive targets Stealth or replace selected targets in the image. Subsequently, researchers
Network (MTTSNet). MTTSNet, as an end-to-end framework, have extensively worked in patch-based [1e5] and patch-based
can automatically obtain mask images and achieve real-time algorithms [6e8] for image inpainting. Still, these traditional al-
stealth of targets. It can produce results that are visually gorithms can only adapt to more superficial background structures
reasonable and satisfy temporal consistency. and need mechanisms to model high-level semantic information.
(2) We design a new domain expansion module to accurately Therefore, the repair is ineffective in complex environments, and
and comprehensively cover all the target areas to be hidden. the resolution of the complementary content could be higher.
This domain expansion module is responsible for guiding the In recent years, image inpainting models have shifted toward
adaptive expansion of mask images. It is simple, fast and deep learning-based approaches that utilize data-driven schemes
produces high-quality targets stealth results. to fill in the area to be restored end-to-end by generating models
(3) Numerous experimental results have shown that MTTSNet [12e15,21e24]. These models are trained on a large quantity of data
achieves significant results on two standard distortion- in order to extract the image's deep features and generate the
oriented metrics (i.e., PSNR and SSIM) compared to many background of the area to be restored. FCN [41] and U-Net [42] are
existing state-of-the-art methods. Qualitative analysis and the primary image inpainting architectures based on convolutional
2
Fig. 1. The encoder-decoder structure based on the atrous convolution.
neural networks. Both networks use an encoder-decoder architec- is still a lack of an algorithm that can quickly hide military time-
ture. The primary principle of FCN is to extract the high- sensitive targets in real-time. We proposed an intelligent algo-
dimensional feature map from the input image using convolu- rithm for hiding military time-sensitive targets based on an
tional and pooling layers. The deconvolutional layer is then used in encoding and decoding framework and generative adversarial
place of the fully-connected layer of CNN to produce a feature map networks to address this problem. Fig. 2 depicts the algorithm's
that is the same size as the input by up-sampling. Similar to FCN, U- overall picture.
Net is an entirely symmetric structure. The encoder part extract Our algorithm consisted of two main stages: The first stage used
features from the image by convolutional layers and down- an improved semantic segmentation algorithm to obtain the real-
sampling, and the decoder part outputs feature maps of the same time mask of the current frame input image. The second stage
size using up-sampling. The difference is that for residual concat- used a GAN network to achieve time-sensitive targets stealth, thus
enation, FCN uses addition, while U-Net uses concat to fuse wholly erasing the time-sensitive target from the image. To address
features. the problem of inaccurate segmentation of time-sensitive targets in
In order to increase image resolution, Sasaki et al. [43] presented semantic segmentation, thanks to the wide application of
an automatic FCN-based image restoration approach that sub- morphology in binary image processing, we added an external
stitutes a nearest neighbor up-sampling technique for the decon- expansion module and a hole-filling module to lay the foundation
volution layer. As GAN is increasingly employed in image for the second stage of target stealth. In the next step, the input
inpainting, FCN is being used more and more as a generator for image and the real-time mask were fed into a generative adver-
GAN. A contextual encoder networkda CNN and GAN encoder sarial network that consisted of two main parts, first, an initial
network combineddwas proposed by Pathak et al. [19]. Since the image reconstruction used a coarse feature network based on an
area to be repaired does not contain valid information, the effect of encoding-decoding structure and, later, a final target stealth
repair will be blurred if it is not distinguished from the valid in- through a two-branch refinement network with contextual
formation. As a result, different convolution patterns based on U- attention.
Net have been suggested by academics, including gated convolu-
tion [15], learnable bidirectional attention map (LBAM), and partial 3.2. Real-time mask generation
convolution [14]. A contextual attention mechanism was intro-
duced to the network by Yu et al. [14], which produced images of To apply a generative network to hide military time-sensitive
superior quality. Yan et al. [18], Yu et al. [15], and Liu et al. [44] also targets, we first need to generate a mask matrix in real time to
managed to solve the problem of irregular region repair. contaminate the area to be hidden. In this paper, we implemented
Recently, the video target removal task [25e29] has also real-time mask generation by semantic segmentation method and
developed more rapidly as an extension of image inpainting. A new mask expansion module. Chen et al. [34] extended the encoding-
End-to-End paradigm for Flow-Guided Video Inpainting was put decoding architecture based on the classical semantic segmenta-
forth by Li et al. [28], and both qualitative and quantitative mea- tion algorithm of Deeplabv3. They proposed the Deeplabv3þ
sures supported the method's efficacy. However, such methods method, which uses Deeplabv3 as an encoder and constructs a
mainly focus on the already generated video frame. These methods simple and effective decoder to achieve the fusion of features at
can only process less than one frame per second, with a large gap different levels, reducing the number of parameters while
from real-time video processing. No effective real-time algorithm increasing the running speed. Therefore, this method was selected
can be implemented for the real-time hiding of military time- as the first stage of the semantic segmentation model. The network
sensitive targets. Therefore, we propose an automatic target architecture of the semantic segmentation algorithm based on
stealth method based on real-time mask generation, using a data- encoding-decoding is shown in Fig. 3.
driven approach for learning and finally generating the hidden
video data in real-time.
3.2.1. Backbone network optimization
Unlike other Deeplab [34,35] series algorithms, the Xception
3. Proposed method [38] series was used as the backbone feature extraction network in
the architecture of this paper. The Xception series model is an
3.1. Overview of our method improvement of Inception-v3 [11], which has shown high perfor-
mance in image classification and target detection. Based on this
Previous work has verified the effectiveness of generative paper, we improved on Aligned Xception by modifying the
adversarial networks for image-hiding tasks. However, today there
3
Fig. 2. Overview of the intelligent stealth algorithm for military time-sensitive targets.
Fig. 3. Encoder-Decoder semantic segmentation network architecture.
4
structure of the entry flow network and replacing all maximum is the set of all shifts z, under the condition that the foreground
pooling layers with deep separable convolution with stride ability, elements of bE overlap with at least one element of I.
thus improving the model effectiveness and computational Hole fill is also the filling of the background area that is sur-
efficiency. rounded by a border connected by foreground pixels, also known as
In the original depth-separable convolution, the depthwise hole fill. In short, it is to first find a point in the hole, use the
convolution of the input feature map was performed first, followed structural elements to expand, then use the complement of the
by the pointwise convolution. In contrast, the depth-separable original image to constrain, keep repeating the expansion,
convolution in Xception expands the channel using 1 1 convo- constrain until the graph does not change (i.e., convergence) to
lution, followed by the depth convolution, and finally, concat. This stop, and finally find an intersection with the original graph, and
is called atrous separable convolution, and we can extract feature the hole is filled. Iðx; yÞ represents a binary mask image, and we
maps with arbitrary resolution. define a labeled image Fðx; yÞ with all positions 0 in the image
except for the border position 1 I, the equation is as follows:
3.2.2. Atrous convolution-based encoder-decoder architecture
Usually, we use down-sampling to reduce the computation and 1 Iðx; yÞ; ðx; yÞ on the border of I
Fðx; yÞ ¼ (2)
increase the perceptual field in deep neural networks, but this leads 0; otherwise
to a large amount of missing information at the expense of spatial
resolution. On the one hand, we can increase the perceptual field to Thus, a binary mask image Hðx; yÞ with the holes filled is shown in
achieve large target segmentation by using cavity convolution. On the following equation.
the other hand, we can improve the resolution to attain precise h ic
target localization compared to the downsampling method. Hðx; yÞ ¼ RD
Iðx;yÞc
ðFðx; yÞÞ (3)
Fig. 2 shows that an ASPP (Atrous Spatial Pyramid Pooling)
structure is also included in the encoder, i.e., it contains a 1 1 where RD is the morphological expansion reconstruction, c is the
convolution, three different rates of atrous convolution, and an complement operation. Specifically, First perform the comple-
ImagePooling layer. First, the ASPP operation is performed on the mentary set operation on the binary masked image Iðx; yÞ. Com-
advanced feature layer, which is compressed four times, the local plement Iðx; yÞc of Iðx; yÞ sets all foreground pixels to background
features were extracted from the image by convolution operation, pixels, while setting background pixels to foreground pixels. The
and ImagePooling implements global feature extraction, feature hole is surrounded by foreground pixels by definition; therefore,
superposition, and fusion are performed after the multi-scale fea- the operation creates a wall of zeros around the hole. Then Perform
tures are extracted. Finally, the features were compressed by a 1 1 expansion reconstruction for Fðx; yÞ with template Iðx; yÞc . The re-
convolution. sults show that all positions of the reconstructed image corre-
In the decoder, we adjusted the number of channels using 1 1 sponding to the foreground pixels of Iðx; yÞ are now 0. Finally, a
convolution for the initial effective feature layer compressed twice complementary set operation is performed to obtain a mask image
and then concatenate with the up-sampling results of the effective Hðx; yÞ in which the holes have been filled.
feature layer after the atrous convolution. Then, we performed a
refinement operation on the stacked features using 3 3 convo-
3.3. Target stealth network
lution, followed by a quadruple bilinear up-sampling. At this
moment, we obtained a final effective feature layer, which is the
The traditional GAN-based image inpainting algorithm suffers
feature condensation of the whole image.
from problems such as distorted complementary regions and
blurred detailed texture edges. Yu et al. [14,15] proposed a system
3.2.3. Mask expansion for free-form image inpainting by incorporating gated convolution
Due to the limited quality of data annotation used for training into deep adversarial generative networks. Inspired by their work,
and the mask output of the segmentation model, there is a phe- we derived a contextual generative network for hiding targets.
nomenon where some targets in the mask image cannot be fully Combined with the real-time mask generation module in the pre-
segmented. This increases the difficulty of the hidden algorithm, vious section, we applied this generative network to achieve real-
resulting in poor target hiding performance. Therefore, we added a time stealth of military time-sensitive targets.
mask optimization algorithm to the mask output port, as shown in We used the complete model architecture from the literature
the mask expansion module in Fig. 1. Morphology offers a [15] for reconstructing the hidden target's background region, and
comprehensive and effective solution to numerous image pro- the algorithm's overall structure is shown in Fig. 4. Like traditional
cessing issues. Inspired by the widespread use of mathematical generative adversarial networks, the architecture includes a coarse-
morphology for binary image processing, our mask expansion to-fine two-stage generator based on gated convolution and a fully
module consisted of two main parts: external dilated and hole fill. convolutional spectral-normalized Markovian discriminator. The
External expansion is similar to "domain expansion", where the generator is responsible for capturing the distribution of sample
target region in the image is expanded to "grow" or "coarsen" the data through iterative data training. The discriminator is respon-
target in the binary image, and the external expansion often be sible for determining whether the samples are generated fake data
expressed as follows: or real data. The generator consists of a coarse network and a two-
branch contextual attention-based refinement network, each using
I 4 E ¼ zjð b
EÞz ∩ I s ∅ (1) an encoder-decoder architecture, where the core is the use of gated
convolution instead of all normal convolution for free-form military
where I is the original mask image output by the model, E is the time-sensitive target stealth. Assuming that the input and output
structure element, z is the reflection translation, and b E is its are channels C and C 0 , respectively, the gated convolution formula
reflection with respect to the origin. The external expansion of B to I for each pixel located at ðb; aÞ is as follows:
5
Fig. 4. Structure of the time-sensitive object stealth network.
4. Results
8 k0h k0w
>
> X X 4.1. Settings
>
> Pðb; aÞ ¼ Mp k0h þ i; k0w þ j ,Iðb þ i; a þ jÞ
>
>
>
> i¼k0h j¼k0w
>
< 4.1.1. Hardware/software environment
k0h k0w (4) The experiments below were carried out on a server with an
> X X
>
> Q ðb; aÞ ¼ Mq k0h þ i; k0w þ j ,Iðb þ i; a þ jÞ Intel Xeon Gold 6230 (2.1 GHz/20C/27.5ML3) CPU 2, NVIDIA RTX
>
>
>
> 0 0 8000 2 with 192 GB of memory, and Python code running on
>
>
i¼kh j¼kw
: Ubuntu 18.04. With CUDA10.1, CuDNN7.6.5, PyTorch1.6, and other
Jðb; aÞ ¼ 4ðQ ðb; aÞÞ14Pðb; aÞ widely used deep learning and image processing libraries installed,
our development environment was pycharm2022.
where a and b represent the a-axis and b-axis of output map, kh is
the kernel size, k0h ¼ ðkh 1Þ =2, k0w ¼ ðkw 1Þ =2, Mp 2
0 0 4.1.2. Dataset
R kh kw CC and Mq 2R kh kw CC are two different convolutional Semantic segmentation datasets for standard objects, such as
filters, Iðb þi; a þjÞ and Jðb; aÞ are inputs and outputs. 4 represent PASCAL VOC [46] and MSCOCO, have been released, which contain
sigmoid activation function. Consequently, the output gating value some time-sensitive objects such as aircraft and ships. However,
can be set between 0 and 1. 4 denotes activation function (e.g., ELU the military targets in these datasets are minimal, and it is chal-
and LeakyReLU), 1 denotes element-wise multiplication. lenging to meet the needs of real-time mask generation for military
We learned a dynamic feature selection mechanism for each time-sensitive targets. In this work, we established a dataset based
channel and spatial location using gated convolution, such that it on PASCAL VOC, DIOR [47], called the Military time-sensitive Target
can select features based on background and mask. Even in the Masking Dataset (MTMD). We selected 867 images of aircraft and
deep layer, the gated convolution learns to emphasise masked re- ships from the PASCALVOC2012 and DIOR datasets as military time-
gions in various channels in order to produce a better background sensitive targets. In addition, we obtained 594 images of time-
for target stealth. For stable training of generative adversarial net- sensitive targets through software simulation and Internet down-
works, we used spectral normalization [45], a fast convergence loads, mainly including ships and aircraft. EISeg-develop [48], as a
method, and the hinge loss is expressed as semantic segmentation annotation tool, is used to annotate mili-
tary time-sensitive target images in PASCAL VOC format.
L G ¼ EzPz ðzÞ ðDðGðzÞ Þ Þ (5)

4.1.3. Training details
Our model training was divided into two parallel branches,
training the mask generation network and the object stealth
L D ¼ ExPdata ðxÞ ½ReLUð1 DðxÞ Þ þ EzPz ðzÞ ½ReLUð1 þ DðGðzÞ Þ Þ network. First, we divided the MTMD data set, where training set to
(6) validation set to test set ratio was 7:2:1. Then, the real-time mask
generation network was trained using the MTMD dataset. The
where L G and L D represent the objective functions of the training parameters were set as follows: the batch size, learning
generator and discriminator, respectively. D represent spectral- rate, optimizer and epoch are initialized as 32, 7 104 , SGD and
normalized discriminator. G is image inpainting network that 300, respectively. The target hidden network had a total of 4.1 M
takes incomplete image z. parameters. We used the public dataset Places2 [49] to train the
6
model. The following training settings were established: the Table 1

learning rate, beta1, beta2, iters and batch size were initialized as Segmentation Quantitative results on the MTMD datasets.
0.0001, 0.5, 0.9, 500000 and 32. The size of the feature map in the Method mIoU mPrecision mRecall
discriminator was 64, while that of the feature map in the generator PSPNet 81.21 94.05 84.49
was 32. HRNet 85.49 90.20 93.20
UNet 86.65 91.12 93.77
Ours 88.57 93.29 93.94
4.2. Object segmentation mask comparison
For the proposed Real-time military time-sensitive target results. Recall is for the original sample, which means the proba-
stealth method, as mentioned above, the essential point is that the bility of being predicted as a positive sample in the actual positive
segmentation mask obtained in real-time can completely cover the sample. We hope for high precision and recall, but they are con-
time-sensitive target. Otherwise, there will be space distortion, tradictory. The above two indicators are contradictory and cannot
texture details disorder, Etc. It can be observed that the aircraft and achieve both. Due to the phenomenon of incomplete segmentation
ship targets in Fig. 5 are partially hidden. We can see the contours of of the target, this increases the difficulty of the hiding method. To
airplanes and ships; the main reason for the failure is that during overcome this problem, our method has added a mask optimization
the real-time mask generation stage, the network did not segment method called mask dilation module, which will cause the seg-
a valid mask image (i.e., the mask image cannot fully cover the mentation results to spread around, decreasing the accuracy of the
hidden area). Therefore, in the target stealth stage, it is impossible method. However, it is exciting that the recall rate of the method
to achieve good results, and contours that are not masked can cause will be improved, laying a good foundation for the second stage of
defects such as texture distortion when the method generates the target hiding. Compared with PSPNet, the algorithm proposed in
background. For algorithm evaluation, we favor high recall over this paper has a significant advantage in recall rate, followed by a
precision. In this section, we experimentally validated the object loss of partial accuracy. Therefore, the accuracy of the algorithm
segmentation mask algorithm on the MTMD dataset. In this section, proposed in this paper is slightly lower than that of PSPNet.
according to previous work [50], we choose mIoU, mPrecision, and Fig. 6 shows the qualitative results of our segmentation method.
mRecall as our evaluation metrics. We compared our approach to The first column displays the original image, and the following
modern baselines, including PSPNet [33], UNet [42], and HrNet [51]. columns display the Ground Truth (GT) and the corresponding re-
To ensure a fair comparison, we configured the training hyper- sults of the five segmentation methods. The methods are HrNet,
parameters (epoch, learning rate, whether to pre-train, etc.) for PSPNet, UNet, SAM [52] and MTTSNet (Ours). The hidden results
each algorithm in the same way on the same device. obtained based on these segmentation results are provided. These
Table 1 contains the quantitative evaluation findings for our examples demonstrate the effectiveness of our method in fully
approach and other benchmarks. Our method significantly out- covering the target. Compared with HrNet, PSPNet, and
performs the previous baseline in mIoU and mRecall metrics and is UNet algorithms, the SAM method has more accurate segmentation
suboptimal in the mPrecision metric. This means that a more and better segmentation performance. Through the results of target
complete and accurate target segmentation mask can be obtained hiding, it can be found laterally. Compared with other algorithms,
using our method, and more importantly, the foundation for sub- our method can more completely segment military targets and lay
sequent target stealth can be laid. the foundation for subsequent target hiding. In the hiding stage,
Precision refers to the probability of being a positive sample incomplete segmentation will generate many target shadows,
among all predicted positive samples, specific to the predicted
Fig. 5. Object stealth failure results.
7
Fig. 6. Visual comparison of different segmentation approaches. Left to right: Original, GT, HrNet, PSPNet, UNet, SAM, ours.
resulting in a poor hiding effect. Even if the segmentation is com-

X 2ms mf þ C1 2ss sf þ C2 ssf þ C3
plete, the hiding effect will still be poor due to edge shadows during SSIMS;I ¼ , , (8)
the target hiding process. m2
s;i s
þ m2f þ C1 s2s þ s2f þ C2 s sf þ C3
s
where parameters C1 and C2 are used to stabilize the algorithm.

4.3. Object stealth comparison
SSIMS;I denotes the similarity between the source image and the
inpainting image. m and s denote the mean and standard deviation
4.3.1. Quantitative results
of the images.
We compared our algorithm with three classes of classical al-
A random selection of 146 images from the MTMD test set and
gorithms, including CRfill [53], CTSDG [54], and RFR [55]. To
irregularly masked images of various sizes are used to evaluate the
demonstrate the method's effectiveness, we next evaluated the
targets stealth algorithm. The irregularly masked images are
targets stealth algorithm on the MTMD dataset. Inpainting lacks
divided into five groups according to size. As shown in Table 2,
effective quantitative evaluation metrics, as stated in Ref. [14]. As a
PSNR and SSIM represent the mean of all values tested. The greater
result, it is challenging to evaluate image completion outcomes
the PSNR value, the closer the SSIM value is to 1, indicating higher
quantitatively, and there is no agreement on the best statistic to
image fidelity. The approach presented in this study is the most
apply. Here, we present the findings of a number of distinct mea-
effective across all measures of the four approaches. The increase in
sures for reference. First, peak signal-to-noise ratio (PSNR) metrics
PSNR and SSIM suggests that our results are closer to the actual
will be used to assess the visual quality of the recovered images
data. Superior results show that our algorithm can generate less
produced by the various methods. These metrics measure the dif-
distorted images and visually more believable content, and better
ference between the original image and the repaired image. The
spatial and temporal consistency.
ratio between the maximum possible power of the PSNR mea-
surement signal and the power of the added noise. In addition, the
structural similarity index (SSIM) metric will be used to measure 4.3.2. Qualitative results
the structural similarity between the original image and the Fig. 7 shows the visual comparison results of various target
restored image. The evaluation metrics are calculated as follows:
2 3 Table 2
6 2 7
Quantitative comparison between different methods.
6 2t 1 7
6 7 Metric Missing Area CRfill CTSDG RFR Ours
PSNRS;I ¼ 10,lg6 7 (7)
6 P n1
m1 P 27 PSNR [0.01, 0.1] 44.2097 31.6181 39.2776 44.4526
4 1 ½Sði; jÞ Iði; jÞ 5
mn (0.1, 0.2] 37.3279 30.7680 37.4248 39.3939
i¼0 j¼0
(0.2, 0.3] 34.0579 30.4263 32.5658 36.0683
(0.3, 0.4] 31.0331 30.1276 32.5257 33.8734
where PSNRS;I stands for the peak signal-to-noise ratio between the (0.4, 0.5] 25.9663 28.4673 26.0712 28.1882
SSIM [0.01, 0.1] 0.9854 0.8881 0.9713 0.9855
complementary image and the source image. m is the image height, (0.1, 0.2] 0.9719 0.8677 0.9583 0.9824
n is the image width, S is the original image, I is the complemented (0.2, 0.3] 0.9524 0.8598 0.9370 0.9585
image. ð2t 1Þ2 is the square of the maximum value of the signal (0.3, 0.4] 0.9312 0.8489 0.9219 0.9416
(0.4, 0.5] 0.8785 0.8162 0.8765 0.8977
and t is the number of bits per sample point.
8
Fig. 7. Comparison of several inpainting techniques visually. Original, Mask, CTSDG, CRfill, RFR, and ours, from left to right.
stealth methods. The first column in the figure contains eight im- scenario. The method completes the background information in
ages to be hidden. The aircraft and ships in the figure are the targets more detail. Tests 3 and 4 demonstrate the stealth results of various
to be hidden. The second column contains real-time mask images algorithms for dense aircraft and ship targets in air-ground view.
generated from these eight images. The following four columns The CTSDG and RFR methods suffer from many visual artifacts and
show the hidden results obtained by four methods: CTSDG, CRfill, edge responses, while CRfill produces better results. However, the
RFR, and Ours. In order to present the method results more faded area still has some chromatic aberration compared to the
prominently, we will enlarge some of the red box areas. The hidden surrounding background, while our method produces a more nat-
targets in Tests 1 and 2 account for a large proportion of the entire ural image. There is more interference information around the
image, and it can be found that the RFR method still has contour target to be faded in tests 5 and 6. It can be found that the results of
information of the hidden targets, while the CTSDG and CRfill stealth by the three previous algorithms all have a large amount of
methods have some pixel overflow phenomenon. Compared to target structure information. In contrast, our method has a high
these methods, only our method performs excellently in this degree of realism and guarantees edge details while stealth the
9
target. The difficulty of Tests 7 and 8 lies in the complex background 4.4. Video real-time object hiding experiment
around the target being hidden, which can cause image distortion
and other issues during background restoration. Currently, most The proposed method was compared to the well-known video-
background restoration methods rely on the information around based image completion method E2FGVI [28]. E2FGVI utilizes the
the background to be repaired, which is propagated through sur- spatiotemporal information of the video sequence to fill in the
rounding pixels to achieve restoration. It can be found that the hidden target area. However, such methods cannot process the
stealth results of CTSDG and RFR methods still have target vignettes collected data in real-time and require early annotation of the video
and poor stealth effects, while the CRfill method makes the stealth to be processed. We evaluated the performance of E2FGVI and our
area more blurred, and the stealth results of the algorithm in this proposed method based on the inference time. According to the
paper have the best visual effect. experimental findings, the inference time for each frame of E2FGVI
is 0.16 s, while our method only needed 0.069 s to complete the
4.3.3. User study inference process for one image frame. The visualization compari-
We conducted a user study of the targets stealth algorithm and son results of the two methods are shown in Fig. 8, and it is
enlisted the help of 14 participants in order to undertake a more discovered that the method proposed in this paper has superior
thorough comparison. From the test set, 20 pictures were chosen at detail restoration capability, and the stealth effect of the target is
random. Then, for comparison, we computed the outcomes of the more pronounced in certain scenarios.
following four methods, including (1) CTSDG, (2) CRfill, (3) RFR, and
(4) Our model. We have done two types of user studies. (A) We rank 5. Conclusions
each method's stealth outcomes (1e10), with greater scores being
better. Then calculate the average score for each method. (B) Out of MTTSNet was a novel end-to-end military time-sensitive target
the findings produced by the four approaches, each participant was stealth network proposed in this article. The method consisted of
asked to select the best one. The results of this test depend on the two components, an encoder-decoder network based on an
subjective visual perception of each participant. The results of the expansion module and a two-branch generative adversarial
user study are presented below. network, which were used to solve the real-time mask generation
Fig. 8. Visual results compared with E2FGVI.
10
and object stealth problems, respectively. In the real-time mask recognition (CVPR) 2019; 2019. p. 1486e94. Long Beach, CA, USA.
[18] Yan ZY, et al. Shift-net: image inpainting via deep feature rearrangement. In:
generation stage, the encoder-decoder structure was used to ach-
European conference on computer vision (ECCV) 2018; 2018. p. 3e19.
ieve multi-scale feature fusion, and we enhanced the target local- Munich, GERMANY.
ization and segmentation accuracy by introducing expansion and [19] Pathak D, et al. Context encoders: feature learning by inpainting. In: IEEE
padding modules. In the target stealth stage, a gated convolution- conference on computer vision and pattern recognition (CVPR) 2016; 2016.
p. 2536e44. Seattle, USA.
based generator model was used to generate the image after the [20] Chang YL, et al. VORNet: spatio-temporally consistent video inpainting for
target is hidden, whereas intensive training increases the dis- object removal. In: IEEE conference on computer vision and pattern recog-
criminator's capacity to discern between real and fake images. nition (CVPR) 2019; 2019. p. 1785e94. Long Beach, CA, USA.
[21] Zhao L, et al. UCTGAN diverse image inpainting based on unsupervised cross-
Extensive qualitative and quantitative experimental results indi- space translation. In: IEEE conference on computer vision and pattern
cate that MTTSNet outperformed the majority of advanced recognition (CVPR) 2020; 2020. p. 5740e9. Seattle, USA.
methods in quantitative metrics and visual perception. In the [22] Zeng Y, et al. CR-fill: generative image inpainting with auxiliary contextual
reconstruction. In: IEEE/CVF international conference on computer vision
future, we will, on the one hand, further optimize the model to (ICCV) 2021. Montreal: Canada; 2021. p. 14144e53.
improve the performance of target concealing, and, on the other [23] Liu HY, et al. PD-GAN: probabilistic diverse GAN for image inpainting. In: IEEE
hand, we will lighten the network to improve the algorithm's real- conference on computer vision and pattern recognition (CVPR) 2021; 2021.
p. 9367e76. online.
time performance. In terms of method application, we hope that [24] Goodfellow I, et al. Generative adversarial networks. Commun ACM
this method can be applied not only in the military but also in the 2020;63(11):139e44.
civilian field. [25] Zhang KD, Fu JJ, Liu D. Flow-guided transformer for video inpainting. In: Eu-
ropean conference on computer vision (ECCV) 2022. Tel Aviv, ISRAEL; 2022.
p. 74e90.
Declaration of competing interest [26] Yang WQ, et al. Deep face video inpainting via UV mapping. IEEE Trans Image
Process 2023;32:1145e57.
[27] Ouyang H, et al. Internal video inpainting by implicit long-range propagation.
The authors declare that they have no known competing
In: IEEE/CVF international conference on computer vision (ICCV) 2021; 2021.
financial interests or personal relationships that could have p. 14559e68. Montreal, Canada.
appeared to influence the work reported in this paper. [28] Li Z, et al. Towards an end-to-end framework for flow-guided video
inpainting. In: IEEE conference on computer vision and pattern recognition
(CVPR) 2022; 2022. p. 17541e50. New Orleans, LA, USA.
Acknowledgements [29] Kang J, Oh SW, Kim SJ. Error compensation framework for flow-guided video
inpainting. In: European conference on computer vision (ECCV) 2022; 2022.
This work was supported in part by the National Natural Science p. 375e90. Tel Aviv, ISRAEL.
[30] Liu J, et al. BFMNet: bilateral feature fusion network with multi-scale context
Foundation of China (Grant No. 62276274), Shaanxi Natural Science aggregation for real-time semantic segmentation. Neurocomputing 2023;521:
Foundation(Grant No. 2023-JC-YB-528), Chinese aeronautical 27e40.
establishment (Grant No. 201851U8012) [31] Noh H, et al. Learning deconvolution network for semantic segmentation. In:
IEEE/CVF international conference on computer vision (ICCV) 2015. Santiago:
CHILE; 2015. p. 1520e8.
References [32] Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-
decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach
[1] Xu R, et al. Texture memory-augmented deep patch-based image inpainting. Intell 2017;39(12):2481e95.
IEEE Trans Image Process 2021;30:9112e24. [33] Zhao HS, et al. Pyramid scene parsing network. In: IEEE conference on com-
[2] Wan W, Liu J. Nonlocal patches based Gaussian mixture model for image puter vision and pattern recognition (CVPR) 2017; 2017. p. 6230e9. Honolulu,
inpainting. Appl Math Model 2020;87:317e31. HI, USA.
[3] Ruzic T, Pizurica A. Context-aware patch-based image inpainting using mar- [34] Chen LCE, et al. Encoder-decoder with atrous separable convolution for se-
kov random field modeling. IEEE Trans Image Process 2015;24(1):444e56. mantic image segmentation. In: European conference on computer vision
[4] Jin KH, Ye JC. Annihilating filter-based low-rank hankel matrix approach for (ECCV) 2018; 2018. p. 833e51. Munich, GERMANY.
image inpainting. IEEE Trans Image Process 2015;24(11):3498e511. [35] Chen LC, et al. DeepLab: semantic image segmentation with deep convolu-
[5] Ding D, Ram S, Rodriguez JJ. Image inpainting using nonlocal texture matching tional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern
and nonlinear filtering. IEEE Trans Image Process 2019;28(4):1705e19. Anal Mach Intell 2018;40(4):834e48.
[6] Zhang YL, et al. Feature pyramid network for diffusion-based image inpainting [36] He KM, et al. Spatial pyramid pooling in deep convolutional networks for
detection. Inf Sci 2021;572:29e42. visual recognition. IEEE Trans Pattern Anal Mach Intell 2015;37(9):1904e16.
[7] Sridevi G, Kumar SS. Image inpainting based on fractional-order nonlinear [37] Grauman K, Darrell T, S.O.C. Ieee Computer. The pyramid match kernel:
diffusion for image reconstruction. Circ Syst Signal Process 2019;38(8): discriminative classification with sets of image features. In: IEEE/CVF inter-
3802e17. national conference on computer vision (ICCV) 2005; 2005. p. 1458e65.
[8] Li HD, Luo WQ, Huang JW. Localization of diffusion-based inpainting in digital Beijing, CHINA.
images. IEEE Trans Inf Forensics Secur 2017;12(12):3050e64. [38] Chollet F. Xception: deep learning with depthwise separable convolutions. In:
[9] He KM, et al. Deep residual learning for image recognition. In: IEEE conference IEEE conference on computer vision and pattern recognition (CVPR) 2017;
on computer vision and pattern recognition (CVPR) 2016; 2016. p. 770e8. 2017. p. 1800e7. Honolulu, HI, USA.
Seattle, USA. [39] Vanhoucke V. Learning visual representations at scale. ICLR invited talk
[10] Lu R, Yang X, Jing X, et al. Infrared small target detection based on local 2014;1(2).
hypergraph dissimilarity measure. Geosci Rem Sens Lett IEEE 2020;19:1e5. [40] Bertalmio M, Sapiro G, Caselles V, et al. Image inpainting. In: Proceedings of
[11] Lu R, Yang X, Li W, et al. Robust infrared small target detection via multidi- the 27th annual conference on Computer graphics and interactive techniques
rectional derivative-based weighted contrast measure. Geosci Rem Sens Lett 2000; 2000. p. 417e24. New Orleans, USA.
IEEE 2020;19:1e5. [41] Chaudhury S, Roy H, Ieee. Can fully convolutional networks perform well for
[12] Zheng HT, et al. Image inpainting with cascaded modulation GAN and object- general image restoration problems?. In: 15th IAPR international conference
aware training. In: European conference on computer vision (ECCV) 2022. Tel on machine vision applications (MVA) 2017. Nagoya, JAPAN: Nagoya Univ;
Aviv, ISRAEL; 2022. p. 277e96. 2017. p. 254e7.
[13] Yu YS, et al. High-fidelity image inpainting with GAN inversion. In: European [42] Ronneberger O, Fischer P, Brox T, U-Net. Convolutional networks for
conference on computer vision (ECCV) 2022. Tel Aviv, ISRAEL; 2022. biomedical image segmentation. In: 18th international conference on medical
p. 242e58. image computing and computer-assisted intervention (MICCAI) 2015; 2015.
[14] Yu JH, et al. Generative image inpainting with contextual attention. In: IEEE p. 234e41. Munich, GERMANY.
conference on computer vision and pattern recognition (CVPR) 2018; 2018. [43] Sasaki K, et al. Joint gap detection and inpainting of line drawings. In: IEEE
p. 5505e14. Salt Lake City, UT, USA. conference on computer vision and pattern recognition (CVPR) 2017; 2017.
[15] Yu JH, et al. Free-form image inpainting with gated convolution. In: IEEE/CVF p. 5768e76. Honolulu, HI, USA.
international conference on computer vision (ICCV) 2019. Seoul: SOUTH [44] Liu G, et al. Image inpainting for irregular holes using partial convolutions. In:
KOREA; 2019. p. 4470e9. European conference on computer vision (ECCV) 2018; 2018. p. 85e100.
[16] Suvorov R, et al. Resolution-robust large mask in painting with fourier con- Munich, GERMANY.
volutions. In: IEEE/CVF winter conference on applications of computer vision [45] Miyato T, Kataoka T, Koyama M, et al. Spectral normalization for generative
(WACV) 2022; 2022. p. 3172e82. Waikoloa, USA. adversarial networks. arXiv preprint 2018;arXiv:1802.05957.
[17] Zeng YH, et al. Learning pyramid-context encoder network for high-quality [46] Everingham M, Eslami SMA, Van Gool L, et al. The pascal visual object classes
image inpainting. In: IEEE conference on computer vision and pattern challenge: a retrospective. Int J Comput Vis 2015;111:98e136.
11
[47] Li K, Wan G, Cheng G, et al. Object detection in optical remote sensing images: Zheng-jie Zhu received the B.E. degrees and the master's
a survey and a new benchmark. ISPRS J Photogrammetry Remote Sens degree from College of Missile Engineering, Rocket Force
2020;159:296e307. University of Engineering, Xi'an, China, in 2020, where he
[48] Hao Y, Liu Y, Chen Y, et al. EISeg: an efficient interactive segmentation tool is currently pursuing the Ph.D. degree. His main research
based on PaddlePaddle. arXiv preprint. 2022. arXiv:2210.08788. interests include computer vision, image processing and
[49] Zhou B, Lapedriza A, Khosla A, et al. Places: a 10 million image database for deep learning.
scene recognition. IEEE Trans Pattern Anal Mach Intell 2017;40(6):1452e64.
[50] Le TT, Almansa A, Gousseau Y, et al. Object removal from complex videos
using a few annotations. Comput Visual Media 2019;5:267e91.
[51] Xie E, Wang W, Yu Z, et al. SegFormer: simple and efficient design for se-
mantic segmentation with transformers. Adv Neural Inf Process Syst 2021;34:
12077e90.
[52] Kirillov A, Mintun E, Ravi N, et al. Segment anything[J]. arXiv preprint. 2023.
arXiv:2304.02643.
[53] Zeng Y, Lin Z, Lu H, et al. Cr-fill: generative image inpainting with auxiliary
contextual reconstruction. In: IEEE conference on computer vision and
pattern recognition (CVPR) 2021; 2021. p. 14164e73. online.
[54] Guo X, Yang H, Huang D. Image inpainting via conditional texture and
structure dual generation. In: IEEE conference on computer vision and pattern Fang-jia Lian is currently pursuing the bachelor's degree
recognition (CVPR) 2021; 2021. p. 14134e43. online. in the PLA Rocket Force University of Engineering, Xi'an,
[55] Li J, Wang N, Zhang L, et al. Recurrent feature reasoning for image inpainting. China. Her research interests include object detection,
In: IEEE conference on computer vision and pattern recognition (CVPR) 2020; image processing.
2020. p. 7760e8. Seattle, USA.
Si-yu Wang was born in Xuzhou, Jiangsu, PRC, in 1996. He

received the M.S. degree in Xi'an Shiyou University, Xi'an,
China, in 2021. He is currently pursuing the Ph.D. degree
in the PLA Rocket Force University of Engineering, Xi'an,
China. His research interests include pattern recognition,
image processing, and object detection.
Qingge Li is currently pursuing the Ph.D. degree in the

PLA Rocket Force University of Engineering, Xi'an, China.
Her research interests include visual navigation, object
(Corresponding Author) Xiao-gang Yang was born in detection, image processing.
Xi'an, Shaanxi, PRC, in 1978. He received the Ph.D. degree
in control science from Rocket Force University of Engi-
neering, Shaanxi, China, in 2006. He is currently a Faculty
Member with the Department of Control Engineering,
Rocket Force University of Engineering. He is the author of
90 articles and 25 inventions. His research interests
include precision guidance and image processing.
Rui-tao Lu received the Ph.D. degree in control science Jiwei Fan was born in Shulan, Jilin, PRC in 1990. He
from the National University of Defense Technology, received the M.S. degree in Northeast Electric Power
Changsha, China, in 2016. He is currently a Faculty University, Jilin, China, in 2018. He is currently pursuing
Member with the Department of Control Engineering, the Ph.D. degree in control science and engineering with
Rocket Force University of Engineering, Xi'an, China. His Xi'an high-tech institution. His research interests include
current research interests include pattern recognition, image processing, machine learning, precision guidance,
image processing, and machine learning. UAV target detection and tracking.
12

1 s2.0 S2214914723002544 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S2214914723002544 Main

Uploaded by

Copyright:

Available Formats

Defence Technology xxx (xxxx) xxx

Contents lists available at ScienceDirect

MTTSNet: Military time-sensitive targets stealth network via real-time

1. Introduction military time-sensitive targets related to national security need to

Fig. 1. The encoder-decoder structure based on the atrous convolution.

Fig. 3. Encoder-Decoder semantic segmentation network architecture.

Fig. 4. Structure of the time-sensitive object stealth network.

L G ¼ EzPz ðzÞ ðDðGðzÞ Þ Þ (5)

model. The following training settings were established: the Table 1

Fig. 5. Object stealth failure results.

resulting in a poor hiding effect. Even if the segmentation is com-

where parameters C1 and C2 are used to stabilize the algorithm.

Fig. 8. Visual results compared with E2FGVI.

Si-yu Wang was born in Xuzhou, Jiangsu, PRC, in 1996. He

Qingge Li is currently pursuing the Ph.D. degree in the

You might also like