Professional Documents
Culture Documents
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
Abstract—Cross-modal communications, devoting to collabora- potential correlation among modalities to satisfy demands such
tively delivering and processing audio, visual, and haptic signals, as low-latency, high reliability, high throughput. However,
have gradually become the supporting technology for the emerg- when transmitting and receiving audio, visual, and haptic
ing multi-modal services. However, the inevitable resource compe-
titions among different modality signals as well as the unexpected information under the existing cross-modal communication
packet loss and latency during transmission seriously affect qual- framework, it still faces two main challenges. First, haptic,
ity of the received signals and end user’s immersive experience audio, and visual streams inevitably compete for the limited
(especially visual experience). To overcome these dilemmas, this resources during transmission, which especially has much
paper proposes a cross-modal signal reconstruction strategy from bad effect on visual streams. Existing modality priority-based
the perspective of human’s perceptual facts. It tries to guarantee
visual signal quality by considering potential correlations among scheduling schemes give haptic stream higher priority to
modalities when processing audio and haptic signals. On the one guarantee its low latency and high reliability requirements (see
hand, a time-frequency masking-based audio-haptic redundancy Table I) [3], [6]. In other words, when haptic stream appears,
elimination mechanism is designed by resorting to the similarity visual streams with lower priority should be interrupted. After
of audio-haptic characteristics and human’s masking effects. On the transmission of the high-priority haptic stream has finished,
the other hand, based on the fact that non-visual perception can
assist to form and enhance visual perception, an audio-haptic visual stream is resumed. Though haptic stream has a smaller
fused visual signal restoration (AHFVR) approach for handling scale compared to visual stream, its high occurrence frequency
the impaired and delayed visual signals is proposed. Experiments and bursty characteristic still severely affect the quality of
on a standard multi-modal database and a constructed practical visual streaming [7].
platform evaluate the performance of the proposed perception-
aware cross-modal signal reconstruction strategy. TABLE I: Transmission requirements of haptic, visual, and
Index Terms—cross-modal communications, perception, visual audio signals [4]
signal reconstruction, audio-haptic redundancy elimination.
Haptic Visual Audio
Delay [1 − 60]ms ≤ 350ms ≤ 150ms
I. I NTRODUCTION Jitter [1 − 10]ms ≤ 30ms ≤ 30ms
∼ 10−5 %
W ITH the dramatic development of wireless communi-
cations and multimedia technologies, humans begin to
pursue more extreme interaction and immersion after their
Data loss rate
Data rate ∼ 128Kbps
∼ 20%
[2.5 − 40]M bps
∼ 30%
[22 − 200]Kbps
audio-visual sensation has been satisfied [1]. Accordingly, Second, due to unexpected packet loss and transmission
multi-modal services, typically integrating audio, visual, and latency, the received quality of visual signals can hardly
haptic signals, will obviously provide a much more immersive be satisfied. During multi-modal streaming, the changeable
experience in various scenarios, such as entertainment, educa- transmission environment unavoidably provides side effects,
tion, healthcare, and industry [2]. For example, haptic-enabled especially for the received visual signals [8]. The reason is
telesurgery will improve the doctor’s interaction and scene that, visual streams are obviously large scale which may
experiences when providing 360 degree panoramic video, high cause impairment during receiving. Moreover, owing to the
definition audio, and force feedback as well as guaranteeing lower transmission priority and the limited bandwidth, the
transmission and signal processing issues [3]. delayed arrival of visual streams is prone to happen. Tradi-
In order to support multi-modal services and enhance end tional methods such as restoration within visual modality or
user’s immersive experience, cross-modal communications retransmission of the destroyed streams are not suitable for
come into being [4], [5]. Compared with existing multi- solving this issue, which may influence the synchronization
modal communication schemes [1], [6], it tries to leverage the between visual and non-visual (haptic, audio) modalities and
user’s immersive experience [9], [10].
Manuscript received April 4, 2022; revised June 14, 2022; accepted July
12, 2022. This work was supported by the National Natural Science Foun- To deal with the above issues, cross-modal signal recon-
dation of China under Grant 62071254 and the Priority Academic Program struction strategy is considered as an effective way [4]. Before
Development of Jiangsu Higher Education Institutions. The associate editor transmission, the volume of non-visual signals should be
coordinating the review of this manuscript and approving it for publication
was Dr. Zhi Wang. (Corresponding Author: Liang Zhou.) reduced as much as possible to guarantee visual transmission
The authors are with School of Communications and Information Engi- throughput. For example, if the redundancy of haptic signals
neering, and Key Lab of Broadband Wireless Communication and Sensor is analyzed and the occurrence frequency can be reasonably
Network Technology, Ministry of Education, Nanjing University of Posts and
Telecommunications, Nanjing 210003, China. (E-mails: {xwei, 1019010631, reduced, the competition between haptic and visual streams
1221014002, liang.zhou}@njupt.edu.cn). can be weaken and the throughput of visual streaming will
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
be improved. After receiving, if visual impairment or delayed how to explore and leverage these correlations among modal-
phenomena emerge, restoration approach can be adopted by ities to realize visual signal restoration is still difficult.
taking advantage of potential correlations between visual and Based on the above analysis, we aim at designing a cross-
non-visual signals. It is considered as an effective substitution modal reconstruction strategy by resorting to the mechanisms
for retransmission scheme. in human’s perception. The contributions of this paper are
In the human’s perception system, there exist inherent inter- listed as follows:
actions among auditory, haptic, and visual sensory subsystems, 1) We construct a perception-aware cross-modal signal
which comprehensively determines human’s experience [11], reconstruction architecture. It elaborately contains an audio-
[12]. As the ultimate goal of multi-modal services is to haptic redundancy elimination mechanism at the sender as well
promote user’s immersive sensory experience, it motivates us as an audio-haptic fused visual signal restoration approach at
to borrow the related mechanisms from the human’s percep- the receiver.
tion system when designing cross-modal signal reconstruction 2) We design a time-frequency masking-based redundancy
architecture. Specifically, as two representative non-visual elimination mechanism for audio and haptic streams to be
perceptions, there are plenty of similarities between audio and delivered at the sender. It takes advantage of masking effect to
haptic sensories. On the one hand, audio and haptic signals reduce the occurrence frequency of audio and haptic streams.
not only have relatively small size requiring low bandwidth 3) We propose an audio-haptic fused visual signal restora-
(∼ Kbps in Table I), but also share the similar waveforms tion (AHFVR) approach. It utilizes modality correlation for
in time and frequency domains for the specific content (e. handling the impaired or delayed visual signals at the re-
g. wood texture in Fig. 1). On the other hand, humans have ceiver. In this approach, low-level features extracted from the
analogous masking effects both in their auditory and haptic received audio and haptic signals are first fused with semantic
subsystems [13], [14]. The former fact inspires us whether constraints. Then, an adversarial scheme is introduced for
to take audio and haptic signals together, while the latter the fused features to capture its latent correlation in real
phenomenon stimulates us to reduce the volume of these two visual space. Finally, a hierarchical fine-grained representation
non-visual signals. In other words, considering and eliminating structure and the knowledge distillation technique are adopted
redundancy of these two non-visual signals together may to realize the desired visual signal generation.
be suitable for enhancing visual streaming efficiency at the The rest of this paper is arranged as follows: Section II
sender, which has not been attracted attention in current works. introduces the related work. Section III describes the construct-
ed perception-aware cross-modal signal reconstruction archi-
tecture. The details of the designed time-frequency masking-
based audio-haptic redundancy elimination mechanism and the
proposed AHFVR approach are presented in Section IV and
Section V, respectively. Experimental results and analysis in a
standard dataset are given in Section VI. In Section VII, the
effectiveness of the proposed perception-aware cross-modal
signal reconstruction is verified in a practical platform. Finally,
Section VIII gives concluding remarks.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
decoder
haptic visual reduction label tagging
impaired
touch
delayed
robot arm
audio priority audio
multiplexing assignment
headphone microphone
visual visual
Time-Frequency Masking-
display Audio-Haptic Fused based Audio-Haptic
screen Visual Signal Restoration Redundancy Elimination camera
high priority. The other modality streams (audio and visual Based on the above analysis, when considering visual signal
streams) are taken as a whole when designing delivery proto- restoration, the majority of current approaches rely on the
col. Moreover, different compression schemes are respectively visual modality itself [29]. Due to the severe content loss or
used for three modality streams, affecting the efficiency of the delay of desired visual streams during transmission, the results
delivery scheme. Recently, perceptual redundancy elimination will be far from satisfactory. To handle this issue, cross-modal-
techniques which investigate masking threshold or just no- based approaches have been considered. In [30], a cross-
ticeable distortion have been taken for either audio or haptic modality transfer learning algorithm is designed to connect
signals [20], [21]. However, they are rarely considered as a image and text data, enhancing accuracy of classification task
joint problem, which is one of the main purposes in this paper. with the aid of modality transfer. In [8], it tries to restore
visual signals by using haptic modality. However, it does not
consider the co-existence of audio signals. It also does not
B. Visual Signal Restoration design schemes from the perspective of human’s perceptual
facts, which are valuable and meaningful. As a result, how to
In [22], the previous research and application of visu- realize effective visual signal restoration from the other non-
al perception in different industrial fields such as product visual modalities is still the focus in this paper.
surface defect detection, intelligent agricultural production,
intelligent driving, image synthesis and event reconstruction
III. P ERCEPTION -AWARE C ROSS -M ODAL S IGNAL
are reviewed. Visual signal restoration is important and d-
R ECONSTRUCTION A RCHITECTURE
ifficult in multi-modal signal processing, especially visual
perception. To effectively eliminate visual distortions, a semi- As visualized in Fig. 2, the constructed perception-aware
supervised learning-based model and optimization approach cross-modal signal reconstruction architecture can be decom-
is designed [23]. In [24], a sparse optimization approach posed into three parts: master domain, network domain, and
named `0 TV-PADMM is proposed, which takes advantage of slave domain. The master domain at the receiver represents
total variation to deal with noise errors. In [25], it relates a human operator and a human system interface (HSI). The
the noise to the semantic segments and presents an efficient HSI consists of an input device such as Geomagic Touch
approach to diverse image synthesis. Besides noise, packet for haptic positioning and orientating, and output devices for
loss occurring in the wireless network can also jeopardize multi-modal displaying, e.g., haptic device for force feedback,
the performance of visual streams. In [26], a multi-channel video display, and headphones. The slave domain at the sender
error recovery approach is proposed to guarantee high-quality contains a controlled robot arm equipped with multiple haptic
and real-time video streaming. It integrates priority queue, sensors, a video camera and a microphone. The network
quick start, and scalable reliable channel to overcome the high domain connecting them provides the communication medium.
packet loss ratio. In [27], a deep learning-based restoration Controlled by the master domain, the slave domain is able
algorithm is developed for handling packet loss in wireless to directly sense and interact with the remote environment
multimedia sensor network, which reconstructs the impaired and return multi-modal feedback over the network domain.
visual frames with information from neighboring available Moreover, motivated by the benefits of edge intelligence [5],
frames. In addition to noise and packet loss, interference from we deploy edge nodes in both master and slave domains to
transmission environments should not be neglected. To restore deal with multi-modal signal processing.
images with interference, [28] utilizes common dark channel When the devices at the slave terminal collect audito-
prior and incorporates adaptive color correction for removing ry, visual, and touch stimuli from surroundings or get the
color casts. commands from the master terminal, the related multi-modal
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
4 5
having no sense for transmission can be filtered. Moreover,
frequency masking is extensively investigated in practical 4 0
applications while time masking phenomenon is rarely imple-
mented quantitatively. Actually, time masking, especially the 3 5
backward masking, cannot be ignored due to its long duration
and large masking amount, which is essential for compression 3 0
5 1 0 1 5 2 0 2 5 3 0 3 5
[31]. Therefore, it is necessary to joint consideration between c o m p re s s io n ra tio
frequency masking and time masking.
(b)
4 masking unit
masking threshold Inspired by this, to enhance transmission efficiency and
3
amplitude
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
Fa a v
Fv
D1 0/1 hierarchical D2 0/1
constraint
received f
H Ff
Vˆ
generated
haptic signals
visual signals
GRU
Fh h G
audio-haptic modality irrelevant cross-modal visual
semantic fusion latent space learning signal generation
Fig. 6: Illustration of the proposed audio-haptic fused visual signal restoration (AHFVR) approach
amplitudes are higher than the threshold. Furthermore, Fig. proposed scheme uses time-frequency masking effects for both
4 completely depicts the diagram of the proposed scheme. audio and haptic modalities, achieving joint optimization under
Concretely, the audio and haptic signals are first switched cross-modal communication scenarios.
into time-frequency representations through wavelet transform.
This can provide more features compared with sole time V. AHFVR:AUDIO -H APTIC F USED V ISUAL S IGNAL
or frequency domains. The obtained time-frequency plane is R ESTORATION A PPROACH
filtered, denoised, and further utilized to extract variance-based In this section, to remedy the impaired or delayed visual
features, while the signal energy is measured at every time- signals, we propose a simple yet effective approach named
frequency unit at the same time. Then, an unsupervised and audio-haptic fused visual signal restoration (AHFVR) ap-
nonparametric adaptive method is adopted to estimate the proach. Its core idea comes from the psychological theory
masking threshold via time-frequency vector [32], associating that humans’ non-visual perception can assist in forming and
to the previous and future frames of the current signals. It is enhancing their visual experience. As illustrated in Fig. 6, the
noted that this masking threshold is dynamic according to the overall approach can be further divided into three components:
actual characteristics of audio and haptic signals. By obtaining audio-haptic semantic fusion, modality irrelevant latent space
masking threshold, the signal masking ratio or compression learning, cross-modal visual signal generation.
ratio can be determined. Finally, this ratio is applied to every
time-frequency unit for quantization and entropy coding. If
A. Audio-Haptic Semantic Fusion
the audio and haptic features in the time-frequency space
are lower their respective masking threshold, they should not Here, we introduce some basic notations utilized in this
be transmitted. As a result, the redundancy of audio and section. We define multi-modal dataset as {V, A, H} and the
haptic signals will be greatly eliminated or compressed without three modalities involved as V for visual signals, A for audio
feeling the perceptual difference. signals, and H for haptic signals. Y denotes the category label
We apply the above proposed redundancy elimination mech- set. In the first stage, we try to extract low-level features from
anism under the constructed architecture in Section III. The the received audio and haptic signals and produce high-level
results are shown in Fig. 5. First, the average throughput of feature fusion by semantic constraints.
multi-modal streams during a certain period of time is given For the received audio signals, spectrogram is acquired
in Fig. 5(a). On the one hand, the transmission efficiency through spectrum analysis and then deep convolutional neural
of audio and haptic modality streams can be promoted due network (CNN) is used to obtain hierarchical auditory features.
to redundancy elimination of these two modalities. On the For the received haptic signals, power spectrum density is used
other hand, as these two modality streams have been com- as the feature descriptors and then the combinations of gated
pressed, the resource competition between audio-haptic and recurrent unit (GRU) and fully-connected layers are adopted to
visual modalities can be weakened. It leads to the throughput obtain haptic features. The specific process can be represented
improvement of visual stream when compared with the scheme as follows:
in [6]. Moreover, in Fig. 5(b), it clearly shows that the a =Fa (A; θa ) : a ∈ Rda , (1)
perceived quality of the compressed audio and haptic signals
h =Fh (H; θh ) : h ∈ Rdh , (2)
degrades gracefully as higher compression ratios kick in.
However, the proposed redundancy elimination mechanism has where a and h denote low-level features of audio and haptic
better performance than that in [6]. The reason is that the signals. Rda and Rdh are their dimensions. θa and θh respec-
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
tively denote parameters of audio mapping network Fa and latent space with modality irrelevant characteristics can be
haptic mapping network Fh . derived.
Subsequently, we employ a deep semantic network to jointly
fuse audio and haptic features. Here, supervision information C. Cross-Modal Visual Signal Generation
such as category label is utilized to maintain the semantically After the fused features have learned rich modality ir-
discriminative ability for the fused feature f . It can be achieved relevant semantic information from real visual signals in
by defining a semantic discriminative loss Ldis : latent space, the final step is to generate the desired visual
f =Ff (a, h; θf ) : f ∈ Rdf , (3) signals from the fused features. Here, a generative adversarial
K
network (GAN) is taken to finish this task. Specifically, in
1 X this GAN, the generator G converts f into visual signals V̂ ,
Ldis = fsof tmax (fi , yi ; θi ), (4)
K i=1 and the discriminator D2 distinguishes the real visual signals
V from the generated visual signals V̂ . In other words, this
where the dimension of the fused feature is Rdf and θf are process mainly indicates that G learns the real visual signal
parameters of deep semantic network Ff . Besides, fi , yi , and distribution to confuse D2 , and D2 continuously enhances its
θi are fused feature, category label, and softmax parameters discriminative ability to distinguish false visual signals. The
belonging to the ith category. fsof tmax (fi , yi ; θi ) is the soft- objective function of this step can be defined as:
max cross entropy loss function. It is noted that the category
label here means which class an audio or a haptic signal LD2 =−EV ∼Pdata(V ) [log D2 (V ; θd2 )]
belongs to. In the real cross-modal communication system, −EV̂ ∼P [log(1 − D2 (V̂ ; θd2 ))], (8)
data(V̂ )
label tagging task is usually performed and loaded at the
sender. When the packet has been received, this information LG =−EV̂ ∼P [log D2 (V̂ ; θd2 )], (9)
data(V̂ )
can be parsed and taken as an important semantic information.
where V̂ denotes the generated visual signal. By minimizing
By minimizing the above semantic discriminative loss,
LD2 and LG , G is able to continuously capture the distribution
the obtained fused features have strong representation and
of V and use it to guide the generation of V̂ .
semantic discrimination abilities.
Besides the discriminative loss LD2 , we should also con-
sider the pixel-wise loss. Here, Euclidean distance is used to
B. Modality Irrelevant Latent Space Learning completely match the generated visual signal V̂ with the real
To bridge the heterogeneous gap among different modalities, visual signal V , forming the restoration loss. The objective
adversarial learning is selected to map the audio-haptic fused function of the restoration loss Lrec can be expressed as
feature and feature of real visual signals into a modality
Lres =
V − V̂
. (10)
irrelevant latent space. In other words, by exploring this latent 2
space, the discrepancy between the fused feature and the visual Actually, due to the discrepancy of data distribution-
feature can be reduced as much as possible. s between real and generated visual signals in the high-
Specifically, we combine a CNN with several fully- dimensional pixel space, only considering reconstruction loss
connected layers as the visual mapping network to produce may generate the blur visual signal. To handle this issue, we
the associated visual features: employ knowledge distillation as an effective tool for further
v = Fv (V ; θv ) : v ∈ Rdv , (5) improving the quality of generated visual signal. Here, knowl-
edge distillation refers to the separation of key distribution
where v is Rdv dimensional features of real visual signals information from an original model and transfers it to a new
V . θv are parameters of the visual mapping network Fv . model, aiming at increasing the fine-grained degree of the
Then, the generative adversarial mechanism is utilized to generated visual signal. Driven by this, we select the visual
learn the latent space, as shown in Fig. 6. Here, we build mapping network as distillation model to assist the process of
a feature discriminator D1 for distinguishing the features, final visual signal generation through hierarchical knowledge
through which the fused features f can continuously approach transferring.
v. In other words, a modality irrelevant latent space can be The structure of knowledge distillation is illustrated in Fig.
explored. According to this principle, the discriminator D1 7. Specifically, during the generation process from the fused
and fusion network Ff are trained by alternatively minimizing feature f to the desired visual signal V̂ , the intermediate result
LD1 in Eq. (6) and LFf in Eq. (7), (l)
produced by layer l of G can be represented as G(l) (f ; θg ).
(l)
LD1 =−Ev∼Pdata(v) [log D1 (v; θd1 )] θg is the parameter set of layer l in G. Meanwhile, during
the feature extraction process from V to v, the corresponding
−Ef ∼Pdata(f ) [log (1 − D1 (f ; θd1 ))] , (6) (L−l) (L−l)
intermediate result is Fv (V ; θv ), which can also be
LFf =−Ef ∼Pdata(f ) [log D1 (f ; θd1 )] + λLdis , (7) produced by the output of layer l in G. θv
(L−l)
is the parameter
where λ is the regularization parameter. set of layer L − l in Fv . L is the number of layers in both Fv
The idea of selecting the generative adversarial mechanism and G. Based on this, the hierarchical constraint loss of visual
is that when D1 cannot distinguish f and v, it makes the signals is defined as
audio-haptic fused feature gradually approach the feature of X
(l)
LH =
G (f ; θg(l) ) − Fv(L−l) (V ; θv(L−l) )
,
real visual signals from the aspect of intrinsic semantics. The 1
(11)
l
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
real visual
visual feature extraction network signal
Fv(L )(V ; v(L ) ) ... Fv(L l )(V ; v(L l ) ) ... Fv(1)(V ; v(1) ) V
It can be seen from Fig. 6 that the structures of G and Fv are Algorithm 1 The optimization process of the proposed AH-
symmetrical, for instance, have the same network layers and FVR approach
output dimensions. By this way, the knowledge from multiple Input:
levels of visual features are transferred into layers of G. The multi-modal dataset V, A, H, category label Y
quality and granularity of the generated visual signals are Output:
improved. Finally, the whole objective function of cross-modal the generated visual images V̂
visual signal generation process is Initialization:
network parameters θv , θa , θh , θf , θd1 , θd2 , θg
Lgen = LG + αLres + βLH , (12) hyper-parameters: α, β, λ
where α and β are the hyper-parameters that aim to balance learning rate: µ1 , µ2
the three terms. mini-batch size: N
1: Repeat:
2: h ← Fh (H), a ← Fa (A), v ← Fv (V ), f ← Ff (a; h),
D. Optimization Process V̂ ← G (f )
The whole optimization process of the AHFVR approach 3: update θd1 by SGD for Eq. (6):
can be divided in two main stages. In the first stage, the 4: θd1 ← θd1 − µ2 ∇θd1 LD1
objective functions to be optimized are LD1 and LFf (con- 5: update θa , θh , θv , θf by SGD for Eq. (7):
taining Ldis ). In the second stage, the objective functions to 6: θa ← θa − µ1 ∇θa LFf
be optimized are LD2 and Lgen . 7: θh ← θh − µ1 ∇θh LFf
Specifically, during the optimization, we first initialize all 8: θv ← θv − µ1 ∇θv LFf
network parameters. Then, the fusion network Ff and dis- 9: θf ← θf − µ1 ∇θf LFf
criminator D1 are trained by minimizing the loss function in 10: update θd2 by SGD for Eq. (8):
the first stage. Subsequently, the generator G and discrim- 11: θd2 ← θd2 − µ2 ∇θd2 LD2
inator D2 are trained. Finally, all the training steps above 12: update θg by SGD for Eq.(12):
are repeated until convergence. It is noted that the stochastic 13: θg ← θg − µ2 ∇θg Lgen
gradient descent (SGD) algorithm is adopted to minimize 14: Until convergence.
all the loss functions and upgrade all network parameters.
Different learning rates are set for every network, aiming at
balancing the convergence speed. The specific optimization features a0 and h0 are then fused via semantic constraints.
process is summarized in Algorithm 1. For the condition of Subsequently, the fused feature f 0 is fed into the trained G
convergence, it needs to consider both the outputs of two whose output is the generated visual signal Vˆ0 . For the delayed
discriminators and the reconstruction results. Specifically, after visual signal, when it is successfully received, it can be utilized
each training epoch ends, calculate the outputs of D1 and D2 . to update the AHFVR by Algorithm 1, further enhancing the
If these two values approach 0.5, calculate the Lgen by the accuracy and robustness of the approach.
randomly selected samples for validation. If the variation of
Lgen becomes lower than the pre-set threshold, this algorithm VI. E XPERIMENTAL R ESULTS
is considered as convergence.
When the AHFVR has been trained by the above steps, the A. Dataset
parameters of all the networks are already well optimized. In this section, we evaluate the effectiveness of the proposed
When the subsequent audio and haptic signals have been AHFVR approach based on a standard LMT dataset [33].
received while the desired visual signals are impaired or Generally, The LMT dataset is composed of sound, haptic
delayed, the AHFVR will be activated. Specifically, the audio acceleration, and texture images of 108 different surface
signal A0 and haptic signal H 0 are respectively projected into materials gathered by a three-axis accelerometer, a microphone
feature space through CNN and GRU networks. Both extracted and a camera. According to the material properties, these
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
Fig. 8: The restored visual signals by (a) GAN, (b) WGAN, (c) StackGAN, (d) ViTac, (e) AHFVR(haptic-only), and (f)
AHFVR
surface materials are classified into nine categories: mesh, effectiveness of the proposed AHFVR approach.
stone, glossy, wood, rubber, fiber, foam, foils and paper, textile GAN [34] is the most basic generative adversarial model,
and fabric. To avoid repetition, we randomly use 80% data in which can generate visual signals by a procedure of max-min
each category as the training set, and the remaining 20% as optimization.
the testing set. WGAN [35] is a variant of GAN whose training aim is to
minimize the Wasserstein distance.
B. Implementation Details StackGAN [36] is an another variant of GAN which is used
Network Structure: We adopt CNN to perform feature to generate visual signals conditioned on textual description.
extraction for audio and visual modality signals, which con- ViTac [37] provides an approach based on the improved
tains four convolution layers and three fully-connected layers GAN to generate visual signals conditioning on the semantic
(1024 → 128 → 6). The number of convolution kernels is features encoded by haptic modalities. It considers the corre-
512, 256, 128, 64. The size of the kernels is 5 × 5. Moreover, lation between haptic and visual modalities.
GRU with 256 units is used for the haptic feature extraction. AHFVR(haptic-only) is the AHFVR which removes the
The fusion network is built as a five-layer fully-connected audio modality. In this case, feature fusion component is
neural network (512 → 1024 → 512 → 128 → 64), and the discarded and only haptic modality is employed to generate
discriminator D1 is a four-layer fully-connected neural net- corresponding visual signals. It can also be utilized to verify
work (512 → 1024 → 512 → 1). The generator G has the the effectiveness of audio modality and multi-modal fusion.
same but the reversed network structure with the visual fea- Moreover, we also conduct some baseline experiments,
ture extraction network. The network of discriminator D2 is including AHFVR(audio-only), without latent space learning,
composed of four convolution layers and two fully-connected and without-distillation to estimate the validity of each com-
layers (1024 → 1). The number of convolution kernels is 512, ponent in the proposed AHFVR approach.
256, 128, 64. The size of the kernels is 5 × 5. Except for
softmax layer and sigmoid layer, the other feature layers are D. Evaluation Metrics
activated by relu function and batchnorm regularization.
Here, we choose the following metrics for evaluation.
Training Details: We first set batch size as 25 in the train-
Inception Score is a numerical metric for quantitative
ing procedure. Category labels are coded as one-hot vector.
evaluation, measuring the performance of the generated visual
Network parameters are learned by optimization described in
signals. Its calculation formula is
Section V-D for all modules with λ = 10−3 , α = 0.1 and
β = 10−5 . In details, the learning rate for the generator is
IS = exp EV̂ KL(p(ŷ|V̂ )kp(ŷ)) , (13)
0.0005 and the discriminator is 0.0001. The training procedure
is adopted at 200 epochs for all the datasets. where V̂ represents a generated visual signal, and ŷ is the
predicted label through Inception networks. p(ŷ) denotes the
C. Compared Approaches marginal distribution. p(ŷ|V̂ ) denotes the possibility of the
Here, we select the following representative models and category ŷ the given visual signal V̂ belongs to. The higher
approaches for performance comparison to demonstrate the inception scores means the better performance.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
base station
Micro-
phone
gateway
camera force sensors
database
router
Master Domain Network domain Slave Domain
Fig. 11: The restored visual signals by (a) GAN, (b) WGAN, (c) StackGAN, (d) ViTac, (e) AVHR(haptic-only), and (f)
cross-modal signal construction with AHFVR
Fig. 11 shows those reconstructed visual signals and provide TABLE III: SSIM and PSNR of the proposed cross-modal
some quantitative information about generation quality. In signal reconstruction with AHFVR and the other competing
particular, SSIM and PSNR are selected to measure similarity schemes
between the reconstructed visual signals and real visual signal-
Approach SSIM PSNR
s. The results are given in Table III. It can be seen that the AH- Original 0.950 −
FVR achieves the best performance on the practical platform. GAN 0.199 9.782
Moreover, the basic generative models, like GAN and WGAN, WGAN 0.201 10.056
always achieve the worst performance. Although StackGAN StackGAN 0.216 10.178
ViTac 0.235 10.678
has made some improvements, it is still limited by discrep- AHFVR(haptic-only) 0.241 10.831
ancy of heterogeneous streams. Compared with AHFVR, the AHFVR 0.412 14.021
performance of AHFVR(haptic-only) is inferior due to the
lack of supplementary audio information. In summary, the
proposed perception-aware cross-modal signal reconstruction They give their experience scores 1∼5 from two aspects: qual-
strategy obtains the best performance, guaranteeing quality of ity scale and classification scale (discrimination ability of the
the received signals. reconstructed images). They respectively represent quality and
Finally, in order to verify the performance of the proposed discrimination quality of the reconstructed images compared
strategy, user’s quality of experience (QoE) is measured based with those of the real images. Scores from the participants
on their subjective assessment of the synthesized images. are averaged and the results are shown in Fig. 12. It can be
Therefore, in this qualitative evaluation, 15 participants who observed that the proposed cross-modal signal reconstruction
have normal senses are invited to manipulate the haptic devices strategy with the AHFVR achieves the best performance on
at the master terminal and perceive the received visual signals. end user’s experience.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
[30] S. Niu, Y. Jiang, and B. Chen, “Cross-modality transfer learning for Liang Zhou (SM’18) received his Ph.D. degree
image-text information management,” ACM Transactions on Manage- major at electronic engineering both from École
ment Information Systems, vol. 13, no. 1, p. 14, 2022. Normale Supérieure (E.N.S.), Cachan, France and
[31] R. M. Chen, N. N. Mahatme, Z. J. Diggins, and L. Wang, “Impact of Shanghai Jiao Tong University, Shanghai, China in
temporal masking of flip-flop upsets on soft error rates of sequential 2009. Now, he is a professor in Nanjing Univer-
circuits,” IEEE Trans. Nucl. Sci., vol. 64, no. 8, pp. 2098-2106, 2017. sity of Posts and Telecommunications, China. His
[32] N. Otsu, “A threshold selection method from gray-level histograms,” research interests are in the area of multimedia
IEEE Transactions on Syst, Man, Cybernetics, vol. 9, no. 1, pp. 62-66, communications and computing.
1979.
[33] M. Strese, C. Schuwerk, A. Iepure, and E. Steinbach, “Multimodal
feature-based surface material classification,” IEEE Trans. Haptics, vol.
10, no. 2, pp. 226-239, 2017.
[34] C. Wang, C. Xu, X. Yao, and D. Tao, “Evolutionary generative adversar-
ial networks,” IEEE Trans. Evolut. Comput., vol. 23, no. 6, pp. 921-934,
2019.
[35] I. Deshpande, Y. Hu, R. Sun, A. Pyrros, and A. Schwing, “Max-sliced
Wasserstein distance and its use for GANs,” IEEE/CVF Conf. Comput.
Vis. Pattern Recog. (CVPR), Long Beach, USA, pp. 10640-10648, 2019.
[36] H. Zhang, T. Xu, H. Li, and S. Zhang, “StackGAN++: Realistic image
synthesis with stacked generative adversarial networks,” IEEE Trans.
Pattern Anal., vol. 41, no. 8, pp. 1947-1962, 2019.
[37] J. Lee, D. Bollegala, and S. Luo, “Touching to see and seeing to
feel: Robotic cross-modal sensory data generation for visual-tactile
perception,” Int. Conf. Robot. Autom. (ICRA), Montreal, Canada, pp.
4276-4282, 2019.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.