Perception-Aware Cross-Modal Signal Reconstruction: From Audio-Haptic To Visual

This article has been accepted for publication in IEEE Transactions on Multimedia.
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2022.3194309
IEEE TRANSACTIONS ON MULTIMEDIA 1
Perception-Aware Cross-Modal Signal

Reconstruction: From Audio-Haptic to Visual
Xin Wei, Member, IEEE, Yuyuan Yao, Haoyu Wang, and Liang Zhou, Senior Member, IEEE
Abstract—Cross-modal communications, devoting to collabora- potential correlation among modalities to satisfy demands such
tively delivering and processing audio, visual, and haptic signals, as low-latency, high reliability, high throughput. However,
have gradually become the supporting technology for the emerg- when transmitting and receiving audio, visual, and haptic
ing multi-modal services. However, the inevitable resource compe-
titions among different modality signals as well as the unexpected information under the existing cross-modal communication
packet loss and latency during transmission seriously affect qual- framework, it still faces two main challenges. First, haptic,
ity of the received signals and end user’s immersive experience audio, and visual streams inevitably compete for the limited
(especially visual experience). To overcome these dilemmas, this resources during transmission, which especially has much
paper proposes a cross-modal signal reconstruction strategy from bad effect on visual streams. Existing modality priority-based
the perspective of human’s perceptual facts. It tries to guarantee
visual signal quality by considering potential correlations among scheduling schemes give haptic stream higher priority to
modalities when processing audio and haptic signals. On the one guarantee its low latency and high reliability requirements (see
hand, a time-frequency masking-based audio-haptic redundancy Table I) [3], [6]. In other words, when haptic stream appears,
elimination mechanism is designed by resorting to the similarity visual streams with lower priority should be interrupted. After
of audio-haptic characteristics and human’s masking effects. On the transmission of the high-priority haptic stream has finished,
the other hand, based on the fact that non-visual perception can
assist to form and enhance visual perception, an audio-haptic visual stream is resumed. Though haptic stream has a smaller
fused visual signal restoration (AHFVR) approach for handling scale compared to visual stream, its high occurrence frequency
the impaired and delayed visual signals is proposed. Experiments and bursty characteristic still severely affect the quality of
on a standard multi-modal database and a constructed practical visual streaming [7].
platform evaluate the performance of the proposed perception-
aware cross-modal signal reconstruction strategy. TABLE I: Transmission requirements of haptic, visual, and
Index Terms—cross-modal communications, perception, visual audio signals [4]
signal reconstruction, audio-haptic redundancy elimination.
Haptic Visual Audio
Delay [1 − 60]ms ≤ 350ms ≤ 150ms
I. I NTRODUCTION Jitter [1 − 10]ms ≤ 30ms ≤ 30ms
∼ 10−5 %
W ITH the dramatic development of wireless communi-
cations and multimedia technologies, humans begin to
pursue more extreme interaction and immersion after their
Data loss rate
Data rate ∼ 128Kbps
∼ 20%
[2.5 − 40]M bps
∼ 30%
[22 − 200]Kbps
audio-visual sensation has been satisfied [1]. Accordingly, Second, due to unexpected packet loss and transmission
multi-modal services, typically integrating audio, visual, and latency, the received quality of visual signals can hardly
haptic signals, will obviously provide a much more immersive be satisfied. During multi-modal streaming, the changeable
experience in various scenarios, such as entertainment, educa- transmission environment unavoidably provides side effects,
tion, healthcare, and industry [2]. For example, haptic-enabled especially for the received visual signals [8]. The reason is
telesurgery will improve the doctor’s interaction and scene that, visual streams are obviously large scale which may
experiences when providing 360 degree panoramic video, high cause impairment during receiving. Moreover, owing to the
definition audio, and force feedback as well as guaranteeing lower transmission priority and the limited bandwidth, the
transmission and signal processing issues [3]. delayed arrival of visual streams is prone to happen. Tradi-
In order to support multi-modal services and enhance end tional methods such as restoration within visual modality or
user’s immersive experience, cross-modal communications retransmission of the destroyed streams are not suitable for
come into being [4], [5]. Compared with existing multi- solving this issue, which may influence the synchronization
modal communication schemes [1], [6], it tries to leverage the between visual and non-visual (haptic, audio) modalities and
user’s immersive experience [9], [10].
Manuscript received April 4, 2022; revised June 14, 2022; accepted July
12, 2022. This work was supported by the National Natural Science Foun- To deal with the above issues, cross-modal signal recon-
dation of China under Grant 62071254 and the Priority Academic Program struction strategy is considered as an effective way [4]. Before
Development of Jiangsu Higher Education Institutions. The associate editor transmission, the volume of non-visual signals should be
coordinating the review of this manuscript and approving it for publication
was Dr. Zhi Wang. (Corresponding Author: Liang Zhou.) reduced as much as possible to guarantee visual transmission
The authors are with School of Communications and Information Engi- throughput. For example, if the redundancy of haptic signals
neering, and Key Lab of Broadband Wireless Communication and Sensor is analyzed and the occurrence frequency can be reasonably
Network Technology, Ministry of Education, Nanjing University of Posts and
Telecommunications, Nanjing 210003, China. (E-mails: {xwei, 1019010631, reduced, the competition between haptic and visual streams
1221014002, liang.zhou}@njupt.edu.cn). can be weaken and the throughput of visual streaming will
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on January 15,2023 at 18:45:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
be improved. After receiving, if visual impairment or delayed how to explore and leverage these correlations among modal-
phenomena emerge, restoration approach can be adopted by ities to realize visual signal restoration is still difficult.
taking advantage of potential correlations between visual and Based on the above analysis, we aim at designing a cross-
non-visual signals. It is considered as an effective substitution modal reconstruction strategy by resorting to the mechanisms
for retransmission scheme. in human’s perception. The contributions of this paper are
In the human’s perception system, there exist inherent inter- listed as follows:
actions among auditory, haptic, and visual sensory subsystems, 1) We construct a perception-aware cross-modal signal
which comprehensively determines human’s experience [11], reconstruction architecture. It elaborately contains an audio-
[12]. As the ultimate goal of multi-modal services is to haptic redundancy elimination mechanism at the sender as well
promote user’s immersive sensory experience, it motivates us as an audio-haptic fused visual signal restoration approach at
to borrow the related mechanisms from the human’s percep- the receiver.
tion system when designing cross-modal signal reconstruction 2) We design a time-frequency masking-based redundancy
architecture. Specifically, as two representative non-visual elimination mechanism for audio and haptic streams to be
perceptions, there are plenty of similarities between audio and delivered at the sender. It takes advantage of masking effect to
haptic sensories. On the one hand, audio and haptic signals reduce the occurrence frequency of audio and haptic streams.
not only have relatively small size requiring low bandwidth 3) We propose an audio-haptic fused visual signal restora-
(∼ Kbps in Table I), but also share the similar waveforms tion (AHFVR) approach. It utilizes modality correlation for
in time and frequency domains for the specific content (e. handling the impaired or delayed visual signals at the re-
g. wood texture in Fig. 1). On the other hand, humans have ceiver. In this approach, low-level features extracted from the
analogous masking effects both in their auditory and haptic received audio and haptic signals are first fused with semantic
subsystems [13], [14]. The former fact inspires us whether constraints. Then, an adversarial scheme is introduced for
to take audio and haptic signals together, while the latter the fused features to capture its latent correlation in real
phenomenon stimulates us to reduce the volume of these two visual space. Finally, a hierarchical fine-grained representation
non-visual signals. In other words, considering and eliminating structure and the knowledge distillation technique are adopted
redundancy of these two non-visual signals together may to realize the desired visual signal generation.
be suitable for enhancing visual streaming efficiency at the The rest of this paper is arranged as follows: Section II
sender, which has not been attracted attention in current works. introduces the related work. Section III describes the construct-
ed perception-aware cross-modal signal reconstruction archi-
tecture. The details of the designed time-frequency masking-
based audio-haptic redundancy elimination mechanism and the
proposed AHFVR approach are presented in Section IV and
Section V, respectively. Experimental results and analysis in a
standard dataset are given in Section VI. In Section VII, the
effectiveness of the proposed perception-aware cross-modal
signal reconstruction is verified in a practical platform. Finally,
Section VIII gives concluding remarks.
II. R ELATED W ORK

A. Multi-Modal Stream Transmission
When considering multi-modal stream transmission, modal-
ity priority assignment technique is widely studied for allocat-
ing available transmission rates [18]. Although this scheme
can well handle the delivery of multi-modal streams, it is
not adaptive to dynamic changes of networks. Subsequently,
multiplexing is another probably suitable solution for resource
Fig. 1: Use case of the audio and haptic signals of wood allocation. Typically, the work in [19] investigates an adaptive
texture in time and frequency domains. application layer multiplexing framework concerning multi-
modal streams. In [6], a multiplexing scheme for multi-modal
Moreover, in psychological theory, it shows that humans’ feedback (visual, audio and haptic) is proposed to give high
non-visual perception can assist in forming and enhancing priority to the haptic data. Moreover, a preemptive-resume
their visual experiences [15]. For instance, blind persons or scheduling strategy is also adopted for multi-modal stream-
infants can recognize and reconstruct visual forms of envi- ing. To further consider characteristics among modalities, we
ronments by hearing and touching the related objects after propose the architecture of cross-modal communications and
conscious training. The reason is that there exists intrinsic further design the cross-modal scheduling strategy to collab-
content consistency among multi-modal sensory signals [16], oratively allocate resources for the specific stream according
[17]. This idea can be transferred to restore the impaired to application requirements [4], [3].
and delayed visual signals at the receiver, where multi-modal In summary, the main aim of the current transmission strate-
signals co-exist and also have potential correlations. However, gies is to guarantee haptic stream delivery with a relatively
Master Domain Network Domain Slave Domain

terminal edge node wireless network edge node terminal
haptic Intact haptic
decoder
haptic visual reduction label tagging
impaired
touch
delayed
robot arm
audio priority audio
multiplexing assignment
headphone microphone
visual visual
Time-Frequency Masking-
display Audio-Haptic Fused based Audio-Haptic
screen Visual Signal Restoration Redundancy Elimination camera
Fig. 2: The perception-aware cross-modal signal reconstruction architecture
high priority. The other modality streams (audio and visual Based on the above analysis, when considering visual signal
streams) are taken as a whole when designing delivery proto- restoration, the majority of current approaches rely on the
col. Moreover, different compression schemes are respectively visual modality itself [29]. Due to the severe content loss or
used for three modality streams, affecting the efficiency of the delay of desired visual streams during transmission, the results
delivery scheme. Recently, perceptual redundancy elimination will be far from satisfactory. To handle this issue, cross-modal-
techniques which investigate masking threshold or just no- based approaches have been considered. In [30], a cross-
ticeable distortion have been taken for either audio or haptic modality transfer learning algorithm is designed to connect
signals [20], [21]. However, they are rarely considered as a image and text data, enhancing accuracy of classification task
joint problem, which is one of the main purposes in this paper. with the aid of modality transfer. In [8], it tries to restore
visual signals by using haptic modality. However, it does not
consider the co-existence of audio signals. It also does not
B. Visual Signal Restoration design schemes from the perspective of human’s perceptual
facts, which are valuable and meaningful. As a result, how to
In [22], the previous research and application of visu- realize effective visual signal restoration from the other non-
al perception in different industrial fields such as product visual modalities is still the focus in this paper.
surface defect detection, intelligent agricultural production,
intelligent driving, image synthesis and event reconstruction
III. P ERCEPTION -AWARE C ROSS -M ODAL S IGNAL
are reviewed. Visual signal restoration is important and d-
R ECONSTRUCTION A RCHITECTURE
ifficult in multi-modal signal processing, especially visual
perception. To effectively eliminate visual distortions, a semi- As visualized in Fig. 2, the constructed perception-aware
supervised learning-based model and optimization approach cross-modal signal reconstruction architecture can be decom-
is designed [23]. In [24], a sparse optimization approach posed into three parts: master domain, network domain, and
named `0 TV-PADMM is proposed, which takes advantage of slave domain. The master domain at the receiver represents
total variation to deal with noise errors. In [25], it relates a human operator and a human system interface (HSI). The
the noise to the semantic segments and presents an efficient HSI consists of an input device such as Geomagic Touch
approach to diverse image synthesis. Besides noise, packet for haptic positioning and orientating, and output devices for
loss occurring in the wireless network can also jeopardize multi-modal displaying, e.g., haptic device for force feedback,
the performance of visual streams. In [26], a multi-channel video display, and headphones. The slave domain at the sender
error recovery approach is proposed to guarantee high-quality contains a controlled robot arm equipped with multiple haptic
and real-time video streaming. It integrates priority queue, sensors, a video camera and a microphone. The network
quick start, and scalable reliable channel to overcome the high domain connecting them provides the communication medium.
packet loss ratio. In [27], a deep learning-based restoration Controlled by the master domain, the slave domain is able
algorithm is developed for handling packet loss in wireless to directly sense and interact with the remote environment
multimedia sensor network, which reconstructs the impaired and return multi-modal feedback over the network domain.
visual frames with information from neighboring available Moreover, motivated by the benefits of edge intelligence [5],
frames. In addition to noise and packet loss, interference from we deploy edge nodes in both master and slave domains to
transmission environments should not be neglected. To restore deal with multi-modal signal processing.
images with interference, [28] utilizes common dark channel When the devices at the slave terminal collect audito-
prior and incorporates adaptive color correction for removing ry, visual, and touch stimuli from surroundings or get the
color casts. commands from the master terminal, the related multi-modal
signals are initially uploaded to the edge node for modality

variance-
and category labeling, forming streams. The labeled streams signal
based
energy
are then divided into two categories: audio-haptic streams and feature
estimation
audio blocking extraction quantization audio
visual streams, which is different from existing schemes. The signal stream
reason is that audio and haptic signals have similar character- discrete time-frequency
istics and transmission requirements. Then, a joint redundancy haptic wavelet masking threshold entropy haptic
signal transform seeking coding stream
elimination mechanism (described in Section IV) is designed
preprocessing compression
for the audio-haptic streams from the perspective of human’s masking ratio
masking effect, which can enhance the efficiency of visual determination
streams to be delivered. Finally, the priority assignment and Fig. 4: Procedure of the proposed audio-haptic redundancy
multiplexing methods are adopted, which gives the compressed elimination mechanism
audio-haptic streams higher priority compared with visual
streams.
When the multi-modal packets have been delivered via
wireless channels and received by the edge node managing the
master terminal, the corresponding signals are first decoded
from packets. Due to the relatively low priority and trans-
mission guarantee, if visual signals are impaired or delayed,
signal restoration (described in Section V) can be adopted by
exploring and leveraging intrinsic content consistency among
modalities. Finally, the intact or restored visual signals as well
as the received audio and haptic signals are transferred from
the edge node to the associated devices of master terminal.
IV. T IME -F REQUENCY M ASKING -BASED AUDIO -H APTIC

R EDUNDANCY E LIMINATION M ECHANISM (a)
6 0
The main idea of masking effect comes from theory about th e p ro p o se d sc h e m e (a u d io )
human perception. It is to remove the perceptual redundancies 5 5
th e p ro p o se d sc h e m e (h a p tic )
while causing no difference to the original signals. More sc h e m e in [ 6 ] (a u d io )
sc h e m e in [ 6 ] (h a p tic )
precisely, it implies that components of audio signals or haptic 5 0
signals with a strong frequency will mask components with a
weak frequency. As a result, the weak frequency components
P S N R
4 5
having no sense for transmission can be filtered. Moreover,
frequency masking is extensively investigated in practical 4 0
applications while time masking phenomenon is rarely imple-
mented quantitatively. Actually, time masking, especially the 3 5
backward masking, cannot be ignored due to its long duration
and large masking amount, which is essential for compression 3 0
5 1 0 1 5 2 0 2 5 3 0 3 5
[31]. Therefore, it is necessary to joint consideration between c o m p re s s io n ra tio
frequency masking and time masking.
(b)
Fig. 5: (a) Average throughput; b) PSNR of the proposed

redundancy elimination mechanism and that in [6]
5
4 masking unit
masking threshold Inspired by this, to enhance transmission efficiency and
3
amplitude
guarantee receiving efficiency, we consider audio and haptic

2 streams together and design a joint time-frequency masking-
1 the masked
based redundancy elimination strategy at the sender. Specif-
units ically, the whole procedure is conducted in time-frequency
0
10 domain, where time and frequency information concerning
10 global and local attributes can be simultaneously captured,
5 8
6
4 shown in Fig. 3. In this figure, the two-dimensional curved
2
frequency(Hz) 0 0 time(s)
surface represents masking threshold, and the green solid
circles are the masking signals. The blue solid circles whose
Fig. 3: Illustration of masking effect in time-frequency amplitude less than the masking threshold can be masked,
domains while the red ones are selected for transmission because their
received real visual

audio signals A signals V
Fa a v
Fv
D1 0/1 hierarchical D2 0/1
constraint
received f
H Ff
Vˆ
generated
haptic signals
visual signals
GRU
Fh h G
audio-haptic modality irrelevant cross-modal visual
semantic fusion latent space learning signal generation
Fig. 6: Illustration of the proposed audio-haptic fused visual signal restoration (AHFVR) approach
amplitudes are higher than the threshold. Furthermore, Fig. proposed scheme uses time-frequency masking effects for both
4 completely depicts the diagram of the proposed scheme. audio and haptic modalities, achieving joint optimization under
Concretely, the audio and haptic signals are first switched cross-modal communication scenarios.
into time-frequency representations through wavelet transform.
This can provide more features compared with sole time V. AHFVR:AUDIO -H APTIC F USED V ISUAL S IGNAL
or frequency domains. The obtained time-frequency plane is R ESTORATION A PPROACH
filtered, denoised, and further utilized to extract variance-based In this section, to remedy the impaired or delayed visual
features, while the signal energy is measured at every time- signals, we propose a simple yet effective approach named
frequency unit at the same time. Then, an unsupervised and audio-haptic fused visual signal restoration (AHFVR) ap-
nonparametric adaptive method is adopted to estimate the proach. Its core idea comes from the psychological theory
masking threshold via time-frequency vector [32], associating that humans’ non-visual perception can assist in forming and
to the previous and future frames of the current signals. It is enhancing their visual experience. As illustrated in Fig. 6, the
noted that this masking threshold is dynamic according to the overall approach can be further divided into three components:
actual characteristics of audio and haptic signals. By obtaining audio-haptic semantic fusion, modality irrelevant latent space
masking threshold, the signal masking ratio or compression learning, cross-modal visual signal generation.
ratio can be determined. Finally, this ratio is applied to every
time-frequency unit for quantization and entropy coding. If
A. Audio-Haptic Semantic Fusion
the audio and haptic features in the time-frequency space
are lower their respective masking threshold, they should not Here, we introduce some basic notations utilized in this
be transmitted. As a result, the redundancy of audio and section. We define multi-modal dataset as {V, A, H} and the
haptic signals will be greatly eliminated or compressed without three modalities involved as V for visual signals, A for audio
feeling the perceptual difference. signals, and H for haptic signals. Y denotes the category label
We apply the above proposed redundancy elimination mech- set. In the first stage, we try to extract low-level features from
anism under the constructed architecture in Section III. The the received audio and haptic signals and produce high-level
results are shown in Fig. 5. First, the average throughput of feature fusion by semantic constraints.
multi-modal streams during a certain period of time is given For the received audio signals, spectrogram is acquired
in Fig. 5(a). On the one hand, the transmission efficiency through spectrum analysis and then deep convolutional neural
of audio and haptic modality streams can be promoted due network (CNN) is used to obtain hierarchical auditory features.
to redundancy elimination of these two modalities. On the For the received haptic signals, power spectrum density is used
other hand, as these two modality streams have been com- as the feature descriptors and then the combinations of gated
pressed, the resource competition between audio-haptic and recurrent unit (GRU) and fully-connected layers are adopted to
visual modalities can be weakened. It leads to the throughput obtain haptic features. The specific process can be represented
improvement of visual stream when compared with the scheme as follows:
in [6]. Moreover, in Fig. 5(b), it clearly shows that the a =Fa (A; θa ) : a ∈ Rda , (1)
perceived quality of the compressed audio and haptic signals
h =Fh (H; θh ) : h ∈ Rdh , (2)
degrades gracefully as higher compression ratios kick in.
However, the proposed redundancy elimination mechanism has where a and h denote low-level features of audio and haptic
better performance than that in [6]. The reason is that the signals. Rda and Rdh are their dimensions. θa and θh respec-
tively denote parameters of audio mapping network Fa and latent space with modality irrelevant characteristics can be
haptic mapping network Fh . derived.
Subsequently, we employ a deep semantic network to jointly
fuse audio and haptic features. Here, supervision information C. Cross-Modal Visual Signal Generation
such as category label is utilized to maintain the semantically After the fused features have learned rich modality ir-
discriminative ability for the fused feature f . It can be achieved relevant semantic information from real visual signals in
by defining a semantic discriminative loss Ldis : latent space, the final step is to generate the desired visual
f =Ff (a, h; θf ) : f ∈ Rdf , (3) signals from the fused features. Here, a generative adversarial
K
network (GAN) is taken to finish this task. Specifically, in
1 X this GAN, the generator G converts f into visual signals V̂ ,
Ldis = fsof tmax (fi , yi ; θi ), (4)
K i=1 and the discriminator D2 distinguishes the real visual signals
V from the generated visual signals V̂ . In other words, this
where the dimension of the fused feature is Rdf and θf are process mainly indicates that G learns the real visual signal
parameters of deep semantic network Ff . Besides, fi , yi , and distribution to confuse D2 , and D2 continuously enhances its
θi are fused feature, category label, and softmax parameters discriminative ability to distinguish false visual signals. The
belonging to the ith category. fsof tmax (fi , yi ; θi ) is the soft- objective function of this step can be defined as:
max cross entropy loss function. It is noted that the category
label here means which class an audio or a haptic signal LD2 =−EV ∼Pdata(V ) [log D2 (V ; θd2 )]
belongs to. In the real cross-modal communication system, −EV̂ ∼P [log(1 − D2 (V̂ ; θd2 ))], (8)
data(V̂ )
label tagging task is usually performed and loaded at the
sender. When the packet has been received, this information LG =−EV̂ ∼P [log D2 (V̂ ; θd2 )], (9)
data(V̂ )
can be parsed and taken as an important semantic information.
where V̂ denotes the generated visual signal. By minimizing
By minimizing the above semantic discriminative loss,
LD2 and LG , G is able to continuously capture the distribution
the obtained fused features have strong representation and
of V and use it to guide the generation of V̂ .
semantic discrimination abilities.
Besides the discriminative loss LD2 , we should also con-
sider the pixel-wise loss. Here, Euclidean distance is used to
B. Modality Irrelevant Latent Space Learning completely match the generated visual signal V̂ with the real
To bridge the heterogeneous gap among different modalities, visual signal V , forming the restoration loss. The objective
adversarial learning is selected to map the audio-haptic fused function of the restoration loss Lrec can be expressed as
feature and feature of real visual signals into a modality
Lres = V − V̂ . (10)

irrelevant latent space. In other words, by exploring this latent 2
space, the discrepancy between the fused feature and the visual Actually, due to the discrepancy of data distribution-
feature can be reduced as much as possible. s between real and generated visual signals in the high-
Specifically, we combine a CNN with several fully- dimensional pixel space, only considering reconstruction loss
connected layers as the visual mapping network to produce may generate the blur visual signal. To handle this issue, we
the associated visual features: employ knowledge distillation as an effective tool for further
v = Fv (V ; θv ) : v ∈ Rdv , (5) improving the quality of generated visual signal. Here, knowl-
edge distillation refers to the separation of key distribution
where v is Rdv dimensional features of real visual signals information from an original model and transfers it to a new
V . θv are parameters of the visual mapping network Fv . model, aiming at increasing the fine-grained degree of the
Then, the generative adversarial mechanism is utilized to generated visual signal. Driven by this, we select the visual
learn the latent space, as shown in Fig. 6. Here, we build mapping network as distillation model to assist the process of
a feature discriminator D1 for distinguishing the features, final visual signal generation through hierarchical knowledge
through which the fused features f can continuously approach transferring.
v. In other words, a modality irrelevant latent space can be The structure of knowledge distillation is illustrated in Fig.
explored. According to this principle, the discriminator D1 7. Specifically, during the generation process from the fused
and fusion network Ff are trained by alternatively minimizing feature f to the desired visual signal V̂ , the intermediate result
LD1 in Eq. (6) and LFf in Eq. (7), (l)
produced by layer l of G can be represented as G(l) (f ; θg ).
(l)
LD1 =−Ev∼Pdata(v) [log D1 (v; θd1 )] θg is the parameter set of layer l in G. Meanwhile, during
the feature extraction process from V to v, the corresponding
−Ef ∼Pdata(f ) [log (1 − D1 (f ; θd1 ))] , (6) (L−l) (L−l)
intermediate result is Fv (V ; θv ), which can also be
LFf =−Ef ∼Pdata(f ) [log D1 (f ; θd1 )] + λLdis , (7) produced by the output of layer l in G. θv
(L−l)
is the parameter
where λ is the regularization parameter. set of layer L − l in Fv . L is the number of layers in both Fv
The idea of selecting the generative adversarial mechanism and G. Based on this, the hierarchical constraint loss of visual
is that when D1 cannot distinguish f and v, it makes the signals is defined as
audio-haptic fused feature gradually approach the feature of X (l)

LH = G (f ; θg(l) ) − Fv(L−l) (V ; θv(L−l) ) ,

real visual signals from the aspect of intrinsic semantics. The 1
(11)
l
real visual
visual feature extraction network signal
Fv(L )(V ; v(L ) ) ... Fv(L l )(V ; v(L l ) ) ... Fv(1)(V ; v(1) ) V
LH LH L ... LHl ... LH1
f G (1)( f ; g(1) ) ... G (l )( f ; g(l ) ) ... G (L )( f ; g(L ) ) Vˆ

fused visual generation
... network generated
feature visual signal
Fig. 7: The structure of knowledge distillation for visual signal generation
It can be seen from Fig. 6 that the structures of G and Fv are Algorithm 1 The optimization process of the proposed AH-
symmetrical, for instance, have the same network layers and FVR approach
output dimensions. By this way, the knowledge from multiple Input:
levels of visual features are transferred into layers of G. The multi-modal dataset V, A, H, category label Y
quality and granularity of the generated visual signals are Output:
improved. Finally, the whole objective function of cross-modal the generated visual images V̂
visual signal generation process is Initialization:
network parameters θv , θa , θh , θf , θd1 , θd2 , θg
Lgen = LG + αLres + βLH , (12) hyper-parameters: α, β, λ
where α and β are the hyper-parameters that aim to balance learning rate: µ1 , µ2
the three terms. mini-batch size: N
1: Repeat:
2: h ← Fh (H), a ← Fa (A), v ← Fv (V ), f ← Ff (a; h),
D. Optimization Process V̂ ← G (f )
The whole optimization process of the AHFVR approach 3: update θd1 by SGD for Eq. (6):
can be divided in two main stages. In the first stage, the 4: θd1 ← θd1 − µ2 ∇θd1 LD1
objective functions to be optimized are LD1 and LFf (con- 5: update θa , θh , θv , θf by SGD for Eq. (7):
taining Ldis ). In the second stage, the objective functions to 6: θa ← θa − µ1 ∇θa LFf
be optimized are LD2 and Lgen . 7: θh ← θh − µ1 ∇θh LFf
Specifically, during the optimization, we first initialize all 8: θv ← θv − µ1 ∇θv LFf
network parameters. Then, the fusion network Ff and dis- 9: θf ← θf − µ1 ∇θf LFf
criminator D1 are trained by minimizing the loss function in 10: update θd2 by SGD for Eq. (8):
the first stage. Subsequently, the generator G and discrim- 11: θd2 ← θd2 − µ2 ∇θd2 LD2
inator D2 are trained. Finally, all the training steps above 12: update θg by SGD for Eq.(12):
are repeated until convergence. It is noted that the stochastic 13: θg ← θg − µ2 ∇θg Lgen
gradient descent (SGD) algorithm is adopted to minimize 14: Until convergence.
all the loss functions and upgrade all network parameters.
Different learning rates are set for every network, aiming at
balancing the convergence speed. The specific optimization features a0 and h0 are then fused via semantic constraints.
process is summarized in Algorithm 1. For the condition of Subsequently, the fused feature f 0 is fed into the trained G
convergence, it needs to consider both the outputs of two whose output is the generated visual signal Vˆ0 . For the delayed
discriminators and the reconstruction results. Specifically, after visual signal, when it is successfully received, it can be utilized
each training epoch ends, calculate the outputs of D1 and D2 . to update the AHFVR by Algorithm 1, further enhancing the
If these two values approach 0.5, calculate the Lgen by the accuracy and robustness of the approach.
randomly selected samples for validation. If the variation of
Lgen becomes lower than the pre-set threshold, this algorithm VI. E XPERIMENTAL R ESULTS
is considered as convergence.
When the AHFVR has been trained by the above steps, the A. Dataset
parameters of all the networks are already well optimized. In this section, we evaluate the effectiveness of the proposed
When the subsequent audio and haptic signals have been AHFVR approach based on a standard LMT dataset [33].
received while the desired visual signals are impaired or Generally, The LMT dataset is composed of sound, haptic
delayed, the AHFVR will be activated. Specifically, the audio acceleration, and texture images of 108 different surface
signal A0 and haptic signal H 0 are respectively projected into materials gathered by a three-axis accelerometer, a microphone
feature space through CNN and GRU networks. Both extracted and a camera. According to the material properties, these
(a) GAN (b) WGAN (c) StackGAN
(d) ViTac (e) AHFVR(haptic-only) (f) AHFVR
Fig. 8: The restored visual signals by (a) GAN, (b) WGAN, (c) StackGAN, (d) ViTac, (e) AHFVR(haptic-only), and (f)
AHFVR
surface materials are classified into nine categories: mesh, effectiveness of the proposed AHFVR approach.
stone, glossy, wood, rubber, fiber, foam, foils and paper, textile GAN [34] is the most basic generative adversarial model,
and fabric. To avoid repetition, we randomly use 80% data in which can generate visual signals by a procedure of max-min
each category as the training set, and the remaining 20% as optimization.
the testing set. WGAN [35] is a variant of GAN whose training aim is to
minimize the Wasserstein distance.
B. Implementation Details StackGAN [36] is an another variant of GAN which is used
Network Structure: We adopt CNN to perform feature to generate visual signals conditioned on textual description.
extraction for audio and visual modality signals, which con- ViTac [37] provides an approach based on the improved
tains four convolution layers and three fully-connected layers GAN to generate visual signals conditioning on the semantic
(1024 → 128 → 6). The number of convolution kernels is features encoded by haptic modalities. It considers the corre-
512, 256, 128, 64. The size of the kernels is 5 × 5. Moreover, lation between haptic and visual modalities.
GRU with 256 units is used for the haptic feature extraction. AHFVR(haptic-only) is the AHFVR which removes the
The fusion network is built as a five-layer fully-connected audio modality. In this case, feature fusion component is
neural network (512 → 1024 → 512 → 128 → 64), and the discarded and only haptic modality is employed to generate
discriminator D1 is a four-layer fully-connected neural net- corresponding visual signals. It can also be utilized to verify
work (512 → 1024 → 512 → 1). The generator G has the the effectiveness of audio modality and multi-modal fusion.
same but the reversed network structure with the visual fea- Moreover, we also conduct some baseline experiments,
ture extraction network. The network of discriminator D2 is including AHFVR(audio-only), without latent space learning,
composed of four convolution layers and two fully-connected and without-distillation to estimate the validity of each com-
layers (1024 → 1). The number of convolution kernels is 512, ponent in the proposed AHFVR approach.
256, 128, 64. The size of the kernels is 5 × 5. Except for
softmax layer and sigmoid layer, the other feature layers are D. Evaluation Metrics
activated by relu function and batchnorm regularization.
Here, we choose the following metrics for evaluation.
Training Details: We first set batch size as 25 in the train-
Inception Score is a numerical metric for quantitative
ing procedure. Category labels are coded as one-hot vector.
evaluation, measuring the performance of the generated visual
Network parameters are learned by optimization described in
signals. Its calculation formula is
Section V-D for all modules with λ = 10−3 , α = 0.1 and
β = 10−5 . In details, the learning rate for the generator is

IS = exp EV̂ KL(p(ŷ|V̂ )kp(ŷ)) , (13)
0.0005 and the discriminator is 0.0001. The training procedure
is adopted at 200 epochs for all the datasets. where V̂ represents a generated visual signal, and ŷ is the
predicted label through Inception networks. p(ŷ) denotes the
C. Compared Approaches marginal distribution. p(ŷ|V̂ ) denotes the possibility of the
Here, we select the following representative models and category ŷ the given visual signal V̂ belongs to. The higher
approaches for performance comparison to demonstrate the inception scores means the better performance.
FID calculates the distance between real signal and gen-

erated signal in the feature space. We first use the Inception
network to extract features and then apply Gaussian model
to fit the feature space, finally calculate the distance between
the two features. The smaller FID represents the better perfor-
mance.
SSIM is an image quality assessment method, generally
used to measure similarity among a couple of visual signals.
The higher SSIM represents the better performance.
PSNR is a common and widely used objective evaluation
metric for visual signals, defined as:
!
2
(2n − 1)
P SN R = 10 × log10 , (14)
M SE
where M SE is the mean squared error between the generated
and the real visual signals. n represents the number of bits.
The higher PSNR means the better performance. Fig. 9: SSIM performance of the approaches in the ablation
experiment
E. Experimental Results and Analysis
The inception score and FID of the proposed AHFVR and Only Haptic/Audio Modality for Fusion: This scheme
the other competing approaches are given in Table II. From only employs haptic or audio signals as prior knowledge to
this table, we can see that the proposed AHFVR approach train the AHFVR. From Fig. 9 we can obviously see that the
obtains the highest inception score and the smallest FID results are excessively unsatisfied. In other words, it is difficult
score. The reasons lie in two aspects: on the one hand, the to use only haptic or audio modality signals to generate visual
AHFVR has the highest semantic correlations, demonstrating signals with explicit content.
that the semantic consistency between the received audio- Without Latent Space Learning: This scheme cannot
haptic signals and the generated visual signals has been perform the modality irrelevant latent space learning in Section
explored. On the other hand, it also indirectly reveals that the V-B. Comparing with the AHFVR approach, this scheme
AHFVR performs well in both quality and diversity. Moreover, achieves worse performance. Thus, it is evident that latent
compared with WGAN and StackGAN, the results of ViTac space learning can offer more accurate and detailed visual
and AHFVR(haptic-only) approaches are slightly better. The elements from the real visual signals in training dataset.
reason is that they add a semantic correlation learning module Without Distillation: This scheme generates visual signals
between non-visual modality and visual modality. What’s without the guidance of real image in hierarchical latent space.
more, in Fig. 8, the fine-grained visual signals generated by In other words, the knowledge distillation is not considered.
the AHFVR reflect the importance of knowledge distillation. We can clearly observe from Fig. 9 that the quality of the
Finally, compared with AHFVR(haptic-only) approach, the generated visual signals is poor. Therefore, knowledge distil-
combination and fusion of audio and haptic modalities in lation technique can help the AHFVR achieve more desirable
the AHFVR approach is highly significant for accuracy and performance.
robustness.
TABLE II: Inception score and FID of the proposed AHFVR VII. P RACTICAL P LATFORM FOR E VALUATING THE
and the other competing approaches C ROSS -M ODAL S IGNAL R ECONSTRUCTION S TRATEGY
In this section, a practical cross-modal communication plat-
Approach Inception score FID form is constructed for evaluating the proposed signal recon-
Original 10.25 ± 0.65 36.97 struction strategy. The structure of this platform is depicted in
GAN 7.73 ± 0.46 134.12
WGAN 7.23 ± 0.77 142.06
Fig. 10, which is composed of the master domain, the network
StackGAN 8.75 ± 0.87 122.34 domain, and the slave domain. Among them, the master
ViTac 7.36 ± 0.64 125.20 domain is a haptic device connected to a laptop. The haptic
AHFVR(haptic-only) 8.84 ± 0.79 111.48 device is used to send control commands and receive force
AHFVR 9.88 ± 0.84 48.84
feedback, and the laptop is used to display the received images
and audio. The slave domain is the combination of a robot
Furthermore, the ablation experiment are also performed on arm equipped with force sensors, a microphone, and a camera
the LMT dataset to verify the effectiveness of components for the collection of haptic, audio, and visual signals. For the
in the proposed AHFVR approach. Several baseline schemes network domain, multi-modal streams are delivered via the
are formed by discarding different components in AHFVR public 5G network and edge servers. All of the received signals
during training phase. We calculate the SSIM metric every belong to nine semantic categories: slate, wood, cardboard,
50 epoches. The results are presented in Fig. 9, which are silk, foam, brass, linen, bubble cushion film, porous plastic
analyzed as follows: sheet.
base station
Micro-
phone
gateway
camera force sensors
database
router
Master Domain Network domain Slave Domain
Fig. 10: Schematic overview of the practical cross-modal communication platform
(a) GAN (b) WGAN (c) StackGAN
(d) ViTac (e) AHFVR(haptic-only) (f) AHFVR
Fig. 11: The restored visual signals by (a) GAN, (b) WGAN, (c) StackGAN, (d) ViTac, (e) AVHR(haptic-only), and (f)
cross-modal signal construction with AHFVR
Fig. 11 shows those reconstructed visual signals and provide TABLE III: SSIM and PSNR of the proposed cross-modal
some quantitative information about generation quality. In signal reconstruction with AHFVR and the other competing
particular, SSIM and PSNR are selected to measure similarity schemes
between the reconstructed visual signals and real visual signal-
Approach SSIM PSNR
s. The results are given in Table III. It can be seen that the AH- Original 0.950 −
FVR achieves the best performance on the practical platform. GAN 0.199 9.782
Moreover, the basic generative models, like GAN and WGAN, WGAN 0.201 10.056
always achieve the worst performance. Although StackGAN StackGAN 0.216 10.178
ViTac 0.235 10.678
has made some improvements, it is still limited by discrep- AHFVR(haptic-only) 0.241 10.831
ancy of heterogeneous streams. Compared with AHFVR, the AHFVR 0.412 14.021
performance of AHFVR(haptic-only) is inferior due to the
lack of supplementary audio information. In summary, the
proposed perception-aware cross-modal signal reconstruction They give their experience scores 1∼5 from two aspects: qual-
strategy obtains the best performance, guaranteeing quality of ity scale and classification scale (discrimination ability of the
the received signals. reconstructed images). They respectively represent quality and
Finally, in order to verify the performance of the proposed discrimination quality of the reconstructed images compared
strategy, user’s quality of experience (QoE) is measured based with those of the real images. Scores from the participants
on their subjective assessment of the synthesized images. are averaged and the results are shown in Fig. 12. It can be
Therefore, in this qualitative evaluation, 15 participants who observed that the proposed cross-modal signal reconstruction
have normal senses are invited to manipulate the haptic devices strategy with the AHFVR achieves the best performance on
at the master terminal and perceive the received visual signals. end user’s experience.
[5] X. Wei and L. Zhou, “AI-enabled cross-modal communications,” IEEE

Wirel. Commun., vol. 28, no. 4, pp. 182-189, 2021.
[6] B. Cizmeci, X. Xu, R. Chaudhari, N. Alt, and E. Steinbach, “A
multiplexing scheme for multimodal teleoperation,” ACM Trans. Multim.
Comput., vol. 13, no. 2, p. 21, 2017.
[7] M. S. Elbamby, C. Perfecto, M. Bennis, and K. Doppler, “Toward low-
latency and ultra-reliable virtual reality,” IEEE Network, vol. 32, no. 2,
pp. 78-84, 2018.
[8] X. Wei, M. Zhang, and L. Zhou, “Cross-modal transmission strategy,”
IEEE Trans. Circ. Syst. Vid., vol. 32, no. 6, pp. 3991-4003, 2022.
[9] H. Cheng, Y. Guo, J. Yin, H. Chen, J. Wang, and H. Nie, “Audio-
driven talking video frame restoration,” IEEE Trans. Multimedia, DOI:
AHFVR 10.1109/TMM.2021.3118287, 2021.
(haptic-only) [10] S. Zou, Q. Wang, J. Ge, and Y. Tian, “Peer-assisted video streaming with
RTMFP flash player: A measurement study on PPTV,” IEEE Trans. Circ.
Syst. Vid., vol. 28, no. 1, pp. 158-170, 2018.
[11] M. T. Hossan, M. Z. Chowdhury, M. Shahjalal, and Y. M. Jang, “Human
bond communication with head-mounted displays: Scope, challenges,
solutions, and applications,” IEEE Commun. Mag., vol. 57, no. 2, pp.
26-32, 2019.
Fig. 12: QoE performance of all the approaches on the [12] X. W. Tang, X. L. Huang, F. Hu, and Q. Shi, “Human-perception-
oriented pseudo analog video transmissions with deep learning,” IEEE
practical cross-modal communication platform Trans. Circ. Syst. Vid., vol. 69, no. 9, pp. 9896-9909, 2020.
[13] X. Yi, K. Yang, X. Zhao, Y. Wang, and H. Yu, “AHCM: Adaptive
huffman code mapping for audio steganography based on psychoacoustic
model,” IEEE Trans. Inf. Foren. Sec., vol. 14, no. 8, pp. 2217-2231, 2019.
VIII. C ONCLUSION [14] R. Chaudhari, C. Schuwerk, M. Danaei, and E. Steinbach, “Perceptual
and bitrate-scalable coding of haptic surface texture signals,” IEEE J.
In this paper, we propose cross-modal signal reconstruction Sel. Top. Signal Process., vol. 9, no. 3, pp. 462-473, 2015.
from the perspective of human’s perceptual facts. The key [15] B. Li, J. P. Munoz, X. Rong, Q. Chen, and J. Xiao, “Vision-based mobile
insight is that it borrows the related mechanisms from the indoor assistive navigation aid for blind people,” IEEE Trans. Mobile
Comput., vol. 18, no. 3, pp. 702-714, 2019.
human’s perception system, which has several similarities with [16] M. Yuan and Y. Peng, “CKD: Cross-task knowledge distillation for text-
cross-modal communications. As an initial trial, we design a to-image synthesis,” IEEE Trans. Multimedia, vol. 22, no. 8, pp. 1955-
time-frequency masking-based audio-haptic redundancy elim- 1968, 2020.
[17] H. Liu, Y. Wu, F. Sun, B. Fang, and D. Guo, “Surface material retrieval
ination mechanism as well as an audio-haptic fused visual using weakly paired cross-modal learning,” IEEE Trans. Autom. Sci.
signal restoration approach. The ultimate aim is to guarantee Eng., vol. 16, no. 2, pp. 781-791, 2019.
quality of received visual signals and end-user’s experience. [18] M. S. Siddiqui, S. O. Amin, and C. S. Hong, “A set-top box for end-to-
end QoS management and home network gateway in IMS,” IEEE Trans.
From a standard multi-modal dataset and a practical cross- Consum. Electr., vol. 55, no. 2, pp. 527-534, 2009.
modal communication platform, the effectiveness of the pro- [19] E. Mohamad and A. E. Saddik, “Admux communication protrocol for
posed architecture and strategy can be verified. real-time multimodal intreaction,” IEEE/ACM Int. Sym. Distr. Simulat.
Real Time Appl., Dublin, Ireland, pp. 118-123, 2012.
The present paper also leaves several open and interest- [20] L. Madmoni, S. Tibor, I. Nelken, and B. Rafaely, “The effect of
ing questions. First, what’s the theoretical threshold about partial time-frequency masking of the direct sound on the perception of
restoration of one modality signals from the other modality reverberant speech,” IEEE-ACM Trans. Audio Spe., vol. 29, pp. 2037-
2047, 2021.
signals. Second, during transmission in several scenarios, the [21] R. Hassen, B. Gulecyuz, and E. Steinbach, “PVC-SLP: Perceptual
impaired or delayed phenomenon of visual, audio, haptic vibrotactile-signal compression based-on sparse linear prediction,” IEEE
signals may randomly occur. How to effectively handle this Trans. Multimedia, vol. 23, pp. 4455-4468, 2021.
[22] J. Yang, C. Wang, B. Jiang, H. Song, and Q. Meng, “Visual perception
issue when more than one modality signal is in trouble. Third, enabled industry intelligence: State of the art, challenges and prospects,”
the perceptual facts and theory adopted in this paper is rela- IEEE Transactions on Industrial Informatics, vol. 17, no. 3, pp. 2204-
tively superficial. More thorough analysis and exploitation for 2219, 2021.
[23] J. Yin, B. Chen, and Y. Li, “Highly accurate image reconstruction
characteristics and interaction mechanisms of human’s visual, for multimodal noise suppression using semisupervised learning on big
auditory, haptic sensory systems are needed for designing data,” IEEE Trans. Multimedia, vol. 20, no. 11, pp. 3045-3056, 2018.
more powerful cross-modal transmission and signal processing [24] G. Yuan and B. Ghanem, “`0 TV: A Sparse optimization method for
impulse noise image restoration,” IEEE Trans. Pattern Anal., vol. 41,
approaches. We will carefully study them in our ongoing no. 2, pp. 352-364, 2019.
works. [25] Z. Yang, H. Liu, and D. Cai, “On the diversity of conditional image
synthesis with semantic layouts,” IEEE Trans. Image Process., vol. 28,
no. 6, pp. 2898-2907, 2019.
R EFERENCES [26] H. Xie, A. Boukerche, and A. A. F. Loureiro, “MERVS: A novel
multichannel error recovery video streaming protocol for vehicle ad hoc
[1] Z. Yuan, G. Ghinea, and G-M. Muntean, “Beyond multimedia adap- networks,” IEEE Trans. Veh. Technol., vol. 65, no. 1, pp. 923-935, 2016.
tion: Quality of experience-aware multi-sensorial media delivery,” IEEE [27] T. Lin, H. Tseng, Y. Wen, F. Lai, C. Lin, and C. Wang, “Reconstruction
Trans. Multimedia, vol. 17, no. 1, pp. 104-117, 2015. algorithm for lost frame of multiview videos in wireless multimedia sen-
[2] W. Zhu, X. Wang, and W. Gao, “Multimedia intelligence: When mul- sor network based on deep learning multilayer perceptron regression,”
timedia meets artificial intelligence,” IEEE Trans. Multimedia, vol. 22, IEEE Sens. J., vol. 18, no. 23, pp. 9792-9801, 2018.
no. 7, pp. 1823-1835, 2020. [28] Y. Peng, K. Cao, and P. C. Cosman, “Generalization of the dark channel
[3] L. Zhou, D. Wu, X. Wei, and J. Chen, “Cross-modal stream scheduling prior for single image restoration,” IEEE Trans. Image Process., vol. 27,
for ehealth,” IEEE J. Sel. Area Comm., vol. 39, no. 2, pp. 426-437, 2021. no. 6, pp. 2856-2868, 2018.
[4] L. Zhou, D. Wu, J. Chen, and X. Wei, “Cross-modal collaborative [29] Z. Jin, M. Z. Iqbal, D. Bobkov, W. Zou, X. Li, and E. Steinbach,
communications,” IEEE Wirel. Commun., vol. 27, no. 2, pp. 112-117, “A flexible deep CNN framework for image restoration,” IEEE Trans.
2020. Multimedia, vol. 22, no. 4, pp. 1055-1068, 2020.
[30] S. Niu, Y. Jiang, and B. Chen, “Cross-modality transfer learning for Liang Zhou (SM’18) received his Ph.D. degree
image-text information management,” ACM Transactions on Manage- major at electronic engineering both from École
ment Information Systems, vol. 13, no. 1, p. 14, 2022. Normale Supérieure (E.N.S.), Cachan, France and
[31] R. M. Chen, N. N. Mahatme, Z. J. Diggins, and L. Wang, “Impact of Shanghai Jiao Tong University, Shanghai, China in
temporal masking of flip-flop upsets on soft error rates of sequential 2009. Now, he is a professor in Nanjing Univer-
circuits,” IEEE Trans. Nucl. Sci., vol. 64, no. 8, pp. 2098-2106, 2017. sity of Posts and Telecommunications, China. His
[32] N. Otsu, “A threshold selection method from gray-level histograms,” research interests are in the area of multimedia
IEEE Transactions on Syst, Man, Cybernetics, vol. 9, no. 1, pp. 62-66, communications and computing.
1979.
[33] M. Strese, C. Schuwerk, A. Iepure, and E. Steinbach, “Multimodal
feature-based surface material classification,” IEEE Trans. Haptics, vol.
10, no. 2, pp. 226-239, 2017.
[34] C. Wang, C. Xu, X. Yao, and D. Tao, “Evolutionary generative adversar-
ial networks,” IEEE Trans. Evolut. Comput., vol. 23, no. 6, pp. 921-934,
2019.
[35] I. Deshpande, Y. Hu, R. Sun, A. Pyrros, and A. Schwing, “Max-sliced
Wasserstein distance and its use for GANs,” IEEE/CVF Conf. Comput.
Vis. Pattern Recog. (CVPR), Long Beach, USA, pp. 10640-10648, 2019.
[36] H. Zhang, T. Xu, H. Li, and S. Zhang, “StackGAN++: Realistic image
synthesis with stacked generative adversarial networks,” IEEE Trans.
Pattern Anal., vol. 41, no. 8, pp. 1947-1962, 2019.
[37] J. Lee, D. Bollegala, and S. Luo, “Touching to see and seeing to
feel: Robotic cross-modal sensory data generation for visual-tactile
perception,” Int. Conf. Robot. Autom. (ICRA), Montreal, Canada, pp.
4276-4282, 2019.
Xin Wei (M’19) received his Ph.D. degree major

at information and communication engineering from
Southeast University, Nanjing, China in 2009. Now,
he is a professor in Nanjing University of Posts and
Telecommunications, China. His research interests
are in the areas of multimedia communications and
computing, educational technology.
Yuyuan Yao received her M.S. degree major at

information and communication engineering from
Nanjing University of Posts and Telecommunica-
tions, Nanjing, China in 2022. Her research interest
is in the area of multimedia communications.
Haoyu Wang received his B.E. degree major at

communication engineering from Nanjing Universi-
ty of Posts and Telecommunications, Nanjing, China
in 2021, where he is currently pursuing the M.S.
degree major at information and communication
engineering. His research interest is in the area of
multimedia communications.

Perception-Aware Cross-Modal Signal Reconstruction: From Audio-Haptic To Visual

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perception-Aware Cross-Modal Signal Reconstruction: From Audio-Haptic To Visual

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in IEEE Transactions on Multimedia.

IEEE TRANSACTIONS ON MULTIMEDIA 1

Perception-Aware Cross-Modal Signal

IEEE TRANSACTIONS ON MULTIMEDIA 2

II. R ELATED W ORK

IEEE TRANSACTIONS ON MULTIMEDIA 3

Master Domain Network Domain Slave Domain

haptic Intact haptic

Fig. 2: The perception-aware cross-modal signal reconstruction architecture

IEEE TRANSACTIONS ON MULTIMEDIA 4

signals are initially uploaded to the edge node for modality

IV. T IME -F REQUENCY M ASKING -BASED AUDIO -H APTIC

Fig. 5: (a) Average throughput; b) PSNR of the proposed

guarantee receiving efficiency, we consider audio and haptic

IEEE TRANSACTIONS ON MULTIMEDIA 5

received real visual

IEEE TRANSACTIONS ON MULTIMEDIA 6

IEEE TRANSACTIONS ON MULTIMEDIA 7

LH LH L ... LHl ... LH1

f G (1)( f ; g(1) ) ... G (l )( f ; g(l ) ) ... G (L )( f ; g(L ) ) Vˆ

Fig. 7: The structure of knowledge distillation for visual signal generation

IEEE TRANSACTIONS ON MULTIMEDIA 8

(a) GAN (b) WGAN (c) StackGAN

(d) ViTac (e) AHFVR(haptic-only) (f) AHFVR

IEEE TRANSACTIONS ON MULTIMEDIA 9

FID calculates the distance between real signal and gen-

IEEE TRANSACTIONS ON MULTIMEDIA 10

Fig. 10: Schematic overview of the practical cross-modal communication platform

(a) GAN (b) WGAN (c) StackGAN

(d) ViTac (e) AHFVR(haptic-only) (f) AHFVR

IEEE TRANSACTIONS ON MULTIMEDIA 11

[5] X. Wei and L. Zhou, “AI-enabled cross-modal communications,” IEEE

IEEE TRANSACTIONS ON MULTIMEDIA 12

Xin Wei (M’19) received his Ph.D. degree major

Yuyuan Yao received her M.S. degree major at

Haoyu Wang received his B.E. degree major at

You might also like