Contrastive Learning Based Semantic

1
Contrastive Learning based Semantic

Communication for Wireless Image Transmission
Shunpu Tang, Qianqian Yang, Lisheng Fan, Xianfu Lei, Yansha Deng, and Arumugam Nallanathan, Fellow, IEEE
Abstract—Recently, semantic communication has been widely this issue, deep learning (DL) has been applied in seman-
applied in wireless image transmission systems as it can pri- tic communication, due to its growing success in semantics
oritize the preservation of meaningful semantic information in extraction and reconstruction. In this direction, the authors
images over the accuracy of transmitted symbols, leading to
improved communication efficiency. However, existing semantic in [2], [3] proposed a DL base joint source-channel coding
arXiv:2304.09438v1 [cs.IT] 19 Apr 2023
communication approaches still face limitations in achieving (DeepJSCC), where the encoder and decoder were designed
considerable inference performance in downstream AI tasks based on autoencoder and jointly optimized for semantic
like image recognition, or balancing the inference performance information transmission to achieve a good image reconstruc-
with the quality of the reconstructed image at the receiver. tion quality. Moreover, a collaborative training framework for
Therefore, this paper proposes a contrastive learning (CL)-based
semantic communication approach to overcome these limitations. semantic communication was proposed in [4], where users
Specifically, we regard the image corruption during transmission could train their semantic encoder to improve the performance
as a form of data augmentation in CL and leverage CL to reduce of unknown downstream inference tasks. However, these ap-
the semantic distance between the original and the corrupted proaches still suffer from limitations in achieving considerable
reconstruction while maintaining the semantic distance among inference performance and making a good trade-off between it
irrelevant images for better discrimination in downstream tasks.
Moreover, we design a two-stage training procedure and the and image reconstruction quality according to communication
corresponding loss functions for jointly optimizing the seman- conditions.
tic encoder and decoder to achieve a good trade-off between In essence, the semantic distance between two irrelevant
the performance of image recognition in the downstream task images is large due to different main semantic information,
and reconstructed quality. Simulations are finally conducted to while the distance is small enough between two nearly identi-
demonstrate the superiority of the proposed method over the
competitive approaches. In particular, the proposed method can cal images sharing the same semantic information. Motivated
achieve up to 56% accuracy gain on the CIFAR10 dataset when by this fact, we propose to integrate contrastive learning (CL)
the bandwidth compression ratio is 1/48. with semantic communication and design a two-stage training
Index Terms—Semantic communication, image transmission, procedure. The key design in this framework is that we regard
contrastive learning, joint source-channel coding. the image corruption that occurs during transmission over a
limited channel as a form of data augmentation operation in
[5] and propose the semantic contrastive coding to reduce
I. I NTRODUCTION the semantic distance (a.k.a semantic similarity) between the
Recently, semantic communication has emerged as a original and reconstructed image while maintain the semantic
promising approach for efficient image transmission in wire- distance among irrelevant images for better discrimination.
less network, and attracted increasing research interests from Based on contrastive loss, we design the two-stage training
academic. Compared with the conventional communication procedure and loss functions for jointly optimizing the encoder
paradigm based on Shannon’s theory, semantic communication and decoder, thereby achieving a good balance between the
aims to prioritize to preserve meaningful semantic information inference performance in the downstream task and reconstruc-
over focusing on the accuracy of transmitted symbols, which tion quality. Simulations are finally conducted to demonstrate
can significantly reduce the amount of data to be transmitted the superiority of the proposed method over the competitive
and improve the communication efficiency [1]. approaches.
A major challenge in semantic communication for image
transmission is to effectively extract semantic information at II. S YSTEM MODEL
the transmitter while accurately reconstructing it at the re- This paper investigates a semantic communication system
ceiver under limited communication conditions. To overcome for wireless image transmission, where a convoloutional neural
S. Tang and L. Fan are both with the School of Computer Science network (CNN) based semantic encoder and decoder are
and Cyber Engineering, Guangzhou University, Guangzhou, China (e-mail: deployed in the transmitter and receiver, respectively. The
tangshunpu@e.gzhu.edu.cn, lsfan@gzhu.edu.cn). semantic encoder is used to extract the semantic information
Q. Yang is with the College of Information Science and Elec-
tronic Engineering, Zhejiang University, Hangzhou, China (e-mail: qian- of input image x ∈ Rc×h×w and directly realize the non-linear
qianyang20@zju.edu.cn). mapping from semantic information into the k-dim complex-
X. Lei is with the Institute of Mobile Communications, Southwest Jiaotong valued vector s̃ ∈ Ck , given by
University, Chengdu, China (e-mail: xflei@home.swjtu.edu.cn).
Y. Deng is with the Department of Engineering, Kings College London,
London, U.K (e-mail:yansha.deng@kcl.ac.uk)
s̃ = Eθ1 (x), (1)
A. Nallanathan is with the School of Electronic Engineering and Com-
puter Science, Queen Mary University of London, London, U.K (e-mail: where Eθ1 (·) represents the semantic encoding operation with
a.nallanathan@qmul.ac.uk). parameter θ1 , and c, h, and w denote the number of channels,
2
Head conv Down-sampling module Down-sampling module Channel coding module
Power Norm
Conv3x3
Conv3x3
Conv4x4
Conv3x3
Conv3x3
Conv4x4
Conv3x3
Conv5x5
+ +
Input Semantic Wireless Semantic Reconstructed
Input Encoder Channel Decoder
ResBolck
Wireless
channel
Semantic encoder
Re-coding module Up-sampling module Up-sampling module

PixelShuffle Head conv
PixelShuffle
Conv3x3
Conv3x3
Conv3x3
Conv5x5
Conv3x3
Conv3x3
Sigmoid
+ + Projection
CNN Backbone Projection CNN Backbone
Network Network
Anchor Positive
Output
ResBolck
Semantic decoder Semantic Space
Irrelevant samples Negatives
Fig. 1: Network architecture of the semantic encoder and
decoder. Fig. 2: Illustration of the proposed semantic contrastive coding.
height, and width of the image, respectively. To simplify, we Since maintaining semantic information in the reconstructed
use n = c × h × w to stand for the dimension of x. Typically, image is crucial for the inference performance, especially
k < n should be satisfied to meet the bandwidth constraint, when the channel bandwidth is limited. Therefore, it is of
and k/n is referred to as the bandwidth compression ratio. vital importance to design the semantic encoder and decoder,
In particular, a large bandwidth compression ratio indicates as well as the training procedure.
a favorable communication condition, whereas a small one
denotes a limited usage of bandwidth. In addition, a power III. P ROPOSED F RAMEWORK
normalization layer [2] is used at the end of the semantic In this section, we will introduce the proposed CL based
encoding network to satisfy the average power constraint semantic communication framework. Specifically, we first
1 ∗
k E[s s] ≤ P at the transmitter, which can be written as present the architecture of the semantic encoder and decoder,
√ s̃ and then we will provide the details of semantic contrastive
s= kP √ , (2) coding and the training procedure.
s̃∗ s̃
where s is the channel input signal that meets the power A. Architecture of Semantic Encoder and Decoder
constraint, and ∗ denotes the conjugate transpose. Next, s
will be transmitted over the additive white Gaussian (AWGN) The network architecture of the semantic encoder and
channel, given by decoder plays a critical role in the extraction of semantic
information. Therefore, we do not utilize the straightforward
ŝ = s + ǫ, (3) approach of stacked convolutional layers in [2], as this simple
where ŝ is the received signal, and ǫ ∈ Ck denotes the architecture lacks this ability. The architecture of the proposed
independent and identically distributed (IID) channel noise semantic encoder and decoder are presented in Fig. 1. The
sample, which follows symmetric complex Gaussian distribu- semantic encoder comprises a 5 × 5 head convolution, two
tion CN (0, σ 2 I) with zero mean and variance σ 2 . down-sampling modules, and a channel coding module. Each
The semantic decoder deployed at the receiver will recon- down-sampling module includes a basic block in ResNet [6]
struct the original image x̂ ∈ Rc×h×w from ŝ according to (we refer to as ResBolck) for capturing the spatial feature of
the image, and a 4 × 4 convolution with stride 2 for down-
sampling the image. The channel coding module is used to
x̂ = Dθ2 (ŝ), (4)
mitigate channel corruption and output the k-dim complex-
where Dθ2 (·) is the semantic decoding operation parameter- valued channel input that satisfies the bandwidth and power
ized by θ2 . Subsequently, x̂ will be used to exert downstream constraint.
task and obtain the inference results through the following Moreover, we adopt a symmetrical architecture in the
process decoder, which consists of a 5 × 5 head convolution, two
up-sampling modules, and a re-coding module. In the up-
fx = Fφb 1 (x̂), (5)
sampling module, ResBolcks are also used as in the encoder
where Fφb 1 (·) characterized by parameter φ1 denotes the and we adopt the Pixel-Shuffle technology [7] to up-sample the
feature extraction operation performed by the CNN backbone image, as it can provide a more efficient computing paradigm
of downstream task, and fx = {f (1) , f (2) , · · · f (C) } is the and better reconstruction performance compared to transposed
output feature map with C channels. The inference results ŷ convolution used in [2]. The re-coding module consists of a
can be obtained by passed fx to the classifier Fφcls (·) with 3×3 convolution followed by the Sigmoid activated function to
2
parameter φ2 , which can be expressed as generate the reconstructed image. Notably, the batch normal-
ization and parametric rectified linear unit (PReLU) activated
ŷ = Fφcls
2
(fx ). (6) function are followed with all convolutions, if not specified.
3
B. Semantic Contrastive Coding C. Loss Function and Training Procedure

Based on the semantic contrastive coding, we design a two-
The key design of semantic contrastive coding is inspired by
stage training strategy to optimize the semantic encoder and
the success of CL which employs data augmentation to gener-
decoder. The first stage is pre-training, where we employ the
ate samples with similar vision representation and minimizes
semantic contrastive coding approach to train the weights of
the distance among them to pretrain the backbone. We modify
encoder θ1 , decoder θ2 and project network ψ simultaneously.
the CL process to adapt it for the semantic communication
However, it is difficult to achieve a fast convergence speed
system. Specifically, we replace the data augmentation with the
when we only optimize semantic contrastive loss. To tackle
process of wireless transmission, as the image corruption that
this issue, we combine it with the reconstructed loss between
occurs during the transmission can be viewed as a form of data
x and x̂, since reducing the reconstructed loss can help
augmentation, and the original image and reconstructed one
improve the convergence speed in the early training rounds.
should keep a small semantic distance for an efficient semantic
Specifically, we apply the mean square error (MSE) function
communication system. Moreover, we utilize a pretrained
to evaluate the reconstruction loss for training batch B, which
backbone to extract features and a learnable projection network
can be expressed as
to map these features into the semantic space. In further,
by incorporating the contrastive loss in semantic space, we

1
jointly optimize the semantic encoder and decoder rather than Lrec = Ex∈B ||x − x̂||22 . (8)
n
pretraining the backbone in CL.
The details of the proposed semantic contrastive coding are Therefore, the loss function in the first training stage can be
shown in Fig. 2. The process begins with the semantic encod- summarized as the linear combination, given by
ing and decoding for a typical image x in a training batch L1 = α1 Lrec + (1 − α1 )Lsem , (9)
B, where we can obtain the reconstructed x̂. The backbone
network Fφb 1 (·) is applied to x and x̂, which generates the where α1 ∈ [0, 1] is a hyper-parameter that controls the trade-
feature maps fx = Fφb 1 (x) and fx̂ = Fφb 1 (x̂), respectively. off between the two part loss functions. For instance, we can
Next, a fully connected projection network Pψ (·) with learn- set to α = k/n in the practical semantic communication
able parameter ψ followed by a normalization operation maps system. In this context, the system prioritizes the preserva-
the features into a semantic space defined as a hypersphere. tion of semantic information over the reconstructed quality
During the training stage, Pψ (·) can be updated to enhance when the bandwidth compression is small. In contrast, as the
the understanding of features, thereby learning the mapping bandwidth compression increases, the system shifts its focus
from features to semantics. Specifically, the projected results towards maintaining the reconstructed quality.
of fx and fx̂ can be represented as qx = Pψ (fx ) and In the second training stage, we aim to further optimize the
v+ = Pψ (fx̂ ), respectively, where qx is referred to as the performance of the semantic communication system by jointly
anchor, and v+ is called the positive. We can use the cosine fine-tuning the encoder, decoder, and classifier with a small
similarity between anchor and positive to define the semantic learning rate to achieve considerable inference performance
distance between x and x̂. and reconstructed image quality. One reason of fine-tuning
For the remaining samples m ∈ B/{x} within training the classifier is that the weights of the backbone and classifier
batch B, the same procedures will be followed. Specifically, are typically trained without considering channel corruption,
we can obtain the feature map fm = Fb (m) by feding m into which causes that the outputs of the backbone network may
the backbone network, and then project fm into the semantic undergo substantial changes when the reconstructed images are
space using vm = Pψ (fm ), where we refer vm to as the neg- inputted instead of the original images. This can result in a
ative. Similar, the semantic distance between x and m can be performance degradation. Therefore, fine-tuning the classifier
defined as the cosine similarity between anchor and negative. with the semantic encoder and decoder can mitigate this issue
The objective of semantic contrastive coding is to minimize and help enhance the semantics transmission. The loss function
the semantic distance between the original and reconstructed of this stage can be expressed as
image while maximizing the semantic distance among the
original image and the irreverent images. Therefore, we can L2 = α2 Lrec + (1 − α2 )LT ask , (10)
use the InfoNCE function [5] to define the semantic contrastive
where α2 ∈ [0, 1] is a hyper-parameter like α1 and LT ask is
loss for the training batch B, which can be expressed as
the loss function of the downstream task. Specifically, when
the downstream task is a classification problem, the cross-
exp(qx · v+ /τ )

Lsem = Ex∈B − log P , (7) entropy function can be employed to model the loss, given by
m∈B/{x} exp(qx · vm /τ )
Ncls
1 X
where τ > 0 is the temperature coefficient used to smooth LT ask = Ex∈B − yi log(ŷi ) , (11)
Ncls i=1
the probability distribution. Next, we will introduce how to
take into account the semantic contrastive coding and seman- where yi and ŷi represent the ground-truth and the predicted
tic contrastive loss to design the loss function and training probability of the i-th class, respectively. Notation Ncls de-
procedure. notes the number of classes in the dataset.
4
95 95
92.5 90
90 85
Test Accuracy
87.5
80
85
75
Proposed
82.5 Proposed
DeepJSCC
DeepJSCC 70 DeepJSCC-ft
DeepJSCC-ft
DeepSC
DeepSC
80
65
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
(a) (a)
40 30
28
35
26
24
30
22
25
20
18
20
Proposed
16 DeepJSCC
Proposed DeepJSCC-ft
DeepJSCC DeepSC
15 DeepJSCC-ft 14
DeepSC
12
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
(b) (b)
Fig. 3: Performance on CIFAR-10 versus the bandwidth com- Fig. 4: Performance on CIFAR-10 versus the bandwidth com-
pression ratio with SNR = 20dB. pression ratio with SNR = 5dB.
IV. S IMULATIONS Fig. 3a compares the accuracy performance of the proposed

In order to verify the effectiveness of the proposed frame- method, DeepJSCC, DeepJSCC-ft, and DeepSC, where SNR
work, we conduct experiments on CIFAR-10, which contains is set to 20dB, and the bandwidth compression ratio k/n
60,000 32×32 color images divided into 10 classes. The train- varies from 1/48 to 1/2.5. From this figure, we can find that
ing set comprises 50,000 images, while the test set contains the proposed method consistently outperforms or matches the
10,000 images. A pre-trained ResNet-56 [6] is used as the competitive methods in accuracy. Specifically, when the com-
backbone network and classifier for downstream inference, pression ratio is 1/2.5, the proposed method, DeepJSCC, and
while the structure of the semantic encoder and decoder is DeepJSCC-ft achieve the test accuracy of about 94%, whereas
shown in Fig. 1. The projection network adopts a two-layer DeepSC performs poorly with the accuracy of only 91%.
fully connected structure with an output dimension of 32. The As the bandwidth compression ratio decreases, the proposed
number of training epochs for the two stages is set to 200 and method still maintains a comparable accuracy performance.
100, respectively, with a batch size of 128. Besides, we use For instance, the proposed method can achieve accuracy of
the Adam optimizer with a learning rate of 0.01 for the first 90.14% and 87.65% at bandwidth compression ratios of 1/24
pre-training stage and 0.0001 for the second fine-tuning stage. and 1/48, respectively, which outperforms DeepSC by about
These learning rates will be adjusted every 50 epochs with a 2.5% and 1% at the corresponding bandwidth compression
decay factor of 0.5. We compare the proposed method with ratios and also shows an accuracy gain of up to 56% over
the following DL-based semantic communication methods, DeepJSCC. These results indicate that the proposed frame-
• DeepJSCC [2]: DL-based source-channel joint coding work can effectively extract semantic information to meet the
that maps the original input to the channel input through requirements of downstream task and removes irrelevant re-
the structure of an autoencoder. dundant information to ensure the semantic information can be
• DeepJSCC-ft: an extension of DeepJSCC with the second successfully transmitted, especially when channel bandwidth
stage fine-tuning strategy of our proposed method, in is limited.
which the encoder, decoder and classifier are updated Fig. 3b presents the peak signal-to-ration (PSNR) of the
with a small learning rate. proposed and the three completing methods , where SNR is
• DeepSC [4]: a DL-based semantic coding framework set to 20dB and the bandwidth compression ratio varies from
that trains the semantic encoder and decoder with both 1/48 to 1/2.5. As shown in the figure, we can find that as the
semantic and observation losses to achieve efficient se- bandwidth compression ratio increases, the PSNRs of all meth-
mantic information transmission. For a fair comparison, ods get improved. Although the proposed method sacrifices
we freeze its encoder and decoder after training and some image quality to prioritize semantic information when
retrained the classifier with a small learning rate. the bandwidth compression ratio is low, it can quickly catch up
5
Original DeepJSCC DeepJSCC-ft DeepSC Proposed
PSNR|MS-SSIM 20.23dB|0.83 20.93dB|0.82 16.85dB|0.68 19.83dB|0.82

(a)
PSNR|MS-SSIM 21.11dB|0.89 21.88dB|0.88 17.92dB|0.73 21.21dB|0.88

(b)
Fig. 5: Visualization comparison of several methods on the Kodak dataset, where SNR = 20dB and bandwidth compression
ratio is 1/48.
with DeepJSCC’s PSNR at higher compression ratios. Specif- method in reconstructing semantic information. These results
ically, the proposed method achieves a PSNR of 38.75dB, further demonstrate the superiority of the proposed method
which is close to the 39.27dB of DeepJSCC, and outperforms in achieving leading accuracy in downstream tasks over the
DeepSC with 36.91dB when the bandwidth compression ratio compared methods.
is 1/2.5. These results indicate that the proposed method
can prioritize to transmit semantic information over irrelevant V. C ONCLUSION
background information to ensure the performance of down- In this paper, we proposed a CL-based semantic communi-
stream task in bandwidth-limited scenarios, and meanwhile cation framework for wireless image transmission. The frame-
transmit enough background information to obtain good image work incorporated semantic contrastive coding and a two-
quality when the bandwidth is not a bottleneck. This further stage training procedure to enhance the extraction of semantic
demonstrates the effectiveness of the proposed method. information, and in order toachieve a better trade-off between
Fig. 4a and Fig. 4b present the performance comparison of reconstruction quality and the performance of downstream
several methods under poor channel conditions in terms of task. We evaluated the effectiveness of the proposed methods
accuracy and PSNR, respectively. Specifically, both figures through simulations on CIFAR-10 and Kodak datasets, which
consider a low SNR of 5dB, and the bandwidth compression simulated results demonstrated the superiority of our approach
ratio varies from 1/48 to 1/2.5. From Fig. 4a, we can observe over existing methods.
that the proposed methods still shows the superiority in terms
of accuracy compared to the three competitive methods, indi-
R EFERENCES
cating its robustness in low SNR scenarios. From Fig. 4b, we
can find the proposed framework can adaptively sacrifice the [1] D. Gündüz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener,
K. Wong, and C. Chae, “Beyond transmitting bits: Context, semantics,
global information to obtain comparable semantic performance and task-oriented communications,” IEEE J. Sel. Areas Commun., vol. 41,
when the bandwidth compression ratio is low, and meanwhile no. 1, pp. 5–41, 2023.
obtain enough reconstructed quality in terms of PSNR as [2] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-
channel coding for wireless image transmission,” IEEE Trans. Cogn.
bandwidth compression ratio decreased. These results in both Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019.
figures further verify the effectiveness and robustness of the [3] D. B. Kurka and D. Gündüz, “DeepJSCC-f: Deep joint source-channel
proposed method in low SNR scenarios. coding of images with feedback,” IEEE J. Sel. Areas Inf. Theory, vol. 1,
no. 1, pp. 178–193, 2020.
We also provide the visualization comparison of several [4] H. Zhang, S. Shao, M. Tao, X. Bi, and K. B. Letaief, “Deep learning-
methods on the Kodak dataset in Fig. 5, where the encoder enabled semantic communication systems with task-unaware transmitter
and dynamic data,” IEEE J. Sel. Areas Commun., vol. 41, no. 1, pp.
and decoder are trained on the STL10 dataset, SNR is set 170–185, 2023.
to 20dB and the bandwidth compression ratio is 1/48. From [5] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple
this figure, we can observe that the proposed can effectively framework for contrastive learning of visual representations,” in Proc.
Int. Conf. Mach. Learn., ICML, vol. 119, 2020, pp. 1597–1607.
remove redundant background information and meanwhile [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
preserve the main semantic information, resulting in less image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., CVPR,
corruption in semantic regions (e.g., macaws and rafters in 2016, pp. 770–778.
[7] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop,
this figures) compared to the competitive methods. Moreover, D. Rueckert, and Z. Wang, “Real-time single image and video super-
the proposed method achieves similar PSNR and multi-scale resolution using an efficient sub-pixel convolutional neural network,” in
structural similarity (MS-SSIM) performance with DeepJSCC Proc. IEEE Conf. Comput. Vis. Pattern Recog., CVPR, 2016, pp. 1874–
1883.
and DeepJSCC-ft, indicating the effectiveness of the proposed

Contrastive Learning Based Semantic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Contrastive Learning Based Semantic

Uploaded by

Copyright:

Available Formats

1

Contrastive Learning based Semantic

Head conv Down-sampling module Down-sampling module Channel coding module

Re-coding module Up-sampling module Up-sampling module

B. Semantic Contrastive Coding C. Loss Function and Training Procedure

IV. S IMULATIONS Fig. 3a compares the accuracy performance of the proposed

Original DeepJSCC DeepJSCC-ft DeepSC Proposed

PSNR|MS-SSIM 20.23dB|0.83 20.93dB|0.82 16.85dB|0.68 19.83dB|0.82

PSNR|MS-SSIM 21.11dB|0.89 21.88dB|0.88 17.92dB|0.73 21.21dB|0.88

You might also like