You are on page 1of 4

IEEE COMMUNICATIONS LETTERS, VOL. 24, NO.

7, JULY 2020 1459

Cooperative Spectrum Sensing Meets Machine Learning:


Deep Reinforcement Learning Approach
Rahil Sarikhani and Farshid Keynia

Abstract— Cognitive radio network (CRN) emerged to utilize the quality of the well-known procedures in the fusion
the frequency bands efficiently. To use the frequency bands center (FC). Recently, reinforcement learning (RL) and deep
efficiently without any interference on the licensed user, detection learning (DL) are combined to improve the accuracy of
of the frequency holes is the first step, which is called spectrum
sensing in the context. In order to increase the quality of local
machine learning [5], [6].
spectrum sensing results, cooperative spectrum sensing (CSS) is In this letter, we have utilized deep reinforcement learning
introduced in the literature to combine the local sensing results. (DRL) to improve the classification performance of CSS. The
Recently, machine learning techniques are designed to improve DL-based CSS needs the updating of the measurement from
the classification of the images and signals. Specifically, Deep all the SUs in the network. While it would be very resource-
Reinforcement Learning (DRL) is of interest for its substantial consuming and network flow consuming. In the proposed
improvement in the classification problems. In this letter, we have
proposed DRL based CSS algorithm, which is employed to approach, we have employed Reinforcement Learning (RL) to
decrease the signaling in the network of SUs. The simulation introduce the required SU to update its measurements. Hence,
results represent the superiority of the proposed approach to the proposed DRL can increase the CSS efficiently from a
state-of-the-art approaches, including Deep Cooperative Sensing resource and time viewpoint. Initially, the agent selects an
(DCS), K-out-of-N, and Support Vector Machine (SVM) based SU to share its local sensing based on some approaches, and
CSS algorithms.
then the selected SU is notified. Subsequently, the selected
Index Terms— Cognitive radio, cooperative spectrum sensing,
SU determines the presence of the PU based on local energy
deep cooperative sensing, deep reinforcement learning, machine
learning. detection and acquaint the agent from its local sensed energy
values. Then, the agent determines the presence or absence
I. I NTRODUCTION of the PU, globally. Finally, the Q-learning parameters are
updated based on the sensing results, and this approach is
C OGNITIVE radio networks (CRN) are exceptional cases
of networks that have been designed to utilize the
frequency bands dynamically and efficiently. Two crucial
continued to the convergence of the algorithm. Accordingly,
we have formulated the CSS problem as a Markov Decision
players of these networks are primary users (PU) who pay Problem (MDP) and introduced state and action spaces and
for the frequency bands and secondary users (SU), which strategy selection policy. Moreover, the reward function is
utilize the vacant bands dynamically. Hence, determining the introduced based on the correlation of cooperating users.
unoccupied bands is crucial for the SUs due to not interrupt Furthermore, we have defined a DL unit to be employed by the
the PUs. Although the spectrum sensing is essential in each proposed DRL algorithm in order to increase the reliability in
SU, the individual sensing results are highly disposed to the large scale problems. Utilizing the proposed DRL approach
fading channel conditions and other destructive effects [1]–[3]. decreases the network flow and the number of cooperat-
Consequently, to improve the sensing quality, Cooperative ing users, which yield in less resource-consuming approach
Spectrum Sensing (CSS) is introduced in the literature. CSS for CSS. Besides, the proposed approach represents a more
utilizes the results of some of cooperating SUs to enhance robust criterion because of omitting correlated measurements.
the quality of sensing and develop the accuracy. However, The remainder of the letter is as follows: After the system
the channel condition affects the optimal strategy of coop- model in Section II, the DRL based CSS algorithm is consid-
eration in CSS [1], [2]. ered in Section III. Subsequently, the simulation results and
Accordingly, some efficient approaches based on K-out-of-N concluding remarks are expressed in Section IV and V.
scheme have been introduced [1]. Moreover, multi-
dimensional correlation in individual sensing was considered II. S YSTEM M ODEL
for the CSS in [2]. Besides, machine learning-based We consider a CRN included NSU SU users who try to
approaches were introduced in [3] and [4], which improve accompany sensing the spectrum holes to transmit the in-queue
data. The positions of the SUs are randomly selected in each
Manuscript received February 4, 2020; revised March 5, 2020; accepted scenario. Moreover, some PUs are located randomly in the
March 19, 2020. Date of publication March 30, 2020; date of current version
July 10, 2020. The associate editor coordinating the review of this letter network and working in the licensed network. Furthermore,
and approving it for publication was M. Chafii. (Corresponding author: we consider a wideband channel to be sensed with NB bands
Rahil Sarikhani.) and the total bandwidth of W .
Rahil Sarikhani is with the Department of Computer Engineering,
Kerman Branch, Islamic Azad University, Kerman 7635131167, Iran (e-mail:
The PUs can transmit in NP bands arbitrarily, and none
rahil.sarikhani@iauk.ac.ir). of the NSU are aware of the utilized bands. The power of
Farshid Keynia is with the Department of Energy Management and Opti- transmitting is fixed at P . Additionally, the Additive White
mization, Institute of Science and High Technology and Environmental Gaussian Noise (AWGN) is assumed where the power spectral
Sciences, Graduate University of Advanced Technology, Kerman 7635131167,
Iran. density is N0 and zbi (n) is the noise of the i-th SU in the
Digital Object Identifier 10.1109/LCOMM.2020.2984430 band b at the time of n. Moreover, the fading and shadowing
1558-2558 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 30,2022 at 14:20:54 UTC from IEEE Xplore. Restrictions apply.
1460 IEEE COMMUNICATIONS LETTERS, VOL. 24, NO. 7, JULY 2020

effect, together with the hidden PU, makes some SUs unable to states. Consequently, if we represent each state by si for
detect the presence of the PUs by local sensing. Hence, the SU i = 0, 1, . . . , NSU + 1, then state s0 is the start state and
broadcasts the request of cooperative sensing, and all the one- sNSU +1 is the stop state and others are according to the SUs
hop neighbors will respond to its appeal by their local sensing in the spectrum sensing network. Subsequently, the actions in
results, intermittently. The initiating SU, which requests for each state are selected from the set of actions as A, which can
cooperative sensing, is called the agent in this phase. The be determined by a0 , a1 , . . . , aA . In this set, a0 is according to
agent will combine the local sensing results and take the role the final action, which means the learning is finalized. Others
of Fusion Center (FC) in the network. are identical to the secondary users in the spectrum sensing
The FC, which is called the agent from now on in this network. It means when an agent selects the action ak for
letter, begins the DRL-based cooperative spectrum sensing k = 0, the secondary user nk is selected to transmit its data
and detects the presence or absence of the PU based on about the spectrum. Action selection in each agent is based
the surrounding one-hop neighbors. The number of one-hop on Boltzman distribution. In other words, the probability of
neighbors surrounding the agent is N . In each period of action ak is based on Boltzman distribution which can be
cooperative sensing, C < N nodes are selected based on the formulated as
reliance of the cooperation, which is estimated by the location- eQ(sk ,ak =i)/τk
based correlation metric ρji , and it would change based on the p(ak = j|sk ) = NSU (2)
Q(sk ,ai )/τk
conditions of the channel and location of SUs. These C nodes i=1 e
are informed step-by-step to update their local sensing results. for j = 1, . . . , NSU . In this equation, Q(sk , ak ) is the state-
To broadcast the fusion decision, the agent utilizes a reliable action value function, which determines the value of selecting
channel to transmit its cooperation consequence and inform action ak in state sk and ςk is the temperature control which
other one-hop nodes about the last result. Besides, the channel balances the exploration versus exploitation in each stage
is assumed to be unavailable if all of the available bands of the algorithm. In this algorithm, increasing the value of
are detected to be occupied by the PUs. The same as other the temperature will increase the independence to the large
references, such as [3], the birth-death process is assumed for value of Q(sk , ak ). Consequently, the agent will explore more
the simulation of the PU activity. opportunities. Since the notion of the algorithm should vary
Let yi,j be the received energy samples from the SU i from from exploration to exploitation, the value of the temperature
the j-th band as follows is chosen to be linearly decreased in each stage to improve
the exploitation versus exploration.
1 
N
yi,j = |xji (n)|2 (1) For the reward calculation, the correlation of the newly
NED n=1 selected SU is considered. Since each SU is located in different
places in the cognitive network, the sense of the PU will be
where xji (n) is the nth received sample in the jth band of correlated or uncorrelated between them. It can be enumerated
ith SU. Moreover, NED is the number of samples to be by the correlation coefficient which is calculated by location
employed for energy detection. The agent collects all the information of the SUs [7]. Based on this information, the cost
measurements in the matrix Y = {yi,j } ∈ RC×NB . After is calculated by the summation of the correlation coefficients
collecting all the measurements, the agent will determine the from the first step up to now. Accordingly, the correlation of
value function using DL select an appropriate action from the kth-step of cooperation could be calculated as
action space A. Then, it receives the local sensing values
and updates the measurement matrix. By running the NN, 
k−1
Ck = ρji (sm , am = j) (3)
it calculates the new Q values and selects the other node
m=0
(appropriate action) if needed. At the end of cooperation,
it will inform all the users from the cooperated result. As it is known, the best policy which is the goal of RL
algorithm is thepolicy π in which the discounted reward
III. D EEP R EINFORCEMENT L EARNING -BASED function Rt = t=0 γ rt for discount factor 0 < γ ≤ 1
t

C OOPERATIVE S PECTRUM S ENSING is maximized. Here, we utilized a neural network in the


context of deep Q-network to estimate the optimal action-value
The agent takes action from A in each episode of coop-
function (Q) as:
eration, and the environment will be affected by the selected
action. This influence will be replied by the environment uti- Q∗ (s, a) = max E(rt + γrt+1 + . . . |st = s, at = a, π) (4)
π
lizing two main phenomena called the state and reward. Then,
the reward will be used in the agent, and the value function This function represents the maximum of the expected dis-
will be determined using deep learning. In the following, all counted reward under policy π which takes action a in state s.
the parameters in the learning process will be determined. A simpler form of this equation is considered using Bellman
Equation.
 
A. Reinforcement Learning
Q∗ (s, a) = Es r + γ max
Q∗ (s , a )|s, a (5)
In this part, we will explain the sets of actions A, states S, a

and reward calculation methodology in order to determine For the known value Q (s , a ), where s and a are the next
∗  

the RL based wide-band spectrum sensing. The state set S states in the sequence of state action spaces. The optimal
is involved in S different states, which are every other node policy is the action a which maximizes the expected value
together with two additional states, including start and stop of r + γQ∗ (s , a ). A nonlinear function estimator is used

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 30,2022 at 14:20:54 UTC from IEEE Xplore. Restrictions apply.
SARIKHANI AND KEYNIA: CSS MEETS MACHINE LEARNING: DRL APPROACH 1461

to approximate the action-value function in deep Q-network. Algorithm 1 DRL-Based Cooperative Spectrum Sensing
If the weights of the utilized neural network are assumed to 1: Input: A, N, Q
be θ, then Q(s, a; θ) ≈ Q∗ (s, a). In such networks, the training 2: Output: K  , P 
is based on the minimizing of the updated loss function Li (θi ). 3: for n = 1 to N do
To integrate the RL and DL, the data-set of the recently 4: for k = 0 to A + 1 do
experienced transition and experience replay, which represents 5: ak is selected according to Eq. (2).
the recent experiences, are critical. The loss function, which is 6: if ak = 0 then
used in each iteration i in the Q-learning update is as follows: 7: Request of channel sensing is sent to the SUak
 
Li (θi ) = E (r + γ max Q(s  
, a ; θ ) − Q(s, a; θ ))2
(6) 8: The energy of the sub-channels is sensed by the

i−1 i
a selected SU
where θi and θi−1 are the current and previous network 9: for i = 1 to 3 do
parameters. Accordingly, the gradient of the loss function 10: Convolutional Layer,
could be formulated as 11: ReLU,
 12: Max pooling Layer,
∇θi Li (θi ) = E (r + γ max

Q(s , a ; θi−1 ) − Q(s, a; θi )) 13: end for
a
×∇θi Q(s, a; θi )] (7) 14: Fully Connected,
15: ReLU,
16: Fully Connected,
B. Deep Learning
17: rk+1 (sk , ak ) = C1k based on Eq. (3)
In the agent, the Convolutional Neural Network (CNN) 18: the parameters of learning are updated,
model is utilized to combine individual sensing consequences 19: else
to calculate the action-value function for the DRL to determine 20: break,
the absence or presence of the PU. The proposed CNN for 21: end if
calculating the action-value function is composed of the con- 22: end for
volution part and fully connected (FC) part as two main parts. 23: u = CoopSenDec(ui , k, A)
The input is the matrix Y which is included of the energy 24: Broadcast the sensing result
samples from neighbor nodes in different bands. The Y matrix 25: end for
is entered into the convolution part, which composed of three
different sub-blocks called convolutional layer, rectifier linear
unit (ReLU), and max-pooling layer where connected in Indian mean that the agent decides the presence or absence of the PU
file. The details of each sub-block can be inferred from [5], based on the local results.
[6]. Specifically, ReLU chooses the maximum value among
the input and zero. Moreover, the max-pooling unit, which is IV. N UMERICAL R ESULTS
fed by a vector, selects the maximum entry of the vector. Three Here, we will evaluate the performance of the proposed
consecutive sub-blocks are included in each convolution layer. algorithm to the other machine learning based approaches. All
Subsequently, two FC sub-blocks are included. FC sub-blocks the simulations are done on MATLAB software. A single PU
are responsible for the gathering of features extracted from the is assumed which is located randomly among number of SUs
convolutional layers. To this end, these two sub-blocks mix the in an area limited by 500 meters in 2 dimensions. Furthermore,
extracted features utilizing the weighting process and biasing the number of frequency bands is set to NB = 160 where each
process. Furthermore, an extra ReLU is positioned between sub-band occupies 1 MHz of bandwidth. Moreover, the trans-
two FC sub-blocks to handle the non-linear classifications. The mitter power is assumed to be P = 30 dBm, and β = 3000,
details can be obtained from [5], [6]. α = 3.8, and σ = 7.9 dB [8], where β, α, and σ denote path
loss constant, path loss exponent, and the standard deviation
C. Deep Reinforcement Learning of the shadow fading, respectively. Besides, the temperature
The DRL-based cooperative spectrum sensing algorithm is control is initialized by 0.01 and increased exponentially in the
represented in Algorithm 1 step-by-step. In this algorithm, iterations. Furthermore, the training of the required coefficients
the number SUs, and all the states together with the actions are in DL is performed based on adaptive moment estimation
considered as input. The output would be the result of CSS. algorithm. Additionally, spatially correlated fading is assumed
In the agent, one of the available SUs is selected and informed for spatial correlation sensing which is modeled based on
dA−B
( )
to represent its local sensing result. The local sensing results ρA−B = e dref with dref = 50 m [9].
are gathered by measuring the available energy in different To evaluate the performance of the proposed algorithm,
subchannels and transmitted to the agent. After that, the algo- we have compared the proposed approach with other machine
rithm begins to run in the agent. To this end, the sensing learning-based cooperative spectrum sensing, including K-out-
matrix Y is gathered, and after three convolutional layers and of-N, SVM [3], and DL [4] approaches as the state-of-the-art
two subsequent FC layers, the result of global sensing through schemes. Furthermore, the comparison metrics are Region of
CNN is obtained. Then, in the agent, the action-value and Convergence (ROC), number of active SUs, and the sensing
rewards are updated according to the cost function and the error, which is the average of probability of false alarm (Pf )
result of local sensing. In the algorithm, by CoopSenDec we and the probability of the miss detection (Pm ).

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 30,2022 at 14:20:54 UTC from IEEE Xplore. Restrictions apply.
1462 IEEE COMMUNICATIONS LETTERS, VOL. 24, NO. 7, JULY 2020

Fig. 1. The ROC of the proposed scheme for P = 30 dBm. Fig. 3. Sensing error vs. Tx power and number of samples.

proposed approach which will lower the information flow in


the network while the performance is not degraded.
Eventually, the sensing error is represented in Fig. 3, where
transmitting power and the number of samples are changed.
As depicted, the sensing error in both cases for the proposed
approach is more reliable than others. Specifically, two initial
results are the lowest sensing error and lowest fluctuations
in the sensing error by changing the number of samples and
transmission power.
V. C ONCLUDING R EMARKS
Fig. 2. Number of active users vs. number of cooperating SUs. In this letter, the cooperative spectrum sensing is considered
in CRN by utilizing the machine learning techniques, specif-
In Fig. 1 the ROC, which is the probability of detection ically DRL. Here, we have proposed a DRL based algorithm
(Pd ) in corresponding Pf of four mentioned schemes are to sense the spectrum holes, cooperatively. In this algorithm,
compared. Apparently, the proposed approach outperforms we have used RL to accurately choose the needed SUs to
others in a lower Pf , while in a higher ones, the result of SVM cooperate and a DL in each SU, which is chosen to sense
approach is comparable to the proposed approach. Of course, the presence or absence of the PU locally. The proposed
in lower Pf , the results of the proposed approach is compara- approach is very suitable from the computational complexity
ble with the result of the DL-based sensing method. Moreover, point since there is no need for the measurements of all
the area under the curve of these three schemes is calculated as the SUs to be updated in each time slot. The reinforcement
0.822, 0.943, 0.958, and 0.961 for K-out-of-N, SVM, DL, and part of the algorithm determines the required SU to share its
proposed schemes, respectively. Obviously, the area under the measurements and local sensing results.
curve of the schemes represents the highest sensing accuracy R EFERENCES
than conventional schemes. Further, the average number of
[1] A. Hajihoseini Gazestani and S. Ali Ghorashi, “Distributed diffusion-
active SUs in each sensing time slot was 10.401, which based spectrum sensing for cognitive radio sensor networks considering
represents that all the SUs did not update their local sensing. link failure,” IEEE Sensors J., vol. 18, no. 20, pp. 8617–8625, Oct. 2018.
In other words, in the proposed approach, the optimal users [2] A. W. Min and K. G. Shin, “An optimal sensing framework based on
spatial RSS-profile in cognitive radio networks,” in Proc. 6th Annu. IEEE
from correlation viewpoint are selected to update their own Commun. Soc. Conf. Sensor, Mesh Ad Hoc Commun. Netw., Rome, Italy,
local sensing. It will decrease the flow of data in the network Jun. 2009, pp. 1–9.
in updating local sensing results. Hence, the complexity of [3] K. M. Thilina, K. Won Choi, N. Saquib, and E. Hossain, “Machine
learning techniques for cooperative spectrum sensing in cognitive radio
the network will be less rather than the comparable DL-based networks,” IEEE J. Sel. Areas Commun., vol. 31, no. 11, pp. 2209–2221,
approach. Moreover, the presence of correlated users in Nov. 2013.
DL-based approach will increase the noise of measurements [4] W. Lee, M. Kim, and D.-H. Cho, “Deep cooperative sensing: Cooperative
and the number of neurons to be initialized, which will spectrum sensing based on convolutional neural networks,” IEEE Trans.
Veh. Technol., vol. 68, no. 3, pp. 3005–3009, Mar. 2019.
decrease the accuracy of sensing according to the ROC. [5] W. Lee, M. Kim, and D.-H. Cho, “Deep power control: Transmit power
Since the ROC of DL-based approach and DRL-based control scheme based on convolutional neural network,” IEEE Commun.
proposed approach are almost identical, we have defined Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018.
[6] M. Kim, N.-I. Kim, W. Lee, and D.-H. Cho, “Deep learning-aided
active SUs which denotes the average number of cooperating SCMA,” IEEE Commun. Lett., vol. 22, no. 4, pp. 720–723, Apr. 2018.
users in the algorithm. In DL-based approach, all the SUs [7] X.-Y. Zhang, K. Zhang, X. Yun, S. Wang, X. Bao, and Q. Yuan,
are cooperating to sense the unoccupied bands, while in the “Location-based correlation estimation in social network via collaborative
learning,” in Proc. IEEE Conf. Comput. Commun. Workshops (INFOCOM
proposed DRL-based approach the active cooperating SUs are WKSHPS), San Francisco, CA, USA, Apr. 2016, pp. 1073–1074.
defined based on the correlation reward. This metric which [8] Channel Models for IEEE 802.20 MBWA System Simulations,
is depicted in Fig. 2, will show that in the network the document IEEE C802.20-03/70, Jan. 2007.
[9] A. Algans, K. I. Pedersen, and P. E. Mogensen, “Experimental analysis
information exchange will be decreased while the ROC is of the joint statistical properties of azimuth spread, delay spread, and
identical and the performance is not affected. This decrease shadow fading,” IEEE J. Sel. Areas Commun., vol. 20, no. 3, pp. 523–531,
in the number of active users is the main advantage of the Apr. 2002.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 30,2022 at 14:20:54 UTC from IEEE Xplore. Restrictions apply.

You might also like