32@caching Transient Data

2074 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO.
2, APRIL 2019
Caching Transient Data for Internet of Things:

A Deep Reinforcement Learning Approach
Hao Zhu , Yang Cao , Member, IEEE, Xiao Wei, Wei Wang, Member, IEEE,
Tao Jiang , Fellow, IEEE, and Shi Jin , Senior Member, IEEE
Abstract—Connected devices in Internet-of-Things (IoT) con- Meanwhile, transferring the tremendous IoT data is expen-
tinuously generate enormous amount of data, which is transient sive, consuming an enormous amount of bandwidth, energy,
and would be requested by IoT application users, such as and time [4].
autonomous vehicles. Transmitting IoT data through wireless
networks would lead to congestions and long delays, which can be Edge caching is a promising approach to avoid unneces-
tackled by caching IoT data at the network edge. However, it is sary end-to-end communications by utilizing the idle storage
challenging to jointly consider IoT data-transiency and dynamic resources of edge nodes. By caching popular data files at
context characteristics. In this paper, we advocate the use of deep edge nodes, requesting end-points can obtain these files from
reinforcement learning (DRL) to solve the problem of caching IoT the edge without explicitly communicating with data sources.
data at the edge without knowing future IoT data popularity, user
request pattern, and other context characteristics. By defining Thus, the redundant network traffic can be reduced, and the
data freshness metrics, the aim of determining IoT data caching request for the cached file can be answered more quickly
policy is to strike a balance between the communication cost with a better quality of service (QoS) or quality of expe-
and the loss of data freshness. Extensive simulation results cor- rience (QoE) [5]. There have been many existing works on
roborate that the proposed DRL-based IoT data caching policy edge caching [6]–[10]. Typically, edge nodes decide what to
outperforms other baseline policies.
cache according to the estimated file popularity. For example,
Index Terms—Caching, data-transiency, deep reinforcement in [6], the caching decision was made by caching the most
learning (DRL), Internet-of-Things (IoT). popular files greedily until no storage space remains, after
the file popularity was predicted through collaborative filter-
ing by exploiting users-files correlations. In [7], the problem
I. I NTRODUCTION
of file placement in a small base station (SBS) was solved
ITH the rapid development of Internet-of-Things (IoT)
W in areas, including intelligent transportation, smart grid,
industrial and home automation, e-Health, and so on, the
through the knapsack approach, where the file popularity was
estimated based on the rate of received requests. In [8], the
estimation of the file popularity was incorporated with the file
number of connected devices (e.g., tags, sensors, embed- caching process through the multiarmed bandit approach. The
ded devices, and hand-held devices) keeps on increasing. aforementioned studies mainly explore the caching policy for
Nowadays, there have been already 6 to 9 billion con- in-transient data (e.g., multimedia files), which never expire,
nected devices, while the predicted number of IoT devices at edge nodes. However, these typical edge caching techniques
in 2020 is expected to be 24 billion [1]. IoT devices gener- cannot be directly applied to the case of IoT data. The reason
ate unprecedented amounts of data, which will be delivered to is that IoT data are transient [11]. IoT data files expire in a
applications, where the information is processed and analyzed certain time period after they are generated at source locations.
to provide services. These traffic flows create great challenges In other words, an IoT data file has a lifetime, during which it
for today’s storage systems and communication networks. For is useful. When the IoT data file is expired, it becomes useless
example, wireless networks will suffer congested wireless and must be discarded. This is different from the in-transient
backhaul and spectrum overcrowding problems [2], [3]. data which never expire.
Caching transient IoT data at edge nodes still has potential
Manuscript received May 15, 2018; revised August 15, 2018 and to benefit network traffic control and QoS or QoE guaran-
October 23, 2018; accepted November 9, 2018. Date of publication
November 21, 2018; date of current version May 8, 2019. This work was tee [12]. For example, in monitoring applications, such as
supported in part by the National Natural Science Foundation of China under urban environment monitoring [13], [14] and vehicular traffic
Grant 61729101, Grant 61601193, Grant 61720106001, Grant 61871441, and monitoring [15], IoT data are collected at specific locations.
Grant 91738202 and in part by the Major Program of National Natural Science
Foundation of Hubei in China under Grant 2016CFA009. (Corresponding An update on the conditions of the monitored objects, such
author: Yang Cao.) as air quality indexes in a specific location, is popular among
H. Zhu, Y. Cao, X. Wei, W. Wang, and T. Jiang are with the Wuhan many location-based services which deliver local information
National Laboratory for Optoelectronics, School of Electronic Information
and Communications, Huazhong University of Science and Technology, to end-users. If popular IoT data can be temporarily cached
Wuhan 430074, China (e-mail: zhuhao@hust.edu.cn; ycao@hust.edu.cn; at edge nodes, requests for these data do not have to be
weixiao1991@hust.edu.cn; weiwangw@hust.edu.cn; tao.jiang@ieee.org). answered all the way by IoT data sources, and in-network
S. Jin is with the National Mobile Communications Research Laboratory,
Southeast University, Nanjing 210096, China (e-mail: jinshi@seu.edu.cn). traffic is thus reduced. Moreover, local data retrieval rather
Digital Object Identifier 10.1109/JIOT.2018.2882583 than remote server retrieval could allow faster response to
2327-4662 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on August 02,2020 at 07:46:51 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: CACHING TRANSIENT DATA FOR IoT 2075
requests. It means that rigorous low-delay information can the cache replacement problem as a Markov decision
be provided for delay-sensitive applications through edge process (MDP) problem.
caching. Therefore, caching transient IoT data at the edge is 2) We solve the formulated problem based on a novel deep
an important way to improve the efficiency and QoS of IoT. reinforcement learning (DRL) approach. A DRL-based
Cache placement policy at edge nodes remains to be delib- caching policy is proposed, which is able to intelligently
erately designed to meet various needs of IoT services, such perceive the environment and then automatically learns
as mobility and geo-distribution support. On the one hand, caching policy according to history and current raw
with mobile and IoT traffic becoming the trend, applications observations of the environment, without any explicit
related to mobility and geo-distribution (like pipeline moni- assumptions about the operating environment.
toring and sensor networks) are in urgent need of obtaining 3) Extensive simulation results demonstrate that the
monitoring, measurement, and automation data [16]. The pop- proposed DRL-based caching policy outperforms two
ularity distribution of these data may change with time and baseline caching policies in striking a balance between
location, and the popularity prediction schemes should be the communication cost and the loss of data freshness.
designed accordingly. On the other hand, the requirements For different network setup parameters, the long-term
of IoT applications on data freshness also need to be taken cost of fetching IoT data can be decreased.
into consideration when designing caching policies for IoT. The remainder of this paper is organized as follows.
Some applications are delay-tolerant [17], while others may The system model is provided in the following section. In
be delay-sensitive [18]. For example, in the context of caching Section III, we present the analytical formulation of the cache
data for e-Health which is delay-sensitive, high data freshness replacement problem. In Section IV, we discuss the DRL-
is needed to support a quick response to sudden changes in a based edge caching policy which is the solution of the for-
patient’s blood pressure and heart rate [19]. mulated MDP problem. Section V evaluates the performance
There are few works about caching transient IoT data. of the proposed DRL-based edge caching policy. Finally,
Vural et al. [20], [21] considered data transiency when they conclusions are drawn in Section VI.
designed caching policy for Internet content routers. They
proposed an analytical model, where the lifetime of data item
and the rate of received requests are variables, to capture II. S YSTEM M ODEL
the tradeoff between multihop communication costs and data In this section, we introduce the system model with some
freshness. Zhang et al. [22] proposed a cooperative in-network basic notations. This paper concentrates on a single edge node
caching scheme for information-centric networking IoT, based with a coverage of a set of data producers in an edge caching-
on the least recently used (LRU) replacement policy. They based IoT system, as illustrated in Fig. 1. There are three main
took the data lifetime of IoT data items into consideration by components in the considered scenario, namely edge node,
setting a threshold for each node to decide whether cache the data producer, and data consumer.
data or not. However, although above works take transient data 1) Edge Node: This can be a static network facility (e.g.,
caching into consideration, caching policies used in them are a gateway or an SBS) fixed at the edge of the network,
static. Besides, these works construct caching policies based which covers a set of IoT data producers. The edge
on explicit assumptions about the IoT environment, such as node acts as a relay between data producers and data
the popularity distribution of data is given or the requests for consumers. In other words, requests generated by data
data obey the Poisson distribution. Due to that rates of request consumers are gathered at the edge node and then for-
in the edge of IoT are time-variable, it is difficult to accurately warded to data producers. Meanwhile, the data generated
obtain the popularity of data. by data producers are gathered at the edge node and then
To the best of our knowledge, most existing works about transmitted to data consumers. In addition, the edge node
edge caching and IoT caching mainly focus on caching in- is empowered with the capability of edge caching. It
transient data and/or are based on assumptions like that the means that the edge node is able to cache data coming
popularity distribution of data is given and the request of from data producers and answer the request directly if
users obey a specific distribution (e.g., Poisson distribution). the requested data has been cached. More details on the
Different from these works, this paper investigates caching caching mechanism are described in Section II-B.
transient data at the edge in IoT and presents an efficient 2) Data Producer: This can be an IoT sensor fixed in the
caching policy, which can make intelligent caching decisions coverage of the edge node. Each data producer may gen-
without assuming the popularity distribution of data or the erate data values for several contents. Each content is
request distribution of users. The distinct features of this paper uniquely named with a static content identifier, referred
are listed as follows. as CID. A specific reading value for a content at a time
1) We propose an edge caching-based IoT system frame- instance is referred as a data item. Each data item has
work for caching and transmitting transient IoT data, a lifetime during which it is valid. When a data item
where “data freshness” and “edge caching” are both expires, it becomes invalid and should not be treated
taken into consideration. A cost function, which makes as an answer to the request. A data producer gener-
a tradeoff between data freshness and communication ates a data item only when it receives a request for the
cost when fetching IoT data, is proposed. To minimize corresponding content. More details on data items are
the long-term cost of fetching IoT data, we formulate described in Section II-A.
2076 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019
(a)
(b)
Fig. 2. Data item freshness. (a) Fresh. (b) Nonfresh.

Fig. 1. Edge caching-based IoT system.
of d by Tlife (d). At time t, the age of data item d can be
denoted as tage (d) = t − tgen (d). As shown in Fig. 2, we say
3) Data Consumer: This can be an IoT application instance that data item d is fresh if tage (d) < Tlife (d), and it is nonfresh
executed in static devices (e.g., desktop computers) or if tage (d) ≥ Tlife (d).
mobile devices (e.g., smart phones), which request the
data generated by data producers for analysis/or pro-
cessing sake. For example, users would like to view B. Data Item Caching
the environment monitoring data with the applications in Denote each request from IoT applications by k, and the
smart phones. Each request message carries a CID field requested content by fk . The time when the request message
which indicates what content is requested. Within the k arrives at the edge node is denoted by tk . The set of data
coverage of the edge node, data consumers are assumed items cached in the edge node at time tk is denoted by Dk =
be able to establish good communications with the edge {dk1 , dk2 , . . . , dkI }, and the set of contents associated with these
node since they usually move with limited rates. Thus, cached data items is denoted by Fk = {fk1 , fk2 , . . . , fkI }, where
we assume that the data consumer is able to fetch the I is the maximum number of items which is determined by
requested data item before it moves out of the coverage the storage capacity. Note that at the same time, the edge
of the edge node. Note that in the scenario of this paper, node caches only one of the different data items associated
we do not make explicit assumptions on the arrival with the same content. The cached data item dki (1 ≤ i ≤ I)
model for data consumers and the generation pattern of is associated with content fki . We use a mapping function to
user requests. denote the association relation between a cached data item and
The way of fetching data item when it is not cached in its associated content, that is, dki = p(fki ).
the edge node is illustrated as the green line in Fig. 1. Upon Denote the data item returned for request k by dk . The
receiving a request message (i.e., the first step of the green detailed process of fetching dk is introduced as follows. Upon
line), the edge node forwards this request to the data producer receiving request k, the edge node checks whether there is a
(i.e., the second step of the green line) and the data producer fresh data item associated with fk existing in the cache. There
sends its data through the edge node to the data consumer are three cases as follows.
(i.e., the third and forth steps of the green line). In the edge Case 1: fk ∈ Fk and tage (p(fk )) < Tlife (p(fk )). That is, the
caching-based IoT scenario, the edge node possesses caching requested content fk belongs to the set of cached
capability for storing data items. Then, besides fetching the contents Fk , and the cached data item p(fk ) asso-
data item from the data producer, the data consumer can fetch ciated with fk is fresh. In this case, the edge node
the data item from the edge node if a valid data item associated returns this cached data item p(fk ) as dk to the data
with the requested content is stored in the edge node. The consumer directly.
latter way of fetching data item is illustrated as the blue line Case 2: fk ∈ Fk and tage (p(fk )) ≥ Tlife (p(fk )). That is, there
in Fig. 1. is an nonfresh data item of fk in the cache of the
edge node. In this case, the edge node fetches a new
A. Contents and Data Items data item from the data producer as dk and then
returns this new data item to the data consumer.
Different data items associated with a common content share Case 3: fk ∈ / Fk , i.e., there is no data item associated with
the same CID. Besides the CID field, each data item contains fk existing in the cache of the edge node. In this
two other fields: 1) the timestamp field and 2) the lifetime filed. case, the edge node also fetches a new data item
The timestamp field indicates when the data item is generated from the data producer as dk and then returns dk
at the data producer, and the lifetime filed indicates the dura- to the data consumer.
tion for which the value carried in the item is valid after the In a word, dk can be expressed as
item is generated. Assume that all data items of all contents
have the same size. We denote a data item by d, the time of p(fk ), if fk ∈ Fk and tage (p(fk )) < Tlife (p(fk ))
dk =
generating d at the data producer by tgen (d), and the lifetime new item, otherwise.
The edge node updates its cached data items in a reactive If the data item is fetched from the edge node, the freshness
replacement way, it decides whether to cache a new data item loss cost is a positive value, i.e., l(dk ) > 0, while the com-
when the data item arrives at it. If a caching decision has been munication cost is minimal, i.e., c(dk ) = c1 . In order to strike
made for that new data item, another already cached data item the balance between these two contradicting objectives, the
has to be removed from the cache due to the limited cache following cost function is defined:
space. We denote the caching action taken by the edge node for
dk as ak , and the action space as A = {a0 , a1 , . . . , aI }. ak = a0 C(dk ) = α · c(dk ) + (1 − α) · l(dk ) (3)
means that the cached data items keep unchanged. It occurs
where α ∈ [0, 1] is a coefficient weighting the relative impor-
when dk is directly answered by a cached data item in the
tance of the communication cost. A higher α means a higher
edge node, or dk is not cached if it is a new data item fetched
communication cost and indicates that an IoT application does
from the data producer. ak = ai (1 ≤ i ≤ I) means that dk is a
not prefer frequent data retrieval from the data producer.
new data item, and it is cached in the edge node by replacing
Note that minimizing the cost function C(dk ) is equivalent
the position of dki . According to different cases of fetching
to maximizing U(dk ) = B − C(dk ) when B is a constant. For
data items, the selection of ak is described as follows. In case
the sake of simplicity, we let B be big enough to guarantee
1, no new data item arrives, ak = a0 . In case 2, new data
that U(dk ) is positive in all cases, and we refer to U(dk ) as
item dk replaces nonfresh cached data item p(fk ) which is also
the utility of fetching data item dk . Without loss of generality,
associated with content fk . In case 3, the cache replacement
we let B be equal to a constant value α · c2 + (1 − α), then
decision is made according to a caching policy π(ak ) which
selects an action from A. Note that Dk and Fk will change to U(dk ) = α · (c2 − c(dk )) + (1 − α)g(dk ). (4)
Dk+1 and Fk+1 according to the choice of caching action ak .
III. P ROBLEM F ORMULATION

C. Cost Function and Utility Function
This paper aims at finding a policy, which selects appro-
Denote the communication cost for fetching a cached data priate caching action for each data item, to minimize the
item from the edge node by c1 , and the communication cost long-term cost of fetching transient data items, or to maximize
for fetching a new data item from the data producer by c2 . the long-term utility of fetching transient data items. Note that
Without loss of generality, we assume that c1 < c2 . This only the selection of actions in case 3 can be optimized (due
can be explained by the fact that the latency and energy con- to the fact that caching actions in cases 1 and 2 are deter-
sumption can be both reduced when fetching data from the mined directly based on the system rule). In the remainder of
edge node directly rather than from the data producer. The this paper, we only focus on the request belonging to case 3,
communication cost of fetching dk can be represented as which is represented by n. We define time step n as the period
between the moment when data item dn arrives at the edge
c , if fk ∈ Fk and tage (p(fk )) < Tlife (p(fk ))
c(dk ) = 1 (1) node and the moment when data item dn+1 arrives at the edge
c2 , otherwise.
node.
Besides the communication cost, we also consider the fresh- We formulate the cache replacement problem as a MDP
ness loss cost. Each data item is considered to have a certain problem. The MDP model can be defined by the tuple
level of data freshness. The freshness of a transient data item {S, A, M(sn+1 |sn , an ), R(sn , an )}.
d can be defined as g(d) = (Tlife (d) − tage (d))/Tlife (d). Items 1) S is the set of states for the edge caching-based IoT
with a nonpositive freshness value (when the data age is not system. We define sn as the state at time step n.
smaller than the lifetime) will not be fetched by data con- It can be represented by values of information about
sumers. Due to edge caching, cached items have finite nonzero cached/arrived data items, or other network condi-
data ages when fetched by data consumers. This reduces the tions which are directly or indirectly related to the
residual lifetime of the data item, during which it can be performance of edge caching. Details on the expres-
considered to have some level of freshness. We define the sion of sn considered in this paper are introduced in
freshness loss of a data item d as tage (d)/Tlife (d). Then, the Section IV.
freshness loss cost of dk can be represented as 2) A is the set of caching actions. The action selected by
the edge node at time step n is denoted by an .
tage (p(fk ))
, if fk ∈ Fk and tage (p(fk )) < Tlife (p(fk )) 3) M(sn+1 |sn , an ) is the state transition probability that
l(dk ) = Tlife (p(fk ))
0, otherwise. maps a state-action pair at time step n onto a distribution
(2) of states at time step n + 1.
4) R(sn , an ) is the immediate/instantaneous reward func-
The freshness loss cost of fetching a new data (with 0 data tion that determines the reward fed back to the edge node
age) item from the data producer is zero. when performing action an at state sn . During time step
Fetching a transient data item from the edge caching-based n, several requests which belong to cases 1 and 2 may
IoT involves a tradeoff between the communication cost and arrive at the edge node. Besides, request n + 1 which
the freshness loss cost. If data item dk is fetched from the data belongs to case 3 also arrives at time step n, since the
producer, the freshness loss cost is minimal, i.e., l(dk ) = 0, time when the edge node receives request n+1 is before
while the communication cost is maximal, i.e., c(dk ) = c2 . the time when dn+1 arrives the edge node. We define rn
as the sum utility of all data items which are requested iteration and policy iteration to improve the policy until an
during time step n. optimal policy is found. Given Qπ (s, a), the best policy can be
We define the caching policy π = π(an |sn ) as a mapping found by greedily selecting a at every state: arg maxa Qπ (s, a).
from state sn to a probability of choosing action an . In the This kind of method, which is based on given models for M
MDP framework, the process of the edge node at time step n and R, is called as model-based method.
can be described as follows. When there are no models for M and R, we resort to
1) At the beginning of time step n, the edge node observes model-free reinforcement learning (RL) methods. RL agents
the system and obtains its current state, sn ∈ S. learn value functions or policy from experience. Two kinds
2) According to the caching policy π , the edge node takes of RL methods have been proposed in the literature: 1) meth-
action an after sn is observed. Note that caching policy ods based on value functions and 2) methods based on policy
π indicates how to select caching actions in case 3, and search. The key of value-function-based methods is to properly
caching actions in cases 1 and 2 are determined by the and efficiently estimate the value function. With the improve-
system rule rather than policy π . ment of the estimate, the policy can naturally be improved by
3) After action an is taken, the IoT system achieves a greedily picking actions according to the updated value func-
reward rn and transitions to a new state sn+1 according tion. As for policy-search-based methods, they search directly
to environment dynamics R(sn , an ) and M(sn+1 |sn , an ). for an optimal policy by parameterizing the policy as πθ
4) The reward is fed back to the edge node and the process and optimizing the parameters θ with the aim of maximizing
is repeated. the expected return Eπθ [R]. Gradient-based or gradient-free
∞ accumulated
The reward is defined as return Rn = optimization are usually adopted to find the optimal param-
m=0 γ rn+m , with a discount factor γ ∈ (0, 1]. γ determines eters θ . In addition, to harvest the benefits of both value
m
the effect of future rewards to current caching decisions, a functions and policy search, a hybrid actor–critic approach is
lower value of γ places more emphasis on immediate rewards. proposed. The “actor” (policy) learns by using the feedback
Our aim is to find the optimal caching policy, π ∗ , which from the “critic” (value function), i.e., the value function acts
achieves the maximum expected return from all states as a baseline for policy gradients.
RL agents need to derive appropriate representations of the
π ∗ = arg max E[Rn |π ]. (5)
π environment, thus enable themselves to generalize past expe-
rience to new situations with fairly low complexity. Then,
To measure how good π is, value functions are defined
traditional RL is inherently limited to domains with fully
for policy π . The state-value function is defined as V π (s) =
observed, low-dimensional state spaces. However, the oper-
Eπ [Rn |sn = s], which represents the expected return for fol-
ating environment of the edge caching-based IoT system is
lowing policy π from state s. The optimal state value for state
complex and dynamic. It is hard to manually extract all useful
s is represented as V ∗ (s) = maxπ V π (s). The action-value
features of the environment as low-dimensional state spaces.
function is defined as Qπ (s, a) = Eπ [Rn |sn = s, an = a],
The rise of DRL has made it possible for agents to be directly
which represents the expected return for deterministically
trained on raw high-dimensional observations, rather than
selecting initial action a in state s and then following π .
handcrafted features or low-dimensional state spaces. In the
The optimal action value for state s and action a is rep-
next section, we adopt a DRL algorithm to automatically find
resented as Q∗ (s, a) = maxπ Qπ (s, a). Note that V ∗ (s) =
∗ the optimal policy, without using any explicit prior knowledge
maxa Qπ (s, a). If V ∗ (s) or Q∗ (s, a) is available, we could
about the system.
obtain the optimal policy by choosing among all actions avail-
able at state sn and greedily picking the action a, which
maximizes Esn+1 ∼M(sn+1 |sn ,a) [V ∗ (sn+1 )] or Q∗ (sn , a).
According to the Markov property, V π (s) can be decom- IV. DRL-BASED C ACHING P OLICY
posed into the Bellman equation
Neural networks (NNs) have recently been applied suc-
V π (s) = π(a|s) p(s , r|s, a) r + γ V π (s ) cessfully to solve large-scale RL problems [23], [24], by
a s ,r utilizing the advantages of deep NNs on automatically learn-
ing low-dimensional feature representations. NNs do not need
where p(s , r|s, a) means the probability that the system
handcrafted features and can be directly trained with raw
achieves reward r and transitions to state s if action a is per-
high-dimensional observation signals. DRL generally employs
formed at state s. The distribution of p(s , r|s, a) is decided
deep NNs to approximate the policy, and/or the value func-
by the environment dynamics R and M. Qπ (s, a) can also be
tion V or Q. Fig. 3 summarizes how DRL can be applied
decomposed into the Bellman equation
to edge caching-based IoT system. As shown, the caching
π agent obtains several raw signals by observing the state of
π
Q (s, a) = p s , r|s, a r + γ π a |s Q s , a . the environment. These signals can be user requests, context
s ,r a information, and network conditions. The deep NN is fed with
If R and M are available, we can use dynamic program- these signals and outputs the value function or the policy.
ming methods to obtain the optimal policy. Specifically, we According to the output, the agent selects a caching action and
can use policy evaluation to calculate state/action value func- observes the reward of performing that action. Then, the agent
tion, e.g., Qπ (s, a) for a policy. Meanwhile, we can use value can train and improve the deep NN model with the reward.
raw observations on user contexts and network conditions can

also be included in the state inputs.
Policy: Upon obtaining sn , the caching agent selects an
action an based on policy π(an |sn ), which is the probabil-
ity of selecting action an in state sn . Since the lifetime and
freshness of data items are continuous real numbers, there
are intractably many {state, action} pairs. To deal with the
curse of dimensionality brought by raw observations, we use
an NN to represent the caching policy with a manageable num-
ber of adjustable parameters, θ . The parameterized policy is
denoted as π(an |sn ; θ ) (or πθ for simplicity), which is rep-
resented as the actor network in Fig. 4. The design of the
specific architecture of the NN is presented in Section V.
State-Value Function: We use an NN, depicted as the critic
Fig. 3. Applying DRL to caching IoT data. network in Fig. 4, to estimate V πθ (s). The parameters of the
critic network are denoted as θv and the estimate of the state-
value function is denoted as V πθ (s; θv ).
Policy Gradient Training: After each action is performed,
the environment provides the caching agent with a reward rn ,
based on which NN parameters can be trained and updated.
In this paper, we utilize a policy gradient method to train the
policy in the actor–critic algorithm [25], where the key lies in
estimating the gradient of the expected total reward by observ-
ing the trajectories of executions. The gradient of the expected
return with respect to actor network parameters, θ , can be
computed as

∇θ Eπθ [Rn ] = Eπθ ∇θ log π (an |sn ; θ )Aπθ (sn , an ; θv ) .
Then, the update of actor network parameters θ is

θ ←θ +λ ∇θ log π (an |sn ; θ )Aπθ (sn , an ; θv ) (6)
n
where λ is the learning rate. The advantage Aπθ (sn , an ) can

Fig. 4. Actor–critic method used by DRL-based caching policy.
be estimated based on the critic network, where Aπθ (sn , an ) =
rn + γ V πθ (sn+1 ; θv ) − V πθ (sn ; θv ).
We adopt the temporal difference method to train the critic
network parameters. Specifically, we set the loss function as
The proposed caching policy is designed based on A3C [25], the mean square error of the difference between the state val-
which is a state-of-the-art actor–critic method. As shown in ues V(sn ; θv ) and the target values rn + γ V(sn+1 ; θv ) [26]. The
Fig. 4, the proposed policy involves training two NNs, i.e., update of critic network parameters θv can be represented as
the actor network and the critic network. In A3C, advantage 2
updates are combined with actor and critic networks, which θv ← θv − λ ∇θv rn + γ V πθ (sn+1 ; θv ) − V πθ (sn ; θv )
n
are trained and updated in an asynchronously parallel style.
(7)
The detailed functionalities of these networks are explained
below. where λ is the learning rate.
Inputs: After dn is fetched, the caching agent takes state In the process of RL learning, the tradeoff between explo-
inputs sn = (
xn0 ,
xn1 , . . . ,
xnI , y0n , y1n , . . . , yIn , z0n , z1n , . . . , zIn ) to ration and exploitation is important for the agent to converge
its NNs. The superscripts 0, 1, . . . , I indicate the currently to a good policy. Exploitation means that taking perceived
requested data item (0) and cached data items (ranging from 1 optimal action and exploration tries to explore the action
to the cache size I).
xni = (xni [1], xni [2], . . . , xni [J]) is a vector space adequately by taking perceived nonoptimal actions. An
which represents the number of requests for content fni within entropy regularization term can be added into (6) for trading
past J groups of requests, where each group consists of G off exploration and exploitation [25]. Specifically, (6) can be
requests. For instance, xni [j] (1 ≤ j ≤ J) means the number of modified as
requests for content fni within the jth past group of requests.
As for yin and zin , they denote the lifetime and freshness of θ ← θ +λ ∇θ log π (an |sn ; θ )Aπθ (sn , an ; θv )
dni , respectively. Even though only the request information of n
data items is utilized to extract features in this paper, other + β∇θ H(π (·|sn ; θ )) (8)
where H(π(·|sn ; θ )) is the entropy of the policy (the proba- extracts features from the history request information. Without
bility distribution over actions) at each time step. This term loss of generality, we let c2 = 1 and c1 = 0. The cost weight
pushes θ in the direction of higher entropy, thus achieving the factor α varies from 0 to 1.
encouragement of exploration. A larger value of the weighted The NN structure of the DRL-based caching policy is illus-
factor β means that more exploration is encouraged. trated in Fig. 4. To acquire the inputs of the network model at
There are two ways to train the proposed caching policy, each time step n, the edge node maintains a list of past J · G
i.e., the online way and the offline way. In the online style, the received requests. The number of past groups J is set as 6,
caching policy is directly trained at the edge node, and peri- and the number of requests contained in each group is set as
odically updated when new data item arrives. In the offline G = 100. Then, we can obtain the value of xni [j] by count-
style, the caching policy is first generated during a training ing the number of requests for content fni in the jth group
phase and then deployed at the edge node. After the deploy- of requests. The agent passes
xni for each considered content
ment, the caching policy keeps unchanged. Parallel training to a 1-D convolution layer with 128 filters, each of size 3
can be utilized to enhance and speed up training in A3C [25]. with stride 1. Results from these layers and other inputs (i.e.,
We adopt 16 parallel agents and configure different sets of the lifetime and freshness of considered data items) are then
input parameters (e.g., network traces) for each of them. The aggregated in a hidden layer with 128 neurons. In the actor
{state, action, reward} tuples of all agents are continuously network, the last layer employs the softmax function to output
sent to a central agent, which aggregates them to generate a the caching policy. As for the critic network, it has the same
single caching policy model. Upon receiving a sequence of network structure with the actor network, but its final output is
tuples, the central agent updates its NN model by perform- a linear neuron. These networks are trained through the online
ing a gradient descent step according to (7) and (8). Then, way. The discount factor γ = 0.99 and the minibatch size is
the new network model of the central agent is passed to the 200. The learning rates for the actor and the critic are set as
agent which sent that tuple. It is evident that, the aforemen- λ = 10−4 and λ = 10−3 , respectively. The entropy factor β
tioned operations among different agents are independent and is 0.1.
can happen asynchronously [25]. We evaluate the performance of the proposed policy on
cache hit ratio, data freshness, and data fetching cost. Cache hit
V. S IMULATION R ESULTS ratio is the ratio of the number of requests directly answered
by the edge node to the number of total requests. Two com-
In this section, numerical results are presented to demon-
mon baseline caching policies are considered, i.e., LRU and
strate the performance of our DRL-based caching policy.
least fresh first (LFF). As the proposed policy, these baseline
policies also deal with how to select actions in case 3, while
A. Simulation Setup action selection in cases 1 and 2 is decided by the system rule.
We conduct our simulations using Python 3.5 and Upon receiving each request n in case 3, the details of these
TensorFlow 1.8.0 on a workstation equipped with an Intel baseline polices are listed as follows.
Core i7-6850K processor (3.6 GHz, 6 cores), four GeForce 1) LRU: The edge node keeps track of the number of
GTX1080Ti GPUs, and 64G RAM. The scenario illustrated requests for every cached content. The new data item
in Fig. 1 is used for the simulation. There is one edge node is cached by removing the data item associated with the
(e.g., a gateway) which covers C = 50 data producers (e.g., content which is requested least many times in the cache.
IoT sensors). Each sensor collects data values of a specific 2) LFF: The edge node keeps track of the freshness of
content from the real world. User applications send their cached data items. The new data item is cached by
requests for these data items to the edge node. The edge node removing a cached data item which is the least fresh.
decides to answer the request directly or forward the request
to the data producer, according to whether there is a fresh
data item in its storage. The lifetime Tlife of each IoT con- B. Results
tent is randomly selected from six levels, i.e., [0.1 min, 0.2 In Fig. 5, we can see the effect of the cost weight factor α
min, 0.25 min, 0.5 min, 0.75 min, 1 min], with equal prob- on the system performance. Note that α impacts the tradeoff
ability. Similar with [27], we assume that users arrive and between avoiding freshness loss cost through fetching freshly
depart according to a Poisson process, and they request content generated data items and saving communication cost through
items at random times with independent and identically dis- fetching cached data items. As α increases from 0 to 1, more
tributed random inter-request times. The average sum request weight is given to the communication cost in (3), i.e., the
rate of users in the coverage area is denoted as w, which requirement for data freshness becomes lower. Fig. 5 shows
varies from 0.5/min to 6.5/min. Moreover, the popularity of that the proposed policy is able to adjust its caching actions
different IoT contents is characterized by the Zipf distribution to cater for different values of α, with the goal of minimizing
with a parameter η = 0.5. It means that in each request, the the average cost. More cache hit events would be allowed by
probability of
requesting IoT content whose popularity ranks the proposed policy with a bigger α, and needless cache hit
j is Pj = j−η / C −η
j=1 j . Each request is assumed to be able events would be avoided by the proposed policy with a smaller
to be answered before the requesting user departs. Note that α. As for the LRU policy and the LFF policy, they replace
other arrival models for user requests are also applicable to the cached data items without the consideration of α even
the proposed DRL-based edge caching policy, since it only though α is a weight coefficient in the fetching cost. Then, with
(a) (b) (c)
Fig. 5. Freshness, cache hit ratio, and cost with varying cost weight factor. (a) Average cache hit ratio. (b) Average freshness. (c) Average cost.
(a) (b) (c)
Fig. 6. Freshness, cache hit ratio, and cost with varying request rate. (a) Average cache hit ratio. (b) Average freshness. (c) Average cost.
(a) (b) (c)
Fig. 7. Freshness, cache hit ratio, and cost with varying popularity skewness. (a) Average cache hit ratio. (b) Average freshness. (c) Average cost.
the variation of α, their performance on cache hit ratio and replacement time instance for the cached data items. With
freshness keeps unchanged, while the costs achieved by them more options, the DRL-based edge caching policy may find
change linearly. In a word, we can see that the proposed policy better solutions. Hence, the cost achieved by the proposed pol-
is able to cater for different requirements for data freshness, icy decreases with the increase of the request rate. In addition,
and it outperforms the LRU policy and the LFF policy for we can observe that the LRU policy and the LFF policy are
different α. less capable to cater for the variation of the request rate, and
In Fig. 6, we can see the impact of the request rate w which their performance changes very slightly when the request rate
varies from 0.5/min to 6.5/min. When the request rate is rela- are high enough for allowing cache hit events.
tively low, the cached data item will become nonfresh before Fig. 7 shows the impact of the popularity skewness param-
the next request for that cached content arrives, in a statistical eter η. As η increases, the Zipf distribution becomes concen-
sense. As the request rate increases, the data items could be trated, i.e., the first few popular content items account for the
requested more times before they expire. It means that more majority of requests. In the LRU policy and the LFF policy,
data requests could arrive before the cached data items expire. more and more requests are answered by the edge node which
Then, the proposed policy has more options of choosing the caches the few popular contents. Then, with the increase of
(a) (b) (c)
Fig. 8. Freshness, cache hit ratio, and cost with varying cache size. (a) Average cache hit ratio. (b) Average freshness. (c) Average cost.
Fig. 9. Illustration of the performance fluctuation.
η, the cache hit ratio increases and the freshness decreases VI. C ONCLUSION
when the request rate and the lifetime are fixed, as shown in In this paper, we proposed an edge caching-based IoT
Fig. 7(a) and (b). Moreover we observe that the proposed pol- system framework for caching and transmitting transient IoT
icy is able to achieve the lowest average cost under different data. To minimize the long-term cost of fetching IoT data, we
η, as shown in Fig. 7(c). formulated the cache replacement problem as a MDP problem.
Fig. 8 shows the impact of different cache size. The hori- To solve the formulated problem, a DRL-based caching policy
zontal axis shows the ratio of the cache size I to the number was proposed, which can determine caching policy without
of contents C. It can be observed that the proposed policy any assumptions explicit about the operating environment.
achieves the lowest average cost for different cache size I. Extensive simulation results demonstrated that the proposed
Compared with the LRU policy and the LFF policy, the gain DRL-based caching policy outperforms two baseline caching
achieved by the proposed policy first becomes larger with policies for a variety of different network setup parameters,
the increase of I, then becomes smaller when I increases to the long-term cost of users fetching IoT data can be decreased.
the value close to C. An extreme example is that all poli- Possible future directions include the following. On the one
cies achieve the same performance when I is equal to C. hand, knowledge from other domains could be utilized to mine
The reason is that, the edge node caches all contents for all more useful features for the caching agent, thus making it eas-
policies if I = C. Upon receiving a new data item of a spe- ier to converge to the optimal caching policy. On the other
cific content, the edge node just caches this new data item hand, cooperative caching in IoT systems with multiple edge
by replacing the old data item of that specific content. In nodes (e.g., SBSs) could be taken into consideration. In the
other words, when C is large enough to cache all contents, scenario of multiple edge nodes, neighboring edge nodes may
all requests belong to cases 1 and 2, and there is no room for have overlapping coverage and share cached data between each
improving the performance by optimizing the caching actions other. Thus, the caching action taken by an edge node may
in case 3. change the environment experienced by another edge node.
Fig. 9 shows the fluctuation of the fetching cost as time goes To achieve multiple equilibria and avoid reduplicative caching,
by, with a fixed cache size I = 25. The fetching cost in each multiagent DRL could be adopted as an efficient solution.
time frame is averaged on 5000 requests. We can observe that
R EFERENCES
the proposed policy is able to converge to a stable performance
on the fetching cost. Moreover, the proposed policy outper- [1] A. Nordrum, “Popular Internet of Things forecast of 50 bil-
lion devices by 2020 is outdated,” IEEE Spectr. [Online].
forms the LRU policy and the LFF policy, if efficient training Available: https://spectrum.ieee.org/tech-talk/telecom/internet/popular-
of the NN is guaranteed. internet-of-things-forecast-of-50-billion-devices-by-2020-is-outdated
[2] M. Nitti, M. Murroni, M. Fadda, and L. Atzori, “Exploiting social Hao Zhu received the B.S. degree in information and communication engi-
Internet of Things features in cognitive radio,” IEEE Access, vol. 4, neering from the Huazhong University of Science and Technology, Wuhan,
pp. 9204–9212, 2016. China, in 2014, where he is currently pursuing the Ph.D. degree at the School
[3] C. Long, Y. Cao, T. Jiang, and Q. Zhang, “Edge computing frame- of Electronic Information and Communications.
work for cooperative video processing in multimedia IoT systems,” IEEE His current research interests include mobile edge networks and multimedia
Trans. Multimedia, vol. 20, no. 5, pp. 1126–1139, May 2018. communications.
[4] M. Chiang and T. Zhang, “Fog and IoT: An overview of research
opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864,
Dec. 2016.
[5] H. Zhu, Y. Cao, W. Wang, B. Liu, and T. Jiang, “QoE-aware resource
allocation for adaptive device-to-device video streaming,” IEEE Netw.,
Yang Cao (S’09–M’14) received the Ph.D. degree in information and
vol. 29, no. 6, pp. 6–12, Nov./Dec. 2015.
communication engineering from the Huazhong University of Science and
[6] E. Bastug, M. Bennis, and M. Debbah, “Living on the edge: The role Technology (HUST), Wuhan, China, in 2014.
of proactive caching in 5G wireless networks,” IEEE Commun. Mag., He is currently an Associate Professor with the Wuhan National Laboratory
vol. 52, no. 8, pp. 82–89, Aug. 2014. for Optoelectronics, School of Electronic Information and Communications,
[7] P. Blasco and D. Gundüz, “Learning-based optimization of cache content HUST. His current research interests include 5G cellular networks, Internet
in a small cell base station,” in Proc. IEEE ICC, Sydney, NSW, Australia, of Things, and future networks.
Jun. 2014, pp. 1897–1903.
[8] J. Song, M. Sheng, T. Q. S. Quek, C. Xu, and X. Wang, “Learning-
based content caching and sharing for wireless networks,” IEEE Trans.
Commun., vol. 65, no. 10, pp. 4309–4324, Oct. 2017.
[9] M. Leconte et al., “Placing dynamic content in caches with small popu-
lation,” in Proc. IEEE INFOCOM, San Francisco, CA, USA, Apr. 2016, Xiao Wei received the B.S. degree in information and communication engi-
pp. 1–9. neering from the Huazhong University of Science and Technology, Wuhan,
[10] S. M. S. Tanzil, W. Hoiles, and V. Krishnamurthy, “Adaptive scheme China, in 2014, where she is currently pursuing the Ph.D. degree at the School
for caching YouTube content in a cellular network: Machine learning of Electronic Information and Communications.
approach,” IEEE Access, vol. 5, pp. 5870–5881, 2017. Her current research interests include wireless communications, signal
[11] W. Wang, S. De, R. Toenjes, E. Reetz, and K. Moessner, “A comprehen- processing, and massive MIMO.
sive ontology for knowledge representation in the Internet of Things,”
in Proc. IEEE TrustCom, Liverpool, U.K., Jun. 2012, pp. 1793–1798.
[12] X. Sun and N. Ansari, “EdgeIoT: Mobile edge computing for the Internet
of Things,” IEEE Commun. Mag., vol. 54, no. 12, pp. 22–29, Dec. 2016.
[13] Y. Liu et al., “CitySee: Not only a wireless sensor network,” IEEE Netw., Wei Wang (S’10–M’16) received the Ph.D. degree from the Department of
vol. 27, no. 5, pp. 42–47, Sep./Oct. 2013. Computer Science and Engineering, Hong Kong University of Science and
[14] P. Bellavista, G. Cardone, A. Corradi, and L. Foschini, “Convergence of Technology, Hong Kong.
MANET and WSN in IoT urban scenarios,” IEEE Sensors J., vol. 13, He is currently a Professor with the Wuhan National Laboratory for
no. 10, pp. 3558–3567, Oct. 2013. Optoelectronics, School of Electronic Information and Communications,
[15] V. Kostakos, T. Ojala, and T. Juntunen, “Traffic in the smart city: Huazhong University of Science and Technology, Wuhan, China. His cur-
Exploring city-wide sensing for traffic control center augmentation,” rent research interests include PHY/MAC designs and mobile computing in
IEEE Internet Comput., vol. 17, no. 6, pp. 22–29, Nov./Dec. 2013. wireless systems.
[16] N. K. Giang, M. Blackstock, R. Lea, and V. C. M. Leung, “Developing Dr. Wang served on the TPC of INFOCOM and GBLOBECOM. He served
IoT applications in the fog: A distributed dataflow approach,” in Proc. as a Guest Editor for Wireless Communications and Mobile Computing and
IEEE IOT, Seoul, South Korea, Oct. 2015, pp. 155–162. for the IEEE COMSOC MMTC Communications.
[17] S. Chen, N. B. Shroff, and P. Sinha, “Heterogeneous delay tolerant task
scheduling and energy management in the smart grid with renewable
energy,” IEEE J. Sel. Areas Commun., vol. 31, no. 7, pp. 1258–1267,
Jul. 2013.
[18] Y. Xu and W. Wang, “Wireless mesh network in smart grid: Modeling
and analysis for time critical communications,” IEEE Trans. Wireless Tao Jiang (M’06–SM’10–F’19) received the Ph.D. degree in information
Commun., vol. 12, no. 7, pp. 3360–3371, Jul. 2013. and communication engineering from the Huazhong University of Science
[19] D. B. Santana, Y. A. Zócalo, and R. L. Armentano, “Integrated e-Health and Technology (HUST), Wuhan, China, in 2004.
approach based on vascular ultrasound and pulse wave analysis for He was with Brunel University London, London, U.K., and the University
asymptomatic atherosclerosis detection and cardiovascular risk strati- of Michigan–Dearborn, Dearborn, MI, USA. He is currently a Chair
fication in the community,” IEEE Trans. Inf. Technol. Biomed., vol. 16, Professor with the Wuhan National Laboratory for Optoelectronics, School
no. 2, pp. 287–294, Mar. 2012. of Electronics Information and Communications, HUST.
[20] S. Vural et al., “In-network caching of Internet-of-Things data,” in Proc. Dr. Jiang was a recipient of the NSFC for Distinguished Young
IEEE ICC, Sydney, NSW, Australia, Jun. 2014, pp. 3185–3190. Scholars Award in 2013. He served as an Associate Editor for the IEEE
[21] S. Vural, N. Wang, P. Navaratnam, and R. Tafazolli, “Caching transient T RANSACTIONS ON S IGNAL P ROCESSING and the IEEE C OMMUNICATIONS
data in Internet content routers,” IEEE/ACM Trans. Netw., vol. 25, no. 2, S URVEYS AND T UTORIALS. He is an Associate Editor-in-Chief of China
pp. 1048–1061, Apr. 2017. Communications.
[22] Z. Zhang, C.-H. Lung, I. Lambadaris, and M. St-Hilaire, “IoT data
lifetime-based cooperative caching scheme for ICN-IoT networks,” in
Proc. IEEE ICC, Kansas, MO, USA, May 2018, pp. 1–7.
[23] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
“Deep reinforcement learning: A brief survey,” IEEE Signal Process. Shi Jin (S’06–M’07–SM’17) received the Ph.D. degree in information and
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017. communications engineering from Southeast University, Nanjing, China, in
[24] V. Mnih et al., “Human-level control through deep reinforcement 2007.
learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. From 2007 to 2009, he was a Research Fellow with the Adastral Park
[25] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Research Campus, University College London, London, U.K. He is currently
in Proc. ICML, New York, NY, USA, Jun. 2016, pp. 1928–1937. a Professor with the Faculty of the National Mobile Communications Research
[26] V. R. Konda and J. N. Tsitsiklis, “Actor–critic algorithms,” in Proc. Adv. Laboratory, Southeast University. His current research interests include space
NIPS, Denver, CO, USA, 2000, pp. 1008–1014. time wireless communications, random matrix theory, and information theory.
[27] L. Wang, H. Wu, Z. Han, P. Zhang, and H. V. Poor, “Multi-hop coopera- Dr. Jin is an Associate Editor of the IEEE T RANSACTIONS ON
tive caching in social IoT using matching theory,” IEEE Trans. Wireless W IRELESS C OMMUNICATIONS, IEEE C OMMUNICATIONS L ETTERS, and
Commun., vol. 17, no. 4, pp. 2127–2145, Apr. 2018. IET Communications.

32@caching Transient Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

32@caching Transient Data

Uploaded by

Copyright:

Available Formats

2074 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO.

Caching Transient Data for Internet of Things:

Fig. 2. Data item freshness. (a) Fresh. (b) Nonfresh.

III. P ROBLEM F ORMULATION

raw observations on user contexts and network conditions can

Then, the update of actor network parameters θ is

where λ is the learning rate. The advantage Aπθ (sn , an ) can

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

Fig. 9. Illustration of the performance fluctuation.

You might also like