Professional Documents
Culture Documents
Reinforcement Learning
Zhiqiang Wan∗ , Hepeng Li† , and Haibo He∗
∗ Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island
RI 02881 USA, Email: {zwan, he}@ele.uri.edu
† Lab. of Networked Control Systems, Shenyang Institute of Automation, Chinese Academy of Sciences
Abstract—A smart home with battery energy storage can take loads such that the electricity cost and the load peak can be
part in the demand response program. With proper energy reduced. However, these works did not consider the value of
management, consumers can purchase more energy at off-peak battery energy storage in REM for DR programs.
hours than at on-peak hours, which can reduce the electricity
costs and help to balance the electricity demand and supply. Many researchers have investigated the REM problem while
However, it is hard to determine an optimal energy management considering battery energy storage. Abdulgader et al. [9]
strategy because of the uncertainty of the electricity consumption analyze a smart home scenario with battery energy storage.
and the real-time electricity price. In this paper, a deep reinforce- An evolution algorithm Grey Wolf Optimizer based on the
ment learning based approach has been proposed to solve this swarm intelligence is applied to manage the energy storage
residential energy management problem. The proposed approach
does not require any knowledge about the uncertainty and can system by charging the battery when the electricity price is
directly learn the optimal energy management strategy based cheap and discharging it when the price is high. Melhem et
on reinforcement learning. Simulation results demonstrate the al. [10] investigate the energy consumption and production
effectiveness of the proposed approach. of a smart home where the integration of the battery energy
storage, the renewable energy sources, and the electric vehicle
I. I NTRODUCTION
is considered. The REM is optimized by a mixed integer linear
According to the latest International Energy Outlook 2017 programming such that the electricity cost is minimized. Wei
[1], the world energy consumption is projected to increase et al. [11] report an error-tolerant iterative adaptive dynamic
28% by 2040. Apart from installing new power plants, demand programming algorithm to optimize the battery energy storage
response (DR) provides an economically efficient way to management in a smart home with Photovoltaics (PV) panels.
alleviate the increasingly tense electricity demand. By shifting Recently, deep neural network (DNN) has been used in
electricity consumption from on-peak hours to off-peak hours, numerous applications [12]–[17], including image recognition,
DR offers a chance for the end-consumers to participate in visual question answering, and machine translation. Deep
operation and energy management of electric power grids reinforcement learning (RL), combinatin of DNN and RL,
to improve energy efficiency. In general, real-time electricity has obtained significant success in complex decision-making
price is used by the electric system planner and operator to processes [18]–[23]. For example, a deep Q-network is de-
incite the consumers to participate in the DR programs. veloped in [18] to learn to make successful decisions only
For residential DR programs, energy management system based on image observations. The discriminative features of
plays an essential part in automatical response to real-time the image observation are extracted by a convolutional neural
electricity prices and management of household appliances network. The proposed deep Q-network is evaluated on the
consumptions. A lot of studies in the literature have been Atari games and its performance is comparable to that of
focused on developing efficient DR policies for optimal res- a professional player. The great success of AlphaGo [19],
idential energy manage (REM). To name a few, in [2], a defeating 18-time world champion Lee Sedol by 4 games to 1,
learning-based DR strategy for optimal control of a heating is contributed by the integration of deep RL and Monte Carlo
ventilation and air conditioning system (HVAC) is developed tree search. Despite the above perfect information applications,
to minimize the electricity costs. In [3], a DR strategy for deep RL obtained promising results in imperfect information
scheduling of aggregated residential HVACs is investigated. applications. Heinrich et al. [20] proposed Neural Fictitious
In [4], the DR capability of electric water heaters (EWH) is Self-Play (NFSP) which combined deep RL with fictitious
evaluated for load-shifting and balancing reserve. In [5], [6], self-play to learn approximate Nash equilibrium for imper-
co-optimization of several types of home appliances, including fect information games. NFSP has been successfully applied
controllable and shiftable appliances, is studied. In [7], a mixed in two-player zero-sum games, such as Leduc poker and
integer linear or nonlinear programming model is proposed to Limit Texas Holdem. DeepStack introduced in [21] defeated
make optimal DR schedules of different types of appliances professional poker players in Heads-up No-Limit Holdem
while considering the users’ comfort. In [8], deep Q-learning Poker by integrating recursive reasoning to tackle information
and deep policy gradient are applied to control the flexible asymmetry.
Fig. 1. The schematic diagram of the REM system. st+1 = f (st , at , ωt ), (1)
where the state transition is determined by the action at and
In this paper, a deep RL based approach is proposed to the randomness ωt . Specifically, the remaining battery energy
solve the REM problem. This problem is formulated as a is determined by the battery model Et+1 = Et + at − Lt .
finite Markov decision process (MDP) process. The objective However, the transition of electricity price Pt and home
of this REM problem is to determine an optimal sequence of loads Lt are subject to randomness because of the lack of
actions such that the total electricity cost is minimized. It is information about future prices and future loads.
worth noting that the proposed approach is model-free which 4) Reward: The reward is defined as
does not require any knowledge about the uncertainties of the (
home load and real-time electricity price. The proposed model −Pt ∗ at , t 6= t24
consists of two networks, an actor network and a critic net- rt =
−Pt ∗ at − λ ∗ [max(Emin − Et , 0)] , t = t24
work. The actor network generates an action which determines (2)
the amount of electricity to be purchased from the utility. where Pt ∗ at represents the electricity cost at hour t, Emin
The critic network assesses the performance of executing this denotes the allowed minimum battery energy, and λ is a
action. In these two networks, gated recurrent unit (GRU) penalty factor.
network is applied to extract discriminative features from the In this problem formulation, an action-value function
input time-series data. Simulation results under real-world Qπ (s, a) is used to assess the performance of executing
scenario verify the effectiveness of the proposed approach. the action a under the state s. Qπ (s, a) is defined as the
The remainder of this paper is organized as follows. Section expected sum of discounted future rewards starting from state
II provides the problem formulation. Then, Section III intro- s, executing the action a, and thereafter following the policy
duces the proposed approach. Its effectiveness is verified by π, i.e.,
simulation results in Section IV. Lastly, a conclusion is given
in Section V. " K
#
X
k
Qπ (s, a) = Eπ γ ∗ rt+k st = s, at = a , (3)
II. P ROBLEM F ORMULATION
k=0
A. Environment Setup
where π is the policy which maps from state s to the
A smart REM system is introduced in [24]. The schematic probability π(a | s) of taking action a when given state s.
diagram of the REM system is presented in Fig. 1. The REM In Eq. (3), K denotes the number of future time steps, and
system consists of the power grid, the energy storage (battery) γ is a discounted factor balancing the importance between
system, and the home loads. At each hour, the REM system the immediate reward and the future rewards. In this problem
decides the amount of electricity to be purchased from the formulation, γ = 1 which means the immediate reward has
utility. We consider a real-world scenario where the electricity the same importance as the future rewards.
price and home loads vary hourly. The real-world hourly The objective of this REM problem is to find an optimal
electricity price is from the California ISO [25]. The household policy π ∗ such that the sum of future rewards is maximized,
consumption is generated based on user behavior modeling in i.e.,
accordance with [7].
Q∗ (s, a) = max Qπ (s, a), (4)
B. Problem Formulation π
A finite MDP with discrete time step t ∈ (1, 2, ..., 24) is ap- where Q∗ (s, a) is called optimal action-value function. With
plied to model the sequential decision-making process of this this optimal policy π ∗ , we can derive the sequence of optimal
REM problem in the horizon of one day. The MDP is defined actions a∗ = argmaxa∈A Q∗ (s, a) to maximize the sum of
as the five-tuple (S, A, P. (·, ·), R. (·, ·), γ), where S denotes future rewards as Eq. (5), which is equivalent to minimize the
the system state, A is the set of feasible actions, P. (·, ·) total electricity costs.
GRU GRU
Electricity price dt-22 C22 dt-22 C22
ypat-22 SOC ypct-22 SOC
GRU GRU
dt-21 dt-21
Cents/kWh
C21 C21
ypat-21 ypct-21 Grid Smart meter
Wa1 Wc1
t-23 t ypat-1 ypat Wa2 ypct-1 ypct Wc2
Time: (hour) GRU dt GRU
dt C0 C0
xa at xc
GRU GRU Q(s, a) HVAC
ft-22 C22 ft-22 C22
Home loads ylat-22 ylct-22
ft-21 GRU
C21
ft-21 GRU
C21
ylat-21 ylct-21 Water heater Oven
kWh
Fig. 2. The overall diagram of the deep RL based residential energy management. The proposed model consists of an actor network and a critic network.
Their inputs are the battery SOC, past 24-h electricity price, and past 24-h home loads. The real-time action at is generated by the actor network, and the
reward rt is fed into the critic network.
23 23
X X dt
Q∗ (s1 , a) = max rk+1 = min Pk+1 ∗ ak+1 (5)
k=0 k=0 yt-1 . ct . Output
tanh +
where s1 is the initial state. yt
gt zt
III. P ROPOSED A PPROACH σ Reset gate Update gate σ
1-zt
.
∗
It is hard to analytically determine the optimal policy π
because the future home loads and electricity price are gener- yt-1
dt yt-1 dt yt-1
ally unknown. In order to solve this problem, Q-learning can
be used to update the action-value function Q(s, a) according
to the Bellman equation [26] as Fig. 3. Information flow in the GRU cell C0 [31].
Qi+1 (s, a) = E rt + γ max Qi (st+1 , at+1 ) st = s, at = a .
at+1 used to extract discriminative features from the input time-
(6)
series data. The gradient of the past 24-h electricity price,
where i represents the number of iteration. Q(s, a) will
dt = Pt − Pt−1 , ..., dt−22 = Pt−22 − Pt−23 , is inputted into
converge to the optimal action-value function Q∗ (s, a) [27]
the 23 cells of the GRU network, respectively. All the 23 cells
when i → ∞.
share a set of parameters. For the cell C22 , its input is dt−22 ,
Fig. 2 illustrates the overall diagram of the deep RL based pa
and its output yt−22 is directly passed to the next cell C21 .
residential energy management (REM). The inputs of the pa
Then, the past information yt−22 and new input information
proposed model are the battery SOC, past 24-h electricity
dt−21 are fused in the cell C21 . This fusion process is repeated
price, and past 24-h home loads. Then, the real-time action
for the remaining cells.
at is generated, and the reward rt is calculated.
The architecture of the GRU cell C0 is presented in Fig. 3.
A. Architecture of Proposed Network The reset gate gt determines the amount of past information
The architecture of the proposed deep RL based model yt−1 to be stored into the cell. The reset gate is computed by
in Fig. 2 consists of an actor network and a critic network.
The actor network generates the action at , i.e., the amount of gt = σ (Wg ∗ dt + Ug ∗ yt−1 + bg ) (7)
electricity to be purchased from the utility. The critic network
outputs the action-value Q(s, a). where σ denotes the sigmoid activation function, and Wg , Ug ,
1) Actor Network: Extracting good features from the input and bg represent the corresponding weights and bias term.
electricity price and home loads is important for this REM The processed past information yt−1 is fused with the input
problem. Recurrent neural network has achieved great success information dt as
in processing time-series data [28]–[30]. Gated recurrent unit
(GRU) network [31], a type of recurrent neural network, is ct = tanh [W ∗ dt + U ∗ (gt yt−1 )] (8)
where tanh is the hyperbolic tangent activation function, 2) Critic Network: Similar to the actor network, the fea-
denotes the element-wise multiplication operation, and W , U , tures of the electricity price and the home loads are extracted
and b denote the weights and bias term. Then, the update gate by two GRU networks, respectively. These features are con-
zt decides the amount of information to be outputted from ct catenated with the output of the actor network and the battery
and yt−1 . Similar to the reset gate, the update gate is computed SOC. Then, the concatenated features are fed into the three-
as layer fully-connected neural network. Its output is the action-
value Q(s, a).
zt = σ (Wz ∗ dt + Uz ∗ yt−1 + bz ) (9) B. Training Algorithm
where Wz , Uz , and bz are weights and bias term. Thus, the The proposed model is trained to generate optimal actions
final output yt is based on the deep deterministic policy gradient [32]. The
yt = zt ct + (1 − zt ) yt−1 (10) training process is provided in Algorithm 1. Apart from the
main actor network and critic network, two target networks
The output of the two GRU networks are concatenated with are applied to calculate the target action-value. Its initial
the battery SOC. These concatenated features xa are inputted parameters are copied from the main networks. When episode
into a three-layer fully-connected neural network. The value is smaller than G, we refer this phase as pretrain phase
of the hidden unit is computed as where the action at is randomly sampled from an uniform
distribution U (−emax , emax ). After the pretrain phase, the
ha = relu (Wa1 ∗ xa + ba1 ) (11)
action at is computed according to the main actor network
where relu denotes the rectified linear activation function, and and the exploration noise as
Wa1 and ba1 are the weights and bias term. Then, the output
of this fully-connected neural network is calculated by at = A(st | θA ) + ν ∗ Nt . (13)
where the noise Nt is sampled from an uniform distribution
o = σ (Wa2 ∗ ha + ba2 ) (12)
bounded between 0 and 1. The noise scale ν is calculated as
Then, o is scaled to the range of −emax ∼ emax and outputted
as at . ν = 0.2 ∗ emax + ν0 ∗ β e−ep (14)
0
0
0 0
where yj = rj + γQ sj+1 , A sj+1 | θA | θQ is the
0
target action-value calculated by the target actor network A
0
and the target critic network Q . The parameters of the actor Fig. 4. The training process of the proposed model.
network are updated with the sampled policy gradient
#F
X are used for testing. The household consumption is generated
5θ A J ≈ 5a Q s, a | θQ |s=sj ,a=A(sj ) 5θA A s | θA |sj . based on user behavior modeling in accordance with [7].
j=1
(16) 2) Model Structure: The structure of the proposed model is
Then, the parameters of the actor network are updated by presented as follows. In the actor network, the GRU network
has 23 cells, and its output is a 16-dimensional vector. The
θA ←− θA + α 5θA J (17) input layer of the fully-connected neural network has 33 units,
and its hidden layer has 16 units. Its output is scaled to the
where α denotes the learning rate. Then, the parameters of range of −10 ∼ 10. In the critic network, the GRU network
the target networks are updated by slowly tracking the main has the same structure as the one in the actor network. The
networks input layer of the fully-connected neural network has 34 units,
and its hidden layer has 16 units.
0 0
θQ ←− τ θQ + (1 − τ )θQ (18) 3) Training Parameters: During the training process, the
0 0 pretrain phase lasts 8000 episodes. The size of the replay
θA ←− τ θA + (1 − τ )θA buffer is set as 300,000. The size of the mini-batch is 64. The
initial noise scale is 2 ∗ emax = 20kW h and the noise decay
where τ is a small constant coefficient.
rate β is set as 0.9985. For the target network, the updating
IV. E XPERIMENTAL R ESULTS coefficient τ is 0.01.
The performance of the proposed deep RL based REM is
B. Training Phase
evaluated by simulation analysis. The simulation setup is intro-
duced in Section IV-A. Then, the training results are provided The proposed model is trained for 100,000 episodes to
in Section IV-B. Finally, in Section IV-C, the proposed model generate the optimal actions. Each episode has 24 time steps.
is evaluated and compared with several benchmark algorithms. The training process of the proposed approach is presented in
Fig. 4. During the first 8,000 episodes, the action is randomly
A. Simulation Setup sampled from the uniform distribution U (−emax , emax ). After
1) The Environment: The simulation is based on the real- that, the action is selected according to the main actor network
world REM scenario, as shown in Fig. 1. The REM system and the exploration noise. Fig. 4 shows that the electricity
decides how much electricity to be purchased from the utility. costs decreases steadily during the training process. After
The allowed maximum purchased electricity emax = 10kW h. about 60,000 episodes, the electricity costs converges around
The purchased electricity can be used to supply the home 4,000 cents. This training curve shows that the proposed model
demand and charge the battery. The battery can discharge to learns to minimize the electricity costs after sufficient training
power the home loads as well. The battery capacity is 20 kWh episodes.
and the allowed minimum battery energy Emin = 2kW h. The
penalty factor λ = 100. The electricity price is obtained from C. Performance Evaluation
real-world electricity market, California ISO [25]. The price The performance of the proposed model is evaluated on
data starts from March 1st, 2014 and contains 300 days, where the test days. In order to demonstrate the effectiveness of
200 days are used for training and the remaining 100 days the proposed model, its performance is compared with some
[14] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image forcement learning for financial signal representation and trading,” IEEE
using convolutional neural network,” in Thirtieth AAAI Conference on Transactions on Neural Networks and Learning Systems, vol. 28, no. 3,
Artificial Intelligence, 2016. pp. 653–664, March 2017.
[15] Z. Wan, H. He, and B. Tang, “A generative model for sparse hyperpa- [24] M. Boaro, D. Fuselli, F. D. Angelis, D. Liu, Q. Wei, and F. Pi-
rameter determination,” IEEE Transactions on Big Data, vol. PP, no. 99, azza, “Adaptive dynamic programming algorithm for renewable energy
pp. 1–1, 2017. scheduling and battery management,” Cognitive Computation, vol. 5,
[16] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to no. 2, pp. 264–277, Jun 2013.
attention-based neural machine translation,” in Proceedings of the 2015 [25] C. ISO, “http://oasis.caiso.com/mrioasis/logon.do.”
Conference on Empirical Methods in Natural Language Processing, [26] R. Bellman, “Dynamic programming,” 1958.
2015. [27] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[17] Z. Wan, X. Hu, H. He, and Y. Guo, “A learning based approach for MIT press Cambridge, 1998, vol. 1, no. 1.
social force model parameter estimation,” in 2017 International Joint [28] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmid-
Conference on Neural Networks (IJCNN), May 2017, pp. 4058–4064. huber, “Lstm: A search space odyssey,” IEEE Transactions on Neural
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, Oct
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski 2017.
et al., “Human-level control through deep reinforcement learning,” [29] M. Han and M. Xu, “Laplacian echo state network for multivariate time
Nature, vol. 518, no. 7540, pp. 529–533, 2015. series prediction,” IEEE Transactions on Neural Networks and Learning
[19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Systems, vol. 29, no. 1, pp. 238–244, Jan 2018.
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, [30] E. Santana, M. S. Emigh, P. Zegers, and J. C. Principe, “Exploiting
M. Lanctot et al., “Mastering the game of go with deep neural networks spatio-temporal structure with recurrent winner-take-all networks,” IEEE
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. Transactions on Neural Networks and Learning Systems, vol. PP, no. 99,
[20] J. Heinrich and D. Silver, “Deep reinforcement learning from self-play in pp. 1–9, 2017.
imperfect-information games,” arXiv preprint arXiv:1603.01121, 2016. [31] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
[21] M. Moravčı́k, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard, gated recurrent neural networks on sequence modeling,” arXiv preprint
T. Davis, K. Waugh, M. Johanson, and M. Bowling, “Deepstack: Expert- arXiv:1412.3555, 2014.
level artificial intelligence in heads-up no-limit poker,” Science, vol. 356, [32] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
no. 6337, pp. 508–513, 2017. D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
[22] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal learning,” arXiv preprint arXiv:1509.02971, 2015.
and autonomous control using reinforcement learning: A survey,” IEEE [33] J. Löfberg, “Yalmip : A toolbox for modeling and optimization in
Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, matlab,” in In Proceedings of the CACSD Conference, Taipei, Taiwan,
pp. 1–21, 2017. 2004.
[23] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct rein-