A Deep Reinforcement Learning Based Energy Storage System Control Method For Wind Farm Integrating Prediction and Decision

A Deep Reinforcement Learning Based Energy
Storage System Control Method for Wind farm

Integrating Prediction and Decision
2019 IEEE 3rd International Electrical and Energy Conference (CIEEC) 978-1-7281-1675-4/20/$31.00 ©2020 IEEE 10.1109/CIEEC47146.2019.CIEEC-2019235
Jiajun Yang,Ming Yang, Pingjing Du, Fangqing Yan, Yixiao Yu

Shandong University
School of Electrical Engineering
Jinan, China
myang@sdu.edu.cn
Abstract²In electricity market, the wind power producers In previous studies, the ESS scheduling in a wind farm
face the challenge that how to maximize their income with the follows two separate processes: wind power prediction and
uncertainty of wind power. This paper proposes an integrated ESS decision making. In the wind power prediction, high-
scheduling mode that integrates the wind power prediction and dimensional meteorological data from wind farms are
the energy storage system (ESS) decision making, avoiding the compressed to forecasted wind power values, which causes the
loss of decision-making information in the wind power prediction. loss of effective decision-making information contained in the
Secondly, deep Q network, a deep reinforcement learning (DRL) original-meteorological data. Meanwhile, in the mathematical
algorithm, is introduced to construct the end-to-end ESS optimization algorithms based decision-making process, the
controller. The uncertainty of wind power is automatically
uncertainty of wind power is generally assumed to be a
considered during the DRL-based optimization, without any
assumption. Finally, the superiority of the proposed method is
specific probability distribution [7]. The inaccurate expression
verified through the analysis of the case wind farm located in of wind power uncertainty also reduces the scheduling
Jiangsu Province. benefits of wind farms [8].
To overcome the defects above, a deep reinforcement
Keywords—Deep Q network, deep reinforcement learning, learning (DRL) based method for ESS control in wind farm
electricity market, energy storage system, wind farm schedule. under the integration of prediction and decision is proposed.
The integration of prediction and decision means the ESS
I. INTRODUCTION control is directly driven by the high-dimensional original data
In recent years, as the most economical power generation (including meteorological data). Such end-to-end integrated
among non-water renewable energy sources, wind power scheduling mode can effectively utilize the hidden decision-
generation accounts for an increasing proportion of total making information in the original data to improve the
power generation. It is an inevitable trend for wind farms to scheduling profits. Secondly, deep Q network (DQN), a DRL
maximize their generation profits as wind power producers in algorithm, is introduced to construct the optimal controller
the electricity market [1]. However, the uncertainties of the under the integration of prediction and decision. The data-
wind power pose challenge for wind farm control. driven optimization algorithm allows the wind power
uncertainty laws contained in big data to be automatically
Integrating energy storage system (ESS) into the wind captured and utilized by the machine.
farm is an effective way to increase the profits obtained in the
condition of the uncertainties [2], [3]. There have been many
II. OVERVIEW OF THE PROPOSED METHOD
studies about the control of ESS considering wind power
uncertainty [4]-[6]. In literature [4], the opportunity cons-
A. Integration of Prediction and Decision in Wind Farm
trained optimization based model for pumped storage power
station control is proposed to alleviate the fluctuation of In traditional scheduling mode, wind power prediction and
integrated power. In the model, the wind power forecast error decision making are independent of each other. In the wind
obeys normal distribution. In literature [5], based on Monte power prediction, the input of the wind power prediction
Carlo method and scene reduction technology, the wind- system generally includes the real-time and historical output
storage scheduling model is established on multiple time power of the wind turbine and the real-time, historical and
scales, and the specific output of wind power and ESS is even predicted meteorological data (wind speed, wind
arranged in detail. Literature [6] introduced the reinforcement direction, temperature, air pressure, etc.). The output is the
learning (RL) into the decision making of the ESS control, and forecasted wind power Pw,fore t value in the future. In the
established a two-stage learning model based on Q-learning decision making, the controller determines the charge/
algorithm. discharge power of the ESS based on Pw,foret , the current state
of charge of the ESS and the price of electricity.
978-1-7281-1675-4/19/$31.00 ©2019 IEEE
568
Wind
turbins
In order to effectively deal with the high-dimensional input
state and extract the high-order data features for optimization,
Pw,t
Meteorology
data
this paper introduces the DQN algorithm to construct the
controller, as shown in Fig. 2. DRL takes the historical
scheduling experience as the learning samples and
continuously updates the parameters in the controller with the
Psys,t goal of maximizing profit.
Controller(Agent) Power
Grid
SOC Instruction III. DEEP REINFORCEMENT LEARNING ALGORITHM
Energy Energy PESS,t A. Deep Reinforcement Learning Basic Principle

Storage Conversion
The basic principle of RL is to continuously encourage the
ESS
agent (controller) to output a feasible action that can bring
high reward with higher probability [9], [10]. The agent is
Fig.1 Energy storage system scheduling under the integration of prediction essentially a mapping relationship from the state space S to the
and decision action space A. Through interacting with the external
Fig. 1 shows the integration of the prediction and decision environment, the reinforcement learning directly optimizes the
in the wind farm. In the integrated scheduling mode, all mapping relationship inside the agent, without considering the
meteorological data, historical wind power, price of electricity physical mechanism between the state st and the action at [11],
and ESS state are directly input into the controller as decision- as shown in Fig. 3.
making basis. The controller automatically extracts data
features that are beneficial to increase the sales revenue from a Controller (Agent)
number of input data.
mapping
After the dispatch is completed, the power Psys,t injected State st relationship
Action at
from the wind farm into the grid is the sum of the actual wind
power Pw,t at that time and the charge/discharge power PESS,t st+1 rt
of the ESS determined by the controller, as shown in (1): st at
Psys,t Pw,t PESS,t (1)
External Environment
where, a positive value of PESS,t indicates that the ESS is in Fig.3 Basic principle of reinforcement learning
the state of discharge, and a negative value indicates that the
ESS is in the state of charge. In traditional RL algorithms, mapping relationships
generally exist in the form of two-dimensional tables, and it is
difficult to directly characterize and process continuous-high-
B. Deep Reinforcement Learning Based Controller
dimensional input state. The continuous state space must be
Optimization discrete to match the algorithm, resulting in unnecessary
In the integrated scheduling mode, the optimization information loss.
algorithm can optimize the entire scheduling process end-to-
end. The input state in the figure means all available decision- In this paper, DQN, a DRL algorithm, is applied to the
making information which includes meteorological data, control of wind farms, so that the controller can process the
historical wind power, price of electricity and ESS state. The continuous and high-dimensional wind farm state under the
data features (such as uncertainties) contained in the input integrated scheduling mode. DQN introduces deep neural
state can be effectively mined and utilized to optimize the networks to fit the mapping relationship. The deep neural
parameter of the controller. network (deep learning) belonging to big data mining
technology can effectively mine high-order data features in
state space and screen out redundant information [12], and
Wind Farm
Income improve the optimization effect of DQN.
B. Deep Q Network
DRL Algorithm The DQN developed by DeepMind team in 2015 showed a
performance beyond human in Go game, which is one of the
optimization most classic algorithms in DRL [13]. DQN takes advantage of
Input State
a deep neural network (evaluation network) to approximate
Integration of Prediction and ESS the mapping relationship between input state and Q value,
(including
Decision (Controller) Instruction
meteorological data) enabling DQN to tackle a continuous state space. The meaning
of the Q value is the discounted expected value of the
Fig.2 Energy storage system optimization control based on deep accumulated reward obtained by the action after numerous
reinforcement learning trials. Secondly, DeepMind has established a replay buffer for
569
DQN to break the correlation among adjacent data samples the wind turbine, wind speed, wind direction, air pressure,
and realize offline learning. Finally, in addition to the humidity and other real-time, historical and even predicted
evaluation network, DeepMind also sets a target network meteorological data.
separately to eliminate the correlation of network parameters
in TD-Error. The iterative process of Q value is shown as B. Action Space
ª rt º In the integrated scheduling mode, the controller directly
« » outputs the scheduling instruction of the ESS. Therefore, the
« at 1
Q st , at ;Tt m Q st , at ;Tt D «J max Q st 1 , at 1 ;T » (2)
»
action space consists of n discrete quantities of the
charge/discharge power PESS,t of the ESS, as shown in (5):
« »
¬« Q st , at ;Tt ¼» A ^a1, a2 , , an ` (5)
where Q st , at ;Tt is the Q value of the action at in the state
st obtained through the evaluation network whose network C. Reward
parameters are represented as T t , the parameters of the target In this paper, the external environment refers to the
network are represented as T t , D is the learning rate of the electricity market environment, and the reward is the dispatch
evaluation network in supervised learning, rt is the immediate income obtained by the wind farm, calculated as follows:
reward from external environment and J is the discount factor rt Psys,t Ot 't (6)
that determines the current value of the rewards to be received
in the future. TD-Error is the loss function used to train the where: Ot is the selling price of the wind farm during the t
evaluation network whose specific form is defined as period. In the current work, in order to reduce the operating

rt J max Q st 1 , at 1 ;T Q st , at ;Tt in (2).
at 1
loss, the ESS is scheduled once every hour.
The ESS is limited by its operational constraints. In this
C. Action Selection Policy paper, the battery pack is chosen as the energy storage
The mapping relationship from state space to action space component. The charge/discharge power of the ESS can be
consists of the evaluation network and an action selection further expressed as:
strategy, we use H -greedy policy to select an action based on dis dis ch ch
the Q values. The policy defined in (3) calculates the PESS,t U ESS,t PESS,t U ESS,t PESS,t (7)
probability S that each action in the action space is selected dis ch
with.
U ESS, t U ESS,t d 1 (8)
dis ch
H where: PESS, t / PESS,t is the discharge/charging power of the

°1 H A , a arg max a Q s, a
dis ch
ESS in the t period. U ESS, t / U ESS,t is the discharge/charging
° state variables of the ESS, where value 0 indicating no, and
S a, s ® (3)
value 1 indicates yes. Equation (8) indicates that the state of
°H , a z arg max a Q s, a
°̄ A charge and the state of discharge cannot exist at the same time.
Battery pack charging/discharging power constraints:
where H (z 0) is used to determine the probability of dis dis dis
selecting the action randomly. 0 d PESS,t d PESS,maxU ESS,t (9)
ch ch ch
0 d PESS,t d PESS,maxU ESS,t (10)
IV. APPLICATION OF DRL
ch dis
Before applying the DQN algorithm described in the where: PESS,max / PESS,max are the maximum charge/discharge
previous section to the ESS integrated scheduling mode of power allowed by the ESS.
wind farm, it is necessary to determine the state space S, the
action space A and the reward value r returned by the external ESS capacity constraints:
environment. Emin d Et d Emax (11)
A. State Space where: Emax / Emin is the maximum/minimum electricity

values stored in the ESS.
The state space of the wind farm consists of the forward-
looking electricity price Ot , the current electricity value Et 1
D. implementation of Deep Q Network
stored in the ESS, and the measurement data of the wind farm,
as shown in (4): DQN uses the replay buffer to perform Q-value iteration
and evaluation network training based on the samples that
S ^s | s
t `
(Ot , Et 1 , M t(1) , M t(2) , , M t( m) ) (4) have been experienced. And then the new controller is used to
continue manufacturing samples for subsequent iterations.
where: M t represents the real-time, historical output power of After experiencing a certain number of samples, the incomes
obtained by the controller tend to be stable, and the network
570
parameters in the controller converge. The implementation TABLE II. ELECTRICITY PRICES AT DIFFERENT TIME INTERVALS
process of the DQN algorithm is shown in Fig. 4. Interval index t 1 2 3 4 5 6
Reserve price (¥) 205 195 185 185 185 190
Initialize controller
Interval index t 7 8 9 10 11 12
parameters
Reserve price (¥) 195 200 205 210 215 220
DQN t=t+1
Interval index t 13 14 15 16 17 18
Reserve price (¥) 225 230 235 240 245 250
Calculate the Q-value of all Interval index t 19 20 21 22 23 24
actions in the current state st Reserve price (¥) 255 255 245 235 225 245
The evaluation network structure is shown in Fig. 5, which
Choose an action a using is a fully connected neural network with two hidden layers. In
İ-greedy policy the training process of the evaluation network, the learning
rate D is set as 0.001, the memory of the replay buffer is set
Observe the new state st+1 and
as 3000 samples, N is set as 300, the update interval of the
calculate reward rt
target network is 300, the reward discount factor J is 0.9, and
the H in the H -greedy policy is fit as 0.1.
Update the parameters in
the controller
1 1
Ot 1
2 2 1 Q1
Parameters no Et 1 2
3 3
converge Ϯ Q2
M t(1)
1 3
yes 4 4
͘͘͘
...
...
31 Q31
5 5
End M t(13)
1 15
͘͘͘
...
Input Output
Fig.4 Implementation process based on DQN 60 60
Layer Layer
The training process of evaluation network is done by the Hidden Hidden
RMSProp optimizer. Whenever the evaluation network is Layer 1 Layer 2
updated N times, the parameters of the evaluation network are Fig.5 Evaluation network structure
copied to the target network.
B. Simulation Results
V. SIMULATION RESULTS The gains from wind farms fluctuate with fluctuations of
wind power. Fig. 6 shows the variation of the average wind
A. Simulation Data farm's income with the number of samples experienced by the
controller. In this optimization process, the mapping from the
This paper takes a wind farm with an installed capacity of
state space to the action space is continuously optimized, and
50MW in Jiangsu Province as a case to analyze and verify the
the income of the wind farm also has a significant rising phase
proposed method. The battery pack parameters of the wind
as the sample increases and then reaches a stable fluctuation
farm are shown in Table I.
range. When the income curve is stable, the average income is
TABLE I. BATTERY PACK PARAMETERS 6216.1 瀲/h. The incomes of the wind power have stabilized,
which means that the parameters of the controller have
ch
PESS,max dis
PESS,max Emax Emin ch dis converged.
KESS KESS
(MW) (MW) (MWh) (MWh)
7.5 7.5 45 5 0.85 0.95
10000
8000
The wind farm state space consists of the forward-looking
Average revenue of time interval//
electricity price, the electricity value stored in the ESS, and 6000
the real-time wind farm measurement data. The measurement
4000
data includes: real-time wind tower 10m wind speed, wind
tower 30m wind speed, wind tower 50m wind speed, wind 2000
tower 70m wind speed, hub height wind speed, wind tower
0
10m wind direction, wind tower 30m wind direction, wind
tower 50m wind direction, wind tower 70m wind direction, -2000
hub height wind direction, wind farm pressure, humidity and
-4000
historical wind turbine output power. The entire state space
consists of 15 dimensions of data. The electricity price for 0 4800 9600 14400 19200 24000 28800 33600
Number of samples experienced by the controller
electricity sold at each time period is shown in Table II.
In the action space, the charge/discharge power of the ESS Fig.6 Change curve of the average gain of the wind farm
is divided into 31 actions:{-7.5, -7.0, ..., 0, ..., 7.0, 7.5}.
571
TABLE III. AVERAGE GAINS OF THE WIND FARM UNDER
DIFFERENT CONDITIONS
Scheduling Prediction Decision Average income Condition

mode (MAE) making (¥/hour) Number
DQN 5611.1 1
Traditional 5.17%
SSP 5041.5 2
scheduling
DQN 6053.1 3
mode 3.94%
SSP 5362.9 4
Integrated
scheduling DQN algorithm 6216.1 5
mode
By comparing the condition 1-5 or 3-5, it can be concluded

that compared with the traditional scheduling mode, the
Fig.7 Change curve of the electricity stored in the ESS integrated scheduling mode can bring higher benefits to the
wind farm.
The fluctuation curve of electricity stored in the ESS is
shown in Fig. 7. It can be seen that the value of the electricity By comparing the condition 1-2 or 3-4, the conclusion can
stored in the ESS can always be maintained in a moderate be drawn that compared with the mathematical optimization
state, thereby avoiding the situation that the ESS loses the algorithm, the specific assumptions or descriptions on the
adjustment capability due to the power reaching the upper and uncertainty of wind power is not required in the DQN, which
lower limits of the storage. avoids the loss of revenue caused by the modeling error of the
probability distribution.
C. Comparative Studies In summary, the integrated scheduling mode based on
In order to further illustrate the effectiveness of the DQN (condition 5) can maximize the scheduling benefits of
proposed method, this paper compares and analyzes the wind farms, which proves the effectiveness of the integrated
traditional scheduling methods where the wind power scheduling mode proposed and DRL algorithm in wind farm
prediction and decision making are separate from each other. control. Additionally, there is a large gap of calculation time
between the SSP and the DQN. The smaller calculation time
In the wind power prediction, the wind farm's high- and memory space are required in the latter.
dimensional state space obtains forward-looking wind power
prediction values after correlation analysis and wind power
VI. CONCLUSION
prediction algorithms. Mean absolute error (MAE) of the
forecasted wind power is used as the evaluation index of the This paper proposes a DQN based ESS scheduling method
prediction stage: for wind farms under the integration of prediction and decision
to improve their income in electricity market environment.
1 n yi yiact The conclusions are as follows:
MAE ¦ C
ni 1
(12)
1) An integrated scheduling mode, which combines the
wind power prediction and ESS decision making, is proposed
where yi / yiact is the forecasted and actual wind power at the to effectively utilize the hidden decision-making information
time i; n is the number of samples; C is the installed capacity in the original data.
of the wind farm. 2) DQN, a DRL algorithm, is introduced to construct the
In the decision making, the controller uses the wind power end-to-end controller under the integrated scheduling mode.
forecasted value as the decision basis to determine the The assumption of the uncertainty of wind power is not
charge/discharge power of the ESS. At this stage, wind power required during optimization, which furtherly improves the
uncertainty is reflected in the prediction error of wind power. scheduling income.
The DQN algorithm and the scenario-based stochastic 3) The proposed integrated scheduling mode and the
programming (SSP) are used to optimize the control strategy advantages of the DQN algorithm have been validated by
in the decision-making phase, respectively. In the SSP based simulation results for an ESS-integrated wind farm located in
optimization process, the prediction error of wind power is Jiangsu Province, China.
assumed to follow a normal distribution [14]. Table III shows
the scheduling incomes of wind farms under different
conditions after 4000 hours of operation. The comparison ĐŬŶŽǁůĞĚŐŵĞŶƚ
conclusions are as follows:
This work was supported by the State Grid Corporation of
China (52060018000X).
572
[8] Liu Guojing㸪Han Xueshan㸪Wang Shang㸪et al㸬Optimal decision-
ZĞĨĞƌĞŶĐĞƐ making in the cooperation of wind power and energy storage based on
[1] Wang Qingran㸪Xie Guohui㸪Zhang Lizi㸬An integrated generation reinforcement learning algorithm[J] 㸬 Power System Technology 㸪
consumption dispatch model with wind power[J] 㸬 Automation of 2016㸪40(9)㸸2729-2736㸬
Electric Power Systems㸪2011㸪35(5)㸸15-18㸪30㸬 [9] L. Busoniu, R. Babuska, and B. De Schutter, "A Comprehensive Survey
[2] Kyung S K㸪Mckenzie K J㸪Liu Y L㸪et al㸬A study on applications of Multiagent Reinforcement Learning," IEEE Transactions on Systems
of energy storage for the wind power operation in power Man & Cybernetics Part C, vol. 38, no. 2, pp. 156-172, 2008.
systems[C]//IEEE Power Engineering Society General [10] R. S. Sutton and A. G. Barto, "Reinforcement Learning: An
Meeting㸬IEEE㸪2006㸬 Introduction," Machine Learning, vol. 8, no. 3-4, pp. 225-227, 1992.
[3] Yan Gangui 㸪 Liu Jia 㸪 Cui Yang 㸪 et al 㸬 Economic evaluation of [11] C. Szepesvari, "Algorithms for Reinforcement Learning," vol. 4, no. 1,
improving the wind power scheduling scale by storage system pp. 632-636, 2009.
[J]㸬Proceeding of the CSEE㸪2013㸪36(22)㸸45-52㸬 [12] Deng Li 㸪 Yu Dong 㸬 Deep learning 㸸 methods and
[4] M. Young, The Technical WritHU¶V +Dndbook. Mill Valley, CA: applications[J] 㸬 Foundations and Trends in Signal Processing 㸪
University Science, 1989. 2014,7(3-4)㸸197-387㸬
[5] Ding H 㸪 Hu Z 㸪 Song Y 㸬 Stochastic optimization of the daily [13] V. Mnih et al., "Human-level control through deep reinforcement
operation of wind farm and pumped-hydro-storage plant[J]㸬Renewable learning," Nature, vol. 518, no. 7540, p. 529, 2015.
Energy㸪2012㸪48(6)㸸571-578㸬 [14] F. Yao, Z. Y. Dong, K. Meng, Z. Xu, H. C. Iu, and K. P. Wong,
[6] Wu Xiong㸪Wang Xiuli㸪Li Jun㸪et al㸬A joint operation model and "Quantum-Inspired Particle Swarm Optimization for Power System
solution for hybrid wind energy storage system[J]㸬Proceeding of the Operations Considering Wind Power Uncertainty and Carbon Tax in
CSEE㸪2013㸪33(13)㸸10-17㸬 Australia," IEEE Transactions on Industrial Informatics, vol. 8, no. 4, pp.
[7] Li J 㸪 Wan C 㸪 Xu Z 㸬Robust offering strategy for a wind power 880-888, 2012.
producer under uncertainties[C]// IEEE International Conference on
Smart Grid Communications㸬IEEE㸪2016㸬
573

A Deep Reinforcement Learning Based Energy Storage System Control Method For Wind Farm Integrating Prediction and Decision

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Deep Reinforcement Learning Based Energy Storage System Control Method For Wind Farm Integrating Prediction and Decision

Uploaded by

Copyright:

Available Formats

A Deep Reinforcement Learning Based Energy

Storage System Control Method for Wind farm

Jiajun Yang,Ming Yang, Pingjing Du, Fangqing Yan, Yixiao Yu

SOC Instruction III. DEEP REINFORCEMENT LEARNING ALGORITHM

Energy Energy PESS,t A. Deep Reinforcement Learning Basic Principle

A. State Space where: Emax / Emin is the maximum/minimum electricity

Scheduling Prediction Decision Average income Condition

By comparing the condition 1-5 or 3-5, it can be concluded

You might also like