1 s2.0 S2773153722000287 Main

Green Energy and Intelligent Transportation 1 (2022) 100028
Contents lists available at ScienceDirect
Green Energy and Intelligent Transportation

journal homepage: www.journals.elsevier.com/green-energy-and-intelligent-transportation
Full length article
Deep reinforcement learning based energy management strategy for fuel

cell/battery/supercapacitor powered electric vehicle
Jie Wang, Jianhao Zhou *, Wanzhong Zhao
College of Energy and Power Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
H I G H L I G H T S G R A P H I C A L A B S T R A C T
TD3 is leveraged to formulate EMS for

fuel cell-battery-supercapacitor HES.
Battery health and fuel cell lifespan are
considered within the proposed EMS.
The superiority of TD3 based EMS is
exhibited towards DDPG and NEMS
based EMSs.
A R T I C L E I N F O A B S T R A C T
Keywords: Vehicles using a single fuel cell as a power source often have problems such as slow response and inability to
Deep reinforcement learning recover braking energy. Therefore, the current automobile market is mainly dominated by fuel cell hybrid
Energy management strategy vehicles. In this study, the fuel cell hybrid commercial vehicle is taken as the research object, and a fuel cell/
Fuel cell
battery/supercapacitor energy topology is proposed, and an energy management strategy based on a double-
Hybrid electric vehicle
TD3
delay deep deterministic policy gradient is designed for this topological structure. This strategy takes fuel
cell hydrogen consumption, fuel cell life loss, and battery life loss as the optimization goals, in which super-
capacitors play the role of coordinating the power output of the fuel cell and the battery, providing more
optimization ranges for the optimization of fuel cells and batteries. Compared with the deep deterministic
policy gradient strategy (DDPG) and the nonlinear programming algorithm strategy, this strategy has reduced
hydrogen consumption level, fuel cell loss level, and battery loss level, which greatly improves the economy
and service life of the power system. The proposed EMS is based on the TD3 algorithm in deep reinforcement
learning, and simultaneously optimizes a number of indicators, which is beneficial to prolong the service life of
the power system.
* Corresponding author.
E-mail addresses: akitaw@foxmail.com (J. Wang), zhoujianhao@nuaa.edu.cn (J. Zhou), zwz@nuaa.edu.cn (W. Zhao).
https://doi.org/10.1016/j.geits.2022.100028
Received 14 February 2022; Received in revised form 8 May 2022; Accepted 4 August 2022
Available online 19 September 2022
2773-1537/© 2022 The Authors. Published by Elsevier Ltd on behalf of Beijing Institute of Technology Press Co., Ltd. This is an open access article under the CC BY-
NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
J. Wang et al. Green Energy and Intelligent Transportation 1 (2022) 100028
Nomenclature κ Adiabatic coefficient

pin Inlet pressure of air
Abbreviation qair Outlet pressure of air
EMS Energy management strategy ηmech Efficiency of compressor
ECMS Equivalent hydrogen consumption Minimization strategy ηmot Efficiency of drive motor
RL/DRL Deep reinforcement learning ηDC-DC Efficiency of the DC-DC converter
MFC Multi-stack fuel cell system ηm Electric motor efficiency
HEV Hybrid electric vehicles Tmot Motor torque
HESS Hybrid energy storage system ωmot Motor rotation speed
SOC State-of-charge Preq Traction power
ΔSOC Variation of SOC Pmot Motor power
DNN Deep neural network Voc Battery open circuit voltage
DDPG Deep deterministic policy gradient Ibat Battery current
DQN Deep Q-Network Rint Battery internal resistance
DPG Deterministic Policy Gradient Pbat Request battery power
PG Policy gradients Qb Battery capacity
PEMFC Proton exchange membrane fuel cell ηcou Battery coulomb efficiency
FLC Fuzzy logic control v Speed
GA Genetic algorithm f Rolling resistance coefficient
DP Dynamic programming CD Aerodynamic drag coefficient
VCU Vehicle control unit A Frontal area
PMP Pontryagin's minimum principle ρ Air density
MPPT Maximum power point tracking a Acceleration
MPC Model predictive control α Road slope
EF Equivalence factor Iref Reference current
SA Simulated annealing D Virtual distance
LMIs Linear matric inequalities Ilow_lim Minimum target operating current
MDP Markov decision process Ihigh_lim Maximum target operating current
MCU Motor controller unit afc Adjustable parameter
DC Direct current SOCref Reference SOC
WLTP class 2 Worldwide Harmonized Light Vehicle Test Procedures SOCmin Minimum SOC
class 2 SOCmax Maximum SOC
WVUTC West Virginia University 5-Peak Truck Cycle SOChigh_lim Maximum target SOC
TCMS Travel costs minimization strategy SOClow_lim Minimum target SOC
ad-TCMS APF and DDPG enhanced TCMS ab Adjustable parameter
AC Actor-Critic st System state
TD3 Twin Delayed Deep Deterministic Policy Gradient π (at, st) Current strategy
Ufc Fuel cell voltage at Action
N Number of cell θ Weight of actor-network
Enernst Ideal potential u Deterministic policy
Vact Activation overpotential ω Weight of critic-network
Vohm Ohmic overpotential Q(s, a, ω) Value-action function
0
Vcon Concentration overpotential θ Weight of target actor-network
ΔG Gibbs free energy atþ1 Next optimal action
ΔS Variation of system entropy stþ1 Next state
0
F The Farady constant ω Weight of target critic-network
T Cell temperature yt Q-value
PH2 Partial pressure of hydrogen rt Reward function
PO2 Partial pressure of oxygen J(ω) Loss function
ξi Parametric coefficients ΔJ(θ) Policy gradient
Ist Stack current τ Soft updating factor
CO2 Concentration of dissolved oxygen N Exploratory noise
RM Equivalent membrane resistance Γi Weighting factors
RC Equivalent contact resistance σ fc Average prices of hydrogen
B Constant dependent
C_ H2
DFC
Cost of hydrogen consumption
J Actual current density
Jmax Maximum limit of current density C_ elec Equivalent electricity consumption
Pfc Net power DFC
deg Degradation and depreciation cost
Pstack Gross power di Unit degradation costs
Paux Auxiliary power Pfcmax Maximum power of the fuel cell
Pcp Compressor power μi Power degradation rate
Cp Heat capacity Iidle Current of FC
Tair Temperature of air nmax Maximum start/stop times
2
λ Curve fitting coefficients Idischarge_lim Battery discharge limit

ηfc,ref Reference efficiency MH2 Molar mass of hydrogen
Ifc,min Minimum allowable output current Ureq Requested voltage
Ifc,max Maximum allowable output current Ireq Requested current
Icharge_lim Battery charge limit
1. Introduction deep learning mainly determines the parameters of the control algorithm,
perceives the current state of the environment, or predicts the state of the
Fuel cell hybrid vehicles generally contain more than one power next moment for further control and analysis; reinforcement learning can
source, and their energy management will be more complicated than control and make decisions based on real-time feedback. Therefore, unlike
single-power sourced fuel cell electric vehicles. In the automotive field, traditional control schemes, DRL algorithms are able to learn control ac-
the existing energy management strategies (EMS) for multi-power sour- tions through continuous trial-and-error interactions with the environment
ces generally take fuel economy and travelling endurance as control under appropriate reward and punishment mechanisms [10,11]. More-
goals. EMS can be generally divided into rules-based control strategies over, DRL does not require a detailed physical model, and can continuously
and optimization-based control strategies. With the development of learn and optimize control actions, so it is very suitable for complex dy-
artificial intelligence (AI) technology, many scholars have begun to try to namic systems and can optimize multiple optimization objectives simul-
apply various AI algorithms to EMS, so AI-based EMS have gradually taneously. Liu et al. [12] proposed a Q-learning-based EMS to allocate
emerged. engine torque which revealed a near-optimal performance in comparison
The rule-based EMS is easy to implement and apply, but requires to DP-based EMS. Yuan et al. [13] proposed a Q-learning-based EMS for
sufficient experience and on-line calibration. Generally, the EMS is plug-in FCHEVs, which optimizes fuel cell start-stop, which has a certain
formulated by the designer based on the current road conditions and the impact on suppressing fuel cell aging. Reddy et al. [14] proposed an EMS
understanding of the hybrid power system through experience, mainly based on deep reinforcement learning to reduce battery loss and fuel
including state machine/operation mode control, power-following con- consumption, and improve economy while maintaining battery SOC. Li
trol, power decoupling and fuzzy logic control (FLC), etc. [1–4]. Wang et al. [15] used deep reinforcement learning to research and develop EMS
et al. [5] proposed an EMS for hybrid electric vehicles (HEV) with fuel for series hybrid electric vehicles, by integrating historical accumulated
cell/battery/supercapacitor power sources and the power distribution trip information to achieve more effective control of the state of charge in
among the three power sources was performed in real time with the help DRL-based EMS. Li et al. [16] proposed an EMS for electric vehicle hybrid
of power demand prediction, and its hydrogen consumption and SOC battery system based on deep reinforcement learning to formulate EMS for
maintenance were verified under real driving conditions. Gao et al. [6] the hybrid battery system according to the electrical and thermal charac-
proposed an EMS for fuel cell hybrid bus, FLC was applied to determine teristics of the battery, aiming to reduce energy losses and improve the
the power allocation of fuel cell based on the required power and electrical and thermal safety levels of the entire system. Han et al. [17]
regenerative braking power, and it is verified through experiments that proposed an EMS for dual-motor-driven hybrid crawler vehicles based on a
the strategy can follow the demand power well. dual deep Q-learning algorithm, which prevents the training process from
Optimization-based EMS can be divided into global optimization falling into an over-optimistic estimation of the strategy value, and high-
strategies and real-time optimization strategies. In general, the global lights its significant advantages in iterative convergence rate and optimi-
optimization algorithm is performed offline under the premise of known zation performance. In conclusion, the above scholars have successfully
driving conditions or power requirements, mainly including dynamic applied the reinforcement learning idea to the energy management of fuel
programming algorithm (DP), genetic algorithm (GA), particle swarm cell hybrid electric vehicles, providing ideas and references for this
algorithm and so on. In reality, road conditions change dynamically, research. However, the EMS proposed in the above research is not well
which makes its implementation in the Vehicle Control Unit (VCU) applicable to energy topology with multi power sources, such as fuel
impractical. The high computational cost also limits its application in cell/battery/supercapacitor.
energy management. However, since its optimization result is a global Fuel cell and lithium battery as hybrid energy source (HES) solves the
optimization, it can provide a data basis for other online and real-time problem for single fuel cell powered vehicle due to the slow response of
control strategies. Xu et al. [7] proposed a DP-based EMS to optimize the fuel cell. However, this kind of power topology also has certain de-
the driving cost of fuel cell and lithium battery. fects, such as the rapid depletion of battery health. Therefore, fuel cell/
Different from the global optimization algorithm, the real-time opti- lithium battery/supercapacitor based HES was proposed to improve the
mization algorithm allocates energy by minimizing the instantaneous health and longevity of battery. Liu [18] et al. proposed a fuel cell/bat-
cost function of the system, and the computational cost is greatly tery/supercapacitor based EMS, and used ADVISOR to carry out a sys-
reduced, such as Maximum Power Point Tracking (MPPT), Equivalent tematic simulation analysis of a hybrid vehicle. Compared with fuel
Consumption Minimization Strategy (ECMS), and Model Predictive cell/battery vehicles, supercapacitors provide more optimization range
Control (MPC). Han et al. [8] proposed an ECMS based on an adaptive for the optimization of fuel cells and battery packs, improving the
equivalent factor. During the driving process of a fuel cell hybrid vehicle economy and service life of the power system, the disadvantage is that it
in a fixed operating condition, the ECMS will automatically select the increases the complexity of the system and needs to consider more
optimal equivalent factor according to the operating conditions. Shen optimization objectives. Most of the existing EMS for three power sources
et al. [9] used a fuzzy modeling framework to build a robust MPC only optimize hydrogen consumption and lack consideration of fuel cell
controller, while using the Linear Matrix Inequality (LMI) technique to degradation and battery longevity. In this case, the EMS obtained by
express the constraints in the optimization problem, its control effect has training and verification cannot obtain good optimization results in
also been verified, but the lack of consideration of fuel cell and battery others complex systems, and the generalization ability is limited. Cai
efficiency and loss is likely to cause unnecessary cost loss. et al. [19] proposed a decentralized EMS based on a hybrid virtual
With the increasing application of AI-based algorithms, many re- impedance droop fuel cell/battery/supercapacitor hybrid system, and
searchers have begun to adopt deep reinforcement learning (DRL) based verified the reliability of the strategy by numerical simulation. This
EMS. DRL owns the perception ability of deep learning and the decision- strategy integrates various evaluation indicators of dynamic systems, and
making ability of reinforcement learning. In terms of EMS development, achieves good optimization results, but has poor migration ability and
3
limited applicability in other complex systems. As a complex dynamic

structure, fuel cell/lithium battery/supercapacitor based HES has mul-
tiple optimization objectives. DRL is suitable for this complex structure
and has better generalization, the EMS migration ability after training is
also excellent.
In this study, a DRL-based EMS is proposed for logistics trucks with
fuel cell/battery/supercapacitor topology. The EMS aims to reduce
overall costs by stabilizing the SOC of batteries and supercapacitors, as
well as improving hydrogen consumption and extending the life of fuel
cells and batteries. The main contents of the paper are as follows: in
Section 2, the topological structure of the fuel cell hybrid vehicle and the
fuel cell/battery/supercapacitor HES was built. In Section 3, a novel DRL-
based EMS for HES is proposed. In Section 4, the proposed EMS was
numerically verified under WLTP class 3 driving cycle, and the results are
Fig. 2. Fuel cell/battery/supercapacitor hybrid logistics truck topology.
analyzed. Section 5 summarizes the conclusions obtained in the study.
2. Modeling of hybrid energy sources and truck

2.2. Fuel cell
In this paper, a long-endurance logistics truck is taken as the reference
vehicle, and the fuel cell/battery/supercapacitor HES is introduced as
Within fuel cell stack, the continuous catalytical reaction between
HES. The proposed DRL-based EMS is theoretically data-driven and
hydrogen and oxygen occurs on the electrode which consumes energy
model-free, which is generally not sensitive to any specific topology of
to overcome certain resistances, namely polarization effect [20]. This
hybrid powertrain. The HEV model and driving cycles described in the
paper adopts the Amphlett static model [21] to predict the performance
following sections are utilized in the numerical experiments to obtain
of fuel cells.
sample sets for training and learning through the proposed DRL agent,
The overpotential losses consist of three aspects as shown in Eq. (1),
which are supposed to be measurable or observable via onboard sensors
mainly includes activation overpotential Vact, ohmic overpotential Vohm
of HEV after deployment.
and concentration overpotential Vcon [21].
2.1. Topology of HES Ufc ¼ N ðEnernst Vact Vohm Vcon Þ (1)

According to Nernst equation and the variations of Gibbs free energy,
This study needs to optimize the performance of fuel cells and bat-
the thermal potential can be depicted as [28]:
teries in the fuel cell/battery/supercapacitor hybrid energy system at the
same time, and in order to balance cost and performance, the topology
ΔG ΔS T Tref RTðln PH2 þ 0:5PO2 Þ
shown in Fig. 1 is adopted. Enernst ¼ þ (2)
2F 2F 2F
where, ΔG is Gibbs free energy; ΔS is the variation of system entropy; F is

the Farady constant; Tref is the cell temperature; PH2 and PO2 represent the
partial pressure of hydrogen and oxygen, respectively.
Activation losses is the voltage required to overcome the activation
energy of the electrochemical reaction on the catalytic surface, which is
empirically expressed as [21]:
Vact ¼ ξ1 þ ξ2 T þ ξ3 T ln CO2 þ ξ4 ln Ist (3)
where, ξ1 , ξ2 , ξ3 and ξ4 are the parametric coefficients which are affected

by temperature and pressure; Ist is the stack current; CO2 is the concen-
tration of dissolved oxygen at the cathode catalyst layer.
Ohmic overpotential is the voltage drop generated by the equivalent
internal resistance of the cell during electricity generation process which
is expressed as [21]:
Vohm ¼ Ist Rint ¼ Ist ðRM þ RC Þ (4)
where, RM denotes the equivalent membrane resistance to proton con-

Fig. 1. Topology of fuel cell/battery/supercapacitor hybrid system. duction and RC refers to the equivalent contact resistance to electron
conduction.
In this topology, the output voltage of fuel cells and supercapacitors
Concentration overpotential represents the voltage drop resulting
fluctuates greatly. If they are directly connected to the DC bus, the system
from the decrease in the concentration of oxygen and hydrogen. It can be
will be unstable and affect the battery life. So the fuel cells need to be
defined by [21]:
connected to the DC bus through a unidirectional DC/DC converter, and
the supercapacitors are connected to the DC bus through a bidirectional
J
DC/DC converter. In this way, the voltage of the fuel cell and the Vcon ¼ B ln 1 (5)
Jmax
supercapacitor can be directly controlled to match the DC bus, and the
supercapacitor play the role of coordinating the power output of the fuel where, B is a constant dependent on the fuel cell type and its operation
cell and the battery, supplementary power with supercapacitors while mode; J is the actual current density and Jmax is the maximum limit of
optimizing fuel cells and batteries as much as possible. The vehicle to- current density.
pology under this topology is shown in Fig. 2.
4
Within above-mentioned mathematical model, the simulation accu- It can be known from the equivalent circuit that the battery output
racy of the stack characteristics is dependent on a group of parametric voltage Vb can be expressed as follows:
coefficients. These parameters were identified using GA method based on
the experimental results provided by the manufactories [21]. Vb ¼ Voc Ib Rint (10)
Regarding to the fuel cell system, the net power Pfc denotes the dif- The output power Pb of the lithium battery is as follows:
ference between the gross power Pstack and the auxiliary power Paux,
which can be computed by: Pb ¼ Vb Ib (11)
Assuming that the output power of the lithium battery is known, the
Pstack ¼ Ufc Ist
(6) current of the lithium battery can be calculated according to the
Pfc ¼ Pstack Paux
following formula:
The electric air compressor system and cooling system are the main
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
auxiliary equipment for the fuel cell system [22]. The power demand of Voc V 2oc 4Rint Pb
Ib ¼ (12)
the compressor can be depicted as: 2Rint
κ1 ! The characteristic curve of the single cell of the lithium battery used
Cp Tair pout κ
Paux Pcp ¼ 1 qair (7) in the experiment is shown in Fig. 4.
ηmech ηmot pin
where, Pcp refers to compressor power. Cp and Tair denote the heat ca-
pacity and temperature of air, respectively. κ denotes as the adiabatic
coefficient. pin and pout are the inlet and outlet pressure of air, respec-
tively. qair is the mass flow rate of air. ηmech and ηmot stand for the effi-
ciency of compressor and its drive motor, respectively.
The hydrogen consumption rate can be computed by [23]:
NMH2
m_ H2 ¼ Ist (8)
nF
where, N is cell number of the stack, MH2 is the molar mass of hydrogen, n
is the number of transferred charges.
Meanwhile, for a given rate of hydrogen consumption rate, the effi-
ciency of a fuel cell can be defined as the ratio of the output power of the
fuel cell to the power generated by the hydrogen:
Pfc
ηfc ¼ (9)
m_ H2 LHV Fig. 4. Lithium battery characteristic curve.
The SOC of the lithium battery can be calculated by the ampere-hour

2.3. Battery
method. Let the rated capacity of the lithium battery be Qb. If the SOC of
the battery at time k-1 is known, the SOC of the battery at time k can be
The Rint equivalent circuit model, also known as the resistance
expressed as follows:
model, is a model that simplifies the lithium battery as an ideal voltage
source connected in series with a resistor [24]. The resistor is used to Z
ηb k
simulate the ohmic internal resistance and polarization internal resis- SOCðtk Þ ¼ SOCðtk1 Þ Ibat dt (13)
Qb k1
tance of the battery. The circuit structure is shown in Fig. 3.
where ηb is the charge-discharge efficiency of the battery.
2.4. Supercapacitor
Although the energy density of supercapacitor is low, it owns the su-

periority of fast charging speed, high charging/discharging power, long
lifespan and high energy conversion efficiency. It has been recognized as
an ideal energy buffer for electric vehicles [25]. In this study, the super
capacitor is mainly used to coordinate the fuel cell and battery power
output in order to achieve better control of the target, when instantaneous
power demand is bigger, the peak power provided by the super capacitor,
and absorption peak power, when braking feedback can relieve pressure
on the battery, extend battery life [26], and compared with other opti-
mization project, the working loss cost of supercapacitor is almost negli-
gible, so it is not included in the overall cost optimization index [27].
In order to simplify the operation, a classic RC circuit is used to
simulate the supercapacitor, that is, the supercapacitor is regarded as an
ideal capacitor connected in series with a resistor, and the resistor is
regarded as the internal resistance Rc of the supercapacitor. The RC
Fig. 3. Equivalent circuit model structure.
equivalent circuit is shown in Fig. 5. Vc_oc and Vc represent the voltage
5
across the ideal capacitor and the terminal voltage of the supercapacitor, 2.5. DC/DC converter
respectively, and Ic is the output current of the supercapacitor.
DC-DC converter is necessary which implements bidirectional boost
and buck (lower) operation. In this work, two DC-DC converters in par-
allel are required for DFC as shown in Fig. 1. In order to facilitate energy
analysis and cost calculation, the efficiency of the DC-DC converters are
regarded as a fixed value, so that the DC-DC converter can be modeled as
follows:
Ureq Ireq ¼ ηDCDC Ufc Ist (17)
where, Ureq and Ireq represent the requested voltage and current from DC
bus, respectively; ηDCDC refers to the efficiency of the DC-DC converter.
2.6. Motor
The motor used in this paper is directly connected to the end of the
drive shaft, and can be used as a motor to provide traction torque, or as a
generator to absorb braking torque to achieve regenerative braking. The
motor efficiency ηm is related to the two working modes, and is related to
Fig. 5. Super capacitor RC circuit.
the torque Tmot and speed of the motor ωmot , and its calculation formula is
as follows:

It can be seen from the RC circuit that the output power Psc of the sgn Preq Tmot ωmot ηm ; ðMotor modeÞ
super capacitor can be expressed as follows: Pmot ¼ Preq ¼ (18)
sgn Preq ηm Tmot ωmot ; ðGenerator modeÞ
Psc ¼ Vc Ic ¼ ðVc_oc Ic Rc ÞIc (14)
where Pmot is the output power of the motor; Preq is the required power of
Similar to the lithium battery, at time k, the power of the super- the vehicle.
capacitor can be expressed as follows: The motor efficiency diagram used in this paper is shown in Fig. 6. On
the premise of known speed and torque, the motor efficiency can be
Ic ðkÞ
SOCsc ðkÞ ¼ SOCðk 1Þ (15) obtained by looking up the table.
Qc
where Qc is the maximum charge of the supercapacitor. 2.7. Longitudinal dynamics model of logistics truck
In addition, the voltage of the super capacitor is also closely related to
the power of the super capacitor. In general, the ideal open circuit voltage The research object of this paper is a fuel cell hybrid logistics truck
at the current moment can be expressed as follows: and its main parameters are shown in Table 1.
According to the power balance relationship during the driving pro-
Vc_oc ðkÞ ¼ SOCsc ðkÞ Vc_oc_max (16) cess of the car, the following formula can also be obtained as follows:
v
where Vc_oc_max is the maximum value of the super capacitor voltage. Preq ¼ mgf cos α þ 0:5ρCD Av2 þ ma þ mg sin α (19)
1000
Fig. 6. Motor efficiency map.
6
Table 1 3. Multi-criteria cost-effective EMS design

Main dynamic parameters of the vehicle.
Parameter name Numerical value 3.1. Formulation of TD3-based EMS
Vehicle quality, m(kg) 5,500
Gravitational acceleration, g(m⋅s2) 908 The traditional policy gradient method often has over-estimation of
Wheel radius, r(m) 0.478 the Q function, which continuously accumulates and leads to the problem
Rolling resistance coefficient, f 0.012 of high deviation. Therefore, the twin-delayed deep deterministic policy
air resistance coefficient, CD 0.55 gradient (TD3) algorithm was proposed.
Air density, ρ(kg⋅m3) 1.209
Frontal area, A(m2) 6.55
As shown in Fig. 7, TD3 uses two independent Critic networks (but
still uses the same experience replay pool) based on the DDPG archi-
tecture. Both Actor and Critic are Deep Neural Networks (DNNs). Actors
where Preq is the power required to drive the car; a is the acceleration of use policy gradient (PG) method to learn and select actions in current
the car. environment. Critic generates signals to evaluate the actions performed
The HES discussed in this part consists of a fuel cell pack, a power by the Actor. In short, the Actor network with network parameter θ is a
battery pack and a super capacitor, and its main performance parameters function approximator used to build a deterministic policy, and the Critic
are shown in Table 2. network with network parameter ω evaluates the action as a value
function Qðs;a;wÞ. During the network update process, two critical target
Table 2 Q values can be expressed as:
Main performance parameters of fuel cell/battery/supercapacitor hybrid energy.
( 0 0
Part name Parameter name Value y1 ¼ r þ γQ s ; a; ω1
0 0 (20)
Fuel cell stack Rated power (kW) 60 y2 ¼ r þ γQ s ; a; ω2
Maximum output power (kW) 76
Power battery pack Battery capacity (Ah) 30
where γ is the discount coefficient; r is the reward function. For the two target
Maximum charging power (kW) 35
Maximum discharge power (kW) 35 Q values, there will always be high and low values. If the Q value is too high,
Super capacitor Maximum charging power (kW) 20 overestimation will inevitably occur. The TD3 algorithm uses the smaller of
Maximum discharge power (kW) 20 the two target Q values to update the two Critic networks, as follows:
Fig. 7. Schematic diagram of TD3 framework.
7
y ¼ minðy1 ; y2 Þ (21) consumption and reduce the lifetime loss of fuel cells and batteries, while
maintaining battery SOC, thereby reducing the overall cost of energy
The utilization of two Critics for training can effectively solve the
system operation.
deviation caused by over-estimation of the Q value, but in the process of
A large number of studies have shown that the degradation of fuel
network update, each step will generate a small error, and after multiple
cells mainly includes low load, high load, and frequent load changes, so
updates, the error will be amplified. Ultimately, the inaccurate Q value
the loss of fuel cell life Cfc can be expressed as follows:
leads to a high variance problem. In order to reduce the variance, a twin-
delayed update manner is adopted, and the current Actor and target C_ fc ¼ Clow þ Chigh þ Cchange (23)
network will not be updated immediately after the current Critic update.
Other networks do not update until Critic is updated N times, and the In the above formula, Clow is the life loss of the fuel cell at low load;
update of the target network continues the soft update method of the Chigh is the life loss of the fuel cell at high load; Cchange is the life loss of the
traditional DDPG algorithm. fuel cell with frequent load changes.
After solving the high variance problem, for the error itself, the value The life loss of the fuel cell during low-load operation can be
function needs to be estimated more accurately [19], so a certain noise μ expressed as a function of the low-load operation time Tlow of the fuel cell
is added to the target Q value as follows: [29]:
0 0 0
yi ¼ r þ γQ ðs ; a þ μ; ω Þ
0
(22) λlow Tlow Mfc
Clow ¼ (24)
~ fc
V
The pseudo-code of the TD3 algorithm is shown in Table 3.
where λlow is the decay rate; Mfc is the cost of the fuel cell. According to
literature [25], the average cost of the fuel cell is 593.95 ￥/kw, and the
Table 3 maximum output power of the fuel cell studied in this paper is 76 kw,
Pseudo-code of TD3. then the cost of the fuel cell is 76 593.95 ¼ 45,140.49 ￥; V ~ fc is the
Pseudo code for offline training of TD3 algorithm pressure drop when the fuel cell is scrapped (generally, the voltage of the
fuel cell drops by 10% at rated power).
1: Initialize the network parameters ω1 , ω2 of the two Critic networks, and the actor's
network parameters θ The degradation cost of the fuel cell under high load operation can
0 0
2: Initialize the Critic and Actor networks in the target network: ω1 ← ω1 、 ω2 ← ω2 、 also be expressed as a function of the high load operation time Thigh of the
0
θ ←θ fuel cell [29] as follows:
3. Empty the experience replay pool R
4: For episode ¼ 1 to M do λhigh Thigh Mfc
Chigh ¼ (25)
5: Begin with an Ornstein-Uhelnbeck (OU) noise N for exploration ~ fc
V
6: Observe initial state s1
7: For t ¼ 1 to T do
8: Select action at ¼ π θ ðsÞ þ N by current policy π ðat ; st Þ and exploration noise
where λhigh is the decay rate.
9: Execute action at The degradation cost of the fuel cell when the load changes can be
10: Observing the reward for current system feedback rt expressed as a function of the fuel cell power change rate P_ fc as follows
11: Observe the system state at the next moment stþ1 [29]:
12: Store transitions ðst ; at ; rt ; stþ1 Þ into R
13: Sample a random mini-batch of m transition from R Z
0 λchange Mfc P_ fc
14: The output of the Actors in the target network will also add noise a ~ ← π θ0 ðs Þ þ μ, Cchange ¼ dt (26)
μ clipðNð0; σ~Þ; c; cÞ 1000nfc V~ fc
15: Calculate the target Q value using the smaller value in the target critical output:
0 0 0
yt ¼ rt ðst ; at Þ þ γmini¼1;2 Q ðst ; at ; ωi Þ where λchange is the decay rate; nfc is the number of fuel cells.
16: Minimize the loss function to update the Critic network: Jðωi Þ ¼ In addition to considering the life loss of the fuel cell, this paper also
1 Xm considers the life loss of the battery. Generally, when the battery loss
i¼1 i
ðy Qðsi ; ai ; ωi ÞÞ2 capacity reaches 20%, the battery can no longer be used. Therefore, the
m
17: If t mod N then life of the battery can be calculated as follows [31]:
1 Xm
18: Policy gradient method update Actor network: rJ ðθÞ ¼ ½ra Qðsi ; ai ; ωÞ
m ΔQloss Mbat
i¼1
js¼si ;a¼πθ ðsÞ rθ π θðsÞ js¼si C_ bat ¼ (27)
0 0 0 0 20%
19: Soft update target network: ωi ← τωi þ ð1 τÞωi , θ ← τθ þ ð1 τÞθ
20: End if In the formula, ΔQloss is the capacitance loss; Mbat is the battery cost.
18: End for According to the literature [25], the battery cost is generally 1,139.43
19: End for
￥/kwh. The maximum output power of the lithium battery pack used in
this paper is 35kw, so the battery cost is 1139.43 35 ¼ 39,880.17￥.
3.2. Design of reward function To sum up, set the reward function rt as:

C_ h þ C_ fc þ C_ bat þ C_ bat_soc
The block diagram of TD3-based EMS is shown in Fig. 8. The input rt ¼ 0:1 < SOCsc < 0:9 (28)
states consist of vehicle speed, acceleration, battery SOC and super- _ _ _ _
C h þ Cfc þ C bat þ C bat_soc þ ζ others
capacitor SOC, and the control variables are fuel cell power and bat-
tery power. The reward function is an important factor for the offline where C_ h is the hydrogen consumption of the fuel cell, which is con-
training of TD3-based EMS. The control goal of this study is to opti- verted into the hydrogen price according to the unit price of hydrogen in
mize hydrogen consumption and the longevity of fuel cell and batterie, 2020 [29] (25.55￥/kg); C_ fc is the life loss of the fuel cell, which is
while maintaining SOC of battery and supercapacitor in an appropriate mainly composed of low load life loss Clow , high load life loss Chigh and
range. TD3-based EMS is employed to coordinate the power output of frequent load change life loss Cchange ; C_ bat is battery pack life loss; ζ is the
the fuel cell and the battery, so that the control target can be better penalty factor when the supercapacitor exceeds the given SOC range, and
achieved. the specific value depends on the operating conditions and vehicle
The reward function is an important factor in the offline training of
models; C_ bat_soc is the battery SOC adjustment term to maintain the
TD3 algorithm, that is, the optimization goal in the optimization prob-
battery SOC, and its expression is as follows:
lem. The control objectives of this study are to reduce hydrogen
8
Fig. 8. Diagram of TD3-based EMS.
C_ bat soc
_
¼ λsoc SOC (29) greatly, which is a very suitable case for verifying the performance of the
strategy.
where SOC_ is the SOC change rate, and the expression has been given in
Section 2; λsoc is the discount coefficient, which is valued according to the
specific vehicle model and operating conditions. 4.1. Analysis of offline training performance
The TD3 algorithm performs the steps in Table 4 for offline training;
3.3. Formulation of nonlinear programming-based EMS in order to characterize the offline training process of deep reinforcement
learning, the cumulative reward obtained in each step of training is
In order to validate the performance of the TD3-based EMS, this study normalized and the data obtained after taking the positive value is
takes the nonlinear programming based EMS (NEMS) with the same mean_reward to represent the deep reinforcement learning offline
objective function as the TD3 [22]. The optimization problem is solved training process.
by sequential quadratic programming (SQP) method as follows [30]: In order to reflect the improvement effect of the TD3 algorithm
( compared to the DDPG algorithm, this study applies the previously
C_ h þ C_ fc þ C_ bat þ C_ bat soc 0:1 < SOCsc < 0:9 mentioned reward function to DDPG and participates in the energy
min J ¼
C_ h þ C_ fc þ C_ bat þ C_ bat soc þ ζ others management of the fuel cell/battery/supercapacitor system in this
8 chapter, while maintaining the corresponding network settings. The two
>
> Pfc þ Pb þ Psc ¼ Preq
>
> deep reinforcement learning-based strategies obtained the convergence
>
>
>
> Ist;min Ist Ist; max (30) graph shown in Fig. 10 after offline training.
<
st : jIst ðtÞ Ist ðt 1Þj ΔIst; max It can be seen from Fig. 10 that at the beginning of training, TD3
>
>
>
> began to explore the environment. Since the evaluation of actions by
>
> Icharge_lim ðSOCÞ Ibat Idischarge_lim ðSOCÞ
>
> the Critic network was not accurate enough at this time, the accumu-
:
Isc_charge_lim ðSOCsc Þ Isc Isc_discharge_lim ðSOCsc Þ lated rewards in the early stage showed a trend of large changes. This
reflects that TD3 obtains the environmental information under different
where Ist;min and Ist;max are the current limit range of the fuel cell actions as much as possible, so as to better evaluate the output action;
respectively; ΔIst;max is the maximum allowable current change rate of the and after 35 episodes, the cumulative reward fluctuation decreases and
fuel cell; Ibat is the battery current; Icharge lim ðSOCÞ and Idischarge lim ðSOCÞ becomes stable, reflecting that the evaluation of the action by the Critic
are the maximum charging current and maximum discharging current of network has been relatively accurate, and the TD3 algorithm training is
the battery pack; Isc is the supercapacitor current; Isc_charge_lim ðSOCsc Þ and completed. In contrast, the convergence diagram of DDPG shown in
Isc_discharge lim ðSOCsc Þ are the maximum charging current and maximum Fig. 10 also shows a similar trend, and the training is completed around
discharging current of the supercapacitor. 31 episodes, which is not much different from TD3. However, the
fluctuation of DDPG in the early stage is much larger than that of TD3,
which may be because the double-delay update method of TD3 reduces
4. Results and discussions the variance of Q value in the early stage, and it can also be seen that
between 38 episodes and 48 episodes, DDPG may fall into a local op-
In this research, the proposed EMS was tested and verified under timum, resulting in an increase in mean_reward, and there is no such
WLTP class 3 conditions. The working condition is shown in Fig. 9. The problem after TD3 converges. This is also due to the fact that TD3 uses
cycle time of this working condition is 1800 s and the maximum speed is two Critic for training, which can effectively solve the deviation caused
36.47 m/s. The load requirement of this working condition is more by over-estimation of the Q value, and effectively avoid falling into the
complicated than that of NEDC, and the speed and acceleration fluctuate local optimal situation.
9
Fig. 9. Schematic diagram of speed and acceleration under WLTP class 3 operating conditions.
4.2. Analysis of power distribution of HES Fig. 11 shows the output power of the fuel cell under different EMS. It
can be clearly seen that the transient power of the fuel cell under the EMS
Figs. 11–13 are the power allocation diagrams of the TD3-based EMS control based on TD3 is smaller than that of DDPG and NEMS, and the
(TD3), the DDPG-based EMS (DDPG), and the nonlinear programming- overall output demand of the fuel cell is also smaller. Therefore, it can be
based EMS (NEMS) under the same objective function. inferred that when the load suddenly changes in the power system, the
fuel cell loss under the strategy based on TD3 will be smaller, and the
hydrogen consumption of the fuel cell will also be smaller.
Table 4
Parameter settings for offline training of deep reinforcement
Fig. 12 shows the output power of the battery under the three stra-
learning. tegies. From the power curve, since TD3 reduces the power demand for
the fuel cell, in order to meet the power demand of the entire vehicle, the
Preset parameters Value
power demand for the battery will increase accordingly, so the power
Discount factor 0.99 change of the battery will be more frequent. And because the optimiza-
Actor learning rate 0.001
tion goal of the TD3 strategy includes the battery life optimization item,
Critic learning rate 0.01
Experience pool capacity 10,000 and the battery life is closely related to the transient operating condi-
Number of replay samples 64 tions, the transient power of the battery under the TD3 strategy is smaller
Soft update discount factor 0.01 than the other two strategies.
Fig. 10. Schematic diagram of offline training convergence.
10
Fig. 11. Fuel cell output power.
Fig. 12. Battery output power.
Fig. 13 shows the output power of the supercapacitor. Obviously, the increase of the average efficiency can also reduce the hydrogen con-
supercapacitor under the control of the TD3 strategy has the highest sumption of the fuel cell.
utilization rate, and makes full use of the regulation effect of the super-
capacitor, that is, under the premise of optimizing the fuel cell and bat- 4.3. SOC trajectories of batterie and supercapacitor
tery as much as possible, using super capacitors to make up the remaining
power demand. The battery SOC and supercapacitor SOC trends of the three strategies
Fig. 14 reflects the fuel cell power distribution for the three control are shown in Fig. 15. It can be seen from Fig. 15 that TD3 and NEMS have
strategies under WLTP class 3 conditions. In general, when a fuel cell stronger constraints on the SOC of lithium batteries than DDPG, which
operates in a more efficient operating region, not only can its average shows that DDPG tends to use lithium batteries in the power distribution
system efficiency be improved, but also effectively reduce fuel cell process. It can be seen from the SOC changes of supercapacitors reflected
degradation caused by low/high loads. Obviously, due to the low/high- in Fig. 15 that TD3 is more inclined to use supercapacitors to compensate
load battery life loss included in the optimization objective, the working the required power in power distribution. Among the three strategies,
state of the fuel cell under the TD3 strategy is more stable, and the fre- TD3 can better utilize supercapacitors to improve the optimization space.
quency of operation in the high-efficiency range is also higher. In addition, TD3 and DDPG are a global optimization strategy. Different
Combining the results in Fig. 14, it can be seen that when TD3 performs from the real-time optimization strategy of NEMS, the battery SOC or
power distribution, the work efficiency matched by the output power of supercapacitor SOC under NEMS is in a lower state than the other two
the fuel cell is as high as possible, which can reduce the life loss of the fuel strategies. This also confirms from the side that in order to meet the
cell caused by the start/stop of the fuel cell to a certain extent, and the multi-objective optimization of hydrogen consumption and fuel cell/
11
Fig. 13. Supercapacitor output power.
battery loss, TD3 and DDPG are more inclined to use supercapacitors to subsequent training does not appear as a rising process like DDPG, which
compensate the required power during power distribution. is due to TD3's unique dual-critic and dual-delay update mechanism,
which makes the estimation of Q value by TD3 more accurate, and can
4.4. Cost analysis effectively avoid overfitting and fall into local optimum.
In addition to hydrogen consumption, the optimization objectives of
Hydrogen consumption is one of the optimization indicators of TD3, TD3, DDPG and NEMS also include fuel cell life loss and battery life loss.
DDPG and NEMS. This chapter converts the quality of hydrogen con- Among them, fuel cell life loss mainly includes low load loss, high load
sumption into price through the unit price of hydrogen. Fig. 16 shows the loss and transient load loss. Fig. 17 shows the comparison of each loss of
optimization results of these three strategies for hydrogen consumption. the three strategies. As can be seen from Fig. 17, NEMS tends to optimize
Combined with the SOC curve in Fig. 15, from the trend point of view, the low load loss, TD3 tends to optimize high load loss and power battery
hydrogen consumption curve and SOC curve of the strategy based on TD3 loss, and DDPG tends to optimize transient load loss and power battery
show a relatively stable upward or downward trend in the whole working loss. Although the optimization effect of the other two strategies is better
process, and the slope fluctuates little, which is a typical global optimi- than that of the TD3 strategy under low load loss, TD3 is better than or
zation trend [29]; However, the slopes of the hydrogen consumption equal to the other two strategies in terms of high load loss, power battery
curves and SOC curves of the other two strategies changed greatly, and loss and transient load loss. On the whole, the optimization effect of the
the trend of the curves fluctuated continuously. As far as the final result is TD3 strategy is the best among the three strategies.
concerned, the hydrogen consumption of TD3 is 36.40% lower than that Fig. 18 shows the distribution of overall loss and hydrogen con-
of DDPG, and the hydrogen consumption of TD3 is 50.87% lower than sumption for the three strategies. Compared with DDPG and NEMS, the
that of NEMS. Although TD3 and DDPG converge in a similar episode, the total cost of TD3 is reduced by 17.36% and 26.83%, respectively. For
Fig. 14. Fuel cell power distribution.
12
Fig. 15. SOC trends of battery and supercapacitor.
Fig. 16. Comparison of cumulative hydrogen consumption cost.
Fig. 17. Comparison of loss cost of three strategies.
13
Fig. 18. Comparison of the overall loss and the sum of hydrogen consumption of the three strategies.
TD3, it can be clearly seen that hydrogen consumption, fuel cell loss more effective in optimizing the evaluation index, that is, the optimized
(including three loss terms) and battery loss show a uniform distribution dynamic system works more stably, the frequency of working in the
trend. On the premise that the high power loss is too small to be ignored, high-efficiency range is higher, thereby reducing the loss of the power
the fuel cell loss of TD3 is reduced by 18.15% compared with NEMS in system and obtaining a lower overall cost, which greatly improves the
one cycle, and the fuel cell loss of DDPG is reduced by 27.22% compared economy of the power system.
with NEMS.
Compared with DDPG and NEMS, the battery loss of TD3 is reduced Conflict of interest statement
by 2.16% and 15.62%, respectively. Although the fuel cell loss of TD3 is
higher than that of DDPG, the overall cost of TD3 is lower than that of The authors declare that they have no known competing financial
DDPG by reducing hydrogen consumption and power battery loss. interests or personal relationships that could have appeared to influence
The specific values of the compared indicators are given in Table 5. By the work reported in this paper.
comparing with DDPG and NEMS, the performance of TD3 in reducing
hydrogen consumption, fuel cell loss and prolonging battery life has been Acknowledgements
verified.
This work was supported by the National Natural Science Foundation
Table 5
of China [Grant No. 51805254]. Any opinions expressed in this paper are
Summary table of various indicators. solely those of the authors and do not represent those of the sponsors. The
authors would like to thank the reviewers for their helpful corrections
TD3 DDPG NEMS
and insightful suggestions.
Hydrogen consumption cost (RMB) 3.97 6.25 8.08
Low load loss cost of fuel cell (RMB) 2.69 2.45 2.28
References
Fuel cell high load loss cost (RMB) 0.006,7 0.009,5 0.015
Fuel cell transient loss cost (RMB) 1.63 1.30 2.39
Overall loss cost of fuel cell (RMB) 4.33 3.85 5.29 [1] Xie C, Quan S, Du C. Research on energy management system of fuel cell electric
vehicle [J]. Automotive Engineering 2007;29(9):758–60.
Battery loss cost (RMB) 4.07 4.93 4.16
[2] Wang Q, Xiao Y, Qi W. Research on vehicle energy management of fuel cell hybrid
Total cost (RMB) 12.38 14.98 16.92
electric vehicle[J]. Power Technology 2012:1459–62. 2012年 10.
[3] Zhang S. Research on electric vehicle charging strategy and control technology
based on dual energy sources [D]. Chongqing University of Technology; 2019.
[4] Zhang C, Dong J, Liu J. etc. Control strategy of battery and supercapacitor hybrid
5. Conclusion energy storage system[J]. Journal of Electrotechnical Technology 2014;(4):334–40.
[5] Wang Y, Sun Z, Zonghai C. Rule-based energy management strategy of a lithium-ion
In this study, a novel EMS is formulated by combining the TD3 al- battery, supercapacitor and PEM fuel cell system[J]. Energy Procedia 2019;158:
2555–60.
gorithm in deep reinforcement learning and used to solve the cost [6] Gao D, Jin Z, Lu Q. Energy management strategy based on fuzzy logic for A fuel cell
optimization problem of long-distance logistics trucks with battery/fuel hybrid bus[J]. Journal of power sources 2008;185(1):311–7.
cell/supercapacitor power structure. Aiming at the fuel cell/battery/ [7] Xu L, Ouyang M, Li J, et al. Dynamic programming algorithm for minimizing
operating cost of a PEM fuel cell vehicle [C]. In: 2012 IEEE international symposium
supercapacitor hybrid system proposed in this paper, a multi-objective
on industrial electronics; 2012. p. 1490–5.
optimal EMS is established based on the TD3 algorithm, while consid- [8] Han J, Park Y, Kum D. Optimal adaptation of equivalent factor of equivalent
ering the hydrogen consumption level, fuel cell life loss level and bat- consumption minimization strategy for fuel cell hybrid electric vehicles under
active state inequality constraints[J]. Journal of Power Sources 2014;267:491–502.
tery life loss level. The method effectively suppresses the aging of core
[9] Shen D, Lim CC, Shi P. Robust fuzzy model predictive control for energy management
components, optimizes the hydrogen consumption of the fuel cell, and systems in fuel cell vehicles[J]. Control Engineering Practice 2020;98:104364.
improves the operation cycle and economy of the entire power system. [10] Vazquez-Canteli JR, Nagy Z. Reinforcement learning for demand response: a review
Compared with the strategy based on nonlinear programming algo- of algorithms and modeling techniques[J]. Applied energy 2019;235:1072–89.
[11] Hu Y, Li W, Xu K, et al. Energy management strategy for a hybrid electric vehicle
rithm and the strategy based on DDPG, it proves that the TD3 strategy is based on deep reinforcement learning [J]. Applied Sciences 2018;8(2):8020187.
14
[12] Liu T, Zou Y, Liu D, et al. Reinforcement learning of adaptive energy management [21] Amphlett JC, Baumert RM, Mann RF, et al. Performance modeling of the Ballard
with transition probability for a hybrid electric tracked vehicle [J]. IEEE Mark IV solid polymer electrolyte fuel cell II. Empirical model development[J].
Transactions on Industrial Electronics 2015;62(12):7837–46. Journal of The Electrochemical Society 1995;142(1):9–15.
[13] Yuan J, Yang L, Chen Q. Intelligent energy management strategy based on [22] Hu X, Zou C, Tang X, et al. Cost-optimal energy management of hybrid electric
hierarchical approximate global optimization for plugin fuel cell hybrid electric vehicles using fuel cell/battery health-aware predictive control[J]. IEEE
vehicles [J]. International Journal of Hydrogen Energy 2018;43(16):8063–78. Transactions on Power Electronics 2019;35(1):382–92.
[14] Reddy N P, Pasdeloup D, Zadeh M K, et al. An intelligent power and energy [23] Sarioglu L, Klein OP, Schroder H, et al. Energy management for fuel-cell hybrid
management system for fuel cell/battery hybrid electric vehicle using vehicles based on specific fuel consumption due to load shifting[J]. Intelligent
reinforcement learning [C]. In 2019 IEEE transportation electrification conference transportation systems. IEEE Transactions on 2012;13(4):1772–81.
and expo (ITEC):1–6. [24] Zhang C, Allafi W, Dinh Q, et al. Online estimation of battery equivalent circuit
[15] Li Yuecheng, He Hongwen, et al. Deep reinforcement learning-based energy model parameters and state of charge using decoupled least squares technique[J].
management for a series hybrid electric vehicle enabled by history cumulative trip Energy 2018;142:413–20.
information[J]. IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY 2019;68: [25] Hua L. Development of supercapacitor and new energy vehivle[J]. Automobile &
7416–30. 8. Parts 2009;No.4:26–32.
[16] Li Weihan, Han Cui, et al. Deep reinforcement learning-based energy management [26] Liu Chunna. Supercapacitor and its application in new energy vehicles[J]. Chinese
of hybrid battery systems in electric vehicles[J]. Journal of Energy Storage 2021;36: Journal of Power Sources 2010;34(12):1223–5. 12.
102355. [27] Li Hui, Zhao Bin, et al. Battery/supercapacitor energy management for streetcars
[17] Han Xuefeng, He Hongwen, et al. Energy management based on reinforcement [J]. BATTERY BIMONTHLY 2022;51(1):48–52. 2.
learning with double deep Q-learning for a hybrid electric tracked vehicle[J]. [28] Zhang F, Li J, Li Z. A TD3-based multi-agent deep reinforcement learning method in
Applied Energy 2019;254:113708. mixed cooperation-competition environment[J]. Neurocomputing 2020;411:
[18] Liu L. Simulation analysis and control of fuel cell hybrid electric vehicle multi- 206–15.
energy system [D]. Jilin University; 2007. [29] Sun Z, Wang Y, Chen Z, et al. Min-max game based energy management strategy for fuel
[19] Cai Kuncheng, Chen Jiawei, Song Qingchao. Decentralized energy management cell/supercapacitor hybrid electric vehicles[J]. Applied Energy 2020;267:115086.
strategy of fuel cell/battery/supercapacitor-hybrid electric vehicle [A]. China [30] Schittowski K. NLQPL: a FORTRAN-subroutine solving constrained nonlinear
Society of Automation. In: Proceedings of the 2020 China automation conference programming problems[J]. Annals of Operations Research 1985;5:485–500.
(CAC2020) [C]. China Society of Automation: China Society of Automation; 2020. [31] Li Yuecheng, He Hongwen, et al. Deep reinforcement learning-based energy
p. 1–5. management for a series hybrid electric vehicle enabled by history cumulative trip
[20] Larminie James. Fuel cell systems explained/-2nd ed[M]. John Wiley; 2003. information[J]. IEEE Transactions On Vehicular Technology AUGUST 2019;68(8).
15

1 s2.0 S2773153722000287 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S2773153722000287 Main

Uploaded by

Copyright:

Available Formats

Green Energy and Intelligent Transportation 1 (2022) 100028

Contents lists available at ScienceDirect

Green Energy and Intelligent Transportation

Full length article

Deep reinforcement learning based energy management strategy for fuel

TD3 is leveraged to formulate EMS for

Nomenclature κ Adiabatic coefﬁcient

λ Curve ﬁtting coefﬁcients Idischarge_lim Battery discharge limit

limited applicability in other complex systems. As a complex dynamic

2. Modeling of hybrid energy sources and truck

2.1. Topology of HES Ufc ¼ N ðEnernst Vact Vohm Vcon Þ (1)

where, ΔG is Gibbs free energy; ΔS is the variation of system entropy; F is

Vact ¼ ξ1 þ ξ2 T þ ξ3 T ln CO2 þ ξ4 ln Ist (3)

where, ξ1 , ξ2 , ξ3 and ξ4 are the parametric coefﬁcients which are affected

Vohm ¼ Ist Rint ¼ Ist ðRM þ RC Þ (4)

where, RM denotes the equivalent membrane resistance to proton con-

The SOC of the lithium battery can be calculated by the ampere-hour

Although the energy density of supercapacitor is low, it owns the su-

Ureq Ireq ¼ ηDCDC Ufc Ist (17)

Fig. 6. Motor efﬁciency map.

Table 1 3. Multi-criteria cost-effective EMS design

Fig. 7. Schematic diagram of TD3 framework.

Fig. 8. Diagram of TD3-based EMS.

Fig. 10. Schematic diagram of ofﬂine training convergence.

Fig. 11. Fuel cell output power.

Fig. 12. Battery output power.

Fig. 13. Supercapacitor output power.

Fig. 14. Fuel cell power distribution.

Fig. 15. SOC trends of battery and supercapacitor.

Fig. 16. Comparison of cumulative hydrogen consumption cost.

Fig. 17. Comparison of loss cost of three strategies.

You might also like