Egy D 22 12752

Energy
Energy management strategy based on deep reinforcement learning and speed

prediction for hybrid electric vehicle with multi-dimensional continuous control
--Manuscript Draft--
Manuscript Number: EGY-D-22-12752
Article Type: Full length article
Keywords: hybrid electric vehicles; Deep reinforcement learning; Speed prediction; Multi-
dimensional continuous control; transfer learning
Abstract: An efficient energy management strategy (EMS) is significant to improve the economy
of hybrid electric vehicles (HEV). In this paper, a power-split HEV model is built and
validated against test results, then the energy management strategy is proposed for
this model based on long short-term memory recurrent neural network speed
prediction model and deep reinforcement learning (DRL) algorithms. The rule-based
logic local controller and global optimal empirical knowledge are introduced to enhance
the convergence speed. The results show that the twin delayed deep deterministic
policy gradient algorithm (TD3) achieves more satisfactory performance on converge
speed and energy efficiency. The networks of the DRL algorithm with continuous
control update more robustly during iterations, in contrast to the discrete ones. The
analysis in this paper illustrates that the multidimensional control space has greater
potential for energy savings for power-split HEVs. As a result, the fuel consumption of
TD3 based EMS with multidimensional continuous differed from the global optimal
algorithm only by 5.7%. Besides, the migration capability of the EMSs proposed in this
paper is investigated. The performance on fuel consumption is still satisfactory in the
new test environment, and the battery SOC is maintained more stable after transfer
learning.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Highlights
 An energy management strategy for power-split hybrid electric vehicle
is proposed.
 The EMSs is based on TD3 deep reinforcement learning algorithms.
 The LSTM RNN speed prediction model is embedded in the EMSs.
 The transfer learning ability of DRL based EMSs is investigated.

Manuscript Click here to view linked References
1
2
3
4
5 Energy management strategy based on deep reinforcement learning
6
7
8 and speed prediction for hybrid electric vehicle with multi-
9
10
11
dimensional continuous control
12
13
14 Xing Liu, Ying Wang*, Kaibo Zhang, Wenhe Li
15
16
17 School of Energy and Power Engineering, Xi’an Jiaotong University, Xi’an 710049, PR China
18
19
20
21
22
23 Corresponding author:
24
25
26 Ying Wang, PhD, Professor.
27
28
29 School of Energy and Power Engineering, Xi’an Jiaotong University
30
31
32 No.28 Xianning west road, Xi’an, 710049, Shaanxi, P.R. China.
33
34
35
36
E-mail: yingw@mail.xjtu.edu.cn
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 * Corresponding author. E-mail address: yingw@mail.xjtu.edu.cn (Y. Wang).
61
1
62
63
64
65
1 Nomenclature
2
3
4
5 EV Electric vehicles
6
7 HEV Hybrid electric vehicles
8
9 EMS Energy management strategy
10
11 RBS Rule-based strategy
12
13 OBS Optimization-based strategy
14
15 LBS Learning-based strategy
16
17 CD-CS Charge depleting-charge sustaining
18
19 SOC State of charge
20
21 DP Dynamic programming
22
23 PMP Pontryagin's minimax principle
24
25 MPC Model predictive control
26
27 ML Machine learning
28
29 DRL Deep reinforcement learning
30
31 DQN Deep Q-learning
32
33 DDQN Double deep Q-learning
34
35 DDPG Deep deterministic policy gradient
36
37 TD3 Twin delayed deep deterministic policy gradient algorithm
38
39 V2X Vehicle to everything
40
41 LSTM Long short-term memory
42
43 RNN Recurrent neural network
44
45 BFSC Brake specific fuel consumption
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
2
62
63
64
65
1 Abstract
2
3
4
5
6
An efficient energy management strategy (EMS) is significant to improve the
7
8 economy of hybrid electric vehicles (HEV). In this paper, a power-split HEV model is
9
10
11 built and validated against test results, then the energy management strategy is proposed
12
13
14 for this model based on long short-term memory recurrent neural network speed
15
16
17 prediction model and deep reinforcement learning (DRL) algorithms. The rule-based
18
19 logic local controller and global optimal empirical knowledge are introduced to enhance
20
21
22 the convergence speed. The results show that the twin delayed deep deterministic policy
23
24
25 gradient algorithm (TD3) achieves more satisfactory performance on converge speed
26
27
28 and energy efficiency. The networks of the DRL algorithm with continuous control
29
30 update more robustly during iterations, in contrast to the discrete ones. The analysis in
31
32
33 this paper illustrates that the multidimensional control space has greater potential for
34
35
36 energy savings for power-split HEVs. As a result, the fuel consumption of TD3 based
37
38
39 EMS with multidimensional continuous differed from the global optimal algorithm only
40
41 by 5.7%. Besides, the migration capability of the EMSs proposed in this paper is
42
43
44 investigated. The performance on fuel consumption is still satisfactory in the new test
45
46
47 environment, and the battery SOC is maintained more stable after transfer learning.
48
49 Keywords: Hybrid electric vehicles; Deep reinforcement learning; Speed
50 prediction; Multi-dimensional continuous control; Transfer learning
51
52
53
54
55
56
57
58
59
60
61
3
62
63
64
65
1
2
3 1. Introduction
4
5
6
7 As the awareness of environmental protection increased and emission regulations
8
9
10 became stricter, traditional engine-driven only vehicles were being criticized for
11
12
13
bringing greenhouse gas emissions that accelerate global warming, thereby
14
15 endangering human space and triggering extreme weather in local areas [1-3]. It was
16
17
18 shown that passenger cars accounted for about 20% of energy consumption in the
19
20
21 transportation sector and that CO2 emissions from cars alone would reach 5 billion tons
22
23
24 in 2020. Endurance was still one shortcoming of pure electric vehicles (EVs) [4].
25
26 Hybrid electric vehicles (HEV) were with the advantages of both electric motors and
27
28
29 engines, and have great energy-saving potential during the transition period from
30
31
32 internal combustion engine drive to electric drive [5].
33
34
35 Hybrid vehicles include multiple power sources with different operating
36
37 characteristics, the battery acted as an intermediate storage device [6], and thus the
38
39
40 energy management strategy (EMS) mainly focuses on reasonably mobilizing each
41
42
43 energy device to work together and ensuring low energy economy of the vehicle while
44
45
46 meeting the driver's power demand . The EMSs of hybrid vehicles can be classed into
47
48 rule-based strategy (RBS), optimization-based strategy (OBS), and learning-based
49
50
51 strategy (LBS) [7].
52
53
54 Rule-based energy management strategies include deterministic rule-based control
55
56
57 and fuzzy rule-based control. Deterministic rule is commonly known as charge
58
59 depleting-charge sustaining (CD-CS), which regulates the power distribution between
60
61
4
62
63
64
65
1 the engine and the battery by determining the SOC and the demand power [8]. RBS-
2
3
4 based energy management strategies are designed to be simple and reliable. However,
5
6 when facing complex road scenes and diverse driving styles, it underperforms in fuel
7
8
9 consumption and is dependent on testing and calibration during development.
10
11
12 Optimization-based strategies are further distinguished into global optimization and
13
14
15
transient optimization based on real-time classification. For example, there are Global
16
17 optimization algorithms such as dynamic programming (DP) [9], stochastic dynamic
18
19
20 programming [10], Pontryagin's minimax principle (PMP) [11, 12], genetic algorithms
21
22
23 (GA) [13, 14], etc. Dynamic programming was often used as a criterion to verify the
24
25
26 merits of other control strategies or to calculate optimal state-action databases as
27
28 empirical data for artificial intelligence-assisted training. The optimal control
29
30
31 sequences can be theoretically achieved with a global optimization algorithm. However,
32
33
34 it also requires the knowledge of global operating conditions and large computational
35
36
37 costs. Minimal equivalent fuel consumption [15, 16] and model predictive control
38
39 (MPC) [17-19] are commonly used as transient optimal control strategies,
40
41
42 compromising between computational cost and global optimization strategy. An
43
44
45 accurate whole-vehicle model is necessary for the applications of both global
46
47
48 optimization and real-time optimization control strategies [20].
49
50 In recent years, with the rapid development of artificial intelligence (AI) technology,
51
52
53 AI is widely utilized in many fields such as natural language understanding, decision
54
55
56 planning, machine vision, pattern recognition, etc. The energy management strategy of
57
58
59 hybrid vehicles based on Machine Learning (ML) and Reinforcement Learning (RL) is
60
61
5
62
63
64
65
1 also becoming one of the research hotspots [21-24]. In the reinforcement learning
2
3
4
process, an intelligent agent senses the environment by state-action-rewards feedback,
5
6 and then optimizes its control strategy during iterations. As model-free energy
7
8
9 management strategies, DRL based EMS was capable to adapt to dynamic driving
10
11
12 conditions [3, 25].
13
14
15
Due to the merits of model-free learning and data-driven learning, research about
16
17 DRL-based EMS for HEVs has attracted increasing attention in recent years, as shown
18
19
20 in Table 1. Hu et al. [26] proposed a DQN-based energy management strategy for a
21
22
23 parallel hybrid vehicle and conducted simulation experiments based on a joint
24
25
26 MATLAB and ADVISOR simulation environment. In the study the discrete engine
27
28 torque was used as the control space, the inverse of the fuel consumption was used as
29
30
31 the instantaneous reward, and the Q-network included three fully connected hidden NN
32
33
34 layers. Han et al. [27] applied the DDQN algorithm to eliminate Q overestimation
35
36
37 problem in DQN. The results showed that the DDQN-based strategy had significant
38
39 advantages over DQN on iterative convergence speed and optimization performance.
40
41
42 The DQN algorithm was only applicable to the discrete control for the evaluation
43
44
45 network output Q values for each deterministic action, and DDPG was proposed to
46
47
48 output continuous actions directly. Wu et al. [28] put forward a multidimensional
49
50 continuous control strategy based on DDPG for a series-parallel hybrid bus traveling
51
52
53 on a fixed route. To simulate the actual driving of the bus, traffic information such as
54
55
56 travel time, passenger load, and route speed of the PHEB was generated with the traffic
57
58
59 simulator and was considered as states of the bus.
60
61
6
62
63
64
65
1 DDPG suffered from the same defects of overestimation Q value as DQN, and a dual-
2
3
4
Q network was introduced in twin delayed deep deterministic policy gradient (TD3)
5
6 framework to mitigate this problem. Liu et al. [29] pointed out that the TD3-based EMS
7
8
9 was easier to converge than the DDPG-based EMS, and computation time would be
10
11
12 reduced by 10% and fuel consumption by 7.3%. Huang et al. [30] established a DRL-
13
14
15
based energy management strategy for a power-split hybrid vehicle and quantified the
16
17 battery health status and added it to the action and state space. The simulation results
18
19
20 showed that compared to ignoring battery damage, the strategy that considered battery
21
22
23 health resulted in a reduction loss of battery capacity of about 60%, despite a 2.12%
24
25
26 decrease in fuel economy. Wu et al. [31] designed a multi-model plug-in hybrid EMS
27
28 based on TD3 and Gaussian distribution using the Chevrolet Volt-II. The output vectors
29
30
31 representing the operating modes were reparameterized by using a Gaussian
32
33
34 distribution and the operating mode with the highest probability was selected. A control
35
36
37 space containing both discrete and continuous actions also appeared in the study of
38
39 Zhang et al. [32]. The state space contained eight variables including road information.
40
41
42 Ultimately, the proposed DRL differed from the dynamic planning results by about 8%
43
44
45 in terms of fuel consumption and the NOx emission was reduced.
46
47 Table 1 EMS based on deep reinforcement learning
48
49 Literature Drive System Algorithm State Control Control space Year
50 Architecture framework
51
52 [26] Parallel Hybrid DQN Demand power, SOC Engine torque Discrete 2018
53 [27] Series Hybrid DDQN Demand power, vehicle speed, acceleration, Engine power Discrete 2019
54 SOC, engine speed
55
56 [28] Series-parallel plug- DDPG Speed, acceleration, SOC, number of Engine speed, engine Continuous 2019
57 in hybrids passengers, distance traveled, road torque, motor torque
58
information
59
60 [33] Power split DDPG SOC, vehicle speed, acceleration Engine power Continuous 2020
61
7
62
63
64
65
[34] Series Hybrid DQN SOC, generator speed, demand power Throttle opening Discrete 2020
1
[35] Parallel Hybrid DQN Engine speed, engine torque, and SOC values Engine power Discrete 2021
2
3 [36] Power split DDPG SOC, vehicle speed, acceleration, image Engine power Continuous 2021
4 [29] Power split TD3 SOC, vehicle speed, acceleration Engine power Continuous 2021
5
6 [37] Single-axis parallel TD3, DDPG Vehicle speed, acceleration, SOC, vehicle Engine torque Continuous 2021
7 connection mass, road slope
8
[38] Parallel Hybrid DQN Engine speed, engine torque, and SOC Engine power Discrete 2022
9
10 hierarchical
11 reinforcement
12
13 learning
14 [39] Series Crawler SAC+DP Demand power, vehicle speed, SOC, generator Throttle opening Continuous 2022
15
Hybrid speed
16
17 [31] Multi-mode TD3+ Gaussian SOC, vehicle speed, demand torque Operating mode, Discrete + 2022
18 distribution engine speed Continuous
19
20 [30] Power split TD3 Speed, acceleration, battery SOC, and battery Engine power Continuous 2022
21 health status
22
[40] Parallel Hybrid A3C SOC, vehicle speed, acceleration Engine torque Continuous 2022
23
24 [41] Power split DDPG SOC, vehicle speed, acceleration Engine speed, torque Continuous 2022
25 [42] Parallel Hybrid DDPG+DQN Speed, acceleration, SOC, gear Throttle opening, Continuous + 2022
26
27 gearing Discrete
28 [43] Parallel Hybrid DDPG Demand power, SOC Equivalence factor Continuous 2022
29
30 [32] Power split TD3 Vehicle speed, acceleration, SOC deviation, Engine power, engine Continuous + 2022
31 remaining mileage, distance to nearest traffic combustion mode Discrete
32 light ahead, past and future road speed
33
34 vectors, past and future road gradient
35 [44] Series plug-in SAC SOC, SOC reference value, ΔSOC, Engine power change Continuous 2022
36
acceleration, demand power
37
38 [45] Power split DDPG Vehicle speed, acceleration, SOC Engine power Continuous 2022
39
40 The study about DRL based EMS for HEV was gradually deepening to tap its
41
42
43 potential for utilization in complex road conditions and dynamic driving cycles.
44
45
46 According to Table 1, many studies collected environmental information from the road
47
48
49
[46, 47] to help the agent to design its strategy in advance for better fuel consumption
50
51 and SOC maintenance. However, the premise that the traffic state is accessible was
52
53
54 based on the vehicle-to-everything (V2X) connection assumption. For power split or
55
56
57 multi-mode hybrid configurations, a wide control space often means that flexible mode
58
59
60 switching and the finest energy distribution can be realized, potentially resulting in
61
8
62
63
64
65
1 more satisfactory overall performance. Currently, related research tends to embed the
2
3
4
empirical knowledge of the optimal fuel consumption curve in the EMS to simplify the
5
6 control, yet enough attention is not paid to the possible drawbacks of this practice.
7
8
9 Besides, the hybrid vehicle energy management strategy is a multi-constrained system
10
11
12 optimization problem, and the obstacles to the exploration of reinforcement learning
13
14
15
may come from the sparse reward and large action space. The agent designed based on
16
17 traditional exploration strategies, such as random action or greedy strategies, would
18
19
20 find it difficult to encounter high reward trajectories due to too long global time.
21
22
23 Based on the above literature research and analysis, this paper proposes an energy
24
25
26 management strategy based on a combination of the long short term memory (LSTM)
27
28 recurrent neural network (RNN) speed prediction and TD3 deep learning algorithm for
29
30
31 power-split HEV, and the control space included engine speed and torque. To the best
32
33
34 of the author's knowledge, this advanced algorithmic framework was pioneeringly used
35
36
37 in power split HEV EMS with multidimensional continuous engine control space. The
38
39 main contributions of this paper for related research hotspots were as follows.
40
41
42 (1) Based on the historical driving data of the vehicle itself, the LSTM RNN was
43
44
45 utilized to predict vehicle speed, which solved the long-term dependency problem
46
47
48 commonly found in general RNNs.
49
50 (2) An innovative TD3 algorithm based EMS with multidimensional continuous
51
52
53 engine control was proposed and embedded in a well-validated power-split hybrid
54
55
56 vehicle energy simulation model. Systematic comparisons with the DDPG、DQN and
57
58
59 double DQN based EMS were conducted.
60
61
9
62
63
64
65
1 (3) The optimal control trajectory by the DP beginning from different initial and the
2
3
4
rule based local controller was introduced to improve the learning efficiency and avoid
5
6 meaningless engine operating intervals.
7
8
9 (4) A comparative study on the effect of one- and two-dimensional engine control on
10
11
12 EMS was conducted. The ability of DRL EMS to transfer learning was discussed, which
13
14
15
proved its ability to adapt to dynamic driving conditions.
16
17 The remainder of this paper was arranged as follows. The powertrain model was
18
19
20 introduced and the model verification was discussed in Section II. Section III described
21
22
23 the theoretical foundations of LSTM and DRL algorithms and the application to the
24
25
26 hybrid vehicle EMS in this paper. The auxiliary learning strategies for DRL-based EMS
27
28 applied in the study were also stated. Section V shows the training process and
29
30
31 significant results. Section V concluded the paper.
32
33
34
35
36 2. Powertrain modeling
37
38
39
40 The Toyota Prius was one of the most classic hybrid vehicle models and was widely
41
42
43 used in energy management research [48]. In this paper, we will build the inverse
44
45
46 vehicle energy model with Prius as the target.
47
48 As shown in Figure 1, the main components in the Prius model include the engine,
49
50
51 generator (MG1), drive motor (MG2), power battery, and planetary gear set. The engine
52
53
54 had a maximum power of 73kW, the battery has a rated discharge power of 27kW, and
55
56
57 the whole vehicle could provide a maximum power output of 100kW. The main
58
59 parameters of the whole vehicle were shown in Table 2.
60
61
10
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 Figure 1 Vehicle model topology schematic.
16
17 Table 2 HEV main parameters
18
19 Modules Parameters Value
20
21 Whole car Overall vehicle mass/kg 1360
22 Windward area/m2 2.25
23
24 Rolling friction coefficient 0.015
25
26 Air resistance coefficient 0.3
27
Wheel radius/m 0.306
28
29 Engine Displacement/L 1.8
30
31 Maximum power/kW 73
32
33 Max. torque/N·m 142
34 Drive motor Maximum power/kW 60
35
37
38 Generators Maximum power/kW 42
40
41 NiMH batteries Rated voltage/V 201.6
42
43 Capacity/Ah 6.5
44
Power distribution planetary row Sun wheel/inner gear ring tooth ratio 30/78
45
46
47 The demand power of the vehicle was expressed by the following equation：
48
49 1 dv
50 Preq  ( Mg sin   Mgf cos    Ca Av 2  M )v / 3600 （1）
51
2 dt
52
53 Where, M was the whole vehicle mass/kg; g denoted the acceleration of gravity/
54
55 m  s 2 ；α was the road slope; f was the rolling resistance coefficient; 𝜌 was the air
56
57
3
58 density/ kg  m ; 𝐶𝑎 was the air resistance coefficient. 𝐴 was the windward area/ m 2 ;
59
60
61
11
62
63
64
65
1
1 v was the vehicle speed/ m  s . The power required by the vehicle was provided by
2
3
4
both the engine and the battery. The engine in the model was an inline four-cylinder
5
6 Atkinson cycle gasoline engine. The MAP of the effective fuel consumption rate for
7
8
9 this engine [49] was plotted in Figure 2. The instantaneous fuel consumption of the
10
11
12 engine can be calculated by the following equation.
13
14 Te nebe
Fueleng  (2)
15 9.55 106
16
17 245.0
250.0
245.0 240.0
18 140 235.0
240.0
19 245.0
230.0
20 120 235.0
21 230.0
100
22
Torque (Nm)
23 220.0 225.0
80
24
25
60 245.0 250.0
26
27 40 300.0
275.0
28 400.0
500.0
29 20 600.0
30 700.0
31 0
32 1000 2000 3000 4000 5000
33 Speed (RPM)
34
35 Figure 2 Static engine fuel consumption MAP [49].
36
37 Where be was instantaneous engine fuel consumption/g·(kW·h)-1; Te meant
38
39
40 engine torque/N·m. ne referred to the engine speed/rpm. The efficiency MAPs of
41
42
43 MG2 [50] and MG1 were adopted in the calculation. Thus, the conversion efficiency
44
45 between mechanical and electrical energy could be queried in the table. The detailed
46
47
48 efficiency Map chart was placed in the Appendix. In addition, the planetary gear set
49
50
51 was an important part of this model to realize power coupling, and the detailed
52
53
54 mathematical relationship description about it was also put in the Appendix. A
55
56 simplified battery RC model was used. The battery SOC was estimated as follows:
57
58
59
60
61
12
62
63
64
65
1  Pb  U b I b  Rb I b2
2 
3  t E  Eb2  4000 Rb Pb
 0 (3)
b
4 dt
 2 Rb
 SOC (t )  SOC0 
5
6  Qn
7
8
9 Where, Qn value battery capacity/ A  h ; SOC 0was the initial SOC value of the
10
11
12 battery, and Pb was the battery discharge power/kW. U b was the open-circuit battery
13
14
15 voltage/V. I b was the current/A. Rb was the resistance/Ω. Eb is the battery
16
17 potential/V.
18
19
20 The test results of Prius In Ref. [48] under US06 cycle conditions were utilized to
21
22
23 verify the vehicle model in this paper. The simulation results reproduce the trajectory
24
25
26 of SOC and the cumulative fuel consumption. According to Figure 3, the difference
27
28 between the calculated and test value of the cumulative fuel consumption at end of the
29
30
31 trip was only 2.2%. The calculated SOC curve was basically consistent with the
32
33
34 experimentally measured one, and the final discrepancy was 2.1%. The above
35
36
37 comparison provided confidence about the simulation model to be applied in the
38
39 subsequent study of this paper.
40
41
600
42
(a) Test (b) Test
43 80 Simulation 500 Simulation
44
Fuel consumation (g)
45 400
46
SOC (%)
70
300
47
48 200
49 60
100
50
51 0
52 50
0 100 200 300 400 500 600
0 100 200 300 400 500 600
53 Time(s) Time(s)
54
55 Figure 3 Comparison between experiments [48] and simulations (a) SOC; (b) cumulative fuel
56
consumption.
57
58
59
60
61
13
62
63
64
65
1 3. Energy management strategy for HEV
2
3
4
5
6 3.1 LSTM RNN based speed prediction
7
8
9
Vehicle speed prediction was considered an important part of the EMS to enhance the
10
11
12 efficiency of the powertrain, especially for hybrid electric vehicles [51]. While the
13
14
15 traditional neural network model is adopted for vehicle speed prediction, the
16
17
18 information is isolated between the front and back nodes because the nodes at each
19
20
layer were connectionless [52]. However, vehicle speed also has the characteristics of
21
22
23 a time series signal, like human languages. As a modified RNN, LSTM recurrent neural
24
25
26 network adds feedback to front neural and also overcomes the gradient disappearance
27
28
29 and gradient explosion problems in the traditional RNN [53], which has a better effect
30
31
32
on dealing with and predicting vehicle speed data.
33
34 The LSTM neural network cell consisted of four interaction layers, and the
35
36
37 transmission status was controlled by gating status, as shown in Figure 4. The input
38
39
40 gate was to process the data. The forgetting gate was to discard part of the previous
41
42
43
information and retain the important information. The memory gate was used to extract
44
45 the valid information and make the filtering. The LSTM solved the long-term
46
47
48 dependency problem that was common in the general recurrent neural networks. The
49
50
51 state update and output calculations were defined as followed in Equation 4:
52
53
54
55
56
57
58
59
60
61
14
62
63
64
65
1 it   (Wi X t  Ri ht 1  bi )
2  f   (W X  R h  b ）
3  t f t f t 1 f
4  gt  Tanh(Wg X t  Rg ht 1  bg )
5  (4)
6 ot   (Wo X t  Ro ht 1  bo )
7 C  f  C  i  g
8  t t t 1 t t
9 ht  ot  Tanh(Ct )
10
11
12 Where, X, h, and C denoted the input, output, and memory states, respectively. The
13
14
15
weights were expressed by W and R, and the bias was denoted by b. The  and Tanh
16
17 represented the sigmoid and tanh activation functions, respectively.
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 Figure 4 Basic principle of LSTM network.
38
39 In this study, a model including both LSTM layer and fully connected layer was
40
41
42 employed to forecast speed within the future time window. Deep NN was set to ensure
43
44
45 that network was qualified for complex and nonlinear transportation environments.
46
47
48 Each layer embodied 128 NN units. Vehicle speed data from the past 20s were
49
50 normalized and input into the network. The predicted speed was transferred to EMS for
51
52
53 better energy planning.
54
55
56
57
58
59
60
61
15
62
63
64
65
1 3.2 TD3 based EMS
2
3
4
5 The DDPG and TD3 algorithms were considered as a combination of deterministic
6
7 algorithms and deep neural networks and also be regarded as an extension of deep Q-
8
9
10 learning in the continuous action space. Both the Q-value function (critic) and the
11
12
13 policy function (actor) were constructed in DDPG, with the critic networks updated
14
15
16 based the same time difference method as the deep Q algorithm. The policy function
17
18 was updated by the policy gradient method with the estimation of the Q-value function.
19
20
21 In the deep deterministic policy gradient algorithm, actor networks played a role as a
22
23
24 deterministic policy function, denoted as π(s). Each action was directly computed
25
through at   ( st |  t )  N without sampling from random strategies. The random
26
27
28
29
noise N was added during the training process to balance the exploration and utilization
30
31
32 of such deterministic strategies.
33
34
35 The critic networks were updated by Bellman Equations in the DDPG algorithm
36
37
38 framework. The next state st 1 and reward rt were obtained from environment.
39
40
41
DDPG calculated the Q and minimized the loss function using gradient descent
42
43 algorithm as Equation 5.
44
45
1 L
 ( yi  Q (si , ai ))2
46
47 Loss  (5)
L i 1
48
49
50 The policy function π was updated by applying the chain rule to the expected return
51
52
53 function J. Updating employing batch samples as Equation 6:
54
55
56
 J ( )  L1  aQ1 ( s, a) |a  ( s )  ( s) (6)
57
58 The deep deterministic policy gradient algorithm softly updated the target network
59
60
61
16
62
63
64
65
1 with exponential smoothing instead of direct parameter replacement to improve the
2
3
4
stability of learning, as in Equation 7.
5
6  i '    (1   )i '
7  (7)
8   '    (1   ) '
9
10 In DDPG, max-Q actions were always chosen for each state which caused the
11
12
13 algorithm to be unusually sensitive to the corresponding Q values of the overestimated
14
15
16 actions. The double critic networks were introduced in TD3 and the minimum Q values
17
18
19 were used to compute the Bellman equation. The TD3 reduced the update frequency of
20
21 the policy network, which allowed for a smaller variance in the estimation of the Q-
22
23
24 value function, resulting in a higher quality policy update. By adding truncated
25
26
27 normally distributed noise as a regularization in each action in the target network, the
28
29
30 TD3 avoided overfitting for narrow peak estimates in the value space. The framework
31
32
of TD3 algorithm was shown in Figure 5 and the pseudocode was listed in Table 3.
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
17
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 Figure 5 The framework of TD3 algorithm.
31
32
33 Table 3 The pseudocode of TD3 algorithm.
34
35 Algorithm Twin Delayed Deep Deterministic Policy Gradient
36
37
38 Initialize critic networks Q1 , Q 2 , and actor network   with random parameters 1 ,  2
39
40
41 and  .
42
43
44 Initialize target networks 1 '  1 ,  2 '   2 ,  '   .
45
46
47 Initialize experience pool E to capacity N.
48
49 Initialize expert experience pool EE through DP.
50
51
52 For episode = 1 to M do
53
54
55 For t = 1 to T do
56
57
x
58 Select an action with exploration noise a   ( s )  N , N ( | ) 1
exp( )
59 2 
60
61
18
62
63
64
65
1 and observe reward r and next state s',
2
3
4 Store transition tuple ( s , a , r , s ' ) to E.
5
6 If t<K Then
7
8
9 Select tuple ( sEE , aEE , rEE , sEE ' ) randomly from EE and store to E.
10
11
12 If the number of stored tuples = N then
13
14
15
Sample mini-batch of L transitions from E.
16
17 a   '   ,  clip( N (0,  ), c, c) and y  r  min i 1,2 (Qi ' ( s ', a )) .
18
19
20 Update critics network i  arg min L1  ( y  Q (s, a))2 .
i i
21
22
23 If t mod d then
24
Update  by deterministic policy gradient:
25
26
27
28  J ( )  L1  aQ1 ( s, a) |a  ( s )  ( s)
29
30
31 Soft update target networks:
32
33
34 i '    (1   )i ', '    (1   ) '
35
36
37 End if
38
39 End if
40
41
42 End for
43
44
45 End for
46
47
48 Fuel economy was one of the key concerns for HEVs EMS, in addition, the SOC
49
50
51 level was kept steadily to ensure a healthy battery condition. Therefore, the reward
52
53 function of DRL in this paper was shown in equation (8).
54
55
56 (m f (t )  1 ( SOC (t )  SOCref )2 ) (0.4  SOC  0.8)
57 reward   (8)
(m f (t )   2 ( SOC (t )  SOCref ) )
2
58 (else)
59
60
61
19
62
63
64
65
1 Where m f (t ) represented the fuel consumption in step t. SOC(t) represented value
2
3
4
of SOC at time t. SOCref equals 0.65 in this paper). 1 and  2 were weights to
5
6 regulate the balance between fuel consumption level and SOC maintenance. As shown
7
8
9 in Figure 6, too small 1 would cause poor ability for battery charge protection, and
10
1 =1 was selected for the
11
12 too large one would sacrifice fuel economy, and therefore
13
14
15 sake of compromise. And  2  1 was set to guide the agent to avoid extreme SOC
16
17 levels.
18
19
20 0.9 3240
3231.4
21
22 3220
0.8
23
3200
24
Fuel Consumption (g/100km)

3187.4
25 0.7 3180
26
27 3160
SOC
28 0.6 3148.3
29  1=0.01 3140
30  1=0.1 3123.4
31 0.5 3120
 1=1
32  1=10 3092.2
3100
33 0.4  1=100
34
35
36 0.3
37 0 200 400 600 800 1000 1200 1400 1600 1800
38 Time (s)
39
40 Figure 6 The effect of the parameter in the cost equation on SOC maintenance and fuel
41 consumption, calculation by DP under WLTP cycle.
42
43 In this paper, vehicle speed 𝑣, demand power 𝑃𝑟𝑒𝑞 , SOC, proportion of miles traveled
44
45
46 X, and expected future average speed 𝑣̅ were chosen as vehicle state variables, as 𝑠 =
47
48
49 {𝑣, 𝑃𝑟𝑒𝑞 , 𝑆𝑂𝐶, 𝑋, 𝑣̅ }. The engine operating conditions were selected as control variables,
50
including continuous speed and torque, as a  eng , Teng  . The control variable of the
51
52
53
54
55
one-dimensional control algorithm involved in this paper was engine speed, and then
56
57 the engine torque was determined from the best BSFC curve.
58
59
60
61
20
62
63
64
65
1 3.3 Assisted learning strategies
2
3
4
5 At the beginning of the agent training, the parameters of the strategy network were
6
7 initialized randomly. Therefore, the control variables decided by the agent were often
8
9
10 counterintuitive. What’s worse, loud noise was added to the strategies at initial iteration
11
12
13 in order to explore a larger space. At this time the agent was biased in exploring and
14
15
16 perceiving the boundaries of the solved space. However, this phase may be experienced
17
18 slowly and make the obtained experience limited in helping in optimizing the DRL
19
20
21 networks. For more efficient learning, in this paper prior knowledge is mixed into the
22
23
24 experience pool at the beginning to improve initial strategy optimization.
25
26
27 The SOC trajectory designed by DP algorithm under the WLTC cycle was shown in
28
29
Figure 7, and the cost function remained consistent with Equation 8. The global optimal
30
31
32 control strategy ensured a lower fuel consumption level while keeping the battery
33
34
35 power balance as much as possible. In order to simulate the possible extreme battery
36
37
38 power level conditions, the decision tracks through DP at a different time starting (t=0s,
39
40
41
600s, and 1200s) and different SOC initialization (SOC=0.4 and 0.8) were also included
42
43 in the prior experience database.
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
21
62
63
64
65
1
2 0.8
3
4
5
6
7
SOC
8 0.6
9
10 SOCini=0.8-1800s
11 SOCini=0.4-1800s
12 SOCini=0.4-1200s
13 SOCini=0.8-1200s
14 SOCini=0.4-600s
15 0.4 SOCini=0.8-600s
SOCini=0.65-1800s
16
17
0 200 400 600 800 1000 1200 1400 1600 1800
18
19 Time(s)
20
21 Figure 7 The SOC trajectory by DP at different initial conditions.
22
23
24 Physical constraints existed for each component of the drive system. As shown in
25
26
27 Equation 9, the maximum output torque varied with speeds; the discharge and charging
28
29
30 power of the battery should also be kept within a safe range. Due to mechanical and
31
32
electrical coupling between the devices, as shown in Figure 8 (a), when the vehicle
33
34
35 speed was small and the demand power was normal, the engine power output cannot be
36
37
38 too large to avoid exceeding maximum battery charging power; when the vehicle speed
39
40
41 increased and the demand power was large, the minimum engine output power was
42
43
limited by the battery discharge power under the premise to provide enough demand
44
45
46 power. The legal control space boundary of the engine changed with the movements of
47
48
49 driving conditions, which undoubtedly increased the task of the deep learning algorithm.
50
51
52  Te  Tmax,e (ne )

abs(Tm )  Tmax,m (nm )
53
54
 (9)
abs(Tg )  Tmax, g (ng )
55
56
P
57  disch arg e  Pbat  Pch arg e
58
59
60
61
22
62
63
64
65
(a) (b) Li
1 140 140 mB
atC
LimEngTrq LimEngTrq
2 LimGenTrq Li
m
ha
r
120 Li 120 Ge
3 mB
atC nT
ha rq
4 100
r 100
5
EngTrq(Nm)
EngTrq(Nm)
6 80 80
Engine Operation Region
7 60 60
8 Engine Operation Region
9 40 40
10
20 20
11 LimBatDisChar
12 0 0
1000 1500 2000 2500 3000 3500 4000 4500 5000 1000 1500 2000 2500 3000 3500 4000 4500 5000
13
EngSpd(RPM) EngSpd(RPM)
14
15 Figure 8 The legal engine operation region in different driving conditions. (a) Vehicle speed
16
17 40km/h, wheel demand torque 240N·m and SOC=0.65; (b) 80km/h, 430N·m and 0.65,
18 respectively.
19
20 Unnecessary and unreasonable engine torque and speed control should be avoided due
21
22
23 to practical physical limitations. However, when using the DRL-based EMS, attention
24
25
26 was rarely paid to the dynamic limitation rules [28, 31, 54]. To improve the exploration
27
28 efficiency and avoid illegal work points, a local controller (LC) based on heuristic rules
29
30
31 was designed in this paper, as shown in Figure 9. By embedding the LC, real-time
32
33
34 monitoring of output by an online actor network was achieved. When the agent
35
36
37 discovered outside the control domain boundary and cannot satisfy the specification
38
39 conditions in Equation 9, the control action would be readjusted by the LC based on the
40
41
42 rule logic EMS.
43
44
45 The rule-based control logic in Figure 9 utilized the CD-CS strategy, where P_req
46
47
48 referred to the time-demand power, SOC_min and SOC_max denotes the SOC limited
49
50 maximum and minimum levels, respectively, P_eng_max referred to the maximum
51
52
53 engine power, and P_mot_max denoted the maximum torque of the drive motor at the
54
55
56 current speed. The hybrid energy allocation strategy including RBS and DRL was
57
58
59 designed not only to reduce the vibration of the deep learning algorithm control strategy
60
61
23
62
63
64
65
1 caused by irrational exploration but also to keep its powerful adaptive capability for
2
3
4
dynamic driving conditions.
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Figure 9 Flow chart of the local controller.
27
28 The system control flow of this study was shown in Figure 10, which mainly included
29
30
31 the vehicle speed prediction module, DP experience library, local controller, and TD3
32
33
34 based EMS. The vehicle simulation models and different test cycles were used as
35
36
37 training environments for DRL, and all parts collaborated to conduct simulation
38
39 experiments.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
24
62
63
64
65
LSTM-Based vehicle TD3-Based ECM
Dynamic Environment
1 v t -1 ,
v t-2 , speed prediction vp S Online Net S Target Net
2 ,
St
v t-n
3 P
req
v
4 X
5 SOC Soft Update
A= {E ng S pd
6 、 E ng T rq}
Local controller
Online
Actor
7 If valid？ If S O C< S O C_ min
Strategy Net

Actor 
RBS based
8 ECM Noise
9 If
P_ req> P_ eng _ m If S O C> S O C_ ma x
update
ax
10 A A
Policy Gradient
11 If
P_ req> P_ mot_ m
ax
If P_ req> P_ ref
12 P_ mot= P_ r P_ mot= P_ r
P_ eng = P_ e
ng _ ma x
P_ mot= 0
P_ mot= P_ m
ot_ ma x
eq, P_ mot= 0 eq
Critic 1 Critic 2 Critic 1 Critic 2
13 P_ eng = 0 P_ eng = 0
14 A= {E ng S pd
、 E ng T rq}
15 Exploration
DP Prior
16 At Q1
min
T eng Q2
17 minimize minimize Q
18 (St , At , R, St 1 )  eng TD error 1 TD error 2
19
20
21 Experience Pool Mini-batch
22
23
Figure 10 Framework of EMS based on TD3 and LSTM.
24
25
26
27
28 4. Training and resulting
29
30
31
32
33 The code for deep reinforcement learning algorithms utilized in this paper relied on
34
35 the Python Pytorch toolkit. The computing platform hardware included an i7@4.0GHz
36
37
38 CPU and RAM with a memory of 32GB. A GPU with a memory of 2GB was exploited
39
40
41 to help accelerate the training of the neural network. The time cost of the computation
42
43
44 was about 4h/1000epoch.
45
46
47
48 4.1 Training LSTM based speed prediction model
49
50
51 In general, the forecast accuracy was considered to be strongly dependent on the
52
53
54 selection of training datasets [55], therefore an amount of test data was often selected
55
56
57 to target certain scenarios. For greater versatility and migration capabilities of the
58
59
60 LSTM based on the vehicle speed prediction model, several test cycles applied to
61
25
62
63
64
65
1 different driving road conditions like urban, highway and suburban, including WLTP,
2
3
4
FTP75, US06, UDDS and JE05, with a total time duration of about 7300 seconds, were
5
6 adopted to train the NN networks. Longer prediction windows tended to have more
7
8
9 optimization potential [51], yet were often limited by poor prediction performance.
10
11
12 In this paper, the prediction horizons of 5s, 10s and 15s were studied by contrast. The
13
14
15
comparisons between predicted and actual speeds under the WLTP test cycle were
16
17 shown in Figure 11. It can be discovered that the models with 5s and 10s prediction
18
19
20 horizons basically reflected the future speed trend, however the result of the model with
21
22
23 15s prediction horizon showed obvious distinction with real speed. As shown in Figure
24
25
26 12(a), not only the root mean square error (RMSE) of vehicle speed increased as the
27
28 prediction window grew, but the network also became more unstable during the
29
30
31 iterative process. Figure 12(b) demonstrated the robustness of those models. The
32
33
34 validation data included three sets of test cycles (NEDC, CLTC_P and FTP72), and
35
36
37 WLTP training data was also displayed as a comparison. It can be seen that the largest
38
39 RMSE difference between training and test data was exhibited when the prediction
40
41
42 domain was 15s. Considering the above analysis, the prediction horizon was specified
43
44
45 as 10s in this paper.
46
47 Actal Actal
140 a. Prediction horizon=5s 140 b. Prediction horizon=10s
48 Prediction Prediction
120 120
49
100 100
50
Speed(km/h)
Speed(km/h)
80 80
51
60 60
52
40 40
53
20 20
54
0 0
55 0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
56 Time(s) Time(s)
57
58
59
60
61
26
62
63
64
65
140 Actal
1 c. Prediction horizon=15s
Prediction
120
2
100
3
Speed(km/h)
80
4
5 60
6 40
7 20
8 0
0 200 400 600 800 1000 1200 1400 1600 1800
9 Time(s)
10
11 Figure 11 Comparison of predicted and actual vehicle speed under WLTP cycle.
12 25 2.0
13 (a) Prediction horizon 5s (b) Training data(WLTP)
1.8
14 10s Testing data(NEDC+CLTC_P+FTP72)
15 20 3.0
15s 1.6
1.4789
16 2.5 1.4
17
RMSE(km/h)
RMSE(km/h)
15 2.0
1.2
1.0693
18 1.5
1.0
1.0089
19 1.0
0.8407
10 0.8
20
21 0.6
22 5 0.4
23 0.2
0.2244
0.1691
24
0 0.0
25 0 100 200 300 400 500 5s 10s 15s
26 Iteration Prediction horizon
27
28 Figure 12 Comparison of LSTM NN performance at different forecast horizon.
29
30
31
32 4.2 Convergence performance
33
34
35
36 In the early stage of iterations, in order to explore the rewards of larger action space,
37
38
39 additional large noise was designed according to the action decided by DRL strategy
40
41
42 network. The neural network tended to stabilize after constantly updating during
43
44 iterations, and then the excessive exploration noise was not conducive to the
45
46
47 convergence of the algorithm. The dynamic adjustment of the noise following the
48
49
50 exploration process was adopted. In the continuous DRL control algorithm in this paper,
51
52
53 the exploration action calculated by a Gaussian probability distribution with an initial
54
55 standard deviation of 0.5, while the discrete DRL control algorithm used an ε −greedy
56
57
58 algorithm with an initial ε of 0.9. In this paper, 1D and 2D denoted one- and two-
59
60
61
27
62
63
64
65
1 dimensional continuous control spaces, respectively, and A represented that this EMS
2
3
4
used the assisted learning strategy mentioned in Section 3.3.
5
6 As shown in Figure 13(a), the epoch reward of all DRL algorithms utilized in this
7
8
9 paper tended stable after 100 epochs, however, significant uncertainty still exhibited in
10
11
12 the discrete control algorithms DQN and DDQN, which was consistent with study of
13
14
15
Zhou et al [37]. The performance of DDPG algorithm was limited possibly due to its
16
17 overly optimistic estimation of Q-worthiness deficiency, and the same conclusion can
18
19
20 also be drawn from Figure 13(b).
21
22 2.0×104 0
23 (a) (b)
24 0.0 -500
25 -2.0×104
TD3-2D -1000
Epoch reward
26 DDPG-2D
TD3-1D
-4.0×104
Epoch reward
27 DQN
-1500
DDQN
28 -6.0×104 0
29 -1000 -2000
-8.0×104 -2000
30 -3000
-2500 TD3-1D-A
31 -1.0×105 -4000
TD3-2D-A
32 200 300 400 500
-3000 DDPG-2D-A
-1.2×105
33
34 -1.4×105 -3500
0 100 200 300 400 500 0 100 200 300 400 500 600
35 Epoch
Epoch
36
37 Figure 13 Comparison of convergence of DRL algorithms, (a) without assisted learning strategy,
38
39 (b) with assisted learning strategy.
40
41 Comparing Figure 13(a) and 13(b), it was noted that DRL with assisted learning
42
43
strategies ensured faster convergence and prefer stability during the exploration process.
44
45
46 In addition, from the simulation experiments in this paper it was found that the DRL
47
48
49 for multidimensional control seemed with higher dependability, which may be due to
50
51
52 its larger fault tolerance space. Besides, although the best epoch reward of conventional
53
54
55
TD3 with 2D control was slightly worse than that with 1D, significant improvement
56
57 was achieved with the assistance of LC and prior knowledge, which demonstrated the
58
59
60 more potential for optimization present in a multidimensional engine control mode.
61
28
62
63
64
65
1 4.3 Fuel consumption and charge maintenance
2
3
4
5 Table 4 summarized the comparisons of fuel consumption and termination SOC
6
7 among the various EMSs in this paper, where the DDQN algorithm performed the worst.
8
9
10 Despite the lower fuel consumption level with DQN algorithm, the final SOC was only
11
12
13 about 0.63, which meant a partial loss of battery energy. The algorithms with the
14
15
16 assisted learning strategy showed more or less improvement in energy consumption,
17
18 with the most satisfying being TD3-2D-A, which had about 3% reduction in fuel
19
20
21 consumption while increasing the energy stored in the battery. Also noted that it was
22
23
24 easier to obtain good fuel consumption due to the one-dimensional control along the
25
26
27 optimal BFSC.
28
29 Table 4 Comparison of economics under WLTP cycle.
30 Algorithm Fuel consumption Termination SOC Fuel economy compared
31
32 with DP (%)
33 DP 3148.26 0.6498 100
34
CD-CS 3562.31 0.6522 113.15
35
36 DQN 3410.21 0.6284 108.32
37 DDQN 3580.91 0.6201 113.74
38
39 TD3-1D 3396.60 0.6599 107.89
40 TD3-1D-A 3390.23 0.6798 107.69
41 DDPG-2D 3397.26 0.6513 107.91
42
43 DDPG-2D-A 3385.36 0.6530 107.53
44 TD3-2D 3391.97 0.6519 107.74
45
TD3-2D-A 3326.74 0.6691 105.67
46
47
48 As shown in Figure 14, compared to other algorithms, the engine operating points of
49
50
51 the DP control strategy were concentrated in the 1000-2500 RPM, especially widely
52
53 distributed in the lowest fuel consumption interval 1500-2500 RPM, with an average
54
55
56 fuel consumption rate of about 217 g/kW·h. TD3-1D achieved low fuel consumption
57
58
59 levels for most of its action points below 222 g/kW·h. TD3-1D-A, despite shortly
60
61
29
62
63
64
65
1 working in the 3500-4500 RPM range, was noted to have significantly less engine
2
3
4
operating time than that of TD3-1D, and thus showed a smaller total energy
5
6 consumption.
7
8
9 390
DP 240
10 360
Average fuel consumption rate (g/kWh)

DQN
11 330
DDQN 236
12 300 TD3-1D
Accumulated time (s)
13 270 TD3-1D-A
232
14 240
15 210
228
16 180
17
150 224
18
120
19
90 220
20
21 60
22 30 216
23 0
[1000,1500] [1500,2000] [2000,2500] [3000,3500] [3500,4500] [4000,4500] [4500,5200]
24
Engine Speed (RPM)
25
26 Figure 14 Engine operating point distribution of EMSs with 1-D control space.
27
28 On the other hand, TD3-1D-A achieved smooth battery SOC variation compared to
29
30
31 conventional TD3 algorithm, specifically its SOC fluctuation range of [0.5,0.7], as
32
33
34 shown in Figure 15(a), which was beneficial to improve battery longevity. In contrast,
35
36
37 the CD-CS control strategy cannot guarantee the battery power health due to too
38
39 frequent adjustment of engine power.
40
41 0.9
0.9
42 DP CD-CS (b)
DP CD-CS (a) DDPG-2D TD3-2D
43 0.8
DQN DDQN 0.8 DDPG-2D-A TD3-2D-A
TD3-1D TD3-1D-A
44
45 0.7 0.7
46
SOC
SOC
0.6
47 0.6
48
0.5 0.5
49
50 0.4
0.4
51
52 0.3 0.3
53 0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Time (s) Time(s)
54
55 Figure 15 SOC Trajectory of EMSs with control space of (a) 1-D and (b) 2-D.
56
57 Figure 16 displayed the operating points of the engine in the two-dimensional control
58
59
60 EMSs. It can be seen that the number of working points with engine torque less than 50
61
30
62
63
64
65
1 N·m was significantly reduced for DRL-A, which meant that these control strategies
2
3
4
avoided selecting actions in the high fuel consumption intervals. Operating points of
5
6 fuel consumption rate 220 g/kW·h chosen by EMS with DDPG-2D-A were
7
8
9 significantly more than those with DDPG-2D. In particular, almost all actions by the
10
11
12 TD3-2D-A were within the range of 230g/kW·h and the working points were highly
13
14
15
concentrated. The assisted learning strategy also enhanced competence in maintaining
16
17 battery power, as shown in Figure 15(b). In particular, TD3-2D-A achieved a very
18
19
20 satisfactory SOC variation trend with values between [0.55,0.75], and the SOC
21
22
23 fluctuation range under the control trajectory planned by DP algorithm was [0.6,0.75].
24
25 140
(a) DDPG-2D
240.0
235.0 245.0 (b) DDPG-2D-A
245.0 140
26 245.0
27 120 120 230.0
220.0
28 100
230.0
230.0
100
29
Torque (Nm)
Torque (Nm)
225.0
220.0 225.0
30 80
235.0
245.0 250.0 80
240.0
31
32 60 60 245.0 250.0
33 40
275.0
300.0 275.0
40 300.0
34 400.0 400.0
500.0 500.0
35 20 600.0 20 600.0
36 700.0 700.0
37 0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
38 Speed (RPM) Speed (RPM)
39
40 (c) TD3-2D (d) TD3-2D-A
140 140
41
42 120 230.0
120 230.0
43
230.0 230.0
44 100 100
Torque (Nm)
Torque (Nm)
45 80
220.0 225.0
80
220.0 225.0
46
47 60 245.0 250.0 60 245.0 250.0
48 275.0 275.0
40 40
49 300.0
400.0
300.0
400.0
50 20 600.0
500.0
20 600.0
500.0
51 700.0 700.0
52 0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
53
Speed (RPM) Speed (RPM)
54
55 Figure 16 Engine operating point distribution of EMSs with 2-D control space.
56
57 Since depletion existed in the process of electrical and mechanical energy conversion,
58
59
60 the control action within optimal fuel consumption may not mean the optimal overall
61
31
62
63
64
65
1 vehicle energy consumption. The losses due to energy conversion of motor and
2
3
4
generator were plotted in Figure 17. In Figure 17(a), the energy conversion loss was
5
6 about 0.8-4.8 kW and occupied about 6.2%-24.8% of total energy consumption, while
7
8
9 the optimal BFSC curve just missed the high efficiency area. There was only a very
10
11
12 narrow overlap between the BFSC and the low loss interval in Figure 17(b). Therefore,
13
14
15
although reducing the control dimension of the power-split hybrid according to engine
16
17 operating characteristics facilitated a faster search and convergence of the DRL
18
19
20 algorithm, the complex system coupling of the engine-generator-motor-battery may
21
22
23 make its optimal strategy bottlenecked. This was one of the reasons why the energy
24
25
26 efficiency of the 1-D control EMSs was not significantly better than that of 2-D actions
27
28 in Table 4.
29
30 (a). Vehicle speed 40km/h, Preq=12kW (b). Vehicle speed 78km/h, Preq=35kW
4800
31 4500
140 Best BFSC
Energy loss rate caused by motor and generator (W)
140
32
Energy loss rate caused by motor and generator (W)

4300
4000
33 120 120
3800
34 3500
100 100
35
Torque (Nm)
3300
Torque (Nm)
36 80 80
3000
2800
37 2500
60
38 2300 60
39 40 1800 40
2000
40
41 20 1300 20 1500
42 0 800.0 0 1000
43 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
44 Speed (RPM) Speed (RPM)
45
Figure 17 Energy conversion loss caused by MG1 and MG2 under different engine actions, (a)
46
47 v=40km/h, Preq=12kW, (b) v=78km/h, Preq=35kW. Blank areas indicated meaningless operation
48 points due to physical limitations.
49
50 The detailed engine and battery power distribution with the EMSs mentioned herein
51
52
53 were shown in Figure 18. The DP algorithm-based control strategy ensured relatively
54
55
56 small battery charging and discharging power by mobilizing the engine work to meet
57
58
59 the vehicle demand power. The EMS based on TD3-2D-A kept the engine power
60
61
32
62
63
64
65
1 basically around 10kW during the first 1100s while assuring the battery charge level.
2
3
4
For the first 300s in DDPG-2D, DDPG-2D-A and TD3-1D- A, the engine didn’t provide
5
6 torque output, making the battery energy deficit. However, after that, the EMS based
7
8
9 on TD3-1D-A made engine output high power, which led the battery SOC back to
10
11
12 around 0.65.
13
14 (a)DDPG-2D (b)DDPG-2D-A
60 150 60 150
15 EngPower
MotPower
16 40
Vehicle speed
40
17
Speed(km/h)
Speed(km/h)
100 100
Power(KW)
Power(KW)
20 20
18
19 50 50
0 0
20
21 -20 -20
22 0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
23
Time(s) Time(s)
24
25 (c)TD3-2D (d)TD3-2D-A
60 150 60 150
26
27 40 40
28
Speed(km/h)
Speed(km/h)
100 100
Power(KW)
Power(KW)
29 20 20
30 50 50
0 0
31
32 -20 -20
33 0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
34
Time(s) Time(s)
35
36 (e) TD3-1D (f) TD3-1D-A
60 150 60 150
37
38 40 40
39
Speed(km/h)
Speed(km/h)
100 100
Power(KW)
Power(KW)
40 20 20
41 50 50
0 0
42
43 -20 -20
44 0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
45
Time(s) Time(s)
46
47 (g) DP (h)CD-CS
60 150 60 150
48
49 40 40
50
Speed(km/h)
Speed(km/h)
100 100
Power(KW)
Power(KW)
51 20 20
52
0 50 0 50
53
54 -20 -20
55 0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
56
Time(s) Time(s)
57
58 Figure 18 Engine and battery power allocations of the EMSs in this study.
59
60
61
33
62
63
64
65
1 4.4 Migration ability examination
2
3
4
5 The optimization of policy of deep reinforcement learning algorithms depended
6
7 on the feedback from the environment [38] and thus adaptively adjusted parameters
8
9
10 responding to dynamic environments. However, the current literatures mostly focused
11
12
13 on the performance of DRL under training environment [33, 37, 38, 56], lacking
14
15
16 attention to the transfer learning capability of DRL when applied to hybrid vehicle
17
18 energy management strategy problems. To evaluate the generalization ability of the
19
20
21 proposed strategy in this paper, a TD3-2D-A based EMS well-trained under WLTC
22
23
24 driving cycle was utilized in FTP75 and CLTC_P test environments, then online
25
26
27 learning was implemented and the performance was observed at intervals.
28
29
As shown in Figures 19 and Table 5, the EMS trained in WLTC cycle showed a
30
31
32 satisfactory fuel consumption performance under the unexperienced test road. The fuel
33
34
35 economy of proposed EMS was only 105.6% and 107.3% of that of DP algorithm in
36
37
38 FTP75 and CLTC_P cycle, respectively, which was attributed to the trained TD3-2D-A
39
40
41
agent that preferred choosing actions with high engine efficiency. The energy
42
43 consumption level was further improved during the transfer learning process. The
44
45
46 trained model basically met the requirements for battery power maintenance in both
47
48
49 new environments, however, the terminated SOC value was only 0.55 in CLTC_P
50
51
52
operating conditions, which indicated that the EMS failed to cope with sudden
53
54 acceleration near the end. Overall, the migration learning made the agent more and
55
56
57 more adapted to the new driving environment, and the SOC trajectory kept approaching
58
59
60 the results designed by DP algorithm, especially in CLTC_P test cycle. The results of
61
34
62
63
64
65
1 this paper demonstrated the effectiveness of the DRL based EMSs for migration
2
3
4
learning.
5
6 0.9 140 0.9 140
DP
7 (a). FTP75 cycle Transfer learning iterations (b). CLTC_P cycle
120 120
8 0.8 0 th
50 th
0.8
9 100 th
100 100
Vehicle speed (km/h)

Vehicle speed (km/h)
Vehicle speed
10 0.7 0.7
11 80 80
SOC
SOC
0.6 0.6
12 60 60
13
0.5 0.5
14 40 40
15 0.4 0.4
20 20
16
17 0.3 0 0.3 0
0 200 400 600 800 1000 1200 1400 1600 1800
18 0 200 400 600 800 1000 1200 1400 1600 1800
Time(s) Time(s)
19
20
21 Figure 19 SOC trajectory of transfer learning under FTP75 and CLTC_P cycle.
22
23 Table 5 Fuel consumption performance of transfer learning.
24
25 Iteration FTP75 (g/100km) CLTC_P (g/100km)
26
DP 3010.6 2798.3
27
28 0 th 3180.3 3002.3
29 105.63% 107.29%
30
31 50th 3244.1 2987.2
32 107.76% 106.75%
33
100th 3139.3 2938.4
34
35 104.27% 105.01%
36
37
38
39 5. Conclusion
40
41
42
43
44
In this paper, an energy management strategy based on a combination of LSTM speed
45
46 prediction model and TD3 deep reinforcement learning was proposed for power-split
47
48
49 hybrid vehicles, which included multidimensional continuous control and directly
50
51
52 commanding engine speed and torque. Firstly, a degraded simulation model was built
53
54
55
and the comparative verification of energy consumption ensured the reliability of the
56
57 model in this paper. The LSTM neural network with time-series characteristics was
58
59
60 used to predict the vehicle speed, and the RMSEs of the training and test sets were
61
35
62
63
64
65
1 1.52km/h and 1.01km/h, respectively, under the prediction horizon of 10s. Considering
2
3
4
the physical limitations of each power facility in the HEV, the local controller, and DP
5
6 prior knowledge were embedded in the EMS to cope with the dynamical control space
7
8
9 boundary and accelerate the convergence of the algorithm. The performance of the
10
11
12 EMS based on the DRL algorithm including TD3, DDPG, DQN, and Double-DQN was
13
14
15
systematically contrasted, where TD3 performed most satisfactorily in reducing energy
16
17 consumption and keeping the SOC level, with a difference of about 5% between the
18
19
20 optimal fuel consumption level and DP resulting, while maintaining the battery power
21
22
23 level stably around the specified criteria. In addition, both the simulation experiments
24
25
26 and the energy conversion loss analysis suggested that the approach to reducing the
27
28 control dimension of the DRL based EMS for power-split HEV through embedding the
29
30
31 engine's best BFSC curve may cause a bottleneck existing for the optimization. What’s
32
33
34 more, this paper tested the migration capability of the DRL EMS by placing a well-
35
36
37 trained agent into a new driving environment. The results showed that despite expected
38
39 flaws initially, the performance of the EMS was improved constantly as the transfer
40
41
42 learning proceeded, demonstrating a strong self-adaptive capability of the DRL based
43
44
45 EMS.
46
47
48 The exploration in real environments was thought to be expensive for model-free
49
50 deep learning methods, and this paper proposed a hybrid energy management strategy
51
52
53 to determine whether the decisions made by the DRL EMS were legitimate, otherwise,
54
55
56 the RBS controller took over the work. However, this paper was based on simulated
57
58
59 experimental research, and safety factors such as time consumption and irreversible
60
61
36
62
63
64
65
1 device loss still require to be further verification in real vehicles.
2
3
4
5
6
Acknowledgments
7
8
9
10 This work was supported by Fundamental Research Funds for the Central Universities
11
12
13
(xtr012019002).
14
15
16
17 Reference
18
19
20
21 [1] Zhao M, Sun T. Dynamic spatial spillover effect of new energy vehicle industry policies on carbon
22 emission of transportation sector in China. Energy Policy. 2022;165:112991.
23
24 [2] Gupta RS, Tyagi A, Anand S. Optimal allocation of electric vehicles charging infrastructure, policies
25 and future trends. Journal of Energy Storage. 2021;43:103291.
26 [3] Liu Y, Huang B, Yang Y, Lei Z, Zhang Y, Chen Z. Hierarchical speed planning and energy
27
28 management for autonomous plug-in hybrid electric vehicle in vehicle-following environment. Energy.
29 2022;260:125212.
30
[4] Han L, Yang K, Ma T, Yang N, Liu H, Guo L. Battery life constrained real-time energy management
31
32 strategy for hybrid electric vehicles based on reinforcement learning. Energy. 2022;259:124986.
33 [5] Saiteja P, Ashok B. Critical review on structural architecture, energy control strategies and
34
35 development process towards optimal energy management in hybrid vehicles. Renewable and
36 Sustainable Energy Reviews. 2022;157:112038.
37 [6] Wang Z, Wei H, Xiao G, Zhang Y. Real-time energy management strategy for a plug-in hybrid electric
38
39 bus considering the battery degradation. Energy Conversion and Management. 2022;268:116053
40 [7] Zhang P, Yan F, Du C. A comprehensive analysis of energy management strategies for hybrid electric
41
vehicles based on bibliometrics. Renewable and Sustainable Energy Reviews. 2015;48:88-104.
42
43 [8] Peng JK, Fan H, He HW, Pan D. A Rule-Based Energy Management Strategy for a Plug-in Hybrid
44 School Bus Based on a Controller Area Network Bus. Energies. 2015;8(6):5122-42.
45
46 [9] Chen Z, Zhang H, Xiong R, Shen W, Liu B. Energy management strategy of connected hybrid electric
47 vehicles considering electricity and oil price fluctuations: A case study of ten typical cities in China.
48 Journal of Energy Storage. 2021;36:102347
49
50 [10] Leroy T, Vidal-Naquet F, Tona P. Stochastic Dynamic Programming based Energy Management of
51 HEV's: an Experimental Validation. IFAC Proceedings Volumes. 2014;47(3):4813-8.
52
[11] Zhang N, Ma XH, Jin LM. Energy Management for Parallel HEV Based on PMP Algorithm. 2017
53
54 2nd International Conference on Robotics and Automation Engineering (Icrae). 2017:177-82.
55 [12] Sanchez M, Delprat S. Hybrid vehicle energy management: Avoiding the explicit Hamiltonian
56
57 minimization. 2018 Ieee Vehicle Power and Propulsion Conference (Vppc). 2018.
58 [13] Panday A, Bansal HO. Energy management strategy for hybrid electric vehicles using genetic
59 algorithm. J Renew Sustain Ener. 2016;8(1).
60
61
37
62
63
64
65
[14] Li Y, Luo Y, Lu X. PHEV Energy Management Optimization Based on Multi-Island Genetic
1 Algorithm. SAE International; 2022.
2
3 [15] Hwang H-Y. Developing Equivalent Consumption Minimization Strategy for Advanced Hybrid
4 System-II Electric Vehicles. Energies. 2020;13(8):2033
5
[16] Gong C, Hu M, Li S, Zhan S, Qin D. Equivalent consumption minimization strategy of hybrid
6
7 electric vehicle considering the impact of driving style. Proceedings of the Institution of Mechanical
8 Engineers, Part D: Journal of Automobile Engineering. 2018;233(10):2610-23.
9
10 [17] Zhao ZC, Xun J, Wan X, Yu RC. MPC Based Hybrid Electric Vehicles Energy Management Strategy.
11 Ifac Papersonline. 2021;54(10):370-5.
12 [18] Guo JQ, He HW, Peng JK, Zhou NN. A novel MPC-based adaptive energy management strategy in
13
14 plug-in hybrid electric vehicles. Energy. 2019;175:378-92.
15 [19] Rodriguez R, F. Trovão JP, Solano J. Fuzzy logic-model predictive control energy management
16
strategy for a dual-mode locomotive. Energy Conversion and Management. 2022;253:115111.
17
18 [20] M. Sabri MF, Danapalasingam KA, Rahmat MF. A review on hybrid electric vehicles architecture
19 and energy management strategies. Renewable and Sustainable Energy Reviews. 2016;53:1433-42.
20
21 [21] Feiyan Q, Weimin L. A Review of Machine Learning on Energy Management Strategy for Hybrid
22 Electric Vehicles. 2021 6th Asia Conference on Power and Electrical Engineering (ACPEE)2021. p.
23 315-9.
24
25 [22] Venkatasatish R, Dhanamjayulu C. Reinforcement learning based energy management systems and
26 hydrogen refuelling stations for fuel cell electric vehicles: An overview. International Journal of
27
Hydrogen Energy. 2022;47(64):27646-70.
28
29 [23] Li W, Cui H, Nemeth T, Jansen J, Ünlübayir C, Wei Z, et al. Deep reinforcement learning-based
30 energy management of hybrid battery systems in electric vehicles. Journal of Energy Storage.
31
32 2021;36:102355
33 [24] Wang H, Ye Y, Zhang J, Xu B. A comparative study of 13 deep reinforcement learning based energy
34
management methods for a hybrid electric vehicle. Energy. 2023;266:126497.
35
36 [25] Bo L, Han L, Xiang C, Liu H, Ma T. A Q-learning fuzzy inference system based online energy
37 management strategy for off-road hybrid electric vehicles. Energy. 2022;252.
38
39 [26] Hu Y, Li W, Xu K, Zahid T, Qin F, Li C. Energy Management Strategy for a Hybrid Electric Vehicle
40 Based on Deep Reinforcement Learning. Applied Sciences. 2018;8(2):187
41 [27] Han X, He H, Wu J, Peng J, Li Y. Energy management based on reinforcement learning with double
42
43 deep Q-learning for a hybrid electric tracked vehicle. Applied Energy. 2019;254:113708
44 [28] Wu Y, Tan H, Peng J, Zhang H, He H. Deep reinforcement learning of energy management with
45
continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus.
46
47 Applied Energy. 2019;247:454-66.
48 [29] Liu ZE, Zhou Q, Li YF, Shuai SJ. An Intelligent Energy Management Strategy for Hybrid Vehicle
49
50 with irrational actions using Twin Delayed Deep Deterministic Policy Gradient. Ifac Papersonline.
51 2021;54(10):546-51.
52 [30] Huang R, He H, Zhao X, Wang Y, Li M. Battery health-aware and naturalistic data-driven energy
53
54 management for hybrid electric bus based on TD3 deep reinforcement learning algorithm. Applied
55 Energy. 2022;321:119353
56
[31] Wu C, Ruan J, Cui H, Zhang B, Li T, Zhang K. The application of machine learning based energy
57
58 management strategy in multi-mode plug-in hybrid electric vehicle, part I: Twin Delayed Deep
59 Deterministic Policy Gradient algorithm design for hybrid mode. Energy. 2023;262:125084
60
61
38
62
63
64
65
[32] Zhang H, Liu S, Lei N, Fan Q, Li SE, Wang Z. Learning-based supervisory control of dual mode
1 engine-based hybrid electric vehicle with reliance on multivariate trip information. Energy Conversion
2
3 and Management. 2022;257:115450
4 [33] Lian R, Peng J, Wu Y, Tan H, Zhang H. Rule-interposing deep reinforcement learning based energy
5
management strategy for power-split hybrid electric vehicle. Energy. 2020;197:117297.
6
7 [34] Du G, Zou Y, Zhang X, Liu T, Wu J, He D. Deep reinforcement learning based energy management
8 for a hybrid electric vehicle. Energy. 2020;201:117591
9
10 [35] Qi C, Zhu Y, Song C, Cao J, Xiao F, Zhang X, et al. Self-supervised reinforcement learning-based
11 energy management for a hybrid electric vehicle. Journal of Power Sources. 2021;514:230584
12 [36] Wang Y, Tan H, Wu Y, Peng J. Hybrid Electric Vehicle Energy Management With Computer Vision
13
14 and Deep Reinforcement Learning. IEEE Transactions on Industrial Informatics. 2021;17(6):3857-68.
15 [37] Zhou J, Xue S, Xue Y, Liao Y, Liu J, Zhao W. A novel energy management strategy of hybrid electric
16
vehicle via an improved TD3 deep reinforcement learning. Energy. 2021;224:120118
17
18 [38] Qi C, Zhu Y, Song C, Yan G, Xiao F, Da w, et al. Hierarchical reinforcement learning based energy
19 management strategy for hybrid electric vehicle. Energy. 2022;238:121703.
20
21 [39] Sun W, Zou Y, Zhang X, Guo N, Zhang B, Du G. High robustness energy management strategy of
22 hybrid electric vehicle based on improved soft actor-critic deep reinforcement learning. Energy.
23 2022:124806.
24
25 [40] Zhou J, Xue Y, Xu D, Li C, Zhao W. Self-learning energy management strategy for hybrid electric
26 vehicle via curiosity-inspired asynchronous deep reinforcement learning. Energy. 2022;242:122548.
27
[41] Xu J, Li Z, Du G, Liu Q, Gao L, Zhao Y. A Transferable Energy Management Strategy for Hybrid
28
29 Electric Vehicles via Dueling Deep Deterministic Policy Gradient. Green Energy and Intelligent
30 Transportation. 2022:100018.
31
32 [42] Tang X, Chen J, Pu H, Liu T, Khajepour A. Double Deep Reinforcement Learning-Based Energy
33 Management for a Parallel Hybrid Electric Vehicle With Engine Start–Stop Strategy. IEEE Transactions
34
on Transportation Electrification. 2022;8(1):1376-88.
35
36 [43] Tang XL, Chen JX, Yang K, Toyoda M, Liu T, Hu XS. Visual Detection and Deep Reinforcement
37 Learning-Based Car Following and Energy Management for Hybrid Electric Vehicles. Ieee Transactions
38
39 on Transportation Electrification. 2022;8(2):2501-15.
40 [44] Fang Z, Chen Z, Yu Q, Zhang B, Yang R. Online Power Management Strategy for Plug-in Hybrid
41 Electric Vehicles Based on Deep Reinforcement Learning and Driving Cycle Reconstruction. Green
42
43 Energy and Intelligent Transportation. 2022:100016.
44 [45] Hu D, Zhang YY. Deep Reinforcement Learning Based on Driver Experience Embedding for Energy
45
Management Strategies in Hybrid Electric Vehicles. Energy Technol-Ger. 2022;10(6).
46
47 [46] Zhang L, Liu W, Qi B. Energy optimization of multi-mode coupling drive plug-in hybrid electric
48 vehicles based on speed prediction. Energy. 2020;206:118126.
49
50 [47] Zegong N, Hongwen H, Yong W, Ruchen H. Energy Management Optimization for Connected
51 Hybrid Electric Vehicle with Offline Reinforcement Learning. 2022 IEEE 12th International
52 Conference on Electronics Information and Emergency Communication (ICEIEC)2022. p. 103-6.
53
54 [48] Kim N, Rousseau A, Rask E. Vehicle-level control analysis of 2010 Toyota Prius based on test data.
55 Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering.
56
2012;226(11):1483-94.
57
58 [49] Zhang X. Design of Power Split Hybrid Powertrains with Multiple Planetary Gears and Clutches.
59 2015.
60
61
39
62
63
64
65
[50] Burress TA, Campbell SL, Coomer C, Ayers CW, Wereszczak AA, Cunningham JP, et al. Evaluation
1 of the 2010 Toyota Prius Hybrid Synergy Drive System. 2011.
2
3 [51] Liu YG, Li J, Gao J, Lei ZZ, Zhang YJ, Chen Z. Prediction of vehicle driving conditions with
4 incorporation of stochastic forecasting and machine learning and a case study in energy management of
5
plug-in hybrid electric vehicles. Mech Syst Signal Pr. 2021;158.
6
7 [52] Meng X, Fu H, Peng L, Liu G, Yu Y, Wang Z, et al. D-LSTM: Short-Term Road Traffic Speed
8 Prediction Model Based on GPS Positioning Data. IEEE Transactions on Intelligent Transportation
9
10 Systems. 2020:1-10.
11 [53] Jamali H, Wang Y, Yang YH, Habibi S, Emadi A. Rule-Based Energy Management Strategy for a
12 Power-Split Hybrid Electric Vehicle with LSTM Network Prediction Model. Ieee Ener Conv. 2021:1447-
13
14 53.
15 [54] Wu J, He H, Peng J, Li Y, Li Z. Continuous reinforcement learning of energy management with deep
16
Q network for a power split hybrid electric bus. Applied Energy. 2018;222:799-811.
17
18 [55] Ye M, Chen J, Li X, Ma K, Liu Y. Energy Management Strategy of a Hybrid Power System Based
19 on V2X Vehicle Speed Prediction. Sensors (Basel). 2021;21(16).
20
21 [56] Du G, Zou Y, Zhang X, Guo L, Guo N. Energy management for a hybrid electric vehicle based on
22 prioritized deep reinforcement learning framework. Energy. 2022;241.
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
40
62
63
64
65
Supplementary Material
Click here to access/download

Supplementary Material
Appendix.docx

Egy D 22 12752

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Egy D 22 12752

Uploaded by

Copyright:

Available Formats

Energy

Energy management strategy based on deep reinforcement learning and speed

Manuscript Number: EGY-D-22-12752

Article Type: Full length article

 An energy management strategy for power-split hybrid electric vehicle

 The EMSs is based on TD3 deep reinforcement learning algorithms.

 The LSTM RNN speed prediction model is embedded in the EMSs.

 The transfer learning ability of DRL based EMSs is investigated.

Fuel Consumption (g/100km)

18 (St , At , R, St 1 )  eng TD error 1 TD error 2

Average fuel consumption rate (g/kWh)

27 120 120 230.0

Energy loss rate caused by motor and generator (W)

Vehicle speed (km/h)

Click here to access/download

You might also like