You are on page 1of 13

Received: 16 April 2022 Revised: 16 May 2022 Accepted: 18 May 2022 IET Renewable Power Generation

DOI: 10.1049/rpg2.12508

ORIGINAL RESEARCH

Agent based online learning approach for power flow control


of electric vehicle fast charging station integrated with smart
microgrid

Mohammad Amir1 Zaheeruddin1 Ahteshamul Haque1 V. S. Bharath Kurukuru1


Farhad Ilahi Bakhsh2 Akbar Ahmad3

1
Advance Power Electronics Research Lab, Abstract
Department of Electrical Engineering, Jamia Millia
In stochastic power systems, electric vehicle (EV) fast charging stations (FCS) are rapidly
Islamia, New Delhi, India
2
being installed, while adversely impacts the distribution network. Due to this, the improper
Department of Electrical Engineering, National
Institute of Technology Srinagar, Srinagar, India
offline charging control policies for EVs may increase the voltage fluctuation and instabil-
3
ity. To analyse these aspects, this paper investigates the problems associated with offline
Faculty of Science and Information Technology,
MI College, Alimas Magu, Malé, Maldives (dis)charging control for effective utilization of battery storage and grid power through
different modes of operations. Further, the need to develop real-time charging control
Correspondence is identified to mitigate the adverse impacts of FCS on the distribution network. Hence,
Dr. Akbar Ahmad, Faculty of Science and an online controller using reinforcement learning (RL) is designed to distinguish the
Information Technology, MI College, Alimas Magu,
Malé 20260, Maldives.
uncertainties in real-time and to schedule the (dis)charging of an EV against the uncer-
Email: akbar@micollege.edu.mv tainties based on travelling pattern. The RL based online controller uses deep neural
network (DNN), where the agents are programmed to control the bi-directional power
flow (V2G/G2V). The effectiveness of the RL rewards controller is fulfilled by the differ-
ent charging states of the battery. The performance of online (dis)charging controllers that
utilize DNN to act at its optimal power flow set of points for all sessions are examined
in the details. Finally, the effectiveness of online RL controller and hardware results have
been realized using real time hardware-in-the loop simulator.

1 INTRODUCTION to flexible EVs travel behaviours, involving their charging pat-


terns and intermittent load demands [4]. Vehicle to Grid (V2G)
Electric vehicles (EVs) penetration, and intermittent renewable is a promising advanced technology where EVs can discharge
energy resources integration in the grid are increasing day by their power back to the grid system in order to reduce the stress
day in the distributed network. Recently, EVs growing rapidly of a distributed network [5, 6].
which is projected to provide more flexibility for the stochas- The online charge control approaches become a novel and
tic distributed power system network [1, 2]. In several countries promising paradigm for offering the real time scheduling of
such as Japan, U.S.A, Sweden, Germany, Netherland, China, and FCS against various uncertainties [7]. Also, the online charge
so on fast charging stations (FCS) infrastructure has been in a control strategies do not consider the future information, how-
growing phase. With the rising concerns of adverse impacts of ever it only relies on the present as well as past EV pattern.
intermittent and uncontrolled charging on the distributed net- This information involves the EVs arrival and departure time,
work [3], it is important to develop controlled charging schemes fast charging requirements of the EVs fleet which is shared
to maintain the load level which limits during peak time. It in between the grid, EVs and FCS that enhances the charg-
will be great if EVs fast charging demand can be estimated in ing control and make a real-time optimal decision [8]. It is
advance, but the FCS faces more uncertainties in demand due simpler to estimate vehicle travel behaviours that can make

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2022 The Authors. IET Renewable Power Generation published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Renew. Power Gener. 2022;1–13. wileyonlinelibrary.com/iet-rpg 1


2 AMIR ET AL.

it likely to the schedule for the appropriate number of vehi- technique can utilize the DNN to predict the desired functional
cles so as to make optimal charging decisions. This paper values, which can have a superior estimation capability. Several
aims to implement an online charging algorithm for FCS inte- studies were established agent-based learning methods for
grated with the grid without knowing upcoming information. different applications. Earlier in [25], a concurrent actor-critic
Recently, significant influences have been made to build the based RL learning system was suggested to get the optimal
real time-based online control algorithms of FCS under inter- solution in multiple players under infinite horizon application.
mittent distributed generation. Earlier study [9], presented a Further, a distributed based RL system was demonstrated in
Lyapunov optimization-based online algorithm to investigate [26], to speed up the enhanced learning methods by reducing
the power scheduling problem in EVs fleet. A novel study the values of variables. An agent-based approach was suggested
[10], suggested an online EV charging control method for in [27], to estimate the optimal performance using the dynamic
real-time charging priority, cost evaluation and optimal loca- programming control technique. It recommended an online
tion. Further in [11], authors proposed a centralized online incentive-based technique to support the distribution compa-
learning algorithm to reduce the electricity cost that consid- nies (DISCOs) and balance the distributed system under power
ered the time and consumers demand dependent approach. fluctuations Further, it provided the grid network reliability
An enhanced multi objective-based optimization algorithm was based on deep RL technique. Lu et al. [28], suggested incentive-
developed in [12], to investigate the peak demand and varia- based energy demand estimation technique to examine the
tion which reduces the operational cost. Researchers in [13], issue of power scheduling and optimal location in power grid
demonstrated a multi-stage stochastic technique for power networks. But the actor-critic based algorithm may not get a
optimization of EVs fleet integrated with distributed gener- better convergence when it is associated with offline policy [29].
ation. Still, the algorithms presented in [14, 15] depends on Therefore, this paper develops an online-based EVs charging
certain model parameters they mainly worked under specific control approach using DNN. Here, a modified power flow
constraints. control algorithm is designed which significantly enhances
Instead, the primary aim of this paper is to develop an the convergence rate and achieves the desired solution for the
online algorithm for the transitioning control of EVs with scheduling of EVs associated with the fast-charging station [30].
FCS. Which is model-free dependent, and its control is based The key contributions of this research paper are summarized as
on real-time EVs travelling behaviours. Hence, to achieve the follows.
optimal scheduling of online EV charging by utilizing a RL Firstly, the problems associated with an offline algorithm
based controller framework for control the power flow in are formulated in which travelling patterns of EVs are known.
distribution network. Further, an enhanced ‘Q’ learning-based Because of the stochastic behaviours, the improper offline
procedure has been suggested to reduce the electricity price charging control policies for EVs may influence voltage insta-
[16]. A bi-directional short-term memory (STM) based learning bility. To deal with these issues, an online EV charging
controller of a parallel RL network was demonstrated in [17] control strategy is demonstrated in a grid system to utilize
to design an energy management approach for hybrid electric the (dis)charging capacity of the EV battery while fulfilling
vehicles (HEVs). An agent-based ‘Q’ learning-based online all the desired constraints. Further, to maintain the power
control algorithm was developed in [18], to schedule the real- flow in V2G/G2V/UPS mode, this paper examines EV
time applications of load under uncertainties. However recent charging scheduling issues during the peak load time, with-
research [19], used the agent-based control technique in which out knowing future information of EVs travelling behaviours
the charging actions of EVs was taken as discrete action values (arrival/departure time). However, the modelling of the online
that limits the behaviours of charging model. Since the nature controller for EVs charging control in different state of charge
of EV travel pattern are as the discrete values, so to overcome (SoC) is demonstrated. Then a DNN based EV charging con-
the offline charging issues a multi-head attention model was trol method is modified to utilize the power under EVs different
demonstrated in [20] for the estimation of the discrete travel modes of operations which maintains the continuous amount
patterns values. If the charging control of discrete complex of charging for individual EVs using RL. To further enhance
system is implemented in travel pattern control action, then it the effectiveness of the proposed online charging method,
will be difficult to accomplish an optimal charging schedule. the algorithm is developed which selects the optimal charg-
For the computations of lower discrete values, that might be ing parameters that considerably increases the computational
effective in weak learning capability. But for the larger discrete convergence.
values, controller can be modified by adaptive learning capa- The details of these aspects are further organized as follows:
bility and it leads to achieve the higher computation values Section 2 discusses the problem formulation and key issues
[21]. associated with offline power flow control. These problems are
A new technique was proposed in [22], to estimate the overcome in Section 3 by the proposed online algorithm. Sec-
number of iterations in policy to enhance the desired policy. tion 4 presents the implementation of RL algorithm based on
The typical deep neural network (DNN) technique is providing the optimal learning and convergence rate. The numerical sim-
the desired policy over other learning methods [23, 24]. Thus, ulation results and real time hardware validation are presented
in this paper the DNN based RL method in implemented to in Section 5 to examine the controller performance. Finally, the
resolve the problem associated with offline control. The agent key conclusion and further discussion are given in Section 6.
AMIR ET AL. 3

2 OFFLINE EV CHARGING PROBLEMS tion (3) [32]. The key objective of maximum SoC is only when
there is no discharging phenomena.
This section investigates the key issues, which are mainly
involved in offline EVs charging [29].
3 PROPOSED ONLINE MODEL

2.1 Problem formulation In this section, the online algorithm learns the optimal function
and control policy, based on the temporal difference error dis-
Here, the optimization issues of offline charging control are for- cussed in Section 2. Here, the policy iterations are used to learn
mulated. Usually, the offline charging control required the past the system operation and stabilize the control policy. There
data of EVs arrival/departure time and (dis)charging demand are certain constraints for the implementation of the RL con-
profiles as a known variable. In order to develop the optimized troller that must satisfy the rewards to ensure desired results.
online charging solution Ci (t) for EVi , in order to reduce the These constraints comprise of controller system that should
dependency of past parameters for a specific time duration (T), have followed by the specified SoC limit.
is expressed in Equation (1a) as:

T ⎛
( )2 3.1 Online reinforcement learning
∑ ∑ ∑ ∑ ⎞
minCl (t ) ⎜K0 Ci (t ) + K1 Ci (t ) + 2K1 Lb (t ) Ci (t )⎟ ; controller
⎜ i∈N
t = 1⎝

i∈N i∈N ⎠
In recent times, the agent-based reinforcement learning has
0 ≤ Ci (t ) ≤ Ci ; maximum; i ∈ N , t ∈ T (1a)
been broadly classified as effective to get optimal decisions
under uncertain conditions. The online charging control does
departure
tl not depend on future information of EVs travel pattern, which

C Di = s.t . Ci (t ) , i ∈ T (1b) takes the real time practical consequences based on uncer-
tlarival
tainties in SoC level. The key benefit of the RL algorithm is
that it can learn from a lesser amount of knowledge data sets
where C Di is the vehicle charging demand of EVi while for analysing the distribution of uncertainty [32]. Therefore,
departure the online control application of RL algorithms has motivated
tIarival and tI is the arrival and departure time, respectively. researchers to use it for EV charging/discharging management.
The major issue of Equation (1b) is convex optimization [31]. But the first challenging task is the selection of state and action
departure
When the pattern of tIarrival , tI , and C Di are known, the space for getting desired reward signals.
optimized charging solution Ci(t) can be achieved by an online
EV charging controller.
3.2 Agent-based algorithm for EV
bi-directional charging control
2.2 Operational mode for achieving
maximum SoC The agent-based online learning algorithm comprises of online
learning control. Based on the certain vehicle charging mode
The SoC optimized mode is supposed to be similar to the and the existing battery SoC, the agent-based algorithm is
uncontrolled charging scenario such that, the EVs charge with used to selects the corresponding charging algorithm [33]. It
the maximum available power. Thus, the proposed online con- is carried until the least SoC level is achieved. Further, the
troller can be utilized if the EV user does not intend to be the battery storage can be charged until the desired SoC level is
scheduling of EV and if the vehicle SoC is reached to lower than accomplished and the SoC level can be categories to a desire
the minimum value of battery SoC. The key objective function operational mode. Consequently, the SoC target was chosen
in optimizing SoC of EV can be presented by Equation (2) as: for the optimal EV’s user operational mode. Moreover, it can
∑ be beneficial to a cost-effective mode, from normal battery
Minimizing[nt ,P t ] −(n)t 𝜖 T ; SoC to optimized SoC as well as from normal operation to
ml t 𝜖T
t (2) economical aspect to feed the surplus power as V2G/G2V
Sub jected to ∶ x0 = xin mode accordingly. These modes of operations are validating
in EV’s battery discharging state, that is enabled by specific
(Edt ≠ 0 𝜈 nt ≠ N ) → Pdt = Plimit
max
(3) EV’s user. Here, the online controller is developed to dis-
charge the battery power only through its allowed internal
For achieving the maximum SoC, the Equation (2) is repeated signals in order to maintain the desire SoC level of bat-
in a loop until the value of energy demand (Ed ) is zero. Where, tery. A bidirectional control can be followed by surplus grid
max power before schedule the EVs arrival/departure time in FCS
N is next milestone of SoC and Plimit is a maximum power limit
of electric vehicle supply equipment (EVSE) referred in Equa- [33, 34].
4 AMIR ET AL.

EVs
Here, |W Ch | is the charging limit and the numeric numbers
Rewards (1, 2, …n) representing the number of EVs, W is the selection
of charging stations, (e.g. if W = 4, that means the EV select
BCs Optimal EV charging the fast-charging station at bus 4). As referred from Figure 1,
RL agent Action charging model station the BCs may be affected by environmental factors, that will be
varied with respect to time and the EVs travelling pattern as well
States [35]. So, it is necessary to upgrade the estimations at all time
Grid
steps with the most recent environment values. Thus, the BCs
network
is depicted by a time-dependent approach (b0 , b1 , b2 , .., bT ).
Environment
[ ∞
]
FIGURE 1 Framework of agent–environment interaction in fast EV ∑
charging control system for a different mode of operation kT = E 𝛾𝛿 .r.𝛿 (7)
𝛿=t

To distinguish the sequential relationship of uncertain fac-


3.3 Modelling of RL based optimal EV tors [36], the RL based five-tuple agent model is adopted
charging control to provide an optimal solution as (S, A, R, P, K ) with State
(S0 , S1 , .. St , .., ST ), Action (A0 , A1 , ..At , .., AT ), Reward
It is stated that the optimized constraints of the proposed RL (R0 , R, .. Rt , .., RT ), Policies (P0 , P1 , ..Pt , .., PT ), and Return
controller model is directly affected by vehicle stochastic actions (k0 , k1 , ..kt , .., kT ). In Equation (7), 𝛿 is reward discount
[34]. These limits are generally called as boundary conditions factor.
(BCs), which cannot be exposed exactly in real time scenar-
ios, and it will affect the charging schedule consequently. On
the other hand, these BCs must be comprised with SoCinitial , 3.3.1 State (S)
SoCexpected , tarival , tdeparture , and selection of EVs charging station
(SoCexpected ). The S is employed to illustrate the environment (E). Whenever
To get an optimal solution of the above optimization model, the environmental conditions vary with respect to time then it
one should be estimating the parameters of BCs before solving directly effects on the state. Thus, the state updates the BCs to
the system. In Figure 1, the environment is selected as the EV, prevent the violation of BCs limit. Here, the voltage of EVs
grid, and FCS. The agent is chosen as RL based (dis)charging battery is not included in state space (S).
controller. For maximum rewards, the proposed (dis)charging
control model can be explained in Equation (4). { ( ) }
S ∈ s0−T ∶ 0, 0 + SoCStep , 0 + 2 SoCStep , … , 1 (8)
Pro fit
Maximumd = f (d ) s.t . ; d ∈ Ω (b ) (4) The SoC values discretized into the finite SoC range 0
and 1 or (0%, 20%, 40%, 60%, 80%, and 100%). Given that
From Equation (4), Ω(b) is a feasible region with respect to the SoC, charging current (ICh ), and charging voltage (VCh ) are
the associated constraints. These constraints depend on BCs (b) being evaluated to get the rewards value. Here from Equa-
and d is the decision vector as shown in Equation (5). tion (8), the SoCStep is expressed as step response of batteries
{ } SoC.
C C
d = Pw,tdis , Pw,tch , PFi, j ,t , QFi, j ,t , Ii,t , Vi,t , SoCw,t (5)

wherePFi, j ,t , QFi, j ,t is active and reactive power flow respec- 3.3.2 Action (A)
tively at ith, j th FCS during time (t). Also, the BCs vector
(b) is related to the EV rider’s uncertain pattern, which is The A is described as the status of charging and discharging cur-
demonstrated in Equation (6). rent. The charging capability depends on the maximum range
(if the FCS has G2V mode enabled and minimum range (if the
{ } FCS has V2G mode enabled). In Equation (9), A is the decision
b = SoCinitial , SoCexpected , Tari , Tdep , EVChar. st
value, which is prepared through the agent and further utilized
⎧ SoCinitial = SoC1ini , … SoC ini Ch , ⎫ for the desired states in proposed charging environment. From
⎪ |W | ⎪
exp exp Figure 1, at is termed as action for the time (t ) for the estimation
⎪ SoCexpected = SoC1 , … , SoC Ch , ⎪
⎪ ( |W) | ⎪ of different EV’s pattern.
⎪ = , … , , ⎪ (6)
=⎨ T ari Tari,1 Tari |W |
Ch
⎬ A ∈ a0−T ∶ {PtΔ+1 , +PBCs
Δ
, … , PTΔ
⎪ ( ) ⎪
(9)
⎪ Tdep = Tdep,1 , … , Tdep|W Ch | , ⎪
⎪ ( )⎪ Here, the P Δ is estimating the BCs at desired
⎪EVChar. st = EVChar. st .1 , … , EVChar. st .|W Ch | ⎪
⎩ ⎭ time duration and ▵ = {t + 1, … ., T }. When an
AMIR ET AL. 5

EV is connected to FCS, unknown information of In this paper rewards are considered as; power withdrawal
SoCinitial , SoCexpected , Tari , and EVChar. st in BCs must from the grid to EV i.e., PG 2V (s, a); power feed from EV to
be exhibited, while the EV departure time (Tdep ) is grid, that is, PV 2G (s, a); and power in standalone (UPS) mode,
remains unknown values. Also, various parameters such as that is, PV 2G (s, a).
SoCinitial , SoCexpected , Tari , and EVChar. st in BCs do not need to
estimate in the future. Policies and learning
A policy is a historical dependent or Markovian deterministic,
which is represented as P. It gives a way of selecting actions
3.3.3 Reward (R) as per agent instruction [38]. The significance of a policy is
to return a probability function over the action or just return
The reward Rt is a scalar feedback signal that defines the objec- an action, that is, Pt ∶ S → ASt , such that P (St ) provides an
tive of the RL problem [37]. This signal allows the agent to be action to choose. Here, the policy may be both randomized
able to distinguish positive actions from negative ones to rein- and deterministic. The key assignment associated with the RL
force and improve its behaviour. It is crucial to notice that the model is that it aims at obtaining the optimal policy to get opti-
reward in real time because it describes only the value of the mal actions. The Equation (11) depicts the optimal approach to
latest action. Besides, by receiving a conspicuous reward at a find desired policies for RL controller in order to maximize the
specific time step, or by sacrificing immediate reward at imme- projected rewards (R).
diate time steps, greater rewards can be achieved for an action.
[ ]
In this context, many features make reinforcement learning (RL) ∑
T
( P P)
distinct from supervised as well as unsupervised learning. The E 𝜆t −1 × rt × St , at (11)
major aspect of them is that there is no supervisor, hence the t =t
agent has to decide what action to take. In addition to the above,
there is no entity that can determine what the optimal decision is Here, (StP ) and (atP ) are signifying as state and action, respec-
in that specific moment. Furthermore, it does not learn from a tively at the time (t), which are drawn from a policy (p). The
set of labelled objects taken from a knowledge base as in super- dynamic range of the vehicle is varying and mainly depends on
vised learning, but it exploits the direct experience composed the different time steps, the moment from the vehicle arrives
by observations and rewards as feedback. From the point of until the departure time. For infinite time horizon, the objec-
view of an unsupervised learning, the presence of a mapping tives must be marginally changed to avoid countless cumulative
between input and output is also considered as a main differ- rewards, 𝛿 is introduced as 𝛿 ∈ (0,1) shown in Equation (12).
ence, where the objective is to find underlying patterns rather [ ]
than mappings. Further, the actions may have long term conse- ∑
T
( P P)
quences as they may delay the reward signals. The agent receives E 𝛿 × rt × St , at (12)
only a reward signal which may delay compared to the moment t =t

in which it has to perform the next action. This fact brings


out another major significance of time function dependency. where 𝛿 can control the rate of future rewards (r), a low value
This sequentially links all actions taken by the agent, making the of 𝛿 will be an effective that provides instant rewards more than
resulting data no more independent and identically distributed. the upcoming rewards. The value function (v) signifies a tech-
Given these definitions, it is noticeable that the primary pur- nique to formalize issues and their optimal solution. The vTP
pose of the agent is to maximize the cumulative reward called is a mathematical formulation form that is derived from the
as the return. The return gt is the total discounted reward start- desired objectives functions, therefore the proposed solution
ing from time step (t ) defined by Equation (10), where 𝛾 is a approaches within RL are based on estimating the v.
discount factor. [ ]

T
( )

∑ vTP (s ) = E rt × StP , atP || S P = s (13)
gt = Rt +1 + 𝛾Rt +2 + ⋯ = 𝛾k Rt +k+1 , 𝛾 ∈ [0, 1] (10) t =t | 1
k=0
From Equation (13), vTP (s) are the average values of reward
Not only the fact that the system behaviour shows a pref- (r), that can be accomplished under the policy (p) when formu-
erence for immediate rewards rather than for future ones, lating in the state (s). This also shows how states are significant
motivates the presence of this factor, but it is also mathemati- to execute under policies (p).
cally necessary that: an infinite-horizon sum of rewards may not
converge to a finite value. Indeed, the return function is a geo- Online charging control algorithm
metric series, so, if 𝛾 ∈ [0, 1], the series converges to a finite Online charging control algorithm derives its associated pol-
value equal to 1∕(1 − 𝛾). For the same sake of convergence, icy based on dynamic programming, which further utilizes the
the case with 𝛾 = 1 makes sense only with a finite-horizon desired reward signals. So, the policy (𝜌i ) can maximize the
cumulative discounted reward. desired reward signals based on policies, that also minimizes
6 AMIR ET AL.

the impact of EVs charging on the utility grid and getting the TABLE 1 Hyperparameters for RL approach
desired SoC level (result in a reduction in battery degradation). Hyperparameter Value

Soft update coefficient 0.005


Interpolation factor 0.9
Algorithm 1: Dynamic programming algorithm [39] to get the
Steps desired function values (y) of state action. Learning rate 1e-4

1 Procedure for Dynamic programming time (t), battery states of Discount factor 0.99
charge (s), EVs charger (a) Replay buffer size 1,000,000
2 y ← 𝜑T ×S ×A Minibatch size 100
3 for t Є [tend − 1 … ..1, 0] do Exploration noise 0.1
4 for s Є battery states o f charge do
5 for a Є charger currents do
6 t ∗ , s ∗ ← charge(t , s, a) station integrated with a smart microgrid. When the difference
7 Y [t , s, a] = rewards(t , s, a) + maximuma (u[t ∗ , s ∗ ]) between the measured and required characteristics is minimum,
8 end for
the algorithm terminates the episode and hence bounds the
states of the system. Apart from the states and boundaries
9 end for
of the environment, the RL algorithm is initialized with a
10 end for set of hyperparameters. The hyperparameters for training
11 return y the EV charging station integrated with a smart microgrid
12 end procedure environment with RL algorithms are empirically optimized
such that there exists a direct mapping between the agent and
the observations to the low-level control. The hyperparam-
Policies and learning eters used for learning the states of the systems are given in
The deterministic policy approaches discussed in this section Table 1.
are implemented with the EV charging station integrated with a The training of the aforementioned RL algorithm is done
smart microgrid to achieve online learning control. The policies under MATLAB/Simulink environment with the data related
of the online charging control algorithm consist of the selection to an EV charging station operating with a three-phase grid-
of actions for specific function values (having maximum value), connected system. The estimated target for the RL algorithms
referred to in Algorithm 2. To achieve this the states and bound- is the regulation of power flow between the EV and the grid
aries corresponding to the operational mode of the EV charging with reference to the load requirement and battery SoC. The
station are required by the generated policy. RL agent decides the action which needs to be taken in order
to perform the optimal learning and lessor the convergence
rate. The actor representation with input observations and
Steps Algorithm 2: Policy (𝝆i ) for online charging control algorithm output actions demonstrated in [41] based on deep neural
networks (DNN). The measured error is reliant on estimated
1 Procedure for Policy (time, state, y)
characteristics and setpoint characteristics which keeps updat-
2 ampere argmaxa ← (𝜇[time, state])
ing as per the reward generation. This with DNN model is
3 end for utilized with the RL and based on single fully connected layer
4 return ampere in proposed network structure. Each current state generates the
5 end procedure three actions based on the action network. Later, these actions
are evaluated by the critical network and higher action values
will be actually executed to perform optimal control action
and conduct policy search. The objective of the algorithm is
4 ALGORITHM IMPLEMENTATION to regulate the impact of peak loading on the grid and provide
ancillary service to the grid through EV charging stations. The
The deterministic policy approaches discussed in this section algorithm learns each task for 1e + 6 time steps and provides
are implemented with the EV charging station integrated with the average return for each time step. The corresponding
a smart microgrid to achieve online learning control. To get average return obtained for the set and initialized hyperparam-
the states and boundaries corresponding to the operation of eters of RL algorithm controlling charging station integrated
the EV charging station are required by the generated policy inverters and voltage source converter control are shown in
[40]. Further, these states are not capable of bounding them- Figure 2.
selves. Hence, a termination statement is marked at the end From the results, it is identified that the RL achieves optimal
of each episode while defining the algorithm to bound them. control with high learning speed and maximum average return.
The termination statement is based on the error measured Further, the reward convergence of the algorithm is plotted in
between the rated characteristics and the measured character- Figure 3. To test the convergence efficiency of the algorithm,
istics during different states of operation of the EV charging the training action is carried for 1000 episodes.
AMIR ET AL. 7

5.1 HIL experimental setup

The HIL SCADA panel offers the interface between EVs, FCS,
and grid as shown in Figure 5. The measurement of con-
verter terminal voltage, SoC, active/reactive power of the grid is
achieved in SCADA panel. It also includes five different LEDs
group that shows the status of EVs connection to the grid
(V2G/V2G), also indicates power flow direction (UPS mode
when no power exchange takes place). Here, the inverter control
FIGURE 2 Learning rate with average return mode is controlled by the macro button manually in controller
logic.
Figure 6 shows the experimental setup, which mainly con-
sists of Typhoon HIL 402 module, HIL breakout board, host
laptop, altera field programmable gate array (FPGA) controller,
and oscilloscope. The control policy developed in Section 3 is
dumped into the FPGA using the Quartus software through the
very higher speed integrated circuit program hardware descrip-
tion language (VHDL). The FPGA utilized for the selection of
optimal charging mode based on desired SoC level. The pro-
posed RL controller is validated for different modes of EV
operations.
FIGURE 3 Reward convergence for achieving optimal control
5.2 Results of DCFCS integrated with smart
grid
The convergence plot indicates that the RL algorithm
efficiently converges around 400 episodes. The proposed Typhoon HIL SCADA real-time panel mea-
sures the vehicle battery responses which are integrated with
DC fast charging stations (DCFCS) for controlling active
5 EXPERIMENTAL ANALYSIS AND power flow under V2G/G2V/UPS mode. From Table 2, the
RESULTS response of inverter output voltage is evaluated by reference
active/reactive power. The vehicle is only connected to the
The experimental analysis is carried out to examine the signif- FCS if the EV motor is in standstill condition (speed is nearly
icances of proposed online controller in real time scenario. The zero), that is, EV standstill condition achieved by SCADA
modes for charging an EV at an FCS are effectively selected by controller manually. Once the EV is connected to FCS, the
the developed approach based on the requirement and availabil- EV motor cannot be operated until the FCS is decoupled by
ity of power. The trained RL algorithm considers the optimal turning off the manual charging button through online con-
policies whenever the EV is integrated at the FCS, while in the troller. The operational characteristics to get desired battery
case of before or after the charging sessions, the RL does not SoC limits must be satisfy their respective mode (referred from
consider policies. In the proposed approach, the optimal policy Figure 4). These SoC limit follows to three different operational
is defined based on how the power flow needs to be scheduled in modes (V2G/G2V/UPS) for either grid following or forming
the system such that the impact on grid is minimized especially mode. The status of EV integration with grid is discussed in
during the peak load time. This approach of policy optimization Table 2.
is considered the best strategy for an online charging control
algorithm during specified charging session. Further, the inves-
tigation of optimum charging mode based on SoC behaviour 5.2.1 Mode 1: Grid to vehicle (G2V)
in smart microgrid is done by comparing the online charging
control algorithm to the offline charging controller [42, 43]. The In this mode, the power required for charging the EVs is drawn
flowchart is shown in Figure 4, which highlighting the learning from the grid considering that the other sources in the system
steps, and control policy for the online algorithm. Initially the are either operating in a grid feeding mode or doesn’t generate
availability of the grid and EV are identified and the require- enough power in the system. Here, the developed control
ment at the local load is analyzed. Further at the EV side, the approach which is implemented using the FPGA identifies the
battery SoC is categorized under three different (dis)charging states of the system and optimizes the policy according to the
limits. To control and achieve optimal transitioning in different G2V mode of operation. The optimal policy is generated until
learning mode, the online controller is employed as per grid the other sources in the system are available and are capable
status, local load, EV status, and the battery SoC level in the enough to support the load at EVCS. Besides, the optimal
system. policy is also achieved during the G2V mode until the load
8 AMIR ET AL.

FIGURE 4 Proposed learning modes and control policy for the online algorithm

TABLE 2 Mode of pperations

Operational Reference Grid inte- Grid operation


mode active power gration Conditions mode

Mode-1 −158 On following


(G2V)
Mode-2 158 On If EVSoC > 80% or PGrid < local load handling following
(V2G) capability
Mode-3 0–2 Off If EVSoC ≥ 80% (EV sufficient charged) or grid forming
(UPS) failure by local load and available grid power
are balanced

is completely satisfied irrespective of the fact that the other the negative value of reference grid power ( PGr = −158 kW) is
sources are available or not. adapted as a local grid to charge the EV.
The voltage at the point of common coupling (PCC) is shown
in Figure 7a, and the inverter terminal current and voltage at
output filter at the charging station are shown in Figures 7b 5.2.2 Mode2: Vehicle to grid (V2G)
and 7c, respectively. Here, the time needed to operate the online
controller is nearly equal to one utility cycle during transitioning In this mode, the power required for supplying the grid is drawn
from normal grid connected to G2V operational mode. Dur- from the EV considering that the other sources in the system are
ing G2V mode, the EV battery is charged from the grid which either operating in a vehicle feeding mode or does not charge
results in increasing SoC as shown in Figure 7d. The DC bus the enough battery in the FCS. Here, the developed online con-
voltage is charging the battery is shown in the Figure 7e, and the trol approach which is implemented using the FPGA identifies
active power drawn from the grid is shown in Figure 7f. Further, the states of the system and optimizes the policy according to
AMIR ET AL. 9

in Figure 8c. The corresponding DC link voltage is presented


in the Figure 8d, and the active power injected during the V2G
mode operation is shown in Figure 8e. The positive value of
reference vehicle power ( Pvr = 158) denoted as EV supports
the distribution grid.

5.2.3 Mode 3: Standalone (UPS)

Finally, the operation changes the FCS to grid forming mode


and EV’s disconnected from the grid. In this mode, Pdemand is
nearly zero. In the scenario of a residential blackout, then EVs
can act as an emergency backup power supply mode (V2G
mode). The inverter terminal current and voltage at output fil-
FIGURE 5 Interface framework between HIL 402 module and breakout
board for different modes of operation (a) illustrates the schematic of the ter at the charging station are depicted in Figures 9a and 9b,
HIL402 interface system and its controller circuits. The EVs integration with respectively. Here, the bandwidth (BW) of the online controller
FCS in HIL SCADA panel is presented in (b), where the interface of RL is regulated within 0.0894 s to accommodate the UPS opera-
controller logic with HIL SCADA panel is employed. Here, the EVs tional mode. As the EV is operating to provide power to the
interconnection with the grid system is realized by connecting a three-phase
local loads in a standalone mode, the instance of transition and
inverter output to a FCS module to investigate the operational mode
variation in power fed into the grid is shown in Figure 9c. Here,
it is identified that during the V2G operation the battery SoC
decreases and after the transition process, the EV battery SoC
is almost constant with slight decrease as shown in Figure 9d.
Further, the corresponding DC link voltage is shown in the
Figure 9e, where the DC link sees a slight variation around 1–2
V which indicates the power flow from higher potential to lower
potential during the V2G operation. The operation of EV both
in transient and ideal conditions are shown in Figures 9f and 9g,
respectively.
From the above results, it is identified that the action of
online learning algorithm achieved a smooth power control-
ling operation for both grid forming/following mode in order
maintain the EV loads and peak demand. Besides it is also
clear that the algorithm mitigates the unnecessary transients
during the sudden transition instant and online controller
achieves the controlled DC link voltage with very less error
tolerance.

FIGURE 6 FCS integrated with grid and interface with FPGA controller
in Typhoon HIL
6 CONCLUSION

the V2G mode of operation. The optimal policy is generated This paper successfully investigates the issues of EV fast
until the other sources in the system are available and are capa- charging in a grid connected system. A reinforcement learning
ble enough to support the grid by EVFCS. Besides, the optimal (RL) based online charging controller is designed to support
policy is also achieved during the V2G mode until the EV bat- the continuous-charging flow in different operating modes
tery is discharge to a certain limit, irrespective of the fact that of the grid connected distributed generation-based EV fast
the other sources are available or not. charging system. In the perspective of uncertain EVs trav-
In this mode, the time needed to operate the online controller elling behaviours, a RL controller is used for the sequential
is nearly equal to one utility cycle during transitioning from G2V characteristics of the charging uncertainties. Also, the online
to V2G operational mode. Further setting the dual stage three charging algorithm is utilized to solve the power scheduling
phase inverter (switch of the contactors is open) to V2G mode, problems in V2G/G2V mode. Since the charging control for
then EV is integrated with a grid that supports the grid forming an EV in real-time will affect the boundary conditions (BCs) of
mode. The utility grid consuming the power from EVFCS. The the EV model, the proposed control approach is followed by
inverter terminal current and voltage response at output filter temporal correlations with better convergence. In this paper, an
is depicted in Figures 8a and 8b, respectively. Here the EV is enhanced policy has been obtained utilizing an online controller
operating in an EV feeding and grid following mode, the EV by dynamic programming to achieve the desired reward func-
battery is discharged to the grid and the change in SoC is shown tion for battery SoC and charging current control into account.
10 AMIR ET AL.

FIGURE 7 (a) Response of normal grid voltage (Vgc , Vgb , Vgc ) after filtering. (b) Characteristics of charging station battery inverter current (Ia , Ib , Ic ) before
filtering. (c) Real time charging station battery inverter voltage measurement (Va , Vb , Vc ) after filtering in G2V mode. (d) SoC of EV Battery (in transient state, if
G2V mode). (e) DC bus voltage corresponding to change in EV battery SoC (referred from (d)). (f) Active power (PGr ) during G2V mode (when battery achieved
90% SoC)

FIGURE 8 (a) Characteristics of charging station battery inverter current (Ia , Ib , Ic ) before filtering; (b) real time charging station battery inverter voltage
measurement (Va , Vb , Vc ) after filtering in V2G mode; (c) SoC of EV battery (when sudden G2V to V2G mode) during transient state; (d) DC bus voltage
corresponding to change in EV battery SoC (referred from (c)). (e) Active power (Pvr ) during V2G mode (when battery achieved 97% SoC
AMIR ET AL. 11

FIGURE 9 (a) Characteristics of charging station battery inverter current (Ia , Ib , Ic ) before filtering; (b) real time charging station battery inverter voltage
measurement (Va , Vb , Vc ) before filtering. (c) Active power (when sudden V2G to UPS mode) (d) SoC of EV battery (in a transient state, when sudden V2G to UPS
mode). (e) DC bus voltage corresponding to change in EV battery SoC (referred from (d)); (f) Battery SoC when EV disconnecting from FCS in UPS mode (in the
transient state). (f) DC bus voltage corresponding to battery SoC (achieved 100%), when disconnecting from FCS (in steady state)

Standard policies were modified for online fast charging control utilized, where the grid facing more fluctuations and the
accordingly. The experimental analysis in a real-time hardware- chances of grid instability is higher. Future research will be
in-loop (HIL) simulations suggested that the proposed online incorporating the agent as online depth of discharge (DoD)
charging scheme acts as an optimal prosumer (producer and monitoring to estimate the charging saturation point of EVs
consumer) that will be advantageous for the optimal power batteries.
flow control in practical system. Further, it has been validated
that the proposed online controller provides superior con- ACKNOWLEDGEMENTS
trol during EV (dis)charging over the conventional offline This Hardware-In-the-Loop (HIL) analysis was performed in
controller. Advanced Power Electronics Research Laboratory, Depart-
Furthermore, this research work will consider the coor- ment of Electrical Engineering, Jamia Millia Islamia (Central
dination of large EVs fleets. Online controller can also be University), New Delhi, India.
12 AMIR ET AL.

CONFLICT OF INTEREST 16. Haq, E.U., Lyu, C., Xie, P., Yan, S., Ahmad, F., Jia, Y.: Implementation
The authors declare that there is no conflict of interest that of home energy management system based on reinforcement learning.
Energy Rep. 8, 560–566 (2022)
could be perceived as prejudicing the impartiality of the research
17. Zhou, Q., Du, C.: A quantitative analysis of model predictive control as
reported. energy management strategy for hybrid electric vehicles: A review. Energy
Rep. 7, 6733–6755 (2021)
FUNDING INFORMATION 18. Bahrami, S., Wong, V.W.S., Huang, J.: An online learning algorithm for
demand response in smart grid. IEEE Trans. Smart Grid 9(5), 4712–4725
The author(s) received no specific funding for this work.
(2018)
19. van der Kam, M., Peters, A., van Sark, W., Alkemade, F.: Agent-based mod-
DATA AVAILABILITY STATEMENT elling of charging behaviour of electric vehicle drivers. J. Artif. Soc. Social
The data that support the findings of this study are openly avail- Simul. 22(4), 7 (2019)
20. Li, Y., Zhang, L., Lv, Z., Wang, W.: Detecting anomalies in intelligent vehi-
able in repository, research publication, etc., references are cited
cle charging and station power supply systems with multi-head attention
in manuscript. models. IEEE Trans. Intell. Transp. Syst. 22(1), 555–564 (2021)
21. Antonopoulos, I., Robu, V., Couraud, B., et al.: Artificial intelligence
ORCID and machine learning approaches to energy demand-side response:
Mohammad Amir https://orcid.org/0000-0003-3432-4217 A systematic review. Renewable Sustainable Energy Rev. 130, 109899
(2020)
Akbar Ahmad https://orcid.org/0000-0002-2785-7296
22. Lee, J., Sutton, R.S.: Policy iterations for reinforcement learning prob-
lems in continuous time and space — Fundamental theory and methods.
REFERENCES Automatica 126, 109421 (2021)
1. Sanguesa, J.A., Torres-Sanz, V., Garrido, P., Martinez, F.J., Marquez-Barja, 23. Sarker, I.H.: Deep learning: A comprehensive overview on techniques, tax-
J.M.: A review on electric vehicles: Technologies and challenges. Smart onomy, applications and research directions. SN Comput. Sci. 2(6), 420
Cities 4(1), 372–404 (2021) (2021)
2. International Energy Agency: Global EV outlook 2021 - Accelerating 24. Alzubaidi, L., Zhang, J., Humaidi, A.J., et al.: Review of deep learning: Con-
ambitions despite the pandemic. Glob. EV Outlook 2021. 101 (2021) cepts, CNN architectures, challenges, applications, future directions. J. Big
3. Nour, M., Chaves-Ávila, J.P., Magdy, G., Sánchez-Miralles, Á.: Review of Data 8(1), 53 (2021)
positive and negative impacts of electric vehicles charging on electric power 25. Grondman, I., Xu, H., Jagannathan, S., Babuska, R.: Solutions to finite
systems. Energies 13(18), 4675 (2020) horizon cost problems using actor-critic reinforcement learning. In: The
4. Khan, W., Ahmad, A., Ahmad, F., Saad Alam, M.: A comprehensive review 2013 International Joint Conference on Neural Networks (IJCNN). Dallas,
of fast charging infrastructure for electric vehicles. Smart Sci. 6, 256–270 TX, pp. 1–7 (2013)
(2018) 26. Barreto, A., Hou, S., Borsa, D., Silver, D., Precup, D.: Fast reinforcement
5. Amir, M., Zaheeruddin Haque, A.: Integration of EVs aggregator with learning with generalized policy updates. Proc. Natl. Acad. Sci. U. S. A.
microgrid and impact of V2G power on peak regulation. In: 2021 IEEE 117(48), 30079–30087 (2020)
4th International Conference on Computing, Power and Communication 27. Zhang, J., Peng, Z., Hu, J., Zhao, Y., Luo, R., Ghosh, B.K.: Internal rein-
Technologies (GUCON). Kuala Lumpur, Malaysia, pp. 1–6 (2021) forcement adaptive dynamic programming for optimal containment con-
6. Wang, L., Qin, Z., Slangen, T., Bauer, P., van Wijk, T.: Grid impact of trol of unknown continuous-time multi-agent systems. Neurocomputing
electric vehicle fast charging stations: Trends, standards, issues and miti- 413, 85–95 (2020)
gation measures - An overview. IEEE Open J. Power Electron. 2, 56–74 28. Lu, R., Hong, S.H.: Incentive-based demand response for smart grid with
(2021) reinforcement learning and deep neural network. Appl. Energy 236, 937–
7. Cao, Y., Wang, H., Li, D., Zhang, G.: Smart online charging algorithm for 949 (2019)
electric vehicles via customized actor-critic learning. IEEE Internet Things 29. Zanette, A., Wainwright, M.J., Brunskill, E.: Provable Benefits of Actor-
J. 9, 684–694 (2021) Critic Methods for Offline Reinforcement Learning. (NeurIPS), workshop
8. Amin, A., Tareen, W.U.K., Usman, M., et al.: A review of optimal charg- on theory of RL, (2021). http://arxiv.org/abs/2108.08812
ing strategy for electric vehicles under dynamic pricing schemes in the 30. Shariff, S.M., Alam, M.S., Ahmad, F., Rafat, Y., Asghar, M.S.J., Khan, S.:
distribution charging network. Sustainability 12(23), 1–28 (2020) System design and realization of a solar-powered electric vehicle charging
9. C. Jin, X. Sheng, Ghosh, P.: Energy efficient algorithms for Electric Vehi- station. IEEE Syst. J. 14(2), 2748–2758 (2020)
cle charging with intermittent renewable energy sources. In: 2013 IEEE 31. Lin, Z., Zhang, H.: Optimization algorithms. In: Low-Rank Models in
Power & Energy Society General Meeting. Washington, DC, pp. 1–5 (2013) Visual Analysis. pp. 55–110. Elsevier, Amsterdam (2017)
10. Kang, Q., Wang, J., Zhou, M., Ammari, A.C.: Centralized charging strategy 32. Shariff, S.M., Alam, M.S., Faraz, S., Khan, M.A., Abbas, A., Amir, M.:
and scheduling algorithm for electric vehicles under a battery swapping Economic Approach to Design of a Level 2 Residential Electric Vehicle
scenario. IEEE Trans. Intell. Transp. Syst. 17(3), 659–669 (2016) Supply Equipment. In: Singh, S., Pandey, R., Panigrahi, B., Kothari, D. (eds)
11. Li, P., Wang, H., Zhang, B.: A distributed online pricing strategy for Advances in Power and Control Engineering. Lecture Notes in Electrical
demand response programs. IEEE Trans. Smart Grid 10(1), 350–360 Engineering, vol 609, pp. 25–40. Springer, Singapore (2020)
(2019) 33. Heydari-doostabad, H., O’Donnell, T.: A wide-range high-voltage-gain
12. Gunantara, N.: A review of multi-objective optimization: Methods and its bidirectional DC–DC converter for V2G and G2V hybrid EV charger.
applications. Cogent Eng. 5(1), 1502242 (2018) IEEE Trans. Ind. Electron. 69(5), 4718–4729 (2022)
13. Quddus, M.A., Shahvari, O., Marufuzzaman, M., Usher, J.M., Jaradat, R.: 34. Abdullah, H.M., Gastli, A., Ben-Brahim, L.: Reinforcement learning
A collaborative energy sharing optimization model among electric vehi- based EV charging management systems-A review. IEEE Access 9,
cle charging stations, commercial buildings, and power grid. Appl. Energy. 41506–41531 (2021)
229, 841–857 (2018) 35. Liessner, R., Schroer, C., Dietermann, A., Bäker, B.: Deep reinforcement
14. Tao, Y., Qiu, J., Lai, S., Zhang, X., Wang, Y., Wang, G.: A human-machine learning for advanced energy management of hybrid electric vehicles. In:
reinforcement learning method for cooperative energy management. Proceedings of the 10th International Conference on Agents and Artificial
IEEE Trans. Ind. Inf. 18(5), 2974–2985 (2021) Intelligence. Funchal, Portugal, pp. 61–72 (2018)
15. He, D., Chan, S., Guizani, M.: Privacy-friendly and efficient secure com- 36. Hanna, J.P., Niekum, S., Stone, P.: Importance sampling in reinforcement
munication framework for V2G networks. IET Commun. 12(3), 304–309 learning with an estimated behavior policy. Mach. Learn. 110(6), 1267–
(2018) 1317 (2021)
AMIR ET AL. 13

37. Silver, D., Singh, S., Precup, D., Sutton, R.S.: Reward is enough. Artif. Intell. APPENDIX
299, 103535 (2021)
38. Lauri, M., Pajarinen, J., Peters, J.: Multi-agent active information gathering
in discrete and continuous-state decentralized POMDPs by policy graph Parameters
improvement. Auton. Agents Multi-Agent Syst. 34(2), 42 (2020) C Di charging demand for ith EVs
39. Sörme, J.: Intelligent charging algorithm for electric vehicles. Degree Proj.
Tari EVs arrival time
Comput. Sci. Eng. (2020)
40. Pavić, I., Pandžić, H., Capuder, T.: Electric vehicle based smart e-mobility S state-space
system – Definition and comparison to the existing concept. Appl. Energy PV 2G (s, a) vehicle to grid (state, agent)
272, 115153 (2020)
41. Li, S., Yu, J.: Deep transfer network with adaptive joint distribution adap- 𝜌i policy
tation: A new process fault diagnosis model. IEEE Trans. Instrum. Meas. Tdep EVs departure time
71, 1–13 (2022)
Ed energy demand
42. Diaz-Cachinero, P., Munoz-Hernandez, J.I., Contreras, J.: A probability-
based algorithm for electric vehicle behaviour in a microgrid with EVChar. st EVs charging station
renewable energy and storage devices. In: 2020 International Conference EVSoC EVs battery state of charge
on Smart Energy Systems and Technologies (SEST). Istanbul, Turkey, pp.
max
1–6 (2020) Plimit maximum power limit
Δ
43. Shafiq, S., Al-Awami, A.T.: An autonomous charge controller for electric PBCs estimation of BCs at time {t + 1, .., T }
vehicles using online sensitivity estimation. IEEE Trans. Ind. Appl. 56(1),
SoCinitial initial state of charge
22–33 (2020)
vTP (s) average values of reward
Ii,t ∕Vi,t current/voltage at ithFCS at time (t )
W Ch charging limit
How to cite this article: Amir, M., Z., Haque, A.,
SoCexpected expected state of charge
Kurukuru, V.S.B., Bakhsh, F.I., Ahmad, A.: Agent based
online learning approach for power flow control of Abbreviations
electric vehicle fast charging station integrated with BCs Boundary conditions
smart microgrid. IET Renew. Power Gener. 1–13 (2022). (Dis)charging Discharging and charging phenomena
https://doi.org/10.1049/rpg2.12508 DNN Deep neural network
DISCOs Distribution companies
EVs Electric vehicles
DCFCS Direct current fast charging station
EVFCS Electric vehicle fast charging station
FCS Fast charging stations
G2V Grid to Vehicle
HIL Hardware in loop
HEVs Hybrid electric vehicles
RL Reinforcement learning
V2G Vehicle to Grid
UPS Uninterruptible power supply

You might also like