Energies 16 05369

energies
Article
Model-Free Approach to DC Microgrid Optimal Operation
under System Uncertainty Based on Reinforcement Learning
Roni Irnawan 1,2, * , Ahmad Ataka Awwalur Rizqi 1 , Muhammad Yasirroni 1 , Lesnanto Multa Putranto 1,2 ,
Husni Rois Ali 1 , Eka Firmansyah 1 and Sarjiya 1,2, *
1 Department of Electrical Engineering and Information Technology, Universitas Gadjah Mada,

Yogyakarta 55281, Indonesia; ahmad.ataka.ar@ugm.ac.id (A.A.A.R.);
muhammad.yasirroni@mail.ugm.ac.id (M.Y.); lesnanto@ugm.ac.id (L.M.P.); husni.rois.ali@ugm.ac.id (H.R.A.)
2 Center for Energy Studies, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
* Correspondence: roniirnawan@ugm.ac.id (R.I.); sarjiya@ugm.ac.id (S.); Tel.: +62-274-552305 (R.I.)
Abstract: There has been tremendous interest in the development of DC microgrid systems which
consist of interconnected DC renewable energy sources. However, operating a DC microgrid system
optimally by minimizing operational cost and ensures stability remains a problem when the system’s
model is not available. In this paper, a novel model-free approach to perform operation control of
DC microgrids based on reinforcement learning algorithms, specifically Q-learning and Q-network,
has been proposed. This approach circumvents the need to know the accurate model of a DC grid
by exploiting an interaction with the DC microgrids to learn the best policy, which leads to more
optimal operation. The proposed approach has been compared with with mixed-integer quadratic
programming (MIQP) as the baseline deterministic model that requires an accurate system model.
The result shows that, in a system of three nodes, both Q-learning (74.2707) and Q-network (74.4254)
are able to learn to make a control decision that is close to the MIQP (75.0489) solution. With the
introduction of both model uncertainty and noisy sensor measurements, the Q-network performs
better (72.3714) compared to MIQP (72.1596), whereas Q-learn fails to learn.
Citation: Irnawan, R.; Rizqi, A.A.A.; Keywords: DC microgrids; optimisation; Q-learning; Q-network; reinforcement learning
Yasirroni, M.; Putranto, L.M.; Ali,
H.R.; Firmansyah, E.; Sarjiya.
Model-Free Approach to DC
Microgrid Optimal Operation under 1. Introduction
System Uncertainty Based on
There has been tremendous interest in generating electric energy from renewable
Reinforcement Learning. Energies
energy resources [1,2] owing to environmental concerns and current advances in power
2023, 16, 5369. https://doi.org/
electronics. Unlike conventional power plants, which often come in large sizes, the capacity
10.3390/en16145369
of renewable power plants strongly depends on local potential and ranges from a few
Academic Editor: Yun-Su Kim kilowatts to several megawatts.
Currently, small-sized renewable energy power plants such as micro hydro, photo-
Received: 12 May 2023
voltaic (PV), or wind turbines, are often connected to local loads at the distribution level to
Revised: 19 June 2023
Accepted: 4 July 2023
configure autonomous alternating current (AC) microgrids [3]. These systems are expected
Published: 14 July 2023
to be able to work with and without the grid connections. However, the problems of
reactive power compensation and frequency control in AC microgrids (ACMG) can be
more challenging than conventional AC systems due to the intermittent nature of wind
and solar power sources [4].
Copyright: © 2023 by the authors. As most renewable energies, especially wind turbine type-4 and PV, inherently produce
Licensee MDPI, Basel, Switzerland. direct current (DC) outputs, it is natural to interconnect them using DC voltage to configure
This article is an open access article direct current microgrids (DCMG). Different from AC microgrids, DC microgrids have no
distributed under the terms and problems with reactive and frequency controls [4]. Moreover, in a DC microgrid system,
conditions of the Creative Commons the use of an energy storage system (ESS), notably a battery energy storage system (BESS),
Attribution (CC BY) license (https:// becomes indispensable to balance the intermittency. This has been demonstrated in [5]
creativecommons.org/licenses/by/ where a method to coordinate several ESSs has been proposed.
4.0/).
Energies 2023, 16, 5369. https://doi.org/10.3390/en16145369 https://www.mdpi.com/journal/energies

Energies 2023, 16, 5369 2 of 20
DC voltage in a DC grid can be considered as the power balance indicator, i.e., similar
to frequency in an AC grid [4,6–9]. By controlling the DC voltage level within the grid,
a certain power flow can be achieved. The DC source connected to a DC grid can be oper-
ated in three different modes, i.e.,: droop-controlled voltage, constant power, or constant
voltage mode. The response of these modes when subjected to a disturbance is illustrated
in Figure 1.
Udc
Udc,max Udc,max Udc,max
Udc,min Udc,min Udc,min
Pdc
Pdc,min Pdc,max Pdc,min Pdc,max Pdc,min Pdc,max
(a) (b) (c)
Figure 1. The operation modes of a DC source represented by a single-slope active power (Pdc ) and
DC voltage (Udc ) relationship: (a) droop-controlled voltage, (b) constant power, and (c) constant
voltage mode [6]. The pre-disturbance operating point of the converter is indicated by the red dot,
while the blue dot represents the post-disturbance operating point.
Wind or PV power plants are usually operated at their maximum power point, hence
this DC source is often operated in active power control, i.e., injecting active power to
the DC microgrid system regardless of the DC voltage condition on the DC microgrid.
This also means that the intermittent operation of these power plants is reflected in the
DC voltage fluctuations. In order to counter the changes in the DC voltage, BESS can be
operated in the DC voltage mode to maintain a constant DC voltage. This means that BESS
can be absorbing or injecting power to the grid whenever there is a surplus or deficiency of
power in the DC microgrid. However, the operation of BESS is a complex problem as it has
special constraints such as Depth of Discharge (DOD) requirements, State of Charge (SOC)
limitations, Charge and Discharge rates, among others [10]. Therefore, it might happen that
BESS switches its mode to DC voltage droop mode to limit the contribution to maintain the
DC voltage within the DC microgrid or even operate in the constant power mode when
reaching its charge or discharge limit.
1.1. Related Work

In order to operate DC microgrids in an optimal way, a method to minimize operating
costs has been proposed in [11] using a particle swarm optimization (PSO) algorithm.
A similar idea of minimizing operating cost while improving stability has been proposed
in [12] using a switch system approach called supervisory control (SC), which appropriately
selects the operating mode to optimize the control objectives. In another study using adap-
tive dynamic programming (ADP) [13] and tabu search (TS) [14], researchers found that
using a central control system that coordinates distributed control devices across multiple
nodes in the microgrid can help optimize a common objective. The switching problem
of voltage sources across multiple nodes has been further studied in [15] to guarantee
the analytic performance of the proposed controller. The author in [15] formulated the
switching problem of multi-node voltage sources as mixed-integer linear programming
(MILP) problem. The stability study used in [15] is based on the previous study of a small
signal model for low-voltage DCMG (LVDCMG) in [16].
Although the proposed methods in the literature have demonstrated satisfactory
performances in the scenarios presented in the corresponding papers, they necessitate the
exact model of DC grids. This assumption is very restrictive for several reasons. Firstly,
from the practical aspect, DC microgrids consist of several components coming from
Energies 2023, 16, 5369 3 of 20
different vendors. Each vendor is normally unwilling to share the detail of components,
e.g., the transfer functions of the components are their controllers, to safeguard their
intellectual properties. Secondly, even if accurate models are available, some components
experience wear and tear after a certain time, which leads to model uncertainty. A study
exploring the incomplete state model is explored in [17] by considering communication
delay. The author overcame the lack of an exact system model by implementing a heuristic-
based approach, that is, a genetic algorithm (GA). That is because GA did not require an
exact model to obtain a sufficient solution approximation. Hence, the natural remedy for
this is to use a method that does not rely on the exact model of the DC grids.
In the current power system hierarchy, there have been large amounts of data gen-
erated by, for example, smart meters and phasor measurement units (PMUs). These data
contain very rich information regarding operating status of the systems at a time instance.
On the other hand, recent progress in applied mathematics and data science leads to the
possibility to extract useful information for power system operation without prior supervi-
sory knowledge on how to extract meaningful information from data. An example of this
is the reinforcement learning (RL) [18] approach, which is a method in machine learning
allowing an agent to take a sequence of appropriate actions to maximize a certain objective.
This method has recently been adopted in power systems [19], for instance to solve the
Volt/Var control (VVC) problem in medium voltage AC (MVAC) distribution systems [20],
real-time wide-area stabilizing control in a high-voltage AC (HVAC) system [21], and a
data-driven load frequency control in a multi-area HVAC system [22].
The purpose of this study is to propose a novel method to solve the DC microgrid
switching problem optimally given limited information about the exact model of DC grids
based on a reinforcement learning algorithm. The limitation is modeled as noise and error
in the system matrix model, as a form of simulating noise in sensor measurements and an
imprecise power system model. The table highlighting the difference between our study
compared with other works can be seen in the comparison of related works in Table 1.
The data challenges shown in the table refer to communication delay, measurement noise,
and model error.
1.2. Contributions
Due to the fact that accurate models of DC grids are not always readily available,
this paper proposes a novel model-free approach to performing operation control of DC
microgrids based on reinforcement learning. The variances of reinforcement learning
algorithm is proposed in the paper because of its ability to learn efficiently from past data
of the state and action, such as the current and operation cost data which correspond to
any mode of operation in the context of a DC microgrid system. In addition, we show
that our proposed approach can cope with the model uncertainty due to noise in sensor
measurement and imprecise power system model, while producing near-optimal policy
which takes both stability and operational cost into consideration.
The contributions of this paper can be summarized as follows:
• Propose a novel model-free approach for solving the LVDCMG optimal switching
problem
• Demonstrate the ability of a reinforcement learning algorithm to solve LVDCMG
optimal switching problem under measurement noise and imprecise power system
mode
• Provide a minimal working example for applying reinforcement learning parameters
in the LVDCMG optimal switching problem
Energies 2023, 16, 5369 4 of 20
Table 1. Comparison of Related Works.
Methods
Paper Systems
Algorithm Objective Unknown
[11] DCMG PSO Economic, Environment Droop
[12] DCMG SC Stability Mode
[15] DCMG MILP Economic, Stability Mode
[13] DCMG ADP Stability Policy
[14] DCMG RL, TS Topology, Stability Edge, Policy
[17] ACMG GA Stability, Frequency Generation
[20] AC RL Economic, Stability Tap Changer
[21] AC RL Stability Control Signal
Ours DCMG RL Economic, Stability Mode
Data Challenge
Paper Voltage Level
Delay Noise Error
[11] LV - - -
[12] LV - - -
[15] LV - - -
[13] LV - - -
[14] LV - - -
[17] LV X - -
[20] MV - - -
[21] HV - - -
[22] HV - - -
Ours LV - X X
2. Operation Control of DC Microgrids

2.1. Models of DC Microgrids
Each power source in distributed droop-controlled LVDCMG can be operated in vari-
ous different modes of operation, that is droop-controlled voltage source (DVS), a constant
power source (CPS), or a constant power load (CPL) [15,16] as can be seen in Figure 2.
For instance, a battery can be operated as either a CPS on discharging, CPL on charging,
or a DVS both at charging and discharging.
Figure 2. Each node of the DC microgrid systems can be modeled as a droop-controlled voltage
source, a constant power source, or a constant power load.
For each droop-controlled voltage source j ∈ Nd ,
Vj = Vj0 − d j PDj , (1)

Energies 2023, 16, 5369 5 of 20
where Vj is the droop controller output voltage, Vj0 is the droop controller nominal volt-
age value, d j is the droop controller gain, and PDj is the droop controller output power.
The small-signal representation of the droop-controlled voltage source can be modeled as
d j V̄j
v j = −( )i Dj (2)
1 + d j Īj
where v j is the small-signal perturbation of the voltage, V̄j and Īj are the DC voltage
and current steady-state operating point, respectively, and i Dj is the corresponding small-
signal perturbation of the current. The individual small-signal models for each node are
aggregated which leads to
v = −d̃iD , (3)
where v := [v1 , v2 , . . . , v Nd ], iD := [i1 , i2 , . . . , i Nd ], and d̃ are a diagonal matrix with diagonal
d j V̄j
elements 1+d j Īj
, and j = 1, 2, . . . , Nd .
Suppose that the microgrids consist of NCPS number of CPsS out of Ns sources where
the set of CPSs is given by NCPS := {1, 2, . . . , NCPS }. By modeling a CPS as a current source
in parallel with a conductance for each CPS j ∈ NCPS , we have the following models
iCPS j = gCPS j v j + ĩCPS j , (4)
where iCPS j and v j are the corresponding CPS small-signal output current and voltage.
The values of the corresponding conductance gCPS j and current source ĩCPS j are determined
by the power and voltage at an operating point as follows
PCPS j
gCPS j = 2
, (5)
VDC
j
2PCPS j
ICPS j = , (6)
VDCj
where PCPS j and VDCj are output power and nominal voltage at operating point of the
CPS. It needs to be noted that ICPS j is the nominal current source value, while ĩCPS j is
the small-signal perturbation of the current source. All the small-signal model can be
aggregated into the following equation
iCPS = gCPS vCPS + ĩCPS j , (7)
where
iCPS := [iCPS1 , iCPS2 , . . . , iCPS N ],

CPS
vCPS = [v1 , v2 , . . . , v NCPS ],
ĩCPS = [ĩCPS1 , ĩCPS2 , . . . , ĩCPS N ],

CPS
gCPS1 0 0
 
gCPS =  0 ... 0 ,
0 0 gCPS N
CPS
Similar to CPS, a CPL is modeled as a negative conductance in parallel to a current

sink. Suppose that the set of CPLs is given by NCPL := {1, 2, . . . , NCPL } where NCPL refers
to the number of CPLs in the microgrids. Each CPL j ∈ NCPL can be modeled as follows
iCPL j = − gCPL j v j + ĩCPL j , (8)

Energies 2023, 16, 5369 6 of 20
where iCPL j and v j are the corresponding small-signal output current and voltage of the
CPL, respectively. The values of the corresponding conductance gCPL j and current sink
ĩCPL j are determined by the power and voltage at an operating point as follows:
PCPL j
gCPL j = 2
, (9)
VDC
j
2PCPL j
ICPL j = , (10)
VDCj
where PCPL j and VDCj are the output power and DC nominal voltage at the operating point
of the CPL. The small-signal model can be aggregated into
iCPL = −gCPL vCPL + ĩCPL j , (11)
where
iCPL := [iCPL1 , iCPL2 , . . . , iCPL N ],

CPL
vCPL = [v1 , v2 , . . . , v NCPL ],
ĩCPL = [ĩCPL1 , ĩCPL2 , . . . , ĩCPL N ],

CPL
gCPL1 0 0
 
gCPL = 0 ... 0 .
0 0 gCPL N
CPL
Suppose that a set of Nb power lines connecting the adjacent vertices (sources and
loads) in the microgrid is given by Nb := {1, 2, . . . , Nb }. Each power line j ∈ Nb can be
modeled as follows
dibj
vb j = lb j + rbj ibj , (12)
dt
where vbj and ibj are the power line voltage and current, respectively, and lbj and rbj are
the power line inductance and resistance with length ` j , respectivley. The aggregate of all
individual small-signal models can be expressed as
dib
vb = lb + rb ib , (13)
dt
where
vb := [vb1 , vb2 , . . . , vbN ],

b
ib = [ib1 , ib2 , . . . , ibN ],

b
 
lb 0 0
 01 ... 0 
lb =  ,
0 0 lb N
b
 
rb 0 0
 01 ... 0 
rb =  .
0 0 rbN
b
Finally, the DC microgrids follow Kirchhoff’s voltage law (KVL) and Kirchhoff’s
current law (KCL) as follows
vb = Mv, (14)
Energies 2023, 16, 5369 7 of 20
iD − iCP = MT ib . (15)
By combining (3), (7), (11) and (13)–(15), the dynamics of the DC microgrids can be sum-
marized in the following state equations:
ẋ = Ax + B, (16)
where x = ib ∈ R Nx and
A = −l− 1
b (M(d̃
−1
− gCP )−1 MT + rb ), (17)
B = −l− 1
b M(d̃
−1
− gCP )−1 . (18)
Any operational modes (i.e., DVS, CPS, or CPL) will follow linear state equations
described in (16), albeit with different values of A and B. Therefore, the dynamics can be
slightly modified as follows
ẋ = Aσ x + Bσ , (19)
where Aσ and Bσ stand for matrix A and B, respectively, for mode σ ∈ (1, N ), where N
refers to the total number of modes under consideration.
The DC microgrid’s dynamics can be reformulated in discrete forms as follows
x(k + 1) = x(k ) + ẋ(k)∆t, (20)
where ∆t is a time-integration constant. Substituting (19) to (20), the complete dynamics in

discrete forms can be rewritten as
x(k + 1) = Ak σ x(k) + Bk σ , (21)
where Ak σ = Aσ ∆t + 1 and Bk σ = Bσ ∆t. Therefore, the decision variable that we can give
T
to the system is u(k) = u1 (k ) u2 (k) . . . u N (k) , where uσ ∈ (0, 1) refers to the status
of mode σ of the DC microgrid and is given by
(
1 if mode σ is active
uσ = . (22)
0 if mode σ is inactive
Note that only one mode can be active at a time.
2.2. Problem Statement

In this paper, we try to find the most optimal operational control for the DC microgrids
modeled in Section 2.1. The problem can be formulated as follows:
Problem 1 (Optimal operation control). Given a microgrid system following a certain dynamics
x(k + 1) = f(x(k), u(k)) which produces an operational cost c(x(k ), u(k)), what is the input
T
signal u∗ (k) = u1∗ (k) u2∗ (k) . . . u∗N (k) which will minimize the total cost C defined as

M
C= ∑ c(x(k), u(k)). (23)
k =0
Here, N denotes the number of available modes of operation and ui ∈ (0, 1) for every i ∈ (1, N )
and M denotes the number of steps.
Energies 2023, 16, 5369 8 of 20
Suppose that the cost function at iteration-k is given by
1
c(k) = x(k) T Qx(k) + γσ uσ (k ). (24)
2
where Q ∈ R Nx × Nx stands for a positive definite matrix while γσ stands for the cost of
operating the current active mode σ. Here, the first term corresponds to the stability
of the system (i.e., an unstable system causing large values of state x will be penalized)
while the second term corresponds to the operation cost of the operational mode being
employed. If the system dynamics f(x(k), u(k )) is fully known and linear while the cost
function c(x(k), u(k)) is linear or quadratic, one way to solve the presented optimization
problem is to employ linear/quadratic programming. In this case, the system dynamics
in (19) can be written in the form of a constraint to the optimization problem. One can then
employ algorithm such as the mixed-integer quadratic programming (MIQP) to solve the
optimization problem as reported in [15].
However, the requirement to fully know the system dynamics accurately might be
difficult to realize in many practical situations. Some parameters in a DC microgrid system
might not be known beforehand due to communication delay [17]. Meanwhile, system
identification to acquire the model parameters could be complicated, especially if the
system contains a large number of components.
3. Reinforcement-Learning-Based Operation Control

3.1. DC Microgrids as a Markov Decision Process
To solve the optimization problem in Problem 1, we can consider the DC Microgrids
with unknown system dynamics as a Markov Decision Process (MDP) [23]. By doing so,
we assume that the DC Microgrids system fulfills the following criteria:
• the future state s(k + 1) only depends on the current state s(k), not the previous
state history,
• the system accepts a finite set of actions a(k) at every step,
• the system will provide state information s(k) and reward r (k) at every step.
The reward function r (k) denotes how good the performance of the state s(k ) at
iteration-k is. Therefore, the optimization problem described in Problem 1 where the cost
function C needs to be minimized can be transformed into an optimization problem to find
a policy π = u∗ (k) which will maximize the total reward R = ∑kM=0 r (k ) over M steps.
This discrete dynamics equation in (21) can be transformed into an MDP where
T
the state s is given by s = x c and the action a = u. We assume that the cost c(k)
follows Equation (24). The reward function r (k) can be derived from the cost function c(k)
as follows: (
cs − c(k) if cs > c(k)
r (k) = , (25)
0 if cs ≤ c(k)
where cs denotes a positive constant. Thus, the objective function used to evaluate each
model is
M
max
u
∑ r ( k ). (26)
k =0
3.2. Q-Learning for Near-Optimal Operation Control

To solve the optimization problem for MDP modeling dynamics of DC microgrids,
a reinforcement learning approach is employed. This approach allows an agent to inter-
act with an environment (i.e., system) without knowledge of the system’s dynamics by
applying an action a and receiving a state information s as well as a reward r as shown in
Figure 3. The agent then learns from this information to decide the most optimal policy
π = a(k), where k ∈ (0, M ), which will maximize the cumulative rewards R = ∑kM=0 r (k)
for M number of steps.
Energies 2023, 16, 5369 9 of 20
Figure 3. Reinforcement learning agents works by interacting with an environment, such as DC

microgrid systems, by performing an action and gathering observation and reward to learn the most
optimal policy.
The reinforcement learning algorithm employed in this paper falls under the class
of algorithm called Q-learning [18] which has been successfully employed for various
engineering problems, such as gaming [24] and robotics [25]. The algorithm works by
predicting an action-value function Q(s, a) for every state s and action a which reflects the
expected final reward when choosing an action a in the state s. Once Q(s, a) is established,
the optimum policy π (s) for every state s will select an action which has the most action-
value (i.e., will yield the most rewards in the long-term) as given by [18],
π (s) = arg max Q(s, a). (27)

a
For a discrete number of states, the way to build an estimate of action-value function
Q(s, a) is by constructing a table which maps a state s and action a into a value Q(s, a) as
shown in Figure 4. This algorithm, called tabular Q-learning, starts with an empty table.
The table will be updated upon interaction with the environment by taking into account
information regarding current state s, the given action a, the current reward r (s), and the
next state s0 = s(k + 1). The update equation is based on Bellman’s optimality equation
and is given by [18],
− Q(s, a) + α(r (s) + γ max Q(s0 , a0 ) − Q(s, a)).

Q(s, a) ← (28)
a0
Here, α stands for a learning rate constant which determines how much we allow the
value of Q(s, a) to change at every iteration and γ stands for the reward’s discount factor
which determines how much we value the contribution of future rewards to the value of
the current state. The iteration process stops once the cumulative reward R over M number
of steps reach a certain threshold Rs .
Figure 4. Q-learning works by building Q-table to estimate the values Q(s, a) of each action a in a
particular state s.
One of the main reason in choosing Q-learning to solve the optimization problem in
Problem 1 is because it is an off-policy algorithm, i.e., it is able to learn from and reuse any
past experience regardless of the policy being used to gather this experience. In the context
of DC microgrids, Q-learning is able to learn from past data of the state and action pairs,
Energies 2023, 16, 5369 10 of 20
such as the current and operation cost data which corresponds to any mode of operation.
Due to this fact, Q-learning is better in terms of data sampling efficiency compared to
employing online reinforcement learning problem such as policy gradient method [26].
Another reason is because it is designed specifically for problems with discrete action space
which is the characteristics of the operation control problem, i.e., to determine which mode
of operation is the most optimal among a discrete number of possible operations.
3.3. Q-Network for Operation Control under Uncertainty

The tabular Q-learning approach found to be extremely effective in various appli-
cations when the number of state s is discrete. However, this is mostly not the case for
a DC microgrid systems which can have a large number of possible current values and
operational cost as a state. The presence of uncertainty in the measurement of the state
value adds further complications. To handle a system with a large number of states or
even a continuous state, the Q-table can be replaced with a function approximator which
directly maps state s and action a into Q(s, a). One such function approximator which has
gained prominence in recent years is a neural network, due to its ability to approximate
highly non-linear functions solely from raw data [27]. Recent study in employing neural
networks into Q-learning as a function approximator of action-value function Q(s, a) has
been carried out in [28]. This approach, called Q-network, works essentially the same
as Q-learning, except that the Q-table is replaced with a neural network and the update
equation in (28) is replaced with training steps of the neural network.
The neural network φ(s), as depicted in Figure 5, accepts a state s as an input and
returns values Q(s, a) of each possible discrete actions a. The target output of the network
is derived from Bellman’s equation, similar to the term in Equation (28), as follows:
Qt (s, a) = r (s) + γ max Q(s0 , a0 ). (29)

a0
The losses, given by L = Q(s, a) − Qt (s, a) are used to update the network’s parameter
φ(s) using the gradient descent algorithm.
Figure 5. Instead of a table, neural network can serve as an action-value approximator in Q-learning.
To facilitate a better training of the neural network (which ideally requires uncorrelated
training data), a strategy called experience buffer is employed. Upon interacting with the
environment, the agent does not update the neural network’s parameter solely based on the
current information, but saves a tuple τ defined as τ = (s, a, s0 , r ) into a buffer of length Lb .
In every iteration, a total of Nt < Lb of tuples are randomly sampled from the experience
buffer and used as data training. The training process stops once the average cumulative
reward over M number of steps R avg in the last NE number of completed episode reaches
a certain threshold Rs . To obtain diverse data at the initial stage of training, the agent
initially starts with purely random action, i.e., the probability e of choosing random action
is set to 1. This parameter e is then linearly decreased at every iteration until it reaches a
certain minimum value emin after Ne number of iterations. Afterwards, the agent will have
Energies 2023, 16, 5369 11 of 20
a probability of emin of choosing a random action and a probability of 1 − emin of choosing

the most optimal policy in (27) from the output Q(s, a) provided by the neural network.
4. Methodologies
To evaluate the performance of the algorithm, simulations consisting of 2 environments
are used as follows:
1. A simple environment: a simplified version of DC microgrid system consisting
of a single source and a single load connected via a transmission line (Nx = 1).
The system can be operated in 2 different modalities (N = 2). The number of update
steps in a single episode of operation is set to be M = 28. For simplicity, the state
equation matrices are simplified and assumed to be A = B = 1 for mode σ = 1 and
A = −10, B = 1 for mode σ = 2.
2. A complex environment: a more realistic DC microgrid system consisting of 3 nodes.
Each node consists of a source and/or a load. Two transmission lines (Nx = 2) connect
Node 1 and Node 2 as well as Node 2 and Node 3 as shown in Figure 6. The system
can be operated in 2 different modalities (N = 2). The number of update steps in a
single episode of operation is set to be M = 40. In this environment, it is assumed
that the system’s mode of operation can only be updated once every 1 s. The state
equation matrices used in this environment are derived from (16)–(22).
Figure 6. The complex environment consists of DC microgrid systems with 3 nodes and 2 bus
connections.
The two environments are chosen because of their simplicity, yet are representative
of simulating a DCMG power system model in the form of an MIQP problem. The first
environment is the most possible simple representation of the system model, while the
second environment is a simple building block for any DCMG system topology. For future
research, this second environment can be used to build a bigger and more complex DCMG
system topology.
For evaluating the performance of the algorithm, we design 3 different scenarios.
The first one is when the model is assumed to be perfectly known to the MIQP algorithm
and the current measurement is perfectly accurate, i.e., no modeling and measurement
uncertainty. The second one is when there is a modeling error, i.e., the model information
used by the MIQP algorithm does not perfectly match the real model of the system. In this
scenario, the real transition matrix Aσ and input vector Bσ for each mode σ have an offset
of ∆A and ∆B respectively with respect to the model information known to the MIQP
algorithm. Finally, the third and last scenario is when there is a Gaussian noise with mean
µ and standard deviation υ added to the current measurement x. Thus, the third scenario
contains both the Gaussian noise and modeling error at the same time.
Apart from implementing Q-learning and Q-network, we also implement an optimiza-
tion technique based on mixed-integer quadratic programming (MIQP) employed in [15].
The MIQP approach is used as the baseline approach and simulated under the assumption
that the dynamics model of the DC microgrid is known. We tested the Q-learning, the Q-
network, and the MIQP algorithm to find the most optimal action for each scenario in the
given environment. Note that the Q-learning and Q-network do not know anything about
the dynamics of the system while the MIQP requires information regarding the model’s
Energies 2023, 16, 5369 12 of 20
dynamics. The neural network employed in the Q-network algorithm contains 1 hidden
layer with 4 nodes. The parameters used in the algorithm are listed in detail in Table 2.
Table 2. List of Parameter Values.
Values
Param Units
Simple Complex
γQ−learning 0.9 0.9 -
αQ−learning 0.2 0.2 -
γQ−network 0.99 0.99 -
αQ−network 0.01 0.001 -
cs 0.2 2 -
µ 0 0 ampere
υ 0.01 0.1 ampere
∆t 0.1 0.0001 second
∆A, ∆B 30 10 %
Q 1 0.00001 -
γσ 0.1 1 -
Nt 28 40 -
Lb 30,000 50,000 -
Ne 10,000 250,000 -
emin 0.02 0.02 -
5. Results
5.1. Simple Environment
First, we present the results of the simulation in the first scenario when there is
no modeling or measurement error. For this problem, the Q-learning and Q-network
algorithm produce an exactly identical solution shown in Figure 7a,b. The best policy
produced by both algorithms is shown in Figure 7a and the resulting state dynamics is
shown in Figure 7b.
1.0 0.25
0.8 0.20
0.6 0.15
Action
State
0.4 0.10
0.2 0.05
0.0 0.00
0 5 10 15 20 25 0 5 10 15 20 25
Steps Steps
(a) (b)
Figure 7. (a) The action given to the simple DC microgrid system produced by Q-learning or
Q-network and (b) the produced state dynamics.
To evaluate the performance of both algorithms, we compare the cumulative reward

per episode as function of training iteration number in Figure 8a. The red line, blue
line, and green line indicate the cumulative reward produced by Q-learning, Q-network,
and MIQP algorithm, respectively. Note that the cumulative reward of the Q-network
is presented as an average cumulative reward over the last 100 episodes to take into
account the stochastic nature of the Q-network algorithm. We can observe that the Q-
learning algorithm (red line) converges very quickly towards its most optimal solution.
The produced cumulative reward is also very close to the optimal solution produced
by MIQP (green line). The Q-network algorithm (blue line) produces a solution with
less average cumulative reward compared to Q-learning and MIQP. However, as can be
Energies 2023, 16, 5369 13 of 20
observed in Figure 7a, the best solution produced by Q-network matches the one produced
by Q-learning. Consequently, as shown in Figure 8b, the reward accumulated by Q-
network over 28 steps match the reward accumulated by Q-learning. Overall, the total
reward accumulated by both of these algorithm (red line for Q-learning and blue line for
Q-network) over 28 steps is only slightly smaller than the optimal solution predicted by
MIQP (green line for MIQP) as shown in Figure 8b. This can also be concluded from Table 3
where the final reward of Q-learning and Q-network (3.5867) is shown to be really close to
the reward of MIQP (3.5871). Note that this performance is achieved by the Q-learning and
Q-network without prior information of the system’s dynamics contrary to the MIQP.
4.0 4.0
3.5 3.5
(Average) Cumulative Reward
3.0 3.0
Cumulative Reward
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
Q learning Q-learning
0.5 Q network 0.5 Q-network
MIQP MIQP
0.0 0.0
0 20 40 60 80 100 0 5 10 15 20 25
Iteration (10 3) Steps
(a) (b)
Figure 8. (a) Comparison of cumulative reward per episode as function of training iteration produced
by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP
(green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by
Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system
without model error. Note that the red and blue lines are on top of each other.
Table 3. Summary of Results.
Best Reward/Cost per Episode

System Model Error Noise
Q-Learning Q-Network MIQP
x x 3.5867 3.5867 3.5871
Simple X x 3.3092 3.6554 3.6293
X X x 3.6685 3.6667
x x 74.2707 74.4254 75.0489
Complex X x 72.4335 72.3714 72.1596
X X x 72.3714 72.1596
The simulation results of the second scenario where there is a modeling error in
the system is shown in Figure 9. From the plot of cumulative reward per episode as
function of training iteration number in Figure 9a, we can observe that the Q-learning
algorithm (red line) once again converges rapidly to the final solution. However, we
can observe that the solution reached by this algorithm is no longer close enough to the
solution provided by MIQP (green line) even in the presence of model error. The average
cumulative reward produced by Q-network (blue line) is also less than the case where there
is no model error. However, if we observe the cumulative reward gathered over the curse
of 1 episode (28 steps) in Figure 9b, the best solution of Q-network (blue line) is slightly
better than the one produced by MIQP (green line). This is the case since MIQP no longer
provides the global optimum solution for the system due to the presence of modeling
error. The Q-network algorithm, on the other hand, learns directly from the system without
prior information regarding the model, and thus, its performance is not affected by the
modeling error. We can observe this more clearly from Table 3 where the best reward
produced by Q-network (3.6554) is better than the reward produced by Q-learning (3.3092)
and MIQP (3.6293).
Energies 2023, 16, 5369 14 of 20
The simulation results of the third scenario in the presence of both the modeling error
and the measurement noise is shown in Figure 10. In this scenario, the Q-learning algorithm
fails to converge towards a solution. This is mainly caused by the fact that the measurement
noise causes a lot of variation in the state values which consequently causes the Q-Table’s
dimension to be too large. In contrast, the Q-network algorithm does not fall into this
problem because it works using continuous approximation function in the form of a neural
network. In Figure 10, we can observe that the Q-network algorithm in this case converges
much faster (slightly after 10,000 iterations) compared to the previous cases. It demonstrates
how the Q-network algorithm actually works better in a more complex scenario, such as
the one with modeling and measurement uncertainty. The best solution produced by
Q-network algorithm (blue line) also produces cumulative reward which is very close to
the solution provided by MIQP (green line) in Figure 10. In fact, the best cumulative reward
of Q-network (3.6685) is once again better than the one produced by MIQP (3.6667) as
shown in Table 3. This demonstrates the power of Q-network algorithm to solve complex
optimization problems in the presence of uncertainty without relying on prior information
about the system’s model. These features make the Q-network algorithm a very strong
candidate to be used in a real operational control problem of DC microgrid system.
4.0 4.0
3.5 3.5
3.0 3.0
Cumulative Reward
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
MIQP MIQP
0.0 0.0
0 20 40 60 80 100 0 5 10 15 20 25
(a) (b)
Figure 9. (a) Comparison of cumulative reward per episode as function of training iteration produced
by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP
(green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by
Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system
with model error.
4.0 4.0
3.5 3.5
3.0 3.0
Cumulative Reward
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
MIQP MIQP
0.0 0.0
0 2 4 6 8 10 0 5 10 15 20 25
(a) (b)
Figure 10. (a) Comparison of cumulative reward per episode as function of training iteration pro-
duced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and
(b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-network (blue)
and MIQP (green) in the case of normal DC microgrid system with model error and measurement
noise. Q-learning solution is not shown due to non-convergence.
Energies 2023, 16, 5369 15 of 20
5.2. Complex Environment

In this Section 5.2, we present the results of the complex simulation consisting of
three nodes. The configuration of the network is shown in Figure 6. This Section 5.2
is almost identical with Section 5.1, only differing in system parameters. The system
parameters are derived from state equation matrices presented in (16)–(22). The resulting
sample of the transient current transfer between nodes as an impact of the switching
can be seen in Figure 11. First, in a system without modeling and measurement error,
the cumulative reward per episode, consisting of 40 steps, as a function of training iteration
number can be observed in Figure 12. We can observe that Q-learning algorithm (red
line) behaves as in Section 5.1, converging very quickly towards its most optimal solution.
However, as we can see from Figure 12, Q-learning achieves the lowest score, followed by
Q-network, while MIQP achieves the highest score. Furthermore, an increase in systems’
complexity can be seen to have an impact on the model-free approach as the difference
of both Q-learning (74.27071075) and Q-network (74.42539101) to the global optimum
solution by MIQP (75.04887579) widens compared to the simple system in Section 5.1.
This happens because the information gap between the model-free approach and the fully
known system’s dynamics parameters on MIQP solution is also widened.
120
100
80
current (A)
60
40
i0
20 i1
0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014
time (s)
Figure 11. Sample current transient from mode 1 to mode 2 in the case of a complex DC micro-
grid system.
75
70
70 60
Cumulative Reward
65 50
40
60
30
55 20
50 Q network 10 Q-network
MIQP MIQP
0
0 25 50 75 100 125 150 175 200 0 5 10 15 20 25 30 35 40
Iteration (103) Steps
(a) (b)
Figure 12. (a) Comparison of cumulative reward per episode as a function of training iteration
produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by
MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by
Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system
without model error.
Energies 2023, 16, 5369 16 of 20
The simulation results of the second scenario of a complex system with a modeling
error in the system is shown in Figure 13. From the plot of cumulative reward per episode
as a function of training iteration number in Figure 13, we can observe that the Q-learning
algorithm (red line) keeps converging rapidly to the final solution, even achieving the
best score followed by Q-network. From Figure 13, we can see that in the course of
1 episode (40 steps), both Q-learning (72.4335) and Q-network (72.3714) could perform
relatively well even without prior knowledge of the model’s dynamics compared to MIQP
(72.1596). Furthermore, the presence of modeling error greatly shifts the MIQP solution
from the global optimum in the complex system scenario, showing the vulnerability of the
MIQP approach under inaccurate system model. Nevertheless, Q-network requires many
iterations to converge (around 500,000 iterations) compared to the first scenario where no
model error exists.
The simulation results of the third scenario where there are both modeling error
and measurement noise in the system is shown in Figure 14. In this scenario, the Q-
learning algorithm fails to converge again towards a solution, just like in the simple model.
In contrast, the Q-network algorithm is still capable of handling noise using its neural
network model. Furthermore, we can see from Figure 14 that the solution of Q-network
(72.3714) achieves a better score compared to the solution provided by MIQP (72.1596). This
result shows that the Q-network algorithm is actually capable of handling both modeling
and measurement uncertainty even in complex scenario. These features once again further
solidify Q-network algorithm as very strong candidate to be used in a real operational
control problem of DC microgrid system. Nevertheless, further research is needed to better
understanding the limit of system complexity and measurement error that Q-network able
to handle in feasible manner with acceptable results.
To investigate the impact of node quantity on the performance and training time of
the Q-network, a series of experiments was conducted. The simulations involved placing
clusters of nodes in a sequential arrangement, following the configuration of the complex
network illustrated in Figure 6. The clusters consisted of 50, 100, and 1000 nodes, which
can be equivalently represented as 150, 300, and 3000 nodes, respectively. The results of
these experiments can be found in Table 4. The findings indicate that increasing the system
size leads to a higher computational burden for Q-learning during training. However,
the rate of increase in training time is lower than the rate of increase in the number of nodes.
This suggests that the computational time increase is primarily caused by small-signal
calculations, rather than the Q-network algorithm itself.
70 70
60
60
Cumulative Reward
50
50 40
30
40 20
Q network 10 Q-network
30 MIQP MIQP
0
0 100 200 300 400 500 0 5 10 15 20 25 30 35 40
(a) (b)
duced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by
MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by
Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system
with model error.
Energies 2023, 16, 5369 17 of 20
70 70

60
60
Cumulative Reward
50
50 40
30
40 20
Q network 10 Q-network
30 MIQP MIQP
0
0 100 200 300 400 500 0 5 10 15 20 25 30 35 40
(a) (b)
duced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and
(b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-network (blue)
and MIQP (green) in the case of complex DC microgrid system with model error and measurement
noise. Q-learning solution is not shown due to non-convergence.
Table 4. Impact of System Size.
System Multiplier Number of Nodes Training Time (s)

1 3 13.949 822 300 113 738
100 300 28.190 732 199 931 517
1000 3000 125.255 569 800 036 03
6. Conclusions and Future Works

In this paper, we propose a reinforcement learning algorithm to solve the optimal
operational control problem of a DC microgrid system. The proposed algorithm works
without relying on prior information regarding the system’s dynamics, but rather it learns
the most optimal policy by interacting directly with the system. We tested two algorithms
in this paper, namely Q-learning and Q-network, and compared the performance with the
MIQP algorithm found in the literature as the baseline. The proposed algorithms are tested
in two different environments, simple and complex environments; and three scenarios,
namely perfect information, modeling error, and measurement noise. Both in the first
and second environments, the Q-learning (3.5867 and 74.2707) and Q-network (3.5867
and 74.4254) algorithm are shown to produce solutions very close to the optimal solution
of MIQP (3.5871 and 75.0489). These results are achieved by the RL methods without
knowledge of the system’s dynamics, whereas the MIQP solution requires an accurate
system model. It is shown that, when an accurate system model did not exist, meaning
there is error in system’s model, both Q-learning (3.3092 and 72.4335) and Q-network
(3.6554 and 72.3714) could perform better than MIQP (3.6293 and 72.1596) that requires an
accurate system model. Moreover, with the introduction of measurement noise on top of
the modeling error, the proposed algorithm of the Q-network is shown to produce a better
solution to MIQP. These preliminary results demonstrate that the reinforcement learning
algorithm is a strong candidate to solve optimal operation control of a DC microgrid system,
especially when the system’s dynamics model is not available.
By showing the capabilities of reinforcement learning-based algorithms in solving
model-free DC microgrids distributed droop switching problems, this paper suggests
that other off-policy algorithms that did not require an exact system model might also be
able to solve the same problem. This paper as a basic starting point in implementing a
reinforcement learning algorithm in such environments also opens a challenge for a newer
and more complex reinforcement learning algorithm that can be explored to achieve better
results. Other kinds of machine-learning algorithms that did not require an exact system
model that can learn from experience are also interesting topics to be explored.
Energies 2023, 16, 5369 18 of 20
In the future, more complicated environments and scenarios, such as the ones con-
sisting of huge numbers of sources or modes, and more complicated constraints can be
used, further exploring the limits of system complexity, mode choices, and uncertainties
due to modeling and measurement error that Q-network can handle. We will also explore
the use of more complex neural network architecture to improve the performance of the
Q-network algorithm. Finally, employing the real data from DC microgrid system in the
training process will also be investigated.
Author Contributions: Conceptualization, R.I. and S.; methodology, R.I., H.R.A., L.M.P. and E.F.;
software, A.A.A.R. and M.Y.; validation, R.I., L.M.P. and S.; formal analysis, R.I.; investigation, R.I.,
A.A.A.R. and M.Y.; resources, R.I. and S.; data curation, A.A.A.R. and M.Y.; writing—original draft
preparation, A.A.A.R. and M.Y.; writing—review and editing, R.I., A.A.A.R. and M.Y.; visualization,
A.A.A.R. and M.Y.; supervision, S.; project administration, L.M.P.; funding acquisition, S. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Indonesian Ministry of Research and Technology/National
Agency for Research and Innovation; Indonesian Ministry of Education and Cultur; under World Class
University Program managed by Institut Teknologi Bandung.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AC alternating current
ACMG AC microgrids
ADP adaptive dynamic programming
BESS battery energy storage system
CPS constant power source
CPL constant power load
DC direct current
DCMG DC microgrids
DOD depth of discharge
DVS droop-controlled voltage source
ESS energy storage system
GA genetic algorithm
HVAC high voltage AC
KCL Kirchoff’s current law
KVL Kirchoff’s voltage law
LVDCMG low-voltage DCMG
MDP Markov decission process
MILP mixed-integer linear programming
MIQP mixed-integer quadratic programming
MVAC medium voltage AC
PMU phasor measurement unit
PSO particle swarm optimization
PV photo-voltaic
RL reinforcement learning
SC supervisory control
SOC atate of charge
TS tabu search
VVC Volt/Var control
Energies 2023, 16, 5369 19 of 20
References
1. Blaabjerg, F.; Teodorescu, R.; Liserre, M.; Timbus, A.V. Overview of control and grid synchronization for distributed power
generation systems. IEEE Trans. Ind. Electron. 2006, 53, 1398–1409. [CrossRef]
2. Carrasco, J.M.; Franquelo, L.G.; Bialasiewicz, J.T.; Galván, E.; PortilloGuisado, R.C.; Prats, M.M.; León, J.I.; Moreno-Alfonso,
N. Power-electronic systems for the grid integration of renewable energy sources: A survey. IEEE Trans. Ind. Electron. 2006,
53, 1002–1016. [CrossRef]
3. Hatziargyriou, N.; Asano, H.; Iravani, R.; Marnay, C. Microgrids. IEEE Power Energy Mag. 2007, 5, 78–94. [CrossRef]
4. Guerrero, J.M.; Vasquez, J.C.; Matas, J.; de Vicuna, L.G.; Castilla, M. Hierarchical Control of Droop-Controlled AC and DC
Microgrids—A General Approach Toward Standardization. IEEE Trans. Ind. Electron. 2011, 58, 158–172. [CrossRef]
5. Hou, N.; Li, Y. Communication-Free Power Management Strategy for the Multiple DAB-Based Energy Storage System in Islanded
DC Microgrid. IEEE Trans. Power Electron. 2021, 36, 4828–4838. [CrossRef]
6. Irnawan, R.; da Silva, F.F.; Bak, C.L.; Lindefelt, A.M.; Alefragkis, A. A droop line tracking control for multi-terminal VSC-HVDC
transmission system. Electr. Power Syst. Res. 2020, 179, 106055. [CrossRef]
7. Peyghami, S.; Mokhtari, H.; Blaabjerg, F. Chapter 3—Hierarchical Power Sharing Control in DC Microgrids. In Microgrid;
Mahmoud, M.S., Ed.; Butterworth-Heinemann: Oxford, UK, 2017; pp. 63–100. [CrossRef]
8. Shuai, Z.; Fang, J.; Ning, F.; Shen, Z.J. Hierarchical structure and bus voltage control of DC microgrid. Renew. Sustain. Energy Rev.
2018, 82, 3670–3682. [CrossRef]
9. Abhishek, A.; Ranjan, A.; Devassy, S.; Kumar Verma, B.; Ram, S.K.; Dhakar, A.K. Review of hierarchical control strategies for DC
microgrid. IET Renew. Power Gener. 2020, 14, 1631–1640. [CrossRef]
10. Chouhan, S.; Tiwari, D.; Inan, H.; Khushalani-Solanki, S.; Feliachi, A. DER optimization to determine optimum BESS
charge/discharge schedule using Linear Programming. In Proceedings of the 2016 IEEE Power and Energy Society Gen-
eral Meeting (PESGM), Boston, MA, USA, 17–21 July 2016; pp. 1–5. [CrossRef]
11. Maulik, A.; Das, D. Optimal operation of a droop-controlled DCMG with generation and load uncertainties. IET Gener. Transm.
Distrib. 2018, 12, 2905–2917. [CrossRef]
12. Dragičević, T.; Guerrero, J.M.; Vasquez, J.C.; Škrlec, D. Supervisory Control of an Adaptive-Droop Regulated DC Microgrid With
Battery Management Capability. IEEE Trans. Power Electron. 2014, 29, 695–706. [CrossRef]
13. Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Assistive Power Buffer Control via Adaptive Dynamic Programming. IEEE
Trans. Energy Convers. 2020, 35, 1534–1546. [CrossRef]
14. Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Data-Driven Sparsity-Promoting Optimal Control of Power Buffers in DC
Microgrids. IEEE Trans. Energy Convers. 2021, 36, 1919–1930. [CrossRef]
15. Ma, W.J.; Wang, J.; Lu, X.; Gupta, V. Optimal Operation Mode Selection for a DC Microgrid. IEEE Trans. Smart Grid 2016,
7, 2624–2632. [CrossRef]
16. Anand, S.; Fernandes, B.G. Reduced-Order Model and Stability Analysis of Low-Voltage DC Microgrid. IEEE Trans. Ind. Electron.
2013, 60, 5040–5049. [CrossRef]
17. Alizadeh, G.A.; Rahimi, T.; Babayi Nozadian, M.H.; Padmanaban, S.; Leonowicz, Z. Improving Microgrid Frequency Regulation
Based on the Virtual Inertia Concept while Considering Communication System Delay. Energies 2019, 12, 2016. [CrossRef]
18. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 2005, 16, 285–286. [CrossRef]
19. Glavic, M. (Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives.
Annu. Rev. Control 2019, 48, 22–35. [CrossRef]
20. Wang, W.; Yu, N.; Gao, Y.; Shi, J. Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution
systems. IEEE Trans. Smart Grid 2019, 11, 3008–3018. [CrossRef]
21. Hadidi, R.; Jeyasurya, B. Reinforcement learning based real-time wide-area stabilizing control agents to enhance power system
stability. IEEE Trans. Smart Grid 2013, 4, 489–497. [CrossRef]
22. Yan, Z.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area
Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [CrossRef]
23. Bellman, R.E.; Dreyfus, S.E. CHAPTER XI. Markovian Decision Processes. In Applied Dynamic Programming; Princeton University
Press: Princeton, NJ, USA, 31 December 1962; pp. 297–321. [CrossRef]
24. Goldwaser, A.; Thielscher, M. Deep Reinforcement Learning for General Game Playing. Proc. AAAI Conf. Artif. Intell. 2020, 34,
1701–1708. [CrossRef]
25. Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to Train Your Robot with Deep Reinforcement Learning;
Lessons We’ve Learned. arXiv 2021, arXiv:2102.02915.
26. Graesser, L.; Keng, W. Foundations of Deep Reinforcement Learning: Theory and Practice in Python; Addison-Wesley Data & Analytics
Series; Pearson Education: Londun, UK, 2019.
Energies 2023, 16, 5369 20 of 20
27. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press:
Cambridge, MA, USA, 2016.
28. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Energies 16 05369

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Energies 16 05369

Uploaded by

Copyright:

Available Formats

energies

1 Department of Electrical Engineering and Information Technology, Universitas Gadjah Mada,

Energies 2023, 16, 5369. https://doi.org/10.3390/en16145369 https://www.mdpi.com/journal/energies

Udc,max Udc,max Udc,max

Udc,min Udc,min Udc,min

(a) (b) (c)

1.1. Related Work

Table 1. Comparison of Related Works.

2. Operation Control of DC Microgrids

For each droop-controlled voltage source j ∈ Nd ,

Vj = Vj0 − d j PDj , (1)

iCPS j = gCPS j v j + ĩCPS j , (4)

iCPS = gCPS vCPS + ĩCPS j , (7)

iCPS := [iCPS1 , iCPS2 , . . . , iCPS N ],

vCPS = [v1 , v2 , . . . , v NCPS ],

ĩCPS = [ĩCPS1 , ĩCPS2 , . . . , ĩCPS N ],

Similar to CPS, a CPL is modeled as a negative conductance in parallel to a current

iCPL j = − gCPL j v j + ĩCPL j , (8)

iCPL = −gCPL vCPL + ĩCPL j , (11)

iCPL := [iCPL1 , iCPL2 , . . . , iCPL N ],

vCPL = [v1 , v2 , . . . , v NCPL ],

ĩCPL = [ĩCPL1 , ĩCPL2 , . . . , ĩCPL N ],

vb := [vb1 , vb2 , . . . , vbN ],

ib = [ib1 , ib2 , . . . , ibN ],

x(k + 1) = x(k ) + ẋ(k)∆t, (20)

where ∆t is a time-integration constant. Substituting (19) to (20), the complete dynamics in

x(k + 1) = Ak σ x(k) + Bk σ , (21)

Note that only one mode can be active at a time.

2.2. Problem Statement

Suppose that the cost function at iteration-k is given by

3. Reinforcement-Learning-Based Operation Control

3.2. Q-Learning for Near-Optimal Operation Control

Figure 3. Reinforcement learning agents works by interacting with an environment, such as DC

π (s) = arg max Q(s, a). (27)

− Q(s, a) + α(r (s) + γ max Q(s0 , a0 ) − Q(s, a)).

3.3. Q-Network for Operation Control under Uncertainty

Qt (s, a) = r (s) + γ max Q(s0 , a0 ). (29)

a probability of emin of choosing a random action and a probability of 1 − emin of choosing

Table 2. List of Parameter Values.

To evaluate the performance of both algorithms, we compare the cumulative reward

Table 3. Summary of Results.

Best Reward/Cost per Episode

5.2. Complex Environment

(Average) Cumulative Reward

Table 4. Impact of System Size.

System Multiplier Number of Nodes Training Time (s)

6. Conclusions and Future Works

You might also like