Reinforcement Learning Approach For Optimal

IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 33, NO.
5, SEPTEMBER 2018 5749
Reinforcement Learning Approach for Optimal

Distributed Energy Management in a Microgrid
Elham Foruzan , Graduate Student Member, IEEE, Leen-Kiat Soh , Member, IEEE,
and Sohrab Asgarpoor, Senior Member, IEEE
Abstract—In this paper, a multiagent-based model is used to to the rest of a power system (or a maingrid). Microgrids also
study distributed energy management in a microgrid (MG). The offer a range of technical benefits, including improving system
suppliers and consumers of electricity are modeled as autonomous reliability and local energy delivery [3]. Also, MGs have eco-
agents, capable of making local decisions in order to maximize their
own profit in a multiagent environment. For every supplier, a lack nomic benefits as they can provide additional sources of capital
of information about customers and other suppliers creates chal- investment for small-scale DERs, which in turn leads to rapid
lenges to optimal decision making in order to maximize its return. extension of MGs. Consequently, it reduces the customer de-
Similarly, customers face difficulty in scheduling their energy con- pendency on maingrid electricity, which has high production
sumption without any information about suppliers and electricity and transmission investment.
prices. Additionally, there are several uncertainties involved in the
nature of MGs due to variability in renewable generation output Notwithstanding the benefits of MGs, the operation of multi-
power and continuous fluctuation of customers’ consumption. In ple small-scale DERs introduces considerable complexity into
order to prevail over these challenges, a reinforcement learning al- a system, which may not be addressed using the existing tradi-
gorithm was developed to allow generation resources, distributed tional controller mechanisms. Specifically, DERs and customers
storages, and customers to develop optimal strategies for energy may have different owners, and therefore they tend to change
management and load scheduling without prior information about
each other and the MG system. Case studies are provided to show their operations to maximize their profits without considering
how the overall performance of all entities converges as an emer- other DERs operations and overall system performance. Thus,
gent behavior to the Nash equilibrium, benefiting all agents. the actions or decisions of these individual competing entities
Index Terms—Microgrid, reinforcement learning, distributed may violate the robustness in an MG, for example, in terms of
control, renewable generation. maintaining load-generation balance or cause inefficient use of
DERs resources, which may lead to transfer excessive power
I. INTRODUCTION from maingrid [3]. Additionally, renewable DERs have variable
output powers. Thus, serving customers’ loads with these re-
IGH penetration of distributed energy resources (DERs),
H small-scale, renewable and non-renewable generators,
and distributed energy storage into power systems, has revo-
sources without a controller may cause system instability [4].
Thus, systematic energy and load management is critical to pro-
vide a market model that enables competitive participation of
lutionized the systems from traditionally passive to active sys-
DERs and customers within a MG to optimize entities utility
tems in which small-scale generation resources are located in
while reducing MG dependency on maingrid, and also miti-
the vicinity of the customers load [1]. In this regard, effective co-
gating the variability of the system so as to match the often-
ordinated operation of these small scale DERs and loads inside
unpredictable energy supply provided by renewable resources.
a power system is achievable through microgrid (MG) systems
Recently, a promising approach based on distributed energy
[2]. A MG is a low-voltage, autonomous small-scale power
and load management has emerged that can be achieved using
grid that interconnects loads and DERs. It has defined electrical
local monitoring and information exchange in MGs [5], [6]. In-
boundaries and acts as a single controllable entity with respect
deed, such solutions have gained popularity, especially in the
form of multiagent systems (MAS), for energy management
Manuscript received October 13, 2017; revised February 6, 2018; accepted and load scheduling in MGs [7]–[11]. MAS-based solutions, as
March 28, 2018. Date of publication April 5, 2018; date of current version a distributed control approach, provide an effective solution for
August 22, 2018. Paper no. TPWRS-01548-2017. (Corresponding author:
Elham Foruzan.) MGs with different DER owners and customers, since all enti-
E. Foruzan is with the Department of Electrical and Computer Engineer- ties can be modeled as autonomous agents with a certain degree
ing, and Computer Science and Engineering, University of Nebraska—Lincoln, of intelligence and the capability to improve their performance.
Lincoln, NE 68588 USA (e-mail:,elham.foruzan@huskers.unl.edu).
L.-K. Soh is with the Department of Computer Science and Engineering, MAS-based solutions have been used to simulate and ana-
University of Nebraska—Lincoln, Lincoln, NE 68588 USA (e-mail:, lksoh lyze emergent behavior of systems as a result of autonomous
@cse.unl.edu). decision makers called agents. Several studies have presented
S. Asgarpoor is with the Department of Electrical and Computer Engi-
neering, University of Nebraska—Lincoln, Lincoln, NE 68588 USA (e-mail:, a comprehensive market model, based on MAS to control en-
sasgarpoor1@unl.edu). ergy dispatch in MGs [12]–[15]. In [12], [13], customers and
Color versions of one or more of the figures in this paper are available online DER agents exchange energy and maximize their utility in a
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TPWRS.2018.2823641 market via many-to-many negotiations on electricity price and
0885-8950 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 02,2020 at 16:42:07 UTC from IEEE Xplore. Restrictions apply.
5750 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 33, NO. 5, SEPTEMBER 2018
amount between m buyers and n sellers at every time frame, In terms of contributions, we developed an RL-based en-
requiring a long computational time of O(mn) for a real-time ergy and load management model which can be conducted in a
market. On the other hand, for many-to-many negotiation, auc- distributed manner at each DER and customer. Our proposed
tions mechanisms seem more intuitively appropriate [16], as all model provides a framework for MG energy and load manage-
agents submit their bids and electricity amount to the market ment taking into account the variability of stochastic entities in
and auctioneer clear the market based on auction rules. Indeed, the MG. In this framework, both customers and suppliers are
auction-based market has proven to be a suitable mechanism rational autonomous agents that can adapt to each other’s behav-
for MGs to coordinate renewable and conventional generation iors. We further propose a set of performance metrics to evaluate
resources and match supply and demand [14], [15]. Likewise, in the effectiveness of the distributed RL method on DERs, cus-
competitive auctions market, finding optimal strategy for agents tomers, and the overall MG system, including MG dependency
to improve their utility is essential. on maingrid for supplying its local customers. We investigate
Reinforcement learning (RL) has been shown to achieve ex- several configurations to investigate the operation of the pro-
cellent performance in optimizing agents’ utilities in MAS- posed model and validate its effectiveness for all participating
based modeling of power systems. Indeed, some studies have agents in the system.
used RL to improve agents’ decision making and increase their
expected utility in an energy market [17]–[19]. For example, II. ENVIRONMENT AND MARKET DESIGN
in [19] Q-learning, an RL algorithm, was used to learn a cost-
effective day-ahead consumption plan, for charging a fleet of In this section, first a multiagent environment is described,
electric vehicles. Authors in [17] used Q-learning to optimize where it includes a set of physical entities: DERs, customers,
the power suppliers’ profit function in the auction based frame- and the maingrid. We couple each DER as well as each cus-
work. Other researchers used RL to improve agent decision tomer with the notion of agent-hood, such that both DERs and
making in an MG system [20]–[22]. For example, strategic bid- customers have full control over their own decisions of power
ding for load-shedding in an MG using RL is discussed in [20]. generation/consumption and profits/costs. Then, auction design,
In [21], RL was used to minimize customer agents’ energy costs to match supply and demand, is presented.
by using strategic battery scheduling. In [22], the authors used
an RL for both customers and a service provider to minimize A. Environment Design
their expected cost in an MG. In later work, however, the au- The MG contains NG agents comprising energy buyers, sell-
thors did not consider direct energy management for DERs, ers, auctioneer, and maingrid. Each customer is an energy buyer
especially those having stochastic energy output, as they as- and decides on the electricity amount it needs to purchase. The
sumed that MG buys power through the power service provider. renewable resources (PVs and wind generators) and diesel gen-
In most of the reported advances in MAS-based MGs, there is erators are energy sellers that provide their bids and amount of
a lack of modeling of both DERs and customers as individual electricity to the energy market. A storage system can be either a
agents in an MG. Such modeling is critical to understand the buyer or a seller; at each time slot, each storage system can only
interactive relationships between the supply and demand sides assume one role. Furthermore, we also assume an auction-based
with their competing needs and perhaps, conflicting goals, in market design for our environment where there is an auctioneer
order to adapt to others in the environment and decrease the agent. The auctioneer agent is an independent entity without
MG dependency on maingrid electricity in addition to focusing any physical property. Its duty is to control the auction process
on single-entity optimal strategies. Additionally, most of these between the buyers and sellers according to the auction rules
researches fail to address stochastic behavior of both renewable and clear market price for each time period in order to define
DERs and customers’ load in an MG, while it is more realistic the dispatch points of resources and market clearing price (PC).
to consider all agents’ variability in models that aim to maintain At the same time, because of variable energy outputs of the re-
load-generation balance. newable resources and variable load demands, MGs often relies
Therefore, we present a MAS-based distributed energy and on the maingrid agent to absorb any excess power and subsidize
load management approach for DERs, including battery storage any deficits, resulting in (1):
resources, and customers by adopting reinforcement learning in
an MG energy market. In this approach, each agent-on either the Pcu stom er = Pw in d + Pdiesel + Pstor ag e + PP V + Pg r id
supply or the demand side- is designed to optimize its utility in (1)
an hourly MG auction-based market via reinforcement learning, where Pcu stom er is all customers’ energy consumption and
allowing it to adapt its behavior to other agents present in a com- Pw in d , Pdiesel , Pstor ag e , PP V , and Pg r id are wind, diesel,
petitive and stochastic MG energy market. The reinforcement storage, PV and maingrid energy generation, respectively. The
learning capability, based on the model-free Q-learning algo- maingrid buys and sells electricity at the different rates of
rithm [23], allows agents to find the optimal policy to maximize Pmt ain ,b and Pmt ain ,s , respectively; therefore, we represent a
their utility without direct communication with other entities in tuple Pmt ain = (Pmt ain ,s , Pmt ain ,b ) as a maingrid price at time
the system. To address the challenge of including agents’ vari- t. Also, maingrid’s price varies during the day. There are three
ability, we used random models for these agents to consider the different price rates depending on the time of use (TOU) for four
variability of renewable DERs and customers’ consumption in time periods. The maingrid buys at 5–15 C| /kWh and sells from
our model. three uniform distributions 13–16, 23–26, and 40–42 C | /kWh
FORUZAN et al.: REINFORCEMENT LEARNING APPROACH FOR OPTIMAL DISTRIBUTED ENERGY MANAGEMENT IN A MICROGRID 5751
for time intervals 11 p.m. to 8 a.m.; 8 a.m. to 12 and 6 p.m. TABLE I

PROPOSED ENERGY MANAGEMENT
to 11 p.m.; and 12 to 6 p.m., respectively. [24]. The desired
emergent behavior of the MAS system is for its energy resources
to supply electricity to meet the demands of the customers with
as little dependency on the maingrid energy as possible while
optimizing the expected rewards of all agents.
B. Market Design
Our MG market is auction-based, and it consists of an auction-
eer agent, customers (or buyer agents), and energy resources (or
seller agents). Every seller agent submits its amount of electric-
ity and corresponding bid at each time interval t. Buyer agents
submit their energy demand to the auctioneer agent as well.
The auctioneer clears the market by finding the intersection of
demand curve from loads and the ascending supply curve from
generation resources in a uniform price auction. If a seller’ sub-
mitted energy does not accepted in the local MG auction, the
seller can sell its energy directly to the maingrid with its buying
rate.
Each agent i aims to maximize its profit by selling energy A. Markov Game
to the MG, Eࢩ . Based on the market rules, agent i will sell its
surplus energy E to the maingrid to salvage cost. For each agent The Markov game has been shown to be effective in mod-
i in the system, the value of agent’s profit for selling electricity eling stochastic multiagent systems with non-cooperative na-
is calculated based on (2). ture [23], [28]. Here, a Markov game is used to model the
stochastic and adaptive behavior of all agents in a sequence of
finite iterations. At each iteration, an agent senses the current
P rof iti = P rof itM
i
G
(E ∗ ) + P rof itm
i
ain g r id
(E) state and decides on an action. After that, it receives an imme-
− Cost (E ∗ + E) (2) diate reward that depends on the current state and that chosen
action, and the resulting next state. The transition to the next
state is stochastic. Mathematically, a Markov game is defined as
where P rof itMi
G
(E ∗ ) is the profit of selling Eࢩ amount of en- a tuple (n, S, A, r, P ) where n is the set of agents; S is a finite
ergy in the MG’s market and P rof itm i
ain g r id
(E) is the profit of state space; A = ×Ai=1,..,n is a joint action space of n play-
selling E amount of energy to the maingrid. The function Cost(.) ers; r is the immediate reward; and P : S × A × S → [0, 1] is a
calculates the operational cost of DER agents. The DRE’s cost transition function over the set S [26]. In a stochastic game, the
function is considered to be quadratic functions of their opera- objective of each agent is to maximize the discounted sum of
tional point with constant coefficients. rewards with a discount factor, β ∈ [0, 1). Let πi be the strategy
of player i and it tries to maximize its value at each state s ∈ S,
which is defined by Vi (s)
III. MODEL DESIGN

t=∞
In the MG, the agent’s utility depends not only on the actions Vi (s, π1 , π2 , . . . , πn ) = β t ri (π1 , π2 , . . . , πn ) (3)
it takes, but also on the actions of other agents as well. Like- t=0
wise, agents do not have stationary and deterministic policy due
Considering that n agents are present in the game that can
to the stochastic environment and their ability to change their
adapt to each other, the Nash equilibrium is often considered
policy over time to improve their expected profit. Therefore, the
as a joint policy such that no self-interested agent can improve
Markov game is used to represent non-stationary interactions of
its expected discounted reward by deviating to a different strat-
agents inside the MG [25]. Then, the RL, a model free algorithm,
egy in stochastic games [25], [28]. The Nash equilibrium in
is discussed to solve distributed optimization for each agent so
a stochastic game with n player is defined as a joint policy
that each agent adapts to others [26], [27]. Next, random model
π = {πi=1,..,n } in which for every agent i ∈ n policy πi is the
is discussed to represent the stochastic behavior of agents in the
best response to the others’. In other words, if all agents except
MG. Thus, agents would consider their variability in their au-
agent i plays with joint policy π−i agents i gets the maximum
tonomous decision and would converge to a Nash equilibrium
possible reward by playing policy πi [29].
using RL, where each agent’s best response is to every other
agent’s best response, resulting in a stable solution. Then, pro-
B. Agent Learning
posed system evaluation indices for analyzing the distributed
energy management are presented. Summary of the proposed Because of the stochastic environment and actors adapting to
algorithm is given in Table I. each other, Q-learning, a RL algorithm, is used to power each
TABLE II
AGENTS STATES, ACTIONS, AND IMMEDIATE REWARD, THE P C t E ∗ t E t , ARE MARKET CLEARING PRICE, SOLD ENERGY TO THE MG MARKET, AND SOLD
ENERGY TO THE MAINGRID AT TIME SLOT t, RESPECTIVELY
agent’s reasoning to discover its optimal policy with respect learning procedure, each DER updates its strategy based on the
to other agents’ actions. Q-learning is essentially a Markov performance of different actions in various states without ex-
decision process, and follows directly from the properties of a plicit modeling of the environment. Interactions of agents and
Markov game [30]. the environment during the learning processor lead the system
Considering that S is a discrete set of environment states, to asymptotically converge to the Nash equilibrium. A detailed
and A is a discrete set of actions, an agent experiences every summary of all agents’ actions, states, and rewards are shown
state s ∈ S and possible sequences of actions a ∈ A through in Table II.
the finite learning horizon. After performing an action, a at state
s, the agent makes a transition to the new state s ∈ S. The agent C. Random Models (RMs)
observes the immediate reward rt for performing an action a
in state s and also observes the new state s , then updates the To represent the stochastic characteristics of customers’ con-
value of the state s. Given the information of < s, a, s , rt >, sumption and the renewable generators’ output power, random
the updating rule of Q-learning is: models are used. In this paper, random values for customer con-
sumption are generated based on the actual data corresponding
Qt+1 (s, a) = (1 − α) Qt (s, a) + α rt + β V ∗ (s , a) (4) to the average values of the feature under study. In other words,
the random sampling of customers load consumption during
where α ∈ [0, 1) is the learning rate, β ∈ [0, 1) is a discounted a day are modelled using an exponentially distributed random
factor, and V ∗ (s ) is the optimal value of state s . The optimal variable with a mean value of daily profile given in [31]. Addi-
value of a state at each iteration is obtained by computing the tionally, the output power of wind and PV resources are selected
maximum value that an agent could obtain by carrying out an randomly from Weibull and Beta probability distribution func-
action in that state [23]. tions, respectively, based on [32]. For each renewable genera-
Considering the Markov game tuple, (n, S, A, r, P ), how tion resource, each simulation time frame was divided into four
the model is mapped into the MG problem is illustrated. subintervals, and several samples from respective probability
In the system, there are five types of agents, namely: wind distribution are taken for each sub-interval. The average value
resources, photovoltaic (PV) or solar resources, diesel genera- of these samples has been used to realize the value of generated
tors, battery storage, and customers. There are total of five self- power for each subinterval of that time frame. Each agent offers
interested agents capable of learning, i.e., updating the value the lowest value of four subinterval to the market to make sure
of each state based on its actions and immediate rewards re- that it can track its offer.
ceived from the environment. The discrete and finite set of ac-
tions A = {a1 , a2 , . . . , aA } for the distributed energy resource
(DER) agents (wind, PV, diesel, and battery) includes their bid D. Agent Design
and the amount of power they want to produce; while customers’ Our proposed energy and load management consists of five
actions are to decide about their consumption given each state. types of agents, namely: wind resources, photovoltaic or solar
The immediate reward function r is defined as a function of resources, diesel generators, battery storages, and customers.
profit/expenses received from the market. Also, for each agent Each agent type behaves as a self-interested learning agent that
type, except battery storage, the state s ∈ S at time slot t is a updates the value of possible actions at each state based on some
tuple of the current time slot t and the maingrid price, while rewards received from the environment. Details of components
different values for the maingrid price were selected randomly in the MG are described below.
from the maingrid’s price discrete distribution set during a learn- 1) Customer Agent: Each customer i has a set of aggregated
ing process. The state of battery storage has an extra element, loads, ADi , which is defined as the total amount of electricity
which represents the battery storage’s state of charge (SOC). Ad- that the customer i should use for its appliances. Also, in our
ditionally, the well-known Q-learning approach did not require environment, a customer can practice load management, which
prior information about the transition function P. The DERs do means curtailing some amount of consumption when electricity
not know the information of other DERs. However, during the price is high in order to reduce cost [24]. The customer agent i
tries to minimize its cost function during the day by choosing TABLE III
INPUT PARAMETERS OF CASE STUDIES
energy function Fit ∈ Fi , where Fi is the set of energy functions
for customer i. By choosing Fit , customer i satisfies Fit (ADit ) =
Elit amount of its total aggregated demand at time slot t. De-
spite the fact that the customer i obtains profit by curtailing flex-
ible portions of its aggregated load, ADit − Elit , it also loses
some utilities from not using this amount of its aggregated load.
The function gi (.) models the utility dissatisfaction for agent B. Fairness Factor (FF)
i. This function was set to gi (ADit − Elit ) = K|ADit − Elit |, This metric is calculated by dividing the average value of
where K is a dissatisfaction coefficient. Different customers AEP for all DERs by the average value of electricity cost of
may have different K value; for example, customers who value customers (AEC), for all customers in the system. This metric
electricity more should choose a higher value for coefficient discovers how the designed system optimizes both customers
factor K. and DERs expected reward.
The total expense of the customer agent i at time slot t is N
defined as: AEPi /N
F F = Ni=1 (8)
L
i=1 AECi /NL
Expensei = P C t . Elit + gi ADit − Elit . (5)
where N and NL are the number of DERs and costumers. The
2) Generation: Two renewable energy resources i.e., wind value of AECi for customer i is calculated by dividing the daily
and PV energy resources are modeled in the environment [29]. energy expenses of customer over its daily power consumption.
Also, diesel generators are considered as a non-renewable gen- 24 t
t=1 Expensei
eration with deterministic output power. AECi = 24 t
3) Battery Storage Systems: A battery storage system can be t=1 ADi
charged and discharged by buying and selling electricity during FF metric increases with either increasing average profit of
operation and changes its energy state at each time slot. We DERs or decreasing the average cost of customers or both.
denote the battery storage state of charge (SOC) at the time Therefore, the system that optimizes both customers and DERs
t with SOC t . The operation constraints of the storage agent expected reward has the highest fairness factor and it is fair to
require that SOC remains bounded with some limit. all agents.
SOCmin < SOC < SOCmax
C. Electricity Purchased from Maingrid (EPM)
where SOCmin and SOCmax are maximum and minimum bat- This metric calculates daily energy that a MG purchases from
tery SOC and are set to 0.8 and 0.2 in our model. Let xti denote the maingrid, normalized by the daily average MG’s load.
a charging or discharging energy during time slot t for the ith 24 t
storage unit, its status at time t + 1 is: t=1 (P rof it − Cost)M ain g r id
EP M = (9)
N L 24 t
SOC t+1 = αSOC t + xt (6) i=1 t=1 ADi /NL
A positive value of xt indicates battery charging, and its neg- where the (P rof it − Cost)tM ain g r id is the difference between
ative value means battery discharging. For each battery storage, maingrid profit from selling electricity to the MG and its cost
the net profit is calculated as a difference of its profit for selling from buying electricity from MG. The lower value for EPM
electricity and its cost for buying electricity. indicates that the maingrid has a smaller profit due to smaller
amount of electricity transfer to the MG.
IV. EVALUATION METRICS
V. RESULTS
Three metrics are designed to evaluate the desired emergent
In this section, first, sensitivity analysis is performed to fur-
behavior of the proposed multiagent model. Among these three
ther analyze the role of learning parameter i.e., learning rate (α)
metrics, the first two metrices, are proposed to evaluate the per-
and discounted rate (β). Then, the performance of the proposed
formance of each agent and estimate how the proposed method
learning-based solution approach is investigated for four differ-
assists agents to optimize their expected reward. The last metric
ent configurations of agents. The model was implemented in the
evaluates the performance of the MG and how agents’ learning
Repast Simphony based on the Java programming languages. In
contribute to reduce MG dependency on maingrid.
the set of experiments, one day consists of 24 time-slots each
of which lasts for one hour. Here, the value of dissatisfaction
A. Average Energy Generation Profit of a DER (AEP) coefficient, K, was set to 10 for all customers. Tables III shows
This metric is calculated by dividing the daily energy profit the parameters for our experiments.
of a DER i, P rof iti , over its daily power generated.
24 A. Sensitivity Analysis with Respect to Learning Parameters
P rof itti
AEPi = 24t=1 ∗t , (7) In the process of learning, the learning parameters, i.e., learn-
t=1 (Ei + Ei )
t
ing rate and discounted rate are internal variables for every agent
Fig. 1. Storage average profit as a function of learning parameters. Fig. 2. Wind agent daily profit (upper), and generated energy (lower) in the
four configurations.
TABLE IV
INPUT PARAMETERS OF CASE STUDIES
of α and low β. In contrast to the storage and diesel, renew-
able resources prefer lower value of β and higher value of α.
This result shows that renewable generation resources follow
a policy to maximize their profit by choosing the maximum
immediate reward. This is because renewable generations have
unpredictable energy supply, with a negligible operational cost.
that impact on their optimal strategy. An agent uses its expe- Thus, they prefer to trade all of their available energy and max-
rience to improve its estimate, blending new information into imize their average profit by following a greedy policy. The
its prior experience according to a learning rate. So, agents that learning parameters of agents have been set based on Table IV.
are greedy about new experience and prefer to choose the maxi-
mum immediate reward right away have high learning rate [17].
B. Simulation Results and Analysis
On the other hand, discount factor controls how much effect
future rewards have on the optimal decisions, thus small values In this section, four configurations are presented to show how
of discount factor emphasize near-term gain and larger values RL affects the performance of agents’ behavior and their ex-
give significant weight to later rewards. Thus, agents choose pected profits/ expenses in the MG environment. These four con-
high discounted rate if the expected future reward is valuable figurations are: (1) system without learning capability (Configu-
for them. To further study the variation in the performance of ration NoL), (2) system with learning capability for DERs only,
DERs learning with different learning parameters, sensitivity and fixed consumptions for customers (Configuration GenL),
analysis is performed for all DERs. (3) system with learning capability for customers only (Con-
Fig. 1 shows the value of storage average net profit in three- figuration LoadL), and (4) system with learning capability for
dimension coordination system defined by learning rate, dis- all agents (Configuration AllL). Figs. 2 to 7 show the results
counted rate, and storage average net profit. Such representation of these four configurations after they have learned their best
of storage average net profit for different values of α and β gives response.
a comprehensive illustration of the preference for storage strat- 1) Configuration NoL: System without Learning: In this
egy. In this experiment, the learning parameters for other agents configuration, agents do not have learning capability. Figs. 2
set to 0.5 and 0.5. As can be seen in Fig. 1, the high value of dis- to 7, configuration NoL, illustrate the agents’ actions in the MG
counted rate and low value of learning rate have the maximum without using their learning feature. As can be seen, the ex-
profit for a storage agent. This is because storage agents gain pected cost of agents is quite varied and system did not reach a
from trading electricity, as opposed to producing electricity, so convergence.
they should be able to buy electricity from microgrid when the Configuration GenL: System with Learning Capability for
price is low and sell it back when the electricity price is high. DERs and Fixed Consumptions: In the set of simulations, agents
Therefore, it is very important for the storage agents to have a choose a value to bid in the market from a discrete set of 9
future prospective vision. High value of β and low α suits this prices, distributed uniformly between 7 to 45 C| /kWh. Also,
requirement appropriately by allowing them to explore more the discrete amount of electricity that diesel agents can choose
state-actions to find the optimal policy. has 4 values, including their full capacity (100 %), medium
Same study was performed for diesel and renewable genera- capacity (a number between 70% to 80% of their capacity),
tion resources. The learning parameters that leads to maximum low capacity (a number between 30% to 40% of their capacity),
average profit were presented in Table IV. Same as battery stor- and zero capacity. Therefore, a diesel agent’s action set has
age, diesel agent has its maximum profit with the high value 9 × 4 = 36 different actions. Battery storages can buy and sell
Fig. 3. Solar agent daily profit (upper), and generated energy (lower) in the Fig. 5. Storage agent daily net profit (upper), and generated energy (lower) in
four configurations. the four configurations.
Fig. 4. Diesel agent daily net profit (upper), and generated energy (lower) in
the four configurations.
Fig. 6. Customer agent daily expenses (upper), and consumed electricity
electricity; they submit one of the four values to sell energy, or (lower) in the four configurations.
negative values to buy energy. Thus, a battery agent’s action set
has 9 × 7 = 63 values. Renewable resources sell their available profit and daily net sold electricity of the storage agent. As can
capacity; thus, each of their action sets has 9 values. be seen, the storage agent gains a higher expected net profit with
A total of four optimality equations as (4) were formu- a lower amount of electricity trading with the system. The daily
lated, which correspond to the aforementioned four DER agents. expenses for a customer agent and its daily consumption are
Based on the microgrid system agents’ design, each DER agent shown in Fig. 6. Customer agents did not make any decisions in
updates its own Q-table through interactions—explorations— this configuration since they satisfy all of their loads. However,
with the environment in the form of rewards—i.e., selling and as a result of strategic actions of other agents in the multiagent
buying prices of electricity resulting in profits—and thus learns environment, their expected expenses were reduced. Fig. 7 also
a best response strategy allowing the environment to reach a verifies the performance of the agent design system. The ratio-
steady state or the Nash equilibrium. The decision made by ev- nal optimal decisions of agents with the presence of learning
ery DER agent in the environment impacts the other agents’ capability led to a Nash equilibrium in this configuration. Con-
gains in the system. Yet, DER agents eventually converge to a sequently, electricity interactions between the maingrid and MG
Nash equilibrium point in less than 9600 time-slot (hours) or decreases 14%.
400 “days” of learning. The total computational time for a PC 2) Configuration LoadL: System with Learning Capability
with 16 Gb RAM was 20 sec. After an agent has learned the for Customers: In this configuration, only customers have the
optimal policy, it chooses its actions based on its optimal policy capability of learning an effective load scheduling policy to
derived from their Q-tables as can be seen in Figs. 1 to 6, GenL. adapt to what it observes and receives as rewards from the
The renewable resources, Figs. 2 to 3, need to decide about environment over time. Fig. 7, Configuration LoadL, shows that
their offered hourly bids. If they offer higher bids than the hourly the maingrid, on average, buys electricity (as indicated by the
market clearing price, they lose out on selling energy in the MG negative amount in transmitted power) as a result of excess
market. Besides, submitting a lower bid in the market, if it was energy produced in the MG, but the maingrid still has a positive
accepted, leads to lower profit. Therefore, they choose a policy to profit. This is due to the maingrid making more profit by selling
increase the expected profit in the long run. Fig. 5 shows the net electricity to the MG rather than buying excess power from the
Fig. 9. Comparison of average energy generation profit (AEP) for all DER
agents in the four configurations.
Fig. 7. Maingrid daily profit (upper), and transmitted electricity from maingrid tricity profit/expenses for agents inside the MG. The RL, as in
to the MG (lower) in the four configurations. the Configuration GenL, enables DERs to learn best response
strategies and increase their expected profit. Additionally, the
same as Configurations LoadL, customers can schedule their
consumption and curtail some portion of their flexible loads, if
it is preferable. Fig. 7 shows that the load reduction significantly
reduces the maingrid profit to 874 cents.
C. Comparison of NoL, GenL, LoadL, and AllL

In this section, the four configurations NoL through AllL are
evaluated using the performance metrics introduced earlier.
Fig. 9 shows values of Average Energy Generation Profit
(AEP) for four different types of DERs in four configurations
NoL, GenL, LoadL, and AllL. The AEP values are the highest
Fig. 8. The patterns of customers’ load curtailment during 24 hours with overall in Configuration GenL for all DERs. The DERs also
changing dissatisfaction coefficient, K. roughly maintain their high values in Configuration AllL, though
lower than those in GenL. This is due to the impact of load-
curtailing customers in AllL, where the customer agents are able
MG, on average. This advantage for the maingrid appears to to learn and adapt to decisions by the DERs to maximize their
be due to the DER agents not following any particular policy own profits. In Configurations NoL and LoadL, DER agents do
to maximize their profits, leading to random bids and losses in not learn to maximize their profits and that leads to relatively
their transactions with the MG. lower daily profits. The AEP values for DER agents have their
To better illustrate the effectiveness of learning for customer lowest possible value in configuration LoadL because DERs,
agents, the results of daily curtailment coefficient for 6 differ- without learning capabilities are disadvantaged against customer
ent dissatisfaction coefficient values, K, were plotted in Fig. 8. agents that have learned their best response policies, resulting
The customers select their load curtailment percentage from the in reduced profits for DERs.
range {0, 0.05, ..., 0.65, 0.70}. We can observe that as K in- We see from Fig. 9 that the AEP value for a battery storage
creases, a customer agent decreases its curtailment, as K can is negative in configurations NoL and LoadL. A negative value
be seen as a penalty factor. This thus creates a trade-off for of AEP for storage indicates that this agent is buying more
customers between (1) saving money because of forgoing some expensive electricity than what it is selling back to the grid,
portion of their loads and (2) their dissatisfaction from having in these configurations. This is due to the storage agent not
curtailed electricity. As the sequence of figures in Fig. 8 shows, taking into account future prices for buying and selling. On the
customer agents learn to maintain the high curtailment during other hand, in Configurations GenL and AllL, the storage agent
on-peak periods and reduce their curtailment during off-peak learns to improve its profits from buying electricity when the
periods. These observations confirm that the RL mechanism electricity price is low, and selling it back to the market when
enables customers to adapt and manage their individual con- the electricity price is high, leading to much higher daily profits
sumption amounts when the utility from using electricity is less than those from the NoL and LoadL configurations. Therefore,
than saving money from curtailing loads. the impact -in terms of average energy profit- that storage agents
3) Configuration AllL: System with Learning Capability for can gain from RL is much more substantial than other types of
All Agents: All of the agents in the system have the reinforce- DER agents in the environment.
ment capability in this configuration. Figs. 2 to 7, Configuration The values of FF for the four configurations are provided in
AllL, show the daily generated/consumed electricity and elec- Fig. 10. Configuration AllL has the highest FF value (∼1.81).
types of agents–i.e., DERs and customers–been modeled and

investigated individually without considering that their coun-
terparts also capable of modeling and adapting in the same
environment. Furthermore, the MG’s electricity dependency on
the maingrid reduced to the lowest value as an emergent behav-
ior of the proposed model. Moreover, with stochastic model of
agents, the applied Q-learning, allowed all agents to infuse the
variability of customer and DER agents in their optimal policy.
The distributed RL learning approach offers a scalable solution.
Therefore, if an agent changes its behavior, i.e., the value of
electricity for a customer, it can update its model locally, based
Fig. 10. Comparison of the fairness factor (FF) and electricity purchased from on new information and learn its new best response without a
maingrid (EPM) in the four configurations. need of excessive communication to a central controller or other
agents. Thus, our proposed model provides a framework for MG
energy and load management taking into account the variability
In Configuration AllL, DER agents learn to effectively improve of stochastic entities in the MG.
their profit while customers simultaneously learn to effectively
reduce their average expenses, which leads to maximum value REFERENCES
for FF among all configurations. [1] J. P. Lopes, N. Hatziargyriou, J. Mutale, P. Djapic, and N. Jenkins, “In-
Now, let’s look at EPM. The EPM index has its lowest value in tegrating distributed generation into electric power systems: A review of
Configuration AllL (∼11.78). Two main reasons collaborate to drivers, challenges and opportunities,” Electric Power Syst. Res., vol. 77,
no. 9, pp. 1189–1203, 2007.
lower the maingrid profit. First, DERs can effectively offer their [2] P. Basak, S. Chowdhury, S. H. nee Dey, and S. Chowdhury, “A literature
bids to the market; and, therefore, instead of selling to the main- review on integration of distributed energy resources in the perspective of
grid, DERs sell electricity to the customers. Second, customer control, protection and stability of microgrid,” Renewable Sustain Energy
Rev., vol. 16, no. 8, pp. 5545–5556, 2012.
demand is lowered such that the MG has a smaller electricity [3] M. Microgrids-Benefits, Barriers and Suggested Policy Initiatives for the
shortfall and does not need to import as much electricity from Commonwealth of Massachusetts. Burlington, MA, USA: KEMA Inc.,
the maingrid. As a result, the MG ends up having fewer interac- 2014.
[4] H. Farhangi, “The path of the smart grid,” IEEE Power Energy Mag.,
tions with the maingrid, and becoming more independent of the vol. 8, no. 1, pp. 18–28, Jan./Feb. 2010.
maingrid. In summary, Configuration AllL has the least depen- [5] H. Cai, G. Hu, F. L. Lewis, and A. Davoudi, “A distributed feedforward
dency on the maingrid which is an important emergent behavior approach to cooperative control of ac microgrids,” IEEE Trans. Power
Syst., vol. 31, no. 5, pp. 4057–4067, Sep. 2016.
stemming from our design modeling. Additionally, Configura- [6] A. Pantoja and N. Quijano, “A population dynamics approach for the
tion AllL fairly allows all agent types to learn their best response dispatch of distributed generators,” IEEE Trans. Ind. Electron., vol. 58,
to each other and improve their performance. Hence, this con- no. 10, pp. 4559–4567, Oct. 2011.
[7] V. Crespi, A. Galstyan, and K. Lerman, “Top-down vs bottom-up method-
figuration is more suited for MGs having different owners for ologies in multi-agent system design,” Auton. Robots, vol. 24, no. 3,
DERs and customers since all agents individually are capable of pp. 303–313, 2008.
effectively learning to improve their performance in the MG and [8] A. Belgana, B. P. Rimal, and M. Maier, “Open energy market strategies
in microgrids: A stackelberg game approach based on a hybrid multi-
adapt to each other. In short, the MG has the best performance objective evolutionary algorithm,” IEEE Trans. Smart Grid, vol. 6, no. 3,
when both customers and DER agents are capable of learning pp. 1243–1252, May 2015.
in terms of fairness and independence from the maingrid. [9] E. Mojica-Nava, C. Barreto, and N. Quijano, “Population games methods
for distributed control of microgrids,” IEEE Trans. Smart Grid, vol. 6,
no. 6, pp. 2586–2595, Nov. 2015.
[10] P. Shamsi, H. Xie, A. Longe, and J.-Y. Joo, “Economic dispatch for an
VI. CONCLUSION agent-based community microgrid,” IEEE Trans. Smart Grid, vol. 7, no. 5,
pp. 2317–2324, Sep. 2016.
A distributed multiagent approach for adaptive control of en- [11] Y. Zhang, N. Gatsis, and G. B. Giannakis, “Robust energy management
ergy management in the MG framework was proposed and pre- for microgrids with high-penetration renewables,” IEEE Trans. Sustain.
sented in this paper. The agents modeled in the MG include load Energy, vol. 4, no. 4, pp. 944–953, Oct. 2013.
[12] Y. F. Eddy, H. B. Gooi, and S. X. Chen, “Multi-agent system for distributed
customers, energy supply resources such as DERs, and storage management of microgrids,” IEEE Trans. Power Syst., vol. 30, no. 1,
agents. In this system, DERs and customers were modeled as au- pp. 24–34, Jan. 2015.
tonomous self-interested agents who can learn the best response [13] R. Duan and G. Deconinck, “Multi-agent coordination in market environ-
ment for future electricity infrastructure based on microgrids,” in Proc.
policy to increase their expected rewards in microgrid market IEEE Int. Conf. Syst., Man Cybern., 2009, pp. 3959–3964.
by adapting to the other agents in the MG. By adopting the [14] D. Divényi and A. M. Dán, “Agent-based modeling of distributed genera-
multiagent learning structure for supplier and customer agents, tion in power system control,” IEEE Trans. Sustain. Energy, vol. 4, no. 4,
pp. 886–893, Oct. 2013.
supplier agents found the optimal solution for their bidding [15] M. H. Cintuglu, H. Martin, and O. A. Mohammed, “Real-time implemen-
problem at each time-slot t with learning strategies to effectively tation of multiagent-based game theory reverse auction model for micro-
improve their expected profits, while, customer agents learned grid market operation,” IEEE Trans. Smart Grid, vol. 6, no. 2, pp. 1064–
1072, Mar. 2015.
to make rational decisions about their consumption based on [16] B. An, N. Gatti, and V. Lesser, “Alternating-offers bargaining in one-
current prices and their actual demands to achieve best response to-many and many-to-many settings,” Ann. Math. Artif. Intell., vol. 77,
strategies. This would not have been possible when the different no. 1–2, pp. 67–103, 2016.
[17] M. Rahimiyan and H. R. Mashhadi, “An adaptive-learning algorithm de- Elham Foruzan (GS’17) received the B.Sc. degree
veloped for agent-based computational modeling of electricity market,” from the Ferdowsi University of Mashhad, Mashhad,
IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 5, pp. 547–556, Iran, in 2008, and the M.Sc. degrees in electrical
Sep. 2010. and computer engineering and computer science and
[18] D. Li and S. K. Jayaweera, “Distributed smart-home decision-making in engineering from the University of Tehran, Tehran,
a hierarchical interactive smart grid architecture,” IEEE Trans. Parallel Iran, and University of Nebraska—Lincoln, Lincoln,
Distrib. Syst., vol. 26, no. 1, pp. 75–84, Jan. 2015. NE, USA, in 2010 and 2017, respectively, where
[19] S. Vandael, B. Claessens, D. Ernst, T. Holvoet, and G. Deconinck, “Re- she is currently working toward the Ph.D. degree.
inforcement learning of heuristic fleet charging in a day-ahead electricity Her research interests include smart grid, multi-agent
market,” IEEE Trans. Smart Grid, vol. 6, no. 4, pp. 1795–1805, Jul. 2015. systems, microgrid, machine learning, and cyber-
[20] Y. Lim and H.-M. Kim, “Strategic bidding using reinforcement learning physical systems.
for load shedding in microgrids,” Comput. Electr. Eng., vol. 40, no. 5,
pp. 1439–1446, 2014.
[21] E. Kuznetsova, Y.-F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell, “Rein-
Leen-Kiat Soh (M’98) received the B.S. degree (with
forcement learning for microgrid energy management,” Energy, vol. 59, highest distinction) and the M.S. and Ph.D. degrees
pp. 133–146, 2013.
(with honors) in electrical engineering from the Uni-
[22] B.-G. Kim, Y. Zhang, M. van der Schaar, and J.-W. Lee, “Dynamic pricing
versity of Kansas, Lawrence, KS, USA. He is cur-
and energy consumption scheduling with reinforcement learning,” IEEE
rently a Professor with the Department of Computer
Trans. Smart Grid, vol. 7, no. 5, pp. 2187–2198, Sep. 2016.
Science and Engineering, University of Nebraska–
[23] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3–4,
Lincoln, Lincoln, NE, USA. He has authored more
pp. 279–292, 1992.
than 180 peer-reviewed journal and conference pub-
[24] “Online,” [Online]. Available: https://www.sce.com/ lications. His research interests include multiagent
[25] M. L. Littman, “Markov games as a framework for multi-agent reinforce-
systems and intelligent agents, computer-aided edu-
ment learning,” in Proc. 11th Int. Conf. Mach. Learn., 1994, vol. 157,
cation systems, computer science education, and in-
pp. 157–163.
telligent image analysis.
[26] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An He has applied his multiagent systems research to smart grids and distributed
Introduction. Cambridge, MA, USA: MIT Press, vol. 1, no. 1, 1998.
GIS. He is a member of ACM and AAAI.
[27] L. P. Kaelbling, L. Pack, M. L. Littman, and A, W. Moore, “Reinforcement
learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
[28] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochas-
tic games,” J. Mach. Learn. Res., vol. 4, no. Nov, pp. 1039–1069, Sohrab Asgarpoor (S’80–M’86–SM’91) received
2003. the B.S., M.S., and Ph.D. degrees in electrical en-
[29] X. Wang and T. Sandholm, “Reinforcement learning to play an optimal gineering from Texas A&M University, College Sta-
Nash equilibrium in team Markov games,” in Proc. 15th Int. Conf. Neural tion, TX, USA.
Inf. Process. Syst., 2002 , pp. 1571–1578. From 1986 to 1989, he was with ABB Net-
[30] M. L. Littman, “Markov games as a framework for multi-agent re- work Management Inc., as a Lead Engineer. Since
inforcement learning,” in Proc. 11th Int. Conf. Mach. Learn., 1994, September 1989, he has been with the University of
vol. 157, pp. 157–163. Nebraska–Lincoln, Lincoln, NE, USA, where he is
[31] E. Foruzan, S. Asgarpoor, and J. M. Bradley, “Hybrid system modeling currently the Interim Associate Dean of Undergrad-
and supervisory control of a microgrid,” in Proc. North Amer. Power uate Programs with the College of Engineering, and
Symp., 2016, pp. 1–6. a Professor with the Department of Electrical and
[32] Y. Atwa, E. El-Saadany, M. Salama, and R. Seethapathy, “Optimal re- Computer Engineering. His research interests include reliability evaluation,
newable resources mix for distribution system energy loss minimization,” maintenance optimization, and advanced computer applications in security and
IEEE Trans. Power Syst., vol. 25, no. 1, pp. 360–370, Feb. 2010. optimization of power systems.

Reinforcement Learning Approach For Optimal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning Approach For Optimal

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 33, NO.

5, SEPTEMBER 2018 5749

Reinforcement Learning Approach for Optimal

for time intervals 11 p.m. to 8 a.m.; 8 a.m. to 12 and 6 p.m. TABLE I

C. Comparison of NoL, GenL, LoadL, and AllL

types of agents–i.e., DERs and customers–been modeled and

You might also like