You are on page 1of 7

Multi Agent Reinforcement Learning based

Distributed Optimization of Solar Microgrid


R Leo R S Milton A Kaviya
SSN College of Engineering SSN College of Engineering SSN College of Engineering
leor@ssn.edu.in miltonrs@ssn.edu.in kaviyaindia@gmail.com

Abstract—We consider grid connected solar microgrid system Integrating renewable energy in microgrid is the way forward
which contains a local consumers, solar photo voltaic (PV) for economic and environmental optimization generating clean
systems, load and battery. The consumer as an agent continuously and green energy and thereby providing solution to the global
interacts with the environment and learns to take optimal actions
through a model-free Reinforcement Learning algorithm, namely warming [2]. A microgrid can connect and disconnect from
Q Learning. The aim of the agent is to optimally schedule the the grid to enable it to operate in both grid-connected or
battery to increase the utility of the battery and solar photo island mode. Microgrids can provide quantifiable benefits
voltaic system and thereby aims for the long term objective for utilities, their customers, and energy service companies,
of reducing the power consumption from grid. Multiple agents regardless of ownership. Solar Photo Voltaic System (PV)
sense the states of environment components and make collective
decisions about how to respond to randomness in load and uptake by customers is increasing, due to incentives, improved
intermittent solar power by using a Multi agent reinforcement performanceand lower cost. Large utility customers increas-
algorithm, namely Coordinated Q Learning (CQ Learning). ingly seek reliability, autonomy, cost certainty and resilience.
Each agent learns to optimize individually and contribute to The importance of having more reliable, efficient, smart sys-
global optimization. Grid power consumed when solar PV system tems is getting more public attention. In the present scenario
operates individually, by using Q learning is compared with
operation of many such solar PV systems in a distributed the consumer wants not only intelligent but smart machines,
environment using CQ learning and it is proved that the grid which can think and operate autonomously and optimally.
power requirement is considerably reduced in CQ learning than Smart energy management of microgrids using genetic al-
in Q learning. Simulation results using real numerical data are gorithm is discussed in [2] and [3]. Energy management
presented for a reliability test of the system. of microgrids using fuzzy logic is discussed in [4]. Energy
Index Terms—Solar microgrid; Reinforcement learning; Q-
learning; CQ Learning; Battery scheduling; Distributed Opti-
management of hybrid renewable energy generation using
mization. constrained optimization was proposed in [5]. An agent-based
modeling approach is used to model microgrids and by simu-
lation the interactions between individual intelligent decision-
I. I NTRODUCTION
makers are analyzed in [2], [3] and [8]. Expert system and
Electricity is a crucial input to propel socio-economic other classical and heuristic algorithms for energy management
growth of a nation. Over the past decade, the electricity gen- of microgrids are discussed in [6], [7] and [9]. Reinforcement
eration, transmission and distribution landscape has changed learning for optimization of wind energy system is discussed
significantly. Open electricity market, penetration of large in [10] and [15]. Coordination Q-Learning method in Multi-
scale renewable generation etc. have further increases the Agent Reinforcement Learning is discussed in [12] and [16].
complexity in the traditional electricity grid towards providing The aim of this paper is to introduce a smart decision making
secure and reliable supply with quality. The key element system using a multi-agent reinforcement learning method,
driving change is the emergence of smart grid. Renewable called Coordination Q-Learning (CQ Learning), for optimiza-
energy plays a significant role in building a green and sus- tion of the distributed energy management in the microgrid.
tainable environment. Solar and wind are the only solution The system behaves in strategic manner when dealing with
to the growing energy crisis in the world [1]. The smart operational scenarios, by aiming to achieve the lowest possible
grid paradigm represents a transition towards an intelligent, cost of power generation.
digitally enhanced, two way power delivery grids. The aim of The rest of the paper is organized as follows. In section II,
smart grid is to promote and enhance the efficient management solar microgrid is explained with the details of solar photo
of operation of the power generation and delivery facilities voltaic system. Section III presents the modeling framework
by incorporating advanced communications, Information tech- of consumer agent with comprehensive framework of rein-
nology, automation and control methodologies into power forcement learning. In section IV, Multi agent reinforcement
grid. A microgrid is a group of interconnected loads and learning framework is explained with CQ Learning. In section
distributed energy resources within clearly defined electrical V, the performance improvement of solar microgrid by the
boundaries that acts as a single controllable entity with respect optimal scheduling of the battery to increase the utility of
to the grid. Microgrid is the building block of smart grid. the battery and solar power, to reduce the power consumption

978-1-4799-3975-6/14/$31.00 ©2014 IEEE


2014 IEEE International Conference on Computational Intelligence and Computing Research

from grid is discussed in detail. Also the performance of Solar converts it into electric energy [13]. PV generation fits the load
PV systems, when operating individually using Q learning demand very well since solar irradiation is higher in daytime.
and when they operate in a distributed environment, using
CQ learning are analyzed and compared. Conclusions and
suggestion to possible improvements are given in the last
section.

II. M ODEL OF THE S OLAR M ICROGRID


Microgrids comprise low voltage distribution systems with
distributed energy resources (DER) (micro turbines, fuel cells,
PV, etc.) together with storage devices (flywheels, energy
capacitors and batteries) and flexible loads. Such systems can
be operated in a non-autonomous way, if interconnected to
the grid, or in an autonomous way, if disconnected from the
main grid. The operation of micro sources in the network can
provide distinct benefits to the overall system performance, if
managed and coordinated efficiently [2]. We consider a solar
microgrid with a battery and load. The solar microgrid involves
Fig. 2. Neural network model of PV

A solar PV module consists of a number of solar cells


connected in series or parallel based on the requirement of
Switch DC/AC voltage and current. Modules may then be connected together
into a photo voltaic array. Solar irradiance (G) and temperature
(T ) are the two main factor which influences the solar power.
A simple feed forward neural network is used to train G and
DC/DC T to get equivalent circuit parameters as shown in Fig. 2.
Once the neural network is trained with sufficient number
of examples then we can determine the current and the
voltage of the solar module for untrained values of G and
DC/DC DC/AC T by generalization [14]. The maximum power Psp is found
by Maximum Power Point Tracking (MPPT) algorithm. The
MPPT refers to the point with a maximum power output under
specific external temperature and solar irradiation [13].
B. Model for the battery storage
Fig. 1. Solar Microgrid A simplified model of battery storage dynamics is adopted
by implementing a discrete time system for the power flow
a consumer with a dynamically varying load Dt , a transformer dynamics over the time step interval Dt .
providing electricity power from the external grid, a solar
generator (solar photovoltaic system) with available power Rt = Rt−1 + Rtstore.charge + Rtstore.discharge (1)
output Psp and a storage facility with a level of battery charge In the equation 1, Rt and Rt−1 are the levels of energy
Rt . The architecture of the considered microgrid is shown in stored in the battery at time t and t − 1 respectively, and
Fig. 1. The consumer can cover his demand partly by using Rtstore.charge , Rtstore.discharge are the power flows over the time
the electricity produced by solar energy, store electricity in the step interval δt from solar generator to battery, and from
battery when the solar source is available and can use from the battery to consumer load respectively [10].
storage when needed. The main aim of solar microgrid is to
satisfy the dynamic requirements of the load while maximizing III. M ODELING OF THE C ONSUMER AGENT
the utilization of the solar power and the battery. Since solar The dynamic variations of load, solar power and the battery
PV system gives dc power, proper converters are used for are considered to constitute the external environment. The
conversion. consumer is modeled as an individual agent who makes use of
reinforcement learning for its decision-making, action-taking
A. Solar Microgrid Neural network model and moving towards its goal. Reinforcement learning deals
A photo voltaic (PV) module is the basic element of each with learning in sequential decision making in the problems
photo voltaic system. A PV module is a clean energy source with limited feedback. Markov Decision Process (MDP) has
used in power systems which absorb solar irradiation and become standard formalism for learning in sequential decision
2014 IEEE International Conference on Computational Intelligence and Computing Research

making. In the MDP, the environment is modeled as set of the values of a certain number of its features, collectively
states and actions can be performed to control the system called its state, denoted by S(t) at time t. Every state has
state. The effect of an action taken in a state is dependent an intrinsic value, based on reward or cost, denoted by R(t)
only on that state and not on the prior history. The goal is at time t. The agents observe the environment and take an
to control the system in such a way that some performance action. It gets a reward or punishment from the environment.
criterion is maximized. This section presents the reinforcement The training information is used to evaluate actions (in terms
learning algorithm used by the consumer agent to interact, of reward or punishment received from the environment) taken
adapt, and take decisions towards its goal defined in the form by the agent. The agent takes the next action to optimize the
of reward functions in the MDP environment, characterized reward in the long run. After a number of interactions, with the
by the available solar power output Psp , the load Dt and the enough learning, the agent finds the optimal policy to achieve
level of battery charge Rt . long term objective. The agent’s choice of action is based on
its past experience of action taken in a certain state and the
A. Markov Decision Process concomitant reward/cost experienced, to update its decision
Markov Decision Process (MDP) is a way to model a making process for future actions. In Dynamic Programming
sequential decision making under uncertainty. We formalize (DP) value iteration and policy iterations are the two methods
an MDP, considering discrete states and actions. The initial to find the optimal policy. But these methods requires models
state is s0 and each state will have a reward r associated with of the environment so we choose a model free Reinforcement
it. The transition function T (s|a, s ) indicates the probability learning method [11].
of transitioning from state s to s when action a is taken.
A discount factor γ in the range 0 . . . 1 is applied to future C. Reward function
rewards. This represents the notion that a current reward is We optimize the battery scheduling of the solar microgrid
more valuable than one in the future. If it is near zero, future by reinforcement learning. This is a process of action-reward
rewards are almost ignored; a near one places great value dynamics, driven by quantitative performance indicators which
on future reward. The reward from a policy is the sum of evaluate the action or sequence of actions undertaken and
the discounted expected utility of each state visited by that feedback the value to adjust future scheduling decisions. The
policy. The optimal policy is the policy that maximizes the optimization of the numerical reward is achieved through the
total expected discounted reward. choice of the actions a0 and a1 of battery scheduling. The
consumer aims at increasing its performance by selecting an
B. Reinforcement Learning optimal sequence of actions. The reward functions are the
Reinforcement learning algorithm is used to model the response we get from the environment for the actions taken.
consumer’s adaptation to a dynamically changing environment If it is charging (a1 ) then the reward function is minimum
by performing actions of battery scheduling in an MDP envi- of Psp and Bdif f erence and if it is discharging (a0 ) then
ronment [10]. The agents observe the environment and take an the reward function is minimum of Dt and Blevel . Here,
action. It gets a reward or punishment from the environment. Bdif f erence is the difference between maximum possible
The agent takes the next action to optimize the reward in the charge and the current battery level (Blevel ). The optimal
long run. After a number of interactions, the agent finds the scheduling of the battery, and thus the increase of the solar
optimal policy to achieve long term objective. The goal of an microgrid performance with respect to the consumer goals, is
done by Q learning.

D. Q learning
Environment Q learning is a model-free reinforcement learning where
the agent explores the environment and finds the next reward
plus the best the agent can do from the next state. In Q
Action Reward State learning, the agent does not need to have any model of the
environment. It only needs to know what states exist and what
actions are possible in each state. We assign each state an
Action Value estimated value, called a Q value [11]. When we visit a state
selection estimation and take an action we receive a reward. We use this reward to
update our estimate of the value of that action in the long run.
Agent We visit the states infinitely often and the action values (Q
values) are continuously updated till it becomes convergent.
Fig. 3. Reinforcement Learning The Q learning algorithm is outlined in Algorithm 1 [11]. In
the algorithm, γ is the discount factor and α, learning rate.
agent is to find the optimal policy based on interactive learning The discounted factor contributes to determining the values of
with the environment. Fig. 3 shows a simple reinforcement future reward and the the learning rate influences the speed of
learning scheme. The environment can be characterized by convergence to Q values.
2014 IEEE International Conference on Computational Intelligence and Computing Research

Algorithm 1: Q-learning V. SIMULATIONS OF SOLAR MICROGRID WITH Q


LEARNING , WHEN OPERATING INDIVIDUALLY AND WITH
1 Set γ and rewards matrix R. CQ L EARNING , IN DISTRIBUTED OPERATION
2 Initialize Q(s, a) arbitrarily.
The forecast value of solar irradiance G is found from the
3 foreach episode do
National Renewable Energy Laboratory (NREL) for a 200
4 Initialize s arbitrarily.
KW and 150 KW solar PV systems in our campus electrical
5 repeat for each step of episode
engineering building and hostel respectively for the year 2014.
6 Select a in s using policy derived from Q
The temperature T is taken from meteorological department
7 Take action a, observe reward r, and next state s
for the same year. These two data are fed into already trained
8 Q(s, a) ←
neural network to predict the solar power Psp for the whole
Q(s, a) + α[ r + γ maxa Q(s , a) − Q(s, a) ]
year. Then the solar power distribution for the whole year in
9 s ← s
hourly basis is drawn as shown in the Fig. 4, Fig. 5. The
10 until s is terminal
load pattern of electrical engineering department and hostel in
11 end
our campus are considered. The hourly basis load graphs are
drawn for the department and the hostel as shown in Fig. 6
and Fig. 7. In the department peak loading happens in the
morning and in the hostel peak load happens in evening. The
IV. M ULTI AGENT R EINFORCEMENT L EARNING hourly basis solar power, Psp and the load, Dt feeds the Q-
Learning algorithm in both the agents (department and hostel)
for optimization. The explorer problem is solved using Monte
Carlo method where the actions are chosen stochastically and
A Multi-Agent System (MAS) is a loosely coupled network averaged over 50 times.
of software agents that interact to solve problems that are
beyond the individual capacities or knowledge of each problem
solver. MAS is used to design and develop decentralized
system for distributed optimization. Here autonomous control
process is assumed by each controllable element, namely,
diesel generator, storage and loads. MAS enhances overall
system performance, specifically along the dimensions of
reliability, computational efficiency and scalability. It is built
on two basic pillars namely the reinforcement learning and
the interdisciplinary work on game theory. The agents can
do reasoning, control, optimize and take collective decisions
about how to respond to user requests and unplanned con-
tingencies. Also the agents can be reactive, proactive and
social. Through a reward based system, the agent eventually
learns its optimal behavior by trial-and-error, where the reward
system provides feedback for each of the chosen actions (either
a reward or a penalty depending on the case). The agent Fig. 4. Solar power for 200 kw unit in a year
attempts to maximize its overall reward, and as a results it
converges in the end to an optimal solution. The problem The agent learns optimal single agent policy when acting
arises when several such RL based agents interact within the alone in the environment. Using this prior knowledge, every
same environment, transforming static environments (from a agent has a model of its expected rewards for every state action
single agent’s perspective) in non-stationary environments, as pair. The Q value table is learned by considering the agent as
each agent has a particular influence over the environment an isolated entity in the environment. Each agent learn to find
itself. All agents attempt to learn simultaneously to optimize the policy for local optimization and coordinate with other
individually. Distributed optimization of multiple agent is done agents for global optimization. The coordination q-learning is
through Coordinated Q Learning (CQ Learning). Here we applicable when the different agents, within which optimal
consider two solar microgrids with capacity of 200 KW and coordination have to be achieved, have already accumulated
150 KW for electrical department and hostel respectively. the anticipated Q-value table. The Coordination Q Learning
Each agent optimizes the scheduling of the battery to increase is implemented in three steps. First is to identify the safe and
the utility of the solar power and the battery to reach long dangerous states using a test followed by selection of actions
term objective of reducing the power from the grid. The two and then update the Q values. The test can be approximated
agents cooperate to manage the various scenarios like peak to be equal to the standard deviation for large samples [16].
load management, fault, etc., due to intermittent solar power To accumulate anticipated Q value table through single agent
supply and randomness in load. learning requires a large number of iteration hence the standard
2014 IEEE International Conference on Computational Intelligence and Computing Research

Fig. 5. Solar power for 150 kw unit in a year Fig. 7. load pattern hostel

Qjk (js, ak ) = Q(js, ak ) + α[r(js, ak )+


γmaxQ(sk , ak ) − Q(js, ak )] (2)
In the equation 2, Qk stands for the Q table containing the
independent states, and Qjk contains the joint states (js). The
second Q-table is initially empty and eventally gets updated
when the agent encounters joint state space. α is the learning
rate and γ is the discount factor. Python programming is used
to program CQ Learning algorithm and various scenarios the
agents encounters are analysed and the optimized solution is
found to meet the long term objective of reducing the cost
of energy consumption. Comparison is made between CQ
Learning, in a distributed environment and individual agent op-
timizing independently with Q Learning. CQ Learning proved
Fig. 6. Load pattern departmentl
to be better way of optimizing the power. Three indicators
B0, S0 and Pg show the improvement in the performance
of the microgrid by using the Q Learning algorithm. The
average values of utility of solar power, battery and the power
consumption from the grid of the department are found as
deviation is taken. Thus together with the mean and deviation
given below and simulated for ten years as shown in Fig. 8,
we can identify whether the state is safe or dangerous and mark
Fig. 9 and Fig. 10. The increase in the utilization of electricity
it accordingly. The safe and dangerous states are maintained in
from the battery is estimated by B0 which is defined as the
separate lists to decide which Q value table to use. The battery
ratio of the cumulative power used from the battery to the
levels of each indicate the present state of the system. Thus
yearly cumulative load.
the number of possible states is a chosen such as to increase

the productivity of the algorithm while still having a reduced Battery to load
number of states. After having acquired the Q value table, B0 = 
load(Dt )
the agent maintains the mean and variance of various states
The increase of the utilization rate of the solar PV system
through single agent Q Learning method. Then the agents can
evaluated by S0 is defined as the ratio of the yearly cumulative
be made to undergo reinforced Coordinating Q Learning. The
power used from the solar PV system to the yearly cumulative
algorithm then uses the test to detect if there are changes in
available solar power.
the observed rewards for the selected state action pair. If it 
detects a change in the reward, algorithm observe this state as Solar to battery
S0 = 
join state, in which collision occurs and mark this as dangerous solar power(Psp )
state. State action pairs that did not cause collisions are marked Finally, parameter Pg indicates the cumulative annual power
as safe state [15]. Each time an agent encounters a marked received from the external grid.
state, it will check whether it is dangerous or safe and the   
following update rule is used. Pg = Grid = load(Dt ) − battery to load
2014 IEEE International Conference on Computational Intelligence and Computing Research

Fig. 8. Utility of battery in department Fig. 10. Grid power with QL in department

Fig. 9. Utility of solar power in department


Fig. 11. Grid power with QL in hostel

Fig. 11 shows the reduction in power from grid in the


hostel. The reduction in power consumption from the grid
due to both the units (department and the hostel) due to Q from grid is reduced in the long run. Distributed optimization
Learning is observed in Fig. 14 and is compared with power of solar micro-grids is done with a Multi-Agent Reinforce-
consumption from grid with Coordinated Q Learning in a ment Learning approach, namely Coordinated Q learning (CQ
distributed environment. Learning) and the performance is compared with optimiza-
It is found from the Fig. 15 that in Coordinated Q Learning, tion of solar microgrid, operating independently, using Q
the power consumed from grid is relatively reduced due to learning. A simulation model was developed using Python
distributed optimization. In the first year, the Grid power programming to prove that CQ learning method requires less
required for the solar microgrid with individual Q learning power from the grid when compared to Q learning method
is 718000 KW. The Grid power required using CQ Learning under intermittent solar power and randomness of load. The
is 713000 KW. Thus 5000 KW power is saved in the first proposed framework gives the intelligent consumer the ability
year and the grid power requirement is considerably reduced to explore and understand the stochastic environment and reuse
in the following years due to distributed optimization, using this experience for selecting the optimal energy management
Coordinated Q Learning method. actions to reduce dependency on the grid in a distributed
environment. Future work will focus on extension to multiple,
VI. C ONCLUSION
diverse renewable generators (solar and wind) using Multi-
Thus the dynamic interactions between the consumer agent Agent Reinforcement Learning with several intelligent con-
and its environment is carried out for autonomous optimization sumers with conflicting requirements. Generalization of state
of battery scheduling to increase the utility of the battery, space can be done through neural networks to reduce the
utility of solar power and thereby the power consumption complexity of the problem.
2014 IEEE International Conference on Computational Intelligence and Computing Research

Fig. 12. Utility of battery with CQL Fig. 14. Grid power with individual Q Learning

Fig. 13. Utility of solar power with CQL Fig. 15. Grid power with CQ Learning

R EFERENCES crogrid,” IEEE Transactions, Industrial Electronics, 60(4), pp.1688-99,


2013.
[1] Chen C, Duan S, Cai T, Liu B, Hu G, “Smart energy management [10] Kuznetsova E, Li Y F, Ruiz C, Zio E, Ault G, and Bel L K, “Reinforce-
system for optimal microgrid economic operation,” Renewable Power ment learning for microgrid energy management,” Energy Journal, 59,
Generation, IET 5(3), pp.258-67, 2011 pp.133-46, 2013.
[2] Ross Guttromson, Steve Glover, “The advanced microgrid integration [11] Sutton R S, Barto A G, “Reinforcement Learning: An introduction”.
and interoperability,” Sandia National Laboratories, Sandia report march London, England: The MIT Press, pp.1-398, 1998.
2014. [12] De Hauwere, Y.M., Vrancx, P., Nowe, A, Learning Multi-Agent State
[3] Reddy P P, Veloso M M, “Strategy learning for autonomous agents in Space Representations In: Proceedings of the 9th International Confer-
smart grid markets,” In Twenty-second international joint conference on ence on Autonomous Agents and Multi-Agent Systems, Toronto, pp.
artificial intelligence, pp. 1446-51, 2005. 715722, Canada,2010.
[4] Hatziargyriou ND, European transactions on electrical power, Special [13] Engin Karatepe, Mutlu Boztepe, Metin Cola, “Neural network based
issue: “Microgrid and energy management,” pp. 1139-141, December solar cell model,” Energy Conversion and Management, 47, pp.1159-
2010. 78, 2006.
[5] Mohamed F A, Koivo H N, “System modelling and online optimal [14] Hiyama T, Kitabayashi K, “Neural network based estimation of maxi-
management of MicroGrid with battery storage,” International Journal mum power generation from PV module using environmental informa-
on Electrical Power and Energy Systems, 32(5), pp.398-407, 2010. tion,” IEEE Transactions, Energy Conversion; 12 (3), pp.241-52, 1997.
[6] Colson C M, Nehrir M H, Pourmousavi S A, “Towards real-time mi- [15] Yujin Lim, Hak-Man Kim, “Strategic bidding using reinforcement
crogrid power management using computational intelligence methods.” learning for load shedding in microgrid,” Computers and Electrical
IEEE, pp.1-8, 2010. Engineering, Elsevier, 2014.
[7] Abdirahman M Abdilahi, M W Mustafa, G Aliyu, J Usman, “Au- [16] Marco Wiering and Martijn van Otterlo Ma, “Reinforcement Learning
tonomous Integrated Microgrid (AIMG) System,” International Journal State -of-the Art”, Springer-Verlag Berlin Heidelberg, 2012.
of Education and Research, Vol.2, no.1, pp.77-82, January 2014.
[8] Jun Z, Junfeng L, Jie W, and Ng H, “A multi-agent solution to energy
management in hybrid renewable energy generation system,” Renewable
Energy, vol. 36, no. 5, pp.1352-63, 2011.
[9] Aymen Chaouachi, Rashad M, Kamel M, Ridha Andoulsi, and Ken
Nagasaka, “Multiobjective Intelligent Energy Management for a Mi-

You might also like